In [2]:
import pandas as pd

In [3]:
df=pd.read_csv('C:/Users/dasgu/Desktop/MyImp/Code/aiml/homepricesd5.csv')
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [4]:
'''
   - One-hot encoding is a technique used to convert categorical variables into numerical format that the machine learning algorithms can understand
   - It transforms each category of a variable into 0's and 1's to progress the categorical data effectively.
   - The categorical variables are divided into nominal{(monroe township, west windsor, robinsville), (green,blue,red), (male,female)} and ordinal{(btech,mtech,phd),(high,medium,low),(good,okay,bad)}.
   - Multicollinearity
     Definition: Multicollinearity is a situation where two or more independent variables are highly correlated, making it hard to isolate their individual effects.
     Example: In a model predicting weight, including both "Height" and "Foot Size," where taller people tend to have larger feet. Their high correlation can distort coefficient estimates.
   - Dummy Variable Trap
     Definition: The dummy variable trap occurs when all dummy variables created from a categorical variable are included in a regression model, leading to perfect multicollinearity.
     Example: For a variable "Color" with categories "Red," "Green," and "Blue," if you create dummy variables for all three (Color_Red, Color_Green, Color_Blue) without dropping one, you can predict one from the others, causing redundancy. To avoid this, you drop one category (e.g., Color_Blue).  
   - Imagine a Classroom
     You have two friends:
     Friend A: Measures how tall everyone is.
     Friend B: Measures how tall everyone is but also counts their shoe size.
     What Happens?
     High Correlation:
     If you notice that taller kids usually have bigger shoes, then Friend A and Friend B are telling you similar things. If you know someone is tall, you can guess their shoe size pretty well!
     Confusion:
     If you try to use both friends’ measurements to understand how much kids weigh, it gets confusing. If both measurements say similar things, it’s hard to figure out who is really helping you more.
     Wobbly Answers:
     When you try to calculate how height and shoe size affect weight, your answers (coefficients) might become wobbly. One time, it might say height is really important, and another time, it might say shoe size is more important, even if they’re both just saying the same thing!
     What to Do?
     Choose One: To make it easier, you could just listen to Friend A or Friend B, but not both. This way, you get clearer answers about weight without confusion.
'''
   

'\n   - One-hot encoding is a technique used to convert categorical variables into numerical format that the machine learning algorithms can understand\n   - It transforms each category of a variable into 0\'s and 1\'s to progress the categorical data effectively.\n   - The categorical variables are divided into nominal{(monroe township, west windsor, robinsville), (green,blue,red), (male,female)} and ordinal{(btech,mtech,phd),(high,medium,low),(good,okay,bad)}.\n   - Multicollinearity\n     Definition: Multicollinearity is a situation where two or more independent variables are highly correlated, making it hard to isolate their individual effects.\n     Example: In a model predicting weight, including both "Height" and "Foot Size," where taller people tend to have larger feet. Their high correlation can distort coefficient estimates.\n   - Dummy Variable Trap\n     Definition: The dummy variable trap occurs when all dummy variables created from a categorical variable are included in a

In [5]:
dummies = pd.get_dummies(df.town).astype(int)
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In [6]:
merged = pd.concat([df,dummies],axis='columns')
merged

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


In [7]:
final = merged.drop(['town'], axis='columns')
final

Unnamed: 0,area,price,monroe township,robinsville,west windsor
0,2600,550000,1,0,0
1,3000,565000,1,0,0
2,3200,610000,1,0,0
3,3600,680000,1,0,0
4,4000,725000,1,0,0
5,2600,585000,0,0,1
6,2800,615000,0,0,1
7,3300,650000,0,0,1
8,3600,710000,0,0,1
9,2600,575000,0,1,0


In [8]:
final = final.drop(['west windsor'], axis='columns')
final
#sklearn library takes care of dummy variable trap hence even if you don't drop one of the state columns it is going to work, however we should make a habit of taking care of dummy variable trap ourselves just in case library that you are using is not handling this for you

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,0,1


In [9]:
X = final.drop('price', axis='columns')
X

Unnamed: 0,area,monroe township,robinsville
0,2600,1,0
1,3000,1,0
2,3200,1,0
3,3600,1,0
4,4000,1,0
5,2600,0,0
6,2800,0,0
7,3300,0,0
8,3600,0,0
9,2600,0,1


In [10]:
y = final.price

In [11]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [12]:
model.fit(X,y)

In [13]:
model.predict(X)

array([539709.7398409 , 590468.71640508, 615848.20468716, 666607.18125134,
       717366.15781551, 579723.71533005, 605103.20361213, 668551.92431735,
       706621.15674048, 565396.15136531, 603465.38378844, 628844.87207052,
       692293.59277574])

In [14]:
model.score(X,y) #finds the accuracy of the model

0.9573929037221872

In [15]:
model.predict([[3400,0,0]]) # 3400 sqr ft home in west windsor



array([681241.66845839])

In [16]:
model.predict([[2800,0,1]]) # 2800 sqr ft home in robbinsville



array([590775.63964739])

In [17]:
#NOW DOING THIS USING ONEHOTENCODER

In [20]:
from sklearn.preprocessing import LabelEncoder #LabelEncoder transforms categorical labels into numerical
le=LabelEncoder()

In [21]:
dfle = df
dfle.town = le.fit_transform(dfle.town) #converts the towns into numerical format
dfle

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [24]:
X = dfle[['town','area']].values #create a 2d numpy array using the values of town and area and store in X
X

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]], dtype=int64)

In [25]:
y = dfle.price.values # .values helps to access the values of numpy array of price
y

array([550000, 565000, 610000, 680000, 725000, 585000, 615000, 650000,
       710000, 575000, 600000, 620000, 695000], dtype=int64)

In [28]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer # A ColumnTransformer is a feature in the sklearn.compose module of scikit-learn that allows you to apply different preprocessing techniques to different columns of a dataset. This is particularly useful when working with heterogeneous data types in a single DataFrame, such as numerical and categorical features.
ct = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder = 'passthrough')#OneHotEncoder is applied to 0th index column and the rest are kept as usual due to passthrough.

In [29]:
X = ct.fit_transform(X)
X #look at first row. 1,0,0,2600.

array([[0.0e+00, 1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [1.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [1.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [32]:
X = X[:,1:]
X #dropping a dummy variable

array([[0.0e+00, 2.6e+03],
       [0.0e+00, 3.0e+03],
       [0.0e+00, 3.2e+03],
       [0.0e+00, 3.6e+03],
       [0.0e+00, 4.0e+03],
       [1.0e+00, 2.6e+03],
       [1.0e+00, 2.8e+03],
       [1.0e+00, 3.3e+03],
       [1.0e+00, 3.6e+03],
       [0.0e+00, 2.6e+03],
       [0.0e+00, 2.9e+03],
       [0.0e+00, 3.1e+03],
       [0.0e+00, 3.6e+03]])

In [33]:
model.fit(X,y)

In [37]:
model.predict([[0,1,3400]]) # 3400 sqr ft home in west windsor

ValueError: X has 3 features, but LinearRegression is expecting 2 features as input.

In [38]:
model.predict([[1,0,2800]]) # 2800 sqr ft home in robbinsville

ValueError: X has 3 features, but LinearRegression is expecting 2 features as input.