In categorial variables, they have two types. First is "nominal", where the categories don't have any numeric ordering in between them. They don't have any order relationship between each other. For example: {male,female}, {green,blue,red}<br>

The second type is called 'ordinal', where the categories have some sort of numerical ordering in between them. For example: {high,medium,low},{satisfied,neutral,dissatisfied}<br>

So when we are dealing with nominal categorial variables, the simple integer encoding is not going to work. For this types of data, we need to use One-Hot-Encoding.<br>

The way one hot encoding works, it create a new column for each of the categories and assign binary value of one or zero.These extra variables which are created, they are also called dummy variables.   

In [1]:
import pandas as pd 

In [2]:
df = pd.read_csv("homeprices3.csv")

In [4]:
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


Using pandas to create dummy variables

In [5]:
dummies = pd.get_dummies(df['town'])
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In [7]:
merged = pd.concat([df,dummies],axis=1)
merged

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


In [13]:
final = merged.drop('town',axis=1)
final

Unnamed: 0,area,price,monroe township,robinsville,west windsor
0,2600,550000,1,0,0
1,3000,565000,1,0,0
2,3200,610000,1,0,0
3,3600,680000,1,0,0
4,4000,725000,1,0,0
5,2600,585000,0,0,1
6,2800,615000,0,0,1
7,3300,650000,0,0,1
8,3600,710000,0,0,1
9,2600,575000,0,1,0


### Dummy Variable Trap

When you can derive one variable from other variables, they are known to be multi-colinear. Here if you know values of california and georgia then you can easily infer value of new jersey state, i.e. california=0 and georgia=0. There for these state variables are called to be multi-colinear. In this situation linear regression won't work as expected. Hence you need to drop one column.<br>

<b>NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one of the state columns it is going to work, however we should make a habit of taking care of dummy variable trap ourselves just in case library that you are using is not handling this for you</b>

In [14]:
final = final.drop('west windsor',axis=1)
final

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,0,1


In [15]:
X = final.drop('price',axis=1)
X

Unnamed: 0,area,monroe township,robinsville
0,2600,1,0
1,3000,1,0
2,3200,1,0
3,3600,1,0
4,4000,1,0
5,2600,0,0
6,2800,0,0
7,3300,0,0
8,3600,0,0
9,2600,0,1


In [17]:
y = final.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [23]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [24]:
model.fit(X.values,y)

LinearRegression()

In [27]:
model.predict(X.values)

array([539709.7398409 , 590468.71640508, 615848.20468716, 666607.18125134,
       717366.15781551, 579723.71533005, 605103.20361213, 668551.92431735,
       706621.15674048, 565396.15136531, 603465.38378844, 628844.87207052,
       692293.59277574])

In [28]:
model.predict([[3400,0,0]]) # 3400 sqr ft home in west windsor

array([681241.66845839])

In [29]:
model.predict([[2800,0,1]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

If we want to know how accurate our model is, we can use 'model.score' method supply X and y and it will calculate the predicted values for all of the rows in X and then it will compare predictor value with the actual values (y) after that it will use some formula to calculate the score

In [31]:
model.score(X.values,y)

0.9573929037221873

So our model is 95% accurate

### Using sklearn OneHotEncoder

In [52]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [53]:
dfle = df
dfle.town = le.fit_transform(dfle.town)

In [54]:
dfle

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [86]:
X = dfle[['area']].values
X

array([[2600],
       [3000],
       [3200],
       [3600],
       [4000],
       [2600],
       [2800],
       [3300],
       [3600],
       [2600],
       [2900],
       [3100],
       [3600]], dtype=int64)

In [87]:
X = pd.DataFrame(X)

In [74]:
y = dfle.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [75]:
from sklearn.preprocessing import OneHotEncoder 
ohe = OneHotEncoder()

In [82]:
X1 = ohe.fit_transform(dfle[['town']].values).toarray()
X1

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.]])

In [89]:
dfle2 = pd.DataFrame(X1)

In [88]:
final = pd.concat(dfle2,X)

  final = pd.concat(dfle2,X)


TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"