# DATA PREPROCESSING IN MACHINE LEARNING

## Dummy Variables And One Hot Encoding

**Integer Encoding or label encoding :** This involves replacing the non-numerical data with a unique integer or a number.

**Categorical or discrete :** These are variables that have two or more categories or values.
> **Types of categorical variables :**\
> There are two categorical variables these includes:\
> - **Nominal :** THese are the onces with no clear ordering to it, example categories of male and female or names of twons.
> - **Ordinal :** These are the types that have clear ordering to it, exmaple categories of A,B and C or low, medium and high.

### One Hot Encoding

Since nominal values do not have a clear ordering and our models work on numeric values we need to encode them. To do this we'll use one hot encoding technique, this involves creating a seperate columns in our dataFrame and placing one in place that are true to that specific value and zero(dummy variables) in the other places or rows.

**dummy variables** - variables containing values such as 1 or 0 representing the presence or absence of the categorical value.

## Dummy Variable Trap

**Note when working with dummy variables watch out for the dummy variable trap, this leads to inaccuracy of your model.**
[Read More Here](https://www.algosome.com/articles/dummy-variable-trap-regression.html#:~:text=The%20Dummy%20Variable%20trap%20is,%2Ffemale\)%20as%20an%20example.)

Inshort dummy variables occur due to existence of [multi-collinearity](https://en.wikipedia.org/wiki/Multicollinearity) between the variables(one value can be predicted from the other values)

## How To Get Around Dummy Variable Trap

The solution to the dummy variable trap is to drop one of the categorical variables (or alternatively, drop the intercept constant) - if there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value and the fit values of the remaining categories represent the change from this reference.[source](https://www.algosome.com/articles/dummy-variable-trap-regression.html#:~:text=The%20Dummy%20Variable%20trap%20is,%2Ffemale\)%20as%20an%20example.)

In [1]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('homeprices.csv')
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


## Creating The Dummy Variables Columns

In [7]:
df_with_dummies = pd.get_dummies(df)
df_with_dummies

Unnamed: 0,area,price,town_monroe township,town_robinsville,town_west windsor
0,2600,550000,1,0,0
1,3000,565000,1,0,0
2,3200,610000,1,0,0
3,3600,680000,1,0,0
4,4000,725000,1,0,0
5,2600,585000,0,0,1
6,2800,615000,0,0,1
7,3300,650000,0,0,1
8,3600,710000,0,0,1
9,2600,575000,0,1,0


In [15]:
df_with_dummies = df_with_dummies.drop(['town_west windsor'], axis='columns')
df_with_dummies

Unnamed: 0,area,price,town_monroe township,town_robinsville
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,0,1
10,2900,600000,0,1
11,3100,620000,0,1


## Create Regression Model

In [23]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [33]:
Y = df_with_dummies['price']
X = df_with_dummies.drop(['price'], axis='columns')

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.8)
y_test

3     680000
12    695000
11    620000
Name: price, dtype: int64

In [50]:
model = LinearRegression()

In [51]:
model.fit(X_train, y_train)

LinearRegression()

In [52]:
model.coef_

array([   122.67002519, -36901.76322418, -12632.2418136 ])

In [53]:
model.predict([[3600, 1, 0]])

array([667500.])

In [54]:
model.predict([[3300, 0, 0]])

array([667600.75566751])