### Encode categorical variables using One Hot Encoding Method

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv("homeprices.csv")
df

Unnamed: 0,town,area,price
0,Kolkata,2600,550000
1,Kolkata,3000,565000
2,Kolkata,3200,610000
3,Kolkata,3600,680000
4,Kolkata,4000,725000
5,Hyderabad,2600,585000
6,Hyderabad,2800,615000
7,Hyderabad,3300,650000
8,Hyderabad,3600,710000
9,Bangalore,2600,575000


#### To use OHE, let us first use Label Encoding - 

In [3]:
le = LabelEncoder()

In [4]:
df.town = le.fit_transform(df.town)
df

Unnamed: 0,town,area,price
0,2,2600,550000
1,2,3000,565000
2,2,3200,610000
3,2,3600,680000
4,2,4000,725000
5,1,2600,585000
6,1,2800,615000
7,1,3300,650000
8,1,3600,710000
9,0,2600,575000


The town column in the DataFrame has been successfully encoded using Label Encoding. The towns have been converted to numerical values:

Kolkata is encoded as **2**
Hyderabad is encoded as **1**
Bangalore is encoded as **0**

The **LabelEncoder** in scikit-learn assigns labels based on the **alphabetical order** of the categories by default. It doesn't assign labels based on the **order in which categories appear** in the data or any other criterias. Here's how the labeling works in our case:

We have three towns: Bangalore, Hyderabad, and Kolkata.
Alphabetically, "Bangalore" comes first, so it gets labeled as 0."Hyderabad" comes next, so it gets labeled as 1.
"Kolkata" is last alphabetically, so it gets labeled as 2.

In [5]:
#retreive training data
x = df[['town', 'area']]
x

Unnamed: 0,town,area
0,2,2600
1,2,3000
2,2,3200
3,2,3600
4,2,4000
5,1,2600
6,1,2800
7,1,3300
8,1,3600
9,0,2600


In [6]:
#retreive target data
y= df.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

**The OneHotEncoder instance has to be created with the _handle_unknown='ignore'_ option. This option ensures that if the encoder encounters a category in the test data that wasn't present in the training data, it will ignore it rather than throwing an error.**

In [7]:
ohe = OneHotEncoder(handle_unknown='ignore')

In [8]:
x1 = ohe.fit_transform(df[['town']])
x1

<13x3 sparse matrix of type '<class 'numpy.float64'>'
	with 13 stored elements in Compressed Sparse Row format>

**1. ohe.fit_transform(df[['town']]):** This part of the code uses the OneHotEncoder object ohe to both fit and transform the town column of your DataFrame.

**2. fit part of fit_transform** learns how many unique values are in the town column and assigns a new binary column to each unique value. In this case, it identifies three unique towns: Kolkata, Hyderabad, and Bangalore.
transform part of fit_transform then takes each row in the town column and encodes it as a binary array where only the column corresponding to the town's value is 1, and all others are 0.

**The result of this operation is a sparse matrix which efficiently represents a large matrix with mostly 0 values.**

In [9]:
x1 = pd.DataFrame(x1.toarray())
x1

Unnamed: 0,0,1,2
0,0.0,0.0,1.0
1,0.0,0.0,1.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0
5,0.0,1.0,0.0
6,0.0,1.0,0.0
7,0.0,1.0,0.0
8,0.0,1.0,0.0
9,1.0,0.0,0.0


#### x1.toarray(): This converts the sparse matrix obtained from the OneHotEncoder into a regular (dense) numpy array. This array now represents the one-hot encoded form of the town column.
#### pd.DataFrame(...): This converts the numpy array into a pandas DataFrame, making it easier to view and manipulate. Each column in this DataFrame corresponds to one unique value in the original town column, with rows representing the one-hot encoded binary values.

In [10]:
#to avoid dummy variable trap, drop 0th column
x1 = x1.iloc[:,1:]
x1

Unnamed: 0,1,2
0,0.0,1.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,0.0,1.0
5,1.0,0.0
6,1.0,0.0
7,1.0,0.0
8,1.0,0.0
9,0.0,0.0


### In this above DataFrame:

The column labeled **1 represents Hyderabad.**
The column labeled **2 represents Kolkata.**

The absence of both (0 in both columns) implicitly indicates Bangalore, the category represented by the dropped column. This approach is used to prevent multicollinearity in the model.

In [11]:
#add these columns to x
x = pd.concat([x,x1], axis="columns")
x

Unnamed: 0,town,area,1,2
0,2,2600,0.0,1.0
1,2,3000,0.0,1.0
2,2,3200,0.0,1.0
3,2,3600,0.0,1.0
4,2,4000,0.0,1.0
5,1,2600,1.0,0.0
6,1,2800,1.0,0.0
7,1,3300,1.0,0.0
8,1,3600,1.0,0.0
9,0,2600,0.0,0.0


**In this DataFrame:**

_Area represents the area of the houses._ Hyderabad and Kolkata are the one-hot encoded columns for the towns.

**A 1.0 indicates the presence of the house in that town, and 0.0 indicates its absence.**

In [12]:
x.drop('town', axis=1, inplace=True)
x

Unnamed: 0,area,1,2
0,2600,0.0,1.0
1,3000,0.0,1.0
2,3200,0.0,1.0
3,3600,0.0,1.0
4,4000,0.0,1.0
5,2600,1.0,0.0
6,2800,1.0,0.0
7,3300,1.0,0.0
8,3600,1.0,0.0
9,2600,0.0,0.0


### Convert all column names in 'x' to strings
You may encounter a TypeError when trying to fit the linear regression model using the sklearn library.The error message will suggest that the feature names in your DataFrame x are of mixed types (some are strings and some are integers).The sklearn library requires all feature names to be of the same type, preferably strings, for compatibility.

In [13]:
x.columns = x.columns.astype(str)
x

Unnamed: 0,area,1,2
0,2600,0.0,1.0
1,3000,0.0,1.0
2,3200,0.0,1.0
3,3600,0.0,1.0
4,4000,0.0,1.0
5,2600,1.0,0.0
6,2800,1.0,0.0
7,3300,1.0,0.0
8,3600,1.0,0.0
9,2600,0.0,0.0


In [14]:
model = LinearRegression()
model.fit(x,y) #train the model

### Predict the price of house with 2800sqft area located at Kolkata

In [15]:
model.predict([[2800,0,1]])

array([565089.22812299])

### Predict the price of house with 3400sqft area located at Bangalore

In [16]:
model.predict([[3400,0,0]])

array([666914.10449365])

### Accuracy score is 

In [17]:
model.score(x,y)

0.9573929037221872