<h2 align="center">Codebasics ML Course: Data Preprocessing: One Hot Encoding<h2>

In [1]:
import pandas as pd

df = pd.read_csv("home_prices.csv")
df

Unnamed: 0,locality,area_sqr_ft,price_lakhs,bedrooms
0,Kollur,656,39.0,2
1,Kollur,1260,83.2,2
2,Kollur,1057,86.6,3
3,Kollur,1259,59.0,2
4,Kollur,1800,140.0,3
5,Kollur,1325,80.1,2
6,Kollur,1085,116.0,3
7,Kollur,1110,45.0,2
8,Kollur,1700,100.0,3
9,Banjara Hills,1650,200.0,3


In [2]:
df_encoded = pd.get_dummies(df, columns=['locality'], drop_first=True)
df_encoded.sample(5)

Unnamed: 0,area_sqr_ft,price_lakhs,bedrooms,locality_Kollur,locality_Mankhal
16,1100,54.0,2,0,1
4,1800,140.0,3,1,0
7,1110,45.0,2,1,0
5,1325,80.1,2,1,0
19,1266,78.0,2,0,1


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


X = df_encoded.drop('price_lakhs', axis=1)
y = df_encoded['price_lakhs']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Building the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

model.score(X_test, y_test)

0.8558905263155381

In [4]:
X_test

Unnamed: 0,area_sqr_ft,bedrooms,locality_Kollur,locality_Mankhal
0,656,2,1,0
13,2400,3,0,0
8,1700,3,1,0
1,1260,2,1,0
15,2600,3,0,0


### Model is trained. Now let's predict prices of homes

In [5]:
test = pd.DataFrame([
    {'area_sqr_ft': 1600, "bedrooms": 2, "locality_Kollur": False, "locality_Mankhal": False},
    {'area_sqr_ft': 1600, "bedrooms": 2, "locality_Kollur": False, "locality_Mankhal": True},
])

model.predict(test)

array([157.03383393, 109.25104283])

### You can also use sklearn OneHotEncoder class for one hot encoding

In [8]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# Sample data
fruits = np.array([['Apple'], ['Banana'], ['Cherry'], ['Apple'], ['Cherry']])

# Initialize OneHotEncoder with custom parameters
encoder = OneHotEncoder(categories='auto', drop='first', sparse=False, handle_unknown='ignore')

# Fit and transform the data
one_hot_encoded = encoder.fit_transform(fruits)

print(one_hot_encoded)



[[0. 0.]
 [1. 0.]
 [0. 1.]
 [0. 0.]
 [0. 1.]]


### OneHotEncoder Parameters

1. **categories**: `'auto'` (default) or a list of array-like.
   - `'auto'`: Determine categories automatically from the training data.
   - List: Specify the categories manually for each feature.

2. **drop**: `None` (default), `'first'`, `'if_binary'`, or an array-like of shape (n_features,).
   - `None`: Retain all categories.
   - `'first'`: Drop the first category in each feature.
   - `'if_binary'`: Drop the first category in each feature with two categories.
   - Array: Specify which category to drop for each feature.

3. **sparse_output**: `True` (default) or `False`.
   - `True`: Return a sparse matrix.
   - `False`: Return a dense array.

4. **dtype**: Data type of the output, default is `numpy.float64`.

5. **handle_unknown**: `'error'` (default) or `'ignore'`.
   - `'error'`: Raise an error if an unknown category is encountered during transformation.
   - `'ignore'`: Ignore unknown categories and set the corresponding feature values to zero.

6. **min_frequency**: `None` (default) or an int/float.
   - Minimum frequency below which categories are grouped into a single category.

7. **max_categories**: `None` (default) or an int.
   - Maximum number of categories per feature. Categories with the lowest frequency are grouped together.

8. **feature_name_combiner**: `'concat'` (default) or a callable.
   - Method to combine feature names when there are multiple features.
