# Data Preprocessing

## P2: Dummy Variables & One-Hot Encoding


Machine learning models need numerical inputs. Categorical variables must be converted to numeric form.

### Techniques Covered:
- **Dummy Variable Encoding** (using `pd.get_dummies()`)
- **One-Hot Encoding** (using `OneHotEncoder` from Scikit-learn)


### Step 1: Import libraries and create sample data

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = {'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Bangalore']}
df = pd.DataFrame(data)
df

### Step 2: Dummy Encoding using `pd.get_dummies()`

In [None]:
df_dummies = pd.get_dummies(df, drop_first=True)
df_dummies

### Step 3: One-Hot Encoding using `OneHotEncoder`

In [None]:
encoder = OneHotEncoder(drop='first', sparse=False)
encoded_array = encoder.fit_transform(df[['City']])

# Convert to DataFrame
encoded_df = pd.DataFrame(encoded_array, columns=encoder.get_feature_names_out(['City']))
encoded_df

### Step 4: Concatenate original and encoded features

In [None]:
result_df = pd.concat([df, encoded_df], axis=1)
result_df

### Conclusion:
- Dummy encoding is simpler and used with pandas.
- OneHotEncoder is more flexible and compatible with pipelines.
- Dropping the first category avoids multicollinearity.


1. Data Preprocessing

1.1. P2: Dummy Variables & One-Hot Encoding

Machine learning models need numerical inputs. Categorical variables must be converted to numeric form.

1.1.1. Techniques Covered:
    
Dummy Variable Encoding (using pd.get_dummies())

One-Hot Encoding (using OneHotEncoder from Scikit-learn)

1.1.2. Step 1: Import libraries and create sample data

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = {'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Bangalore']}
df = pd.DataFrame(data)
df

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Delhi
3,Chennai
4,Bangalore


1.1.3. Step 2: Dummy Encoding using pd.get_dummies()

In [2]:
df_dummies = pd.get_dummies(df, drop_first=True)
df_dummies

Unnamed: 0,City_Chennai,City_Delhi,City_Mumbai
0,False,True,False
1,False,False,True
2,False,True,False
3,True,False,False
4,False,False,False


1.1.4. Step 3: One-Hot Encoding using OneHotEncoder


In [13]:

encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_array = encoder.fit_transform(df[['City']])

# Convert to DataFrame
encoded_df = pd.DataFrame(encoded_array, columns=encoder.get_feature_names_out(['City']))
encoded_df

Unnamed: 0,City_Chennai,City_Delhi,City_Mumbai
0,0.0,1.0,0.0
1,0.0,0.0,1.0
2,0.0,1.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,0.0


1.1.5. Step 4: Concatenate original and encoded features¶

In [14]:
result_df = pd.concat([df, encoded_df], axis=1)
result_df

Unnamed: 0,City,City_Chennai,City_Delhi,City_Mumbai
0,Delhi,0.0,1.0,0.0
1,Mumbai,0.0,0.0,1.0
2,Delhi,0.0,1.0,0.0
3,Chennai,1.0,0.0,0.0
4,Bangalore,0.0,0.0,0.0
