Encode Categorical Features:

Categorical encoding is converting categorical variables (textual or nominal data) into numerical format that machine learning algorithms can use. Common methods include:

***Label Encoding:*** Assigning each unique category an integer label. Suitable for ordinal categories or tree-based models.

**One-Hot Encoding**: Creating binary columns for each category, indicating presence/absence. Good for nominal categorical variables with no intrinsic order.

**Pandas Categorical dtype**: Efficient memory representation and can be used for encoding.

In [9]:
import pandas as pd
import numpy as np

# Synthetic dataset with categorical and duplicate rows
data = {
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Houston', 'Los Angeles', 'Boston', 'Boston', 'Chicago','Dallaas'],
    'Weather': ['Sunny', 'Rainy', 'Sunny', 'Cloudy', 'Sunny', 'Rainy', 'Rainy', 'Sunny', 'Cloudy','Cloudy'],
    'Severity': [3, 2, 3, 1, 3, 2, 3, 3, 1,1],
    'Accidents': [100, 150, 100, 80, 90, 150, 60, 60, 80,80]
}

df = pd.DataFrame(data)
print("Original Dataset:")
print(df)

Original Dataset:
          City Weather  Severity  Accidents
0     New York   Sunny         3        100
1  Los Angeles   Rainy         2        150
2     New York   Sunny         3        100
3      Chicago  Cloudy         1         80
4      Houston   Sunny         3         90
5  Los Angeles   Rainy         2        150
6       Boston   Rainy         3         60
7       Boston   Sunny         3         60
8      Chicago  Cloudy         1         80
9      Dallaas  Cloudy         1         80


In [10]:
# Remove duplicate rows from original data
df_no_duplicates = df.drop_duplicates()
print("\nDataset After Removing  Duplicates:")
print(df_no_duplicates)


Dataset After Removing  Duplicates:
          City Weather  Severity  Accidents
0     New York   Sunny         3        100
1  Los Angeles   Rainy         2        150
3      Chicago  Cloudy         1         80
4      Houston   Sunny         3         90
6       Boston   Rainy         3         60
7       Boston   Sunny         3         60
9      Dallaas  Cloudy         1         80


In [12]:
# Label Encoding with pandas category datatype
df_no_duplicates['City_encoded'] = df_no_duplicates['City'].astype('category').cat.codes
df_no_duplicates['Weather_encoded'] = df_no_duplicates['Weather'].astype('category').cat.codes
print("\nLabel Encoded Dataset:")
print(df_no_duplicates)


Label Encoded Dataset:
          City Weather  Severity  Accidents  City_encoded  Weather_encoded
0     New York   Sunny         3        100             5                2
1  Los Angeles   Rainy         2        150             4                1
3      Chicago  Cloudy         1         80             1                0
4      Houston   Sunny         3         90             3                2
6       Boston   Rainy         3         60             0                1
7       Boston   Sunny         3         60             0                2
9      Dallaas  Cloudy         1         80             2                0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_duplicates['City_encoded'] = df_no_duplicates['City'].astype('category').cat.codes
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_duplicates['Weather_encoded'] = df_no_duplicates['Weather'].astype('category').cat.codes


In [13]:
# One-Hot Encoding using pandas get_dummies
df_encoded = pd.get_dummies(df_no_duplicates, columns=['Weather'])
print("\nOne-Hot Encoded Dataset:")
print(df_encoded)


One-Hot Encoded Dataset:
          City  Severity  Accidents  City_encoded  Weather_encoded  \
0     New York         3        100             5                2   
1  Los Angeles         2        150             4                1   
3      Chicago         1         80             1                0   
4      Houston         3         90             3                2   
6       Boston         3         60             0                1   
7       Boston         3         60             0                2   
9      Dallaas         1         80             2                0   

   Weather_Cloudy  Weather_Rainy  Weather_Sunny  
0           False          False           True  
1           False           True          False  
3            True          False          False  
4           False          False           True  
6           False           True          False  
7           False          False           True  
9            True          False          False  
