<a href="https://colab.research.google.com/github/MikkoDT/MexEE402_AI/blob/main/Ch_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Chapter 4: Unleashing the Power of Data Through Transformation and Feature Engineering**

**Feature Engineering**

* Transform raw data into useful features

* Leverages domain knowledge to improve ML models

* Helps uncover hidden patterns

**Basics**

* Binning → group values into ranges

* Interaction Features → combine variables

* Polynomial Features → higher-order terms

Example: Lemonade Stand

* Record: temperature + lemonade sales

* New feature: Lemonade per Degree = sales ÷ temperature

* Provides better insight into sales–temperature relationship

In [21]:
import pandas as pd

In [22]:
# sample data
data = {
'Temperature': [75, 77, 82, 85, 89, 91, 95],
'Lemonade Sold': [30, 35, 50, 55, 60, 65, 80]
}

df = pd.DataFrame(data)

df.head()

Unnamed: 0,Temperature,Lemonade Sold
0,75,30
1,77,35
2,82,50
3,85,55
4,89,60


In [23]:
# create new feature
df['Lemonade per Degree'] = df['Lemonade Sold'] / df['Temperature']

df.head()

Unnamed: 0,Temperature,Lemonade Sold,Lemonade per Degree
0,75,30,0.4
1,77,35,0.454545
2,82,50,0.609756
3,85,55,0.647059
4,89,60,0.674157


**Binning**

* Convert numerical data → categorical data

* Example: temperature → Cool, Warm, Hot, Very Hot

* Benefits:

  - Handles outliers

  - Simplifies models

  - Can reveal clearer variable–target relationships

In [24]:
# define bins and labels
bins = [70, 75, 85, 95, 100]
labels = ['cool', 'warm', 'hot', 'very hot']

# apply binning
df['Temperature Category'] = pd.cut(df['Temperature'], bins=bins, labels=labels)

df.head()

Unnamed: 0,Temperature,Lemonade Sold,Lemonade per Degree,Temperature Category
0,75,30,0.4,cool
1,77,35,0.454545,warm
2,82,50,0.609756,warm
3,85,55,0.647059,warm
4,89,60,0.674157,hot


**Interaction Features**

* Combine two or more variables into a new feature

* Example (lemonade stand):

  - Variables: Temperature + Ice Cubes

  - New feature: Ice Cubes per Degree = ice_cubes ÷ temperature

* Can capture relationships not visible in individual variables

In [25]:
# add Ice Cubes data
df['Ice Cubes'] = [100, 110, 120, 130, 140, 150, 200]

# create interaction feature
df['Ice Cubes per Degree'] = df['Ice Cubes'] / df['Temperature']

df.head()

Unnamed: 0,Temperature,Lemonade Sold,Lemonade per Degree,Temperature Category,Ice Cubes,Ice Cubes per Degree
0,75,30,0.4,cool,100,1.333333
1,77,35,0.454545,warm,110,1.428571
2,82,50,0.609756,warm,120,1.463415
3,85,55,0.647059,warm,130,1.529412
4,89,60,0.674157,hot,140,1.573034


**Polynomial Features**

* Add non-linear complexity to models

* Use powers of features or interaction terms

* Example (lemonade stand):

  - Suspect non-linear relation → create Temperature²

  - Helps capture curved patterns in sales vs. temperature

In [26]:
# create polynomial feature

df['Temperature Squared'] = df['Temperature'] ** 2

df.head()

Unnamed: 0,Temperature,Lemonade Sold,Lemonade per Degree,Temperature Category,Ice Cubes,Ice Cubes per Degree,Temperature Squared
0,75,30,0.4,cool,100,1.333333,5625
1,77,35,0.454545,warm,110,1.428571,5929
2,82,50,0.609756,warm,120,1.463415,6724
3,85,55,0.647059,warm,130,1.529412,7225
4,89,60,0.674157,hot,140,1.573034,7921


## **Categorical Variable Encoding**

* ML models need numerical input

* Convert categorical values → numbers (encoding)

* Example: lemonade stand weather data (Sunny, Cloudy, Rainy)



**Methods**

* One-hot Encoding

  - Creates new column for each category

  - Assigns 1 = True, 0 = False

  - Example: Sunny → [1,0,0], Cloudy → [0,1,0], Rainy → [0,0,1]

* Ordinal Encoding (for ordered categories, e.g., Low/Medium/High)

In [27]:
# Python Implementation:

# example data
data_2 = {'Weather': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy']}
df_2 = pd.DataFrame(data_2)

# apply one-hot encoding
df_encoded = pd.get_dummies(df_2, columns=['Weather'])

df_encoded.head()

Unnamed: 0,Weather_Cloudy,Weather_Rainy,Weather_Sunny
0,False,False,True
1,True,False,False
2,False,True,False
3,False,False,True
4,True,False,False


## **Ordinal Encoding**

* Used when categories have a natural order

* Example (lemonade stand – ice amount):

  - Little → 1

  - Medium → 2

  - Lots → 3

* Preserves ranking/priority in categorical data

In [28]:
# Python Implementation:
from sklearn.preprocessing import OrdinalEncoder

# example data
data_3 = {'Ice': ['Little', 'Medium', 'Lots', 'Little', 'Lots']}
df_3 = pd.DataFrame(data_3)

# create encoder
ord_enc = OrdinalEncoder(categories=[['Little', 'Medium', 'Lots']])

# apply ordinal encoding
df_3['Ice_encoded'] = ord_enc.fit_transform(df_3[['Ice']])

df_3.head()

Unnamed: 0,Ice,Ice_encoded
0,Little,0.0
1,Medium,1.0
2,Lots,2.0
3,Little,0.0
4,Lots,2.0
