<a href="https://colab.research.google.com/github/Luckysolex/Data-Science-Methodologies/blob/master/Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FEATURE ENGINEERING**

Feature engineering is the process of cleaning, transforming, and preparing raw data so a machine learning model can learn from it effectively.

We can create new features from the ones we already have to help the model learn better and make more accurate predictions.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("/content/drive/MyDrive/housing.csv")

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


Let's create two new features that can be beneficial for our model.

In [5]:
df["rooms_per_household"] = df["total_rooms"] / df["households"]

**`rooms_per_household`**

**What it means:**
This tells us how many rooms, on average, each household has.

Instead of just knowing total rooms or number of households, this feature answers a better question:

“Are people living in spacious homes or crowded ones?”

**Simple example:**

* 100 rooms for 10 households → 10 rooms per household (spacious)

* 100 rooms for 50 households → 2 rooms per household (crowded)

This is more meaningful than just “100 rooms” by itself.

In [6]:
df["bedrooms_per_room"] = df["total_bedrooms"] / df["total_rooms"]

**`bedrooms_per_room`**

**What it means:**
This shows what fraction of the rooms are bedrooms.

 It helps describe the layout or design of the house.

**Simple example:**

* 5 bedrooms out of 10 rooms → 0.5

* 2 bedrooms out of 10 rooms → 0.2

A higher value suggests houses are mostly bedrooms, while a lower value suggests more living rooms, kitchens, or other spaces.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   longitude            20640 non-null  float64
 1   latitude             20640 non-null  float64
 2   housing_median_age   20640 non-null  float64
 3   total_rooms          20640 non-null  float64
 4   total_bedrooms       20433 non-null  float64
 5   population           20640 non-null  float64
 6   households           20640 non-null  float64
 7   median_income        20640 non-null  float64
 8   median_house_value   20640 non-null  float64
 9   ocean_proximity      20640 non-null  object 
 10  rooms_per_household  20640 non-null  float64
 11  bedrooms_per_room    20433 non-null  float64
dtypes: float64(11), object(1)
memory usage: 1.9+ MB


# **Encoding categorical Variables**

Machine learning models understand numbers, not words.

So if we have data like:
| Gender | City  |
| ------ | ----- |
| Male   | Lagos |
| Female | Abuja |

The model cannot process Male, Female, Lagos, Abuja directly.
* We must convert categories into numbers → this is encoding.

**Types of Categorical Variables**

* Nominal (No order)

Examples:

Gender (Male, Female)


City (Lagos, Abuja, Ibadan)


* Ordinal (Has order)

Examples:

Education (Primary < Secondary < Tertiary)

Rating (Low < Medium < High)

**NOTE** Order matters

## **Encoding Methods**

**1. Label Encoding**

What it does:
Assigns a number to each category.

Example:

Male   → 1

Female → 0


* When to use

Ordinal variables

Tree-based models (Decision Trees, Random Forest, XGBoost)


**2. One-Hot Encoding**

What it does:
Creates a new column for each category.

Example:
| City_Lagos | City_Abuja | City_Ibadan |
| ---------- | ---------- | ----------- |
| 1          | 0          | 0           |
| 0          | 1          | 0           |

* When to use

Nominal variables

Linear Regression, Logistic Regression, KNN, SVM

Safest and most commonly used encoding.

* Limitation

Many categories → many columns (curse of dimensionality)


**3. Ordinal Encoding**

Used only when order exists.

Example:

Low    → 1

Medium → 2

High   → 3

* When to use

Ordinal data ONLY

In [8]:
df["ocean_proximity"].unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

### **Label Encoding `ocean_proximity`**


In [9]:
from sklearn.preprocessing import LabelEncoder

# Create a copy of the dataframe
df_encoded = df.copy()

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply Label encoding to ocean proximity
df_encoded["ocean_proximity_label_encoded"] = label_encoder.fit_transform(df_encoded["ocean_proximity"])

In [10]:
# Mapping of categories to labels
for i, category in enumerate(label_encoder.classes_):
    print(f"{category}: {i}")

<1H OCEAN: 0
INLAND: 1
ISLAND: 2
NEAR BAY: 3
NEAR OCEAN: 4


In [11]:
# Display the first few rows with the new column
display(df_encoded[["ocean_proximity", "ocean_proximity_label_encoded"]].sample(10))

Unnamed: 0,ocean_proximity,ocean_proximity_label_encoded
4440,<1H OCEAN,0
15230,NEAR OCEAN,4
661,NEAR BAY,3
15677,NEAR BAY,3
13068,INLAND,1
4031,<1H OCEAN,0
2740,INLAND,1
20411,<1H OCEAN,0
10229,<1H OCEAN,0
9324,NEAR BAY,3


In [12]:
df_encoded.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,rooms_per_household,bedrooms_per_room,ocean_proximity_label_encoded
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,6.984127,0.146591,3
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,6.238137,0.155797,3
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,8.288136,0.129516,3
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,5.817352,0.184458,3
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,6.281853,0.172096,3


### **One-Hot Encoding `ocean_proximity`**


In [13]:
from sklearn.preprocessing import OneHotEncoder

df_one_hot = df.copy()

# Initialize OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse_output = False)

# Reshape the column to a 2D array as required by OneHotEncoder
encode_features = one_hot_encoder.fit_transform(df_one_hot[["ocean_proximity"]])

In [14]:
# Create a dataframe from the one_hot encoded features
encoded_df = pd.DataFrame(encode_features, columns=one_hot_encoder.get_feature_names_out(["ocean_proximity"]))

# Concatenate the one_hot encoded dataframe with the original dataframe
df_one_hot = pd.concat([df_one_hot, encoded_df], axis=1)

df_one_hot.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,rooms_per_household,bedrooms_per_room,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,6.984127,0.146591,0.0,0.0,0.0,1.0,0.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,6.238137,0.155797,0.0,0.0,0.0,1.0,0.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,8.288136,0.129516,0.0,0.0,0.0,1.0,0.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,5.817352,0.184458,0.0,0.0,0.0,1.0,0.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,6.281853,0.172096,0.0,0.0,0.0,1.0,0.0


#**Machine Learning Pipeline**

A pipeline is a structured sequence of steps that automatically applies preprocessing and then trains a model in the correct order.

It is like an assembly line:

* Clean data

* Encode categories

* Train model

* Predict

All in one object.

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, r2_score

In [16]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 13 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   longitude                      20640 non-null  float64
 1   latitude                       20640 non-null  float64
 2   housing_median_age             20640 non-null  float64
 3   total_rooms                    20640 non-null  float64
 4   total_bedrooms                 20433 non-null  float64
 5   population                     20640 non-null  float64
 6   households                     20640 non-null  float64
 7   median_income                  20640 non-null  float64
 8   median_house_value             20640 non-null  float64
 9   ocean_proximity                20640 non-null  object 
 10  rooms_per_household            20640 non-null  float64
 11  bedrooms_per_room              20433 non-null  float64
 12  ocean_proximity_label_encoded  20640 non-null 

In [17]:
# Features and target
X = df_encoded.drop(["median_house_value", "ocean_proximity"], axis=1)
y = df_encoded["median_house_value"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"shape of X: {X.shape}")
print(f"shape of y: {y.shape}")

shape of X: (20640, 11)
shape of y: (20640,)


In [18]:
# Example using ordinal encoder for all features (only when appropriate)
model = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('model', DecisionTreeRegressor(
        random_state=42,
        max_depth=10,
        min_samples_split=20,
        min_samples_leaf=10
    ))
])

model.fit(X_train, y_train)

In [19]:
y_pred_train = model.predict(X_train)
print(y_pred_train)

[128976.47058824 392952.88888889 257930.97707736 ... 241600.
 279541.71024735 271908.33333333]


In [21]:
# Create a dataframe for the actual vs predicted value on the train set
results_df = pd.DataFrame({'Actual': y_train, 'Predicted': y_pred_train})
display(results_df.head(10))

Unnamed: 0,Actual,Predicted
14196,103000.0,128976.470588
8267,382100.0,392952.888889
17445,172600.0,257930.977077
14265,93400.0,121014.655172
2271,96500.0,97665.217391
17848,264800.0,327446.7
6252,157300.0,183252.317881
9389,500001.0,431946.384615
6113,139800.0,147246.666667
6061,315600.0,279541.710247


In [22]:
# MAke predictions for the test set
y_pred_test = model.predict(X_test)
print(y_pred_test)

[ 56934.54878049  72902.4291498  497719.85714286 ... 500001.
  72902.4291498  198635.29411765]


In [23]:
# Create a dataframe for the actual vs predicted on the test set
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_test})
display(results_df.head(10))

Unnamed: 0,Actual,Predicted
20046,47700.0,56934.54878
3024,45800.0,72902.42915
15663,500001.0,497719.857143
20484,218600.0,279541.710247
9814,278000.0,257930.977077
13311,158700.0,176151.388889
7113,198200.0,205884.717647
7668,157500.0,187320.289855
18246,340000.0,257930.977077
5723,446600.0,487619.5


In [25]:
# Calculate metrics
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)

train_mae = mean_absolute_error(y_train, y_pred_train)
test_mae = mean_absolute_error(y_test, y_pred_test)

print(f"Train R2 Score:", round(train_r2, 4))
print(f"Test R2 Score:", round(test_r2, 4))
print(f"Train MAE:", round(train_mae, 4))
print(f"Test MAE:", round(test_mae, 4))

Train R2 Score: 0.8171
Test R2 Score: 0.735
Train MAE: 33272.4108
Test MAE: 38972.9268


## Summary:


Using pipelines for teaching machine learning offers several benefits:
*   **Modularity and Clarity:** Pipelines break down complex workflows into clear, sequential steps, making it easier for students to understand each component's role.
*   **Reproducibility:** They ensure that all preprocessing and modeling steps are applied consistently, preventing data leakage and promoting reproducible results.
*   **Reduced Errors:** Pipelines automate the flow of data, minimizing manual intervention and the potential for errors during data transformations.
*   **Encapsulation:** They encapsulate the entire workflow, allowing students to treat the pipeline as a single unit for training, prediction, and deployment, which simplifies model management.

