<a href="https://colab.research.google.com/github/Mohsal2026/github.com/blob/main/Copy_of_2_feature__engineering__live_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### 🎥 Recommended Video  : [Feature Engineering](https://www.youtube.com/watch?v=DkLQtGqQedo)

#### 🎥 Recommended Video  : [How to select features for Machine learning](https://www.youtube.com/watch?v=YaKMeAlHgqQ)


# Unlocking the Power of Feature Engineering

Imagine you’re tasked with predicting house prices. You have raw data with features like square footage, number of bedrooms, and age of the house. While the data is valuable, it may not tell the full story. What if we could create new insights, transform existing ones, or identify the most important features to improve prediction accuracy? This process is called **feature engineering**—a critical step in building high-performing machine learning models.

---

## Why Feature Engineering Matters

Feature engineering enhances your models in three key ways:

1. **Improved Performance**: By representing data better, models achieve higher accuracy.
2. **Reduced Overfitting**: Including only meaningful features helps prevent overfitting.
3. **Enhanced Interpretability**: Models become easier to understand with well-engineered features.

---

## Techniques in Feature Engineering

Let’s explore some key techniques with relatable examples.

### 1. Feature Creation

Sometimes raw data isn’t enough, and we need to create new features to capture hidden patterns.

- **Combining Features**: Imagine you’re analyzing housing data. Creating a feature like `bedrooms_per_bath` (number of bedrooms divided by bathrooms) can add value by capturing room-to-bathroom balance.
- **Discretization**: Continuous features, such as house grades (e.g., 1-10), can be grouped into categories: low, medium, high.
- **Feature Extraction**: Extract relevant signals from complex data types like text, images, or time-series data.

---

### 2. Feature Transformation

Transformations can standardize or compress data for better model performance.

- **Scaling**: Normalize features to bring them to the same scale, using methods like Min-Max scaling or standardization.
- **Log Transformation**: For skewed data (e.g., house prices), a logarithmic transformation can reduce the range and normalize the distribution.
- **One-Hot Encoding**: Convert categorical features (e.g., neighborhood names) into numerical representations, allowing models to process them effectively.

---

### 3. Feature Selection

Not all features contribute equally. Selecting the right ones reduces noise and improves model accuracy.

- **Filter Methods**: Use statistical measures (e.g., correlation, chi-squared test) to identify the most relevant features.
- **Wrapper Methods**: Evaluate subsets of features based on model performance (e.g., recursive feature elimination).
- **Embedded Methods**: During model training, techniques like L1 regularization help select important features by shrinking less relevant ones to zero.

---

## Example: Predicting House Prices

Let’s see feature engineering in action for house price prediction.

**Raw Data:**

<table>
  <tr>
    <th>Feature</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>sqft_living</td>
    <td>Square footage of living space</td>
  </tr>
  <tr>
    <td>bedrooms</td>
    <td>Number of bedrooms</td>
  </tr>
  <tr>
    <td>bathrooms</td>
    <td>Number of bathrooms</td>
  </tr>
  <tr>
    <td>floors</td>
    <td>Number of floors</td>
  </tr>
  <tr>
    <td>grade</td>
    <td>Overall grade given to the house</td>
  </tr>
</table>


### Applying Feature Engineering

1. **Create a New Feature**: Combine features to create `bedrooms_per_bath` = `bedrooms` / `bathrooms`.
2. **Discretize `grade`**: Group grades into categories (e.g., low, medium, high).
3. **Scale Numerical Features**: Use Min-Max scaling to bring all numerical features to a common range.

### The Impact

These engineered features enhance your model’s ability to learn patterns, leading to better predictions and insights into what drives house prices.

---

## Advanced Applications

Feature engineering isn’t limited to numeric data. Here are a few advanced examples:

- **Text Data**: Extract features like word counts, sentiment scores, or topic categories from unstructured text. We will look at this in later modules
- **Time-Series Data**: Derive trends, seasonality, and rolling averages. We will also look at this in later modules
- **Interaction Terms**: Combine two features (e.g., `sqft_living` * `grade`) to capture interactions between them.

---


## Tips for Effective Feature Engineering

1. **Leverage Domain Knowledge**: Insights from the field can guide feature creation and transformation.
2. **Experimentation Matters**: Test different approaches and evaluate their impact.
3. **Use Regularization**: Techniques like L1 regularization can highlight important features and reduce overfitting.
4. **Analyze Feature Importance**: Use model-based methods to understand which features matter most.

---

By mastering feature engineering, you can unlock the full potential of your machine learning models. Every dataset tells a story—your job is to bring that story to life with the right features.



## Example : Feature Engineering :Exploration of Individual Features

Considering the Logistic regression model we built in [Module 1](../Module%201/4.%20Logistic%20Regression.ipynb)
, for predicting the propensity of term deposit purchases i.e predict whether a customer is likely to buy a special type of savings account called a 'term deposit', we could create a new variable called **Asset Index** ; *a way to measure the overall financial standing of a customer by combining their assets and liabilities into a single score.* This score will helps us determine the customer's likelihood of buying a term deposit.

We could calculate the Asset Index by:

1. **Assigning Weights to Assets**:
   - If the customer owns a house, it contributes **5 points** to their Asset Index (a house is a valuable asset).
   - If the customer does not own a house, it contributes **1 point**.

2. **Assigning Weights to Liabilities**:
   - If the customer has a loan, it contributes **1 point** to their Asset Index (a loan is a liability).
   - If the customer has no loan, it contributes **5 points**.

### Example Calculation

Let’s calculate the Asset Index for two customers:

- **Customer A**:
  - Owns a house: **5 points**
  - Has a loan: **1 point**
  - **Total Asset Index = 5 + 1 = 6**

- **Customer B**:
  - Does not own a house: **1 point**
  - Has no loan: **5 points**
  - **Total Asset Index = 1 + 5 = 6**

In [None]:
import pandas as pd

In [None]:
# file_url = 'sample_data/bank-full.csv'
# bankData = pd.read_csv(file_url, sep=";")
# bankData.head()

In [None]:
# Import Data Files from Google Drive

import requests
import pandas as pd
from io import StringIO

def read_gd(sharingurl):
    file_id = sharingurl.split('/')[-2]
    download_url='https://drive.google.com/uc?export=download&id=' + file_id
    url = requests.get(download_url, headers={'Content-Type': 'text/csv'}).text
    csv_raw = StringIO(url)
    df = pd.read_csv(csv_raw, sep=';')
    return df

url = "https://drive.google.com/file/d/1dlrFu_FxM95RTlb2lE1WSix7ameJVwHD/view?usp=sharing"
bankData = read_gd(url)

In [None]:
bankData.head(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
5,35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
6,28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no
7,42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no
8,58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no
9,43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no


In [None]:
# Relationship between housing and propensity for term deposits
bankData.groupby(['housing', 'y'])['y'].agg(houseTot='count').reset_index()

Unnamed: 0,housing,y,houseTot
0,no,no,16727
1,no,yes,3354
2,yes,no,23195
3,yes,yes,1935


In [None]:
# Relationship between having a loan and propensity for term deposits
bankData.groupby(['loan', 'y'])['y'].agg(loanTot='count').reset_index()

Unnamed: 0,loan,y,loanTot
0,no,no,33162
1,no,yes,4805
2,yes,no,6760
3,yes,yes,484


In [None]:
# Taking the quantiles for 25%, 50% and 75% of the balance data
import numpy as np
np.quantile(bankData['balance'],[0.25,0.5,0.75])

array([  72.,  448., 1428.])

In [None]:
1,2,4, 5,6

(1, 2, 4, 5, 6)

In [None]:
# bankData['balanceClass'] = 'Quant1'

# bankData.head(10)

In [None]:
# Creating new features for bank data based on the quantile values

bankData['balanceClass'] = 'Quant1'

bankData.loc[(bankData['balance'] > 72) & (bankData['balance'] < 448), 'balanceClass'] = 'Quant2'

bankData.loc[(bankData['balance'] > 448) & (bankData['balance'] < 1428), 'balanceClass'] = 'Quant3'

bankData.loc[bankData['balance'] > 1428, 'balanceClass'] = 'Quant4'

bankData.head(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,balanceClass
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,Quant4
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,Quant1
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,Quant1
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,Quant4
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,Quant1
5,35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no,Quant2
6,28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no,Quant2
7,42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no,Quant1
8,58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no,Quant2
9,43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no,Quant3


In [None]:
# Calculating the customers under each quantile
balanceTot = bankData.groupby(['balanceClass'])['y'].agg(balanceTot='count').reset_index()
balanceTot

Unnamed: 0,balanceClass,balanceTot
0,Quant1,11340
1,Quant2,11275
2,Quant3,11299
3,Quant4,11297


In [None]:
# Calculating the total customers categorised as per quantile and propensity classification
balanceProp = bankData.groupby(['balanceClass', 'y'])['y'].agg(balanceCat='count').reset_index()
balanceProp

Unnamed: 0,balanceClass,y,balanceCat
0,Quant1,no,10517
1,Quant1,yes,823
2,Quant2,no,10049
3,Quant2,yes,1226
4,Quant3,no,9884
5,Quant3,yes,1415
6,Quant4,no,9472
7,Quant4,yes,1825


In [None]:
# Merging both the data frames
balanceComb = pd.merge(balanceProp, balanceTot, on = ['balanceClass'])
balanceComb['catProp'] = (balanceComb['balanceCat'] / balanceComb['balanceTot'])*100
balanceComb

Unnamed: 0,balanceClass,y,balanceCat,balanceTot,catProp
0,Quant1,no,10517,11340,92.743
1,Quant1,yes,823,11340,7.257
2,Quant2,no,10049,11275,89.126
3,Quant2,yes,1226,11275,10.874
4,Quant3,no,9884,11299,87.477
5,Quant3,yes,1415,11299,12.523
6,Quant4,no,9472,11297,83.845
7,Quant4,yes,1825,11297,16.155


From the distribution of data, we can see that, as we move from Quantile 1 to
Quantile 4, the proportion of customers who buy term deposits keeps on increasing.
For instance, of all of the customers who belong to Quant 1, 7.25% have bought
term deposits (we get this percentage from **catProp**). This proportion increases
to 10.87 % for Quant 2 and thereafter to 12.52 % and 16.15% for Quant 3 and
Quant4, respectively. From this trend, we can conclude that individuals with higher
balances have more propensity for term deposits.

In [None]:
bankNumeric = bankData[['age','balance','day','duration','campaign','pdays','previous']]

In [None]:
import pandas as pd

# Reset pandas options to defaults
pd.reset_option("all")

# Set specific options
pd.set_option('display.width', 150)
pd.set_option('display.precision', 3)  # Use display.precision

bankCorr = bankNumeric.corr(method='pearson')
bankCorr

  pd.reset_option("all")
  pd.reset_option("all")


Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
age,1.0,0.098,-0.009,-0.005,0.005,-0.024,0.001
balance,0.098,1.0,0.005,0.022,-0.015,0.003,0.017
day,-0.009,0.005,1.0,-0.03,0.162,-0.093,-0.052
duration,-0.005,0.022,-0.03,1.0,-0.085,-0.002,0.001
campaign,0.005,-0.015,0.162,-0.085,1.0,-0.089,-0.033
pdays,-0.024,0.003,-0.093,-0.002,-0.089,1.0,0.455
previous,0.001,0.017,-0.052,0.001,-0.033,0.455,1.0
