In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler

In [2]:
data = {
    'CustomerID': [101, 102, 103, 104, 105],
    'Age': [25, 30, None, 22, 28],
    'Gender': ['Male', 'Female', 'Male', 'Female', None],
    'Income': [50000, 60000, 55000, None, 52000],
    'City': ['Urban', 'Rural', 'Urban', 'Rural', 'Urban'],
    'SubscriptionStatus': ['Subscribed', 'Not Subscribed', None, 'Subscribed', 'Not Subscribed']
}

df = pd.DataFrame(data)
df


Unnamed: 0,CustomerID,Age,Gender,Income,City,SubscriptionStatus
0,101,25.0,Male,50000.0,Urban,Subscribed
1,102,30.0,Female,60000.0,Rural,Not Subscribed
2,103,,Male,55000.0,Urban,
3,104,22.0,Female,,Rural,Subscribed
4,105,28.0,,52000.0,Urban,Not Subscribed


In [3]:
df.isna()     # shows where values are missing

Unnamed: 0,CustomerID,Age,Gender,Income,City,SubscriptionStatus
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,True,False,False,False,True
3,False,False,False,True,False,False
4,False,False,True,False,False,False


In [4]:
df.isna().sum()       # counts how many missing in each column

CustomerID            0
Age                   1
Gender                1
Income                1
City                  0
SubscriptionStatus    1
dtype: int64

###  Filling Missing Values

In this step, we replace the missing values in our dataset.  
We do **not** fill values randomly, because that can distort the dataset.

Instead, we use statistical methods:

- **Mean** → for numerical columns like *Age* and *Income*  
- **Mode** → for categorical columns like *Gender* and *SubscriptionStatus*

These methods help maintain the consistency of the data.


We can also remove rows with missing values using `dropna()`,  
but in this example we prefer `fillna()` to keep all rows.

In [5]:
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Income'] = df['Income'].fillna(df['Income'].mean())
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['SubscriptionStatus'] = df['SubscriptionStatus'].fillna(df['SubscriptionStatus'].mode()[0])

###  Handling Missing Values

When a dataset has empty cells (missing values), we cannot leave them blank because:
- Machine learning models cannot handle empty values   
- Calculations (mean, sum, etc.) may give errors   
- It reduces the quality of the analysis  

So we fill these missing values using simple statistical methods:

---

###  1. Numerical Columns → Use **Mean** or **Median**
Numerical columns (like **Age** and **Income**) contain numbers.  
We replace missing values using:

- **Mean** → Average value of the column  
- **Median** → Middle value when the data is sorted  

**Why?**  
Because we want a reasonable estimate, not a random guess.  
Example: If one “Age” is missing, using the average age keeps the data consistent.

---

###  2. Categorical Columns → Use **Mode**
Categorical columns (like **Gender**, **City**, **SubscriptionStatus**) contain text labels.  
We replace missing values using:

- **Mode** → The most frequent (common) value in that column  

**Why?**  
If most customers are “Male”, and one value is missing, the safest assumption is the majority category.

---

###  Final Idea
- **Numerical data** → Fill using **Mean/Median**  
- **Categorical data** → Fill using **Mode**  
These methods avoid random filling and keep the dataset clean and accurate.


In [6]:
df

Unnamed: 0,CustomerID,Age,Gender,Income,City,SubscriptionStatus
0,101,25.0,Male,50000.0,Urban,Subscribed
1,102,30.0,Female,60000.0,Rural,Not Subscribed
2,103,26.25,Male,55000.0,Urban,Not Subscribed
3,104,22.0,Female,54250.0,Rural,Subscribed
4,105,28.0,Female,52000.0,Urban,Not Subscribed


In [7]:
df.isna()

Unnamed: 0,CustomerID,Age,Gender,Income,City,SubscriptionStatus
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [8]:
df.isna().sum()

CustomerID            0
Age                   0
Gender                0
Income                0
City                  0
SubscriptionStatus    0
dtype: int64

### Encoding Categorical Data 

Some columns in a dataset contain text values instead of numbers.  
For example:
- Gender → Male / Female  
- City → Urban / Rural  
- SubscriptionStatus → Subscribed / Not Subscribed  

Machine learning models **cannot work with text**, so we must convert these categories into numbers.  
This process is called **Encoding**.

We mainly use two methods:

---

###  1. Label Encoding
This method converts each category into a unique number.

Example:  
- Male → 1  
- Female → 0  

Useful when the categories have only two or three values.

---

###  2. One-Hot Encoding
This method creates separate columns (0 or 1) for each category.

Example for City:  
- City_Urban  
- City_Rural  

If a customer lives in Urban → City_Urban = 1, City_Rural = 0  
If they live in Rural → City_Urban = 0, City_Rural = 1

---

###  Why Encoding?
- Machine learning needs numbers, not text  
- Models train better with encoded data  
- Prevents errors during analysis  


In [9]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['Gender_Encoded'] = le.fit_transform(df['Gender'])
df['Subscription_Encoded'] = le.fit_transform(df['SubscriptionStatus'])

df = pd.get_dummies(df, columns=['City'])
df

Unnamed: 0,CustomerID,Age,Gender,Income,SubscriptionStatus,Gender_Encoded,Subscription_Encoded,City_Rural,City_Urban
0,101,25.0,Male,50000.0,Subscribed,1,1,False,True
1,102,30.0,Female,60000.0,Not Subscribed,0,0,True,False
2,103,26.25,Male,55000.0,Not Subscribed,1,0,False,True
3,104,22.0,Female,54250.0,Subscribed,0,1,True,False
4,105,28.0,Female,52000.0,Not Subscribed,0,0,False,True


###  Feature Scaling 

Feature Scaling is used to **adjust the range of numerical values** so that they are on a similar scale.  
Why?  
Because machine learning models work better when all features are in similar ranges.

For example:
- Age ranges from 20 to 60  
- Income ranges from 30,000 to 80,000  

If we don’t scale them, the model may give more importance to Income just because it has bigger numbers.

---

###  MinMaxScaler
We use **MinMaxScaler** to scale values between **0 and 1**.

Formula: Scaled Value = (Value − Min) / (Max − Min)


In [10]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['Age_Scaled', 'Income_Scaled']] = scaler.fit_transform(df[['Age', 'Income']])
df

Unnamed: 0,CustomerID,Age,Gender,Income,SubscriptionStatus,Gender_Encoded,Subscription_Encoded,City_Rural,City_Urban,Age_Scaled,Income_Scaled
0,101,25.0,Male,50000.0,Subscribed,1,1,False,True,0.375,0.0
1,102,30.0,Female,60000.0,Not Subscribed,0,0,True,False,1.0,1.0
2,103,26.25,Male,55000.0,Not Subscribed,1,0,False,True,0.53125,0.5
3,104,22.0,Female,54250.0,Subscribed,0,1,True,False,0.0,0.425
4,105,28.0,Female,52000.0,Not Subscribed,0,0,False,True,0.75,0.2


### Summary

In this notebook, we:

1. Created a customer information DataFrame with:
   - CustomerID, Age, Gender, Income, City, SubscriptionStatus
2. Checked for missing values using `isna()` and `isna().sum()`
3. Handled missing values:
   - Used **Mean** for numerical columns (Age, Income)
   - Used **Mode** for categorical columns (Gender, SubscriptionStatus)
4. Encoded categorical data:
   - **Label Encoding** for Gender and SubscriptionStatus
   - **One-Hot Encoding** for City using `pd.get_dummies()`
5. Applied **Feature Scaling** on Age and Income using `MinMaxScaler`
   - Scaled values to the range **0 to 1**

This prepares the dataset for further machine learning tasks.
