# Using Scikit Learn to fill the Numerical Values and Categorical values

### There are various strategies that we can use to fill the missing numerical value in `SimpleImputer(strategy="?")` Class Like : 
- #### mean (generally used)
- #### median
- #### most frequent
- #### constant
  
- #### We also need SimpleImputer Class in automatic datacleaning piplines to fill the missing values

In [77]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [78]:
df = pd.read_csv("loan_data_set.csv")

df.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


In [79]:
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


### Filling For Numerical Data USING `SimpleImputer()` CLASS of sklearn

In [81]:
df.select_dtypes(include="float64")

Unnamed: 0,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
0,0.0,,360.0,1.0
1,1508.0,128.0,360.0,1.0
2,0.0,66.0,360.0,1.0
3,2358.0,120.0,360.0,1.0
4,0.0,141.0,360.0,1.0
...,...,...,...,...
609,0.0,71.0,360.0,1.0
610,0.0,40.0,180.0,1.0
611,240.0,253.0,360.0,1.0
612,0.0,187.0,360.0,1.0


In [82]:
float_columns = df.select_dtypes(include = "float64").columns.tolist()
float_columns

['CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']

### NOW calling the SimpleImputer class from sklearn.impute module

In [83]:
from sklearn.impute import SimpleImputer

## Note: In the line `from sklearn.impute import SimpleImputer`

- `sklearn` : This refers to Scikit-learn, which is a popular open-source Python library for machine learning. It provides a wide range of tools for data mining and data analysis, including classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.


- `impute` : This is a submodule within the sklearn library. It specifically contains tools and methods for imputation, which is the process of filling in or replacing missing values in a dataset. Missing values can occur for various reasons and can negatively impact the performance of machine learning models if not handled appropriately

- `SimpleImputer` : This is a class within the impute submodule. It provides basic strategies for imputing missing values. It can replace missing values (represented as NaN or other specified placeholders) with a constant value, or with statistics calculated from the data, such as 
   - the mean, 
   - median,
   - most frequent value of each column.
  

  


# 1. ➡️ **`SimpleImputer(strategy="mean")`**

### 🌟 **mean Imputation is done when data is normally distributed**

In [84]:
si = SimpleImputer(strategy="mean")

## ⭐ Important Note : 

### The 2D structure is a universal convention in scikit-learn for distinguishing between features and samples.

### 👌 1. The Scikit-learn Convention

- Scikit-learn expects all input data matrices to adhere to the following shape convention:

                                                        (n_samples,n_features)

- Rows `(n_samples)`: Represent the individual data points or observations (e.g., individual houses, people, or transactions).

- Columns `(n_features)`: Represent the different measurable properties or characteristics of those samples (e.g., 'Square Footage', 'Income', 'Loan Amount').

### 👌 2. Why a Series (df['column']) Fails

- When you use single square brackets on a pandas DataFrame, like `dataset['CoapplicantIncome']`, the result is a pandas Series.

- A Series is a 1-dimensional object.

- Scikit-learn interprets a 1D object as having the shape `(n_samples,)`, which is ambiguous. It doesn't know if the data represents:

- One sample with many features (a single row).

- Or, many samples with only one feature (a single column).

- The error message, "Expected a 2-dimensional container but got **<class 'pandas.core.series.Series'>**," confirms this confusion.

### 👌 3. Why Double Brackets (df[['column']]) Works

- When you use double square brackets on a pandas DataFrame, like **`dataset[['CoapplicantIncome']]`**:

- The result is a pandas DataFrame, even if it only contains one column.

- A single-column DataFrame is inherently 2-dimensional, with the shape (n_samples,1).

- This explicitly tells scikit-learn that you are providing n samples and 1 feature, satisfying the required 2D structure.

In [85]:
# IF double square brackets df[[]] is not used in the fit_transform() then this error will come
# Expected a 2-dimensional container but got <class 'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.
# Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
for col in float_columns:
    df[col]=si.fit_transform(df[[col]])
    
    #Fitting the Imputer on the Data to tranform it


## 🌟 NOTE ➡️ The output of **`.fit_transform(df[[col]])`** of any **`Strategy in SimpleImputer`** is a **`2D numpy-arrary`** with shape `(n_samples, 1)`

### Pandas is Smart enough to convert the 2D -> 1D array in case of **Numerical data** with all the strategies but not in case of **Categorical Values**  Thats why always go with :
```python

dataset[[column]] = imputer.fit_transform(dataset[[column]])

```

### ⚠️ OR ELSE : There will be a Value Error

In [86]:
df.select_dtypes(include="float64").isnull().sum()

CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
dtype: int64

### NOTE : As the above output shows that there are no missing values in the data of float types we have successfully filled the mean values using the `SimpleImputer(strategy = "mean").fit_transform(df[[col]])`

<br>
<hr>
<br>

# 2. ➡️ `SimpleImputer(strategy = "median")`

### 🌟 **median Imputation is done when data is skewed**

## 🌟 Important Note : ➡️ What is data Skeweness❓


- When we say that data is **skewed**, it means the data distribution is asymmetrical and leans or stretches more to one side than the other.

- In a perfectly symmetrical distribution (like the bell-shaped **Normal Distribution**), the left and right sides are mirror images, and the mean, median, and mode are all the same value. 

- **Skewness** describes the extent to which a distribution deviates from this symmetry.


**Types of Skewness**

***There are two main types of skewness, defined by the direction of the "tail" or stretch:***


1. **Positive Skewness (Right-Skewed)** 📈
  
    - **Appearance**: The distribution has a long, stretched-out tail on the right side (the positive side of the number line). The bulk of the data is concentrated on the left.

    - **Relationship**: The **`Mean > Median > Mode`**. The few large, extreme values (outliers) on the right side pull the mean higher than the median.

    - **Example**: Distribution of personal income (most people earn a modest income, but a few high earners stretch the tail far to the right).

<br>

2. **Negative Skewness (Left-Skewed)** 📉
  
    - **Appearance**: The distribution has a long, stretched-out tail on the left side (the negative side of the number line). The bulk of the data is concentrated on the right.

    - **Relationship**: The **`Mean < Median < Mode`**. The few small, extreme values (outliers) on the left side pull the mean lower than the median.

    - **Example**: Distribution of test scores on an easy exam (most students score high, but a few very low scores stretch the tail to the left).


## 🌟 Why It Matters in Data Science ❓

**Skewness is a critical descriptive statistic because it affects how we interpret and model the data:**

- **Choice of Metric**: For skewed data, the median is often a better measure of the "typical" value than the mean, as the median is less affected by the extreme values in the long tail.

- **Statistical Assumptions**: Many statistical models (like linear regression and t-tests) assume the data or the errors are normally distributed (no skew). Significant skewness violates this assumption, which can lead to inaccurate model results or unreliable predictions.


- **Data Transformation**: Data scientists frequently apply transformations (like the logarithm or square root) to skewed data to make it more symmetrical before using it in certain machine learning models





In [87]:
# loading data
dataset1 = pd.read_csv("loan_data_set.csv")
dataset1.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [88]:
dataset1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [89]:
# Listing/Collecting All the Numerical Columns 

numericalColumns = dataset1.select_dtypes(exclude= "object").columns.tolist()
numericalColumns

['ApplicantIncome',
 'CoapplicantIncome',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History']

In [90]:
# Count of the missing values in numericalColumns
dataset1[numericalColumns].isnull().sum()

ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
dtype: int64

In [91]:
medianImputer = SimpleImputer(strategy = "median")

## 🌟 What does **`.fit_transform`** does ❓

#### `fit_transform()` in imputation first learns how to fill missing values **(e.g., mean, median, mode, or constant)** during the fit step, then applies that learned rule to actually fill the NaN values during the transform step — all in one command.

#### It’s equivalent to doing fit() + transform() together.

In [92]:
# Now Imputing every missing values in the numericalColumns with their median

for col in numericalColumns:
    dataset1[[col]]= medianImputer.fit_transform(dataset1[[col]])

In [93]:
dataset1.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849.0,0.0,128.0,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583.0,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000.0,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583.0,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000.0,0.0,141.0,360.0,1.0,Urban,Y


In [94]:
dataset1[numericalColumns].isnull().sum()

ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
dtype: int64

<br>
<hr>
<br>

# 3. ➡️ `SimpleImputer(strategy = "most_frequent")`

### :star: **mode Imputation is done when the data is categorical and we are dealing with nominal (e.g., color, city) or ordinal (e.g., small, medium, large) features, the most frequent category is often the most sensible single value to fill in missing entries, as mean and median are not applicable.**

### :star: It can be an alternative in numerical data if the data is :

- **Discrete and low-cardinality** (e.g., number of children, a count with a limited range).
- **Highly skewed**, where the mode provides a better representation of the central tendency than the mean, or if you want to prioritize preserving the most common value.

### Why Use It :question:

It is the standard method for single-value imputation of categorical missing data because it doesn't introduce a new, non-existent category or value.

### ⚠️ Limitations: 

**It does not take into account the values of other features, only the distribution of the values in that specific column.**


In [95]:
# Listing the object type columns
categoricalColumns = dataset1.select_dtypes(include = "object").columns.tolist()

In [96]:
#checking missing values in Categorical Columns

dataset1[categoricalColumns].isnull().sum()

Loan_ID           0
Gender           13
Married           3
Dependents       15
Education         0
Self_Employed    32
Property_Area     0
Loan_Status       0
dtype: int64

In [97]:
#simple Imputer

modeImputer = SimpleImputer(strategy="most_frequent")


### 🚫 Wrong Syntax to do Imputation on the Categorical Values

In [101]:
for col in categoricalColumns:
    dataset1[col] = modeImputer.fit_transform(dataset1[[col]])

ValueError: 2

### ✅️ Correct Syntax for the same 

In [102]:
for col in categoricalColumns:
    dataset1[[col]] = modeImputer.fit_transform(dataset1[[col]])

In [103]:
dataset1[categoricalColumns].isnull().sum()

Loan_ID          0
Gender           0
Married          0
Dependents       0
Education        0
Self_Employed    0
Property_Area    0
Loan_Status      0
dtype: int64

In [104]:
dataset1.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

<hr>

# 4. ➡️ **`SimpleImputer(strategy = "constant")`**

### :star: When to Use Constant Value Imputation

#### ✅ Use it when:

- You know what **a sensible default would be for missing values** (e.g., 0 sales, 0 temperature difference, etc.).

- You want to **keep missing values identifiable** using a special code like -1 or 999.

#### 🚫 Avoid it when:

- The fixed value **distorts the true meaning** of your data.

- The column has **no logical default** (e.g., average temperature, price).

In [105]:
# Lets try this on the categorical missing values of df
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term      0
Credit_History        0
Property_Area         0
Loan_Status           0
dtype: int64

In [107]:
catColumns = df.select_dtypes(include="object").columns.tolist()
catColumns

['Loan_ID',
 'Gender',
 'Married',
 'Dependents',
 'Education',
 'Self_Employed',
 'Property_Area',
 'Loan_Status']

In [108]:
constantImputer = SimpleImputer(strategy = "constant", fill_value="unknown")


In [109]:
for col in catColumns:
    df[[col]] = constantImputer.fit_transform(df[[col]])

In [110]:
df[catColumns].isnull().sum()

Loan_ID          0
Gender           0
Married          0
Dependents       0
Education        0
Self_Employed    0
Property_Area    0
Loan_Status      0
dtype: int64

In [115]:
df.tail(20)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
594,LP002938,Male,Yes,0,Graduate,Yes,16120,0.0,260.0,360.0,1.0,Urban,Y
595,LP002940,Male,No,0,Not Graduate,No,3833,0.0,110.0,360.0,1.0,Rural,Y
596,LP002941,Male,Yes,2,Not Graduate,Yes,6383,1000.0,187.0,360.0,1.0,Rural,N
597,LP002943,Male,No,unknown,Graduate,No,2987,0.0,88.0,360.0,0.0,Semiurban,N
598,LP002945,Male,Yes,0,Graduate,Yes,9963,0.0,180.0,360.0,1.0,Rural,Y
599,LP002948,Male,Yes,2,Graduate,No,5780,0.0,192.0,360.0,1.0,Urban,Y
600,LP002949,Female,No,3+,Graduate,unknown,416,41667.0,350.0,180.0,0.842199,Urban,N
601,LP002950,Male,Yes,0,Not Graduate,unknown,2894,2792.0,155.0,360.0,1.0,Rural,Y
602,LP002953,Male,Yes,3+,Graduate,No,5703,0.0,128.0,360.0,1.0,Urban,Y
603,LP002958,Male,No,0,Graduate,No,3676,4301.0,172.0,360.0,1.0,Rural,Y
