# **Missing value imputation through Scikit-Learn**

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
Dataset = pd.read_csv("Titanic-Dataset.csv")
Dataset

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [3]:
Dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
Dataset.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [5]:
Dataset.select_dtypes(include="float64").columns

Index(['Age', 'Fare'], dtype='object')

## **SimpleImputer**
**`SimpleImputer`** is used to handle **missing values** (`NaN`) in a dataset by replacing them with a **specific value**.  

**Key Features:**
- Can replace missing values in **numeric or categorical data**.
- **Strategies:**
  - `mean` → Replace with column mean (numeric data)
  - `median` → Replace with column median (numeric data)
  - `most_frequent` → Replace with mode (categorical data)
  - `constant` → Replace with a fixed value

---

### `fit()` and `transform()` in scikit-learn

- **`fit()`**  
  Learns the parameters from the data.  
  Example: For `SimpleImputer(strategy='mean')`, `fit()` calculates the **mean** of each column.

- **`transform()`**  
  Uses the learned parameters to **modify the data**.  
  Example: Replaces missing values with the **calculated mean**.

- **`fit_transform()`**  
  Combines `fit()` and `transform()` in one step.  It returns **array**.
 

In [6]:
from sklearn.impute import SimpleImputer

In [8]:
imputer = SimpleImputer(strategy="mean")
data_fill = imputer.fit_transform(Dataset[['Age', 'Fare']])

In [9]:
data_fill

array([[22.        ,  7.25      ],
       [38.        , 71.2833    ],
       [26.        ,  7.925     ],
       ...,
       [29.69911765, 23.45      ],
       [26.        , 30.        ],
       [32.        ,  7.75      ]])

- ⚠️ `**SimpleImputer**` which **cannot work** on **single Series or column names**.
- `**SimpleImputer**` expects an input of shape `**(n_samples, n_features)**`.
- **Double brackets** `data[['Age', 'Fare']]` → returns a **DataFrame (2D)**, which is required by SimpleImputer.

In [12]:
# create new dataframe using imputer array
df = pd.DataFrame(data_fill,columns=Dataset.select_dtypes(include="float64").columns)

In [13]:
df

Unnamed: 0,Age,Fare
0,22.000000,7.2500
1,38.000000,71.2833
2,26.000000,7.9250
3,35.000000,53.1000
4,35.000000,8.0500
...,...,...
886,27.000000,13.0000
887,19.000000,30.0000
888,29.699118,23.4500
889,26.000000,30.0000


In [14]:
df.isnull().sum()

Age     0
Fare    0
dtype: int64

In [16]:
# check mean
Dataset['Age'].mean()

np.float64(29.69911764705882)

In [17]:
# using strategy median
imputer2 = SimpleImputer(strategy="median")
med_ar = imputer2.fit_transform(Dataset[['Age', 'Fare']])

In [18]:
med_ar

array([[22.    ,  7.25  ],
       [38.    , 71.2833],
       [26.    ,  7.925 ],
       ...,
       [28.    , 23.45  ],
       [26.    , 30.    ],
       [32.    ,  7.75  ]])

In [19]:
df2 = pd.DataFrame(med_ar,columns=Dataset.select_dtypes(include="float64").columns)

In [20]:
df2

Unnamed: 0,Age,Fare
0,22.0,7.2500
1,38.0,71.2833
2,26.0,7.9250
3,35.0,53.1000
4,35.0,8.0500
...,...,...
886,27.0,13.0000
887,19.0,30.0000
888,28.0,23.4500
889,26.0,30.0000


In [21]:
# check median
Dataset['Age'].median()

28.0