### Data Imputation

## **What is Imputation?**
Imputation is the process of filling in missing values in a dataset. In real-world data, missing values are common and need to be handled properly before training a machine learning model.

### **Why is Imputation Needed?**
1. **ML Models Can't Handle Missing Data:** Most machine learning algorithms do not work with missing values.
2. **Prevents Data Loss:** Simply deleting rows or columns with missing values can lead to **loss of important data**.
3. **Improves Accuracy:** Proper imputation can make models more **robust and accurate**.
4. **Ensures Data Consistency:** Missing data can introduce **bias** or misinterpretations in analysis.

---

## **Types of Imputation Techniques**
Imputation techniques are mainly classified into:
1. **Simple Imputation** (Statistical methods)
2. **Advanced Imputation** (Machine learning-based methods)

---

## **1. Simple Imputation Techniques**
These methods use **basic statistical techniques** to fill in missing values.

### **A. Mean, Median, and Mode Imputation**
- **Mean Imputation:** Replace missing values with the column’s **mean** (average).  
- **Median Imputation:** Replace with the **median** (middle value) of the column.  
- **Mode Imputation:** Replace with the **most frequent value** (best for categorical data).  

#### **Example using Pandas**
```python
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Creating a dataset with missing values
data = pd.DataFrame({
    'Age': [25, np.nan, 30, 35, np.nan, 40],
    'Salary': [50000, 60000, np.nan, 80000, 75000, np.nan],
    'City': ['Hyderabad', 'Mumbai', 'Delhi', np.nan, 'Mumbai', 'Hyderabad']
})

print("Original Data:\n", data)

# Mean Imputation
mean_imputer = SimpleImputer(strategy='mean')
data[['Age', 'Salary']] = mean_imputer.fit_transform(data[['Age', 'Salary']])

# Mode Imputation for Categorical Data
mode_imputer = SimpleImputer(strategy='most_frequent')
data[['City']] = mode_imputer.fit_transform(data[['City']])

print("After Imputation:\n", data)
```
**When to Use?**
| **Method**  | **Best For** |
|-------------|-------------|
| Mean | Normally distributed numerical data |
| Median | Skewed numerical data (e.g., salary, house price) |
| Mode | Categorical data (e.g., gender, city, department) |

---

## **2. Advanced Imputation Techniques**
For better accuracy, we can use machine learning-based imputation methods.

### **A. K-Nearest Neighbors (KNN) Imputation**
- Fills missing values by considering **nearest neighbors**.
- Uses **distance-based similarity** to estimate missing values.

#### **Example using Scikit-Learn**
```python
from sklearn.impute import KNNImputer

# Creating a KNN Imputer
knn_imputer = KNNImputer(n_neighbors=2)

# Applying KNN Imputation
data[['Age', 'Salary']] = knn_imputer.fit_transform(data[['Age', 'Salary']])

print("After KNN Imputation:\n", data)
```
✅ **Best when missing values are dependent on other features.**  

---

### **B. Regression Imputation**
- Uses **linear regression** to predict missing values.
- Example: If "Salary" is missing, it can be predicted using "Age".

#### **Example using Linear Regression**
```python
from sklearn.linear_model import LinearRegression

# Splitting known and unknown values
known = data[data['Age'].notna()]
missing = data[data['Age'].isna()]

# Training a regression model
model = LinearRegression()
model.fit(known[['Salary']], known['Age'])

# Predicting missing values
data.loc[data['Age'].isna(), 'Age'] = model.predict(missing[['Salary']])

print("After Regression Imputation:\n", data)
```
✅ **Best for numerical data with strong relationships.**

---

### **C. Multiple Imputation (MICE - Multivariate Imputation by Chained Equations)**
- Iteratively predicts missing values using **multiple models**.
- More accurate than **single-value imputations**.

#### **Example using IterativeImputer**
```python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Creating an Iterative Imputer instance
iter_imputer = IterativeImputer()

# Applying MICE Imputation
data[['Age', 'Salary']] = iter_imputer.fit_transform(data[['Age', 'Salary']])

print("After MICE Imputation:\n", data)
```
✅ **Best when missing data is complex and interdependent.**

---

## **3. Imputation for Different Data Types**
| **Data Type** | **Best Imputation Technique** |
|--------------|------------------------------|
| **Numerical (Continuous)** | Mean, Median, KNN, Regression, MICE |
| **Numerical (Skewed)** | Median, KNN |
| **Categorical (Nominal)** | Mode, One-Hot Encoding, Frequency Encoding |
| **Categorical (Ordinal)** | Mode, Ordinal Encoding |

---

## **Conclusion**
- **Imputation is necessary** to prevent data loss and improve model performance.
- **Simple imputation** (Mean, Median, Mode) is fast and easy but may introduce bias.
- **Advanced imputation** (KNN, Regression, MICE) provides better accuracy but requires more computation.

In [1]:
import numpy as np
import pandas as pd
import scipy as sci
import matplotlib.pyplot as plt
import seaborn as sna
from sklearn.impute import SimpleImputer
df=pd.read_csv("titanics.csv")

In [2]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
# from ydata_profiling import ProfileReport as pr
# pr(df)

In [19]:
from sklearn.impute import SimpleImputer


In [21]:
dir(sklearn.impute)

NameError: name 'sklearn' is not defined

In [23]:
mean_impute=SimpleImputer(strategy="mean")
df["Age"]=mean_impute.fit_transform(df[["Age"]])

In [25]:
df["Age"].head(10)

0    22.000000
1    38.000000
2    26.000000
3    35.000000
4    35.000000
5    29.699118
6    54.000000
7     2.000000
8    27.000000
9    14.000000
Name: Age, dtype: float64

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [29]:
df.drop("Cabin",axis=1,inplace=True)

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


In [67]:
from sklearn.impute import SimpleImputer


In [101]:
model=SimpleImputer(strategy="most_frequent")


In [105]:
df['Embarked']=model.fit_transform(df[["Embarked"]]).ravel()

In [107]:
df['Embarked']

0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: object

In [109]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


In [75]:
import sklearn.impute
print(dir(sklearn.impute))6

['KNNImputer', 'MissingIndicator', 'SimpleImputer', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__getattr__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_base', '_knn', 'typing']


In [111]:
# from ydata_profiling import ProfileReport as pr
# pr(df)