# For Numerical Values

- When we deploy the project there we need to automatically handle missing values at that moment we use simple imputer to fill the values

## Key Scikit-Learn Classes for Missing Data
- SimpleImputer: For basic strategies like mean, median, or mode.

- KNNImputer: For imputing values based on K-nearest neighbors.

- IterativeImputer: For advanced, iterative imputations using models.

- Pipeline: To integrate imputation with other preprocessing and modeling steps.

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
ds = pd.read_csv("Sales_data.csv")
ds.head(5)

Unnamed: 0,Group,Customer_Segment,Sales_Before,Sales_After,Customer_Satisfaction_Before,Customer_Satisfaction_After,Purchase_Made
0,Control,High Value,240.548359,300.007568,74.684767,,No
1,Treatment,High Value,246.862114,381.337555,100.0,100.0,Yes
2,Control,High Value,156.978084,179.330464,98.780735,100.0,No
3,Control,Medium Value,192.126708,229.278031,49.333766,39.811841,Yes
4,,High Value,229.685623,,83.974852,87.738591,Yes


In [4]:
ds.isnull().sum()

Group                           1401
Customer_Segment                1966
Sales_Before                    1522
Sales_After                      767
Customer_Satisfaction_Before    1670
Customer_Satisfaction_After     1640
Purchase_Made                    805
dtype: int64

In [5]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Group                         8599 non-null   object 
 1   Customer_Segment              8034 non-null   object 
 2   Sales_Before                  8478 non-null   float64
 3   Sales_After                   9233 non-null   float64
 4   Customer_Satisfaction_Before  8330 non-null   float64
 5   Customer_Satisfaction_After   8360 non-null   float64
 6   Purchase_Made                 9195 non-null   object 
dtypes: float64(4), object(3)
memory usage: 547.0+ KB


## Simple Imputation
- The SimpleImputer class in Scikit-learn can be used for basic imputation techniques such as filling missing values with the mean, median, or most frequent value.

### Numerical Data
- for numerical data imputation we use mean and median at the place of missing values


In [7]:
ds.select_dtypes(include='float64').columns

Index(['Sales_Before', 'Sales_After', 'Customer_Satisfaction_Before',
       'Customer_Satisfaction_After'],
      dtype='object')

In [8]:
from sklearn.impute import SimpleImputer

In [19]:
# Impute missing values with the mean
# Handling missing numerical values by adding the mean of the column
si1 = SimpleImputer(strategy="mean")
ar1 = si1.fit_transform(ds[['Sales_Before', 'Sales_After', 'Customer_Satisfaction_Before',
       'Customer_Satisfaction_After']])

In [20]:
# Handling missing numerical values by adding the median of the column
# Impute missing values with the median
si2 = SimpleImputer(strategy='median')
ar2 = si2.fit_transform(ds[['Sales_Before', 'Sales_After', 'Customer_Satisfaction_Before',
       'Customer_Satisfaction_After']])

In [21]:
new_ds1 = pd.DataFrame(ar1, columns= ds.select_dtypes(include="float64").columns)
new_ds2 = pd.DataFrame(ar2, columns= ds.select_dtypes(include="float64").columns)

In [22]:
new_ds1.isnull().sum()

Sales_Before                    0
Sales_After                     0
Customer_Satisfaction_Before    0
Customer_Satisfaction_After     0
dtype: int64

In [23]:
new_ds2.isnull().sum()

Sales_Before                    0
Sales_After                     0
Customer_Satisfaction_Before    0
Customer_Satisfaction_After     0
dtype: int64

### Categorical Data
- For categorical data imputation we use mode or can say most frequest values at the place of missing values

In [24]:
ds.select_dtypes(include='object').columns

Index(['Group', 'Customer_Segment', 'Purchase_Made'], dtype='object')

In [27]:
si3 = SimpleImputer(strategy= "most_frequent")
ar3 = si3.fit_transform(ds[['Group', 'Customer_Segment', 'Purchase_Made']])

In [28]:
new_ds3 = pd.DataFrame(ar3, columns= ds.select_dtypes(include="object").columns)

In [29]:
new_ds3.isnull().sum()

Group               0
Customer_Segment    0
Purchase_Made       0
dtype: int64