## 3) Handling Missing Values

Missing values in the dataset can lead to incorrect analysis and model predictions. Imputing missing values ensures the integrity of the dataset, making it possible to build reliable models.

**Instructions:**
1. Check for missing values.
2. Impute missing values for numeric data using the mean and for non-numeric data using the mode.

In [2]:
# CHECK FOR MISSING VALUES
import pandas as pd
import numpy as np

data = pd.read_excel("/Users/abhirajchaudhary/Downloads/Approval.xlsx")

missing_count = data.isnull().sum()
print(missing_count)

Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
Industry          0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
DriversLicense    0
Citizen           0
ZipCode           0
Income            0
Approved          0
dtype: int64


In [7]:
# IMPUTE MISSING VALUES
import pandas as pd
from sklearn.impute import SimpleImputer

imputer_numeric = SimpleImputer(strategy = "median")
data[numeric_cols] = imputer_numeric.fit_transform(df[numeric_cols])

ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: 'Industrials'

In [10]:
# For non numeric data using mode
import pandas as pd
from sklearn.impute import SimpleImputer
imputer_non_numeric = SimpleImputer(strategy='most_frequent')
data[non_numeric_cols] = imputer_non_numeric.fit_transform(data[non_numeric_cols])


In [11]:
data.head(10)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Industry,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,1.0,30.83,0.0,1.0,1.0,Industrials,White,1.25,1.0,1.0,1.0,0.0,ByBirth,202.0,0.0,1.0
1,0.0,58.67,4.46,1.0,1.0,Materials,Black,3.04,1.0,1.0,6.0,0.0,ByBirth,43.0,560.0,1.0
2,0.0,24.5,0.5,1.0,1.0,Materials,Black,1.5,1.0,0.0,0.0,0.0,ByBirth,280.0,824.0,1.0
3,1.0,27.83,1.54,1.0,1.0,Industrials,White,3.75,1.0,1.0,5.0,1.0,ByBirth,100.0,3.0,1.0
4,1.0,20.17,5.625,1.0,1.0,Industrials,White,1.71,1.0,0.0,0.0,0.0,ByOtherMeans,120.0,0.0,1.0
5,1.0,32.08,4.0,1.0,1.0,CommunicationServices,White,2.5,1.0,0.0,0.0,1.0,ByBirth,360.0,0.0,1.0
6,1.0,33.17,1.04,1.0,1.0,Transport,Black,6.5,1.0,0.0,0.0,1.0,ByBirth,164.0,31285.0,1.0
7,0.0,22.92,11.585,1.0,1.0,InformationTechnology,White,0.04,1.0,0.0,0.0,0.0,ByBirth,80.0,1349.0,1.0
8,1.0,54.42,0.5,0.0,0.0,Financials,Black,3.96,1.0,0.0,0.0,0.0,ByBirth,180.0,314.0,1.0
9,1.0,42.5,4.915,0.0,0.0,Industrials,White,3.165,1.0,0.0,0.0,1.0,ByBirth,52.0,1442.0,1.0


## <span style="color:red">*Q2. Is there any missing data at all?*</span>

## <span style="color:black">*A2. There does not seem to be any missing data.*</span>

## <span style="color:red">*Q3. Why do we impute mean for numeric and mode for non-numeric data?*</span>

## <span style="color:black">*A2. Imputing mean for numeric data seems to be a better choice since numeric data is usually normally distributed and then the mean gives a sensible representation of the value. Imputing the mode will be best for non-numeric data since we will be imputing the most common category for a missing value.*</span>