<p> Null Values handling 


<p> In our previous discussions, we analyzed both numeric and categorical data. Until now, our approach for handling missing values involved removing the rows containing them. However, removing rows is not always the best solution, as it leads to a loss of valuable data.

In this article, we will explore alternative methods to handle missing numerical values by filling them instead of removing them.

Why Not Remove Null Values?
When working with a dataset, missing values can pose challenges. The simplest way to handle them is by removing rows where missing values exist, but this method has drawbacks:

Loss of Data: If multiple rows have missing values, we lose a significant portion of data.
Skewed Analysis: Deleting rows can distort distributions and affect insights.
Not Always Practical: In small datasets, removing rows may not be feasible.
Instead, we can fill missing values with estimated values using different imputation techniques.

<p> 1. Removing the null values using dropna ()

In [4]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

In [7]:
data=pd.read_csv("Data (1).csv")
data

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,Yes
1,Spain,27.0,48000.0,Yes
2,,30.0,54000.0,
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [8]:
data.dropna()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,Yes
1,Spain,27.0,48000.0,Yes
3,Spain,38.0,61000.0,No
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


<p>Approach 2: Filling Null Values Using Imputation
Instead of deleting missing values, we can fill them with appropriate values.
This is done using imputation techniques, such as:

Mean (average of existing values)
Median (middle value when sorted)
Most Frequent (mode, most common value)
Here, we use most frequent value (mode) to fill missing values.

SimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder. 
It is implemented by the use of the SimpleImputer() method which takes the following arguments :
 

missing_values : The missing_values placeholder which has to be imputed. By default is NaN 
strategy : The data which will replace the NaN values from the dataset. The strategy argument can take the values – ‘mean'(default), ‘median’, ‘most_frequent’ and ‘constant’. 
fill_value : The constant value to be given to the NaN data using the constant strategy.  

In [9]:
# filling the mean 

imputer=SimpleImputer(missing_values=np.nan,strategy="mean")



In [24]:
# imputer.fit(data["Salary"])



<p> The Above Satement will throw error because the input array is 1-d and the expected array is 2-d means the data is in 2-d mode but we are fitting it in 1 -d so we have to use the numpy or we can also use iloc function by using all the numerical columns 


<p> using numpy array

In [None]:
imputer.fit(np.array(data[["Salary"]]))
# will throw error 

<p> Using iloc function 

In [65]:
imputer2=SimpleImputer(missing_values=np.nan,strategy="median")


In [None]:
x=data.iloc[:,1:3]

Unnamed: 0,Age,Salary
0,44.0,72000.0
1,27.0,48000.0
2,30.0,54000.0
3,38.0,61000.0
4,40.0,63777.777778
5,35.0,58000.0
6,,52000.0
7,48.0,79000.0
8,50.0,83000.0
9,37.0,67000.0


In [39]:
imputer2.fit(x)


<p> Handling categorical null values 

In [102]:
dataa=pd.read_csv("Data (1).csv")
dataa

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,Yes
1,Spain,27.0,48000.0,Yes
2,,30.0,54000.0,
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [104]:
imputer3=SimpleImputer(missing_values=np.nan,strategy="most_frequent")

imputer3.fit(dataa)
dataa=imputer3.transform(dataa)
dataa

array([['France', 44.0, 72000.0, 'Yes'],
       ['Spain', 27.0, 48000.0, 'Yes'],
       ['France', 30.0, 54000.0, 'Yes'],
       ['Spain', 38.0, 61000.0, 'No'],
       ['Germany', 40.0, 48000.0, 'Yes'],
       ['France', 35.0, 58000.0, 'Yes'],
       ['Spain', 27.0, 52000.0, 'No'],
       ['France', 48.0, 79000.0, 'Yes'],
       ['Germany', 50.0, 83000.0, 'No'],
       ['France', 37.0, 67000.0, 'Yes']], dtype=object)

In [106]:
data

array([['France'],
       ['Spain'],
       ['France'],
       ['Spain'],
       ['Germany'],
       ['France'],
       ['Spain'],
       ['France'],
       ['Germany'],
       ['France']], dtype=object)

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

imputer.fit(dataa.iloc[:, :].values)

dataa.iloc[:, :] = imputer.transform(dataa.iloc[:, :].values)

AttributeError: 'numpy.ndarray' object has no attribute 'iloc'