---
- # **Data Preprocessing**
  - ## **Introduction to Data Preprocessing** 
    - ### **Data Cleaning (Coding Examples Included)**
        - ### **What is Data Preprocessing?**
          - ![Data_Cleaning](./Images/1.png)
          - **Data preprocessing is a crucial step in data mining that involves transforming raw data into a useful and efficient format. It prepares data for further analysis by addressing inconsistencies and making sure the data is clean and organized.**
          - ![Data_Cleaning](./Images/2.png)
        - #### **Data Cleaning**
          - **Data Cleaning** The process of removing or correcting inaccurate, incomplete, or irrelevant data. It involves identifying and handling missing values, outliers, and errors in the data.
          - ![Data_Cleaning](./Images/3.png)
          - **Noisy Data** Data that contains errors or outliers, which can lead to inaccurate models if not properly handled.
            - ![Noisy_Data](./Images/4.png)
  
          - **Binning** A technique to smoothen noisy data by grouping a set of values into bins or ranges.
          - ![Binning](./Images/5.png)
          - ![Binning ](./Images/6.png)
- **Handling Missing Values: Data Cleaning Code**
  - Missing values in the data can be handled using various strategies such as deletion, mean/mode/median imputation, or using algorithms that can handle missing data.


In [1]:
import numpy as np
import pandas as pd 
dict = {"Marks":[100,90,np.nan,95],
        "Rank":[1,2,4,np.nan],
        "Avg Marks": [np.nan,40,80,98]}
df=pd.DataFrame(dict)
df

Unnamed: 0,Marks,Rank,Avg Marks
0,100.0,1.0,
1,90.0,2.0,40.0
2,,4.0,80.0
3,95.0,,98.0


In [2]:
df.isnull()

Unnamed: 0,Marks,Rank,Avg Marks
0,False,False,True
1,False,False,False
2,True,False,False
3,False,True,False


In [3]:
df.fillna(0) # To fill the value 

Unnamed: 0,Marks,Rank,Avg Marks
0,100.0,1.0,0.0
1,90.0,2.0,40.0
2,0.0,4.0,80.0
3,95.0,0.0,98.0


In [4]:
df.fillna(method='bfill') # Fill both two values but not change all the nan values 

  df.fillna(method='bfill') # Fill both two values but not change all the nan values


Unnamed: 0,Marks,Rank,Avg Marks
0,100.0,1.0,40.0
1,90.0,2.0,40.0
2,95.0,4.0,80.0
3,95.0,,98.0


In [5]:
df.fillna(method='pad') # It was taking number next to it and before to it. 

  df.fillna(method='pad') # It was taking number next to it and before to it.


Unnamed: 0,Marks,Rank,Avg Marks
0,100.0,1.0,
1,90.0,2.0,40.0
2,90.0,4.0,80.0
3,95.0,4.0,98.0


In [6]:
df.interpolate(method='linear',limit_direction='forward')

Unnamed: 0,Marks,Rank,Avg Marks
0,100.0,1.0,
1,90.0,2.0,40.0
2,92.5,4.0,80.0
3,95.0,4.0,98.0


In [7]:
df.dropna()

Unnamed: 0,Marks,Rank,Avg Marks
1,90.0,2.0,40.0
