# PANDAS 🐼🐼

## Data Preprocessing in Machine Learning using Pandas

Data preprocessing is required tasks for cleaning the data and making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning model.

Pandas is an open source Python that is most widely used for data science/data analysis and machine learning tasks. It is built on top on another package named Numpy, which provides support for multi-dimensional arrays. Preprocessing is the process of doing a pre-analysis of data, in order to transform them into a standard and normalized format.

**Preprocessing involves the following aspects:**
- missing values
- data standardization
- data normalization
- data mining





In [1]:
import pandas as pd

In [2]:
data={
    "cars": ["bmw", "honda", "suzuki", "byd"],
    "models": ["2007", "2008" ,"2015", "2024"],
    "passenger": ["4", "6", "6", "5"]
}
print(data)

{'cars': ['bmw', 'honda', 'suzuki', 'byd'], 'models': ['2007', '2008', '2015', '2024'], 'passenger': ['4', '6', '6', '5']}


In [3]:
# Convert the data dictonary into data frame
df= pd.DataFrame(data)
df

Unnamed: 0,cars,models,passenger
0,bmw,2007,4
1,honda,2008,6
2,suzuki,2015,6
3,byd,2024,5


In [4]:
df.to_csv("Dataset/car_data.csv")

In [5]:
pd.read_csv("Dataset/car_data.csv")

Unnamed: 0.1,Unnamed: 0,cars,models,passenger
0,0,bmw,2007,4
1,1,honda,2008,6
2,2,suzuki,2015,6
3,3,byd,2024,5


In [6]:
# Create a dataframe with two features calories and duration.
burn={
    "calories":["420", "500", "390"],
    "duration":["50", "60", "45"]
}
df=pd.DataFrame(burn)
df

Unnamed: 0,calories,duration
0,420,50
1,500,60
2,390,45


In [7]:
# Specific Row
print(df.loc[1])

calories    500
duration     60
Name: 1, dtype: object


In [8]:
# Multiple Row
print(df.loc[[0,1]])

  calories duration
0      420       50
1      500       60


In [9]:
# Create a dataframe using indexing
data={
    "calories":["420", "500", "390"],
    "duration":["50", "60", "45"]
}
frame=pd.DataFrame(data, index=["day1","day2","day3"])
frame


Unnamed: 0,calories,duration
day1,420,50
day2,500,60
day3,390,45


In [10]:
print(frame.loc[["day1","day2"]])

     calories duration
day1      420       50
day2      500       60


# Data Cleaning with Pandas

**Bad data can be:**
- empty cells
- Data in wrong format
- wrong data
- Duplicates

In [11]:
dataset=pd.read_csv("Dataset/dataset(1).csv")
dataset

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


In [12]:
# Cleaning Date
dataset0= dataset
dataset0['Date']=dataset0['Date'].str.replace("'","")
dataset0.head(5)

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0


In [13]:
# Handling Empty Value
dataset1=dataset
dataset1.fillna(130)

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,450,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


In [14]:
# In Specific Column
dataset2=dataset
dataset2['Calories']=dataset2['Calories'].fillna(120)
dataset2

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,450,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


In [15]:
# Fill Null Values with mean, median or mode
dataset3= dataset
x=dataset3["Calories"].mean()
dataset3['Calories']=dataset2['Calories'].fillna(x)
dataset3

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,450,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0
