<center> 
    <h2> <b> Exploratory Data Analysis(EDA) Using Python.</b> </h2>
</center> 

<hr />

Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main characteristics often plotting them visually.

This step is very important especially when we arrive at modeling the data in order to apply Machine learning.

Plotting in EDA consists of Histograms, Box plot, Scatter plot and many more. It often takes much time to explore the data.

Through the process of EDA, we can ask to define the problem statement or definition on our data set which is very important.
<hr />

## How to perform Exploratory Data Analysis?

This is one such question that everyone is keen on knowing the answer. Well, the answer is it depends on the data set that you are working. 

There is no one method or common methods in order to perform EDA, whereas in this guide you can understand some common methods and plots that would be used in the EDA process.
<hr />

## What data are we exploring today?

Since I am a huge fan of cars, I got a very beautiful data-set of cars from Kaggle. The data-set can be downloaded from [here](https://www.kaggle.com/CooperUnion/cardataset). 

To give a piece of brief information about the data set this data contains more of 10, 000 rows and more than 10 columns which contains features of the car such as Engine Fuel Type, Engine Size, HP, Transmission Type, highway MPG, city MPG and many more.

So in this guide, we will explore the data and make it ready for modeling. 
<hr />

<center>
<h3>
    <b>
        Exploratory Data Analysis Process.
    </b>
</h3>
</center>

### 1). Importing the required libraries for EDA

Below are the libraries that are used in order to perform EDA (Exploratory data analysis) in this guide. The complete code can be found on my GitHub.
[Link to the Source Code](url)

In [6]:
"""Importing required libraries."""
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline 
sns.set(color_codes=True)

### 2). Loading the data into the data frame.

Loading the data into the pandas data frame is certainly one of the most important steps in EDA, as we can see that the value from the data set is comma-separated.

So all we have to do is to just read the CSV into a data frame and pandas data frame does the job for us.

If you are using google colab her a simple step to read your dataset, <br>
In <b> Google Colab </b> at the left-hand side of the notebook, you will find a <b> “>” </b> (greater than symbol). 

<br> When you click that you will find a tab with three options, you just have to select Files.

<br> Then you can easily upload your file with the help of the Upload option. No need to mount to the google drive or use any specific libraries just upload the data set and your job is done.

<br> 
<b> One thing to remember in this step is that uploaded files will get deleted when this runtime is recycled.
</b>

In [10]:
df = pd.read_csv("CovidData.csv")

In [11]:
"""To display the top 5 rows"""
df.head(5)

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


In [12]:
"""To display the bottom 5 rows"""
df.tail(5) 

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
205946,205947,01/19/2021,Zaporizhia Oblast,Ukraine,2021-01-20 05:21:54,62492.0,738.0,39168.0
205947,205948,01/19/2021,Zeeland,Netherlands,2021-01-20 05:21:54,13031.0,149.0,0.0
205948,205949,01/19/2021,Zhejiang,Mainland China,2021-01-20 05:21:54,1316.0,1.0,1298.0
205949,205950,01/19/2021,Zhytomyr Oblast,Ukraine,2021-01-20 05:21:54,42758.0,707.0,37834.0
205950,205951,01/19/2021,Zuid-Holland,Netherlands,2021-01-20 05:21:54,224398.0,3153.0,0.0


### 3). Checking the types of data

Here we check for the datatypes because sometimes the MSRP or the price of the car would be stored as a string or object, if in that case, we have to convert that string to the integer data only then we can plot the data via a graph. 

Here, in this case, the data is already in integer format so nothing to worry.

In [15]:
"""Checking the data type"""
df.dtypes

SNo                  int64
ObservationDate     object
Province/State      object
Country/Region      object
Last Update         object
Confirmed          float64
Deaths             float64
Recovered          float64
dtype: object

### 4). Dropping irrelevant columns 

This step is certainly needed in every EDA because sometimes there would be many columns that we never use in such cases dropping is the only solution. 

In this case, the columns such as Engine Fuel Type, Market Category, Vehicle style, Popularity, Number of doors, Vehicle Size doesn't make any sense to me so I just dropped for this instance.

In [17]:
# Dropping irrelevant columns
df = df.drop([‘Engine Fuel Type’, ‘Market Category’, ‘Vehicle Style’, ‘Popularity’, ‘Number of Doors’, ‘Vehicle Size’], axis=1)
df.head(5)

SyntaxError: invalid character in identifier (<ipython-input-17-06dbd36c4729>, line 2)

### 5. Renaming the columns 

In this instance, most of the column names are very confusing to read, so I just tweaked their column names. This is a good approach it improves the readability of the data set.

In [19]:
# Renaming the column names
df = df.rename(columns={“Engine HP”: “HP”, “Engine Cylinders”: “Cylinders”, “Transmission Type”: “Transmission”, “Driven_Wheels”: “Drive Mode”,”highway MPG”: “MPG-H”, “city mpg”: “MPG-C”, “MSRP”: “Price” })
df.head(5)

SyntaxError: invalid character in identifier (<ipython-input-19-66d0da2e1637>, line 2)

### 6. Dropping the duplicate rows

This is often a handy thing to do because a huge data set as in this case contains more than 10, 000 rows often have some duplicate data which might be disturbing, so here I remove all the duplicate value from the data-set. 

For example prior to removing I had 11914 rows of data but after removing the duplicates 10925 data meaning that I had 989 of duplicate data.

In [21]:
"""Total number of rows and columns"""
df.shape
(11914, 10)

"""Rows containing duplicate data"""
duplicate_rows_df = df[df.duplicated()]
print(“number of duplicate rows:", duplicate_rows_df.shape)
number of duplicate rows:  (989, 10)

SyntaxError: invalid character in identifier (<ipython-input-21-6325787e99ce>, line 7)

Now let us remove the duplicate data because it's ok to remove them.

In [23]:
# Used to count the number of rows before removing the data
df.count() 

SNo                205951
ObservationDate    205951
Province/State     150574
Country/Region     205951
Last Update        205951
Confirmed          205951
Deaths             205951
Recovered          205951
dtype: int64