# <center><u> Exploratory Data Analysis</u> </center>

## 1. Importing the necessary libraries

Import the libraries that will be used in this notebook

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
from scipy import stats
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline 
sns.set(color_codes=True)

## 2. Download the dataset and load into dataframe

Please download the dataset from [here](https://www.kaggle.com/CooperUnion/cardataset) and extract the csv file. Load the csv file as pandas dataframe. 

In [None]:
## load the csv file 
df = pd.read_csv('7A-archive.zip')

In [None]:
## print the head of the dataframe
df.head()

## 3. Check the datatypes

In [None]:
# Get the datatypes of each columns number of records in each column.
df.info()

## 4. Dropping irrevalent columns

Reference:-https://www.youtube.com/watch?v=cRurWEfmxC0


Not all the columns are important to us in the given dataframe, and hence we would drop the columns that are irrevalent to us. <br>

The list cols_to_drop contains the names of the cols that are irrevalent, drop all these cols from the dataframe

cols_to_drop = ["Engine Fuel Type", "Market Category", "Vehicle Style", "Popularity", "Number of Doors", "Vehicle Size"]

In [None]:
# initialise cols_to_drop
cols_to_drop = ["Engine Fuel Type","Market Category","Vehicle Style","Popularity","Number of Doors","Vehicle Size"]


In [None]:
# drop the irrevalent cols and print the head of the dataframe
df.drop(cols_to_drop,axis=1,inplace=True)
df.head()

## 5.  Renaming the columns

We have already dropped the unneccesary columns, and now we are left with useful columns. One extra thing that we would do is to rename the columns such that the name clearly represents the essence of the column.

The given dict represents (in key value pair) the previous name, and the new name for the dataframe columns

In [None]:
rename_cols = {"Engine HP":"HP","Engine Cylinders":"Cylinders","Transmission Type":"Transmission",
               "Driven_Wheels":"Drive Mode","highway MPG":"MPG_H","city mpg":"MPG-C","MSRP":"Price" }


In [None]:
# use a pandas function to rename the current columns with names in rename_cols 
df = df.rename(columns=rename_cols)


In [None]:
# Print the head of the dataframe
df.head()


## 6. Dropping the duplicate rows

Reference: https://www.youtube.com/watch?v=bFVMR1qfzXo


There are many rows in the dataframe which are dublicate, and hence they are just repeating the information. Its better if we remove these rows as they don't add any value to the dataframe. 

For given data, we would like to see how many rows were duplicates. For this, we will count the number of rows, remove the dublicated rows, and again count the number of rows.

In [None]:
# number of rows before removing duplicated rows
df.count()


In [None]:
# drop the duplicated rows and print its head
df = df.drop_duplicates()
df.head()

In [None]:
# Count Number of rows after deleting duplicated rows
df.count()

## 7. Dropping the null or missing values 

There are many values which are missing, in pandas dataframe these values are reffered to as np.nan. We want to deal with these values beause we can't use nan values to train models. Either we can remove them to apply some strategy to replace them with other values.

To keep things simple we will be dropping nan values

In [None]:
# check for total nan values in each columns
df.isnull().sum()

In [None]:
# drop missing values from df
df.dropna(inplace=True)

In [None]:
# Make sure that missing values are removed
# check total nan values in each col again
df.isnull().sum()

In [None]:
#Describe statistics of df
df.describe()

## 8. Removing outliers


Reference: https://www.youtube.com/watch?v=yxTRB0boTVg

Sometimes a dataset can contain extreme values that are outside the range of what is expected and unlike the other data. These are called outliers and often machine learning modeling and model skill in general can be improved by understanding and even removing these outlier values.

### Detecting outliers

There many techiniques to detect outliers. Let us first see the simplest form of visualizing outliers.

Box plots are a graphical depiction of numerical data through their quantiles. It is a very simple but effective way to visualize outliers. Think about the lower and upper whiskers as the boundaries of the data distribution. Any data points that show above or below the whiskers, can be considered outliers or anomalous.

In [None]:
## Plot a boxplot for 'Price' column in dataset. 
plt.boxplot(df['Price'])
plt.show()

In [None]:
df.head(2)

In [None]:
## PLot a boxplot for 'HP' columns in dataset
sns.boxplot(x=df['HP'])

plt.show()

In [None]:
## Plot a boxplot for 'Cylinders' column in dataset
sns.boxplot(x=df['Cylinders'])

plt.show()

print all the columns which are of int or float datatype in df. 

Hint: Use loc with condition

In [None]:
#code here
df.loc[:,df.dtypes != object]


Save the column names of the above output in variable list named 'l'


In [None]:
#code here

l = list(df.loc[:,df.dtypes != 'object'].columns)

## Outliers removal techniques

Reference: https://www.youtube.com/watch?v=A3gClkblXK8

1. **Using IQR Technique**
 

**Here comes cool Fact for you!**

IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data.

The anatomy of boxplot is given below.

![image.png](attachment:image.png)

- Calculate IQR  and give a suitable threshold to remove the outliers and save this new dataframe into df2.

Let us help you to decide threshold: Outliers in this case are defined as the observations that or below (Q1 − 1.5x IQR) and above (Q3 + 1.5x IQR)

In [None]:
## Your code here
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1

df1 = df[~((df<(q1-1.5*iqr)) | (df>(q3+1.5*iqr))).any(axis=1)]


2. **Outlier removal using Z-score function**

Reference:- https://www.youtube.com/watch?v=rfmfXa0kdrQ
  - The intuition behind Z-score is to describe any data point by finding their relationship with the Standard Deviation and Mean of the group of data points.

We will use Z-score function defined in scipy library to detect the outliers in dataframe df having columns which are in variable 'l'


In [None]:
l

In [None]:
# calculate absolute zscore values for each columns present in l, save it in variable z. Print z
z = np.abs(stats.zscore(df[l]))
print(z)


Hey buddy! do you understand the above output? Difficult right? let’s try and define a threshold to identify an outlier so that we get a clear picture of whats going on.

We will not spare you without a good fact! ;)

`In most of the cases a threshold of 3 or -3 is used` i.e `if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers`.

In [None]:
# print the values in dataframe which are less than the threshold and save this dataframe as df3
threshold = 3
df2 = df[(z<threshold).all(axis=1)]
df2


print the shape difference of df df2 and df3. 

In [None]:
print(df.shape)
print(df1.shape)
print(df2.shape)

Interesting right? Bam! you have removed 489 rows from the dataframe which was detected as outlier by Z-score technique.
and removed 1636 rows from the dataframe which was detected as outlier by IQR technique.

By the way `there are many other techniques by which you can remove outliers. You can explore on more interesting techniques available.`

We know you must be having many questions in you mind like:
- Which technique we should use and why?
- Is it neccessary that whatever detected as outlier are really outliers?

Dont't worry these delimma is faced my many data analyst. We provide you with good references below for you to explore further on this

- https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/
- https://www.researchgate.net/post/Which-is-the-best-method-for-removing-outliers-in-a-data-set

Lets find unique values and there counts in each column in df using value counts function.

Reference: https://www.youtube.com/watch?v=8wRfskrpTIk

In [None]:
#Replicate the below output

for col in df.columns:
    print()
    print('-------------%s---------------'%col)
    print(df[col].value_counts())

In [None]:
# Plot a pie chart to visualise unique value counts in column 'Make' using matplotlib
plt.figure(figsize=(7,7))

df['Make'].value_counts().plot(kind='pie')

plt.show()

## Visualising Univariate Distributions

Reference:-https://www.youtube.com/watch?v=lI5-7rX3xPY

We will use seaborn library to visualize eye catchy univariate plots. 

Do you know? you have just now already explored one univariate plot. guess which one? Yeah its box plot.


### 1 . Histogram & Density Plots

Histograms and density plots show the frequency of a numeric variable along the y-axis, and the value along the x-axis. The ```sns.distplot()``` function plots a density curve. Notice that this is aesthetically better than vanilla ```matplotlib```.

In [None]:
#ploting distplot for variable HP

sns.distplot(df['HP'])

plt.show()

Since seaborn uses matplotlib behind the scenes, the usual matplotlib functions work well with seaborn. For example, you can use subplots to plot multiple univariate distributions.
- Hint: use matplotlib subplot function

Reference:- https://www.youtube.com/watch?v=Tf-dgRR1PMA

In [None]:
# plot all the columns present in list l together using subplot of dimention (2,3).

plt.figure(figsize=(12,8))
counter = 0
for col in l:
    counter += 1
    plt.subplot(2,3,counter)
    plt.title(col)
    sns.distplot(df[col])
    
plt.show()

## 2. Bar plots


Plot a histogram depicting the make in X axis and number of cars in y axis. <br>

In [None]:
#replicate below chart keeping nlargest as 40 and figsize as (12,6)
df.Make.value_counts().nlargest(40).plot(kind='bar',figsize=(12,6))

plt.title('Number if cars by make')
plt.xlabel('Make')
plt.ylabel('Number of cars')

plt.show()

### 3. Count Plot

Reference:-https://www.youtube.com/watch?v=8U5h3EJuu8M

`A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.`


 Plot a countplot for a variable Transmission vertically with hue as Drive mode

In [None]:
plt.figure(figsize=(12,4))

sns.countplot(x='Transmission',hue='Drive Mode',data=df)

plt.show()

# Visualising Bivariate Distributions


Bivariate distributions are simply two univariate distributions plotted on x and y axes respectively. They help you observe the relationship between the two variables.




## 1. Scatterplots
Reference:- https://www.youtube.com/watch?v=Is9bc-WA-c8

Scatterplots are used to find the correlation between two continuos variables. <br>

Using scatterplot find the correlation between 'HP' and 'Price' column of the data. 



In [None]:
## Your code here - 
plt.figure(figsize=(12,4))

sns.scatterplot(df['HP'],df['Price'],alpha=0.4)

plt.show()

## 2. Lineplot 

Reference: https://www.youtube.com/watch?v=CMRVEKf9jWA


Plot a lineplot where x axis is 'HP' and y axis show 'Price'
Size of graph - (10,5)



In [None]:
plt.figure(figsize=(12,4))
sns.lineplot(df['HP'],df['Price'])

plt.show()

Optional Question: - What happens to Price as we increase HP? is the trend linear?
* `Sharp increase can be seen in Price for HP more then 780`

## 3. Box plot w.r.t various variables

Reference:- https://www.youtube.com/watch?v=KoTHJPzpw0c


In [None]:
# Plot boxplot of 'Price' across various 'Drive Mode'
plt.figure(figsize=(12,4))

sns.boxplot(df['Drive Mode'],df['Price'])

plt.yscale('log')
plt.show()

Try to understand the above plot. Let us help you to understand it. This chart tells you that the Price of front wheel drive, on an average, lower than the other three categories. 

Let's now plot the other important variable - 'Transmission'

In [None]:
# plot boxplot of 'Price' across various 'Transmission'
plt.figure(figsize=(12,4))

sns.boxplot(df['Transmission'],df['Price'])

plt.yscale('log')
plt.show()

Hey Techie!, just give a try to explain what insight do you get from the above chart.

In [None]:
#### comment below
print('Price of direct_drive on average is lower')

You can **visualise the distribution across two categorical variables** using the ```hue= ``` argument.

In [None]:
# plot a boxplot with Transmission on x axis and Price on y axis. along with huw as Drive mode
plt.figure(num=None, figsize=(12, 8), dpi=80, facecolor='w', edgecolor='k')

sns.boxplot(df['Transmission'],df['Price'],hue=df['Drive Mode'])

plt.yscale('log')
plt.show()

Wow!! If you have plotted this for first time then this plot might look very complicated. But no, its shows some really simple and interesting insights.
Can you help us what do you think about the above chart.
Please comment below:

In [None]:
#Comment
print('Avg price based on Transmission as Manual for All Drive mode can be seen and so on...','\n',
     'In Transmission as MANUAL front wheel drive and four whell drive are cheeper then other two')


## 4. joint distributions

Reference:-https://www.youtube.com/watch?v=LrSLBbe6pGY


Seaborn's jointplot displays a relationship between 2 variables (bivariate) as well as 1D profiles (univariate) in the margins. This plot is a convenience class that wraps JointGrid


In [None]:
# joint plots of MPG_H and MPG-C

sns.jointplot(df['MPG_H'],df['MPG-C'])

plt.show()


In [None]:
# joint plots of Price and Year

sns.jointplot(df['Price'],df['Year'])

plt.show()


You can adjust the arguments of the jointplot() to make the plot more readable. For e.g. specifying kind=hex will create a 'hexbin plot'.

In [None]:
# plotting hex plot for Prize and HP
# hex plot

sns.jointplot(df['Price'],df['Year'],kind='hex',color='r')

plt.show()


## 5. Plotting Aggregated Values across Categories

Reference:-https://www.youtube.com/watch?v=yWYWmeuH7no

### Bar Plots - Mean, Median and Count Plots


Bar plots are used to **display aggregated values** of a variable, rather than entire distributions. This is especially useful when you have a lot of data which is difficult to visualise in a single figure. 

For example, say you want to visualise and *compare the Price across Cylinders*. The ```sns.barplot()``` function can be used to do that.


In [None]:
# bar plot with default statistic=mean between Cylinder and Price

sns.barplot(df['Cylinders'],df['Price'])

plt.show()


Hey you smarty!!. An Amazing fact for you:
Note that, **by default, seaborn plots the mean value across categories**, though you can plot the count, median, sum etc. Also, barplot computes and shows the confidence interval of the mean as well.



When you want to visualise having a large number of categories, it is helpful to plot the categories across the y-axis. Let's now *drill down into Transmission sub categories*. 

In [None]:
# Plotting categorical variable Transmission across the y-axis

sns.barplot(df['Price'],df['Transmission'],estimator=np.median)

plt.show()


Plot bar plot for Price and Transmission with hue="Drive Mode"


In [None]:
plt.figure(num=None,figsize=(12,8),dpi=80,facecolor='w',edgecolor='k')

# specify hue="Drive Mode"
sns.barplot(df['Price'],df['Transmission'],hue=df['Drive Mode'],estimator=np.median)

plt.show()


These plots looks beutiful isn't it? In Data Analyst life such charts are there unavoidable friend.:)

# Multivariate Plots



# 1. Pairplot

Refrence:- https://www.youtube.com/watch?v=TexdD7t0IKU

Plot a pairplot for the dataframe df. 



In [None]:
sns.pairplot(df)

plt.show()

## 2. Multivariate scatter plots

This function combines regplot() and FacetGrid. It is intended as a convenient interface to fit regression models across conditional subsets of a dataset.

plot multivariate scatter plot using lmplot function of seaborn between HP and Price using hue as Transmission keeping fit_reg False.

In [None]:
sns.lmplot(x='HP',y='Price',hue='Transmission',data=df,fit_reg=True,)


Seems so interesting plot. You can clearly see in above output that the cost price increases for automated_manual transmission cars gradually

## 3. Parallel Coordinates

Reference:-https://www.youtube.com/watch?v=QVBQdS-uL_I


A parallel coordinates plot provides another way of visualizing data across many variables.

In a Parallel Coordinates Plot, each variable is given its own axis and all the axes are placed in parallel to each other. Each axis can have a different scale, as each variable works off a different unit of measurement, or all the axes can be normalised to keep all the scales uniform. Values are plotted as a series of lines that connected across all the axes. This means that each line is a collection of points placed on each axis, that have all been connected together.


In [None]:
#copy l into another variable l1 and remove Price 'element' from  l1
l1 = l.copy()
l1.remove('Price')

Plot a parallel plot with columns present in l1 in df with respect to Cylinders

In [None]:
from pandas.plotting import parallel_coordinates

In [None]:
plt.figure(figsize=(15,5))

#complete the code
parallel_coordinates(df[l1],'Cylinders')

plt.show()


from the above plot we can see that mostly Cylinder 4 is majorly spread in MPG_H and MPG_C. HP variable shows some growth w.r.t year and goes constant for MPG_H and MPG-C.



## 4.  Heatmaps

Reference:- https://www.youtube.com/watch?v=ZSwXRn50lnA

A heat map is a two-dimensional representation of information with the help of colors. Heat maps can help the user visualize simple or complex information


Using heatmaps plot the correlation between the features present in the dataset.

In [None]:
#find the correlation of features of the data df

df.corr()


In [None]:
# Using the correlated df, plot the heatmap 
# set cmap = 'BrBG', annot = True - to get the same graph as shown below 
# set size of graph = (12,8)

sns.heatmap(df.corr(),cmap='BrBG',annot=True)

plt.show()