## EDA in Python

There are many libraries available in python like pandas, NumPy, matplotlib, seaborn, etc. with the help of those we can do the analysis of the data and bring out helpful insights. I will be using Jupyter Notebook along with these libraries.

### Importing Libraries

In [None]:
import pandas as pd # To explore the dataframe
import numpy as np # T0 manuplate the data
import matplotlib # To visualisation
import matplotlib.pyplot as plt # To visualisation
import seaborn as sns # To visualisation

plt.rcParams['figure.figsize'] = (16, 8) # To set the figure size
plt.style.use('fivethirtyeight') # To set the style

# Avoide unwanted warnings
import warnings 
warnings.filterwarnings('ignore')

### Data Loading

In [None]:
car_data = pd.read_csv('/kaggle/input/car-features-and-prices-dataset/data.csv')
car_data.head()

In [None]:
car_data.tail()

In [None]:
# Shape of the Data
car_data.shape

In [None]:
# Detail info of data
car_data.info()

The type of data will be stored as an object if there are strings present in the variables. Also, it will be int or float if the data has numerical and decimal values respectively. MSRP (the price of the car) is stored as int data type while Driven_wheels is stored as an object data type.

Above results show many variables like Engine Fuel Type, Engine HP, Engine Cylinders, No. of Doors, and Market Category have missing values in the data.

In [None]:
car_data.describe()

In [None]:
car_data.columns

In [None]:
car_data.rename(columns={'Make':'make', 'Model':'Model', 'Year':'Year', 'Engine Fuel Type':'Fuel Type', 
                         'Engine HP':'HP','Engine Cylinders':'Cylinders', 'Transmission Type':'Transmission', 'Driven_Wheels':'Driven_Wheels',
                         'Number of Doors':'Doors', 'Market Category':'Market Category', 'Vehicle Size':'Size', 'Vehicle Style':'Style',
                         'highway MPG':'h_MPG', 'city mpg':'c_MPG', 'Popularity':'Popularity', 'MSRP':'price'},inplace = True)

In [None]:
car_data.columns

## Cleaning Categorical Data in our data set

In [None]:
# Making List of Categorical Columns
categorical = list(car_data.dtypes[car_data.dtypes == 'object'].index)
# Other methods 
# list(df.select_dtypes(include = 'O').columns)
# [col for col in df.columns if df[col].dtype == 'object']
for col in categorical:
    car_data[col] = car_data[col].str.lower().str.replace(" ", "_")
car_data.head()

In [None]:
# Unique values and numbers
"""Printing out First 5 Unique Values."""
for col in car_data.columns:
    print(col)
    print(car_data[col].unique()[:5])
    print(car_data[col].nunique())
    print('\n')

## Missing Values

In [None]:
print("Number of Missing Values in our data set\n")
missing_df = car_data.isnull().sum().to_frame().reset_index().rename({"index" : 'Variable', 0: 'Missing Values'}, axis =1)
display(missing_df.style.background_gradient('gnuplot2_r'))
print("\n Percentage of Missing Values in our data set")
display((car_data.isnull().sum() / (len(car_data.index)) * 100).head(20).to_frame().rename({0:'Count'}, axis = 1).style.background_gradient('gnuplot2_r'))
round((car_data.isnull().sum() / (len(car_data.index)) * 100) , 2).plot(kind = 'barh',color ='#bf0606')

plt.title("Percentage of Missing values");

Above results show that out of 12 variables, 3 variables Fuel_type, HP, and cylinders have missing values.

There are 0.025%, 0.58% and 0.25% data are missing in the variables Fuel_type, HP and cylinders respectively.

There are many ways to treat these missing values.

Drop Impute We can either drop the rows where missing values are present or replace the missing values with some values like mean, median or mode.

Since the % of the data missing is very less, we can remove those rows from the dataset.

In [None]:
car_data.columns

In [None]:
"""We will use Mode to fill up missing values in Categorical columns"""
car_data['Market Category'].fillna(car_data['Market Category'].mode()[0], inplace = True)
car_data['Fuel Type'].fillna(car_data['Fuel Type'].mode()[0], inplace = True)
   
"""We will use mean to fill up missing values in Numerical columns"""
car_data['HP'].fillna(car_data['HP'].mean(), inplace = True)

"""We will use median to fill up missing values in Ordinal Numerical columns"""
car_data['Cylinders'].fillna(car_data['Cylinders'].median(), inplace = True)
car_data['Doors'].fillna(car_data['Doors'].median(), inplace = True)
"""Checking Missing Values after imputing """
display(car_data.isnull().sum().to_frame().reset_index().rename({'index' : 'Variables', 0: 'Missing Values'},  axis =1).style.background_gradient('copper_r'))

## Checking duplicates in our data set.

In [None]:
display("Total number of of Duplicates present in data: %s" %car_data.duplicated().sum())


In [None]:
# Dropping the duplicates

"""Dropping the Duplicates"""
car_data.drop_duplicates(inplace = True)

"""Checking the Duplicates again"""
print("Total number of of Duplicates present in data: %s" %car_data.duplicated().sum())

From the descriptive summary, we got to know that there is 47 unique make of the cars and 904 models. Data has maximum Chevrolet make cars with 1115 counts. The average price of the car is 40581.5 dollars. The 50th percentile or median of the price is 29970. There is a huge difference between the mean and median of the price. This depicts that the price variable is highly skewed, which we can check visually using a histogram.

## Data Visualisation

Data visualisation, as its name suggests, is to observe the data using various types of plots, graphs etc. Various plots include histogram, scatterplot, boxplot, heatmap etc. We will use matplotlib and seaborn together to visualise a few variables.

In [None]:
numerical = [col for col in car_data.columns if col not in categorical]
car_data[numerical].corr().style.background_gradient('copper_r')

From the above correlation plot, it can be inferred that there are many variables which are strongly related to each other. For Example, the correlation value between c_mpg and h_mpg is 0.85 which is near to 1. That means there is a strong positive relationship between them. Likewise, Cylinders and c_mpg have a negative relationship.

#### Observation

msrp is the dependent variable.

Variables engine_hp and engine_cylinders has the highest correlation with dependent variable.

Multicollinearity exits in our data set. Observe the correlation between engine_hp & engine_cylinders and highway_mpg & city_mpg

## Checking Relation between all variables (Bivariate Analysis)

In [None]:
numerical = [col for col in car_data.columns if col not in categorical]
for i in numerical:
    ax = sns.distplot(car_data[i], color = '#bf0606')
    plt.title("Distribution of %s" %i, fontsize = 20)
    plt.xlabel(" ")
    plt.ylabel(" ")
    plt.xticks(fontsize = 15)
    plt.show();
    print('\n')

#### Observation

Numerical column distribution shows that some variables has skewed data: msrp, highway_mpg, year. Log transformation of data may be useful before using the data for prediction.

There are certain numerical variables which are ordinal in nature e.g. number_of_doors and engine_cylinders. They can be converted into categorical columns and then transformed for the prediction by using One Hot Encoding etc.

Variable popularity shows multimodal distribution

### Distribution and relationship of Numerical Variables with dependent variable

In [None]:
nrows = 2
ncols = 4
i = 0
fig, ax = plt.subplots(nrows, ncols, figsize = (16,8),)
for row in range(nrows):
    for col in range(ncols):
        sns.scatterplot(data=car_data, x=numerical[i], y='price', ax=ax[row, col], color='#bf0606').set(ylabel='')        
        plt.tight_layout()
        i += 1 

#### Observation

There are outliers in: msrp, highway_mpg, year. Outliers can be treated using capping and flooring methods.

Also variable transformation certain variables can help improving the linearity in relationship with dependent variable

### Checking Relation between all categorical variables and dependent variables (price)

In [None]:
"""Creating a function for ordering the groups in a column as per their frequency"""
def sort_order(column):
    orders = (car_data.groupby(column)[numerical].mean().sort_values(by ='price', ascending = False)).index
    return orders

In [None]:
"""Looping over categorical variables to check the Price over different Groups"""
for i in categorical:
    if car_data[i].nunique() < 72:
        sns.barplot(x=car_data[i], y=car_data['price'], order=sort_order(i), palette='copper')        
        plt.title("Bar Plot of %s" %i, fontsize = 20)
        plt.xticks(fontsize = 12)
        plt.xlabel("%s"%i)
        plt.ylabel("Car Price")
        plt.xticks(fontsize = 15, rotation = 90)
        plt.show();
        print('\n')

#### Observation

* These bar graphs represents individual categorical variable relation with dependent variable.
* Many groups in every variable have leads to high price in car.
#### Some of them are:
* convertible and coupe in vehicle_style
* large in vehicle_size
* exotic, luxury, performance in market_category
* all wheel drive and rear wheel drive in driven_wheels
* automated-manual in transmission type
* vbugatti, maybach in vehicle_size