# Code by [Avishake Adhikary](https://www.linkedin.com/in/avishakeadhikary/)

#### Amity University Kolkata
>Under the guidance of [**Prof. Indranil Seal**](https://github.com/Indranil-Seal/)

# Machine Learning Lifecycle

1. Data Sanity Check 
2. Exploratory Data Analysis (EDA)
3. Additional Insights
4. Feature Engineering
5. Baseline Model Creation
6. Optimizing Baseline Model

# Install Dependencies

In [1]:
!pip install numpy pandas sklearn matplotlib seaborn



# Import Libraries

In [2]:
import numpy as np
import pandas as pd
import sklearn as skl
from matplotlib import pyplot as plt
import seaborn as sns

# Import Dataset

In [3]:
data = pd.read_csv('ElectricCarData_Clean.csv')
data

Unnamed: 0,Brand,Model,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats,PriceEuro
0,Tesla,Model 3 Long Range Dual Motor,4.6,233,450,161,940,Yes,AWD,Type 2 CCS,Sedan,D,5,55480
1,Volkswagen,ID.3 Pure,10.0,160,270,167,250,Yes,RWD,Type 2 CCS,Hatchback,C,5,30000
2,Polestar,2,4.7,210,400,181,620,Yes,AWD,Type 2 CCS,Liftback,D,5,56440
3,BMW,iX3,6.8,180,360,206,560,Yes,RWD,Type 2 CCS,SUV,D,5,68040
4,Honda,e,9.5,145,170,168,190,Yes,RWD,Type 2 CCS,Hatchback,B,4,32997
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,Nissan,Ariya 63kWh,7.5,160,330,191,440,Yes,FWD,Type 2 CCS,Hatchback,C,5,45000
99,Audi,e-tron S Sportback 55 quattro,4.5,210,335,258,540,Yes,AWD,Type 2 CCS,SUV,E,5,96050
100,Nissan,Ariya e-4ORCE 63kWh,5.9,200,325,194,440,Yes,AWD,Type 2 CCS,Hatchback,C,5,50000
101,Nissan,Ariya e-4ORCE 87kWh Performance,5.1,200,375,232,450,Yes,AWD,Type 2 CCS,Hatchback,C,5,65000


# Create Top View Dataset

#### We want to create a top view of the dataset to understand the data properly in order to further process data
So we will be creating a new view with the attributes => 
>||ColumnName|NoOfUniqueValues|DataType|NoOfMissingValues|IsPrimary Key||

In [4]:
col_list = list(data.columns) #Creates the list of all the Columns
topview_data = pd.DataFrame() #Creates new pandas DataFrame

for col in col_list:
    datum = pd.DataFrame({'colname':col, #each column extracted from the for loop
                          '#uniq_':[len(data[col].unique())], #lists unique values number of unique values to specified columns
                          'dtype':[data[col].dtype], #datatype used for column
                          '#missing_':[data[col].isna().sum()], #lists out all the columns which are NA values
                          'is_PKey':[len(data[col].unique()) == data.shape[0]]}) #data.shape returns the dimensions of data each time
                          #checks if the numberofuniquevalues is equal to the numberofrows each time (if true then primarykey)
    topview_data = pd.concat([topview_data,datum],axis = 0) #concatinates the datatuple created each time to finaltopview
    #axis 0 states horizontal/rows axis 1 states vertical/columns
    
topview_data #Prints TopView Data

Unnamed: 0,colname,#uniq_,dtype,#missing_,is_PKey
0,Brand,33,object,0,False
0,Model,102,object,0,False
0,AccelSec,55,float64,0,False
0,TopSpeed_KmH,25,int64,0,False
0,Range_Km,50,int64,0,False
0,Efficiency_WhKm,54,int64,0,False
0,FastCharge_KmH,51,object,0,False
0,RapidCharge,2,object,0,False
0,PowerTrain,3,object,0,False
0,PlugType,4,object,0,False


# Truely Validate Data

#### We see that there are no mission values and each attribute is individually a primary key
>So to overcome the problem we combine the Brand and Model attributes to a single Brand+Model attribute and create a superior primary key
>And we also manually check for the missing values to deal with them

In [5]:
#creates a new attribute combining brand&model for primary key
data['brand_model_key'] = data['Brand'].astype(str) + data['Model'].astype(str)
#checks number of unique attributes, the number of data and if those attributes are the same as shape
print(len(data['brand_model_key'].unique()), data.shape[0] , len(data['brand_model_key'].unique()) == data.shape[0])
#print dataset
data

"""
BM_array = data[['Brand','Model']].drop_duplicates()
BM_array.shape,data[['Brand','Model']].shape
"""

102 103 False


"\nBM_array = data[['Brand','Model']].drop_duplicates()\nBM_array.shape,data[['Brand','Model']].shape\n"

#### We see that the data shapes don't match because numberofuniquevalues > numberofvalues in the attribute we used to create a primary key
> So to overcome this we drop the duplicates from the dataset

In [6]:
#Prints the columns Brand,Model,FastCharge_KmH based on FastCharge_KmH columns where its value is '-'
data.loc[data['FastCharge_KmH'] == '-',['Brand','Model','FastCharge_KmH']]

Unnamed: 0,Brand,Model,FastCharge_KmH
57,Renault,Twingo ZE,-
68,Renault,Kangoo Maxi ZE 33,-
77,Smart,EQ forfour,-
82,Smart,EQ fortwo coupe,-
91,Smart,EQ fortwo cabrio,-


#### We see that the missing values are still there in the dataset so we need to deal with them
> In order to preserve the attributes connected to the missing values we replace the non readable values with a 0 value

In [7]:
#Replaces '-' values in FastCharge_KmH with '0' value
data['FastCharge_KmH'] = data['FastCharge_KmH'].replace('-',0)
#prints data in boolean form
print(data['FastCharge_KmH'] == '-')

0      False
1      False
2      False
3      False
4      False
       ...  
98     False
99     False
100    False
101    False
102    False
Name: FastCharge_KmH, Length: 103, dtype: bool


# Exploratory Data Analysis
![EDA Cheat Sheet](https://miro.medium.com/max/1000/0*l0P7bwhVkLp1SPbS.png)
- Feature: AccelSec > Accleration Seconds

In [16]:
data.AccelSec.describe() #Gives necessary information based on AccelSec attribute in the dataset

count    103.000000
mean       7.396117
std        3.017430
min        2.100000
25%        5.100000
50%        7.300000
75%        9.000000
max       22.400000
Name: AccelSec, dtype: float64

In [15]:
#We group the data by brand because that seems to be the correct option to classify the data and then we run aggregate functions
#on the data which returns either a dataframe or a series object
brandaccelgrp = data.groupby('Brand').agg(avg_AccelSec = pd.NamedAgg('AccelSec','median'),#calls median function on AccelSec
                                          min_AccelSec = pd.NamedAgg('AccelSec','min'),#calls minimum function on AccelSec
                                          max_AccelSec = pd.NamedAgg('AccelSec','max'),#calls maximum function on AccelSec
#calls count functin on Model then resets its index and sorts values of the data according to avg_AccelSec in ascending order
                                          models = pd.NamedAgg('Model','count')).reset_index().sort_values('avg_AccelSec',ascending = True)
brandaccelgrp #Prints the new dataframe

Unnamed: 0,Brand,avg_AccelSec,min_AccelSec,max_AccelSec,models
15,Lucid,2.8,2.8,2.8,1
24,Porsche,3.5,2.8,4.0,5
30,Tesla,3.8,2.1,7.0,13
23,Polestar,4.7,4.7,4.7,1
11,Jaguar,4.8,4.8,4.8,1
32,Volvo,4.9,4.9,4.9,1
18,Mercedes,5.1,5.0,10.0,3
1,Audi,5.7,3.5,6.8,9
8,Ford,6.3,6.0,7.0,4
4,CUPRA,6.5,6.5,6.5,1


- Feature: TopSpeed_KmH > Top Speed in kmph

In [17]:
data.TopSpeed_KmH.describe() #Gives necessary information based on TopSpeed_KmH attribute in the dataset

count    103.000000
mean     179.194175
std       43.573030
min      123.000000
25%      150.000000
50%      160.000000
75%      200.000000
max      410.000000
Name: TopSpeed_KmH, dtype: float64

In [18]:
#We group the data by brand because that seems to be the correct option to classify the data and then we run aggregate functions
#on the data which returns either a dataframe or a series object
data.groupby('Brand').agg(avg_TopSpeed_KmH = pd.NamedAgg('TopSpeed_KmH','mean'),#calls mean function on TopSpeed_KmH
                          min_TopSpeed_KmH = pd.NamedAgg('TopSpeed_KmH','min'),#calls min function on TopSpeed_KmH
                          max_TopSpeed_KmH = pd.NamedAgg('TopSpeed_KmH','max'),#calls max function on TopSpeed_KmH
#calls count functin on Model then resets its index and sorts values of the data according to avg_TopSpeed_KmH in descending order
                          models = pd.NamedAgg('Model','count')).reset_index().sort_values('avg_TopSpeed_KmH',ascending = False)

Unnamed: 0,Brand,avg_TopSpeed_KmH,min_TopSpeed_KmH,max_TopSpeed_KmH,models
24,Porsche,254.0,250,260,5
15,Lucid,250.0,250,250,1
30,Tesla,244.461538,180,410,13
23,Polestar,210.0,210,210,1
1,Audi,200.0,180,240,9
11,Jaguar,200.0,200,200,1
3,Byton,190.0,190,190,3
32,Volvo,180.0,180,180,1
8,Ford,180.0,180,180,4
18,Mercedes,173.333333,140,200,3


- Feature: Range_Km > Mileage

In [19]:
data.Range_Km.describe() #Gives necessary information based on Range_Km attribute in the dataset

count    103.000000
mean     338.786408
std      126.014444
min       95.000000
25%      250.000000
50%      340.000000
75%      400.000000
max      970.000000
Name: Range_Km, dtype: float64

In [20]:
data.groupby('Brand').agg(avg_Range_Km = pd.NamedAgg('Range_Km','mean'),
                          min_Range_Km = pd.NamedAgg('Range_Km','min'),
                          max_Range_Km = pd.NamedAgg('Range_Km','max'),
                          models = pd.NamedAgg('Model','count')).reset_index().sort_values('avg_Range_Km',ascending = False)

Unnamed: 0,Brand,avg_Range_Km,min_Range_Km,max_Range_Km,models
15,Lucid,610.0,610,610,1
14,Lightyear,575.0,575,575,1
30,Tesla,500.769231,310,970,13
4,CUPRA,425.0,425,425,1
23,Polestar,400.0,400,400,1
8,Ford,395.0,340,450,4
24,Porsche,388.0,365,425,5
32,Volvo,375.0,375,375,1
3,Byton,371.666667,325,400,3
11,Jaguar,365.0,365,365,1


- Feature: Efficiency_WhKm

In [21]:
data.Efficiency_WhKm.describe()

count    103.000000
mean     189.165049
std       29.566839
min      104.000000
25%      168.000000
50%      180.000000
75%      203.000000
max      273.000000
Name: Efficiency_WhKm, dtype: float64

In [22]:
data.groupby('Brand').agg(avg_Efficiency_WhKm = pd.NamedAgg('Efficiency_WhKm','mean'),
                          min_Efficiency_WhKm = pd.NamedAgg('Efficiency_WhKm','min'),
                          max_Efficiency_WhKm = pd.NamedAgg('Efficiency_WhKm','max'),
                          models = pd.NamedAgg('Model','count')).reset_index().sort_values('avg_Efficiency_WhKm',ascending = False)

Unnamed: 0,Brand,avg_Efficiency_WhKm,min_Efficiency_WhKm,max_Efficiency_WhKm,models
3,Byton,234.666667,222,244,3
11,Jaguar,232.0,232,232,1
1,Audi,224.555556,188,270,9
18,Mercedes,220.0,171,273,3
24,Porsche,209.4,195,223,5
8,Ford,202.25,194,209,4
30,Tesla,201.384615,153,267,13
32,Volvo,200.0,200,200,1
20,Nissan,194.75,164,232,8
13,Lexus,193.0,193,193,1


- Feature: FastCharge_KmH

In [23]:
data.FastCharge_KmH.describe()

count     103
unique     51
top       230
freq        6
Name: FastCharge_KmH, dtype: object

> We see that FastCharge_KmH attribute is an object so wee need to convert the object into a numerical value to get the rest of the statistical outputs

In [25]:
data['FastCharge_KmH'] = data['FastCharge_KmH'].astype(float) #Converts the data into float datatype in the dataset
data.FastCharge_KmH.describe()

count    103.000000
mean     434.563107
std      219.660061
min        0.000000
25%      260.000000
50%      440.000000
75%      555.000000
max      940.000000
Name: FastCharge_KmH, dtype: float64

In [26]:
data.groupby('Brand').agg(avg_FastCharge_KmH = pd.NamedAgg('FastCharge_KmH','mean'),
                          min_FastCharge_KmH = pd.NamedAgg('FastCharge_KmH','min'),
                          max_FastCharge_KmH = pd.NamedAgg('FastCharge_KmH','max'),
                          models = pd.NamedAgg('Model','count')).reset_index().sort_values('avg_FastCharge_KmH',ascending = False)

Unnamed: 0,Brand,avg_FastCharge_KmH,min_FastCharge_KmH,max_FastCharge_KmH,models
24,Porsche,796.0,730.0,890.0,5
30,Tesla,730.0,480.0,940.0,13
15,Lucid,620.0,620.0,620.0,1
23,Polestar,620.0,620.0,620.0,1
4,CUPRA,570.0,570.0,570.0,1
1,Audi,567.777778,450.0,850.0,9
14,Lightyear,540.0,540.0,540.0,1
32,Volvo,470.0,470.0,470.0,1
3,Byton,453.333333,420.0,480.0,3
2,BMW,435.0,260.0,650.0,4


- Feature: RapidCharge

In [27]:
data.RapidCharge.describe()

count     103
unique      2
top       Yes
freq       98
Name: RapidCharge, dtype: object

###### We can always save the new processed data using the syntax `datasetvariable.to_csv('newfile.csv')` 