# Code by [Avishake Adhikary](https://www.linkedin.com/in/avishakeadhikary/)

#### Amity University Kolkata
>Under the guidance of [**Prof. Indranil Seal**](https://github.com/Indranil-Seal/)

# Machine Learning Lifecycle

1. Data Sanity Check 
2. Exploratory Data Analysis (EDA)
3. Additional Insights
4. Feature Engineering
5. Baseline Model Creation
6. Optimizing Baseline Model

# Install Dependencies

In [1]:
!pip install numpy pandas sklearn matplotlib seaborn



# Import Libraries

In [2]:
import numpy as np
import pandas as pd
import sklearn as skl
from matplotlib import pyplot as plt
import seaborn as sns

# Import Dataset

In [3]:
data = pd.read_csv('ElectricCarData_Clean.csv')
data

Unnamed: 0,Brand,Model,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats,PriceEuro
0,Tesla,Model 3 Long Range Dual Motor,4.6,233,450,161,940,Yes,AWD,Type 2 CCS,Sedan,D,5,55480
1,Volkswagen,ID.3 Pure,10.0,160,270,167,250,Yes,RWD,Type 2 CCS,Hatchback,C,5,30000
2,Polestar,2,4.7,210,400,181,620,Yes,AWD,Type 2 CCS,Liftback,D,5,56440
3,BMW,iX3,6.8,180,360,206,560,Yes,RWD,Type 2 CCS,SUV,D,5,68040
4,Honda,e,9.5,145,170,168,190,Yes,RWD,Type 2 CCS,Hatchback,B,4,32997
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,Nissan,Ariya 63kWh,7.5,160,330,191,440,Yes,FWD,Type 2 CCS,Hatchback,C,5,45000
99,Audi,e-tron S Sportback 55 quattro,4.5,210,335,258,540,Yes,AWD,Type 2 CCS,SUV,E,5,96050
100,Nissan,Ariya e-4ORCE 63kWh,5.9,200,325,194,440,Yes,AWD,Type 2 CCS,Hatchback,C,5,50000
101,Nissan,Ariya e-4ORCE 87kWh Performance,5.1,200,375,232,450,Yes,AWD,Type 2 CCS,Hatchback,C,5,65000


# Create Top View Dataset

#### We want to create a top view of the dataset to understand the data properly in order to further process data
So we will be creating a new view with the attributes => 
>||ColumnName|NoOfUniqueValues|DataType|NoOfMissingValues|IsPrimary Key||

In [4]:
col_list = list(data.columns) #Creates the list of all the Columns
topview_data = pd.DataFrame() #Creates new pandas DataFrame

for col in col_list:
    datum = pd.DataFrame({'colname':col, #each column extracted from the for loop
                          '#uniq_':[len(data[col].unique())], #lists unique values number of unique values to specified columns
                          'dtype':[data[col].dtype], #datatype used for column
                          '#missing_':[data[col].isna().sum()], #lists out all the columns which are NA values
                          'is_PKey':[len(data[col].unique()) == data.shape[0]]}) #data.shape returns the dimensions of data each time
                          #checks if the numberofuniquevalues is equal to the numberofrows each time (if true then primarykey)
    topview_data = pd.concat([topview_data,datum],axis = 0) #concatinates the datatuple created each time to finaltopview
    #axis 0 states horizontal/rows axis 1 states vertical/columns
    
topview_data #Prints TopView Data

Unnamed: 0,colname,#uniq_,dtype,#missing_,is_PKey
0,Brand,33,object,0,False
0,Model,102,object,0,False
0,AccelSec,55,float64,0,False
0,TopSpeed_KmH,25,int64,0,False
0,Range_Km,50,int64,0,False
0,Efficiency_WhKm,54,int64,0,False
0,FastCharge_KmH,51,object,0,False
0,RapidCharge,2,object,0,False
0,PowerTrain,3,object,0,False
0,PlugType,4,object,0,False


# Truely Validate Data

#### We see that there are no mission values and each attribute is individually a primary key
>So to overcome the problem we combine the Brand and Model attributes to a single Brand+Model attribute and create a superior primary key
>And we also manually check for the missing values to deal with them

In [5]:
#creates a new attribute combining brand&model for primary key
data['brand_model_key'] = data['Brand'].astype(str) + data['Model'].astype(str)
#checks number of unique attributes, the number of data and if those attributes are the same as shape
print(len(data['brand_model_key'].unique()), data.shape[0] , len(data['brand_model_key'].unique()) == data.shape[0])
#print dataset
data

"""
BM_array = data[['Brand','Model']].drop_duplicates()
BM_array.shape,data[['Brand','Model']].shape
"""

102 103 False


"\nBM_array = data[['Brand','Model']].drop_duplicates()\nBM_array.shape,data[['Brand','Model']].shape\n"

#### We see that the data shapes don't match because numberofuniquevalues > numberofvalues in the attribute we used to create a primary key
> So to overcome this we drop the duplicates from the dataset

In [6]:
#Prints the columns Brand,Model,FastCharge_KmH based on FastCharge_KmH columns where its value is '-'
data.loc[data['FastCharge_KmH'] == '-',['Brand','Model','FastCharge_KmH']]

Unnamed: 0,Brand,Model,FastCharge_KmH
57,Renault,Twingo ZE,-
68,Renault,Kangoo Maxi ZE 33,-
77,Smart,EQ forfour,-
82,Smart,EQ fortwo coupe,-
91,Smart,EQ fortwo cabrio,-


#### We see that the missing values are still there in the dataset so we need to deal with them
> In order to preserve the attributes connected to the missing values we replace the non readable values with a 0 value

In [7]:
#Replaces '-' values in FastCharge_KmH with '0' value
data['FastCharge_KmH'] = data['FastCharge_KmH'].replace('-',0)
#prints data in boolean form
print(data['FastCharge_KmH'] == '-')

0      False
1      False
2      False
3      False
4      False
       ...  
98     False
99     False
100    False
101    False
102    False
Name: FastCharge_KmH, Length: 103, dtype: bool


# Exploratory Data Analysis
![EDA Cheat Sheet](https://miro.medium.com/max/1000/0*l0P7bwhVkLp1SPbS.png)
- [ ] Feature: AccelSec > Accleration Seconds

In [18]:
data.TopSpeed_KmH.describe() #Gives necessary information based on TopSpeed_KmH attribute in the dataset

count    103.000000
mean     179.194175
std       43.573030
min      123.000000
25%      150.000000
50%      160.000000
75%      200.000000
max      410.000000
Name: TopSpeed_KmH, dtype: float64