# SUPPORT VECTOR MACHINE

 
 What You're Aiming For

In this checkpoint, we are going to work on the 'Electric Vehicle Data' dataset that was provided by Kaggle as part of the Electric Vehicle Price Prediction competition.

Dataset description: This dataset contains information on the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered with the Washington State Department of Licensing (DOL). This dataset was introduced as part of an official invitation-based competition on Kaggle. Our SVM model should answer the question "This is my car's model & make, along with a few other parameters, what price can this vehicle be brought or sold?”

➡️ Dataset link

https://i.imgur.com/IpuCW3s.jpg

➡️Columns explanation 

 

Import you data and perform basic data exploration phase

    Display general information about the dataset

    Create a pandas profiling reports to gain insights into the dataset

    Handle Missing and corrupted values

    Remove duplicates, if they exist

    Handle outliers, if they exist

    Encode categorical features

Select your target variable and the features

Split your dataset to training and test sets

Build and train an SVM model on the training set

Assess your model performance on the test set using relevant evaluation metrics

Discuss with your cohort alternative ways to improve your model performance

To work on the 'Electric Vehicle Data' dataset for predicting vehicle prices using an SVM model, you can follow the steps outlined in your instructions. Below is a structured approach to accomplish this task using Python and libraries such as Pandas, NumPy, Scikit-learn, and others.

### Step 1: Import Libraries and Load Data

In [1]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.svm import SVR

from sklearn.metrics import mean_squared_error, r2_score

from pandas_profiling import ProfileReport

  from pandas_profiling import ProfileReport


In [2]:
# Load the dataset

url = (r"C:\Users\User\Desktop\gomycode\Machine Learning\Electric_cars_dataset.csv")  

data = pd.read_csv(url)

### Step 2: Basic Data Exploration
Display General Information

In [4]:
# Display general information about the dataset

print(data.info())

print(data.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64353 entries, 0 to 64352
Data columns (total 18 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   ID                                                 64353 non-null  object 
 1   VIN (1-10)                                         64353 non-null  object 
 2   County                                             64349 non-null  object 
 3   City                                               64344 non-null  object 
 4   State                                              64342 non-null  object 
 5   ZIP Code                                           64347 non-null  float64
 6   Model Year                                         64346 non-null  float64
 7   Make                                               64349 non-null  object 
 8   Model                                              64340 non-null  object 
 9   Electr

* Create a Pandas Profiling Report

In [7]:
# Create a profiling report

profile = ProfileReport(data, title="Pandas Profiling Report", explorative=True)

profile.to_file("electric_vehicle_data_report.html")  # Save the report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Step 3: Handle Missing and Corrupted Values

In [11]:
# Check for missing values

missing_values = data.isnull().sum()

print(missing_values[missing_values > 0])


# Handle missing values (example: fill with mean or drop)

data.fillna(data.mean(), inplace=True)  # or data.dropna(inplace=True)

County                    4
City                      9
State                    11
ZIP Code                  6
Model Year                7
Make                      4
Model                    13
Legislative District    169
Vehicle Location        510
Electric Utility        722
dtype: int64


TypeError: can only concatenate str (not "int") to str

In [19]:
print(data.dtypes)

ID                                                    object
VIN (1-10)                                            object
County                                                object
City                                                  object
State                                                 object
ZIP Code                                             float64
Model Year                                           float64
Make                                                  object
Model                                                 object
Electric Vehicle Type                                 object
Clean Alternative Fuel Vehicle (CAFV) Eligibility     object
Electric Range                                         int64
Base MSRP                                              int64
Legislative District                                 float64
DOL Vehicle ID                                         int64
Vehicle Location                                      object
Electric Utility        

In [21]:
numeric_data = data.select_dtypes(include=['number'])

In [23]:
data[numeric_data.columns] = numeric_data.fillna(numeric_data.mean())

In [26]:
import pandas as pd


# Assuming 'data' is your DataFrame

print(data.dtypes)  # Check the data types of each column


# Select only numeric columns

numeric_data = data.select_dtypes(include=['number'])


# Fill missing values with the mean of numeric columns

data[numeric_data.columns] = numeric_data.fillna(numeric_data.mean())


# Optionally, print the DataFrame to verify changes

print(data)

ID                                                    object
VIN (1-10)                                            object
County                                                object
City                                                  object
State                                                 object
ZIP Code                                             float64
Model Year                                           float64
Make                                                  object
Model                                                 object
Electric Vehicle Type                                 object
Clean Alternative Fuel Vehicle (CAFV) Eligibility     object
Electric Range                                         int64
Base MSRP                                              int64
Legislative District                                 float64
DOL Vehicle ID                                         int64
Vehicle Location                                      object
Electric Utility        

In [66]:
data['Index'] = pd.to_numeric(data['Index'], errors='coerce')

In [32]:
print(data.columns)

Index(['ID', 'VIN (1-10)', 'County', 'City', 'State', 'ZIP Code', 'Model Year',
       'Make', 'Model', 'Electric Vehicle Type',
       'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range',
       'Base MSRP', 'Legislative District', 'DOL Vehicle ID',
       'Vehicle Location', 'Electric Utility', 'Expected Price ($1k)'],
      dtype='object')


In [56]:
print(data.head())

        ID  VIN (1-10)     County        City State  ZIP Code  Model Year  \
0  EV33174  5YJ3E1EC6L  Snohomish    LYNNWOOD    WA   98037.0      2020.0   
1  EV40247  JN1AZ0CP8B     Skagit  BELLINGHAM    WA   98229.0      2011.0   
2  EV12248  WBY1Z2C56F     Pierce      TACOMA    WA   98422.0      2015.0   
3  EV55713  1G1RD6E44D       King     REDMOND    WA   98053.0      2013.0   
4  EV28799  1G1FY6S05K     Pierce    PUYALLUP    WA   98375.0      2019.0   

        Make    Model                   Electric Vehicle Type  \
0      TESLA  MODEL 3          Battery Electric Vehicle (BEV)   
1     NISSAN     LEAF          Battery Electric Vehicle (BEV)   
2        BMW       I3          Battery Electric Vehicle (BEV)   
3  CHEVROLET     VOLT  Plug-in Hybrid Electric Vehicle (PHEV)   
4  CHEVROLET  BOLT EV          Battery Electric Vehicle (BEV)   

  Clean Alternative Fuel Vehicle (CAFV) Eligibility  Electric Range  \
0           Clean Alternative Fuel Vehicle Eligible             308   
1   

In [58]:
# Strip whitespace from column names
data.columns = data.columns.str.strip()

In [60]:
data['Index'] = range(len(data))  # Example: create an index column

In [64]:
# Assuming you have a column named 'Index' that you want to convert
data['Index'] = pd.to_numeric(data['Index'], errors='coerce')

In [68]:
# Check the DataFrame after modifications

print(data)

            ID  VIN (1-10)     County        City State  ZIP Code  Model Year  \
0      EV33174  5YJ3E1EC6L  Snohomish    LYNNWOOD    WA   98037.0      2020.0   
1      EV40247  JN1AZ0CP8B     Skagit  BELLINGHAM    WA   98229.0      2011.0   
2      EV12248  WBY1Z2C56F     Pierce      TACOMA    WA   98422.0      2015.0   
3      EV55713  1G1RD6E44D       King     REDMOND    WA   98053.0      2013.0   
4      EV28799  1G1FY6S05K     Pierce    PUYALLUP    WA   98375.0      2019.0   
...        ...         ...        ...         ...   ...       ...         ...   
64348   EV6357  KNDCE3LG7L       King     SEATTLE    WA   98144.0      2020.0   
64349    EV423  JTDKN3DP2D     Pierce      TACOMA    WA   98402.0      2013.0   
64350  EV27852  1G1FX6S05J       King     SEATTLE    WA   98119.0      2018.0   
64351    EV830  WP1AE2A24H       King     SEATTLE    WA   98115.0      2017.0   
64352  EV11120  1N4BZ1CP8K      Lewis      TOLEDO    WA   98591.0      2019.0   

            Make          M

In [54]:
print(data.dtypes)

ID                                                    object
VIN (1-10)                                            object
County                                                object
City                                                  object
State                                                 object
ZIP Code                                             float64
Model Year                                           float64
Make                                                  object
Model                                                 object
Electric Vehicle Type                                 object
Clean Alternative Fuel Vehicle (CAFV) Eligibility     object
Electric Range                                         int64
Base MSRP                                              int64
Legislative District                                 float64
DOL Vehicle ID                                         int64
Vehicle Location                                      object
Electric Utility        

In [None]:
data.fillna(data.select_dtypes(include=['number']).mean(), inplace=True)

In [None]:
data = data.apply(pd.to_numeric, errors='coerce')

### Step 4: Remove Duplicates

In [None]:
# Remove duplicates if they exist

data.drop_duplicates(inplace=True)

### Step 5: Handle Outliers

In [None]:
# Visualize outliers using boxplots

plt.figure(figsize=(10, 6))

sns.boxplot(data=data)

plt.show()


In [70]:
Q1 = data['Index'].quantile(0.25)

Q3 = data['Index'].quantile(0.75)

IQR = Q3 - Q1

data = data[(data['Index'] >= (Q1 - 1.5 * IQR)) & (data['Index'] <= (Q3 + 1.5 * IQR))]

In [72]:
print(data.columns) 

Index(['ID', 'VIN (1-10)', 'County', 'City', 'State', 'ZIP Code', 'Model Year',
       'Make', 'Model', 'Electric Vehicle Type',
       'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range',
       'Base MSRP', 'Legislative District', 'DOL Vehicle ID',
       'Vehicle Location', 'Electric Utility', 'Expected Price ($1k)',
       'Index'],
      dtype='object')


### Step 6: Encode Categorical Features

In [None]:
# Convert categorical features to numerical using one-hot encoding

data = pd.get_dummies(data, drop_first=True)

### Step 7: Select Target Variable and Features

In [None]:
# Select target variable and features

X = data.drop('price', axis=1)  # Features

y = data['price']  # Target variable

In [None]:
data

### Step 8: Split Dataset into Training and Test Sets

In [None]:
# Split the dataset into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 9: Build and Train an SVM Model

In [None]:
# Build and train the SVM model

model = SVR(kernel='rbf')  # You can experiment with different kernels

model.fit(X_train, y_train)

### Step 10: Assess Model Performance

In [None]:
# Make predictions on the test set

y_pred = model.predict(X_test)


# Evaluate the model performance

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)


print(f'Mean Squared Error: {mse}')

print(f'R-squared: {r2}')

### Step 11: Discuss Alternative Ways to Improve Model Performance

 1.    Hyperparameter Tuning: Use techniques like GridSearchCV or RandomizedSearchCV to find the best hyperparameters for the SVM model.
 2.    Feature Engineering: Create new features based on existing ones or transform features to improve model performance.
 3.   Scaling Features: Standardize or normalize the features, as SVM is sensitive to the scale of the data.
 4.  Try Different Models: Experiment with other regression models like Random Forest, Gradient Boosting, or Neural Networks.
 5.  Cross-Validation: Use cross-validation to ensure that the model's performance is consistent across different subsets of the data.
