### What You're Aiming For
- In this checkpoint, we are going to work on the 'Electric Vehicle Data' dataset that was provided by Kaggle as part of the Electric Vehicle Price Prediction competition.

- Dataset description: This dataset contains information on the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered with the Washington State Department of Licensing (DOL). This dataset was introduced as part of an official invitation-based competition on Kaggle. Our SVM model should answer the question "This is my car's model & make, along with a few other parameters, what price can this vehicle be brought or sold?”

##### Data Overview
- VIN (1-10) - The 1st 10 characters of each vehicle's Vehicle Identification Number (VIN).
- County- The county in which the registered owner resides.
- City - The city in which the registered owner resides.
- State- The state in which the registered owner resides.
- ZIP Code - The 5-digit zip code in which the registered owner resides.
- Model Year - The model year of the vehicle is determined by decoding the Vehicle Identification Number (VIN).
- Make- The manufacturer of the vehicle, determined by decoding the Vehicle Identification Number (VIN).
- Model- The model of the vehicle is determined by decoding the Vehicle Identification Number (VIN).
- Electric Vehicle Type - This distinguishes the vehicle as all-electric or a plug-in hybrid.
- Clean Alternative Fuel Vehicle (CAFV) Eligibility - This categorizes vehicles as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-only range requirement.
- Electric Range - Describes how far a vehicle can travel purely on its electric charge.
- Base MSRP - This is the lowest Manufacturer's Suggested Retail Price (MSRP) for any trim level of the model in question.
- Legislative District - The specific section of Washington State that the vehicle's owner resides in, as represented in the state legislature.
- DOL Vehicle ID - Unique number assigned to each vehicle by the Department of Licensing for identification purposes.
- Vehicle Location - The center of the ZIP Code for the registered vehicle.
- Electric Utility - This is the electric power retail service territory serving the address of the registered vehicle.
- Expected Price - This is the expected price of the vehicle.

##### Instructions

- Import you data and perform basic data exploration phase
    - Display general information about the dataset
    - Create a pandas profiling reports to gain insights into the dataset
    - Handle Missing and corrupted values
    - Remove duplicates, if they exist
    - Handle outliers, if they exist
    - Encode categorical features
- Select your target variable and the features
- Split your dataset to training and test sets
- Build and train an SVM model on the training set
- Assess your model performance on the test set using relevant evaluation metrics
- Discuss with your cohort alternative ways to improve your model performance


In [None]:
import warnings
warnings.simplefilter("ignore")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVR
from scipy import stats  # Import stats module for Z-score
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets
from sklearn.preprocessing import StandardScaler, OneHotEncoder  # For scaling numerical data and encoding categorical data
from sklearn.metrics import mean_squared_error, r2_score, make_scorer  # For model evaluation metrics
# import the label Encoder library 
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
from sklearn.model_selection import GridSearchCV

In [None]:
data = pd.read_csv("Electric_cars_dataset.csv")

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe(include = "number").T

In [None]:
data.describe(include = 'object')

In [None]:
data = data.drop("ID", axis = 1)

In [None]:
data = data.drop("VIN (1-10)", axis = 1)

In [None]:
print(data.isnull().sum())

In [None]:
data = data.dropna()

In [None]:
data.drop_duplicates(inplace=True)

In [None]:
data["Base MSRP"].value_counts()

In [None]:
data['State'].value_counts()

In [None]:
data['City'].value_counts()

In [None]:
#4. Use ydata-profiling to generate a report of the provided dataset.
from ydata_profiling import ProfileReport

profile = ProfileReport(data, title="Electric Cars", explorative=True)

profile.to_file("electric_cars_dataset.html")

In [None]:
profile.to_notebook_iframe()

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data['Expected Price ($1k)'] = pd.to_numeric(data['Expected Price ($1k)'], errors='coerce')
data['Expected Price ($1k)'] = data['Expected Price ($1k)'].fillna(0).astype(float)

In [None]:
data['Expected Price ($1k)'] = data['Expected Price ($1k)'].astype(float)

In [None]:
categorical_features = data.select_dtypes(include='object').columns
categorical_features

In [None]:
plt.figure(figsize=(25, 88))  # Reduce figure size for better visibility
for i in range(0, len(categorical_features)):
    plt.subplot(11, 4, i+1)  # Adjust grid to 11x4 (or whatever fits best)
    sns.boxplot(x=categorical_features[i], y='Expected Price ($1k)', data=data, palette='viridis')
    plt.title(f'Expected Price ($1k) vs. {categorical_features[i]}', fontsize=15)
    plt.xlabel(categorical_features[i], fontsize=12)
    plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
    plt.ylabel('Expected Price ($1k)', fontsize=12)  # Add y-axis label for clarity

# Apply tight_layout after all subplots are created
plt.tight_layout()
plt.show()

In [None]:
numerical_features = data.select_dtypes(include='number').columns
numerical_features

In [None]:
plt.figure(figsize=(25, 25))
for i in range(0, len(numerical_features)):
    plt.subplot(10, 4, i+1)
    sns.boxplot(x = data[numerical_features[i]], palette = 'viridis')
    plt.title(numerical_features[i], fontsize = 30)
    plt.xlabel(' ')
    plt.tight_layout()

In [None]:
plt.figure(figsize=(15, 7.5))
correlation_matrix = data[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

In [None]:
data = data.drop("Legislative District", axis = 1)
data = data.drop("ZIP Code", axis = 1)
data = data.drop("DOL Vehicle ID", axis = 1)
data = data.drop("Base MSRP", axis = 1)
data = data.drop("County", axis = 1)
data = data.drop("Vehicle Location", axis = 1)
data = data.drop("Electric Utility", axis = 1)
data = data.drop("State", axis = 1)
data = data.drop("City", axis = 1)

In [None]:
#Removing Outliers
from scipy.stats import zscore


# Calculate Z-scores for all numerical columns
z_scores = data[numerical_features].apply(zscore)

# Set the Z-score threshold for detecting outliers (commonly 3 or -3)
threshold = 3

# Remove outliers per column (not requiring all to be below threshold)
for col in numerical_features:
    data_no_outliers = data[(z_scores[col] < threshold) & (z_scores[col] > -threshold)]

# Print the shape of the DataFrame before and after removing outliers
print("Original shape:", data.shape)
print("Shape after removing outliers:", data_no_outliers.shape)

In [None]:
categorical_cols = ['Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']

# Initialize Label Encoder
label_encoder = LabelEncoder()

# Apply label encoding to categorical columns
for col in categorical_cols:
    data_no_outliers[col] = label_encoder.fit_transform(data_no_outliers[col])

# Check the encoded dataset
data_no_outliers.head()

In [None]:
data = data_no_outliers

In [None]:
data.head()

In [None]:
target = 'Expected Price ($1k)'
features = ['Make', 'Model', 'Model Year', 'Electric Vehicle Type', 'Electric Range', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']

X = data[features]
y = data[target]

In [None]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of the splits
X_train.shape, X_test.shape

In [None]:
# Standardizing the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build and train the SVM model
svm_model = SVR(kernel='rbf', C=100, gamma='auto')
svm_model.fit(X_train_scaled, y_train)

In [None]:
# Predictions
y_pred = svm_model.predict(X_test_scaled)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R² Score: {r2}")


In [None]:
# Define parameter grid
param_grid = {
    'C': [1, 10, 100],
    'gamma': ['scale', 'auto'],
    'kernel': ['rbf', 'linear'],
}

# Perform GridSearch
grid_search = GridSearchCV(SVR(), param_grid, cv=5, scoring='r2')
grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")