##  Correct Order of Steps for Support Vector Machine Regression
#### 1.	Handle duplicates & missing values
#### 2.	Treat outliers (using IQR, capping, or transformations)
#### 3.	Apply Target Variable Transformation (If necessary)
#### 4.	Split data into train & test sets
#### 5.	Apply categorical encoding (Only on training data, then transform test data)
#### 6.	Handle multi-collinearity (VIF check, dropping highly correlated features)
#### 7.	Normalize/Standardize numerical features (Only on training data, then transform test data) (Not Necessory for trees)
#### 8.	Train Random Forest Regression model
#### 9.	Evaluate the model
#### 10. Apply Hyperparameter tuning using Grid search
#### 11. Make final predictions

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error

## Data Import

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
from google.colab import files
uploaded = files.upload()


In [None]:
!ls

In [None]:
electric = pd.read_csv('train.csv')
print(electric.head())

## Data Overview
#### VIN (1-10) - The 1st 10 characters of each vehicle's Vehicle Identification Number (VIN).
#### County- The county in which the registered owner resides.
#### City - The city in which the registered owner resides.
#### State- The state in which the registered owner resides.
#### ZIP Code - The 5-digit zip code in which the registered owner resides.
#### Model Year - The model year of the vehicle is determined by decoding the Vehicle Identification Number (VIN).
#### Make- The manufacturer of the vehicle, determined by decoding the Vehicle Identification Number (VIN).
#### Model- The model of the vehicle is determined by decoding the Vehicle Identification Number (VIN).
#### Electric Vehicle Type - This distinguishes the vehicle as all-electric or a plug-in hybrid.
#### Clean Alternative Fuel Vehicle (CAFV) Eligibility - This categorizes vehicles as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-only range requirement.
#### Electric Range - Describes how far a vehicle can travel purely on its electric charge.
#### Base MSRP - This is the lowest Manufacturer's Suggested Retail Price (MSRP) for any trim level of the model in question.
#### Legislative District - The specific section of Washington State that the vehicle's owner resides in, as represented in the state legislature.
#### DOL Vehicle ID - Unique number assigned to each vehicle by the Department of Licensing for identification purposes.
#### Vehicle Location - The center of the ZIP Code for the registered vehicle.
#### Electric Utility - This is the electric power retail service territory serving the address of the registered vehicle.
#### Expected Price - This is the expected price of the vehicle.

## Data type

In [None]:
electric.info()

In [None]:
electric['ZIP Code'] = electric['ZIP Code'].astype(str)
electric['Legislative District'] = electric['Legislative District'].astype(str)


In [None]:
electric.describe()

In [None]:
electric = electric.drop(['ID','DOL Vehicle ID','Base MSRP','VIN (1-10)','Vehicle Location'],axis=1)

## Treatment of null values

In [None]:
electric.isnull().sum()

In [None]:
electric['Expected Price ($1k)'] = electric['Expected Price ($1k)'].replace('N/',np.nan).astype(float)
electric = electric.dropna(subset=['Expected Price ($1k)'])


In [None]:
electric = electric.fillna(electric.mode().iloc[0])

In [None]:
electric.isnull().sum()

### Outlier treatment using IQR

In [None]:
fig, axarr  = plt.subplots(figsize=(10,10))
sns.boxplot(data=electric)

### No outliers are present in 'Electric Range'

## Target Variable Transformation

In [None]:
electric['Expected Price ($1k)'] = np.log1p(electric['Expected Price ($1k)'])

## Splitting the data into training & testing sets.

In [None]:
X = electric.drop(columns=['Expected Price ($1k)'])  # Features
y = electric['Expected Price ($1k)']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Treatment of Numerical columns

#### No need to to Standardization or Normalization as we have only 1 Numerical variable ('Electric Range')

## Treatment of Categorical columns

In [None]:
# Categorical Column types:
# County : Nominal, 140 categories
# City: Nominal, 545 Categories
# State: Nominal, 39 Categories
# ZIP Code: Nominal, 679 Categories
# Make: Nominal, 35 Categories
# Model: Nominal, 107 Categories
# Electric Vehicle Type: Nominal, 2 Categories
# Clean Alternative Fuel Vehicle (CAFV) Eligibility: Nominal, 3 Categories
# Legislative District: Nominal, 51 Categories
# Electric Utility: Nominal, 69 Categories

In [None]:
# Here we have 'Electric Vehicle Type' and 'CAFV Eligibility' where we can apply One-Hot Encoding or Dummies.

low_cardinality_features = ['Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']

X_train = pd.get_dummies(X_train, columns=low_cardinality_features, drop_first=True)
X_test = pd.get_dummies(X_test, columns=low_cardinality_features, drop_first=True)

In [None]:
# To all other variables which have higher number of categories, we can apply Target encoding.
!pip install category_encoders
import category_encoders
from category_encoders import TargetEncoder
high_cardinality_features = ['County', 'City', 'State', 'ZIP Code', 'Make', 'Model', 'Legislative District', 'Electric Utility']

target_enc = TargetEncoder()
X_train[high_cardinality_features] = target_enc.fit_transform(X_train[high_cardinality_features], y_train)
X_test[high_cardinality_features] = target_enc.transform(X_test[high_cardinality_features])


## Support Vector Machine model building

In [20]:
from sklearn.svm import SVR
clf = SVR(kernel='rbf', C=100, epsilon=0.1)
clf.fit(X_train, y_train)

### Model Evaluation

In [21]:
if y.nunique() <= 2:  # Classification
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
else:  # Regression
    y_pred = clf.predict(X_test)
    print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

RMSE: 0.21835584284487897


### Hyperparameter Tuning using Grid-search

In [None]:
param_grid = {
    'C': [1, 10, 100, 1000],
    'epsilon': [0.01, 0.1, 0.5, 1],
    'kernel': ['linear', 'rbf', 'poly']
}
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy' if y.nunique() <= 2 else 'neg_mean_squared_error')
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)

### Pruning the Decision Tree

In [None]:
pruned_clf = SVR(**grid_search.best_params_, random_state=42) if y.nunique() <= 2 else xgb.XGBRegressor(**grid_search.best_params_, random_state=42)
pruned_clf.fit(X_train, y_train)

### Final Predictions

In [None]:
y_final_pred = pruned_clf.predict(X_test)
print("Final Accuracy:" if y.nunique() <= 2 else "Final RMSE:", accuracy_score(y_test, y_final_pred) if y.nunique() <= 2 else np.sqrt(mean_squared_error(y_test, y_final_pred)))


In [None]:
from sklearn.metrics import r2_score

def adjusted_r2(y_true, y_pred, X):
    """Calculate Adjusted R²"""
    r2 = r2_score(y_true, y_pred)
    n = X.shape[0]  # Number of samples
    p = X.shape[1]  # Number of predictors
    adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - p - 1))
    return r2, adj_r2

# Example usage with your trained model
y_pred = pruned_clf.predict(X_test)
r2, adj_r2 = adjusted_r2(y_test, y_pred, X_test)

print(f"R² Score: {r2:.4f}")
print(f"Adjusted R² Score: {adj_r2:.4f}")
