<a href="https://colab.research.google.com/github/Abiola0101/Go-Data/blob/main/GoData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# import necessary python libraries to visualize and manipulate the data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

In [None]:
#import data file into a pandas dataframe
df_goData = pd.read_csv('CBB_Listings.csv', on_bad_lines='skip')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


**1. Exploring CBB_Listing Dataset.**

In [None]:
# display the first 5 rows
df_goData.head()

In [None]:
# display the last 5 rows
df_goData.tail()

In [None]:
# making a list of the columns in the dataset
df_goData.columns

In [None]:
# get general information about the dataset
df_goData.info()

**Observations**

---


1. There are 145144 rows and 46 columns
2. 9 columns contain null values:

  listing_heading,
  dealer_email,
  dealer_phone,
  series,
  exterior_color,
  exterior_color_category,
  interior_color,
  interior_color_category,
  listing_dropoff_date

3. The entire column of dealer email has no data


In [None]:
# get the number of rows and columns in the dataset

df_goData.shape

In [None]:
# checking the data type for each column
df_goData.dtypes

In [None]:
# checking for unique values in each column
df_goData.nunique()

In [None]:
# checking for duplicates

df_goData.duplicated().sum()

In [None]:
# generating correlation matrix
corr_matrix = df_goData.corr(numeric_only = True)
corr_matrix

In [None]:
# Creating a heatmap using Seaborn
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt= ".2f")
plt.title('Correlation Heatmap', size = 24)


In [None]:
pip install ydata-profiling

In [None]:
from ydata_profiling import ProfileReport

In [None]:
profile = ProfileReport(df_goData, title="Pandas Profiling Report", explorative=True)
profile

### **2a. Data cleaning: Removing null values**

In [None]:
# enabling copy on write to avoid creating unnecessary copies
pd.options.mode.copy_on_write = True

In [None]:
# filling null values in series with unknown
df_goData.fillna('unknown', inplace=True)

In [None]:
# confirming there are no cells containing null in the dataframe
df_goData.info()

In [None]:
# confirming there are no cells containing null in the dataframe
df_goData.isnull().sum()

### **2b. Data cleaning: Removing Duplicates**

In [None]:
# checking for duplicate rows
duplicate_rows = df_goData.duplicated()
duplicate_rows

In [None]:
duplicate_rows.nunique()

**Since there is only 1 unique entry in duplicate rows, it means that there are no duplicates**

In [None]:
df_goData.shape

### **2c. Data cleaning: Removing Outliers**

**Removing outliers from price coulum**

In [None]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df_goData['price'].quantile(0.25)
Q3 = df_goData['price'].quantile(0.75)

# Calculate IQR
IQR = Q3 - Q1

# Define the bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df_no_outliers = df_goData[(df_goData['price'] >= lower_bound) & (df_goData['price'] <= upper_bound)]

#print("Original Data:")
#print(df_clean)
#print("\nData without outliers:")
#print(df_no_outliers)


In [None]:
df_no_outliers.shape

**Removing outliers from mileage column**

In [None]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df_goData['mileage'].quantile(0.25)
Q3 = df_goData['mileage'].quantile(0.75)

# Calculate IQR
IQR = Q3 - Q1

# Define the bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df_no_outliers = df_goData[(df_goData['mileage'] >= lower_bound) & (df_goData['mileage'] <= upper_bound)]

In [None]:
df_no_outliers.shape

In [None]:
df_no_outliers.reset_index(drop=True, inplace=True)

In [None]:
df_no_outliers.info()

## **2d. Data cleaning: Assigning 6 and 7 in transmission_from_vin column to manual and automatic respectively**

In [None]:
# checking the unique entries for transmission_from_vin column
df_no_outliers['transmission_from_vin'].unique()

In [None]:
df_no_outliers.replace({'transmission_from_vin': '6'}, 'M', inplace=True)
df_no_outliers.replace({'transmission_from_vin': '7'}, 'A', inplace=True)

In [None]:
# checking the unique entries for transmission_from_vin column after replacing 6 and 7
df_no_outliers['transmission_from_vin'].unique()

### **3. Identifying Significant Attributes for Problem 3.**

---



Based on our research into car features, we identified 18 features that has a high potential to make accurate predictions on vehicle transmission type. Following this selction, we are using Chi-square technique to identify features (from these 18) that would best make good predictions, there by reducing the number of features from 18 initially selected.

In [None]:
# creating a new dataframe containing relevant features
df_features = df_no_outliers[['model_year', 'make', 'model', 'mileage', 'price', 'series', 'style', 'dealer_type', 'stock_type', 'days_on_market', 'certified', 'vin',
                         'drivetrain_from_vin', 'engine_from_vin', 'wheelbase_from_vin','fuel_type_from_vin', 'number_price_changes','transmission_from_vin']]

In [None]:
df_features.info()

In [None]:
from sklearn.feature_selection import chi2
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder # Import OrdinalEncoder

X = df_features.drop('transmission_from_vin', axis=1)  # Features
y = df_features['transmission_from_vin']  # Target

# Convert categorical features to numerical using OrdinalEncoder
encoder = OrdinalEncoder() # Initialize OrdinalEncoder
X_encoded = encoder.fit_transform(X) # Fit and transform X

# Chi-squared test
chi_scores = chi2(X_encoded, y) # Use encoded X for chi2 test
p_values = pd.Series(chi_scores[1], index=X.columns)
p_values.sort_values(ascending=True, inplace=True)
print(p_values)  # Features with lower p-values are more important

From the result of Chi-Square test, **model_year, model, number_price_changes, stock_type, dealer_type, fuel_type_from_vin, and certified** have the lowest p-values and are the most useful in making accurate predictions. In addition to these 7, we will include **make, mileage and price** which we have been instructed to include as features in our model.

In [None]:
# creating a new dataframe containing relevant 11 features
df_model_features = df_features[['model_year', 'make', 'model', 'mileage', 'price', 'number_price_changes',
                              'stock_type', 'dealer_type', 'fuel_type_from_vin', 'certified', 'transmission_from_vin']]

# displaying the new dataframe
df_model_features.head()

In [None]:
df_model_features.shape

In [None]:
df_model_features.info()

###**4. Splitting data into Train and Test sets.**

In [None]:
#importing train_test_split library
from sklearn.model_selection import train_test_split

In [None]:
# defining the independent (X) and dependent (y) variables
X = df_model_features.drop('transmission_from_vin', axis=1)
y = df_model_features['transmission_from_vin']

In [None]:
# splitting the dataset into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.65, stratify=y, random_state = 42)

In [None]:
# validating the shape of the train and test sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
X_train.head()

In [None]:
X_train.info()

In [None]:
X_test.head()

In [None]:
X_test.info()

In [None]:
y_train.head()

In [None]:
y_test.head()

### **5. Data Pre-processing - Encoding Categorical columns**

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

In [None]:
pip install category_encoders

In [None]:
from category_encoders import BinaryEncoder

###**Encoding X_train**
**i) Encoding the 'make' column**

In [None]:
# checking the unique entries in 'make' column
X_train['make'].unique()


In [None]:
# checking the count of unique entries in 'make' column
X_train['make'].nunique()

In [None]:
# Encoding the 'make' column with BinaryEncoder
be_make = BinaryEncoder(cols = ['make'])
X_train = be_make.fit_transform(X_train)


In [None]:
# confirming the 'make' column has been encoded
X_train.head()

**ii)Encoding the 'model' column**

In [None]:
# checking the unique entries in 'model' column
X_train['model'].unique()


In [None]:
X_train['model'].nunique()

In [None]:
# Encoding the 'model' column with BinaryEncoder
be_model = BinaryEncoder(cols = ['model'])
X_train = be_model.fit_transform(X_train)


In [None]:
X_train.head()

In [None]:
X_train.shape

**iii) Encoding the 'stock_type' column**

In [None]:
# checking the unique entries in the 'stock_type' column
X_train['stock_type'].unique()

In [None]:
# Encoding the stock_type column with LabelEncoder

le_number_stock_type = LabelEncoder()
X_train['stock_type'] = le_number_stock_type.fit_transform(X_train['stock_type'])


In [None]:
#checking 'exterior_color_category' column has been encoded and the 1631 unique entries have been captured in 11 columns
X_train.head()

**vii) Encoding the 'dealer_type' column**

In [None]:
# checking the unique entries in the 'dealer_type' column
X_train['dealer_type'].unique()

In [None]:
# Encoding the dealer_type column with LabelEncoder

le_dealer_type = LabelEncoder()
X_train['dealer_type'] = le_dealer_type.fit_transform(X_train['dealer_type'])


In [None]:
X_train.head()

**viii) Encoding the 'fuel_type_from_vin' column**

In [None]:
# checking the unique entries in the 'fuel_type_from_vin' column
X_train['fuel_type_from_vin'].unique()

In [None]:
X_test['fuel_type_from_vin'].unique()

In [None]:
# Encoding the fuel_type_from_vin column with OnehotEncoder
X_train = pd.get_dummies(X_train, columns=['fuel_type_from_vin'], dtype = 'int')

In [None]:
X_train.head()

In [None]:
X_train.shape

### **Encoding X_test**

In [None]:
# checking the unique entries in 'make' column
X_test['make'].unique()

In [None]:
X_test['make'].nunique()

In [None]:
# Encoding the 'make' column with BinaryEncoder
be_make = BinaryEncoder(cols = ['make'])
X_test = be_make.fit_transform(X_test)

In [None]:
# Encoding the 'model' column with BinaryEncoder
be_model = BinaryEncoder(cols = ['model'])
X_test = be_model.fit_transform(X_test)

In [None]:
# Encoding the stock_type column with LabelEncoder

le_number_stock_type = LabelEncoder()
X_test['stock_type'] = le_number_stock_type.fit_transform(X_test['stock_type'])

In [None]:
# Encoding the dealer_type column with LabelEncoder

le_number_dealer_type = LabelEncoder()
X_test['dealer_type'] = le_number_dealer_type.fit_transform(X_test['dealer_type'])


In [None]:
# Encoding the fuel_type_from_vin column with OnehotEncoder
X_test = pd.get_dummies(X_test, columns=['fuel_type_from_vin'], dtype = 'int')

In [None]:
X_test.head()

In [None]:
print(X_train.shape)
print(X_test.shape)

In [None]:
X_test.info()

### **Encoding y_train & y_test**

In [None]:
y_train = pd.get_dummies(y_train, columns=['transmission_from_vin'], dtype = 'int', drop_first= True)
y_train.head()

In [None]:
y_test = pd.get_dummies(y_test, columns=['y_test'], dtype = 'int', drop_first= True)
y_test.head()

In [None]:
# rename M to transmission_from_vin in y_train and y_test
y_train.rename(columns={'M': 'transmission_from_vin'}, inplace=True)
y_test.rename(columns={'M': 'transmission_from_vin'}, inplace=True)

In [None]:
print(y_train.info())
print(y_test.info())

In [None]:
print(y_train.head())
print(y_test.head())

### **6. Handling Imbalanced data columns**

In [None]:
pip install ydata-profiling

In [None]:
from ydata_profiling import ProfileReport

In [None]:
profile = ProfileReport(X_train, title="Pandas Profiling Report", explorative=True)
profile

The result of profiling X_train set after encoding shows high imbalance in
- **model_0**
- **model_1**
-	**certified**
- **fuel_type_from_vin_CNG**
-	**fuel_type_from_vin_Diesel**
-	**fuel_type_from_vin_Electric**
-	**fuel_type_from_vin_Hybrid**
-	**fuel_type_from_vin_Hydrogen**
- **fuel_type_from_vin_PHEV**

In [None]:
#count the number of classes in each imbalanced column in the Train set
print(X_train['model_0'].value_counts())
print(X_train['model_1'].value_counts())
print(X_train['certified'].value_counts())
print(X_train['fuel_type_from_vin_CNG'].value_counts())
print(X_train['fuel_type_from_vin_Diesel'].value_counts())
print(X_train['fuel_type_from_vin_Electric'].value_counts())
print(X_train['fuel_type_from_vin_Hybrid'].value_counts())
print(X_train['fuel_type_from_vin_Hydrogen'].value_counts())
print(X_train['fuel_type_from_vin_PHEV'].value_counts())
print(y_train['transmission_from_vin'].value_counts())


In [None]:
#count the number of classes in each imbalanced column in the Test set
print(X_test['model_0'].value_counts())
print(X_test['model_1'].value_counts())
print(X_test['certified'].value_counts())
print(X_test['fuel_type_from_vin_CNG'].value_counts())
print(X_test['fuel_type_from_vin_Diesel'].value_counts())
print(X_test['fuel_type_from_vin_Electric'].value_counts())
print(X_test['fuel_type_from_vin_Hybrid'].value_counts())
print(X_test['fuel_type_from_vin_Hydrogen'].value_counts())
print(X_test['fuel_type_from_vin_PHEV'].value_counts())
print(y_test['transmission_from_vin'].value_counts())

**Visualization of Imbalanced columns in the Train set**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create individual countplots for each column
fig, axes = plt.subplots(5, 2, figsize=(15, 20))  # Adjust figsize as needed

sns.countplot(x='model_0', hue = 'model_0', data=X_train, ax=axes[0, 0])
sns.countplot(x='model_1', hue = 'model_1', data=X_train, ax=axes[0, 1])
sns.countplot(x='certified', hue = 'certified', data=X_train, ax=axes[1, 0])
sns.countplot(x='fuel_type_from_vin_CNG', hue = 'fuel_type_from_vin_CNG', data=X_train, ax=axes[1, 1])
sns.countplot(x='fuel_type_from_vin_Diesel', hue = 'fuel_type_from_vin_Diesel', data=X_train, ax=axes[2, 0])
sns.countplot(x='fuel_type_from_vin_Electric', hue = 'fuel_type_from_vin_Electric', data=X_train, ax=axes[2, 1])
sns.countplot(x='fuel_type_from_vin_Hybrid', hue = 'fuel_type_from_vin_Hybrid', data=X_train, ax=axes[3, 0])
sns.countplot(x='fuel_type_from_vin_Hydrogen', hue = 'fuel_type_from_vin_Hydrogen', data=X_train, ax=axes[3, 1])
sns.countplot(x='fuel_type_from_vin_PHEV', hue = 'fuel_type_from_vin_PHEV', data=X_train, ax=axes[4, 0])
sns.countplot(x='transmission_from_vin', hue = 'transmission_from_vin', data=y_train, ax=axes[4, 1])
# clear extra subplots to avoid empty plots
# axes[4, 1].axis('off')
plt.tight_layout()  # Adjust spacing between subplots
plt.show()

**Using SMOTE technique to handle imbalance in Train set**

In [None]:
from imblearn.over_sampling import SMOTE
import pandas as pd
import numpy as np

# Select specific columns (features to balance)
selected_columns = ['model_0', 'model_1', 'certified', 'fuel_type_from_vin_CNG',
                    'fuel_type_from_vin_Diesel', 'fuel_type_from_vin_Electric',
                    'fuel_type_from_vin_Hybrid',
                    'fuel_type_from_vin_PHEV']
X = X_train[selected_columns]
y = y_train['transmission_from_vin']

# Initialize SMOTE with k_neighbors=1
smote = SMOTE(random_state=42, k_neighbors=1) # Changed k_neighbors to 1

# Create a copy of X to store resampled data
X_resampled = X.copy()

# Apply SMOTE to each selected feature
for feature in selected_columns:
    # Create a temporary target variable for the current feature
    temp_y = X[feature]

    # Check if the minority class has at least k_neighbors + 1 samples
    # If not, skip SMOTE for this feature
    unique_values, counts = np.unique(temp_y, return_counts=True)
    minority_class_count = counts.min()

    if minority_class_count > smote.k_neighbors:  # Check if minority class has enough samples
        # Apply SMOTE to the feature and temporary target
        X_feature_resampled, _ = smote.fit_resample(X, temp_y)

        # Update the resampled data with the balanced feature values
        X_resampled[feature] = X_feature_resampled[feature]
    else:
        print(f"Skipping SMOTE for feature '{feature}' due to insufficient minority class samples.")

# Apply SMOTE to balance the target variable
X_resampled, y_resampled = smote.fit_resample(X_resampled, y)


# Recombine with the remaining columns
X_train_resampled_combined = pd.concat([X_train.drop(columns=selected_columns), X_resampled], axis=1)

In [None]:
print(X_train_resampled_combined.shape)
print(y_resampled.shape)

In [None]:
#count the number of classes in each column of the Train set after handling imbalance
print(X_train_resampled_combined['model_0'].value_counts())
print(X_train_resampled_combined['model_1'].value_counts())
print(X_train_resampled_combined['certified'].value_counts())
print(X_train_resampled_combined['fuel_type_from_vin_CNG'].value_counts())
print(X_train_resampled_combined['fuel_type_from_vin_Diesel'].value_counts())
print(X_train_resampled_combined['fuel_type_from_vin_Electric'].value_counts())
print(X_train_resampled_combined['fuel_type_from_vin_Hybrid'].value_counts())
print(X_train_resampled_combined['fuel_type_from_vin_Hydrogen'].value_counts())
print(X_train_resampled_combined['fuel_type_from_vin_PHEV'].value_counts())
print(y_resampled.value_counts())

In [None]:
# Create individual countplots for each column
fig, axes = plt.subplots(5, 2, figsize=(15, 20))

sns.countplot(x='model_0', hue = 'model_0', data=X_train_resampled_combined, ax=axes[0, 0])
sns.countplot(x='model_1', hue = 'model_1', data=X_train_resampled_combined, ax=axes[0, 1])
sns.countplot(x='certified', hue = 'certified', data=X_train_resampled_combined, ax=axes[1, 0])
sns.countplot(x='fuel_type_from_vin_CNG', hue = 'fuel_type_from_vin_CNG', data=X_train_resampled_combined, ax=axes[1, 1])
sns.countplot(x='fuel_type_from_vin_Diesel', hue = 'fuel_type_from_vin_Diesel', data=X_train_resampled_combined, ax=axes[2, 0])
sns.countplot(x='fuel_type_from_vin_Electric', hue = 'fuel_type_from_vin_Electric', data=X_train_resampled_combined, ax=axes[2, 1])
sns.countplot(x='fuel_type_from_vin_Hybrid', hue = 'fuel_type_from_vin_Hybrid', data=X_train_resampled_combined, ax=axes[3, 0])
sns.countplot(x='fuel_type_from_vin_Hydrogen', hue = 'fuel_type_from_vin_Hydrogen', data=X_train_resampled_combined, ax=axes[3, 1])
sns.countplot(x='fuel_type_from_vin_PHEV', hue = 'fuel_type_from_vin_PHEV', data=X_train_resampled_combined, ax=axes[4, 0])
# Convert y_resampled to a DataFrame
y_resampled_df = y_resampled.to_frame()
sns.countplot(x='transmission_from_vin', hue = 'transmission_from_vin', data=y_resampled_df, ax=axes[4, 1])
# clear extra subplots to avoid empty plots
# axes[4, 1].axis('off')
plt.tight_layout()  # Adjust spacing between subplots
plt.show()

**Using SMOTE technique to handle imbalance in Test set**

In [None]:
# Create individual countplots for each column
fig, axes = plt.subplots(5, 2, figsize=(15, 20))

sns.countplot(x='model_0', hue = 'model_0', data=X_test, ax=axes[0, 0])
sns.countplot(x='model_1', hue = 'model_1', data=X_test, ax=axes[0, 1])
sns.countplot(x='certified', hue = 'certified', data=X_test, ax=axes[1, 0])
sns.countplot(x='fuel_type_from_vin_CNG', hue = 'fuel_type_from_vin_CNG', data=X_test, ax=axes[1, 1])
sns.countplot(x='fuel_type_from_vin_Diesel', hue = 'fuel_type_from_vin_Diesel', data=X_test, ax=axes[2, 0])
sns.countplot(x='fuel_type_from_vin_Electric', hue = 'fuel_type_from_vin_Electric', data=X_test, ax=axes[2, 1])
sns.countplot(x='fuel_type_from_vin_Hybrid', hue = 'fuel_type_from_vin_Hybrid', data=X_test, ax=axes[3, 0])
sns.countplot(x='fuel_type_from_vin_Hydrogen', hue = 'fuel_type_from_vin_Hydrogen', data=X_test, ax=axes[3, 1])
sns.countplot(x='fuel_type_from_vin_PHEV', hue = 'fuel_type_from_vin_PHEV', data=X_test, ax=axes[4, 0])
sns.countplot(x='transmission_from_vin', hue = 'transmission_from_vin', data=y_test, ax=axes[4, 1])
# clear extra subplots to avoid empty plots
# axes[4, 1].axis('off')
plt.tight_layout()  # Adjust spacing between subplots
plt.show()

In [None]:
"""import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.impute import SimpleImputer

# Select specific columns (features to balance)
selected_columns = ['model_0', 'model_1', 'certified',
                    'fuel_type_from_vin_Diesel', 'fuel_type_from_vin_Electric',
                    'fuel_type_from_vin_Hybrid',
                    'fuel_type_from_vin_PHEV']
X1 = X_test[selected_columns]
y1 = y_test['transmission_from_vin']

# Initialize SMOTE with k_neighbors=1
# k_neighbors must be less than or equal to the number of samples in the minority class
smote = SMOTE(random_state=42, k_neighbors=1)

# Initialize Imputer to replace NaN with most frequent value
imputer = SimpleImputer(strategy='most_frequent')

# Impute missing values in X1
X1_imputed = pd.DataFrame(imputer.fit_transform(X1), columns=X1.columns)

# Create a copy of X to store resampled data
X1_resampled = X1_imputed.copy()

# Apply SMOTE to each selected feature
for feature in selected_columns:
    # Create a temporary target variable for the current feature
    temp_y1 = X1_imputed[feature]

    # Check if the minority class has at least k_neighbors + 1 samples
    # If not, skip SMOTE for this feature
    unique_values, counts = np.unique(temp_y1, return_counts=True)
    minority_class_count = counts.min()

    if minority_class_count > smote.k_neighbors:  # Check if minority class has enough samples
        # Apply SMOTE to the feature and temporary target
        X1_feature_resampled, _ = smote.fit_resample(X1_imputed, temp_y1)

        # Update the resampled data with the balanced feature values
        X1_resampled[feature] = X1_feature_resampled[feature]
    else:
        print(f"Skipping SMOTE for feature '{feature}' due to insufficient minority class samples.")


# Apply SMOTE to balance the target variable
X1_resampled, y1_resampled = smote.fit_resample(X1_resampled, y1)


# Recombine with the remaining columns
X_test_resampled_combined = pd.concat([X_test.drop(columns=selected_columns), X1_resampled], axis=1)"""

In [None]:
print(X_test_resampled_combined.shape)
print(y1_resampled.shape)

In [None]:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.impute import SimpleImputer

# Select specific columns (features to balance)
selected_columns = ['model_0', 'model_1', 'certified',
                    'fuel_type_from_vin_Diesel', 'fuel_type_from_vin_Electric',
                    'fuel_type_from_vin_Hybrid',
                    'fuel_type_from_vin_PHEV']
X1 = X_test[selected_columns]
y1 = y_test['transmission_from_vin']

# Initialize SMOTE with k_neighbors=1
smote = SMOTE(random_state=42, k_neighbors=1)

# Initialize Imputer to replace NaN with most frequent value
imputer = SimpleImputer(strategy='most_frequent')

# Impute missing values in X1
X1_imputed = pd.DataFrame(imputer.fit_transform(X1), columns=X1.columns)

# Create a copy of X to store resampled data
X1_resampled = X1_imputed.copy()

# Apply SMOTE to each selected feature, but skip if insufficient minority samples
for feature in selected_columns:
    temp_y1 = X1_imputed[feature]
    unique_values, counts = np.unique(temp_y1, return_counts=True)
    minority_class_count = counts.min()

    if minority_class_count > smote.k_neighbors:
        X1_feature_resampled, _ = smote.fit_resample(X1_imputed, temp_y1)
        X1_resampled[feature] = X1_feature_resampled[feature]
    else:
        print(f"Skipping SMOTE for feature '{feature}' due to insufficient minority class samples.")

# Apply SMOTE to balance the target variable
X1_resampled, y1_resampled = smote.fit_resample(X1_resampled, y1)

X_test_remaining = X_test.drop(columns=selected_columns)
X_test_resampled_combined = pd.concat([X_test_remaining, X1_resampled], axis=1)

# If rows don't match, adjust X_test_resampled_combined
if X_test_resampled_combined.shape[0] != y1_resampled.shape[0]:
    num_rows_to_adjust = y1_resampled.shape[0] - X_test_resampled_combined.shape[0]

    if num_rows_to_adjust > 0:  # Need to add rows
        additional_rows = X_test_remaining.sample(n=num_rows_to_adjust, replace=True, random_state=42)
        X_test_remaining = pd.concat([X_test_remaining, additional_rows], ignore_index=True)
        X_test_resampled_combined = pd.concat([X_test_remaining, X1_resampled], axis=1)

    elif num_rows_to_adjust < 0:  # Need to remove rows
        # Remove extra rows from X_test_resampled_combined
        X_test_resampled_combined = X_test_resampled_combined.iloc[:y1_resampled.shape[0]]

print(f"Shape of X_test_resampled_combined: {X_test_resampled_combined.shape}")
print(f"Shape of y1_resampled: {y1_resampled.shape}")

In [None]:
# Create individual countplots for each column
fig, axes = plt.subplots(5, 2, figsize=(15, 20))

sns.countplot(x='model_0', hue = 'model_0', data=X_test_resampled_combined, ax=axes[0, 0])
sns.countplot(x='model_1', hue = 'model_1', data=X_test_resampled_combined, ax=axes[0, 1])
sns.countplot(x='certified', hue = 'certified', data=X_test_resampled_combined, ax=axes[1, 0])
sns.countplot(x='fuel_type_from_vin_CNG', hue = 'fuel_type_from_vin_CNG', data=X_test_resampled_combined, ax=axes[1, 1])
sns.countplot(x='fuel_type_from_vin_Diesel', hue = 'fuel_type_from_vin_Diesel', data=X_test_resampled_combined, ax=axes[2, 0])
sns.countplot(x='fuel_type_from_vin_Electric', hue = 'fuel_type_from_vin_Electric', data=X_test_resampled_combined, ax=axes[2, 1])
sns.countplot(x='fuel_type_from_vin_Hybrid', hue = 'fuel_type_from_vin_Hybrid', data=X_test_resampled_combined, ax=axes[3, 0])
sns.countplot(x='fuel_type_from_vin_Hydrogen', hue = 'fuel_type_from_vin_Hydrogen', data=X_test_resampled_combined, ax=axes[3, 1])
sns.countplot(x='fuel_type_from_vin_PHEV', hue = 'fuel_type_from_vin_PHEV', data=X_test_resampled_combined, ax=axes[4, 0])

# Convert y1_resampled to a DataFrame before using it in sns.countplot
y1_resampled_df = y1_resampled.to_frame()

# Now use the DataFrame in sns.countplot
sns.countplot(x='transmission_from_vin', hue='transmission_from_vin', data=y1_resampled_df, ax=axes[4, 1])

# clear extra subplots to avoid empty plots
# axes[4, 1].axis('off')
plt.tight_layout()  # Adjust spacing between subplots
plt.show()

In [None]:
#count the number of classes in each column of the Train set after handling imbalance
print(X_test_resampled_combined['model_0'].value_counts())
print(X_test_resampled_combined['model_1'].value_counts())
print(X_test_resampled_combined['certified'].value_counts())
print(X_test_resampled_combined['fuel_type_from_vin_CNG'].value_counts())
print(X_test_resampled_combined['fuel_type_from_vin_Diesel'].value_counts())
print(X_test_resampled_combined['fuel_type_from_vin_Electric'].value_counts())
print(X_test_resampled_combined['fuel_type_from_vin_Hybrid'].value_counts())
print(X_test_resampled_combined['fuel_type_from_vin_Hydrogen'].value_counts())
print(X_test_resampled_combined['fuel_type_from_vin_PHEV'].value_counts())
print(y_resampled.value_counts())

## **7. Scaling Test and Train Sets**
### The purpose of scaling is to bring all features (variables) into a common range or distribution. This can improve the performance and convergence speed of machine learning algorithms

### **Scaling X_train**

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler with desired range (default is 0 to 1)
scaler = MinMaxScaler()

# Fit and transform the data
scaled_X_train_resampled_combined = scaler.fit_transform(X_train_resampled_combined)

# Convert the result back to a DataFrame
scaled_X_train_resampled_combined = pd.DataFrame(scaled_X_train_resampled_combined, columns=X_train_resampled_combined.columns)

print("Scaled Data:")
scaled_X_train_resampled_combined.head()


### **Scaling X_test**

In [None]:
# Initialize MinMaxScaler with desired range (default is 0 to 1)
scaler2 = MinMaxScaler()

# Fit and transform the data
scaled_X_test_resampled_combined = scaler2.fit_transform(X_test_resampled_combined)

# Get column names from the original DataFrame
#columns = X_test.columns

# Convert the result back to a DataFrame using the original column names
scaled_X_test_resampled_combined = pd.DataFrame(scaled_X_test_resampled_combined, columns=X_test_resampled_combined.columns)

print("Scaled Data:")
scaled_X_test_resampled_combined.head()

In [None]:
# scaling y1_resampled_df
scaler3 = MinMaxScaler()

# Fit and transform the data
scaled_y1_resampled_df = scaler3.fit_transform(y1_resampled_df)


In [None]:
# scaling y1_resampled_df
scaler4 = MinMaxScaler()

# Fit and transform the data
scaled_y_resampled_df = scaler4.fit_transform(y_resampled_df)

In [None]:
print(scaled_X_train_resampled_combined.shape)
print(scaled_X_test_resampled_combined.shape)

In [None]:
#print(scaled_y_resampled.shape)
print(scaled_y1_resampled_df.shape)

In [None]:
print(y_resampled.shape) # train
print(y1_resampled.shape) # test

## **Model Building**

In [None]:
import shap
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,ConfusionMatrixDisplay
from sklearn.impute import SimpleImputer # Import the SimpleImputer class from the correct module
from sklearn.pipeline import Pipeline  # Import Pipeline for creating the pipeline

In [None]:
models = []
models.append(('LR', LogisticRegression(solver ='lbfgs',multi_class='auto')))
models.append(('KNN', KNeighborsClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVC', SVC(gamma='scale')))
models.append(('RFC', RandomForestClassifier(n_estimators=100)))
models.append(('DTR', DecisionTreeClassifier()))
models.append(('XGB',XGBClassifier()))

In [None]:
results = []
names = []

In [None]:
# Import necessary classes
from sklearn.model_selection import KFold, cross_val_score

# Define the models to evaluate
models = [('Logistic Regression', LogisticRegression(solver='lbfgs', multi_class='auto')),
    ('K-Nearest Neighbors', KNeighborsClassifier()),
    ('Naive Bayes', GaussianNB()),
    ('Support Vector Machine', SVC()),
    ('Random Forest', RandomForestClassifier()),
    ('Decision Tree', DecisionTreeClassifier()),
    ('XGB', XGBClassifier())
]

# Define the number of folds for k-fold cross-validation
num_folds = 5
results = []
names = []

# Iterate through the models
for name, model in models:
    # Create a KFold object
    kfold = KFold(n_splits=num_folds, shuffle=True, random_state=42)

    # Handle NaN values before cross-validation
    # This ensures that any NaN values are filled for each fold
    imputer = SimpleImputer(strategy='constant', fill_value=0) # Use an imputer to handle NaNs
    pipeline = Pipeline([('imputer', imputer), ('model', model)]) # Create a pipeline with imputation and model

    # Perform cross-validation using the pipeline
    cv_results = cross_val_score(pipeline, scaled_X_train_resampled_combined, y_resampled, cv=kfold, scoring='accuracy')

    # Store the results
    results.append(cv_results)
    names.append(name)

    # Print the mean and standard deviation of the accuracy scores
    print(f"{name}: {cv_results.mean():.4f} ({cv_results.std():.4f})")

### **Hyperparameter Tuning**
Since RandomForestClassifier is the best performing model based on the cross-validation results, we will be be performimg hyperparameter tuning to identify the best hyperparameter for prediction.

In [None]:
# hyper parameter tuning of random forest regressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error


In [None]:
#Instantiating
RF = RandomForestClassifier()

# Default parameters
RF.get_params()

In [None]:
# Define the parameter grid to search
param_grid = {
    'n_estimators': [100, 150, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'bootstrap': [True, False],
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False],

}

In [None]:
# Create a Random Forest Regressor
rf_regressor = RandomForestClassifier()


# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=rf_regressor,
                           param_grid=param_grid, cv=3,
                           scoring='accuracy',
                           n_jobs=-1, verbose=2 )

In [None]:
# Fit the GridSearchCV object to the training data
grid_search.fit(scaled_X_train_resampled_combined, y_resampled)

#Use the best estimator from grid search
best_rf = grid_search.best_estimator_


In [None]:
best_rf

In [None]:
# Get the feature names from the training data
training_feature_names = scaled_X_train_resampled_combined.columns

# Ensure the test data has the same feature names and order
scaled_X_test_resampled_combined = scaled_X_test_resampled_combined[training_feature_names]

# Now, make predictions
y_pred = best_rf.predict(scaled_X_test_resampled_combined)

### **Evaluating Model Performance**

In [None]:
# Evaluate the performance of the best model on the test dataset
accuracy = accuracy_score(scaled_y1_resampled_df, y_pred)
print(f"Accuracy of the best model on the test dataset: {accuracy:.4f}")

In [None]:
# Evaluate the performance of the best model on the test dataset
accuracy = accuracy_score(y1_resampled, y_pred)  # Changed from scaled_y1_resampled_df to y1_resampled
print(f"Accuracy of the best model on the test dataset: {accuracy:.4f}")

In [None]:
# Generate classification report and confusion matrix
print(classification_report(scaled_y1_resampled_df, y_pred))
cm = confusion_matrix(scaled_y1_resampled_df, y_pred)
print("Confusion Matrix:")
print(cm)

In [None]:
# You can also visualize the confusion matrix using ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

**Exporting Model Predictions to CSV**

In [None]:
# Create a DataFrame
model_df = pd.DataFrame({
    'Predictions': y_pred
})

# Save to CSV
model_df.to_csv('predictions.csv', index=False)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Example: Assume you have the following variables:
# X_train, y_train - Your feature matrix and target labels
# X_test, y_test - Your test feature matrix and target labels

# Train a model (for example, a RandomForestClassifier)
model = RandomForestClassifier(random_state=42)
model.fit(scaled_X_train_resampled_combined, scaled_y_resampled_df)

# Predict probabilities (for ROC curve, we need probabilities, not just predictions)
y_pred_prob = model.predict_proba(scaled_X_test_resampled_combined)[:, 1]  # Get the probability for the positive class

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(scaled_y1_resampled_df, y_pred_prob)

# Compute AUC score
auc_score = roc_auc_score(scaled_y1_resampled_df, y_pred_prob)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f'ROC curve (AUC = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')  # Diagonal line (no discrimination)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()


In [None]:
import pandas as pd
import pickle

# Replace these with your actual model, features, and predictions
model = ...  # Trained machine learning model
encoded_features = ...  # Pandas DataFrame containing encoded features
predictions = ...  # Predictions as a pandas Series or numpy array

# File paths
model_pkl_path = "model.pkl"
features_csv_path = "encoded_features.csv"
predictions_csv_path = "predictions.csv"

# Save the model as a pickle file
with open(model_pkl_path, "wb") as f:
    pickle.dump(model, f)
print(f"Model saved to {model_pkl_path}")

# Save the encoded features to a CSV file
encoded_features.to_csv(features_csv_path, index=False)
print(f"Encoded features saved to {features_csv_path}")

# Save the predictions to a CSV file
predictions_df = pd.DataFrame(predictions, columns=["Prediction"])
predictions_df.to_csv(predictions_csv_path, index=False)
print(f"Predictions saved to {predictions_csv_path}")


#                           **THE END**