# EECS 3401: Project

## Author: Harsh Parmar & Shubhkumar Patel

**Dataset Source: Suraj, (2023) . _Car Sales Data_ . Kaggle . https://www.kaggle.com/datasets/suraj520/car-sales-data**

**Modified Dataset: _Car Sales Data_ . https://media.githubusercontent.com/media/ParmarHarsh/Project-Group-50/main/car_sales_data.csv**

# Car Sales Data

**Attributes for car-sales-data.csv dataset:**

The below attributes are copied from the original dataset.
1. Date: The date of the car sale
2. Salesperson: The name of the salesperson who made the sale
3. Customer Name: The name of the customer who purchased the car
4. Car Make: The make of the car that was purchased
5. Car Model: The model of the car that was purchased
6. Car Year: The year of the car that was purchased
7. Sale Price: The sale price of the car in USD
8. Commission Rate: The commission rate paid to the salesperson on the sale
9. Commission Earned: The amount of commission earned by the salesperson on the sale

## 1 - Look at the big picture & frame the problem.

### Look at the big picture

- Predicting future car sales aids manufacturers in planning production, managing inventory, and optimizing marketing strategies based on historical sales data.

### Frame the problem

- Supervised learning: Using labeled historical data to predict sales figures constitutes a supervised learning problem.
- A regression task: Forecasting sales figures from available attributes aligns with regression.
- Batch learning: Leveraging the entire dataset to train models for predicting future sales represents batch learning.

## 2 - Load the dataset.

In [None]:
# Import libraries.
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset.
url = "https://media.githubusercontent.com/media/ParmarHarsh/Project-Group-50/main/car_sales_data.csv"
cars = pd.read_csv(url, sep=',')

# Create a backup copy of the dataset.
cars_backup = cars

### 2.1 - Take a quick look at the data structure.

In [None]:
# Content of dataset.
cars

In [None]:
# First few rows of dataset.
cars.head()

In [None]:
# Descriptive statistics of numerical columns.
cars.describe()

In [None]:
# Consize summary of dataset.
cars.info()

In [None]:
# Dimensions of dataset.
cars.shape

## 3 - Explore and visualize the data to gain insights.

### 3.1 - Plot a histogram of the data.

In [None]:
# Displaying a histogram
cars.hist(figsize=(15,15))
plt.show()

### 3.2 - Look for correlations between the features.

#### 3.2.1 - Correlations using Pearson correlation coefficient.

In [None]:
# Calculating the correlation matrix.
corr_matrix = cars.corr(numeric_only=True)

# # Sorting the correlation of 'Sale Price'.
corr_matrix["Sale Price"].sort_values(ascending=False)

#### 3.2.2 - Correlations with regard to our target.

In [None]:
# Line plot to visualize the relationship between 'Car Year' and 'Sale Price'.
year_vs_price = sns.lineplot(x="Car Year", y="Sale Price", data=cars, errorbar=None)

### 3.3 - Look at the structure of Car Make and Car Model

In [None]:
# Counting the occurrences of each car make in the 'Car Make' column.
count_make = cars["Car Make"].value_counts()
count_make

In [None]:
# Counting the occurrences of each car model in the 'Car Model' column.
count_model = cars["Car Model"].value_counts()
count_model

In [None]:
# Creating a line plot to visualize the relationship between 'Car Make' and 'Sale Price'.
make_vs_price = sns.lineplot(x="Car Make", y="Sale Price", data=cars, errorbar=None)

# Calculating the average sale price for each car make and sorting in descending order.
average_price_by_make = cars.groupby('Car Make')['Sale Price'].mean().sort_values(ascending=False)
print(average_price_by_make)

In [None]:
# Creating a line plot to visualize the relationship between 'Car Model' and 'Sale Price'.
model_vs_price = sns.lineplot(x="Car Model", y="Sale Price", data=cars, errorbar=None)

# Calculating the average sale price for each car model and sorting in descending order
average_price_by_model = cars.groupby('Car Model')['Sale Price'].mean().sort_values(ascending=False)
print(average_price_by_model)

## 4 - Preprocessing.

In [None]:
# Selecting every 500th row in the dataset.
cars = cars.iloc[::500]

cars.shape

### 4.1 - Check for duplicate rows and remove them if any.

In [None]:
# Checking for and counting duplicated rows.
cars.duplicated().sum()

### 4.2 - Handle the missing values.

In [None]:
# Replacing '?' with NaN (missing value).
cars = cars.replace('?', np.nan)

# Counting missing values (NaN) in each column.
cars.isna().sum()

### 4.3 - Create a pipeline.

In [None]:
# Creating features (X) by dropping the "Sale Price" column.
X = cars.drop(["Sale Price"], axis = 1)

# Creating the target variable (y) using the "Sale Price" column.
y = cars["Sale Price"]

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Separating numeric and categorical columns.
num_cols = X.select_dtypes(include='number').columns.to_list()
cat_cols = X.select_dtypes(exclude='number').columns.to_list()

# Creating pipelines for numeric and categorical data.
num_pipeline = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())

# Combining numeric and categorical preprocessing pipelines using ColumnTransformer.
preprocessing = ColumnTransformer([('num', num_pipeline, num_cols), ('cat', cat_pipeline, cat_cols)]
                                  , remainder='passthrough')

preprocessing

In [None]:
import scipy

# Performing preprocessing on the features (X).
X_prep = preprocessing.fit_transform(X)

# Checking if the resulting transformed data is a sparse matrix and converting it to a dense array.
if isinstance(X_prep, scipy.sparse.csr_matrix):
    X_prep = X_prep.toarray()

# Getting feature names after transformation.
feature_names = preprocessing.get_feature_names_out()

# Creating a DataFrame using the transformed data and the obtained feature names.
X_prep = pd.DataFrame(data=X_prep, columns=feature_names)

X_prep

## 5 - Model Selection.

### 5.1 - Split the testing and training datasets.

In [None]:
from sklearn.model_selection import train_test_split

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_prep, y, test_size=0.2, random_state=42)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

### 5.2 - Training and evaluation of MLAs.

#### 5.2.1 - Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

# Creating an instance of the Linear Regression model
lr_model = LinearRegression()

# Fitting the Linear Regression model to the training data
lr_model.fit(X_train,y_train)

In [None]:
# Predicting target values using the Linear Regression model on the test set
lr_y_predict = lr_model.predict(X_test)

from sklearn.metrics import mean_squared_error as mse

# Calculating the Mean Squared Error (MSE) between predicted and actual target values
lr_mse=mse(y_test, lr_y_predict)

print("Linear Regression MSE:", lr_mse)

In [None]:
from sklearn.model_selection import cross_val_score

# Performing cross-validation on the Linear Regression model
# cv=5 specifies 5-fold cross-validation, scoring='neg_mean_squared_error' calculates negative MSE
scores = cross_val_score(lr_model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')

# Calculating the mean of the negative MSE scores obtained from cross-validation
# Multiplying by -1 to revert to positive MSE
cross_validation_scores = -scores.mean()

print(f'Cross-Validation Mean Score: {cross_validation_scores}') 

In [None]:
from sklearn.linear_model import Lasso

# Creating an instance of the Lasso Regression model
LassoRegression = Lasso()

# Fitting the Lasso Regression model to the training data
lasso_model = LassoRegression.fit(X_train, y_train)

# Predicting target values using the trained Lasso Regression model on the test set
Lasso_y_predict = lasso_model.predict(X_test)

# Calculating the Mean Sqaured Error (MSE) between predicted and actual target values
lasso_mse=mse(y_test, Lasso_y_predict)

print(f'Lasso Regression MAE: {lasso_mse}')

In [None]:
# Scatter plot of predicted vs actual values using the Linear Regression model
plt.scatter(lr_y_predict, y_test)
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()