# Project 4: Car Sales Analysis

# Prefactory Remarks

- [x] **Create a virtual environment to download the packages**

In [None]:
# You don't have to do this, it's just safer.

# Install virtualenv (virtual environment):

# !pip install virtualenv

# Create a virtual environment named "myenv":

# !python -m venv myenv

# Activate the virtual environment:

# myenv\Scripts\activate (Windows)
# source myenv/bin/activate (macOS/Linux)

# Upgrade pip and install essential data science libraries inside the virtual environment:

# !myenv/bin/python -m pip install --upgrade pip  
# !myenv/bin/python -m pip install numpy pandas matplotlib seaborn scikit-learn scipy statsmodels jupyterlab plotly openpyxl xlrd tensorflow keras torch torchvision pyspark ipykernel

# Add the virtual environment as a Jupyter kernel:

# !myenv/bin/python -m ipykernel install --user --name=myenv --display-name "Python (myenv)"

# Deactivate the virtual environment (Run this in the terminal):

# deactivate

- [x] **Libraries we might need to install or upgrade**

In [None]:
# If you don't care to create a virtual environment, here is what you need to do to download the libraries

# Run these directly in a cell to download the libraries:

#!pip install tensorflow
#!pip install pyspark
#!pip install scikit-optimize (for skopt)
#!pip install missingno
#!pip install seaborn
#!pip install numpy
#!pip install pandas
#!pip install matplotlib
#!pip install scikit-learn

# To update them, run this (with your desired library):

#!pip install --upgrade scikit-learn

- [x] **Tips for rearranging your Notebook**

- Hold ctrl+shift and click on the various cells you want to move, then press the arrow keys to move them up or down.

## 1. Visualize the data

- [x] **View the data**

In [6]:
import pandas as pd
import numpy as np
import math as ma
import re
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("car_sales.csv")

df

Unnamed: 0,Car_id,Date,Customer Name,Gender,Annual Income,Dealer_Name,Company,Model,Engine,Transmission,Color,Price ($),Dealer_No,Body Style,Phone,Dealer_Region
0,C_CND_000001,1/2/2022,Geraldine,Male,13500,Buddy Storbeck's Diesel Service Inc,Ford,Expedition,DoubleÂ Overhead Camshaft,Auto,Black,26000,06457-3834,SUV,8264678,Middletown
1,C_CND_000002,1/2/2022,Gia,Male,1480000,C & M Motors Inc,Dodge,Durango,DoubleÂ Overhead Camshaft,Auto,Black,19000,60504-7114,SUV,6848189,Aurora
2,C_CND_000003,1/2/2022,Gianna,Male,1035000,Capitol KIA,Cadillac,Eldorado,Overhead Camshaft,Manual,Red,31500,38701-8047,Passenger,7298798,Greenville
3,C_CND_000004,1/2/2022,Giselle,Male,13500,Chrysler of Tri-Cities,Toyota,Celica,Overhead Camshaft,Manual,Pale White,14000,99301-3882,SUV,6257557,Pasco
4,C_CND_000005,1/2/2022,Grace,Male,1465000,Chrysler Plymouth,Acura,TL,DoubleÂ Overhead Camshaft,Auto,Red,24500,53546-9427,Hatchback,7081483,Janesville
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23901,C_CND_023902,12/31/2023,Martin,Male,13500,C & M Motors Inc,Plymouth,Voyager,Overhead Camshaft,Manual,Red,12000,60504-7114,Passenger,8583598,Pasco
23902,C_CND_023903,12/31/2023,Jimmy,Female,900000,Ryder Truck Rental and Leasing,Chevrolet,Prizm,DoubleÂ Overhead Camshaft,Auto,Black,16000,06457-3834,Hardtop,7914229,Middletown
23903,C_CND_023904,12/31/2023,Emma,Male,705000,Chrysler of Tri-Cities,BMW,328i,Overhead Camshaft,Manual,Red,21000,99301-3882,Sedan,7659127,Scottsdale
23904,C_CND_023905,12/31/2023,Victoire,Male,13500,Chrysler Plymouth,Chevrolet,Metro,DoubleÂ Overhead Camshaft,Auto,Black,31000,53546-9427,Passenger,6030764,Austin


- [x] **Check the data types**

In [16]:
df.dtypes

Car_id           object
Date             object
Customer Name    object
Gender           object
Annual Income     int64
Dealer_Name      object
Company          object
Model            object
Engine           object
Transmission     object
Color            object
Price ($)         int64
Dealer_No        object
Body Style       object
Phone             int64
Dealer_Region    object
dtype: object

- [x] **Count Occurrences**

In [22]:
df["Gender"].value_counts()

for column in df.columns:
    print(f"Value counts for {column}:\n{df[column].value_counts()}\n{'-'*40}\n")

Value counts for Car_id:
Car_id
C_CND_000001    1
C_CND_015935    1
C_CND_015944    1
C_CND_015943    1
C_CND_015942    1
               ..
C_CND_007967    1
C_CND_007966    1
C_CND_007965    1
C_CND_007964    1
C_CND_023906    1
Name: count, Length: 23906, dtype: int64
----------------------------------------

Value counts for Date:
Date
9/5/2023      190
11/10/2023    175
12/29/2023    151
12/11/2023    140
11/24/2023    135
             ... 
6/21/2022       5
7/12/2023       5
12/9/2022       5
7/8/2022        5
6/29/2023       5
Name: count, Length: 612, dtype: int64
----------------------------------------

Value counts for Customer Name:
Customer Name
Thomas           92
Emma             90
Lucas            88
Nathan           80
Louis            76
                 ..
Adelin            1
Zakarya           1
Paule             1
Noeline           1
Djamel Epoine     1
Name: count, Length: 3021, dtype: int64
----------------------------------------

Value counts for Gender:
Gender


## 2. Clean the data

- [x] **Check if there are any NaN values in any columns. Use Imputation to fill them if necessary.**

- [x] **Check if there are any null values. Use Imputation to fill them if necessary.**

- [x] **Check for missing values**

## 3. Analysis and Visualizations

# Data Science (Using the Pandas Library)

## 4. Inferential Statistics

## PCA (Principal Component Analysis)

- [] **Use a PCA to analyze what are the best **

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd

# Assuming you have preprocessed your data in 'X' (features) and 'y' (target)
X = df.drop('Price ($)', axis=1)  # Features (excluding target column)
y = df['Price ($)']  # Target column (e.g., car price)

# Standardize the numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Initialize PCA (let’s reduce the dimensions to 2 for visualization)
pca = PCA(n_components=2)  # You can choose n_components based on how much variance you want to preserve
X_pca = pca.fit_transform(X_scaled)

# Convert to DataFrame for easy visualization
pca_df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
pca_df['Target'] = y  # Add target to the DataFrame for visualization

print(pca_df.head())


import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(pca_df['PC1'], pca_df['PC2'], c=pca_df['Target'], cmap='viridis')
plt.title("PCA: Car Sales Dataset")
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Price ($)')
plt.show()


## Random Forest

- [] **Use a Random Forest model, coupled with feature importance, to help us understand which features (columns) in this dataset contribute the most to the model’s predictions**

In [None]:
from sklearn.ensemble import RandomForestRegressor  # Use RandomForestClassifier for classification tasks
from sklearn.model_selection import train_test_split
import pandas as pd

# Assuming you have preprocessed your data in 'X' and 'y'
X = df.drop('Price ($)', axis=1)  # Features
y = df['Price ($)']  # Target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Random Forest model
model = RandomForestRegressor()
model.fit(X_train, y_train)

# Get feature importances
feature_importances = model.feature_importances_

# Create a DataFrame to display the importance of each feature
feature_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

print(feature_df)

- [] **Use a random forest model to predict**

## KMeans Clustering

- [] **Segment customers into different clusters based on their preferences and demographics.**

In [None]:

X = df[[]]

inertia = []
for k in range(1,11):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X_train)
    inertia.append(km.inertia_)

plt.figure(figsize(10,5))
plt.plot(range(1,11), inertia, marker="o")
plt.title("Best k value for KMeans Clustering")
plt.xlabel("")
plt.ylabel("")
plt.show()

km = KMeanskm = KMeans(n_clusters=k, random_state=42)
km.fit(X_train,y_train)

## Linear Regression

- [] **Use a Linear Regression model to predict the car price using the Engine, Body Style, Color, Dealer Region, Car Model, Company**

In [None]:
X = df[["Engine", "Body Style", "Color", "Dealer Region", "Car Model", "Company"]]
y = df["Price"]

X_train,X_test,y_train, y_test = train_test_split(X,y,random_state=42,test_size=0.2)

model = LinearRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test,y_pred)
mse = mean_squared_error(y_test,y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test,y_pred)



## Gradient Boosting

- [] **Use a gradient boosting model to predict a customer's gender based on their Annual Income, Car Model, Body Style, and Dealer Region.**

In [None]:
X = df[[]]
y= df[]

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = XGBoostClassifier()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

c_r = classification_report(y_test,y_pred)
c_m = confusion_matrix(y_test,y_pred)

plt.figure(figsize=(8,6))
sns.heatmap(c_m, annot = True, fmt="d", cmap="Blues")
plt.xlabel("")
plt.ylabel("")
plt.title("")
plt.show()



## Logistic Regression

- [] **Use a Logistic Regression model to predict the car model based on Dealer Region, Transmission, and Body Style**

In [None]:
X = df[["Dealer Region", "Transmission","Body Style"]]
y = df["Car Model"]

X_train,X_test,y_train,y_test = train_test_model(X,y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

c_r = classification_report(y_test,y_pred)
c_m = confusion_matrix(y_test,y_pred)

plt.figure(figsize=(8,6))
sns.heatmap(c_m, annot = True, fmt="d", cmap="Blues")
plt.xlabel("")
plt.ylabel("")
plt.title("")
plt.show()

## Time Series and Forecasting

- [] **Put the date as the index and plot the price column over time**

- [] **Check for Seasonality**

- [] **Check for Stationarity**

- [] **Try the ARIMA model**

- [] **Try the SARIMA model**

- [] **Try the Exponential Smoothing model**

# Transfering the data to MySQL

- [x] **Save the original dataset with fixed columns**

- [x] **Save the clean dataset**