# Project Title: Car Price Prediction


## Outline 
- Creating a car-price prediction project with a linear regression model 
- Performing an initial exploratory data analysis with Jupyter notebook
- Setting up a validation framework
- Implementing the linear regression model and other regression model 
- Performing feature engineering for the model 
- Keeping the model under control with regularization 
- Using the model to predict car prices

## Project Description
The aim of this project is to develop a machine learning model that can accurately predict the price of a car based on various features such as make, model, mileage, year, and other relevant factors. The project will involve collecting a dataset of car listings with associated prices and features from online sources or existing databases. Data preprocessing steps will be implemented to clean and prepare the dataset for modeling, including handling missing values, encoding categorical variables, and scaling numerical features.

Next, various machine learning algorithms such as linear regression, decision trees, random forests, and gradient boosting will be trained and evaluated using techniques like cross-validation and hyperparameter tuning to identify the best-performing model. Feature importance analysis will also be conducted to understand which features have the most significant impact on the predicted car prices.

The developed model will be deployed into a user-friendly interface, allowing users to input car features and obtain a predicted price estimate. Additionally, the project will include documentation detailing the steps involved in data collection, preprocessing, modeling, evaluation, and deployment, making it accessible for others to understand and replicate.

## Dataset Description 

### Dataset Url 
- https://www.kaggle.com/code/jshih7/car-price-prediction?select=data.csv


### Attributes

- make: make of a car (BMW, Toyota, and so on)
- model: model of a car
- year: year when the car was manufactured
- engine_fuel_type: type of fuel the engine needs (diesel, electric, and so on)
- engine_hp: horsepower of the engine
- engine_cylinders: number of cylinders in the engine
- transmission_type: type of transmission (automatic or manual)
- driven_wheels: front, rear, all
- number_of_doors: number of doors a car has
- market_category: luxury, crossover, and so on
- vehicle_size: compact, midsize, or large
- vehicle_style: sedan or convertible
- highway_mpg: miles per gallon (mpg) on the highway
- city_mpg: miles per gallon in the city
- popularity: number of times the car was mentioned in a Twitter stream
- msrp: manufacturer’s suggested retail price


## Importing Libraries


In [10]:
## loading and preprocessing data
import pandas as pd 
import numpy as np 

## visualization of data
import matplotlib.pyplot as plt 
import seaborn as sns 

## building validation framework 
from sklearn.model_selection import train_test_split 

## categorical encoding 
from sklearn.feature_extraction import DictVectorizer

## regression model 
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

## metrics 
from sklearn.metrics import root_mean_squared_error

## Loading Data 

In [11]:
## loading dataset
data = pd.read_csv("Datasets/car_price_datatset.csv")

## create a copy 
df = data.copy()

FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/car_price_datatset.csv'

## Data Preview And Understanding
- Loading the dataset

In [None]:
## view the first rows 
df.head()

In [None]:
## view the last five rows 
df.tail()

In [None]:
## check the number of rows and columns 
print(f"{df.shape[0]}   {df.shape[1]}")

In [None]:
## get a summary description of the data
df.info()

In [None]:
## checking for missing values
df.isnull().sum()

In [None]:
## checking for duplicated values 
df.duplicated().sum()

In [None]:
## checking type on columns
df.dtypes

In [None]:
## lets return the total counts of unique values in each column 
df.nunique()

In [None]:
for each_name in df.columns: 
    print(each_name)
    print(df[each_name].unique())

## Data preprocessing 
- normalize the columns 
- normalizing column types
    - year, numberofdoors, vhecial_size, v
- replacing unsual characters with NaN values
- Normalizing column names
- Filling of missing 

In [None]:
##change column names to lower case and replace spaces with underscore
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [None]:
df['year'] = df['year'].astype('category')
df['number_of_doors'] = df['number_of_doors'].astype('category')
df['vehicle_size'] = df['vehicle_size'].astype('category')
df['vehicle_style'] = df['vehicle_style'].astype('category')
df['make'] = df['make'].astype('category')
df['transmission_type'] = df['transmission_type'].astype('category')
df['market_category'] = df['market_category'].astype('category')
df['engine_cylinders'] = df['engine_cylinders'].astype('category')
df['engine_fuel_type'] = df['engine_fuel_type'].astype('category')
df['model'] = df['model'].astype('category')

In [None]:
## lets fill in col for engine_cylinders 
df['engine_cylinders'] = df['engine_cylinders'].fillna(df['engine_cylinders'].value_counts().index[0])
## 
df['number_of_doors'] = df['number_of_doors'].fillna(df['number_of_doors'].value_counts().index[0])
#
df['market_category'] = df['market_category'].fillna(df['market_category'].value_counts().index[0])
#
df['engine_fuel_type'] = df['engine_fuel_type'].fillna(df['engine_fuel_type'].value_counts().index[0])

In [None]:
df['engine_hp'] = df['engine_hp'].fillna(df['engine_hp'].mean())

## Descriptive Analysis
- statistical summary

In [None]:
df.describe().round()

In [None]:
numerical_cols = df.select_dtypes(include=['int', 'float'])

corr_matrix = numerical_cols.corr()

corr_matrix['msrp']

## Exploratory Data Analysis
- Target variable analysis
- Plot a coorelation againts the target variable
- Outlier analysis

In [None]:
plt.figure(figsize=(12, 6))

plt.title('Frequency Distribution of Car Price')
plt.xlabel('Price')
plt.ylabel('Count') 

sns.histplot(df['msrp'][df['msrp'] < 100000], color='gold') 

plt.show()


In [None]:
## log transformation 
log_price = np.log1p(df['msrp'])

In [None]:

plt.figure(figsize=(12, 6))

plt.title('Frequency Distribution of Car Price')
plt.xlabel('Log Price')
plt.ylabel('Count') 

sns.histplot(log_price, color='gold') 

plt.show()

In [None]:
## performing a coorelation on numerical column
## select numerical ..


## Building a validation framework 
- Training dataset 60%
- Validation dataset 20%
- Testing dataset 20%

In [None]:
df_train_full, df_test = train_test_split(df, test_size=0.2 , random_state=10)
df_train, df_valid = train_test_split(df_train_full, test_size=0.25, random_state=10) 

print(f'Size of Training Dataset {len(df_train)}')
print(f'Size of Validation Dataset {len(df_valid)}')
print(f'Size of Testing Dataset {len(df_test)}')

In [None]:
df_train.head()

## Data Preprocessing 2 

In [None]:
## 
y_train = np.log1p(df_train['msrp']).values
y_valid = np.log1p(df_valid['msrp']).values
y_test = np.log1p(df_test['msrp']).values 

In [None]:
y_valid

In [None]:
del df_train['msrp']
del df_valid['msrp']
del df_test['msrp'] 

In [None]:
df_train.head

In [None]:
## select the cols with int and float 
df_train_bl = df_train.select_dtypes(include=['int', 'float'])
df_valid_bl = df_valid.select_dtypes(include=['int', 'float']) 

## 
X_train_bl = df_train_bl.values 
X_valid_bl = df_valid_bl.values 

## Training a Baseline Linear Regression 

In [None]:
## instance of the model 
lr_bl_model = LinearRegression() 

## traing the model 
lr_bl_model.fit(X_train_bl, y_train) 

## Model Evaluation 

In [None]:
## generate validation predictions 
y_valid_pred = lr_bl_model.predict(X_valid_bl) 

In [None]:
## 
rmse_bl = root_mean_squared_error(y_valid, y_valid_pred) 

print(f'Baseline Validation Metic {round(rmse_bl, 2) * 100} %')

In [None]:
df_train.columns

## Decision Tree

In [None]:
## create instance
dt_model_bl = DecisionTreeRegressor(random_state=11)

dt_model_bl.fit(X_train_bl, y_train)

## Feature Engineering 
- select two categorical cols: year, transmission_type 

In [None]:
## create a newlist of the col names 
cat_fe_1 = ['year', 'transmission_type']

numerical_cols = ['engine_hp', 'city_mpg', 'popularity', 'highway_mpg']


df_train_fe_1 = df_train[numerical_cols + cat_fe_1]

df_valid_fe_1 = df_valid[numerical_cols + cat_fe_1]


In [None]:
## convert our dataframe to a dictionary 
dict_train_fe_1 = df_train_fe_1.to_dict(orient='records')
dict_valid_fe_1 = df_valid_fe_1.to_dict(orient='records')

In [None]:
## create an instance 
dv = DictVectorizer(sparse=False)

dv.fit(dict_train_fe_1)

In [None]:
X_train_fe_1 = dv.transform(dict_train_fe_1)
X_valid_fe_1 = dv.transform(dict_valid_fe_1)

## Train Fe_1 Model 

In [None]:
lr_fe_1_model = LinearRegression() 

lr_fe_1_model.fit(X_train_fe_1, y_train)

In [None]:
y_valid_pred_fe_1 = lr_fe_1_model.predict(X_valid_fe_1)

In [None]:
rmse_fe_1 = root_mean_squared_error(y_valid, y_valid_pred_fe_1)

print(f'Log Reg Validation Metric for Fe_1:  : {round(rmse_fe_1, 2) * 100} %')

## Save The Model 

## Load The Model 

## Make Predictions 