# <font color=darkblue> Machine Learning model deployment with Flask framework</font>

## <font color=Blue>Used Cars Price Prediction Application</font>

### Objective:
1. To build a Machine learning regression model to predict the selling price of the used cars based on the different input features like fuel_type, kms_driven, type of transmission etc.
2. Deploy the machine learning model with the help of the flask framework.

### Dataset Information:
#### Dataset Source: https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho?select=CAR+DETAILS+FROM+CAR+DEKHO.csv
This dataset contains information about used cars listed on www.cardekho.com
- **Car_Name**: Name of the car
- **Year**: Year of Purchase
- **Selling Price (target)**: Selling price of the car in lakhs
- **Present Price**: Present price of the car in lakhs
- **Kms_Driven**: kilometers driven
- **Fuel_Type**: Petrol/diesel/CNG
- **Seller_Type**: Dealer or Indiviual
- **Transmission**: Manual or Automatic
- **Owner**: first, second or third owner


### 1. Import required libraries

In [1]:
# Importing libraries for data processing and machine learning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from debugpy.common.timestamp import current
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler


### 2. Load the dataset

In [2]:
data_set = "./data/car_data.csv"
df = pd.read_csv(data_set)



In [3]:
df.head()


Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


### 3. Check the shape and basic information of the dataset.

In [4]:
df.shape


(301, 9)

In [5]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Year           301 non-null    int64  
 2   Selling_Price  301 non-null    float64
 3   Present_Price  301 non-null    float64
 4   Kms_Driven     301 non-null    int64  
 5   Fuel_Type      301 non-null    object 
 6   Seller_Type    301 non-null    object 
 7   Transmission   301 non-null    object 
 8   Owner          301 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 21.3+ KB


In [6]:
df.isnull().sum()


Car_Name         0
Year             0
Selling_Price    0
Present_Price    0
Kms_Driven       0
Fuel_Type        0
Seller_Type      0
Transmission     0
Owner            0
dtype: int64

### 4. Check for the presence of the duplicate records in the dataset? If present drop them

In [7]:
duplicates = df[df.duplicated()]


In [8]:
print("Duplicate rows:")
print(duplicates)

Duplicate rows:
    Car_Name  Year  Selling_Price  Present_Price  Kms_Driven Fuel_Type  \
17    ertiga  2016           7.75          10.79       43000    Diesel   
93  fortuner  2015          23.00          30.61       40000    Diesel   

   Seller_Type Transmission  Owner  
17      Dealer       Manual      0  
93      Dealer    Automatic      0  


In [9]:
# drop duplicates:
df_no_duplicates = df.drop_duplicates()

In [10]:
df_no_duplicates.head()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


In [11]:
df_no_duplicates.info()

<class 'pandas.core.frame.DataFrame'>
Index: 299 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       299 non-null    object 
 1   Year           299 non-null    int64  
 2   Selling_Price  299 non-null    float64
 3   Present_Price  299 non-null    float64
 4   Kms_Driven     299 non-null    int64  
 5   Fuel_Type      299 non-null    object 
 6   Seller_Type    299 non-null    object 
 7   Transmission   299 non-null    object 
 8   Owner          299 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 23.4+ KB


In [12]:
df_no_duplicates.shape

(299, 9)

In [13]:
df_no_duplicates.duplicated().sum()

np.int64(0)

In [14]:
# We have removed all the duplicates

### 5. Drop the columns which you think redundant for the analysis.

In [15]:
df_new = df_no_duplicates.copy()
df_new.head()


Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


In [16]:
df_new.columns


Index(['Car_Name', 'Year', 'Selling_Price', 'Present_Price', 'Kms_Driven',
       'Fuel_Type', 'Seller_Type', 'Transmission', 'Owner'],
      dtype='object')

In [17]:
# Check for constant columns (columns with only one unique value)
constant_columns = [col for col in df_new.columns if df_new[col].nunique() == 1]
print("\nConstant columns:")
print(constant_columns)


Constant columns:
[]


In [18]:
# Let us drop the redundant columns like 'Year', 'Fuel_Type', 'Seller_Type', 'Transmission', 'Owner'
columns_to_drop = ['Year', 'Fuel_Type', 'Seller_Type', 'Transmission', 'Owner']
df_cleaned = df_new.drop(columns=columns_to_drop)

In [19]:
df_cleaned.columns


Index(['Car_Name', 'Selling_Price', 'Present_Price', 'Kms_Driven'], dtype='object')

In [20]:
df_cleaned.head()

Unnamed: 0,Car_Name,Selling_Price,Present_Price,Kms_Driven
0,ritz,3.35,5.59,27000
1,sx4,4.75,9.54,43000
2,ciaz,7.25,9.85,6900
3,wagon r,2.85,4.15,5200
4,swift,4.6,6.87,42450


In [21]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 299 entries, 0 to 300
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       299 non-null    object 
 1   Selling_Price  299 non-null    float64
 2   Present_Price  299 non-null    float64
 3   Kms_Driven     299 non-null    int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 11.7+ KB


In [22]:
df_car = df_cleaned.dropna()

In [23]:
df_car.shape

(299, 4)

In [24]:
df_car.info()

<class 'pandas.core.frame.DataFrame'>
Index: 299 entries, 0 to 300
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       299 non-null    object 
 1   Selling_Price  299 non-null    float64
 2   Present_Price  299 non-null    float64
 3   Kms_Driven     299 non-null    int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 11.7+ KB


### 6. Extract a new feature called 'age_of_the_car' from the feature 'year' and drop the feature year

In [25]:
from datetime import datetime

In [26]:
current_year = datetime.now().year

In [27]:
df_car["age_of_the_car"]=current_year - df_new["Year"]

In [28]:
df_car.head()

Unnamed: 0,Car_Name,Selling_Price,Present_Price,Kms_Driven,age_of_the_car
0,ritz,3.35,5.59,27000,10
1,sx4,4.75,9.54,43000,11
2,ciaz,7.25,9.85,6900,7
3,wagon r,2.85,4.15,5200,13
4,swift,4.6,6.87,42450,10


In [29]:
df_car.describe()

Unnamed: 0,Selling_Price,Present_Price,Kms_Driven,age_of_the_car
count,299.0,299.0,299.0,299.0
mean,4.589632,7.541037,36916.752508,10.384615
std,4.98424,8.567887,39015.170352,2.896868
min,0.1,0.32,500.0,6.0
25%,0.85,1.2,15000.0,8.0
50%,3.51,6.1,32000.0,10.0
75%,6.0,9.84,48883.5,12.0
max,35.0,92.6,500000.0,21.0


In [30]:
# Please Note that we have already dropped the feature "Year"

### 7. Encode the categorical columns

In [31]:
# Let us use LLabel Encoding for the "Car_Name" Categorical Column

In [32]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [33]:
df_car['CarName'] = le.fit_transform(df_car['Car_Name'])

In [34]:
df_car.head()


Unnamed: 0,Car_Name,Selling_Price,Present_Price,Kms_Driven,age_of_the_car,CarName
0,ritz,3.35,5.59,27000,10,90
1,sx4,4.75,9.54,43000,11,93
2,ciaz,7.25,9.85,6900,7,68
3,wagon r,2.85,4.15,5200,13,96
4,swift,4.6,6.87,42450,10,92


### 8. Separate the target and independent features.

In [35]:
# Reorder columns to start with 'CarName'
new_column_order = ['CarName'] + [col for col in df_car.columns if col != 'CarName']
df_car = df_car[new_column_order]

In [36]:
df_car.head()

Unnamed: 0,CarName,Car_Name,Selling_Price,Present_Price,Kms_Driven,age_of_the_car
0,90,ritz,3.35,5.59,27000,10
1,93,sx4,4.75,9.54,43000,11
2,68,ciaz,7.25,9.85,6900,7
3,96,wagon r,2.85,4.15,5200,13
4,92,swift,4.6,6.87,42450,10


In [37]:
# Independent Features (X)
columns_to_drop = ['Car_Name', 'Selling_Price', 'Present_Price']
X = df_car.drop(columns=columns_to_drop)

In [38]:
X.head()

Unnamed: 0,CarName,Kms_Driven,age_of_the_car
0,90,27000,10
1,93,43000,11
2,68,6900,7
3,96,5200,13
4,92,42450,10


In [39]:
# Target (y)
y = df_car['Selling_Price']

In [40]:
y.head()

0    3.35
1    4.75
2    7.25
3    2.85
4    4.60
Name: Selling_Price, dtype: float64

### 9. Split the data into train and test.

In [41]:
# Splitting the data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### 10. Build a Random forest Regressor model and check the r2-score for train and test.

In [42]:
from sklearn.ensemble import RandomForestRegressor

# Random Forest Regressor model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)


In [43]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [44]:
# Train the model
model = rf_model.fit(X_train_scaled, y_train)


In [45]:
model

In [46]:
model.score(X_train_scaled, y_train)

0.9179304736428145

In [47]:
model.score(X_test_scaled, y_test)

0.7211179495100304

In [48]:
y_pred = rf_model.predict(X_test_scaled)


In [49]:
from sklearn.metrics import r2_score

# Make predictions on both training and testing sets
y_train_pred = rf_model.predict(X_train_scaled)
y_test_pred = rf_model.predict(X_test_scaled)

In [50]:
# Calculate the R2-score for both training and testing data
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)


In [51]:
# Output the R2-scores
print(f"R2-score for training data: {r2_train:.4f}")
print(f"R2-score for testing data: {r2_test:.4f}")


R2-score for training data: 0.9179
R2-score for testing data: 0.7211


### 11. Create a pickle file with an extension as .pkl

In [52]:
import pickle as pkl
pickle_file1 = open('./models/model.pkl', 'wb')
pkl.dump(rf_model, pickle_file1)
pickle_file1.close()


In [53]:
import pickle as pkl
pickle_file2 = open('./models/scaler.pkl', 'wb')
pkl.dump(scaler, pickle_file2)
pickle_file2.close()


In [54]:
import pickle as pkl
pickle_file3 = open('./models/car_name_encoder.pkl', 'wb')
pkl.dump(le, pickle_file3)
pickle_file3.close()


### 12. Create new folder/new project in visual studio/pycharm that should contain the "model.pkl" file *make sure you are using a virutal environment and install required packages.*

### a) Create a basic HTML form for the frontend

Create a file **index.html** in the templates folder and copy the following code.

In [None]:
# Please refer to "./templates/index.html"


### b) Create app.py file and write the predict function

In [None]:
# please refer to ./app.py

### 13. Run the app.py python file which will render to index html page then enter the input values and get the prediction.

In [None]:
# Please run "app.py" as:
# python .\app.py
# and access http://127.0.0.1:5000/ for accessing the prediction app

### Happy Learning :)