# Problem Statement

- Size of test set: 2671 records

- FEATURES: Airline: The name of the airline.

- Date_of_Journey: The date of the journey

- Source: The source from which the service begins.

- Destination: The destination where the service ends.

- Route: The route taken by the flight to reach the destination.

- Dep_Time: The time when the journey starts from the source.

- Arrival_Time: Time of arrival at the destination.

- Duration: Total duration of the flight.

- Total_Stops: Total stops between the source and destination.

- Additional_Info: Additional information about the flight

- Price: The price of the ticket

## Importing Libraries

In [161]:
import seaborn as sns
import pandas as pd
import numpy as np
import random

sns.set_context('notebook',font_scale=1.5)

import matplotlib.pyplot as plt


import warnings
warnings.filterwarnings("ignore")

In [162]:
train_data=pd.read_csv('train.csv')
test_data=pd.read_csv('test.csv')

In [163]:
train_data.shape,test_data.shape

((8012, 11), (2671, 10))

In [164]:
train_data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,Airline C,12/06/2019,Delhi,Cochin,DEL → MAA → COK,20:40,09:25 13 Jun,12h 45m,1 stop,No info,7480
1,Airline A,18/06/2019,Banglore,Delhi,BLR → DEL,18:55,22:00,3h 5m,non-stop,No info,8016
2,Airline C,18/05/2019,Delhi,Cochin,DEL → BOM → COK,03:50,19:15,15h 25m,1 stop,No info,8879
3,Airline A,6/05/2019,Kolkata,Banglore,CCU → BOM → BLR,20:00,08:15 07 May,12h 15m,1 stop,In-flight meal not included,9663
4,Airline A,9/05/2019,Kolkata,Banglore,CCU → BOM → BLR,06:30,12:00,5h 30m,1 stop,In-flight meal not included,9663


In [165]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8012 entries, 0 to 8011
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          8012 non-null   object
 1   Date_of_Journey  8012 non-null   object
 2   Source           8012 non-null   object
 3   Destination      8012 non-null   object
 4   Route            8011 non-null   object
 5   Dep_Time         8012 non-null   object
 6   Arrival_Time     8012 non-null   object
 7   Duration         8012 non-null   object
 8   Total_Stops      8011 non-null   object
 9   Additional_Info  8012 non-null   object
 10  Price            8012 non-null   int64 
dtypes: int64(1), object(10)
memory usage: 688.7+ KB


In [166]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2671 entries, 0 to 2670
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          2671 non-null   object
 1   Date_of_Journey  2671 non-null   object
 2   Source           2671 non-null   object
 3   Destination      2671 non-null   object
 4   Route            2671 non-null   object
 5   Dep_Time         2671 non-null   object
 6   Arrival_Time     2671 non-null   object
 7   Duration         2671 non-null   object
 8   Total_Stops      2671 non-null   object
 9   Additional_Info  2671 non-null   object
dtypes: object(10)
memory usage: 208.8+ KB


In [167]:
# Checking missing value in dataset
train_data.isnull().values.any(),test_data.isnull().values.any()

(True, False)

In [168]:
# Checking missing value in dataset
test_data.isnull().values.any(),test_data.isnull().values.any()

(False, False)

In [169]:
train_data.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              1
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        1
Additional_Info    0
Price              0
dtype: int64

In [171]:
test_data.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              0
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        0
Additional_Info    0
dtype: int64

In [172]:
test_data.dropna(inplace=True)

In [173]:
# Checking if there are any Duplicate values
train_data[train_data.duplicated()]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
1048,Airline A,15/06/2019,Delhi,Cochin,DEL → JAI → BOM → COK,05:30,04:25 16 Jun,22h 55m,2 stops,In-flight meal not included,10368
1331,Airline A,12/06/2019,Delhi,Cochin,DEL → NAG → BOM → COK,14:35,12:35 13 Jun,22h,2 stops,In-flight meal not included,10919
1421,Airline A,15/06/2019,Delhi,Cochin,DEL → JAI → BOM → COK,05:30,04:25 16 Jun,22h 55m,2 stops,No info,13014
1909,Airline A,21/03/2019,Delhi,Cochin,DEL → AMD → BOM → COK,19:10,04:25 23 Mar,33h 15m,2 stops,In-flight meal not included,8834
2000,Airline C,9/05/2019,Delhi,Cochin,DEL → GOI → BOM → COK,22:00,19:15 10 May,21h 15m,2 stops,No info,10441
...,...,...,...,...,...,...,...,...,...,...,...
7800,Airline A,27/06/2019,Delhi,Cochin,DEL → AMD → BOM → COK,23:05,12:35 28 Jun,13h 30m,2 stops,No info,12819
7833,Airline C,18/05/2019,Delhi,Cochin,DEL → GOI → BOM → COK,22:00,19:15 19 May,21h 15m,2 stops,No info,10441
7851,Airline A,21/03/2019,Delhi,Cochin,DEL → MAA → BOM → COK,16:10,18:50 22 Mar,26h 40m,2 stops,In-flight meal not included,8728
7923,Airline C,21/05/2019,Delhi,Cochin,DEL → HYD → BOM → COK,17:15,19:15 22 May,26h,2 stops,No info,11972


In [174]:
train_data.drop_duplicates()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,Airline C,12/06/2019,Delhi,Cochin,DEL → MAA → COK,20:40,09:25 13 Jun,12h 45m,1 stop,No info,7480
1,Airline A,18/06/2019,Banglore,Delhi,BLR → DEL,18:55,22:00,3h 5m,non-stop,No info,8016
2,Airline C,18/05/2019,Delhi,Cochin,DEL → BOM → COK,03:50,19:15,15h 25m,1 stop,No info,8879
3,Airline A,6/05/2019,Kolkata,Banglore,CCU → BOM → BLR,20:00,08:15 07 May,12h 15m,1 stop,In-flight meal not included,9663
4,Airline A,9/05/2019,Kolkata,Banglore,CCU → BOM → BLR,06:30,12:00,5h 30m,1 stop,In-flight meal not included,9663
...,...,...,...,...,...,...,...,...,...,...,...
8007,Airline A,12/06/2019,Kolkata,Banglore,CCU → BOM → BLR,06:30,04:40 13 Jun,22h 10m,1 stop,In-flight meal not included,7594
8008,Airline C,9/06/2019,Delhi,Cochin,DEL → GOI → BOM → COK,22:00,19:15 10 Jun,21h 15m,2 stops,No info,10651
8009,Airline A,3/03/2019,Delhi,Cochin,DEL → BOM → COK,08:00,04:25 04 Mar,20h 25m,1 stop,No info,17024
8010,Airline A,18/05/2019,Delhi,Cochin,DEL → BOM → COK,11:30,12:35 19 May,25h 5m,1 stop,In-flight meal not included,12373


In [175]:
train_data.shape

(8011, 11)

In [176]:
# Drop duplicates value
train_data.drop_duplicates(keep='first',inplace=True)

In [177]:
train_data.shape

(7894, 11)

In [178]:
# Checking if there are any Duplicate values
test_data[test_data.duplicated()]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info
431,Airline A,21-03-2019,Delhi,Cochin,DEL → BDQ → BOM → COK,18:25,23-03-2020 04:25,34h,2 stops,No info
806,Airline A,03-03-2019,Delhi,Cochin,DEL → IDR → BOM → COK,05:25,04-03-2020 18:50,37h 25m,2 stops,No info
1503,Airline A,27-06-2019,Delhi,Cochin,DEL → AMD → BOM → COK,23:05,28-06-2020 19:00,19h 55m,2 stops,No info
1614,Airline D,12-06-2019,Delhi,Cochin,DEL → BOM → COK,13:00,13-06-2020 01:30,12h 30m,1 stop,No info
1628,Airline A,24-06-2019,Delhi,Cochin,DEL → AMD → BOM → COK,23:05,25-06-2020 19:00,19h 55m,2 stops,No info
1639,Airline A,06-03-2019,Delhi,Cochin,DEL → AMD → BOM → COK,19:10,08-03-2020 04:25,33h 15m,2 stops,No info
1706,Airline C,03-03-2019,Delhi,Cochin,DEL → HYD → BOM → COK,21:30,04-03-2020 19:15,21h 45m,2 stops,No info
1892,Airline A,27-05-2019,Delhi,Cochin,DEL → AMD → BOM → COK,19:10,28-05-2020 19:00,23h 50m,2 stops,No info
1943,Airline C,03-03-2019,Banglore,New Delhi,BLR → DEL,06:10,08:55,2h 45m,non-stop,No info
1971,Airline A,21-05-2019,Delhi,Cochin,DEL → JAI → BOM → COK,05:30,22-05-2020 04:25,22h 55m,2 stops,No info


In [179]:
test_data.drop_duplicates()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info
0,Airline B,27-03-2019,Delhi,Cochin,DEL → HYD → COK,04:55,16:10,11h 15m,1 stop,No info
1,Airline E,27-05-2019,Kolkata,Banglore,CCU → BLR,22:20,28-05-2020 00:40,2h 20m,non-stop,No info
2,Airline C,06-06-2019,Kolkata,Banglore,CCU → IXR → DEL → BLR,05:50,20:25,14h 35m,2 stops,No info
3,Airline A,06-03-2019,Banglore,New Delhi,BLR → MAA → DEL,09:45,14:25,4h 40m,1 stop,No info
4,Airline B,15-06-2019,Delhi,Cochin,DEL → BOM → COK,16:00,16-06-2020 01:30,9h 30m,1 stop,No info
...,...,...,...,...,...,...,...,...,...,...
2666,Airline C,21-03-2019,Delhi,Cochin,DEL → BOM → COK,08:00,19:15,11h 15m,1 stop,No info
2667,Airline G,27-04-2019,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info
2668,Airline A,09-06-2019,Delhi,Cochin,DEL → BHO → BOM → COK,05:30,12:35,7h 5m,2 stops,In-flight meal not included
2669,Airline A,01-05-2019,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,In-flight meal not included


In [180]:
test_data.shape

(2671, 10)

In [181]:
# Drop duplicates value
test_data.drop_duplicates(keep='first',inplace=True)

In [182]:
test_data.shape

(2655, 10)

### Feature Engineering ( Dividing data into features and labels)

In [183]:
# Duration convert hours in min.
train_data['Duration']=  train_data['Duration'].str.replace("h", '*60').str.replace(' ','+').str.replace('m','*1').apply(eval)
test_data['Duration']=  test_data['Duration'].str.replace("h", '*60').str.replace(' ','+').str.replace('m','*1').apply(eval)

In [184]:
# Preprocessing same as training data that we have done

print("Train data Info")
print("-"*75)
print(train_data.info())

print()
print()

print("Null values :")
print("-"*75)
train_data.dropna(inplace = True)
print(train_data.isnull().sum())

# EDA

# Date_of_Journey
train_data['Date_of_Journey'] = pd.to_datetime(train_data['Date_of_Journey'])
train_data["Journey_day"] = pd.to_datetime(train_data.Date_of_Journey, format="%d/%m/%Y").dt.day
train_data["Journey_month"] = pd.to_datetime(train_data["Date_of_Journey"], format = "%d/%m/%Y").dt.month
train_data.drop(["Date_of_Journey"], axis = 1, inplace = True)

# Dep_Time
train_data["Dep_hour"] = pd.to_datetime(train_data["Dep_Time"]).dt.hour
train_data["Dep_min"] = pd.to_datetime(train_data["Dep_Time"]).dt.minute
train_data.drop(["Dep_Time"], axis = 1, inplace = True)

# Arrival_Time
train_data["Arrival_hour"] = pd.to_datetime(train_data.Arrival_Time).dt.hour
train_data["Arrival_min"] = pd.to_datetime(train_data.Arrival_Time).dt.minute
train_data.drop(["Arrival_Time"], axis = 1, inplace = True)

train_data.shape

Train data Info
---------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7894 entries, 0 to 8011
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          7894 non-null   object
 1   Date_of_Journey  7894 non-null   object
 2   Source           7894 non-null   object
 3   Destination      7894 non-null   object
 4   Route            7894 non-null   object
 5   Dep_Time         7894 non-null   object
 6   Arrival_Time     7894 non-null   object
 7   Duration         7894 non-null   int64 
 8   Total_Stops      7894 non-null   object
 9   Additional_Info  7894 non-null   object
 10  Price            7894 non-null   int64 
dtypes: int64(2), object(9)
memory usage: 740.1+ KB
None


Null values :
---------------------------------------------------------------------------
Airline            0
Date_of_Journey    0
Source             0
Dest

(7894, 14)

In [185]:
# Preprocessing same as training data that we have done

print("Test data Info")
print("-"*75)
print(test_data.info())

print()
print()

print("Null values :")
print("-"*75)
test_data.dropna(inplace = True)
print(test_data.isnull().sum())

# EDA

# Date_of_Journey
test_data['Date_of_Journey'] = pd.to_datetime(test_data['Date_of_Journey'])
test_data["Journey_day"] = pd.to_datetime(test_data.Date_of_Journey, format="%d/%m/%Y").dt.day
test_data["Journey_month"] = pd.to_datetime(test_data["Date_of_Journey"], format = "%d/%m/%Y").dt.month
test_data.drop(["Date_of_Journey"], axis = 1, inplace = True)

# Dep_Time
test_data["Dep_hour"] = pd.to_datetime(test_data["Dep_Time"]).dt.hour
test_data["Dep_min"] = pd.to_datetime(test_data["Dep_Time"]).dt.minute
test_data.drop(["Dep_Time"], axis = 1, inplace = True)

# Arrival_Time
test_data["Arrival_hour"] = pd.to_datetime(test_data.Arrival_Time).dt.hour
test_data["Arrival_min"] = pd.to_datetime(test_data.Arrival_Time).dt.minute
test_data.drop(["Arrival_Time"], axis = 1, inplace = True)


Test data Info
---------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2655 entries, 0 to 2670
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          2655 non-null   object
 1   Date_of_Journey  2655 non-null   object
 2   Source           2655 non-null   object
 3   Destination      2655 non-null   object
 4   Route            2655 non-null   object
 5   Dep_Time         2655 non-null   object
 6   Arrival_Time     2655 non-null   object
 7   Duration         2655 non-null   int64 
 8   Total_Stops      2655 non-null   object
 9   Additional_Info  2655 non-null   object
dtypes: int64(1), object(9)
memory usage: 228.2+ KB
None


Null values :
---------------------------------------------------------------------------
Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              0
Dep_Time

In [186]:
# Total_Stops
train_data['Total_Stops'].replace(['1 stop', 'non-stop', '2 stops', '3 stops', '4 stops'], [1, 0, 2, 3, 4], inplace=True)
test_data['Total_Stops'].replace(['1 stop', 'non-stop', '2 stops', '3 stops', '4 stops'], [1, 0, 2, 3, 4], inplace=True)

In [187]:
train_data["Airline"].value_counts()

Airline A    2807
Airline B    1545
Airline C    1252
Airline D     906
Airline E     622
Airline F     359
Airline G     235
Airline H     146
Airline I      12
Airline J       6
Airline K       3
Airline L       1
Name: Airline, dtype: int64

In [188]:
# Airline
train_data["Airline"].replace({'Airline L':'Other','Airline K':'Other','Airline J':'Other','Airline I':'Other'},inplace=True)

test_data["Airline"].replace({'Airline L':'Other','Airline K':'Other','Airline J':'Other','Airline I':'Other'},inplace=True)

- From above scatter plot it is clear that the ticket prices for Business class is higher which is quite obivious

In [189]:
train_data["Additional_Info"].value_counts()

No info                         6162
In-flight meal not included     1465
No check-in baggage included     247
1 Long layover                     9
Business class                     3
No Info                            3
Change airports                    2
1 Short layover                    1
2 Long layover                     1
Red-eye flight                     1
Name: Additional_Info, dtype: int64

In [190]:
# Additional_Info
train_data["Additional_Info"].replace({'Change airports':'Other', 
                                                        'Business class':'Other',
                                                        '1 Short layover':'Other',
                                                        'Red-eye flight':'Other',
                                                        '2 Long layover':'Other',   
                                                   },    
                                        inplace=True)
test_data["Additional_Info"].replace({'Change airports':'Other', 
                                                        'Business class':'Other',
                                                        '1 Short layover':'Other',
                                                        'Red-eye flight':'Other',
                                                        '2 Long layover':'Other',   
                                                   },    
                                        inplace=True)

In [191]:
train_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Journey_day,Journey_month,Dep_hour,Dep_min,Arrival_hour,Arrival_min
0,Airline C,Delhi,Cochin,DEL → MAA → COK,765,1,No info,7480,6,12,20,40,9,25
1,Airline A,Banglore,Delhi,BLR → DEL,185,0,No info,8016,18,6,18,55,22,0
2,Airline C,Delhi,Cochin,DEL → BOM → COK,925,1,No info,8879,18,5,3,50,19,15
3,Airline A,Kolkata,Banglore,CCU → BOM → BLR,735,1,In-flight meal not included,9663,5,6,20,0,8,15
4,Airline A,Kolkata,Banglore,CCU → BOM → BLR,330,1,In-flight meal not included,9663,5,9,6,30,12,0


## Convert categorical data into numerical

In [192]:
data = train_data.drop(["Price"], axis=1)

In [193]:
train_categorical_data = data.select_dtypes(exclude=['int64', 'float','int32'])
train_numerical_data = data.select_dtypes(include=['int64', 'float','int32'])

test_categorical_data = test_data.select_dtypes(exclude=['int64', 'float','int32','int32'])
test_numerical_data  = test_data.select_dtypes(include=['int64', 'float','int32'])

In [194]:
train_categorical_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Additional_Info
0,Airline C,Delhi,Cochin,DEL → MAA → COK,No info
1,Airline A,Banglore,Delhi,BLR → DEL,No info
2,Airline C,Delhi,Cochin,DEL → BOM → COK,No info
3,Airline A,Kolkata,Banglore,CCU → BOM → BLR,In-flight meal not included
4,Airline A,Kolkata,Banglore,CCU → BOM → BLR,In-flight meal not included


In [195]:
#Label encode and hot encode categorical columns
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_categorical_data = train_categorical_data.apply(LabelEncoder().fit_transform)
test_categorical_data = test_categorical_data.apply(LabelEncoder().fit_transform)

In [196]:
train_categorical_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Additional_Info
0,2,2,1,115,4
1,0,0,2,17,4
2,2,2,1,98,4
3,0,3,0,61,1
4,0,3,0,61,1


## Concatenate both catagorical and numerical data

In [197]:
X = pd.concat([train_categorical_data, train_numerical_data], axis=1)
y=train_data['Price']
test_set = pd.concat([test_categorical_data, test_numerical_data], axis=1)

In [198]:
X.head()

Unnamed: 0,Airline,Source,Destination,Route,Additional_Info,Duration,Total_Stops,Journey_day,Journey_month,Dep_hour,Dep_min,Arrival_hour,Arrival_min
0,2,2,1,115,4,765,1,6,12,20,40,9,25
1,0,0,2,17,4,185,0,18,6,18,55,22,0
2,2,2,1,98,4,925,1,18,5,3,50,19,15
3,0,3,0,61,1,735,1,5,6,20,0,8,15
4,0,3,0,61,1,330,1,5,9,6,30,12,0


In [199]:
X.shape

(7894, 13)

In [200]:
test_set.shape

(2655, 13)

In [201]:
y.head()

0    7480
1    8016
2    8879
3    9663
4    9663
Name: Price, dtype: int64

### Building Machine Learning Models

In [202]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score

from math import sqrt

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import KFold

def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [203]:
# training testing and splitting the dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.25, random_state=0)

In [204]:
print("The size of training input is", X_train.shape)
print("The size of training output is", y_train.shape)
print(50 *'*')
print("The size of testing input is", X_test.shape)
print("The size of testing output is", y_test.shape)

The size of training input is (5920, 13)
The size of training output is (5920,)
**************************************************
The size of testing input is (1974, 13)
The size of testing output is (1974,)


In [205]:
print("The size of testing input is", test_set.shape)
print("The size of testing output is", y_test.shape)

The size of testing input is (2655, 13)
The size of testing output is (1974,)


### Ridge

In [206]:
params ={'alpha' :[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]}
ridge_regressor =GridSearchCV(Ridge(), params ,cv =5,scoring = 'neg_mean_absolute_error', n_jobs =-1)
ridge_regressor.fit(X_train ,y_train)

GridSearchCV(cv=5, estimator=Ridge(), n_jobs=-1,
             param_grid={'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000,
                                   10000, 100000]},
             scoring='neg_mean_absolute_error')

In [207]:
y_train_pred =ridge_regressor.predict(X_train) ##Predict train result
y_test_pred =ridge_regressor.predict(X_test) ##Predict test result

In [208]:
X_test.shape

(1974, 13)

In [209]:
test_set.shape

(2655, 13)

In [210]:
print("Train Results for Ridge Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_train.values, y_train_pred)))
print("Mean absolute % error: ", round(mean_absolute_percentage_error(y_train.values, y_train_pred)))
print("R-squared: ", r2_score(y_train.values, y_train_pred))

Train Results for Ridge Regressor Model:
--------------------------------------------------
Root mean squared error:  3498.6464279877982
Mean absolute % error:  30.0
R-squared:  0.4423675033996044


In [211]:
print("Test Results for Ridge Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_test, y_test_pred)))
print("Mean absolute % error: ", round(mean_absolute_percentage_error(y_test, y_test_pred)))
print("R-squared: ", r2_score(y_test, y_test_pred))

Test Results for Ridge Regressor Model:
--------------------------------------------------
Root mean squared error:  3387.7818598357053
Mean absolute % error:  32.0
R-squared:  0.4291772544042115


### Lasso

In [212]:
params ={'alpha' :[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]}
lasso_regressor =GridSearchCV(Lasso(), params ,cv =15,scoring = 'neg_mean_absolute_error', n_jobs =-1)
lasso_regressor.fit(X_train ,y_train)

GridSearchCV(cv=15, estimator=Lasso(), n_jobs=-1,
             param_grid={'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000,
                                   10000, 100000]},
             scoring='neg_mean_absolute_error')

In [213]:
y_train_pred =lasso_regressor.predict(X_train) ##Predict train result
y_test_pred =lasso_regressor.predict(X_test) ##Predict test result

In [214]:
test_prediction = lasso_regressor.predict(test_set) ##Predict test result

In [215]:
print("Train Results for Lasso Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_train.values, y_train_pred)))
print("Mean absolute % error: ", round(mean_absolute_percentage_error(y_train.values, y_train_pred)))
print("R-squared: ", r2_score(y_train.values, y_train_pred))

Train Results for Lasso Regressor Model:
--------------------------------------------------
Root mean squared error:  3506.955133356198
Mean absolute % error:  30.0
R-squared:  0.4397157889383171


In [216]:
print("Test Results for Lasso Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_test, y_test_pred)))
print("Mean absolute % error: ", round(mean_absolute_percentage_error(y_test, y_test_pred)))
print("R-squared: ", r2_score(y_test, y_test_pred))

Test Results for Lasso Regressor Model:
--------------------------------------------------
Root mean squared error:  3387.329820158752
Mean absolute % error:  32.0
R-squared:  0.42932957667590166


## Random Forest Regressor

In [217]:
tuned_params = {'n_estimators': [100, 200, 300, 400, 500], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]}
random_regressor = RandomizedSearchCV(RandomForestRegressor(), tuned_params, n_iter = 20, scoring = 'neg_mean_absolute_error', cv = 5, n_jobs = -1)
random_regressor.fit(X_train, y_train)

RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=20,
                   n_jobs=-1,
                   param_distributions={'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [100, 200, 300, 400,
                                                         500]},
                   scoring='neg_mean_absolute_error')

In [218]:
y_train_pred = random_regressor.predict(X_train)
y_test_pred_RD = random_regressor.predict(X_test)

In [219]:
print("Train Results for Random Forest Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_train.values, y_train_pred)))
print("Mean absolute % error: ", round(mean_absolute_percentage_error(y_train.values, y_train_pred)))
print("R-squared: ", r2_score(y_train.values, y_train_pred))

Train Results for Random Forest Regressor Model:
--------------------------------------------------
Root mean squared error:  682.9907511846886
Mean absolute % error:  3.0
R-squared:  0.9787490861835482


In [220]:
print("Test Results for Random Forest Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_test, y_test_pred)))
print("Mean absolute % error: ", round(mean_absolute_percentage_error(y_test, y_test_pred_RD)))
print("R-squared: ", r2_score(y_test, y_test_pred))

Test Results for Random Forest Regressor Model:
--------------------------------------------------
Root mean squared error:  3387.329820158752
Mean absolute % error:  8.0
R-squared:  0.42932957667590166


## K Neighbors Regressor

In [221]:
k_range = list(range(1, 30))
params = dict(n_neighbors = k_range)
knn_regressor = GridSearchCV(KNeighborsRegressor(), params, cv =10, scoring = 'neg_mean_squared_error')
knn_regressor.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                         13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                         23, 24, 25, 26, 27, 28, 29]},
             scoring='neg_mean_squared_error')

In [222]:
y_train_pred =knn_regressor.predict(X_train) ##Predict train result
y_test_pred =knn_regressor.predict(X_test) ##Predict test result

In [223]:
print("Train Results for KNN Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_train.values, y_train_pred)))
print("Mean absolute % error: ", round(mean_absolute_percentage_error(y_train.values, y_train_pred)))
print("R-squared: ", r2_score(y_train.values, y_train_pred))

Train Results for KNN Regressor Model:
--------------------------------------------------
Root mean squared error:  2993.382662607981
Mean absolute % error:  21.0
R-squared:  0.591800524306235


In [224]:
print("Test Results for KNN Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_test, y_test_pred)))
print("Mean absolute % errorr: ", round(mean_absolute_percentage_error(y_test, y_test_pred)))
print("R-squared: ", r2_score(y_test, y_test_pred))

Test Results for KNN Regressor Model:
--------------------------------------------------
Root mean squared error:  3065.4356284299724
Mean absolute % errorr:  24.0
R-squared:  0.5326364965816399


## Decision Tree Regressor

In [225]:
depth  =list(range(3,30))
param_grid =dict(max_depth =depth)
tree =GridSearchCV(DecisionTreeRegressor(),param_grid,cv =10)
tree.fit(X_train,y_train)

GridSearchCV(cv=10, estimator=DecisionTreeRegressor(),
             param_grid={'max_depth': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
                                       15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
                                       25, 26, 27, 28, 29]})

In [226]:
y_train_pred =tree.predict(X_train) ##Predict train result
y_test_pred =tree.predict(X_test) ##Predict test result

In [227]:
print("Train Results for Decision Tree Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_train.values, y_train_pred)))
print("Mean absolute % error: ", round(mean_absolute_percentage_error(y_train.values, y_train_pred)))
print("R-squared: ", r2_score(y_train.values, y_train_pred))

Train Results for Decision Tree Regressor Model:
--------------------------------------------------
Root mean squared error:  243.03631276894473
Mean absolute % error:  0.0
R-squared:  0.997309144974626


In [228]:
print("Test Results for Decision Tree Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_test, y_test_pred)))
print("Mean absolute % error: ", round(mean_absolute_percentage_error(y_test, y_test_pred)))
print("R-squared: ", r2_score(y_test, y_test_pred))


Test Results for Decision Tree Regressor Model:
--------------------------------------------------
Root mean squared error:  1704.2577870381842
Mean absolute % error:  8.0
R-squared:  0.8555422095727175


## XGB Regressor

In [229]:
tuned_params = {'max_depth': [1, 2, 3, 4, 5], 'learning_rate': [0.01, 0.05, 0.1], 'n_estimators': [100, 200, 300, 400, 500], 'reg_lambda': [0.001, 0.1, 1.0, 10.0, 100.0]}
model = RandomizedSearchCV(XGBRegressor(), tuned_params, n_iter=20, scoring = 'neg_mean_absolute_error', cv=5, n_jobs=-1)
model.fit(X_train, y_train)

RandomizedSearchCV(cv=5,
                   estimator=XGBRegressor(base_score=None, booster=None,
                                          colsample_bylevel=None,
                                          colsample_bynode=None,
                                          colsample_bytree=None, gamma=None,
                                          gpu_id=None, importance_type='gain',
                                          interaction_constraints=None,
                                          learning_rate=None,
                                          max_delta_step=None, max_depth=None,
                                          min_child_weight=None, missing=nan,
                                          monotone_constraints=None,
                                          n_estimators=100, n...
                                          random_state=None, reg_alpha=None,
                                          reg_lambda=None,
                                          scale_pos_we

In [230]:
y_train_pred = model.predict(X_train)
y_test_pred_XGB = model.predict(X_test)


In [231]:
print("Train Results for XGBoost Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_train.values, y_train_pred)))
print("Mean absolute % error: ", round(mean_absolute_percentage_error(y_train.values, y_train_pred)))
print("R-squared: ", r2_score(y_train.values, y_train_pred))

Train Results for XGBoost Regressor Model:
--------------------------------------------------
Root mean squared error:  732.228505642702
Mean absolute % error:  6.0
R-squared:  0.9755746256485236


In [232]:
print("Test Results for XGBoost Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_test, y_test_pred_XGB)))
print("Mean absolute % error: ", round(mean_absolute_percentage_error(y_test, y_test_pred_XGB)))
print("R-squared: ", r2_score(y_test, y_test_pred))

Test Results for XGBoost Regressor Model:
--------------------------------------------------
Root mean squared error:  1354.334070123846
Mean absolute % error:  9.0
R-squared:  0.8555422095727175


#### Random Forest Regressor & XGB Regressor are giving Maximum Accuracy as compare to other Regressor algorithm.

In [235]:
rmse_XGB =  -np.sqrt(np.square(np.log10(y_test_pred_XGB +1) - np.log10(y_test +1)).mean())

In [236]:
#y_test_pred_RD = y_test_pred_RD.astype(int)

In [237]:
#y_test_pred_RD

In [240]:
y_test_pred_RD = random_regressor.predict(test_set)
final_df=pd.DataFrame({ 'Price': y_test_pred_RD})
final_df.to_csv('random_regressor.csv',index=False)

In [241]:
y_test_pred =knn_regressor.predict(test_set) ##Predict test result
final_df=pd.DataFrame({ 'Price': y_test_pred})
final_df.to_csv('knn_regressor.csv',index=False)