# Ticket Price Prediction using Regression

This is a tickets pricing monitoring system. It scrapes tickets pricing data periodically and stores it in a database. Ticket pricing changes based on demand and time, and there can be significant difference in price. I am creating this product mainly with ourselves in mind. Users can set up alarms using an email, choosing an origin and destination (cities), time (date and hour range picker) choosing a price reduction over mean price, etc.

**Data set**<br>
**Following is the description for columns in the dataset**<br>
- insert_date: date and time when the price was collected and written in the database<br>
- origin: origin city <br>
- destination: destination city <br>
- start_date: train departure time<br>
- end_date: train arrival time<br>
- train_type: train service name<br>
- price: price<br>
- train_class: ticket class, tourist, business, etc.<br>
- fare: ticket fare, round trip, etc <br>

# Importing dataset and Performing Summary statistics

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# To show Matplotlib graphs in the same Jupyter notebook
%matplotlib inline 
# To set Seaborn default style as it is a bit pretty than Matplotlib default styles
sns.set() 

#### Import Dataset and create a copy of that dataset

In [None]:
data = pd.read_csv('data1.csv')
df = data.copy()

#### Display first five rows

In [None]:
df.head()

#### Drop 'unnamed: 0' column
 The data set contains an 'unnamed: 0', i will first remove this as we don't have need of this column

In [None]:
# Droping the column and saving changes in the same dataframe
df = df.drop(['Unnamed: 0'],axis=1) 

#### Check the number of rows and columns

In [None]:
# shape is a dataframe attribute which shows the number of rows and columns
df.shape

#### Check data types of all columns

In [None]:
# dtypes is a dataframe attribute which shows data type of all columns
df.dtypes

####  Check the basic summary statistics

In [None]:
df.describe()

#### Check summary statistics of all columns, including object dataypes

In [None]:
df.describe(include='all')

**Question: Explain the summary statistics for the above data set**

**Answer:**
 - The response variable here is the 'price'
 - Their is no enough differene between mean and median, therefore it is alomost symmetric but slightly right skewed
 - Minimum of price column is 16.60,maximum is 206.80, mean is 56.723 and median is 53.40
 - The people whose origin/destination was 'MADRID' existed the most in the data, and the train class 'Turista' is used the most
 - Most of the people used 'promo' for their journey

# Data Cleansing
  - Here we will fill null values with the mean  
  - Droping unnecessary columns  
  - Droping those rows/recoreds which contain some column missing vlues
  - Using Numpy and Pandas

#### Check null values in dataset

In [None]:
df.isnull().sum()

####  Filling the Null values in the 'price' column.

In [None]:
# First find the mean and then replace null values with it
mean = df['price'].mean() 
df['price'].fillna(mean, inplace=True)

**Check null values again in dataset**

In [None]:
# To ensure that the null values in price column filled with the mean
df['price'].isnull().sum()

#### Droping the rows containing Null values in the attributes 'train_class' and 'fare'

In [None]:
# Select and save those records which have null values in 'train_class' and 'fare' cloumn
df = df[~(df['train_class'].isnull() & df['fare'].isnull())]
df.head()

In [None]:
# To ensure all the null values are gone
df.isnull().sum()

####  Drop 'insert_date' column
 

In [None]:
df.drop(['insert_date'], axis=1, inplace=True)

# Data Visualization
  Using Matpoltlib and Seaborn to see some useful visual insights

#### Plot number of people boarding from different stations


In [None]:
sns.countplot(x=df['origin'])

**Question: What insights do you get from the above plot?**

**Answer**
 - The people whose origin was 'Madrid' are more than one lac
 - The people whose origin was 'Ponferrada' are least in number
 - The people whose origin was 'Valencia' and 'Barcelona' are nearly equal in number

#### Plot number of people for the destination stations

In [None]:
sns.countplot(x=df['destination'])

**Question: What insights do you get from the above graph?**

**Answer**
 - The people whose destination was 'Madrid' are more than one lac
 - The people whose destination was 'Ponferrada' are least in number
 - The people whose destination was 'Valencia' and 'Barcelona' are nearly equal in number

#### Plot different types of trains that runs in Spain

In [None]:
plt.figure(figsize=(18, 26))
plt.ylim(0, 130000)
sns.countplot(x=df['train_type'])

**Question: Which train runs the maximum in number as compared to other train types?**

**Answer:**  
 **AVE** train runs miximum in number as compared to other trains

#### Plot number of trains of different class

In [None]:
plt.figure(figsize=(16, 10))
sns.countplot(x=df['train_class'], data=df)

**Question: Which the most common train class for traveling among people in general?**

**Answer:**  
 **Turista** is the most used general train class

#### Plot number of tickets bought from each category

In [None]:
plt.figure(figsize=(15, 6))
sns.countplot(x='fare', data=df)

**Question: Which the most common tickets are bought?**

**Answer:**   
 **Promo fare** are the tickets which were most commenly bought 

####  Plot distribution of the ticket prices

In [None]:
plt.figure(figsize=(15, 6))
plt.xlim(10, 140)
sns.distplot(df['price'], bins=30, kde=True)

**Question: What readings can you get from the above plot?**

**Answer:**        
- Tickets having price 27 to 30 are bought more than the other tickets
- After that tickets having price 56 to 60 are bought less than the above
- And the tickets having price 43 to 45 and price 83 to 85 are bought about in same number

**Show train_class vs price through boxplot**

In [None]:
plt.figure(figsize=(15, 6))
sns.boxplot(x='train_class', y='price', data=df)

**Question: What pricing trends can you find out by looking at the plot above?**

**Answer:** ?

#### Show train_type vs price through boxplot

In [None]:
plt.figure(figsize=(15, 6))
sns.boxplot(x='train_type', y='price', data=df)

**Question: Which type of trains cost more as compared to others?**

**Answer:** 



# Feature Engineering


In [None]:
df = df.reset_index()

**Finding the travel time between the place of origin and destination**<br>
We need to find out the travel time for each entry which can be obtained from the 'start_date' and 'end_date' column. Also if you see, these columns are in object type therefore datetimeFormat should be defined to perform the necessary operation of getting the required time.

**Import datetime library**

In [None]:
import datetime

In [None]:
datetimeFormat = '%Y-%m-%d %H:%M:%S'
def fun(a,b):
    diff = datetime.datetime.strptime(b, datetimeFormat)- datetime.datetime.strptime(a, datetimeFormat)
    return(diff.seconds/3600.0)                  
    

In [None]:
df['travel_time_in_hrs'] = df.apply(lambda x:fun(x['start_date'],x['end_date']),axis=1) 
df.head()

#### Remove redundant features
  - we need to remove features that are giving the related values as 'travel_time_in_hrs'
  - Removing 'start_date' and 'end_date' columns as we extracted 'travel_time_in_hrs' from it
  - To reduce the redundency

In [None]:
df.drop(['start_date', 'end_date'], axis=1, inplace=True)

In [None]:
df.head()

We now need to find out the pricing from 'MADRID' to other destinations. We also need to find out time which each train requires for travelling. 

## 1: **Travelling from MADRID to SEVILLA**

#### Findout people travelling from MADRID to SEVILLA

In [None]:
df1 = df.loc[(df.origin == 'MADRID') & (df.destination == 'SEVILLA')]
df1

#### Make a plot for finding out travelling hours for each train type

In [None]:
# We will use the above dataframe as 'df1'
plt.figure(figsize=(15, 6))
sns.barplot(x='train_type', y='travel_time_in_hrs', data=df1)

#### Show train_type vs price through boxplot

In [None]:
plt.figure(figsize=(15, 6))
sns.boxplot(x='train_type', y='price', data=df1)

## 2: **Travelling from MADRID to BARCELONA**

#### Findout people travelling from MADRID to BARCELONA

In [None]:
df2 = df.loc[(df.origin == 'MADRID') & (df.destination == 'BARCELONA')]

#### Make a plot for finding out travelling hours for each train type

In [None]:
# We will be usign 'df2'
plt.figure(figsize=(15, 6))
sns.barplot(x='train_type', y='travel_time_in_hrs', data=df2)

#### Show train_type vs price through boxplot

In [None]:
plt.figure(figsize=(15, 6))
sns.boxplot(x='train_type', y='price', data=df2)

## 3: **Travelling from MADRID to VALENCIA**

#### Findout people travelling from MADRID to VALENCIA

In [None]:
df3 = df.loc[(df.origin == 'MADRID') & (df.destination == 'VALENCIA')]

#### Make a plot for finding out travelling hours for each train type

In [None]:
# We will be usign 'df3'
plt.figure(figsize=(15, 6))
sns.barplot(x='train_type', y='travel_time_in_hrs', data=df3)

#### Show train_type vs price through boxplot

In [None]:
plt.figure(figsize=(15, 6))
sns.boxplot(x='train_type', y='price', data=df3)

## 4: **Travelling from MADRID to PONFERRADA**

#### Findout people travelling from MADRID to PONFERRADA

In [None]:
df4 = df.loc[(df.origin == 'MADRID') & (df.destination == 'PONFERRADA')]

#### Make a plot for finding out travelling hours for each train type

In [None]:
# We will be usign 'df4'
plt.figure(figsize=(15, 6))
sns.barplot(x='train_type', y='travel_time_in_hrs', data=df4)

#### Show train_type vs price through boxplot

In [None]:
plt.figure(figsize=(15, 6))
sns.boxplot(x='train_type', y='price', data=df4)

# Applying Linear  Regression

#### Import LabelEncoder library from sklearn 

In [None]:
from sklearn import preprocessing

**Data Encoding**

In [None]:
df

In [None]:
lab_en = preprocessing.LabelEncoder()
df.iloc[:,1] = lab_en.fit_transform(df.iloc[:,1])
df.iloc[:,2] = lab_en.fit_transform(df.iloc[:,2])
df.iloc[:,3] = lab_en.fit_transform(df.iloc[:,3])
df.iloc[:,5] = lab_en.fit_transform(df.iloc[:,5])
df.iloc[:,6] = lab_en.fit_transform(df.iloc[:,6])

In [None]:
df.head(10)

#### Separate the dependant and independant variables

In [None]:
X = df.drop(['price'], axis=1)
Y = df[['price']]
print(X.shape)
print(Y.shape)

#### Import test_train_split from sklearn
  To split the data into train and test samples

In [None]:
from sklearn.model_selection import train_test_split

#### Splitting the data into training and test set

In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size=0.30, random_state=25,shuffle=True)
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

#### Import LinearRegression model from sklearn

In [None]:
from sklearn.linear_model import LinearRegression

#### Make an object of LinearRegression( ) / Instantiate the model and train it using the training data sets

In [None]:
lr = LinearRegression()

In [None]:
#Training the model
lr.fit(X_train, Y_train)

#### Find out the predictions using test data set.

In [None]:
lr_predict = lr.predict(X_test)

#### Find out the predictions using training data set.

In [None]:
lr_predict_train = lr.predict(X_train)

#### Checking model performance
- Import r2_score library form sklearn
- r2_score is a performance metric used to check model performance

In [None]:
from sklearn.metrics import r2_score

#### Find out the R2 Score for test data and print it

In [None]:
lr_r2_test = r2_score(Y_test,lr_predict)
print(lr_r2_test)

#### Task 43: Find out the R2 Score for training data and print it.

In [None]:
lr_r2_train = r2_score(Y_train,lr_predict_train)
print(lr_r2_train)

**Comaparing training and testing R2 scores**

In [None]:
print('R2 score of Linear Regression for Testing Data is: ', lr_r2_train)
print('R2 score of Linear Regression for Testing Data is: ', lr_r2_test)

# Applying Polynomial Regression

#### Import PolynomialFeatures from sklearn

In [None]:
from sklearn.preprocessing import PolynomialFeatures

#### Making an object of default Polynomial Features

In [None]:
# Using degree = 2
poly_reg = PolynomialFeatures(degree=2)

#### Transform the features to higher degree features.

In [None]:
X_train_poly = poly_reg.fit_transform(X_train)
X_test_poly = poly_reg.fit_transform(X_test)

#### Fit the transformed features to Linear Regression

In [None]:
poly_model =LinearRegression()
poly_model.fit(X_train_poly, Y_train)

#### Find the predictions on the data set

In [None]:
y_train_predicted = poly_model.predict(X_train_poly)
y_test_predict = poly_model.predict(X_test_poly)

#### Task 49: Evaluate R2 score for training data set

In [None]:
r2_train = r2_score(Y_train, y_train_predicted)

#### Evaluate R2 score for test data set

In [None]:
r2_test =  r2_score(Y_test, y_test_predict)

**Comaparing training and testing R2 scores**

In [None]:
print ('The r2 score for training set is: ', r2_train)
print ('The r2 score for testing set is: ', r2_test)

## Model Selection
 - **Question: Which model gives the best result for price prediction? Find out the complexity using R2 score and give your answer.**<br>
 - **Usin for loop for finding the best degree and model complexity for polynomial regression model**

In [None]:
r2_train=[]
r2_test=[]
for i in range(1,6):
    poly_reg = PolynomialFeatures(degree=i)
    
    X_tr_poly,X_tst_poly = poly_reg.fit_transform(X_train),poly_reg.fit_transform(X_test)
    poly = LinearRegression()
    poly.fit(X_tr_poly, Y_train)
   
    y_tr_predicted,y_tst_predict = poly.predict(X_tr_poly),poly.predict(X_tst_poly)
    r2_train.append(r2_score(Y_train, y_tr_predicted))
    r2_test.append(r2_score(Y_test, y_tst_predict))
    
print ('R2 Train', r2_train)
print ('R2 Test', r2_test)

#### Plotting the model

In [None]:
plt.figure(figsize=(18,5))
sns.set_context('poster')
plt.subplot(1,2,1)
sns.lineplot(x=list(range(1,6)), y=r2_train, label='Training');
plt.subplot(1,2,2)
sns.lineplot(x=list(range(1,6)), y=r2_test, label='Testing');

**Answer**