# Spanish High Speed Rail Tickets Pricing Analytics


Created by Venessa M Yuhong on 13-Jun-2019.


## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>


## Introduction

> The data source of this project is from [Kaggle](https://www.kaggle.com/thegurus/spanish-high-speed-rail-system-ticket-pricing). The project will explore various approaches to travel from most popular routes, and recommend the best cost-saving approach for travellers.

In [None]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt

%matplotlib inline

In [None]:
# load data
df=pd.read_csv('../input/renfe.csv')
df.head(1)

In [None]:
df.info()

In [None]:
df.shape

### Data Wragling

In [None]:
# handling missing data
# check missing data rows
sum(df.isna().any(axis=1))

In [None]:
# check columns with missing data
df.isna().any(axis=0)

In [None]:
# View rows with missing data
df[df.isna().any(axis=1)].tail(10)

In [None]:
# Drop rows with missing value
df.dropna(inplace=True)
sum(df.isna().any(axis=1))

In [None]:
# change data types
df['insert_date']=pd.to_datetime(df['insert_date'])
df['start_date'] =pd.to_datetime(df['start_date'])
df['end_date']=pd.to_datetime(df['end_date'])

In [None]:
# create new columns
df['duration']=df['end_date']-df['start_date']
df['duration_mins']=df['duration']/np.timedelta64(1, 'm')
df['start_time']=df['start_date'].dt.time
df.head(1)

In [None]:
#drop unwanted columns
df.drop(['insert_date'],axis=1,inplace=True)

In [None]:
# histgram of price 
df['price'].hist()

In [None]:
# histgram of duration
df['duration_mins'].hist()

In [None]:
# plot train type
df['train_type'].value_counts().plot.bar()

In [None]:
# plot train class
df['train_class'].value_counts().plot.bar()

In [None]:
# plot fare
df['fare'].value_counts().plot.bar()

## Exploratary Analysis

### Question 1: Where marjority travellers go? 

In [None]:
df.groupby(by=['origin','destination'])['duration','duration_mins','price'].agg({'count','mean'})

### Obeservations
- MADRID,Spain's central capital, has the biggest inbound and outbound trafic volume.
- The volume and ticket price of trips between BARCELONA and MADRID are highest among others.

### Question 2: Which route is the key revenue contributor?

In [None]:
#calc total counts and revenue by orgin, destination,and train class
df_mjr=df.groupby(by=['origin','destination','train_type','train_class'])['price'].agg({'count',('sum',lambda x: x.astype(float).sum())}).reset_index()
df_mjr.head(10)

In [None]:
df_mjr['count_per']=df_mjr['count']/sum(df_mjr['count'])
df_mjr['sum_per']=df_mjr['sum']/sum(df_mjr['sum'])

In [None]:
df_mjr.sort_values('count_per',ascending=False, inplace=True)

In [None]:
df_mjr['origin_destination']=df_mjr['origin']+"-"+df_mjr['destination']
df_mjr.set_index('origin_destination',inplace=True)
df_mjr.head(6).sum()

In [None]:
df_mjr.head(6).plot(kind='pie',y='count_per',legend=False)

In [None]:
df_mjr.head(6).plot(kind='pie',y='sum_per',legend=False)

### Observations
- 58.2% travellers travel among three cities: MADRID,BARCELONA,and SEVILLA taking AVE train and Turista class.This group of travellers also contributes highest revenue, which is 61% of total.
- However, travellers between MADRID and BARCELONA contribute relatively higher revenue per trip than others.

### Question 3: Which is the most cost-saving approach to travel between Madrid and Barcelona?

In [None]:
#create dataset for trips between Barcelona and Mardrid only
df_bm=df.query('origin in ("BARCELONA","MADRID") and destination in ("BARCELONA","MADRID")')
df_bm.head(1)

In [None]:
# boxplot price groupped by train class and fare
df_bm.boxplot(column=['price','duration_mins'],by=['train_class','fare'],rot=45,layout=(2, 1),figsize=(10,9))

### Obeservations
- The combination of (Turista,Adulto ida) provides the loweset and fixed price, yet, this option also takes longest travel time.
- Some combination provided fixed price, while some provides non-fixed price.

In [None]:
# group data by origin, destination, train class, and fare
# since most travllers travel between Madrid and Barcelona, the investigation will focus on Barcelona and Madrid 
df_app=df_bm.groupby(by=['origin','destination','train_type','train_class','fare','duration_mins'])['price'].describe().reset_index()
df_app.head(6)

In [None]:
# view info in order
df_app.sort_values(['mean']).head(2)

In [None]:
# investigate fixed price and non-fixed price
# fixed price 
df_app_fixed=df_app[df_app['std'] == 0]
df_app_fixed.sort_values(['min','origin'])

In [None]:
#boxplot fixed travel packages groupped by train class and fare
df_app_fixed.boxplot(column=['mean','duration_mins'],by=['train_class','fare'],rot=45,layout=(3, 1),figsize=(10,9))

### Observations
- Two typs of ticket price to travel between BARCELONA and MADRID: fixed and non-fixed.
- If a traveller is risk-averse and wish to travel at lowest cost, the traveller can choose Turista + Adulto ida package at 43.25 Euro per trip. however, the travller need to sacrifice time and spend longest hours on trip, which is more than 562.0 minutes.
- Overall, if the travelelr prefer fixed price, Turista class on Adulto ida package provides the lowest price but longest hours to travel, which is 43.25 Euro and selected by majority travellers.

### Question 4: Is there any travel time restrictions to use Turista class on Adulto ida package?

In [None]:
# investigate if there is travel time restrictions.
df_fixed_cheapest=df_bm.query('train_class=="Turista" and fare=="Adulto ida"')
df_fixed_cheapest.head(1)

In [None]:
# check available travel time
df_fixed_cheapest['start_time'].unique()

### Observations: 

There is restrictions to use Turista + Adulto ida package, traveller can only use it on 9:30 and 7:15.

### Question 5: What if traveller prefer non-fixed package? which one is the most cost-saving way?

In [None]:
# non-fixed price
df_app_vary=df_app[df_app['std'] != 0]
df_app_vary.sort_values(['std','min'],ascending=True).head(4)

In [None]:
# investigate non-fixed trips
df_app_vary.sort_values(by='min',ascending=True).head(2)

In [None]:
# plot non-fixed trips
df_app_vary.boxplot(column=['mean','duration_mins'],by=['train_class','fare'],rot=45,layout=(3, 1),figsize=(10,9))

### Observations
- for non-fixed price options, trips usually take about 180 mins, that is 3 hours. The time difference is negligible.
- (Turista,Poromo) package offers overall better price

In [None]:
# merge two dataset to get travel start time.
df_bm_vary=pd.merge(df_bm,df_app_vary, how='left',on=['origin','destination','train_type','train_class','fare','duration_mins'])
df_bm_vary.head(1)

In [None]:
# find out available travel time
df_bm_tp=df_bm_vary.query('train_class =="Turista" and fare=="Promo"')
df_bm_tp.head(5)


In [None]:
# convert start_time data type to time
df_bm_tp['start_time'].astype(dt.datetime, inplace=True)
df_bm_tp.head(1)

In [None]:
# plot average price by start time
df_bm_tp.groupby('start_time')['price'].mean().plot.line()

In [None]:
# plot average price by start time
df_bm_tp.groupby('start_time')['price'].count().plot.line()

### Observations

- The ticket price of (Turista,Poromo) package depends on train start time, during morning and afternoon peak hours, the price goes up, while during non-peak hours, price goes down.

### Conclusion
- The most popular travel route is the between Madrid and Barcelona. 
- There are two types of tickets available to travel:  fixed price ticket, and non-fixed price ticket.
- If a travel is risk averse and prefer fixed price ticket, there are four options to choose from. The most cost-saving approach(Turista, Adulto ida) takes longest travel time, that is nearly 3 times of normal travel time. Besides, this approach has constraints, traveller can only take trains start at 7:15am and 9:30am.
- If a travel prefer non-fixed price ticket. The (Turista, Poromo) package offers overall better price. If traveler’s schedule is flexible, he/she can choose non-peak hours to enjoy the lowest price.




