# Extensive Spanish High Speed Rail tickets pricing data analysis 
![](https://www.seat61.com/images/Spain-pato-train-barcelona2.jpg)

# Introduction:
### Rail transport in Spain operates on four rail gauges and services are operated by a variety of private and public operators. The total route length in 2012 was 16,026 km (10,182 km electrified)

### Most railways are operated by Renfe Operadora; metre and narrow-gauge lines are operated by FEVE and other carriers in individual autonomous communities. It is proposed and planned to build or convert more lines to standard gauge,including some dual gauging of broad-gauge lines, especially where these lines link to France, including platforms to be heightened.

### Spain is a member of the International Union of Railways (UIC).

# Table of contents:
1. [Importing libraries](#1)
2. [Importing Data](#2) <br>
&nbsp;&nbsp; 2.1. [Having a glimpse at data](#2.1) <br>
3. [Changing date columns to datetime columns](#3)
4. [Checking for the null values](#4)
5. [Checking for minimum and maximum dates in datetime columns](#5)
6. [Feature extraction](#6) <br>
&nbsp;&nbsp; 6.1 [Getting the latitude and longitude of the origin and destinations](#6.1) <br>
&nbsp;&nbsp; 6.2 [6.2 Mapping origin and destination columns to longitude dict and latitude dict ](#6.2)<br>
7. [Plotting of date columns according to start date and end date](#7)<br>
&nbsp;&nbsp; 7.1 [Journey Start date analysis](#7.1)<br>
&nbsp;&nbsp; 7.2 [Journey End date analysis](#7.2)<br>
8. [Visualising that journeys ended on same or different date](#8)<br>
9. [Plot of Most journeys started and ended hours](#9)<br>
&nbsp;&nbsp; 9.1 [Start hour analysis](#9.1)<br>
&nbsp;&nbsp; 9.2 [End hour analysis](#9.2)<br>
10. [Plot of count of journeys by month](#10)
11. [Analysis of travelling time in minutes](#11)<br>
&nbsp;&nbsp; 11.1 [Histogram plot of travelling time in minutes(50k observations)](#11.1)<br>
&nbsp;&nbsp; 11.2 [Box plot of travelling time only 50k observations](#11.2)<br>
12. [Analysis of journeys in weekdays](#12)
13. [Analysing Origin and Destination columns](#13)<br>
&nbsp;&nbsp; 13.1 [Plotting origins according to their counts](#13.1)<br>
&nbsp;&nbsp; 13.2 [Plotting destinations according to their counts](#13.2)<br>
14. [Creating a new column 'Route'](#14)<br>
&nbsp;&nbsp; 14.1 [Analysis of route column using pie chart](#14.1)<br>
&nbsp;&nbsp; 14.2 [Analysis of route column using Tree Maps](#14.2)<br>
15. [Analysis of train type column](#15)<br>
&nbsp;&nbsp; 15.1 [Overview of different train types in spain](#15.1)<br>
&nbsp;&nbsp; 15.2 [Analysis of train type using pie chart](#15.2)<br>
16. [Analysis of train class column](#16)<br>
&nbsp;&nbsp; 16.1 [Overview of train classes](#16.1)<br>
&nbsp;&nbsp; 16.2 [Analysis of train classes using pie chart](#16.2)<br>
17. [Analysis of Fare column](#17)<br>
&nbsp;&nbsp; 17.1 [Overview of fare types in spain](#17.1)<br>
&nbsp;&nbsp; 17.2 [Pie chart analysis of fare column](#17.2)<br>
18. [Getting insights from route column](#18)<br>
&nbsp;&nbsp; 18.1 [Analysis of route column with train type](#18.1)<br>
&nbsp;&nbsp; 18.2 [Analysis of route column with train class](#18.2)<br>
&nbsp;&nbsp; 18.3 [Analysis of route column with fare type](#18.3)<br>
19. [Analysis of price according to Train class,Train type,fare](#19)
20. [Average time of journeys by routes](#20)
21. [Average time of journeys by train type](#21)


## <font color='red'>If you like this kernel please consider Upvoting 😊 which keeps me motivated for doing hard work and to produce more high Quality content.</font>

# 1. Importing libraries <a name="1"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

#plotly
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
from IPython.display import HTML, Image
from plotly import tools
import folium 
from folium import plugins 
import squarify

from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# 2.Importing data <a name="2"></a>

In [None]:
rail_data = pd.read_csv('../input/renfe.csv')

# 2.1 Having a glimpse at data <a name="2.1"></a>

In [None]:
rail_data.head()

## About the dataset:
* insert_date : date and time when the price was collected and written in the database, scrapping time (UTC)<br>
* origin: origin city<br>
* destination: destination city<br>
* start_date: train departure time (European Central Time)<br>
* end_date: train arrival time (European Central Time)<br>
* train_type: train service name<br>
* price: price (euros)<br>
* train_class: ticket class, tourist, business, etc.<br>
* fare: ticket fare, round trip, etc.<br>

In [None]:
rail_data.tail()

In [None]:
rail_data.shape

Good there are 2579771 rows and 10 columns in the dataset

In [None]:
rail_data.info()

# 3.Changing date columns to datetime columns <a name="3"></a>

In [None]:
for i in ['insert_date','start_date','end_date']:
    rail_data[i] = pd.to_datetime(rail_data[i])
rail_data.info()

### Good we have converted date columns into datetime64.

# 4.Checking for the null values <a name="4"></a>

In [None]:
rail_data.isnull().mean()*100

### Well there are some null values in price,train_class,fare columns.

## 4.1 Filling missing values <a name="4.1"></a>
1. We will fill the train class and fare columns with mode.
2. We will fill the price according to fare column. 

In [None]:
cols = ['train_class','fare']
for c in cols:
    rail_data[c].fillna(rail_data[c].mode()[0], inplace=True)

In [None]:
rail_data.loc[rail_data.price.isnull(), 'price'] = rail_data.groupby('fare').price.transform('mean')

In [None]:
#check for is nan values correctly imputed
rail_data.isnull().any()

### Good successfully imputed nan values

# 5.Checking for minimum and maximum dates in datetime columns <a name="5"></a>

In [None]:
print(f" started date minimum value {rail_data.start_date.min()}")
print(f" started date maximum value {rail_data.start_date.max()}")

In [None]:
print(f" end date minimum value {rail_data.end_date.min()}")
print(f" end date maximum value {rail_data.end_date.max()}")

In [None]:
print(f" Inserted date minimum value {rail_data.insert_date.min()}")
print(f" Inserted date maximum value {rail_data.insert_date.max()}")

## Good! Wait what?? Its still june but how can we get the data of july(7)!!! There was some mistake occurred while recording the data i think. Lets solve this :D (By the time when i was writing this kernel)

## I am going to solve this by depending on month of insert_date column and remaining features like hour,year,date are fine with start_date and end_date columns.So i am using these attributes normally for analysis,but making insert_date column month as journey month.

# 6.Feature extraction: <a name="6"></a>
#### 1.Extracting starting hour and ending hour of the journey.
#### 2.Getting is the journey completed on the same day or not.
#### 3.Extracting travelling time in minutes
#### 4.Extracting week day name and month of the journeys from insert date column.

In [None]:
# lets create some important features using date columns
rail_data['start_hour'] = rail_data['start_date'].dt.hour
rail_data['end_hour'] = rail_data['end_date'].dt.hour
rail_data['is_journey_end_on_sameday'] = np.where(rail_data['start_date'].dt.date==rail_data['end_date'].dt.date, 
                                           'yes', 'no')
rail_data['travel_time_in_mins'] = rail_data['end_date'] - rail_data['start_date']
rail_data['travel_time_in_mins']=rail_data['travel_time_in_mins']/np.timedelta64(1,'m')
rail_data['journey_day_of_week'] = rail_data['insert_date'].dt.weekday_name
rail_data['journey_month'] = rail_data['insert_date'].dt.month

## 6.1 Getting the latitude and longitude of the origin and destinations <a name="6.1"></a>

#### As there is same values in origin and destination we can get latitude and longitude using one of the column.I am using origin column.

In [None]:
geolocator = Nominatim(user_agent="specify_your_app_name_here")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
dictt_latitude = {}
dictt_longitude = {}
for i in rail_data['origin'].unique():
    location = geocode(i)
    print(location.address)
    print(location.latitude, location.longitude)
    dictt_latitude[i] = location.latitude
    dictt_longitude[i] = location.longitude

## 6.2 Mapping origin and destination columns to longitude dict and latitude dict <a name="6.2"></a>

In [None]:
rail_data['start_latitude']= rail_data['origin'].map(dictt_latitude)
rail_data['start_longitude'] = rail_data['origin'].map(dictt_longitude)
rail_data['end_latitude'] = rail_data['destination'].map(dictt_latitude)
rail_data['end_longitude'] = rail_data['destination'].map(dictt_longitude)

In [None]:
#having a glimpse at data
rail_data.head()

# 7.Plotting of date columns according to start date and end date <a name="7"></a>

 ## 7.1 Journey Start date analysis <a name="7.1"></a>

In [None]:
count_  = rail_data['start_date'].dt.date.value_counts()
count_ = count_[:50,]
plt.figure(figsize=(20,10))
sns.barplot(count_.index, count_.values, alpha=0.8,palette = "GnBu_d")
plt.title('Plot of most journeys started dates')
plt.xticks(rotation='vertical')
plt.ylabel('Number of journeys', fontsize=12)
plt.xlabel('Date', fontsize=12)
plt.show()

## Observations:
#### From the above graph we can observe that on dates 2019-05-23,2019-05-16,2019-05-13,2019-05-20 have most journeys started.

## 7.2 Journey end date analysis <a name="7.2"></a>

In [None]:
count_  = rail_data['end_date'].dt.date.value_counts()
count_ = count_[:50,]
plt.figure(figsize=(20,10))
sns.barplot(count_.index, count_.values, alpha=0.8,palette = "cubehelix")
plt.title('Plot of most journeys ended dates')
plt.xticks(rotation='vertical')
plt.ylabel('Number of journeys', fontsize=12)
plt.xlabel('Date', fontsize=12)
plt.show()

## Observations:
#### From above graph we can observe that on dates 2019-05-23,2019-05-16,2019-05-13,2019-05-20 have most journeys ended.

# 8.Lets visualize that journeys ended on same or different date <a name="8"></a>

In [None]:
cnt_ = rail_data['is_journey_end_on_sameday'].value_counts()
cnt_ = cnt_.sort_index() 
fig = {
  "data": [
    {
      "values": cnt_.values,
      "labels": cnt_.index,
      "domain": {"x": [0, .5]},
      "name": "Percentage of journeys started and ended on same date",
      "hoverinfo":"label+percent+name",
      "hole": .3,
      "type": "pie"
    },],
  "layout": {
        "title":"Percentage of journeys started and ended on same date",
        "annotations": [
            { "font": { "size": 20},
              "showarrow": False,
             "text": "Pie Chart",
                "x": 0.50,
                "y": 1
            },
        ]
    }
}
iplot(fig)
cnt_

## Observation:
* #### From above plot we can observe that most of the journeys are ended in one day.(One day journeys)

# 9.Plot of Most journeys started and ended hours <a name="9"></a>

## 9.1 Start hour analysis <a name="9.1"></a>

In [None]:
import plotly.graph_objs as go
cnt_srs = rail_data['start_hour'].value_counts()
trace1 = go.Bar(
                x = cnt_srs.index,
                y = cnt_srs.values,
                marker = dict(color = 'rgba(0, 255, 200, 0.8)',
                             line=dict(color='rgb(0,0,0)',width=0.2)),
                text = cnt_srs.index)

data = [trace1]
layout = go.Layout(title = 'Plot of most journeys started according to hour')
fig = go.Figure(data = data, layout = layout)
iplot(fig)

In [None]:
cnt_srs = rail_data['start_hour'].value_counts()
trace1 = go.Scatter(
                    x = cnt_srs.index,
                    y = cnt_srs.values,
                    mode = "markers",
                    marker = dict(color = 'rgba(100, 35, 55, 0.8)')
                    )

data = [trace1]
layout = dict(title = 'Journeys started according to hour',
              xaxis= dict(title= 'Journeys per hour',ticklen= 5,zeroline= False)
             )
fig = dict(data = data, layout = layout)
iplot(fig)

## Observation:
#### *  By above graphs we can observe that most of the journeys started at 6AM-9AM and again at 2PM-5PM. Maybe office timings.

## 9.2 End hour analysis <a name="9.2"></a>

In [None]:
cnt_srs = rail_data['end_hour'].value_counts()
trace1 = go.Bar(
                x = cnt_srs.index,
                y = cnt_srs.values,
                marker = dict(color = 'rgba(0, 155, 100, 0.8)',
                             line=dict(color='rgb(0,0,0)',width=0.2)),
                text = cnt_srs.index)

data = [trace1]
layout = go.Layout(title = 'Plot of most journeys ended according to hour')
fig = go.Figure(data = data, layout = layout)
iplot(fig)

In [None]:
cnt_srs = rail_data['end_hour'].value_counts()
trace1 = go.Scatter(
                    x = cnt_srs.index,
                    y = cnt_srs.values,
                    mode = "markers",
                    marker = dict(color = 'rgba(155, 28, 155, 0.8)')
                    )

data = [trace1]
layout = dict(title = 'Journeys ended according to hour',
              xaxis= dict(title= 'Journeys per hour',ticklen= 5,zeroline= False)
             )
fig = dict(data = data, layout = layout)
iplot(fig)

## Observations:
#### *  Most of the journeys ended at 9AM-11AM and 4PM-11PM.

# 10.Plot of count of journeys by month <a name="10"></a>

In [None]:
cnt_srs = rail_data['journey_month'].value_counts()
trace1 = go.Bar(
                x = cnt_srs.values,
                y = cnt_srs.index,orientation = 'h',
                marker = dict(color = 'rgba(155, 0, 100, 0.8)',
                             line=dict(color='rgb(0,0,0)',width=0.2)),
                text = cnt_srs.index)

data = [trace1]
layout = go.Layout(title = 'Plot of count of journeys by month')
fig = go.Figure(data = data, layout = layout)
iplot(fig)

## Observations:
#### *  Most of the journeys are in month of april.

In [None]:
print('The average travelling time was {} mins \nThe maximum travelling time was {} mins \nThe minimum travelling time was {} mins'.format(rail_data.travel_time_in_mins.mean(),rail_data.travel_time_in_mins.max()
                                                                                                                                               ,rail_data.travel_time_in_mins.min()))

# 11. Analysis of travelling time in minutes <a name="11"></a>

# 11.1 Histogram plot of travelling time in minutes(50k observations) <a name="11.1"></a>

In [None]:
fig = ff.create_distplot([rail_data.travel_time_in_mins[:50000,]],['travel_time_in_mins'],bin_size=5)
iplot(fig, filename='Basic Distplot')

# 11.2 Box plot of travelling time only 50k observations <a name="11.2"></a>

In [None]:
trace1 = go.Box(
    y=rail_data.travel_time_in_mins[:50000,],
    name = 'Box plot of average travelling time in minutes only 50k observations',
    marker = dict(
        color = 'rgb(12, 12, 140)',
    )
)
data = [trace1]
iplot(data)

# 12.Analysis of journeys in weekdays <a name="12"></a>

In [None]:
cnt_srs = rail_data['journey_day_of_week'].value_counts()
trace1 = go.Bar(
                x = cnt_srs.index,
                y = cnt_srs.values,
                marker = dict(color = 'rgba(55, 25, 55, 0.3)',
                             line=dict(color='rgb(0,0,0)',width=1.5)),
                text = cnt_srs.index)

data = [trace1]
layout = go.Layout()
fig = go.Figure(data = data, layout = layout)
iplot(fig)

## Observations:
#### *  Most of the journeys are in Mondays followed by sundays and Tuesdays.

# 13.Analysing Origin and Destination columns <a name="13"></a>

## 13.1 Plotting origins according to their counts <a name="13.1"></a>

In [None]:
df = rail_data['origin'].value_counts()
df = pd.DataFrame(df)
df = df.reset_index()
df.columns = ['origin', 'counts'] 
df['start_latitude']= df['origin'].map(dictt_latitude)
df['start_longitude'] = df['origin'].map(dictt_longitude)
map1 = folium.Map(location=[40.4637, 3.7492], tiles='CartoDB dark_matter', zoom_start=5)
markers = []
for i, row in df.iterrows():
    loss = row['counts']
    if row['counts'] > 0:
        count = row['counts']*0.00003    
    folium.CircleMarker([float(row['start_latitude']), float(row['start_longitude'])], radius=float(count), color='#ef4f61', fill=True).add_to(map1)
map1

## Observations:
#### * As we can see from above graph most of the trains start from **Madrid** and very few trains start from **Ponferrada**

## 13.2 Plotting Destinations according to their counts <a name="13.2"></a>

In [None]:
df = rail_data['destination'].value_counts()
df = pd.DataFrame(df)
df = df.reset_index()
df.columns = ['destination', 'counts'] 
df['start_latitude']= df['destination'].map(dictt_latitude)
df['start_longitude'] = df['destination'].map(dictt_longitude)
map1 = folium.Map(location=[40.4637, 3.7492], tiles='CartoDB dark_matter', zoom_start=5)
markers = []
for i, row in df.iterrows():
    loss = row['counts']
    if row['counts'] > 0:
        count = row['counts']*0.00003   
    folium.CircleMarker([float(row['start_latitude']), float(row['start_longitude'])], radius=float(count), color='#ef4f61', fill=True).add_to(map1)
map1

## Observations:
#### * Great! The most of the trains have **Madrid** as destination. From above two graph we can confirm that there was good circulation of trains in **madrid**  as it is the capital of spain ,but very few in **ponferrada**

# 14.Creating a new column 'Route' <a name="14"></a>

In [None]:
rail_data['route'] = rail_data['origin']+' to '+rail_data['destination']
print('There are {} number of routes in dataframe'.format(rail_data['route'].nunique()))

### We can see that there are 8 unique routes in dataset.

## 14.1 Analysis of route column using pie chart <a name="14.1"></a>

In [None]:
cnt_ = rail_data['route'].value_counts()

fig = {
  "data": [
    {
      "values": cnt_.values,
      "labels": cnt_.index,
      "domain": {"x": [0, .5]},
      "name": "Routes",
      "hoverinfo":"label+percent+name",
      "hole": .5,
      "type": "pie"
    },],
  "layout": {
        "title":"Pie chart of routes",
        "annotations": [
            { "font": { "size": 20},
              "showarrow": False,
             "text": "Pie Chart",
                "x": 0.50,
                "y": 1
            },
        ]
    }
}
iplot(fig)
cnt_

## 14.2 Analysis of route column using Tree Maps <a name="14.2"></a>


In [None]:
x = 0.
y = 0.
width = 50.
height = 50.
type_list = list(rail_data['route'].unique())
values = [len(rail_data[rail_data['route'] == i]) for i in type_list]

normed = squarify.normalize_sizes(values, width, height)
rects = squarify.squarify(normed, x, y, width, height)

color_brewer = ['#2D3142','#4F5D75','#BFC0C0','#F2D7EE','#EF8354','#839788','#EEE0CB','#494949']
shapes = []
annotations = []
counter = 0

for r in rects:
    shapes.append( 
        dict(
            type = 'rect', 
            x0 = r['x'], 
            y0 = r['y'], 
            x1 = r['x']+r['dx'], 
            y1 = r['y']+r['dy'],
            line = dict( width = 2 ),
            fillcolor = color_brewer[counter]
        ) 
    )
    annotations.append(
        dict(
            x = r['x']+(r['dx']/2),
            y = r['y']+(r['dy']/2),
            text = "{}-{}".format(type_list[counter], values[counter]),
            showarrow = False
        )
    )
    counter = counter + 1
    if counter >= len(color_brewer):
        counter = 0

# For hover text
trace0 = go.Scatter(
    x = [ r['x']+(r['dx']/2) for r in rects ], 
    y = [ r['y']+(r['dy']/2) for r in rects ],
    text = [ str(v) for v in values ], 
    mode = 'text',
)
        
layout = dict(
    height=1000,
    width=1250,
    xaxis=dict(showgrid=False,zeroline=False),
    yaxis=dict(showgrid=False,zeroline=False),
    shapes=shapes,
    annotations=annotations,
    hovermode='closest',
    font=dict(color="#FFFFFF")
)

# With hovertext
figure = dict(data=[trace0], layout=layout)
iplot(figure, filename='treemap')

### Obervations:
1. We can see most of the routes in our dataset was from **MADRID to BARCELONA**,**MADRID to SEVILLA**,**BARCELONA to MADRID**
2. There are less number of journeys between **PONFERRADA to MADRID**,**MADRID to PONFERRADA**

# 15.Analysis of train type column <a name="15"></a>

## 15.1 Overview of different train types in spain: <a name="15.1"></a>
#### AVE: In Spain With 3,100km of track the Spanish high-speed AVE trains operate on the longest high-speed network in Europe. Running at speeds of up to 310 km/h this extensive network allows for fast connections between cities in Spain. Travel from Madrid to Barcelona in less than 3 hours! This modern train system connects many cities across Spain from Madrid and Barcelona, to Córdoba, Seville, Málaga and Valencia.
#### ALVIA: The Spanish Alvia trains combine both a long distance and a high-speed service to connect major cities across Spain. The Alvia offers many routes such as connections from Madrid to Gijón, Alicante and Castellón and from Barcelona to Bilbao, A Coruña and Vigo. With air conditioned carriages and check-in control before boarding the Alvia is comfortable and relaxed way to traverse one of Europe's biggest countries.
#### REGIONAL: Regional and intercity trains in Spain. FEVE trains operate in the north of Spain, connecting cities like Bilbao, Gijón, León and Santander. Cercanías (suburban trains) is a network of trains that operates in and around the larger Spanish cities including Barcelona and Valencia.
#### INTERCITY: Traditional intercity trains travelling between 160 do 250 km/h allow you to reach nearly every corner of Spain. You can choose to travel in 2nd class (Turista) or 1st class (Preferente). The comfort of the carriages is close to that of the high-speed AVE trains. All trains are air-conditioned.
#### AV City :The ave city trains are high speed train to complement the AVE to offer lower prices and marketed in economy class (p) and economy plus (p+)
#### Less distance(LD) - Medium distance(MD): The LD-AVE and MD-AVE on the list of trains is for an indirect service that uses a comination of the regular trains (either the LD - Larga Distancia/Long Distance or the MD - Media Distancia/Medium Distance). Those trains requires a change (usually in Zaragoza or Valencia) so, due to the change and the lower speed trains on part of the way. The journey is longer but the tickets are cheaper. 
#### The LD,AVE-MD,AVE-LD,LD-MD,MD-AVE,MD,LD-AVE slightly falls under above category.
#### TRENHOTEL: Trenhotel are night trains running in Spain and from Spain to Portugal. The trains are consisting of Talgo articulated stock, providing a very comfortable and smooth ride. There are different kinds of trains in use, offering different kinds of services. Older trains convey 2nd class seats, four bed sleepers, 2 bed sleepers and 2 bed sleepers with ensuite shower and WC. Newer trains convey only 1st class reclining seats and 2 bed sleepers with ensuite shower and WC. All trains have a bistro carriage.


In [None]:
rail_data['train_type'].value_counts()

## 15.2 Analysis of train type using pie chart <a name="15.2"></a>

In [None]:
cnt_ = rail_data['train_type'].value_counts()

fig = {
  "data": [
    {
      "values": cnt_.values,
      "labels": cnt_.index,
      "domain": {"x": [0, .5]},
      "name": "Train types",
      "hoverinfo":"label+percent+name",
      "hole": .7,
      "type": "pie"
    },],
  "layout": {
        "title":"Pie chart Train types",
        "annotations": [
            { "font": { "size": 20},
              "showarrow": False,
             "text": "Pie Chart",
                "x": 0.50,
                "y": 1
            },
        ]
    }
}
iplot(fig)
cnt_

### Observations:
1. Most of the train types in our dataset was AVE as these are high speed trains.It occupies 69.4%
2. People rarely choosing the MD-AVE,MD,LD-AVE trains because these train journeys are longer in time. We can say that spain people like to travel in high speed trains.

# 16.Analysis of train class column: <a name="16"></a>

## 16.1 Overview of train classes: <a name="16.1"></a>
#### Turista and Preferente: Spanish long-distance trains generally have two classes. Turista = 2nd class(Least expensive) and Preferente = 1st class(Expensive). On weekdays on AVE & EuroMed high-speed trains, Preferente usually includes a hot airline-style meal & wine.
#### Remaining columns comes under turista with different prices.
#### Cama G. Clase : Night trains with berths.

## 16.2 Analysis of train classes using pie chart <a name="16.2"></a>

In [None]:
cnt_ = rail_data['train_class'].value_counts()

fig = {
  "data": [
    {
      "values": cnt_.values,
      "labels": cnt_.index,
      "domain": {"x": [0, .5]},
      "name": "Train Class",
      "hoverinfo":"label+percent+name",
      "hole": .8,
      "type": "pie"
    },],
  "layout": {
        "title":"Pie chart Train Class",
        "annotations": [
            { "font": { "size": 20},
              "showarrow": False,
             "text": "Pie Chart",
                "x": 0.50,
                "y": 1
            },
        ]
    }
}
iplot(fig)
cnt_

### Observations:
1. Most of the people prefering Turista class.
2. Less number of people are travelling in night so there are less number of journeys in Cama G. Clase class.

# 17.Analysis of Fare column: <a name="17"></a>

## 17.1 Overview of fare types in spain: <a name="17.1"></a>
#### Promo: These are based on a dynamic pricing system with heavy discounts on AVE and Larga Distancia (Long-Distance) trains on domestic journeys, which are set depending on the train, date of travel and advance purchase.
#### Flexible: The Flexible Ticket is a commercial offer for AVE and Larga Distancia (Long-Distance) services only across all classes and seats (seats and berths). It is the same price as the General fare without any discount, but comes with additional offers that enable passengers to obtain better conditions for changes, cancellations or if they miss their train.
#### Promo+: These are based on a dynamic pricing system with heavy discounts on AVE and Larga Distancia (Long-Distance) trains on domestic journeys, which are set depending on the train, date of travel and advance purchase, and without lowering any quality standards.


## 17.2 Pie chart analysis of fare column <a name="17.2"></a>

In [None]:
cnt_ = rail_data['fare'].value_counts()
fig = {
  "data": [
    {
      "values": cnt_.values,
      "labels": cnt_.index,
      "domain": {"x": [0, .5]},
      "name": "Train Class",
      "hoverinfo":"label+percent+name",
      "hole": .9,
      "type": "pie"
    },],
  "layout": {
        "title":"Pie chart Fare",
        "annotations": [
            { "font": { "size": 20},
              "showarrow": False,
             "text": "Pie Chart",
                "x": 0.50,
                "y": 1
            },
        ]
    }
}
iplot(fig)
cnt_

## Observations:
1. Most of the people prefering Promo type fares which has discounts on prices.
2. Very few are travelling in Grupos Ida fare type journeys.

# 18.Getting insights from route column <a name="18"></a>

## 18.1 Analysis of route column with train type <a name="18.1"></a>

In [None]:
rail_data.groupby(['route','train_type'])['train_type'].count()

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(x= 'route', hue = 'train_type', data = rail_data,alpha=1.0,linewidth=5)
plt.title('Count plot most used train type by passengers according to routes')
plt.xticks(rotation='vertical')
plt.ylabel('Number of journeys', fontsize=12)
plt.xlabel('Route', fontsize=12)
plt.show()

## Observations:
1. Most of the passengers prefering **AVE** type trains while travelling from **BARCELONA to MADRID **.
2. Similarly from **MADRID to BARCELONA** most passengers travel in  **AVE** type trains.
3. Most of the passengers prefering **AVE-MD**,**AVE-LD**,**ALVIA** type of trains while travelling from **MADRID to PONFERRADA**.
4. From **MADRID to SEVILLA** passengers prefering  **AVE**
5. **AVE** is the train type having upper hand amogest all the types while travelling from **MADRID to VALENCIA**.
6. Most of the passengers chose **LD** type of trains while travelling from **PONFERRADA to MADRID**.
7. While travelling from **SEVILLA to MADRID** and **VALENCIA to MADRID** most of the passengers choosing **AVE** type of trains.<br>
#### Finally We can say that as AVE trains are fastest trains people in spain prefering trains which reach destination fastly.

## 18.2 Analysis of route column with train class <a name="18.2"></a>

In [None]:
rail_data.groupby(['route','train_class'])['train_class'].count()

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(x= 'route', hue = 'train_class', data = rail_data,alpha=1.0,linewidth=5)
plt.title('Count plot most used train class by passengers according to routes')
plt.xticks(rotation='vertical')
plt.ylabel('Number of journeys', fontsize=12)
plt.xlabel('Route', fontsize=12)
plt.show()

## Observations:
1. Most of the passengers prefering **Turista** class trains while travelling from **BARCELONA to MADRID **.
2. Similarly from **MADRID to BARCELONA** most passengers travel in  **Turista** class trains.
3. Most of the passengers prefering **Turista con enlace** type of train class while travelling from **MADRID to PONFERRADA**.
4. From **MADRID to SEVILLA** passengers prefering  **Turista**
5. **Turista** is the train class having upper hand amogest all the types while travelling from **MADRID to VALENCIA**.
6. Most of the passengers chose **Turista con enlace** type of train class while travelling from **PONFERRADA to MADRID**.
7. While travelling from **SEVILLA to MADRID** and **VALENCIA to MADRID** most of the passengers choosing **Turista** class of trains.<br>
#### Finally We can say that as Turista class trains are mostly chosen in every journey by spain people.

## 18.3 Analysis of route column with fare type <a name="18.3"></a>

In [None]:
rail_data.groupby(['route','fare'])['fare'].count()

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(x= 'route', hue = 'fare', data = rail_data,alpha=1.0,linewidth=5)
plt.title('Count plot most used fare type by passengers according to routes')
plt.xticks(rotation='vertical')
plt.ylabel('Number of journeys', fontsize=12)
plt.xlabel('Route', fontsize=12)
plt.show()

## Observations:
 #### * No doubtedly Promo type fare having upper hand in every journey and route.

# 19. Analysis of price according to Train class,Train type,fare <a name="19"></a>

In [None]:
tools.set_credentials_file(username='Ratan2513', api_key='T94PMqZ1KYsD6E8JPw1g')
def horizontal_bar_chart(cnt_srs, color):
    trace = go.Bar(
        y=cnt_srs.index[::-1],
        x=cnt_srs.values[::-1],
        showlegend=False,
        orientation = 'h',
        marker=dict(
            color=color,
        ),
    )
    return trace
#train_type
cnt_srs = rail_data.groupby('train_type')['price'].agg(['mean'])
cnt_srs.columns = ["mean"]
cnt_srs = cnt_srs.sort_values(by="mean", ascending=False)
trace0 = horizontal_bar_chart(cnt_srs['mean'], 'rgba(50, 71, 96, 0.6)')

#train_class
cnt_srs = rail_data.groupby('train_class')['price'].agg(['mean'])
cnt_srs.columns = ["mean"]
cnt_srs = cnt_srs.sort_values(by="mean", ascending=False)
trace1 = horizontal_bar_chart(cnt_srs['mean'], 'rgba(71, 58, 131, 0.8)')

#route
cnt_srs = rail_data.groupby('route')['price'].agg(['mean'])
cnt_srs.columns = ["mean"]
cnt_srs = cnt_srs.sort_values(by="mean", ascending=False)
trace2 = horizontal_bar_chart(cnt_srs['mean'], 'rgba(246, 78, 139, 0.6)')

#fare
cnt_srs = rail_data.groupby('fare')['price'].agg(['mean'])
cnt_srs.columns = ["mean"]
cnt_srs = cnt_srs.sort_values(by="mean", ascending=False)
trace3 = horizontal_bar_chart(cnt_srs['mean'], 'rgba(200, 108, 39, 0.6)')

# Creating two subplots
fig = tools.make_subplots(rows=4, cols=1, vertical_spacing=0.04, 
                          subplot_titles=['Average prices by Train Type','Average prices by Train Class','Average prices by Route','Average prices by Fare'])

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 2, 1)
fig.append_trace(trace2, 3, 1)
fig.append_trace(trace3, 4, 1)


fig['layout'].update(height=1200, width=1200, paper_bgcolor='rgb(233,233,233)', title="Price(Euros) Plots")
py.iplot(fig, filename='Price(Euros) plots')

In [None]:
cnt_srs = rail_data.groupby('train_type')['price'].agg(['mean'])
cnt_srs.columns = ["mean"]
cnt_srs['train_type'] = cnt_srs.index

data = [
    {
        'x': cnt_srs['train_type'],
        'y': cnt_srs['mean'],
        'mode': 'markers+text',
        'text' : cnt_srs['train_type'],
        'textposition' : 'bottom center',
        'marker': {
            'color': "#f27da6",
            'size': 15,
            'opacity': 0.9
        }
    }
]

layout = go.Layout(title="Average fare prices according to Train type", 
                   xaxis=dict(title='Train type'),
                   yaxis=dict(title='Average price(Euros)')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter0')

In [None]:
cnt_srs = rail_data.groupby('train_class')['price'].agg(['mean'])
cnt_srs.columns = ["mean"]
cnt_srs['train_class'] = cnt_srs.index

data = [
    {
        'x': cnt_srs['train_class'],
        'y': cnt_srs['mean'],
        'mode': 'markers+text',
        'text' : cnt_srs['train_class'],
        'textposition' : 'bottom center',
        'marker': {
            'color': "#d889f9",
            'size': 15,
            'opacity': 0.9
        }
    }
]

layout = go.Layout(title="Average fare prices according to Train class", 
                   xaxis=dict(title='Train class'),
                   yaxis=dict(title='Average price(Euros)')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter1')

In [None]:
cnt_srs = rail_data.groupby('route')['price'].agg(['mean'])
cnt_srs.columns = ["mean"]
cnt_srs['route'] = cnt_srs.index

data = [
    {
        'x': cnt_srs['route'],
        'y': cnt_srs['mean'],
        'mode': 'markers+text',
        'text' : cnt_srs['route'],
        'textposition' : 'bottom center',
        'marker': {
            'color': "#7ae6ff",
            'size': 15,
            'opacity': 0.9
        }
    }
]

layout = go.Layout(title="Average fare prices according to routes", 
                   xaxis=dict(title='Routes'),
                   yaxis=dict(title='Average price(Euros)')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter2')

In [None]:
cnt_srs = rail_data.groupby('fare')['price'].agg(['mean'])
cnt_srs.columns = ["mean"]
cnt_srs['fare'] = cnt_srs.index

data = [
    {
        'x': cnt_srs['fare'],
        'y': cnt_srs['mean'],
        'mode': 'markers+text',
        'text' : cnt_srs['fare'],
        'textposition' : 'bottom center',
        'marker': {
            'color': "#42f4bc",
            'size': 15,
            'opacity': 0.9
        }
    }
]

layout = go.Layout(title="Average fare prices according to fare type", 
                   xaxis=dict(title='Fare'),
                   yaxis=dict(title='Average price(Euros)')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter')

## Observations:
### By Train Type:<br>
#### 1. The AVE-TGV train type tickets costs around 85 euros which are slightly expensive.<br>
#### 2. The regional tickets are cheaper than all train types.<br>
### By Train class:<br>
#### 1. The Cama G. Clase train class costs around 150 euros from journey.<br>
#### 2. The Turista con enlace train class was cheaper than all classes.<br>
### By Route:<br>
#### 1. The prices for Barcelona to Madrid wa high comparing to remaining.<br>
#### 2. The prices are less for journeys between madrid to sevulla<br>
### By Fare:<br>
#### 1. The prices for mesa fare type was more compared to other fare types.<br>

# 20.Average time of journeys by routes <a name="20"></a>

In [None]:
cnt_srs = rail_data.groupby('route')['travel_time_in_mins'].agg(['mean'])
cnt_srs.columns = ["mean"]
cnt_srs['route'] = cnt_srs.index

data = [
    {
        'x': cnt_srs['route'],
        'y': cnt_srs['mean'],
        'mode': 'markers+text',
        'text' : cnt_srs['route'],
        'textposition' : 'bottom center',
        'marker': {
            'color': "#d069f7",
            'size': 15,
            'opacity': 0.9
        }
    }
]

layout = go.Layout(title="Average Time taken for journeys by routes (in mins)", 
                   xaxis=dict(title='Routes'),
                   yaxis=dict(title='Average time in minutes')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter3')

## Observations:
#### * Its taking roughly 285mins to travel from madrid to poneferrada and 300 minutes from poneferrada to madrid
#### * Shortest journeys was between madrid and sevilla.

# 21.Average time of journeys by train type <a name="21"></a>

In [None]:
cnt_srs = rail_data.groupby('train_type')['travel_time_in_mins'].agg(['mean'])
cnt_srs.columns = ["mean"]
cnt_srs['train_type'] = cnt_srs.index

data = [
    {
        'x': cnt_srs['train_type'],
        'y': cnt_srs['mean'],
        'mode': 'markers+text',
        'text' : cnt_srs['train_type'],
        'textposition' : 'bottom center',
        'marker': {
            'color': "#d62728",
            'size': 15,
            'opacity': 0.9
        }
    }
]

layout = go.Layout(title="Average Time taken for journeys by train type (in mins)", 
                   xaxis=dict(title='train_type'),
                   yaxis=dict(title='Average time in minutes')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter4')

## Observations:
#### * AVE are the fastest trains and MD takes a lot of time to complete the journeys.

## <font color='red'>Thats all for now,thank You for Reading the Kernel!Upvote if you like the kernel😊 . I welcome suggestions to improve this kernel further.</font>