# Exercice de data crunching
 

## Installation des librairies nécessaires

- pandas
- datetime


**Documentation:**
- Tutoriel utilisé: https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/

In [2]:
import sys

# Installing pandas
!{sys.executable} -m pip install pandas

# Installing datetime
!{sys.executable} -m pip install datetime

# Installing datetime
!{sys.executable} -m pip install numpy



## Importation des datasets



In [3]:
import pandas as pd
import datetime
import numpy as np

# Importing CSVs
tickets_df = pd.read_csv('./csv/ticket_data.csv', comment='#')
cities_df = pd.read_csv('./csv/cities.csv', comment='#')
providers_df = pd.read_csv('./csv/providers.csv', comment='#')
stations_df = pd.read_csv('./csv/stations.csv', comment='#')

## Sélection d'un voyage dans le dataset

**Documentation:**
- Doc pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/  
- Exemple supplémentaire StackOverflow: https://stackoverflow.com/questions/17071871/how-to-select-rows-from-a-dataframe-based-on-column-values

In [4]:
# Function returning a dataframe with all the trips between these cities
def getTrips(tickets_df, origin_city = 628, destination_city = 453):
    
    # Selecting the trips between origin and destination
    selected_trips_df = tickets_df.loc[(tickets_df['o_city'] == origin_city) & (tickets_df['d_city'] == destination_city)]
    
    return selected_trips_df;

## Calcul des prix min, max et moyens

In [5]:
# Getting the minimum and the maximum price of the selected trip and the mean of the prices

def getPricesBounds(trips_df):
    minPrice = (trips_df['price_in_cents'].min()) / 100
    avgPrice = (trips_df['price_in_cents'].mean()) / 100
    maxPrice = (trips_df['price_in_cents'].max()) / 100
    
    return minPrice, avgPrice, maxPrice;

## Conversion des champs de date et calcul de la durée des trajets

**Documentation:**
- https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
- https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html

**Problème rencontré:**  
- *SettingWithCopyWarning*

    Solutions trouvées: 
    - https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
    - https://stackoverflow.com/questions/36846060/how-to-replace-an-entire-column-on-pandas-dataframe

In [6]:
# Adds a duration column and format the timestamps to datetime
def addDuration(trips_df):
    
    # Changing departure and arrival to datetime
    formatted_trips_df = trips_df.loc[:, ['departure_ts', 'arrival_ts']].apply(pd.to_datetime)
    
    # Adding all the other columns
    formatted_trips_df = trips_df.assign(departure_ts = formatted_trips_df['departure_ts'], arrival_ts = formatted_trips_df['arrival_ts'])
    
    
    # Create a Serie containing the durations between departure and arrival
    duration_s = formatted_trips_df.loc[:, 'arrival_ts'] - formatted_trips_df.loc[:, 'departure_ts']
    duration_s.name = 'duration_tdelt'
    
    # Adding the durations to the dataframe
    formatted_trips_df = pd.concat([formatted_trips_df, duration_s], axis = 1)
    
    return formatted_trips_df;

## Calcul des durées min, max et moyennes

**Documentation:**  
- Convertir un timedelta en secondes: https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html#conversions

In [10]:
def getDurationBounds(trips_df):
    
    # Min and max duration
    minDuration = min(trips_df.loc[:, 'duration_tdelt'])
    maxDuration = max(trips_df.loc[:, 'duration_tdelt'])
    
    # Converting the durations in seconds
    durationInSecs = trips_df.loc[:, 'duration_tdelt'].astype('timedelta64[s]')
    
    # Computing the mean and returning it in timedelta
    avgDuration = durationInSecs.mean()
    avgDuration = pd.to_timedelta(avgDuration, 'seconds')
    
    return minDuration, maxDuration, avgDuration;

## Affichage d'un timestamp au format HH:mm:  

**Documentation:**  
- Formater un timedelta: https://stackoverflow.com/questions/538666/format-timedelta-to-string

In [22]:
# Display of a timedelta
def timedeltaDisplay(timedelta):
    
    # Get the total # of seconds
    timedeltaSeconds = timedelta.seconds
    
    # Translate it in hours and minutes
    hours, remainder = divmod(timedeltaSeconds,3600)
    minutes = remainder // 60
    
    # Build the string
    display = '{}h{}mn'.format(hours, minutes)
    return display;

## Récupération de l'id d'une ville

In [173]:
# Get the id of a given city
def getCityId(city):
    
    # The cities are in lowercase in the column we're searching
    city = city.lower()
    
    # If the city isn't in the dataframe
    if(cities_df.loc[cities_df['unique_name'].isin([city])].empty):
        
        # Not very clean exit
        print('We couln\'t find this city')
        sys.exit(0)
    
    # Find the correct row in the cities dataframe
    else:
        
        # Get the id
        city_id = cities_df.loc[cities_df['unique_name'] == city, 'id'].item()
    
    return city_id;

## Récupération du nom complet d'une ville

In [103]:
# Get the city, its region and its country 
def getCityLocalName(city_id):
    
    cityLocalName = cities_df.loc[cities_df['id'] == city_id, 'local_name'].item()
    
    return cityLocalName;

## Sélection du trajet

In [80]:
# Get the trip origin and destination input
def getTripEndpoints():
    
    destination_city = input("You want to go to : ")
    origin_city = input("from : ")
    
    return getCityId(origin_city), getCityId(destination_city);

## Main

**Documentation:**  
- Affichage des durées moyennes: https://stackoverflow.com/questions/538666/format-timedelta-to-string

In [178]:
# Retrieve the destination and the origin
origin_city_id, destination_city_id = getTripEndpoints()

# Select the trips available between the given cities
myTrip = getTrips(tickets_df, origin_city_id, destination_city_id)

# Get price overview
minPrice, avgPrice, maxPrice = getPricesBounds(myTrip)
print("For the trip from {} to {}, the prices are going from {:.2f}€ to {:.2f}€ and the average price is {:.2f}€".format(getCityLocalName(origin_city_id), getCityLocalName(destination_city_id), minPrice, maxPrice, avgPrice))

# Add a duration column for each trip
myTrip = addDuration(myTrip)

# Get duration overview
minDuration, maxDuration, avgDuration = getDurationBounds(myTrip)
print("The trip from town A to town B, last between {} and {} and the average duration is {}".format(timedeltaDisplay(minDuration), timedeltaDisplay(maxDuration), timedeltaDisplay(avgDuration)))

You want to go to : lisboa
from : paris
For the trip from Paris, Île-de-France, France to Lisboa, Área Metropolitana de Lisboa, Portugal, the prices are going from 71.90€ to 134.50€ and the average price is 88.03€
The trip from town A to town B, last between 18h20mn and 8h10mn and the average duration is 15h13mn
