# Exercice de data crunching - Tictactrip
 
 
**Ma démarche:**

J'ai commencé à lire le guide utilisateur pandas et à reproduire les exemples avec le dataset ticket_data jusqu'à ce que je puisse me lancer sur le premier point. Puis j'ai avancé petit à petit, en me référant à la doc quand je ne savais pas comment faire.

## Installation des librairies nécessaires

- pandas
- datetime
- numpy
- geopy
- tabulate


**Documentation:**
- Tutoriel utilisé: https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/

In [1]:
import sys

# Installing pandas: it's a pandas exercise
!{sys.executable} -m pip install pandas

# Installing datetime: to handle timewise operations
!{sys.executable} -m pip install datetime

# Installing numpy: to manipulate arrays
!{sys.executable} -m pip install numpy

# Installing geopy: to convert GPS coordinates into distances
!{sys.executable} -m pip install geopy

# Installing tabulate: to display a pretty table
!{sys.executable} -m pip install tabulate



# PARTIE 1 : Stats d'un trajet choisi par l'utilisateur

## A. Sélection d'un trajet

**Documentation:**
- Doc pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/  
- Exemple supplémentaire StackOverflow: https://stackoverflow.com/questions/17071871/how-to-select-rows-from-a-dataframe-based-on-column-values

### Saisie des villes de départ et d'arrivée

In [2]:
# Get the trip origin and destination input
def getTripEndpoints():
    
    destination_city = input("You want to go to : ")
    origin_city = input("from : ")
    
    return getCityId(origin_city), getCityId(destination_city);

### Récupération de l'id d'une ville

In [3]:
# Get the id of a given city
def getCityId(city):
    
    # The cities are in lowercase in the column we're searching
    city = city.lower()
    
    # If the city isn't in the dataframe
    if(cities_df.loc[cities_df['unique_name'].isin([city])].empty):
        
        # Not very clean exit
        print('We couln\'t find this city')
        sys.exit(0)
    
    # Find the correct row in the cities dataframe
    else:
        
        # Get the id
        cityId = cities_df.loc[cities_df['unique_name'] == city, 'id'].item()
    
    return cityId;

### Sélection de tous les tickets pour un trajet

In [4]:
# Function returning a dataframe with all the trips between these cities
def getTrips(tickets_df, originCity = 628, destinationCity = 453):
    
    # Selecting the trips between origin and destination
    selected_trips_df = tickets_df.loc[(tickets_df['o_city'] == originCity) & (tickets_df['d_city'] == destinationCity)]
    
    # If there is no trip available between the chosen cities, end the program
    if(selected_trips_df.empty):
        print('No trip available for this location\n')
        sys.exit(0)

    return selected_trips_df;

## B. Prix des tickets

In [5]:
# Getting the minimum and the maximum price of the selected trip and the mean of the prices

def getPricesBounds(trips_df):
    minPrice = (trips_df['price_in_cents'].min()) / 100
    avgPrice = (trips_df['price_in_cents'].mean()) / 100
    maxPrice = (trips_df['price_in_cents'].max()) / 100
    
    priceBounds = pd.Series([minPrice, avgPrice, maxPrice], index=list(['minimum', 'average', 'maximum']))
    
    return priceBounds;

## C. Durées des trajets

### Ajout d'une colonne 'durée du trajet'

**Documentation:**
- https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
- https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html

**Problème rencontré:**  
- *SettingWithCopyWarning*

    Solutions trouvées: 
    - https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
    - https://stackoverflow.com/questions/36846060/how-to-replace-an-entire-column-on-pandas-dataframe

In [6]:
# Adds a duration column and format the timestamps to datetime
def addDuration(trips_df):
    
    # Changing departure and arrival to datetime
    formatted_trips_df = trips_df.loc[:, ['departure_ts', 'arrival_ts']].apply(pd.to_datetime)
    
    # Adding all the other columns
    formatted_trips_df = trips_df.assign(departure_ts = formatted_trips_df['departure_ts'], 
                                         arrival_ts = formatted_trips_df['arrival_ts'])
    
    
    # Create a Serie containing the durations between departure and arrival
    duration_s = formatted_trips_df.loc[:, 'arrival_ts'] - formatted_trips_df.loc[:, 'departure_ts']
    duration_s.name = 'duration_tdelt'
    
    # Adding the durations to the dataframe
    formatted_trips_df = pd.concat([formatted_trips_df, duration_s], axis = 1)
    
    return formatted_trips_df;

### Calcul des durées min, max et moyennes

**Documentation:**  
- Convertir un timedelta en secondes: https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html#conversions

In [7]:
def getDurationBounds(trips_df):
    
    # Min and max duration
    minDuration = min(trips_df.loc[:, 'duration_tdelt'])
    maxDuration = max(trips_df.loc[:, 'duration_tdelt'])
    
    # Converting the durations in seconds
    durationInSecs = trips_df.loc[:, 'duration_tdelt'].astype('timedelta64[s]')
    
    # Computing the mean and returning it in timedelta
    avgDuration = durationInSecs.mean()
    avgDuration = pd.to_timedelta(avgDuration, 'seconds')
    
    durationBounds_s = pd.Series([minDuration, avgDuration, maxDuration], 
                                 index=list(['minimum', 'average', 'maximum']))
    
    return durationBounds_s;

## D. Affichage

### Affichage d'un timestamp au format HH:mm   

**Documentation:**  
- Formater un timedelta: https://stackoverflow.com/questions/538666/format-timedelta-to-string

**Problème rencontré:**  
- Un trajet d'1 jour et 3 heures s'affichait 3h

**Solution adoptée:**  
- Ajouter 86400 secondes par jour

In [8]:
# Display of a timedelta
def timedeltaDisplay(timedelta):
    
    # Get the total # of seconds
    timedeltaSeconds = timedelta.seconds + 86400 * timedelta.days
    
    # Translate it in hours and minutes
    hours, remainder = divmod(timedeltaSeconds,3600)
    minutes = remainder // 60
    
    # Build the string
    display = '{}h{:02d}mn'.format(hours, minutes)
    
    return display;

### Récupération du nom complet d'une ville

In [9]:
# Get the city, its region and its country 
def getCityLocalName(city_id):
    
    # Search the city local name using id
    cityLocalName = cities_df.loc[cities_df['id'] == city_id, 'local_name'].item()
    
    # Split the city, the region and the country    
    cityLocation = pd.Series(cityLocalName.split(', '), index=list(['city', 'region', 'country']))
    
    return cityLocation;

### Affichage des résultats

**Documentation:**  
- Affichage des durées moyennes: https://stackoverflow.com/questions/538666/format-timedelta-to-string

In [10]:
# Display the stats of the trips available
def displayTripStats(origin_city_id, destination_city_id, priceBounds, durationBounds):
    
    # Get the cities full names
    origin = getCityLocalName(origin_city_id)
    destination = getCityLocalName(destination_city_id)
    
    # Print trip endpoints
    print("\n")
    print("Trip from {} ({}) to {} ({}):\n"
          .format(origin.city, origin.region, destination.city, destination.region))
         
    # Print prices available
    print("Prices : {:.2f}€ >>> {:.2f}€. Average: {:.2f}€"
          .format(priceBounds.minimum, priceBounds.maximum, priceBounds.average))
    
    # Print trip durations
    print("Duration : {} >>> {}. Average: {}"
          .format(
              timedeltaDisplay(durationBounds.minimum), 
              timedeltaDisplay(durationBounds.maximum), 
              timedeltaDisplay(durationBounds.average)))
    
    return;

# PARTIE 2 : Prix selon la distance et le moyen de transport

## A. Calcul de la distance de chaque trajet

### Ajout d'une colonne distance

**Documentation:**

- geopy: https://pypi.org/project/geopy/

**Problèmes rencontrés:**
- Calculer la distance à partir de plusieurs colonnes du dataframe
- Utiliser les villes plutôt que les stations en cas de données manquantes
- Associer les distances calculées pour chaque trajet sur l'ensemble des tickets

**Solutions adoptées:**

- Utiliser la fontion apply:
 - Doc de la fonction apply: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
 - Exemple StackOverflow: https://stackoverflow.com/questions/31414481/new-column-with-coordinates-using-geopy-pandas
- Relire la doc pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
- Utiliser la fontion merge: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

In [29]:
# Add a distance column for each ticket
def addDistance(trips_df):

    # Extracting only the geographic info from the tickets dataframe and keep a single copy of each trip
    distances_df = trips_df.loc[:,['o_station', 'd_station', 'middle_stations', 'o_city', 'd_city']]
    distances_df = distances_df.drop_duplicates()

    # Computing the distance for each unique trip
    distances_df['distance'] = distances_df.apply(lambda df: computeDistance(df['o_city'], 
                                                                             df['d_city'], 
                                                                             df['o_station'], 
                                                                             df['d_station'],
                                                                             df['middle_stations']), axis=1)

    # Associate to each ticket its distance
    trips_df = pd.merge(trips_df, distances_df, on=['o_city',
                                                    'd_city',
                                                    'o_station', 
                                                    'd_station', 
                                                    'middle_stations'])

    return trips_df;

### Calculer la distance entre deux couples de coordonnées GPS

*Lisibilité à améliorer  
Pas optimisé*

In [30]:
def computeDistance(originCity, destinationCity, originStation, destinationStation, middleStations):
    
    # If there is only origin and destination city
    if(pd.isna(originStation) | pd.isna(destinationStation)):
        
        # Find the GPS coordinates from cities.csv
        originCoords = getCityCoords(originCity)
        destinationCoords = getCityCoords(destinationCity)
        
        # Use geodesic distance
        distance = geodesic(originCoords, destinationCoords).km
    
    # If there is origin and destination stations
    elif(pd.notna(originStation) & pd.notna(destinationStation)):
        
        # If there is middle stations
        if(pd.notna(middleStations)):
            
            # Enter the middle stations in an array
            middleStations = middleStations.strip('{}')
            stationsTable = middleStations.split(',')
                   
        # Add the first and the last station
        stationsTable.insert(0, originStation)
        stationsTable.append(destinationStation)
        
        # Add the distance between each couple of stations
        distance = 0
        
        for station in range(0, len(stationsTable), 2):    
            distance = distance + getDistanceBetweenTwoStations(stationsTable[station], 
                                                                stationsTable[station+1])
            
        
    # If there is none, exit the program
    else:
        print('An error occured')
        sys.exit(0)
    
    return distance;

### Calculer la distance entre deux stations

In [13]:
def getDistanceBetweenTwoStations(firstStation, secondStation):
    
    # Get the coordinates of both stations
    firstStationCoords = getStationCoords(int(firstStation))
    secondStationCoords = getStationCoords(int(secondStation))
    
    # Compute their distance using geodesic
    distance = geodesic(firstStationCoords, secondStationCoords).km
    
    return distance;

### Récupérer les coordonnées GPS d'une station

In [14]:
def getStationCoords(station_id):
    
    station_lat = stations_df.loc[stations_df['id'] == station_id, 'latitude'].item()
    station_lon = stations_df.loc[stations_df['id'] == station_id, 'longitude'].item()
    
    return (station_lat, station_lon)

### Récupérer les coordonnées GPS d'une ville

In [15]:
def getCityCoords(city_id):
    
    city_lat = cities_df.loc[cities_df['id'] == city_id, 'latitude'].item()
    city_lon = cities_df.loc[cities_df['id'] == city_id, 'longitude'].item()
    
    return (city_lat, city_lon)

## B. Durées et prix moyens selon la distance et le moyen de transport

**Problème rencontré:**

Je ne sais pas lire, donc je perds du temps

**Solution:**

Relire


### Séparation des trajets selon la tranche de distance

In [16]:
# Returns 3 separated dataframes for train, bus and carpooling trip tickets
def separateTripLength(trips_df):
    
    shortTrip = trips_df.loc[trips_df['distance'] <= 200]
    mediumTrip = trips_df.loc[(trips_df['distance'] <= 800) & (trips_df['distance'] > 200)]
    longTrip = trips_df.loc[(trips_df['distance'] <= 2000) & (trips_df['distance'] > 800)]
    veryLongTrip = trips_df.loc[trips_df['distance'] > 2000]
    
    return shortTrip, mediumTrip, longTrip, veryLongTrip;

### Récupérer les trajets utilisant le même moyen de transport

In [17]:
def getByTransportType(trips_df, transportType):

    # Get all the ids of providers matching the selected transport type
    companyByTransportType = providers_df.loc[providers_df['transport_type'] == transportType, 'id']

    # Select the trips from the selected providers
    selectedTrips_df = trips_df.loc[trips_df['company'].isin(companyByTransportType)]
    
    return selectedTrips_df;

### Découper un dataframe selon les moyens de transports

In [31]:
# Returns 3 separated dataframes for train, bus and carpooling trip tickets
def separateTransportTypes(trips_df):
    
    trainTrips_df = getByTransportType(trips_df, 'train')
    busTrips_df = getByTransportType(trips_df, 'bus')
    carpoolingTrips_df = getByTransportType(trips_df, 'carpooling')
    
    separatedTrips_df = {'train': trainTrips_df, 
                         'bus': busTrips_df, 
                         'carpooling': carpoolingTrips_df}
    
    return separatedTrips_df;

### Calculer le prix et la durée moyenne d'un dataframe de tickets

In [19]:
def getPricesAndDurations(trips_df):
    
    # Very bad handling of empty datasets
    if(trips_df['train'].empty | trips_df['bus'].empty | trips_df['carpooling'].empty):
        return "No information available";
    
    else:

        # Computes the average prices
        avgTrainPrice = trips_df['train'].price_in_cents.mean() / 100
        avgBusPrice = trips_df['bus'].price_in_cents.mean() / 100
        avgCarpoolingPrice = trips_df['carpooling'].price_in_cents.mean() / 100

        # Computes the average durations
        avgTrainDuration = trips_df['train'].loc[:, 'duration_tdelt'].mean()
        avgBusDuration = trips_df['bus'].loc[:, 'duration_tdelt'].mean()
        avgCarpoolingDuration = trips_df['carpooling'].loc[:, 'duration_tdelt'].mean()

        # Gather the average for each way of transportation
        pricesAndDurations = pd.DataFrame(np.array([[timedeltaDisplay(avgTrainDuration), 
                                                     '{:.2f}€'.format(avgTrainPrice)],
                                                    [timedeltaDisplay(avgBusDuration), 
                                                     '{:.2f}€'.format(avgBusPrice)],
                                                    [timedeltaDisplay(avgCarpoolingDuration), 
                                                     '{:.2f}€'.format(avgCarpoolingPrice)]]), 
                                          columns=['Average duration', 'Average price'], 
                                          index=['Train', 'Bus', 'Carpooling'])

        return pricesAndDurations;

### Comparer les prix selon le moyen de transport

Le cas de dataframes vides est mal géré (très mal)

In [20]:
# Returns the average price for each mean of transport
def comparePrices(shortTrip, mediumTrip, longTrip, veryLongTrip):
    
    # Separate the trips per transport type
    shortTripPerTransportation = separateTransportTypes(shortTrip)
    mediumTripPerTransportation = separateTransportTypes(mediumTrip)
    longTripPerTransportation = separateTransportTypes(longTrip)
    veryLongTripPerTransportation = separateTransportTypes(veryLongTrip)
    
    # Get the average prices and durations for each transport type
    shortTripInfo = getPricesAndDurations(shortTripPerTransportation)
    mediumTripInfo = getPricesAndDurations(mediumTripPerTransportation)
    longTripInfo = getPricesAndDurations(longTripPerTransportation)
    veryLongTripInfo = getPricesAndDurations(veryLongTripPerTransportation)
    
    
    # Printing the results
    print('Trips of < 200km:\n')
    print(shortTripInfo.to_markdown())
    print('\nTrips between 200km and 800km:\n')
    print(mediumTripInfo.to_markdown())
    print('\nTrips between 800km and 2000km:\n')
    print(longTripInfo.to_markdown())
    print('\nTrips of > 2000km:\n')
    print(veryLongTripInfo)
    
    return;

# Main


### Import des datasets

In [21]:
import pandas as pd
import datetime
import numpy as np
from geopy.distance import geodesic

# Importing CSVs
tickets_df = pd.read_csv('./csv/ticket_data.csv', comment='#')
cities_df = pd.read_csv('./csv/cities.csv', comment='#')
providers_df = pd.read_csv('./csv/providers.csv', comment='#')
stations_df = pd.read_csv('./csv/stations.csv', comment='#')

### Calcul de la durée et de la distance

In [22]:
# Computes a duration and a distance column for each trip
tickets_df = addDuration(tickets_df)
tickets_df = addDistance(tickets_df)

### Prix et durée moyenne par trajet

In [32]:
# Ask for the destination and the origin of the trip 
originCityId, destinationCityId = getTripEndpoints()

# Select the trips available between the given cities
myTrip_df = getTrips(tickets_df, originCityId, destinationCityId)

# Get price overview
priceBounds_s = getPricesBounds(myTrip_df)

# Get duration overview
durationBounds_s = getDurationBounds(myTrip_df)


# Display the results
displayTripStats(originCityId, destinationCityId, priceBounds_s, durationBounds_s)

You want to go to : lille
from : paris


Trip from Paris (Île-de-France) to Lille (Hauts-de-France):

Prices : 10.00€ >>> 134.50€. Average: 20.31€
Duration : 1h08mn >>> 37h20mn. Average: 3h38mn


### Prix et durée moyenne par distance

In [None]:
# Prices and durations overview

# Separate the trips by length
shortTrips_df, mediumTrips_df, longTrips_df, veryLongTrips_df = separateTripLength(tickets_df)

# Compare the transport types for each trip length
comparePrices(shortTrips_df, mediumTrips_df, longTrips_df, veryLongTrips_df)