# 02_Processing
##### Authors: Diego Senso González, Luis Vaciero
##### 15 january 2021
##### Module: Machine Learning - Master's Degree in Data Science for Finance

## Objective
The purpose of this document is to create different datasets with the data analyzed previously that allow to offer several perspectives when later we pass to the visualization with networks. Three datasets will be created representing the number of journeys according to the street, the neighborhood and the district.

## Libraries
Firstly, we import the requested libraries.

In [82]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Loading the data

In [83]:
bicimad = pd.read_csv('../data/bicimad.csv', delimiter = ',')

## Journeys by street

We put together in the same column the stations of origin and exit by a symbol '!', that later will serve us to separate it.

In [84]:
bicimad["recorrido"] = bicimad["origin"] + "!" + bicimad["destination"]
bicimad

Unnamed: 0.1,Unnamed: 0,user_type,travel_time,idunplug_station,ageRange,idplug_station,district_origin,neighbourhood_origin,origin,district_dest,neighbourhood_dest,destination,date,hour_time,recorrido
0,0,1,210,87,2,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-01,00:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C..."
1,1,1,330,87,0,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-02,09:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C..."
2,2,1,281,87,4,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-03,09:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C..."
3,3,1,270,87,0,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-03,13:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C..."
4,4,1,245,87,0,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-04,08:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2039704,2056519,1,1464,32,0,109,01 CENTRO,01-03 CORTES,"ALCALA, CALLE, DE",04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",2020-10-03,20:00:00,"ALCALA, CALLE, DE!SERRANO, CALLE, DE"
2039705,2056520,1,377,32,3,109,01 CENTRO,01-03 CORTES,"ALCALA, CALLE, DE",04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",2020-10-15,22:00:00,"ALCALA, CALLE, DE!SERRANO, CALLE, DE"
2039706,2056521,1,489,32,0,109,01 CENTRO,01-03 CORTES,"ALCALA, CALLE, DE",04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",2020-10-19,06:00:00,"ALCALA, CALLE, DE!SERRANO, CALLE, DE"
2039707,2056522,1,396,32,0,109,01 CENTRO,01-03 CORTES,"ALCALA, CALLE, DE",04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",2020-10-19,07:00:00,"ALCALA, CALLE, DE!SERRANO, CALLE, DE"


Now we look at which are the most and least repeated routes.

In [85]:
bicimad["recorrido"].value_counts()

CASTELLANA, PASEO, DE LA!CASTELLANA, PASEO, DE LA            7173
MENENDEZ PELAYO, AVENIDA, DE!MENENDEZ PELAYO, AVENIDA, DE    5999
SANTA ENGRACIA, CALLE, DE!SANTA ENGRACIA, CALLE, DE          3734
SERRANO, CALLE, DE!SERRANO, CALLE, DE                        2737
SAN BERNARDO, CALLE, DE!SAN BERNARDO, CALLE, DE              2711
                                                             ... 
LIRA, CALLE, DE LA!MARQUES DE CERRALBO, PLAZA, DEL              1
CORDON, CALLE, DEL!JOSE GUTIERREZ ABASCAL, CALLE, DE            1
MARQUES DE LA ENSENADA, CALLE, DEL!PAVIA, CALLE, DE             1
JACOMETREZO, CALLE, DE!DOCTOR ARCE, AVENIDA, DEL                1
JOSE GUTIERREZ ABASCAL, CALLE, DE!CARLOS III, CALLE, DE         1
Name: recorrido, Length: 15117, dtype: int64

We group by route, count the number of repetitions, and generate a new dataframe where each route will be found and the number of routes followed by that route. We eliminate the columns that are not useful for the study.

In [86]:
bicimad_trips = bicimad.groupby("recorrido")
bicimad_trips = bicimad_trips.count()
aristas = pd.DataFrame(data=bicimad_trips)
aristas = aristas.reset_index()
aristas.drop(["user_type","travel_time","idunplug_station","ageRange","idplug_station","district_origin","neighbourhood_origin","origin","district_dest","neighbourhood_dest","destination","date","hour_time"], axis = 'columns', inplace = True)
aristas

Unnamed: 0.1,recorrido,Unnamed: 0
0,"ALBERTO ALCOCER, AVENIDA, DE!ALBERTO ALCOCER, ...",1224
1,"ALBERTO ALCOCER, AVENIDA, DE!ALCALA, CALLE, DE",457
2,"ALBERTO ALCOCER, AVENIDA, DE!ALCANTARA, CALLE, DE",59
3,"ALBERTO ALCOCER, AVENIDA, DE!ALFONSO XII, CALL...",283
4,"ALBERTO ALCOCER, AVENIDA, DE!ALMADEN, CALLE, DE",50
...,...,...
15112,"VISTILLAS, TRAVESIA, DE LAS!VALENCIA, CALLE, DE",139
15113,"VISTILLAS, TRAVESIA, DE LAS!VAZQUEZ DE MELLA, ...",144
15114,"VISTILLAS, TRAVESIA, DE LAS!VELAZQUEZ, CALLE, DE",122
15115,"VISTILLAS, TRAVESIA, DE LAS!VENTURA RODRIGUEZ,...",48


We separated both fields that we had previously joined.

In [87]:
aristas_final = aristas["recorrido"].str.split("!", expand=True)
aristas_final

Unnamed: 0,0,1
0,"ALBERTO ALCOCER, AVENIDA, DE","ALBERTO ALCOCER, AVENIDA, DE"
1,"ALBERTO ALCOCER, AVENIDA, DE","ALCALA, CALLE, DE"
2,"ALBERTO ALCOCER, AVENIDA, DE","ALCANTARA, CALLE, DE"
3,"ALBERTO ALCOCER, AVENIDA, DE","ALFONSO XII, CALLE, DE"
4,"ALBERTO ALCOCER, AVENIDA, DE","ALMADEN, CALLE, DE"
...,...,...
15112,"VISTILLAS, TRAVESIA, DE LAS","VALENCIA, CALLE, DE"
15113,"VISTILLAS, TRAVESIA, DE LAS","VAZQUEZ DE MELLA, PLAZA, DE"
15114,"VISTILLAS, TRAVESIA, DE LAS","VELAZQUEZ, CALLE, DE"
15115,"VISTILLAS, TRAVESIA, DE LAS","VENTURA RODRIGUEZ, CALLE, DE"


We treat the resulting dataset so that at the end there is a column of origin, another of destination and the number of journeys between both points.

In [88]:
aristas["origin"] = aristas_final[0]
aristas["destination"] = aristas_final[1]
aristas.rename(columns = {'Unnamed: 0':'number_of_trips'}, inplace = True)
aristas.drop(["recorrido"], axis = 'columns', inplace = True)
aristas

Unnamed: 0,number_of_trips,origin,destination
0,1224,"ALBERTO ALCOCER, AVENIDA, DE","ALBERTO ALCOCER, AVENIDA, DE"
1,457,"ALBERTO ALCOCER, AVENIDA, DE","ALCALA, CALLE, DE"
2,59,"ALBERTO ALCOCER, AVENIDA, DE","ALCANTARA, CALLE, DE"
3,283,"ALBERTO ALCOCER, AVENIDA, DE","ALFONSO XII, CALLE, DE"
4,50,"ALBERTO ALCOCER, AVENIDA, DE","ALMADEN, CALLE, DE"
...,...,...,...
15112,139,"VISTILLAS, TRAVESIA, DE LAS","VALENCIA, CALLE, DE"
15113,144,"VISTILLAS, TRAVESIA, DE LAS","VAZQUEZ DE MELLA, PLAZA, DE"
15114,122,"VISTILLAS, TRAVESIA, DE LAS","VELAZQUEZ, CALLE, DE"
15115,48,"VISTILLAS, TRAVESIA, DE LAS","VENTURA RODRIGUEZ, CALLE, DE"


We export the file for use in the subsequent visualization of the networks.

In [89]:
aristas.to_csv(r'../data/streets.csv')

## Journeys by neighbourhood

We repeat the same steps as before, except that we now work on the neighborhoods in order to know how many journeys have gone from one neighborhood to another.

In [90]:
bicimad["neighbourhood_trip"] = bicimad["neighbourhood_origin"] + "!" + bicimad["neighbourhood_dest"]
bicimad

Unnamed: 0.1,Unnamed: 0,user_type,travel_time,idunplug_station,ageRange,idplug_station,district_origin,neighbourhood_origin,origin,district_dest,neighbourhood_dest,destination,date,hour_time,recorrido,neighbourhood_trip
0,0,1,210,87,2,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-01,00:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C...",03-02 ADELFAS!03-06 NIÑO JESÚS
1,1,1,330,87,0,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-02,09:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C...",03-02 ADELFAS!03-06 NIÑO JESÚS
2,2,1,281,87,4,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-03,09:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C...",03-02 ADELFAS!03-06 NIÑO JESÚS
3,3,1,270,87,0,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-03,13:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C...",03-02 ADELFAS!03-06 NIÑO JESÚS
4,4,1,245,87,0,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-04,08:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C...",03-02 ADELFAS!03-06 NIÑO JESÚS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2039704,2056519,1,1464,32,0,109,01 CENTRO,01-03 CORTES,"ALCALA, CALLE, DE",04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",2020-10-03,20:00:00,"ALCALA, CALLE, DE!SERRANO, CALLE, DE",01-03 CORTES!04-01 RECOLETOS
2039705,2056520,1,377,32,3,109,01 CENTRO,01-03 CORTES,"ALCALA, CALLE, DE",04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",2020-10-15,22:00:00,"ALCALA, CALLE, DE!SERRANO, CALLE, DE",01-03 CORTES!04-01 RECOLETOS
2039706,2056521,1,489,32,0,109,01 CENTRO,01-03 CORTES,"ALCALA, CALLE, DE",04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",2020-10-19,06:00:00,"ALCALA, CALLE, DE!SERRANO, CALLE, DE",01-03 CORTES!04-01 RECOLETOS
2039707,2056522,1,396,32,0,109,01 CENTRO,01-03 CORTES,"ALCALA, CALLE, DE",04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",2020-10-19,07:00:00,"ALCALA, CALLE, DE!SERRANO, CALLE, DE",01-03 CORTES!04-01 RECOLETOS


In [91]:
bicimad["neighbourhood_trip"].value_counts()

01-02 EMBAJADORES!01-02 EMBAJADORES    31769
01-01 PALACIO!01-01 PALACIO            21691
01-05 UNIVERSIDAD!01-05 UNIVERSIDAD    19153
01-02 EMBAJADORES!01-01 PALACIO        15148
04-01 RECOLETOS!04-01 RECOLETOS        15062
                                       ...  
05-05 NUEVA ESPAÑA!02-03 CHOPERA          82
06-03 CASTILLEJOS!03-01 PACÍFICO          81
03-02 ADELFAS!06-03 CASTILLEJOS           80
03-03 ESTRELLA!06-03 CASTILLEJOS          79
03-01 PACÍFICO!06-03 CASTILLEJOS          70
Name: neighbourhood_trip, Length: 1089, dtype: int64

In [92]:
neighbourhood_trips = bicimad.groupby("neighbourhood_trip")
neighbourhood_trips = neighbourhood_trips.count()
neighbourhood_aristas = pd.DataFrame(data=neighbourhood_trips)
neighbourhood_aristas = neighbourhood_aristas.reset_index()
neighbourhood_aristas.drop(["user_type","travel_time","idunplug_station","ageRange","idplug_station","district_origin","neighbourhood_origin","origin","district_dest","neighbourhood_dest","destination","date","hour_time"], axis = 'columns', inplace = True)
neighbourhood_aristas

Unnamed: 0.1,neighbourhood_trip,Unnamed: 0,recorrido
0,01-01 PALACIO!01-01 PALACIO,21691,21691
1,01-01 PALACIO!01-02 EMBAJADORES,12886,12886
2,01-01 PALACIO!01-03 CORTES,5491,5491
3,01-01 PALACIO!01-04 JUSTICIA,7948,7948
4,01-01 PALACIO!01-05 UNIVERSIDAD,14494,14494
...,...,...,...
1084,09-02 ARGÜELLES!07-03 TRAFALGAR,998,998
1085,09-02 ARGÜELLES!07-04 ALMAGRO,974,974
1086,09-02 ARGÜELLES!07-05 RÍOS ROSAS,1803,1803
1087,09-02 ARGÜELLES!09-01 CASA DE CAMPO,566,566


In [93]:
neighbourhood_aristas_final = neighbourhood_aristas["neighbourhood_trip"].str.split("!", expand=True)
neighbourhood_aristas_final

Unnamed: 0,0,1
0,01-01 PALACIO,01-01 PALACIO
1,01-01 PALACIO,01-02 EMBAJADORES
2,01-01 PALACIO,01-03 CORTES
3,01-01 PALACIO,01-04 JUSTICIA
4,01-01 PALACIO,01-05 UNIVERSIDAD
...,...,...
1084,09-02 ARGÜELLES,07-03 TRAFALGAR
1085,09-02 ARGÜELLES,07-04 ALMAGRO
1086,09-02 ARGÜELLES,07-05 RÍOS ROSAS
1087,09-02 ARGÜELLES,09-01 CASA DE CAMPO


In [94]:
neighbourhood_aristas["neighbourhood_origin"] = neighbourhood_aristas_final[0]
neighbourhood_aristas["neighbourhood_destination"] = neighbourhood_aristas_final[1]
neighbourhood_aristas.rename(columns = {'Unnamed: 0':'number_of_trips'}, inplace = True)
neighbourhood_aristas.drop(["recorrido", "neighbourhood_trip"], axis = 'columns', inplace = True)
neighbourhood_aristas

Unnamed: 0,number_of_trips,neighbourhood_origin,neighbourhood_destination
0,21691,01-01 PALACIO,01-01 PALACIO
1,12886,01-01 PALACIO,01-02 EMBAJADORES
2,5491,01-01 PALACIO,01-03 CORTES
3,7948,01-01 PALACIO,01-04 JUSTICIA
4,14494,01-01 PALACIO,01-05 UNIVERSIDAD
...,...,...,...
1084,998,09-02 ARGÜELLES,07-03 TRAFALGAR
1085,974,09-02 ARGÜELLES,07-04 ALMAGRO
1086,1803,09-02 ARGÜELLES,07-05 RÍOS ROSAS
1087,566,09-02 ARGÜELLES,09-01 CASA DE CAMPO


In [95]:
neighbourhood_aristas.to_csv(r'../data/neighbourhoods.csv')

## Journeys by district
We repeat for the last time the steps described above, this time to have the number of journeys between the different districts of the city of Madrid. 

In [96]:
bicimad["district_trip"] = bicimad["district_origin"] + "!" + bicimad["district_dest"]
bicimad

Unnamed: 0.1,Unnamed: 0,user_type,travel_time,idunplug_station,ageRange,idplug_station,district_origin,neighbourhood_origin,origin,district_dest,neighbourhood_dest,destination,date,hour_time,recorrido,neighbourhood_trip,district_trip
0,0,1,210,87,2,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-01,00:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C...",03-02 ADELFAS!03-06 NIÑO JESÚS,03 RETIRO!03 RETIRO
1,1,1,330,87,0,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-02,09:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C...",03-02 ADELFAS!03-06 NIÑO JESÚS,03 RETIRO!03 RETIRO
2,2,1,281,87,4,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-03,09:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C...",03-02 ADELFAS!03-06 NIÑO JESÚS,03 RETIRO!03 RETIRO
3,3,1,270,87,0,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-03,13:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C...",03-02 ADELFAS!03-06 NIÑO JESÚS,03 RETIRO!03 RETIRO
4,4,1,245,87,0,78,03 RETIRO,03-02 ADELFAS,"DOCTOR ESQUERDO, CALLE, DEL",03 RETIRO,03-06 NIÑO JESÚS,"JUAN DE URBIETA, CALLE, DE",2020-01-04,08:00:00,"DOCTOR ESQUERDO, CALLE, DEL!JUAN DE URBIETA, C...",03-02 ADELFAS!03-06 NIÑO JESÚS,03 RETIRO!03 RETIRO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2039704,2056519,1,1464,32,0,109,01 CENTRO,01-03 CORTES,"ALCALA, CALLE, DE",04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",2020-10-03,20:00:00,"ALCALA, CALLE, DE!SERRANO, CALLE, DE",01-03 CORTES!04-01 RECOLETOS,01 CENTRO!04 SALAMANCA
2039705,2056520,1,377,32,3,109,01 CENTRO,01-03 CORTES,"ALCALA, CALLE, DE",04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",2020-10-15,22:00:00,"ALCALA, CALLE, DE!SERRANO, CALLE, DE",01-03 CORTES!04-01 RECOLETOS,01 CENTRO!04 SALAMANCA
2039706,2056521,1,489,32,0,109,01 CENTRO,01-03 CORTES,"ALCALA, CALLE, DE",04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",2020-10-19,06:00:00,"ALCALA, CALLE, DE!SERRANO, CALLE, DE",01-03 CORTES!04-01 RECOLETOS,01 CENTRO!04 SALAMANCA
2039707,2056522,1,396,32,0,109,01 CENTRO,01-03 CORTES,"ALCALA, CALLE, DE",04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",2020-10-19,07:00:00,"ALCALA, CALLE, DE!SERRANO, CALLE, DE",01-03 CORTES!04-01 RECOLETOS,01 CENTRO!04 SALAMANCA


In [97]:
bicimad["district_trip"].value_counts()

01  CENTRO!01  CENTRO                343250
07  CHAMBERÍ!01  CENTRO               92803
01  CENTRO!07  CHAMBERÍ               92159
04  SALAMANCA!01  CENTRO              83268
01  CENTRO!04  SALAMANCA              81754
                                      ...  
09  MONCLOA-ARAVACA!05  CHAMARTÍN      6769
06  TETUÁN!03  RETIRO                  5669
03  RETIRO!06  TETUÁN                  5318
06  TETUÁN!09  MONCLOA-ARAVACA         4627
09  MONCLOA-ARAVACA!06  TETUÁN         4440
Name: district_trip, Length: 64, dtype: int64

In [98]:
district_trips = bicimad.groupby("district_trip")
district_trips = district_trips.count()
district_aristas = pd.DataFrame(data=district_trips)
district_aristas = district_aristas.reset_index()
district_aristas

Unnamed: 0.1,district_trip,Unnamed: 0,user_type,travel_time,idunplug_station,ageRange,idplug_station,district_origin,neighbourhood_origin,origin,district_dest,neighbourhood_dest,destination,date,hour_time,recorrido,neighbourhood_trip
0,01 CENTRO!01 CENTRO,343250,343250,343250,343250,343250,343250,343250,343250,343250,343250,343250,343250,343250,343250,343250,343250
1,01 CENTRO!02 ARGANZUELA,61156,61156,61156,61156,61156,61156,61156,61156,61156,61156,61156,61156,61156,61156,61156,61156
2,01 CENTRO!03 RETIRO,76520,76520,76520,76520,76520,76520,76520,76520,76520,76520,76520,76520,76520,76520,76520,76520
3,01 CENTRO!04 SALAMANCA,81754,81754,81754,81754,81754,81754,81754,81754,81754,81754,81754,81754,81754,81754,81754,81754
4,01 CENTRO!05 CHAMARTÍN,37976,37976,37976,37976,37976,37976,37976,37976,37976,37976,37976,37976,37976,37976,37976,37976
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,09 MONCLOA-ARAVACA!04 SALAMANCA,10685,10685,10685,10685,10685,10685,10685,10685,10685,10685,10685,10685,10685,10685,10685,10685
60,09 MONCLOA-ARAVACA!05 CHAMARTÍN,6769,6769,6769,6769,6769,6769,6769,6769,6769,6769,6769,6769,6769,6769,6769,6769
61,09 MONCLOA-ARAVACA!06 TETUÁN,4440,4440,4440,4440,4440,4440,4440,4440,4440,4440,4440,4440,4440,4440,4440,4440
62,09 MONCLOA-ARAVACA!07 CHAMBERÍ,10104,10104,10104,10104,10104,10104,10104,10104,10104,10104,10104,10104,10104,10104,10104,10104


In [99]:
district_aristas.drop(["user_type","travel_time","idunplug_station","ageRange","idplug_station","district_origin","neighbourhood_origin","origin","district_dest","neighbourhood_dest","destination","date","hour_time", "neighbourhood_trip","recorrido"], axis = 'columns', inplace = True)

In [100]:
district_aristas_final = district_aristas["district_trip"].str.split("!", expand=True)
district_aristas_final

Unnamed: 0,0,1
0,01 CENTRO,01 CENTRO
1,01 CENTRO,02 ARGANZUELA
2,01 CENTRO,03 RETIRO
3,01 CENTRO,04 SALAMANCA
4,01 CENTRO,05 CHAMARTÍN
...,...,...
59,09 MONCLOA-ARAVACA,04 SALAMANCA
60,09 MONCLOA-ARAVACA,05 CHAMARTÍN
61,09 MONCLOA-ARAVACA,06 TETUÁN
62,09 MONCLOA-ARAVACA,07 CHAMBERÍ


In [101]:
district_aristas["district_origin"] = district_aristas_final[0]
district_aristas["district_destination"] = district_aristas_final[1]
district_aristas.rename(columns = {'Unnamed: 0':'number_of_trips'}, inplace = True)
district_aristas.drop(['district_trip'], axis = 'columns', inplace = True)
district_aristas

Unnamed: 0,number_of_trips,district_origin,district_destination
0,343250,01 CENTRO,01 CENTRO
1,61156,01 CENTRO,02 ARGANZUELA
2,76520,01 CENTRO,03 RETIRO
3,81754,01 CENTRO,04 SALAMANCA
4,37976,01 CENTRO,05 CHAMARTÍN
...,...,...,...
59,10685,09 MONCLOA-ARAVACA,04 SALAMANCA
60,6769,09 MONCLOA-ARAVACA,05 CHAMARTÍN
61,4440,09 MONCLOA-ARAVACA,06 TETUÁN
62,10104,09 MONCLOA-ARAVACA,07 CHAMBERÍ


In [102]:
district_aristas.to_csv(r'../data/districts.csv')

The next step will be to make the visualization of the graphs with the data we had and with these new generated.