## Explorando el API pública de [Open Bike Share Data](https://bikeshare-research.org/)

En este ejercicio realizaremos la exploracion de datos usando el API. de BikeShare

¿Qué es Open Bike Share Data?

La investigación del sistema de bicicletas compartidas (BSS), más allá del análisis de sistemas individuales, requiere una gran recopilación y análisis de datos. Bike Share Research (BSR) tiene como objetivo facilitar la conservación de datos BSS a través de una plataforma de datos colaborativa y abierta al tiempo que la hace accesible mediante API.

En el siguiente apartado haremos el ejercicio de explorar 

In [124]:
import requests
import json
import pandas as pd

def parse_response(response):
    return json.loads(response.text)

def get_feed_from_list(feed_name, feed_list):
    for feed in feed_list:
        if feed["name"] == feed_name:
            return feed;
    
    return None

## Análizando estaciones de bicicletas de MiBici en la ciudad de Guadalajara

A continuación usaremos el API de BikeShare para explorar que data encontramos relacionada con MiBici, un sistema de bicicletas públicas del Area Metropolitana de Guadalajara

En este endpoint podemos observar cuál es el estatus de las estaciones, por ejemplo podemos ver el numero de bicicletas disponibles asi como tambien el número de docks para bicicletas y las bicicletas deshabilitadas

In [125]:
station_status_url = 'https://guadalajara.publicbikesystem.net/customer/ube/gbfs/v1/en/station_status'
station_status_response = parse_response(requests.get(station_status_url))

station_status_df = pd.DataFrame.from_dict(station_status_response['data']['stations'])
display(station_status_df.head())

Unnamed: 0,station_id,num_bikes_available,num_bikes_available_types,num_bikes_disabled,num_docks_available,num_docks_disabled,last_reported,is_charging_station,status,is_installed,is_renting,is_returning,traffic
0,2,11,"{'mechanical': 11, 'ebike': 0}",1,3,0,1693261464,False,IN_SERVICE,1,1,1,
1,3,2,"{'mechanical': 2, 'ebike': 0}",2,11,0,1693261574,False,IN_SERVICE,1,1,1,
2,4,7,"{'mechanical': 7, 'ebike': 0}",0,12,0,1693261650,False,IN_SERVICE,1,1,1,
3,5,6,"{'mechanical': 6, 'ebike': 0}",0,5,0,1693261593,False,IN_SERVICE,1,1,1,
4,6,8,"{'mechanical': 8, 'ebike': 0}",0,3,0,1693261489,False,IN_SERVICE,1,1,1,


También podemos observar información mas detallada de las caracterisitcas de las estaciones con el siguiente endpoint del api. Podemos revisar diferentes caracteristicas en las estaciones que puede ser útil para complementar el dataset del proyecto y poder entender ó descubrir caracteristicas que puedan ayudar a complementar nuestro proyecto

In [126]:
station_info_url = 'https://guadalajara.publicbikesystem.net/customer/ube/gbfs/v1/en/station_information'
station_info_response = parse_response(requests.get(station_info_url))

station_info_df = pd.DataFrame.from_dict(station_info_response['data']['stations'])
display(station_info_df.head())

Unnamed: 0,station_id,name,physical_configuration,lat,lon,altitude,address,post_code,capacity,is_charging_station,rental_methods,groups,obcn,nearby_distance,_ride_code_support,rental_uris
0,2,(GDL-001) C. Epigmenio Glez./ Av. 16 de Sept.,REGULAR,20.666378,-103.34882,0.0,(GDL-001) C. Epigmenio González / Av. 16 de Se...,44180,15,False,"[KEY, TRANSITCARD, CREDITCARD, PHONE]",[],GDL-001,500.0,True,{}
1,3,(GDL-002) C. Colonias / Av. Niños héroes,REGULAR,20.667228,-103.366,1.0,(GDL-002) C. Colonias / Av. Niños Héroes,44160,15,False,"[KEY, TRANSITCARD, CREDITCARD, PHONE]",[],GDL-002,500.0,True,{}
2,4,(GDL-003) C. Vidrio / Av. Chapultepec,REGULAR,20.66769,-103.368252,1.0,(GDL-003) C. Vidrio / Av. Chapultepec,44160,19,False,"[KEY, TRANSITCARD, CREDITCARD, PHONE]",[],GDL-003,500.0,True,{}
3,5,(GDL-004) C. Ghilardi /C. Miraflores,REGULAR,20.69175,-103.36255,0.0,(GDL-004) C. Ghilardi / C. Miraflores *,44600,11,False,"[KEY, TRANSITCARD, CREDITCARD, PHONE]",[],GDL-004,500.0,True,{}
4,6,(GDL-005) C. San Diego /Calzada Independencia,REGULAR,20.681158,-103.339363,0.0,(GDL-005) C. San Diego / Calz. Independencia *,44280,11,False,"[KEY, TRANSITCARD, CREDITCARD, PHONE]",[],GDL-005,500.0,True,{}


# Data Wrangling

### Obteniendo información de los viajes de MiBici de los ultimos 3 meses del 2023 (Mayo a Julio del 2023)

A continuación obtendremos los datos públicos de los viajes en bicicleta de los usuarios de MiBici en la ciudad de Guadalajara, México, adicional, estaremos obteniendo la información de las estaciones.

### 1. Obtener datos de los viajes de los meses Mayo, Junio y Julio del 2023 de MiBici

In [139]:
from pandas import DataFrame
from typing import List
from io import StringIO
from enum import Enum

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}
url_base = 'https://www.mibici.net/site/assets/files/'
datasets_to_download = [
    '46759/datos_abiertos_2023_05.csv',
    '48116/datos_abiertos_2023_06.csv',
    '49448/datos_abiertos_2023_07.csv'
]
raw_dataframes: List[DataFrame] = []
    
class DataSethMonth(Enum):
    MAY = 0
    JUNE = 1
    JULY = 2

for dataset_url in datasets_to_download:
    response = requests.get(f"{url_base}{dataset_url}", headers=headers)
    data = StringIO(response.text)
    temp_df = pd.read_csv(data)
    raw_dataframes.append(temp_df)

#### Datos de Mayo 2023

In [141]:
# little fix
raw_dataframes[DataSethMonth.MAY.value]['Año_de_nacimiento'] = raw_dataframes[DataSethMonth.MAY.value]['A}äe_nacimiento']
del raw_dataframes[DataSethMonth.MAY.value]['A}äe_nacimiento']

display(raw_dataframes[DataSethMonth.MAY.value].head())
display(raw_dataframes[DataSethMonth.MAY.value].describe())
print(raw_dataframes[DataSethMonth.MAY.value].info())

Unnamed: 0,Viaje_Id,Usuario_Id,Genero,Inicio_del_viaje,Fin_del_viaje,Origen_Id,Destino_Id,Año_de_nacimiento
0,28467098,70123,M,2023-05-01 00:00:03,2023-05-01 00:22:19,64,141,1967.0
1,28467099,2237235,M,2023-05-01 00:00:23,2023-05-01 00:04:26,36,172,1980.0
2,28467100,2051727,F,2023-05-01 00:01:05,2023-05-01 00:10:21,96,296,2002.0
3,28467101,2246225,M,2023-05-01 00:01:07,2023-05-01 00:04:14,33,255,1969.0
4,28467102,324247,M,2023-05-01 00:01:26,2023-05-01 00:13:18,226,231,1975.0


Unnamed: 0,Viaje_Id,Usuario_Id,Origen_Id,Destino_Id,Año_de_nacimiento
count,364158.0,364158.0,364158.0,364158.0,363601.0
mean,28683190.0,1132535.0,138.012876,137.981475,1989.064026
std,124492.0,754204.1,96.712673,98.347428,10.673858
min,28467100.0,102.0,2.0,2.0,1920.0
25%,28574350.0,442464.0,51.0,51.0,1984.0
50%,28684160.0,1140731.0,132.0,120.0,1992.0
75%,28791050.0,1737235.0,224.0,232.0,1997.0
max,28897250.0,2371510.0,327.0,327.0,2022.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364158 entries, 0 to 364157
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Viaje_Id           364158 non-null  int64  
 1   Usuario_Id         364158 non-null  int64  
 2   Genero             362472 non-null  object 
 3   Inicio_del_viaje   364158 non-null  object 
 4   Fin_del_viaje      364158 non-null  object 
 5   Origen_Id          364158 non-null  int64  
 6   Destino_Id         364158 non-null  int64  
 7   Año_de_nacimiento  363601 non-null  float64
dtypes: float64(1), int64(4), object(3)
memory usage: 22.2+ MB
None


#### Datos de Junio 2023

In [142]:
display(raw_dataframes[DataSethMonth.JUNE.value].head())
display(raw_dataframes[DataSethMonth.JUNE.value].describe())
print(raw_dataframes[DataSethMonth.JUNE.value].info())

Unnamed: 0,Viaje_Id,Usuario_Id,Genero,Año_de_nacimiento,Inicio_del_viaje,Fin_del_viaje,Origen_Id,Destino_Id
0,28897410,1521317,M,1995.0,2023-06-01 00:00:40,2023-06-01 00:09:26,20,255
1,28897411,1712554,M,1993.0,2023-06-01 00:00:46,2023-06-01 00:13:33,20,12
2,28897412,2105703,M,2000.0,2023-06-01 00:00:55,2023-06-01 00:01:10,273,273
3,28897413,2105706,F,2001.0,2023-06-01 00:00:59,2023-06-01 00:12:41,273,246
4,28897414,2105703,M,2000.0,2023-06-01 00:01:20,2023-06-01 00:12:33,273,246


Unnamed: 0,Viaje_Id,Usuario_Id,Año_de_nacimiento,Origen_Id,Destino_Id
count,339628.0,339628.0,339131.0,339628.0,339628.0
mean,29094850.0,1155184.0,1989.029136,136.610253,136.531202
std,113192.6,779054.1,10.515967,96.304264,97.996399
min,28897410.0,102.0,1920.0,2.0,2.0
25%,28997200.0,437522.0,1984.0,51.0,51.0
50%,29095260.0,1145763.0,1992.0,127.0,110.0
75%,29193110.0,1776062.0,1997.0,214.0,229.0
max,29290180.0,2423462.0,2007.0,327.0,327.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 339628 entries, 0 to 339627
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Viaje_Id           339628 non-null  int64  
 1   Usuario_Id         339628 non-null  int64  
 2   Genero             337989 non-null  object 
 3   Año_de_nacimiento  339131 non-null  float64
 4   Inicio_del_viaje   339628 non-null  object 
 5   Fin_del_viaje      339628 non-null  object 
 6   Origen_Id          339628 non-null  int64  
 7   Destino_Id         339628 non-null  int64  
dtypes: float64(1), int64(4), object(3)
memory usage: 20.7+ MB
None


#### Datos de Julio 2023

In [143]:
display(raw_dataframes[DataSethMonth.JUNE.value].head())
display(raw_dataframes[DataSethMonth.JUNE.value].describe())
print(raw_dataframes[DataSethMonth.JUNE.value].info())

Unnamed: 0,Viaje_Id,Usuario_Id,Genero,Año_de_nacimiento,Inicio_del_viaje,Fin_del_viaje,Origen_Id,Destino_Id
0,28897410,1521317,M,1995.0,2023-06-01 00:00:40,2023-06-01 00:09:26,20,255
1,28897411,1712554,M,1993.0,2023-06-01 00:00:46,2023-06-01 00:13:33,20,12
2,28897412,2105703,M,2000.0,2023-06-01 00:00:55,2023-06-01 00:01:10,273,273
3,28897413,2105706,F,2001.0,2023-06-01 00:00:59,2023-06-01 00:12:41,273,246
4,28897414,2105703,M,2000.0,2023-06-01 00:01:20,2023-06-01 00:12:33,273,246


Unnamed: 0,Viaje_Id,Usuario_Id,Año_de_nacimiento,Origen_Id,Destino_Id
count,339628.0,339628.0,339131.0,339628.0,339628.0
mean,29094850.0,1155184.0,1989.029136,136.610253,136.531202
std,113192.6,779054.1,10.515967,96.304264,97.996399
min,28897410.0,102.0,1920.0,2.0,2.0
25%,28997200.0,437522.0,1984.0,51.0,51.0
50%,29095260.0,1145763.0,1992.0,127.0,110.0
75%,29193110.0,1776062.0,1997.0,214.0,229.0
max,29290180.0,2423462.0,2007.0,327.0,327.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 339628 entries, 0 to 339627
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Viaje_Id           339628 non-null  int64  
 1   Usuario_Id         339628 non-null  int64  
 2   Genero             337989 non-null  object 
 3   Año_de_nacimiento  339131 non-null  float64
 4   Inicio_del_viaje   339628 non-null  object 
 5   Fin_del_viaje      339628 non-null  object 
 6   Origen_Id          339628 non-null  int64  
 7   Destino_Id         339628 non-null  int64  
dtypes: float64(1), int64(4), object(3)
memory usage: 20.7+ MB
None


### 2. Enseguida realizaremos la union de todos los datasets que obtuvimos de la data pública de Mibici

In [144]:
raw_mibici_df = pd.concat(raw_dataframes)
display(raw_mibici_df.head())
display(raw_mibici_df.describe())
display(raw_mibici_df.info())

Unnamed: 0,Viaje_Id,Usuario_Id,Genero,Inicio_del_viaje,Fin_del_viaje,Origen_Id,Destino_Id,Año_de_nacimiento
0,28467098,70123,M,2023-05-01 00:00:03,2023-05-01 00:22:19,64,141,1967.0
1,28467099,2237235,M,2023-05-01 00:00:23,2023-05-01 00:04:26,36,172,1980.0
2,28467100,2051727,F,2023-05-01 00:01:05,2023-05-01 00:10:21,96,296,2002.0
3,28467101,2246225,M,2023-05-01 00:01:07,2023-05-01 00:04:14,33,255,1969.0
4,28467102,324247,M,2023-05-01 00:01:26,2023-05-01 00:13:18,226,231,1975.0


Unnamed: 0,Viaje_Id,Usuario_Id,Origen_Id,Destino_Id,Año_de_nacimiento
count,1035573.0,1035573.0,1035573.0,1035573.0,1034005.0
mean,29073720.0,1154088.0,136.8167,136.6023,1989.024
std,346975.5,778358.9,96.48719,98.07246,10.60156
min,28467100.0,102.0,2.0,2.0,1920.0
25%,28774080.0,440363.0,51.0,51.0,1984.0
50%,29076750.0,1145524.0,128.0,111.0,1992.0
75%,29374360.0,1774264.0,216.0,229.0,1997.0
max,29671920.0,2476513.0,327.0,327.0,2022.0


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1035573 entries, 0 to 331786
Data columns (total 8 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Viaje_Id           1035573 non-null  int64  
 1   Usuario_Id         1035573 non-null  int64  
 2   Genero             1030612 non-null  object 
 3   Inicio_del_viaje   1035573 non-null  object 
 4   Fin_del_viaje      1035573 non-null  object 
 5   Origen_Id          1035573 non-null  int64  
 6   Destino_Id         1035573 non-null  int64  
 7   Año_de_nacimiento  1034005 non-null  float64
dtypes: float64(1), int64(4), object(3)
memory usage: 71.1+ MB


None

### 3. A continuación empezaremos con la Data Transformation para poder utilizar el dataset, haremos lo siguiente:
* renombrado de columnas
* agregaremos nuevas columnas calculadas en base a la información (enriquecimiento)
* reordenaremos las columnas para facilitar su lectura

In [146]:
def rename_columns(df: DataFrame) -> DataFrame:
    temp_df = df.copy()
    existing_columns = ['Viaje_Id', 'Usuario_Id', 'Genero', 'Año_de_nacimiento', 'Inicio_del_viaje', 'Fin_del_viaje', 'Origen_Id', 'Destino_Id']
    new_columns = ['trip_id', 'user_id', 'gender', 'birthday', 'start_trip', 'end_trip', 'origin_id', 'destination_id']
    
    for idx, new_column in enumerate(new_columns):
        temp_df[new_column] = temp_df[existing_columns[idx]]
        del temp_df[existing_columns[idx]]
    
    return temp_df

def add_new_columns(df: DataFrame) -> DataFrame:
    from datetime import datetime, date
    
    temp_df = df.copy()
    now = datetime.now()
    current_year = now.year

    # calculate user's age and add it to dataframe
    temp_df["age"] = current_year - temp_df["birthday"]
    
    # calculate trip duration
    temp_df['start_trip_date'] = pd.to_datetime(temp_df['start_trip'])
    temp_df['end_strip_date'] =  pd.to_datetime(temp_df['end_trip'])
    
    temp_df["trip_duration_hours"] = (temp_df["end_strip_date"] - temp_df['start_trip_date']) / pd.Timedelta(1, 'h')
    temp_df["trip_duration_hours"] = temp_df["trip_duration_hours"].round(2)
    temp_df["trip_duration_minutes"] = (temp_df["end_strip_date"] - temp_df['start_trip_date']) / pd.Timedelta(1, 'm')
    temp_df["year"] = temp_df.start_trip_date.dt.year
    
    del temp_df['start_trip_date']
    del temp_df['end_strip_date']
    
    return temp_df

def reorder_columns(df: DataFrame) -> DataFrame:
    temp_df = df.copy()
    columns_order = [
        'trip_id', 
        'user_id', 
        'gender', 
        'birthday', 
        'age',
        'start_trip', 
        'end_trip',
        'trip_duration_minutes',
        'trip_duration_hours',
        'year',
        'origin_id',
        'origin_name',
        'origin_obcn',
        'origin_location',
        'origin_latitude',
        'origin_longitude',
        'origin_status',
        'destination_id',
        'destination_name',
        'destination_obcn',
        'destination_location',
        'destination_latitude',
        'destination_longitude',
        'destination_status'
    ]
    
    return temp_df.reindex(columns=columns_order)

 # preparing data
mibici_df = rename_columns(raw_mibici_df)
mibici_df = add_new_columns(mibici_df)
mibici_df = reorder_columns(mibici_df)

display(mibici_df.head())
display(mibici_df.describe())
print(mibici_df.info())

Unnamed: 0,trip_id,user_id,gender,birthday,age,start_trip,end_trip,trip_duration_minutes,trip_duration_hours,year,origin_id,origin_name,origin_obcn,origin_location,origin_latitude,origin_longitude,origin_status,destination_id,destination_name,destination_obcn,destination_location,destination_latitude,destination_longitude,destination_status
0,28467098,70123,M,1967.0,56.0,2023-05-01 00:00:03,2023-05-01 00:22:19,22.266667,0.37,2023,64,,,,,,,141,,,,,,
1,28467099,2237235,M,1980.0,43.0,2023-05-01 00:00:23,2023-05-01 00:04:26,4.05,0.07,2023,36,,,,,,,172,,,,,,
2,28467100,2051727,F,2002.0,21.0,2023-05-01 00:01:05,2023-05-01 00:10:21,9.266667,0.15,2023,96,,,,,,,296,,,,,,
3,28467101,2246225,M,1969.0,54.0,2023-05-01 00:01:07,2023-05-01 00:04:14,3.116667,0.05,2023,33,,,,,,,255,,,,,,
4,28467102,324247,M,1975.0,48.0,2023-05-01 00:01:26,2023-05-01 00:13:18,11.866667,0.2,2023,226,,,,,,,231,,,,,,


Unnamed: 0,trip_id,user_id,birthday,age,trip_duration_minutes,trip_duration_hours,year,origin_id,origin_name,origin_obcn,origin_location,origin_latitude,origin_longitude,origin_status,destination_id,destination_name,destination_obcn,destination_location,destination_latitude,destination_longitude,destination_status
count,1035573.0,1035573.0,1034005.0,1034005.0,1035573.0,1035573.0,1035573.0,1035573.0,0.0,0.0,0.0,0.0,0.0,0.0,1035573.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,29073720.0,1154088.0,1989.024,33.97595,10.81775,0.1803302,2023.0,136.8167,,,,,,,136.6023,,,,,,
std,346975.5,778358.9,10.60156,10.60156,10.56433,0.1760707,0.0,96.48719,,,,,,,98.07246,,,,,,
min,28467100.0,102.0,1920.0,1.0,0.0,0.0,2023.0,2.0,,,,,,,2.0,,,,,,
25%,28774080.0,440363.0,1984.0,26.0,5.55,0.09,2023.0,51.0,,,,,,,51.0,,,,,,
50%,29076750.0,1145524.0,1992.0,31.0,9.3,0.16,2023.0,128.0,,,,,,,111.0,,,,,,
75%,29374360.0,1774264.0,1997.0,39.0,14.55,0.24,2023.0,216.0,,,,,,,229.0,,,,,,
max,29671920.0,2476513.0,2022.0,103.0,2251.65,37.53,2023.0,327.0,,,,,,,327.0,,,,,,


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1035573 entries, 0 to 331786
Data columns (total 24 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   trip_id                1035573 non-null  int64  
 1   user_id                1035573 non-null  int64  
 2   gender                 1030612 non-null  object 
 3   birthday               1034005 non-null  float64
 4   age                    1034005 non-null  float64
 5   start_trip             1035573 non-null  object 
 6   end_trip               1035573 non-null  object 
 7   trip_duration_minutes  1035573 non-null  float64
 8   trip_duration_hours    1035573 non-null  float64
 9   year                   1035573 non-null  int64  
 10  origin_id              1035573 non-null  int64  
 11  origin_name            0 non-null        float64
 12  origin_obcn            0 non-null        float64
 13  origin_location        0 non-null        float64
 14  origin_latitude    

### 4. A continuación vamos a obtener la información de las estaciones de MiBici desde su API pública

In [147]:
stations_url = '1118/nomenclatura_2023_07.csv'
response = requests.get(f"{url_base}{stations_url}", headers=headers)
data = StringIO(response.text)
station_df = pd.read_csv(data)
display(station_df.head())
display(station_df.describe())
print(station_df.info())

Unnamed: 0,id,name,obcn,location,latitude,longitude,status
0,2,(GDL-001) C. Epigmenio Glez./ Av. 16 de Sept.,GDL-001,POLÍGONO CENTRAL,20.666378,-103.34882,IN_SERVICE
1,3,(GDL-002) C. Colonias / Av. Niños héroes,GDL-002,POLÍGONO CENTRAL,20.667228,-103.366,IN_SERVICE
2,4,(GDL-003) C. Vidrio / Av. Chapultepec,GDL-003,POLÍGONO CENTRAL,20.66769,-103.368252,IN_SERVICE
3,5,(GDL-004) C. Ghilardi /C. Miraflores,GDL-004,POLÍGONO CENTRAL,20.69175,-103.36255,IN_SERVICE
4,6,(GDL-005) C. San Diego /Calzada Independencia,GDL-005,POLÍGONO CENTRAL,20.681151,-103.338863,IN_SERVICE


Unnamed: 0,id,latitude,longitude
count,312.0,312.0,312.0
mean,163.330128,20.679264,-103.365016
std,93.845926,0.022943,0.026635
min,2.0,20.63613,-103.419421
25%,80.75,20.665107,-103.38458
50%,164.5,20.674973,-103.365685
75%,242.25,20.687425,-103.348492
max,327.0,20.73837,-103.301239


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 312 entries, 0 to 311
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         312 non-null    int64  
 1   name       312 non-null    object 
 2   obcn       312 non-null    object 
 3   location   312 non-null    object 
 4   latitude   312 non-null    float64
 5   longitude  312 non-null    float64
 6   status     312 non-null    object 
dtypes: float64(2), int64(1), object(4)
memory usage: 17.2+ KB
None


### 5. Ahora vamos a usar la información de las estaciones para enriquecer nuestro dataframe con la información de los viajes en bicicleta en MiBici

In [148]:
def add_origin_destination_nomenclature(df: DataFrame, nomenclature: DataFrame):
    temp_df = df.copy()
    
    for left_on in ['origin', 'destination']:    
        temp_df = temp_df.merge(nomenclature, left_on=f"{left_on}_id", right_on='id', suffixes=(None, f"_{left_on}"))

        temp_df[f"{left_on}_name"] = temp_df['name']
        temp_df[f"{left_on}_obcn"] = temp_df['obcn']
        temp_df[f"{left_on}_location"] = temp_df['location']
        temp_df[f"{left_on}_latitude"] = temp_df['latitude']
        temp_df[f"{left_on}_longitude"] = temp_df['longitude']
        temp_df[f"{left_on}_status"] = temp_df['status']


        del temp_df['id']
        del temp_df['name']
        del temp_df['obcn']
        del temp_df['location']
        del temp_df['latitude']
        del temp_df['longitude']
        del temp_df['status']

    return temp_df

mibici_df = add_origin_destination_nomenclature(mibici_df, nomenclature=station_df)

display(mibici_df.head())
display(mibici_df.describe())
print(mibici_df.info())

Unnamed: 0,trip_id,user_id,gender,birthday,age,start_trip,end_trip,trip_duration_minutes,trip_duration_hours,year,origin_id,origin_name,origin_obcn,origin_location,origin_latitude,origin_longitude,origin_status,destination_id,destination_name,destination_obcn,destination_location,destination_latitude,destination_longitude,destination_status
0,28467098,70123,M,1967.0,56.0,2023-05-01 00:00:03,2023-05-01 00:22:19,22.266667,0.37,2023,64,(GDL-062) C. Libertad / C. Moscú,GDL-062,POLÍGONO CENTRAL,20.673072,-103.365055,IN_SERVICE,141,(ZPN-047) Av. Tizoc / Av. López Mateos,ZPN-047,POLÍGONO CENTRAL,20.65381,-103.40134,IN_SERVICE
1,28653697,1515654,F,1994.0,29.0,2023-05-14 10:57:09,2023-05-14 11:33:51,36.7,0.61,2023,50,(GDL-048) C. Constancio Hernández/ Av. Juaréz,GDL-048,POLÍGONO CENTRAL,20.674721,-103.358548,IN_SERVICE,141,(ZPN-047) Av. Tizoc / Av. López Mateos,ZPN-047,POLÍGONO CENTRAL,20.65381,-103.40134,IN_SERVICE
2,28930381,2221010,M,1999.0,24.0,2023-06-03 11:10:49,2023-06-03 11:35:36,24.783333,0.41,2023,50,(GDL-048) C. Constancio Hernández/ Av. Juaréz,GDL-048,POLÍGONO CENTRAL,20.674721,-103.358548,IN_SERVICE,141,(ZPN-047) Av. Tizoc / Av. López Mateos,ZPN-047,POLÍGONO CENTRAL,20.65381,-103.40134,IN_SERVICE
3,29554571,134164,M,1989.0,34.0,2023-07-21 17:03:15,2023-07-21 17:19:58,16.716667,0.28,2023,154,(GDL-087) Av. Américas / Av. López Mateos,GDL-087,POLÍGONO CENTRAL,20.69493,-103.37333,IN_SERVICE,141,(ZPN-047) Av. Tizoc / Av. López Mateos,ZPN-047,POLÍGONO CENTRAL,20.65381,-103.40134,IN_SERVICE
4,28654633,306600,M,1991.0,32.0,2023-05-14 12:04:18,2023-05-14 12:13:48,9.5,0.16,2023,131,(ZPN-037) Av. San Ignacio / Av. Guadalupe,ZPN-037,POLÍGONO CENTRAL,20.666914,-103.402512,IN_SERVICE,141,(ZPN-047) Av. Tizoc / Av. López Mateos,ZPN-047,POLÍGONO CENTRAL,20.65381,-103.40134,IN_SERVICE


Unnamed: 0,trip_id,user_id,birthday,age,trip_duration_minutes,trip_duration_hours,year,origin_id,origin_latitude,origin_longitude,destination_id,destination_latitude,destination_longitude
count,1035573.0,1035573.0,1034005.0,1034005.0,1035573.0,1035573.0,1035573.0,1035573.0,1035573.0,1035573.0,1035573.0,1035573.0,1035573.0
mean,29073720.0,1154088.0,1989.024,33.97595,10.81775,0.1803302,2023.0,136.8167,20.6785,-103.3613,136.6023,20.67891,-103.3594
std,346975.5,778358.9,10.60156,10.60156,10.56433,0.1760707,0.0,96.48719,0.01592944,0.01888493,98.07246,0.0158508,0.01804993
min,28467100.0,102.0,1920.0,1.0,0.0,0.0,2023.0,2.0,20.63613,-103.4194,2.0,20.63613,-103.4194
25%,28774080.0,440363.0,1984.0,26.0,5.55,0.09,2023.0,51.0,20.67036,-103.3726,51.0,20.67051,-103.3693
50%,29076750.0,1145524.0,1992.0,31.0,9.3,0.16,2023.0,128.0,20.67563,-103.3595,111.0,20.67575,-103.357
75%,29374360.0,1774264.0,1997.0,39.0,14.55,0.24,2023.0,216.0,20.68452,-103.3485,229.0,20.68465,-103.348
max,29671920.0,2476513.0,2022.0,103.0,2251.65,37.53,2023.0,327.0,20.73837,-103.3012,327.0,20.73837,-103.3012


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1035573 entries, 0 to 1035572
Data columns (total 24 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   trip_id                1035573 non-null  int64  
 1   user_id                1035573 non-null  int64  
 2   gender                 1030612 non-null  object 
 3   birthday               1034005 non-null  float64
 4   age                    1034005 non-null  float64
 5   start_trip             1035573 non-null  object 
 6   end_trip               1035573 non-null  object 
 7   trip_duration_minutes  1035573 non-null  float64
 8   trip_duration_hours    1035573 non-null  float64
 9   year                   1035573 non-null  int64  
 10  origin_id              1035573 non-null  int64  
 11  origin_name            1035573 non-null  object 
 12  origin_obcn            1035573 non-null  object 
 13  origin_location        1035573 non-null  object 
 14  origin_latitude   

### 6. A continuación removeremos valores nulos

In [149]:
print("Before removing null values", mibici_df.shape)
mibici_df = mibici_df.dropna()
print("After removing null values", mibici_df.shape)

Before removing null values (1035573, 24)
After removing null values (1029044, 24)


### 7. A continuación realizaremos algunas operaciones de agregación sobre los datos para exploración simplemente

In [150]:
print(mibici_df.groupby(['origin_id'])['origin_id', 'origin_name'].value_counts())
print(mibici_df.groupby(['destination_id'])['destination_id', 'destination_name'].value_counts())

  print(mibici_df.groupby(['origin_id'])['origin_id', 'origin_name'].value_counts())
  print(mibici_df.groupby(['destination_id'])['destination_id', 'destination_name'].value_counts())


origin_id  origin_name                                  
2          (GDL-001) C. Epigmenio Glez./ Av. 16 de Sept.    4893
3          (GDL-002) C. Colonias  / Av.  Niños héroes       4364
4          (GDL-003) C. Vidrio / Av. Chapultepec            6256
5          (GDL-004) C. Ghilardi /C. Miraflores             4465
6          (GDL-005) C. San Diego /Calzada Independencia    2621
                                                            ... 
323        (GDL-213) C. Palavicini / Av. Circunvalación     2761
324        (GDL-221) De los científicos/ J. J. Tablada      2283
325        (GDL-226) C. Luis G. Cuevas /Av. Revolución      1050
326        (GDL-219) C. D. Rivera / C. Filósofos            1014
327        (GDL-227) C. Artistas / C. Carlos Pereira        2005
Length: 300, dtype: int64
destination_id  destination_name                             
2               (GDL-001) C. Epigmenio Glez./ Av. 16 de Sept.    5257
3               (GDL-002) C. Colonias  / Av.  Niños héroes       3717


### 8. Pasos adicionales

Como podemos observar realizamos diferentes tareas del Data Wrangling, aun queda mas explorar en este data set, lo cual se realizará en los próximos desafíos.