# Exploratory Data Analysis

Geographic location of the bus stops in Ciudad Autonoma de Buenos Aires, with some extra administrative information like commune/comuna and the associated bus lines in each stop.

### Source
**Dataset:** paradas-de-colectivo.csv 

**Link:** [Buenos Aires Data - Paradas de Colectivo][bus_stops]

[bus_stops]: https://data.buenosaires.gob.ar/dataset/colectivos-paradas/resource/d0e599d2-3e78-4fb2-9255-30a2be0525f8

## Resource fields (as published on the dataset web)

Each row represents a bus stop location in CABA, with its address, coordinates, and up to 6 associated bus lines (plus direction).

| Name | Type | Description (EN translation) |
|---|---|---|
| CALLE | string | Street name where the address is located. |
| ALT_PLANO | string | Street number (it may represent the building/property numbering). |
| DIRECCION | string | Full address, including street and number. |
| coord_X | string | X coordinate (geographic longitude) of the address location. |
| coord_Y | string | Y coordinate (geographic latitude) of the address location. |
| COMUNA | string | Commune (comuna) number where the address is located. |
| BARRIO | string | Neighborhood (barrio) name the address belongs to. |
| L1 | string | Public transport line number that serves the location. |
| l1_sen | string | Direction of line L1 (e.g., `I` for outbound/ida or `V` for return/vuelta). |
| L2 | string | Second public transport line number serving the location (if applicable). |
| l2_sen | string | Direction of line L2. |
| L3 | string | Third public transport line number serving the location (if applicable). |
| l3_sen | string | Direction of line L3. |
| L4 | string | Fourth public transport line number serving the location (if applicable). |
| l4_sen | string | Direction of line L4. |
| L5 | string | Fifth public transport line number serving the location (if applicable). |
| l5_sen | string | Direction of line L5. |
| L6 | string | Sixth public transport line number serving the location (if applicable). |
| l6_sen | string | Direction of line L6. |

## Questions:

To complete:
* abc
* def
* ghi

## Setup

#### Imports

In [54]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


#### Initial settings

In [55]:
%matplotlib inline

## Loading Data

In [56]:
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent 
DATA_PATH = PROJECT_ROOT / "data" / "raw" / "paradas-de-colectivo.csv"

DATA_PATH.exists()

True

In [57]:
stops_raw = pd.read_csv(DATA_PATH)

# Make a copy in order to leave the original file intact
stops = stops_raw.copy()
stops

Unnamed: 0,fid,CALLE,ALT PLANO,DIRECCION,coord_X,coord_Y,COMUNA,BARRIO,L1,l1_sen,L2,l2_sen,L3,l3_sen,L4,l4_sen,L5,l5_sen,L6,l6_sen
0,1,DEFENSA,1524,1524 DEFENSA,-583709946,-3462565880,1,SAN TELMO,22.0,V,53.0,I,,,,,,,,
1,2,DEFENSA,1528,1528 DEFENSA,-583709994,-3462571060,1,SAN TELMO,29.0,I,,,,,,,,,,
2,3,BARTOLOME MITRE,906,"906 MITRE, BARTOLOME",-583796587,-3460721560,1,SAN NICOLAS,105.0,V,,,,,,,,,,
3,4,REGIMIENTO DE PATRICIOS AV.,51,51 REGIMIENTO DE PATRICIOS AV.,-583706639,-3463022580,4,BARRACAS,93.0,I,70.0,V,74,I,,,,,,
4,5,REGIMIENTO DE PATRICIOS AV.,389,389 REGIMIENTO DE PATRICIOS AV.,-583703604,-3463340970,4,BARRACAS,10.0,I,22.0,I,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6957,6958,NAZCA AV.,1672,1672 NAZCA AV.,-58479063,-3461479700,11,VILLA SANTA RITA,63.0,I,133.0,V,,,,,,,,
6958,6959,BAIGORRIA,4185,4185 BAIGORRIA,-5850318,-3461053300,11,VILLA DEVOTO,109.0,V,,,,,,,,,,
6959,6960,GRAL. PAZ AV.,11118,"PAZ, GRAL. AV. 11118",-58528564,-3464514700,9,LINIERS,4.0,I,185.0,I,,,,,,,,
6960,6961,GRAL. PAZ AV.,11108,"PAZ, GRAL. AV. 11108",-58528629,-3464461600,9,LINIERS,8.0,V,86.0,V,,,,,,,,


## Preprocessing

I will normalice the column names: lowercase + snake_case

In [58]:
stops.columns

Index(['fid', 'CALLE', 'ALT PLANO', 'DIRECCION', 'coord_X', 'coord_Y',
       'COMUNA', 'BARRIO', 'L1', 'l1_sen', 'L2', 'l2_sen', 'L3', 'l3_sen',
       'L4', 'l4_sen', 'L5', 'l5_sen', 'L6', 'l6_sen'],
      dtype='object')

In [59]:
stops.columns = (stops.columns
                 .str.strip()
                 .str.lower()
                 .str.replace(" ", "_")      
)
stops.columns

Index(['fid', 'calle', 'alt_plano', 'direccion', 'coord_x', 'coord_y',
       'comuna', 'barrio', 'l1', 'l1_sen', 'l2', 'l2_sen', 'l3', 'l3_sen',
       'l4', 'l4_sen', 'l5', 'l5_sen', 'l6', 'l6_sen'],
      dtype='object')

Lets check the dtypes of the Dataframe:

In [60]:
stops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6962 entries, 0 to 6961
Data columns (total 20 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   fid        6962 non-null   int64  
 1   calle      6962 non-null   object 
 2   alt_plano  6910 non-null   object 
 3   direccion  6959 non-null   object 
 4   coord_x    6962 non-null   object 
 5   coord_y    6962 non-null   object 
 6   comuna     6962 non-null   int64  
 7   barrio     6961 non-null   object 
 8   l1         6959 non-null   float64
 9   l1_sen     6957 non-null   object 
 10  l2         3813 non-null   float64
 11  l2_sen     3806 non-null   object 
 12  l3         587 non-null    object 
 13  l3_sen     586 non-null    object 
 14  l4         98 non-null     float64
 15  l4_sen     94 non-null     object 
 16  l5         19 non-null     float64
 17  l5_sen     19 non-null     object 
 18  l6         8 non-null      float64
 19  l6_sen     8 non-null      object 
dtypes: float

I will start by converting the columns coord_x and coord_y to float type:

In [61]:
for col in ["coord_x", "coord_y"]:
    stops[col] = pd.to_numeric(stops[col].str.replace(",", "."), errors="coerce")

stops[["coord_x", "coord_y"]].dtypes

coord_x    float64
coord_y    float64
dtype: object

Now I will convert the columns l1-l6 to int type (bus lines numbers are integers):

I have to be careful because we can see that the dtype of the column l3 is object, so it is probable that there are some string values  

In [62]:
for col in ["l1", "l2", "l3", "l4", "l5", "l6"]:
    stops[col] = pd.to_numeric(stops[col], errors="coerce").astype('Int64')

stops[["l1", "l2", "l3", "l4", "l5", "l6"]].dtypes

l1    Int64
l2    Int64
l3    Int64
l4    Int64
l5    Int64
l6    Int64
dtype: object