## Subway stations Dataset
In this notebook, a dataset about subway stations in the City of Buenos Aires (CABA) is prepared for use in the final visualization. It contains info about the geo-localization, name, commune, and other attributes for each train station.

### Source
Source is the dataset [Bocas de subte](https://data.buenosaires.gob.ar/dataset/bocas-subte/resource/c9b9628b-6ca5-4867-bf9d-ad11997f951f) from the Government of the City of Buenos Aires.

### Details
For this, Pandas will be used as the main tool. The main normalization will be translating some terms from Spanish to English. Also, deleting some columns that do not have to do with our use case.

I normalized a dataset using pandas by modifying the original DataFrame directly. This was achieved by utilizing the *inplace=True* parameter within the pandas functions used for normalization. This approach offers the advantage of memory efficiency as it avoids creating a new DataFrame to store the normalized data. However, a potential disadvantage is that if I need to revert to the original, unnormalized data, I would need to either maintain a separate copy or re-load the original dataset.

In [1]:
import pandas as pd

In [2]:
subway_st_df = pd.read_csv('source_datasets/bocas-de-subte.csv')

Let's take a first look at the attributes

Lots of attributes are irrelevant to our use case. Commune-related data is relevant.

In [3]:
subway_st_df.head()

Unnamed: 0,long,lat,id,linea,estacion,numero_de_,destino_bo,lineas_de_,cierra_fin,escalera_p,...,salvaescal,calle,altura,calle2,barrio,comuna,observacio,Objeto,dom_norma,dom_orig
0,-58.384068,-34.602106,1,D,TRIBUNALES - TEATRO COLÓN,4,a Catedral y Congreso de Tucumán,,True,True,...,False,Libertad,556,,San Nicolas,Comuna 1,Andén central,Boca de subte,LIBERTAD 556,Libertad 556
1,-58.384372,-34.602394,2,D,TRIBUNALES - TEATRO COLÓN,5,a Catedral y Congreso de Tucumán,,True,True,...,False,Lavalle,1221,,San Nicolas,Comuna 1,Andén central,Boca de subte,LAVALLE 1221,Lavalle 1221
2,-58.39725,-34.587804,3,H,LAS HERAS,1,a Hospitales,,False,True,...,False,Pueyrredon,2199,,,,Vestíbulo intermedio,Boca de subte,,Pueyrredon 2199
3,-58.403967,-34.598733,4,H,CÓRDOBA,1,a Las Heras y Hopitales,,False,True,...,False,Pueyrredon,984,,,,Vestíbulo intermedio,Boca de subte,,Pueyrredon 984
4,-58.405406,-34.603884,5,H,CORRIENTES,6,a Las Heras y Hospitales,B,False,False,...,False,Pueyrredón,558,,,,Vestíbulo intermedio,Boca de subte,,Pueyrredón 558


In the official outline for the dataset, we can see that the commune-related information is under the "comuna" column. Let's take a first look at it

In [4]:
print(subway_st_df['comuna'].unique())

['Comuna 1' nan 'Comuna 3' 'Comuna 5' 'Comuna 6' 'Comuna 7' 'Comuna 15'
 'Comuna 12' 'Comuna 2' 'Comuna 14' 'Comuna 13' 'Comuna 4']


### Translation of attribute names
Only attribute relevant to us now is the commune

In [5]:
subway_st_df = subway_st_df.rename(columns={'comuna': 'commune'})

### Normalization of attributes
#### Info about the commune
In order to normalize this attribute for final use in our visualization, let's do the same thing we did on the train stations dataset, that is, going from "Comuna 1" to the numeric value of '1' as the value for this attribute.

In [6]:
# Extract the numeric value from strings like "Comuna X" and convert to numbers
subway_st_df['commune'] = pd.to_numeric(
    subway_st_df['commune'].str.extract('(\d+)', expand=False)
)

In [7]:
print(subway_st_df['commune'].unique())

[ 1. nan  3.  5.  6.  7. 15. 12.  2. 14. 13.  4.]


#### NaN values
Let's also delete the NaN values which indicate subway stations outside the City of Buenos Aires. Same as we did for the train stations dataset

In [8]:
subway_st_df = subway_st_df.dropna(subset=['commune'])

## Exporting the dataset

In [9]:
subway_st_df.to_csv('processed_data/subway_stations.csv', encoding='utf-8', index=False)