## Train Stations Dataset
In this notebook, a dataset about train stations in the City of Buenos Aires (CABA) is prepared for use in the final visualization. It contains info about the geo-localization, name, commune, and other attributes for each train station.

### Source
Source is the dataset [Estaciones de Ferrocarril](https://data.buenosaires.gob.ar/dataset/estaciones-ferrocarril/resource/juqdkmgo-1021-resource) from the Government of the City of Buenos Aires.

### Details
For this, Pandas will be used as the main tool. The main normalization will be translating some terms from Spanish to English. Also, deleting some columns that do not have to do with our use case.

I normalized a dataset using pandas by modifying the original DataFrame directly. This was achieved by utilizing the *inplace=True* parameter within the pandas functions used for normalization. This approach offers the advantage of memory efficiency as it avoids creating a new DataFrame to store the normalized data. However, a potential disadvantage is that if I need to revert to the original, unnormalized data, I would need to either maintain a separate copy or re-load the original dataset.

In [3]:
import pandas as pd

In [4]:
train_stations_df = pd.read_csv('estaciones_ferrocarril.csv')

Let's take a first look at the dataset and its attributes

In [5]:
train_stations_df.head()

Unnamed: 0,long,lat,id,nombre,linea,linea_2,ramal,barrio,comuna,localidad,partido
0,-58.424982,-34.571748,1,3 de Febrero,Mitre,F.C.G.B.M.,Retiro - Mitre,Palermo,Comuna 14,,
1,-58.461808,-34.568051,35,Belgrano R,Mitre,F.C.G.B.M.,Retiro - Mitre,Belgrano,Comuna 13,,
2,-58.475349,-34.565261,90,Coghlan,Mitre,F.C.G.B.M.,Retiro - Mitre,Coghlan,Comuna 12,,
3,-58.448253,-34.572975,91,Colegiales,Mitre,F.C.G.B.M.,Retiro - Mitre,Colegiales,Comuna 13,,
4,-58.49413,-34.523089,119,Dr. Cetrángolo,Mitre,F.C.G.B.M.,Retiro - Mitre,,,Florida,Vicente L\u00f3pez


### Translation of attribute names

In [6]:
translations = {
    'columns': {
        'nombre': 'name',
        'linea': 'line',
        'linea_2': 'line_short_name',
        'ramal': 'branch',
        'barrio': 'neighborhood',
        'comuna': 'commune',
        'localidad': 'city',
        'partido': 'department',
    }
}

In [7]:
train_stations_df = train_stations_df.rename(columns=translations['columns'])

In [9]:
train_stations_df.head()

Unnamed: 0,long,lat,id,name,line,line_short_name,branch,neighborhood,commune,city,department
0,-58.424982,-34.571748,1,3 de Febrero,Mitre,F.C.G.B.M.,Retiro - Mitre,Palermo,Comuna 14,,
1,-58.461808,-34.568051,35,Belgrano R,Mitre,F.C.G.B.M.,Retiro - Mitre,Belgrano,Comuna 13,,
2,-58.475349,-34.565261,90,Coghlan,Mitre,F.C.G.B.M.,Retiro - Mitre,Coghlan,Comuna 12,,
3,-58.448253,-34.572975,91,Colegiales,Mitre,F.C.G.B.M.,Retiro - Mitre,Colegiales,Comuna 13,,
4,-58.49413,-34.523089,119,Dr. Cetrángolo,Mitre,F.C.G.B.M.,Retiro - Mitre,,,Florida,Vicente L\u00f3pez


### Normalization of attributes
#### Info about the commune

Since the main focus of our final visualization is to count how many train stations are in each commune, let's see how does the 'Commune' attribute looks like in the dataset.

In [13]:
print(train_stations_df['commune'].unique())

['Comuna 14' 'Comuna 13' 'Comuna 12' 'Comuna 1' 'Comuna 4' 'Comuna 8'
 'Comuna 11' 'Comuna 6' 'Comuna 7' 'Comuna 10' 'Comuna 9' 'Comuna 3']


Let's pay attention to two things:
- 'Nan' values present
- The communes are indicated first in Spanish, and then as a String. In order to parse it in the future, I think it would be easier to have it as a numeric value.

#### NaN values

Even though it was not stated in the dataset's description, rows with a 'NaN' value for the commune indicate that the station is outside the city of Buenos Aires. Consequently, those rows will be deleted since they serve no purpose for our final visualization.

In [12]:
train_stations_df = train_stations_df.dropna(subset=["commune"])

In [14]:
print(train_stations_df['commune'].unique())

['Comuna 14' 'Comuna 13' 'Comuna 12' 'Comuna 1' 'Comuna 4' 'Comuna 8'
 'Comuna 11' 'Comuna 6' 'Comuna 7' 'Comuna 10' 'Comuna 9' 'Comuna 3']


#### Setting commune info to numeric values

In [16]:
# Extract the numeric value from strings like "Comuna X" and convert to numbers
train_stations_df["commune"] = pd.to_numeric(
    train_stations_df["commune"].str.extract('(\d+)', expand=False)
)

In [20]:
print(train_stations_df["commune"].unique())

[14 13 12  1  4  8 11  6  7 10  9  3]


## Exporting the dataset

In [22]:
train_stations_df.to_csv("train_stations.csv", encoding='utf-8', index=False)