## PAIR PROGRAMMING ETL II

### Transformación I - Limpieza
---

In [51]:
import requests
import pandas as pd
import ast 
from IPython.core.interactiveshell import InteractiveShell 
InteractiveShell.ast_node_interactivity = "all" 
pd.options.display.max_columns = None

Tendréis que usar el csv attacks_limpieza_completa.

En la lección de hoy aprendimos como transformar nuestros datos para que estén preparados para almacearlos en una BBDD. En este momento tenemos dos fuentes de datos:

1. El csv con los ataques de tiburones que hemos estado limpiando hasta ahora, el que os hemos adjuntado (attacks_limpieza_completa). Sentiros libres de usar vuestros propios csv en caso de que queráis.
2. El csv con los datos climáticos de los principales paises que tienen ataques de tiburones, el que creamos en el pair programming de ayer.

El **objetivo** de la sesión de hoy será juntar en un único csv la información de ambas fuentes. Para ello:

- Cargaremos los dos ficheros de datos

In [52]:
df= pd.read_pickle('../files/attacks12_remplazo_nulos.pkl')


In [53]:
df.head()

Unnamed: 0,case_number,year,mes,sexo,edades,country,type,activity,fatal,cat_species
0,1800.00.00,1997.0,,F,27.878808,seychelles,Unprovoked,a corsair's boat was overturned,Y,
1,1797.05.28.R,1997.0,May,M,27.878808,,Unprovoked,Dropped overboard,Y,
2,1792.09.12,1997.0,Sep,M,27.878808,england,Provoked,Fishing,Y,
3,1791.00.00,1997.0,,F,27.878808,australia,Unprovoked,,Y,
4,1788.05.10,1997.0,May,M,27.878808,australia,Boat,Fishing,N,


In [54]:
df_clima = pd.read_csv('../files/clima_paises.csv', index_col=0)

---

- Del dataframe de los ataques nos quedaremos solo con las filas de los países que seleccionamos en la lección de ayer:
    - USA
    - Australia
    - New Zealand
    - South Africa
    - Papua New Guinea

In [55]:
# Creamos un nuevo dataframe filtrando por los cinco países 
df_attacks= df[df['country'].isin(['usa','australia','new zealand','south africa', 'papua new guinea'])]

In [56]:
# Comprobamos que solo están esos países
df_attacks['country'].unique()

array(['australia', 'usa', 'papua new guinea', 'new zealand',
       'south africa'], dtype=object)

In [57]:
print(f'Nº filas: {df_attacks.shape[0]}\nNº columnas: {df_attacks.shape[1]}')

Nº filas: 1355
Nº columnas: 10


---

- Del dataframe de los datos climáticos seleccionaremos todas las columnas.

In [58]:
df_clima.head(2)

Unnamed: 0,timepoint,cloudcover,highcloud,midcloud,lowcloud,rh_profile,wind_profile,temp2m,lifted_index,rh2m,msl_pressure,prec_type,prec_amount,snow_depth,wind10m.direction,wind10m.speed,country
0,3,1,-9999,-9999,-9999,"[{'layer': '950mb', 'rh': 2}, {'layer': '900mb...","[{'layer': '950mb', 'direction': 235, 'speed':...",13,15,3,1026,none,0,0,195,3,usa
1,6,2,-9999,-9999,-9999,"[{'layer': '950mb', 'rh': 5}, {'layer': '900mb...","[{'layer': '950mb', 'direction': 220, 'speed':...",14,10,9,1026,none,0,0,215,3,usa


- Cuando ya tengamos todos los datos deseados juntaremos los dos csv.
    - Para hacer esta unión tendremos que hacer un groupby en la tabla de clima para sacar una media de las medidas climáticas por país.
    - Antes de hacer el groupby si nos fijamos tenemos dos columnas rh_profile y wind_profile cuya información es una lista de diccionarios. Si intentamos hacer la media de eso no nos dará un valor real. A este problema ya nos enfrentamos en la clase invertida de ETL-2, donde teníais un Bonus para desempaquetar esta información. En caso de que en aquel ejercicio no lo consigierais os dejamos por aquí una posible solución que nos permite separar esa información en distintas columnas. Os dejamos el código documentado. ⚠️ Os recomendamos que vayáis desgranando el código y viendo lo que nos devuelve cada línea de código para entenderlo mejor.

In [59]:
df_clima['rh_profile']= df_clima['rh_profile'].apply(ast.literal_eval) #casteamos la lista de diccionarios.

In [60]:
# Para separar la lista de diccionarios en varias columnas
x = df_clima['rh_profile'].apply(pd.Series)

In [61]:
x.head(2)# Comprobamos que funciono.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,"{'layer': '950mb', 'rh': 2}","{'layer': '900mb', 'rh': -1}","{'layer': '850mb', 'rh': -1}","{'layer': '800mb', 'rh': 0}","{'layer': '750mb', 'rh': 1}","{'layer': '700mb', 'rh': 2}","{'layer': '650mb', 'rh': 2}","{'layer': '600mb', 'rh': 2}","{'layer': '550mb', 'rh': 1}","{'layer': '500mb', 'rh': 2}","{'layer': '450mb', 'rh': 4}","{'layer': '400mb', 'rh': 5}","{'layer': '350mb', 'rh': 4}","{'layer': '300mb', 'rh': 12}","{'layer': '250mb', 'rh': 8}","{'layer': '200mb', 'rh': 4}"
1,"{'layer': '950mb', 'rh': 5}","{'layer': '900mb', 'rh': 1}","{'layer': '850mb', 'rh': 0}","{'layer': '800mb', 'rh': 0}","{'layer': '750mb', 'rh': 1}","{'layer': '700mb', 'rh': 2}","{'layer': '650mb', 'rh': 2}","{'layer': '600mb', 'rh': 1}","{'layer': '550mb', 'rh': 4}","{'layer': '500mb', 'rh': 4}","{'layer': '450mb', 'rh': 5}","{'layer': '400mb', 'rh': 6}","{'layer': '350mb', 'rh': 7}","{'layer': '300mb', 'rh': 10}","{'layer': '250mb', 'rh': 12}","{'layer': '200mb', 'rh': 12}"


In [62]:
# For loop para sacar el nombre de la columna y los valores de las filas
for i in range(len(x.columns)): 
    
    # aplicamos el apply,extraemos el valor de la key "layer" y lo almacenamos en una variable que convertimos a string 
    nombre = "rh_" + str(x[i].apply(pd.Series)["layer"][0]) 

    # hacemos lo mismo con una variable que se llame valores para "guardar" los valores de la celda
    valores = list(x[i].apply(pd.Series)["rh"] )

    # usamos el método insert de los dataframes para ir añadiendo esta información a el dataframe con la información del clima. 
    df_clima.insert(i, nombre, valores)

In [63]:
df_clima['wind_profile']= df_clima['wind_profile'].apply(ast.literal_eval)

In [64]:
# Para separar la lista de diccionarios en varias columnas
y = df_clima['wind_profile'].apply(pd.Series)

In [65]:
# For loop para sacar el nombre de la columna y los valores de las filas
for i in range(len(y.columns)): 
    
    # aplicamos el apply,extraemos los valores de la key "layer" y lo almacenamos en dos variables que convertimos a strings
    nombre1 = "direction" + str(y[i].apply(pd.Series)["layer"][0]) 
    nombre2 = "speed" + str(y[i].apply(pd.Series)["layer"][0]) 

    # hacemos lo mismo con dos variables para "guardar" los valores
    valores1 = list(y[i].apply(pd.Series)["direction"] )
    valores2= list(y[i].apply(pd.Series)["speed"] )

    # usamos el método insert de los dataframes para ir añadiendo esta información a el dataframe con la información del clima. 
    df_clima.insert(i, nombre1, valores1)
    df_clima.insert(i,nombre2,valores2)

In [66]:
# Eliminamos las columnas que tienen las listas de diccionarios, información duplicada
df_clima.drop(['rh_profile','wind_profile'], axis=1, inplace=True)

In [67]:
print(f'Nº filas: {df_clima.shape[0]}\nNº columnas: {df_clima.shape[1]}')

Nº filas: 320
Nº columnas: 63


In [68]:
# Hacemos un groupby por los países para sacar la media de la información del clima por país.
df_clima = df_clima.groupby('country').mean()

  df_clima = df_clima.groupby('country').mean()


In [69]:
df_clima.head()

Unnamed: 0_level_0,speed950mb,speed900mb,speed850mb,speed800mb,speed750mb,speed700mb,speed650mb,speed600mb,speed550mb,speed500mb,speed450mb,speed400mb,speed350mb,speed300mb,speed250mb,speed200mb,direction200mb,direction250mb,direction300mb,direction350mb,direction400mb,direction450mb,direction500mb,direction550mb,direction600mb,direction650mb,direction700mb,direction750mb,direction800mb,direction850mb,direction900mb,direction950mb,rh_950mb,rh_900mb,rh_850mb,rh_800mb,rh_750mb,rh_700mb,rh_650mb,rh_600mb,rh_550mb,rh_500mb,rh_450mb,rh_400mb,rh_350mb,rh_300mb,rh_250mb,rh_200mb,timepoint,cloudcover,highcloud,midcloud,lowcloud,temp2m,lifted_index,rh2m,msl_pressure,prec_amount,snow_depth,wind10m.direction,wind10m.speed
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1
australia,3.375,3.34375,3.203125,3.1875,3.078125,3.0625,3.125,3.3125,3.5,3.5625,3.578125,3.5625,3.59375,3.828125,4.265625,5.046875,137.34375,121.796875,121.875,127.1875,125.0,103.59375,101.71875,94.296875,101.484375,84.53125,81.171875,73.046875,65.78125,64.921875,67.265625,72.8125,13.421875,11.5,8.109375,5.59375,4.140625,3.546875,3.171875,3.265625,3.046875,3.1875,3.5625,3.359375,2.53125,2.171875,4.265625,5.828125,97.5,3.015625,-9999.0,-9999.0,-9999.0,25.828125,-2.78125,10.359375,1016.171875,2.546875,0.0,74.53125,3.171875
new zealand,3.9375,3.78125,3.703125,3.75,3.640625,3.640625,3.703125,3.6875,3.734375,3.859375,3.96875,4.265625,5.015625,5.296875,6.0625,6.109375,231.484375,187.578125,196.328125,170.15625,162.96875,146.015625,145.46875,137.65625,132.421875,131.09375,126.25,129.53125,126.328125,129.140625,113.90625,116.328125,13.125,13.4375,6.21875,3.765625,3.421875,2.921875,2.0625,1.34375,0.9375,1.46875,2.453125,4.8125,5.40625,6.421875,6.953125,2.71875,97.5,7.1875,-9999.0,-9999.0,-9999.0,15.0625,10.1875,10.46875,1019.28125,2.1875,0.0,113.828125,3.546875
papua new guinea,4.046875,4.46875,4.59375,4.453125,3.90625,3.234375,2.28125,2.59375,3.453125,4.4375,5.0,4.9375,5.1875,5.140625,5.796875,6.28125,235.0,228.4375,244.296875,254.453125,256.71875,260.15625,253.4375,240.15625,217.421875,150.15625,78.90625,81.328125,83.359375,83.28125,81.796875,81.640625,13.75,11.296875,7.90625,4.046875,1.265625,-0.3125,0.09375,2.296875,2.5625,2.0625,2.34375,4.90625,5.765625,5.71875,6.015625,2.4375,97.5,4.25,-9999.0,-9999.0,-9999.0,25.703125,-0.4375,11.265625,1009.9375,2.359375,0.0,81.796875,3.25
south africa,2.578125,2.34375,2.234375,2.265625,2.234375,2.359375,2.40625,2.4375,2.609375,2.96875,3.453125,3.765625,3.875,3.90625,4.15625,4.5,214.296875,228.359375,229.609375,230.390625,228.28125,224.21875,222.890625,221.484375,217.109375,221.25,219.609375,207.5,180.625,166.015625,138.203125,119.453125,13.140625,10.546875,8.28125,8.046875,9.359375,9.75,10.140625,9.140625,6.21875,5.046875,2.65625,1.359375,2.4375,4.15625,5.625,7.328125,97.5,5.390625,-9999.0,-9999.0,-9999.0,23.6875,2.09375,10.609375,1019.34375,1.421875,0.0,122.265625,2.53125
usa,3.328125,3.734375,3.984375,4.15625,4.265625,4.5,4.609375,4.96875,5.3125,6.0,6.59375,7.015625,7.484375,8.21875,8.765625,9.1875,288.4375,281.640625,274.296875,289.140625,289.453125,293.828125,296.71875,286.640625,292.34375,282.96875,275.078125,283.828125,296.25,281.71875,258.4375,244.0625,3.65625,3.9375,3.78125,3.109375,2.3125,2.171875,2.609375,2.953125,3.640625,4.1875,5.109375,4.390625,3.671875,4.65625,5.015625,4.1875,97.5,4.0625,-9999.0,-9999.0,-9999.0,12.203125,9.859375,3.671875,1013.640625,0.609375,0.0,244.84375,2.953125


In [70]:
# Vemos que country figura como índice y la reseteamos para poder unir los dataframes por esa columna
df_clima.reset_index(inplace=True)

In [71]:
# Guardamos el dataframe de clima limpio 
df_clima.to_csv('../files/datos_clima_paises_2.csv')

In [72]:
# Unimos los dos dataframes
df_union= df_attacks.merge(df_clima, how= 'inner', on= 'country')

In [73]:
df_union.head()

Unnamed: 0,case_number,year,mes,sexo,edades,country,type,activity,fatal,cat_species,speed950mb,speed900mb,speed850mb,speed800mb,speed750mb,speed700mb,speed650mb,speed600mb,speed550mb,speed500mb,speed450mb,speed400mb,speed350mb,speed300mb,speed250mb,speed200mb,direction200mb,direction250mb,direction300mb,direction350mb,direction400mb,direction450mb,direction500mb,direction550mb,direction600mb,direction650mb,direction700mb,direction750mb,direction800mb,direction850mb,direction900mb,direction950mb,rh_950mb,rh_900mb,rh_850mb,rh_800mb,rh_750mb,rh_700mb,rh_650mb,rh_600mb,rh_550mb,rh_500mb,rh_450mb,rh_400mb,rh_350mb,rh_300mb,rh_250mb,rh_200mb,timepoint,cloudcover,highcloud,midcloud,lowcloud,temp2m,lifted_index,rh2m,msl_pressure,prec_amount,snow_depth,wind10m.direction,wind10m.speed
0,1791.00.00,1997.0,,F,27.878808,australia,Unprovoked,,Y,,3.375,3.34375,3.203125,3.1875,3.078125,3.0625,3.125,3.3125,3.5,3.5625,3.578125,3.5625,3.59375,3.828125,4.265625,5.046875,137.34375,121.796875,121.875,127.1875,125.0,103.59375,101.71875,94.296875,101.484375,84.53125,81.171875,73.046875,65.78125,64.921875,67.265625,72.8125,13.421875,11.5,8.109375,5.59375,4.140625,3.546875,3.171875,3.265625,3.046875,3.1875,3.5625,3.359375,2.53125,2.171875,4.265625,5.828125,97.5,3.015625,-9999.0,-9999.0,-9999.0,25.828125,-2.78125,10.359375,1016.171875,2.546875,0.0,74.53125,3.171875
1,1788.05.10,1997.0,May,M,27.878808,australia,Boat,Fishing,N,,3.375,3.34375,3.203125,3.1875,3.078125,3.0625,3.125,3.3125,3.5,3.5625,3.578125,3.5625,3.59375,3.828125,4.265625,5.046875,137.34375,121.796875,121.875,127.1875,125.0,103.59375,101.71875,94.296875,101.484375,84.53125,81.171875,73.046875,65.78125,64.921875,67.265625,72.8125,13.421875,11.5,8.109375,5.59375,4.140625,3.546875,3.171875,3.265625,3.046875,3.1875,3.5625,3.359375,2.53125,2.171875,4.265625,5.828125,97.5,3.015625,-9999.0,-9999.0,-9999.0,25.828125,-2.78125,10.359375,1016.171875,2.546875,0.0,74.53125,3.171875
2,0005.00.00,1997.0,,M,27.878808,australia,Unprovoked,,N,,3.375,3.34375,3.203125,3.1875,3.078125,3.0625,3.125,3.3125,3.5,3.5625,3.578125,3.5625,3.59375,3.828125,4.265625,5.046875,137.34375,121.796875,121.875,127.1875,125.0,103.59375,101.71875,94.296875,101.484375,84.53125,81.171875,73.046875,65.78125,64.921875,67.265625,72.8125,13.421875,11.5,8.109375,5.59375,4.140625,3.546875,3.171875,3.265625,3.046875,3.1875,3.5625,3.359375,2.53125,2.171875,4.265625,5.828125,97.5,3.015625,-9999.0,-9999.0,-9999.0,25.828125,-2.78125,10.359375,1016.171875,2.546875,0.0,74.53125,3.171875
3,ND-0139,1997.0,,F,15.0,australia,Unprovoked,,N,,3.375,3.34375,3.203125,3.1875,3.078125,3.0625,3.125,3.3125,3.5,3.5625,3.578125,3.5625,3.59375,3.828125,4.265625,5.046875,137.34375,121.796875,121.875,127.1875,125.0,103.59375,101.71875,94.296875,101.484375,84.53125,81.171875,73.046875,65.78125,64.921875,67.265625,72.8125,13.421875,11.5,8.109375,5.59375,4.140625,3.546875,3.171875,3.265625,3.046875,3.1875,3.5625,3.359375,2.53125,2.171875,4.265625,5.828125,97.5,3.015625,-9999.0,-9999.0,-9999.0,25.828125,-2.78125,10.359375,1016.171875,2.546875,0.0,74.53125,3.171875
4,ND-0138,1997.0,,M,27.878808,australia,Unprovoked,Fell into the water,Y,,3.375,3.34375,3.203125,3.1875,3.078125,3.0625,3.125,3.3125,3.5,3.5625,3.578125,3.5625,3.59375,3.828125,4.265625,5.046875,137.34375,121.796875,121.875,127.1875,125.0,103.59375,101.71875,94.296875,101.484375,84.53125,81.171875,73.046875,65.78125,64.921875,67.265625,72.8125,13.421875,11.5,8.109375,5.59375,4.140625,3.546875,3.171875,3.265625,3.046875,3.1875,3.5625,3.359375,2.53125,2.171875,4.265625,5.828125,97.5,3.015625,-9999.0,-9999.0,-9999.0,25.828125,-2.78125,10.359375,1016.171875,2.546875,0.0,74.53125,3.171875


In [74]:
df_union.shape

(1355, 71)

---

- Guardar los resultados obtenidos en un csv que usaremos en próximos ejercicios de pair programming.

In [75]:
df_union.to_csv('../files/datos_clima_attacks.csv')