# Challenge: Múltiples variables almacenadas en una columna.

Vamos a tratar ahora con el siguiente dataset: Registros de tuberculosis de la Organización Mundial de la Salud
Este conjunto de datos documenta el recuento de casos confirmados de tuberculosis por país, año, edad y sexo.

## Problemas:
- Algunas columnas contienen varios valores: sexo y edad.
- Mezcla de ceros y valores perdidos NaN. Esto se debe al proceso de recopilación de datos y la distinción es importante para este conjunto de datos.

In [1]:
import pandas as pd

In [2]:
tb = pd.read_csv("tb-raw.csv")
tb

Unnamed: 0,country,year,m014,m1524,m2534,m3544,m4554,m5564,m65,mu,f014
0,AD,2000,0.0,0.0,1.0,0.0,0,0,0.0,,
1,AE,2000,2.0,4.0,4.0,6.0,5,12,10.0,,3.0
2,AF,2000,52.0,228.0,183.0,149.0,129,94,80.0,,93.0
3,AG,2000,0.0,0.0,0.0,0.0,0,0,1.0,,1.0
4,AL,2000,2.0,19.0,21.0,14.0,24,19,16.0,,3.0
5,AM,2000,2.0,152.0,130.0,131.0,63,26,21.0,,1.0
6,AN,2000,0.0,0.0,1.0,2.0,0,0,0.0,,0.0
7,AO,2000,186.0,999.0,1003.0,912.0,482,312,194.0,,247.0
8,AR,2000,97.0,278.0,594.0,402.0,419,368,330.0,,121.0
9,AS,2000,,,,,1,1,,,


Para ordenar este conjunto de datos, necesitamos eliminar los diferentes valores del encabezado y SEPARARLOS en filas.

### Instrucciones:
- Primero, necesitaremos fusionar las columnas de sexo + grupo de edad en una sola.
- Una vez que tengamos esa única columna, derivaremos tres columnas de ella: sexo, age_lower y age_upper. Con ellos, podremos construir correctamente un conjunto de datos ordenado.

In [22]:
# Utiliza melt como en primer challenge para realizar el melt sobre las columnas que correspondan
tb1=pd.melt(frame=tb, id_vars=["country","year"], var_name="sexo + grupo de edad", value_name="cases")
tb1

Unnamed: 0,country,year,sexo + grupo de edad,cases
0,AD,2000,m014,0.0
1,AE,2000,m014,2.0
2,AF,2000,m014,52.0
3,AG,2000,m014,0.0
4,AL,2000,m014,2.0
...,...,...,...,...
85,AM,2000,f014,1.0
86,AN,2000,f014,0.0
87,AO,2000,f014,247.0
88,AR,2000,f014,121.0


In [45]:
for i in range(len(tb1)):
    tb1["sexo + grupo de edad"][i].strip()
tb1

Unnamed: 0,country,year,sexo + grupo de edad,cases
0,AD,2000,m014,0.0
1,AE,2000,m014,2.0
2,AF,2000,m014,52.0
3,AG,2000,m014,0.0
4,AL,2000,m014,2.0
...,...,...,...,...
85,AM,2000,f014,1.0
86,AN,2000,f014,0.0
87,AO,2000,f014,247.0
88,AR,2000,f014,121.0


In [17]:
# Utiliza str.extract como en el primer challenge para separar el género y el rango de edad
tb1Expanded=tb1["sexo + grupo de edad"].str.split("(\d+)", expand=True)
tb1Expanded.drop(columns=[2],inplace=True)
tb1Expanded.columns=["sexo","grupo de edad"]
tb1Expanded


Unnamed: 0,sexo,grupo de edad
0,m,014
1,m,014
2,m,014
3,m,014
4,m,014
...,...,...
85,f,014
86,f,014
87,f,014
88,f,014


In [63]:
tb1_clear=tb1.drop(columns=["sexo + grupo de edad"], inplace=True)


In [66]:
print(tb1_clear)

None


In [64]:
tb2=pd.concat([tb1, tb1Expanded ],axis=1)
tb2

Unnamed: 0,country,year,cases,sexo,grupo de edad
0,AD,2000,0.0,m,014
1,AE,2000,2.0,m,014
2,AF,2000,52.0,m,014
3,AG,2000,0.0,m,014
4,AL,2000,2.0,m,014
...,...,...,...,...,...
85,AM,2000,1.0,f,014
86,AN,2000,0.0,f,014
87,AO,2000,247.0,f,014
88,AR,2000,121.0,f,014


In [47]:
tb2_age=tb2["grupo de edad"].str.split("", expand=True)
tb2_age

Unnamed: 0,0,1,2,3,4,5
0,,0,1,4,,
1,,0,1,4,,
2,,0,1,4,,
3,,0,1,4,,
4,,0,1,4,,
...,...,...,...,...,...,...
85,,0,1,4,,
86,,0,1,4,,
87,,0,1,4,,
88,,0,1,4,,


In [8]:
tb_age=pd.concat([tb1Expanded, tb2_age ],axis=1)
tb_age

Unnamed: 0,sexo,grupo de edad,0,1,2,3,4,5
0,m,014,,0,1,4,,
1,m,014,,0,1,4,,
2,m,014,,0,1,4,,
3,m,014,,0,1,4,,
4,m,014,,0,1,4,,
...,...,...,...,...,...,...,...,...
85,f,014,,0,1,4,,
86,f,014,,0,1,4,,
87,f,014,,0,1,4,,
88,f,014,,0,1,4,,


In [40]:
tb1["sexo + grupo de edad"][78]

'mu'

In [62]:
for i in range(len(tb1)):
    if len(tb1["sexo + grupo de edad"][i])!=4:
        tb2_age["Age lower"]=tb2_age[1].map(str)+tb2_age[2]
        tb2_age["Age upper"]=tb2_age[4].map(str)+tb2_age[5]
    elif len(tb1["sexo + grupo de edad"][i])==3:
        tb2_age["Age lower"]=tb2_age[0].map(str)+tb2_age[1]
        tb2_age["Age upper"]=tb2_age[4].map(str)+tb2_age[5]
    else:
        tb2_age["Age lower"]=tb2_age[0].map(str)+tb2_age[1]
        tb2_age["Age upper"]=tb2_age[2].map(str)+tb2_age[3]

tb2_age["Age"]=tb2_age["Age lower"].map(str)+"-"+tb2_age["Age upper"]
tb2_age.sort_values(["Age"], ascending=False)


Unnamed: 0,0,1,2,3,4,5,Age lower,Age upper,Age
68,,6,5,,,,6,5,6-5
67,,6,5,,,,6,5,6-5
66,,6,5,,,,6,5,6-5
65,,6,5,,,,6,5,6-5
64,,6,5,,,,6,5,6-5
...,...,...,...,...,...,...,...,...,...
75,,,,,,,,,
76,,,,,,,,,
77,,,,,,,,,
78,,,,,,,,,


In [27]:
tb_f=pd.concat([tb2, tb2_age["Age"]],axis=1)
tb_f.drop(columns=["grupo de edad"],inplace=True)

In [28]:
tb_f

Unnamed: 0,country,year,cases,sexo,Age
0,AD,2000,0.0,m,0-14
1,AE,2000,2.0,m,0-14
2,AF,2000,52.0,m,0-14
3,AG,2000,0.0,m,0-14
4,AL,2000,2.0,m,0-14
...,...,...,...,...,...
85,AM,2000,1.0,f,0-14
86,AN,2000,0.0,f,0-14
87,AO,2000,247.0,f,0-14
88,AR,2000,121.0,f,0-14


In [29]:
tb_f.sort_values(["Age"],ascending=False)

Unnamed: 0,country,year,cases,sexo,Age
68,AR,2000,330.0,m,6-5
67,AO,2000,194.0,m,6-5
66,AN,2000,0.0,m,6-5
65,AM,2000,21.0,m,6-5
64,AL,2000,16.0,m,6-5
...,...,...,...,...,...
75,AM,2000,,mu,
76,AN,2000,,mu,
77,AO,2000,,mu,
78,AR,2000,,mu,


## Resultado
Al finalizar, deberías obtener un dataframe similar a este:

![image.png](https://storage.googleapis.com/campus-cvs/lectures/tidyDataChallenge2.PNG)