# Manejo de errores & datos faltantes.

Hasta ahor hemos importado archivos planos con ajustes menores para establecer nombres de columna y administrar la cantidad de datos cargados.

Esto es suficiente si los datos ya están en buen estado, pero que pasas si hay problemas con los datos o la importación de ellos.

## Problemas comunes.

* Los porblemas comune sincluyen tipos de datos de columna incorrectos que pueden dificultar el análisis.

* Valores Faltantes indicados con designadores personalizados.

* Resgistros que no pueden ser leidos por Pandas.

Afortunadamente `read_csv` tiene mas formas de abordar estos problemas durante la importación, reduciendo las disputas necesarias más adelante.


Al importar datos, Pandas infiere el tipo de datos de cada columna pero aveces infiere mal

In [6]:
from urllib.request import urlretrieve
import pandas as pd
# asiganmos la url 
data = "https://assets.datacamp.com/production/repositories/4412/datasets/61bb27bf939aac4344d4f446ce6da1d1bf534174/vt_tax_data_2016.csv"
# guardamos el archivo
urlretrieve(data)
# leemo el arhivo en un dataframe

df = pd.read_csv(data)

# mostramos el dataframe
df.dtypes

STATEFIPS     int64
STATE        object
zipcode       int64
agi_stub      int64
N1            int64
              ...  
A85300        int64
N11901        int64
A11901        int64
N11902        int64
A11902        int64
Length: 147, dtype: object

Al verificar los tipos de datos en los impuestos , vemos que pandas interpreto los códigos postales como enteros, ain embargo estos estan modelados con mayor precision como strings, recordemos que los códigos postales no son cantidades e incluyen ceros iniciales significtivos.

En lugar de que pandas adivine podemos establecer el tipo de dato de cualquiera o de todas las columnas con el argumento `dtype` de csv.

`dype` toma un diccionario , donde cada clave es un nombre de la columna y cada valor es el tipo de datos que debería ser esa columna.

  A continuación especificaremos que zipcode es un string:

In [8]:
text_data = pd.read_csv(data,dtype={"zipcode":str})
text_data.dtypes

STATEFIPS     int64
STATE        object
zipcode      object
agi_stub      int64
N1            int64
              ...  
A85300        int64
N11901        int64
A11901        int64
N11902        int64
A11902        int64
Length: 147, dtype: object

Al imprimir nuestros tipos de datos vemos que e sun objeto , que es la contraparte de las strings en Python.

# Modificando datos faltantes.

La falta de datos es otro problema común, pandas reconoce automáticamente algunos valores como Na o Nan , lo que permite el uso de prácticas funciones de limpieza de datos.

En nuestros datos , los registros se ordenaron para que los primeros tengan código postal 0, que no es un código valido y debe ser tratado como perdido.

In [9]:
text_data.head()

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,PREP,N2,NUMDEP,TOTAL_VITA,VITA,TCE,VITA_EIC,RAL,RAC,ELDERLY,A00100,N02650,A02650,N00200,A00200,N00300,A00300,N00600,A00600,N00650,A00650,N00700,A00700,N00900,A00900,N01000,A01000,N01400,A01400,N01700,A01700,SCHF,...,N07230,A07230,N07240,A07240,N07220,A07220,N07260,A07260,N09400,A09400,N85770,A85770,N85775,A85775,N09750,A09750,N10600,A10600,N59660,A59660,N59720,A59720,N11070,A11070,N10960,A10960,N11560,A11560,N06500,A06500,N10300,A10300,N85530,A85530,N85300,A85300,N11901,A11901,N11902,A11902
0,50,VT,0,1,111580,85090,14170,10740,45360,130630,26200,5900,2140,3760,860,1440,12620,30760,1314871,111570,1346018,82490,963138,23700,11699,14740,28241,13320,19296,1690,800,16620,113552,11790,10571,9940,55612,16860,137654,950,...,2100,1529,3910,647,1880,839,90,35,14140,19686,5100,18624,5020,17366,2690,1433,96880,181921,31090,59882,25370,52128,13020,14784,2880,2558,2530,2011,42510,29414,53660,50699,0,0,0,0,10820,9734,88260,138337
1,50,VT,0,2,82760,51960,18820,11310,35600,132950,32310,1670,840,830,30,680,13670,19160,3006587,82780,3066101,72330,2426615,20940,12790,13520,40635,12830,29294,7220,3592,12130,136240,10440,26072,8290,75651,13600,213453,940,...,4520,4809,12490,2434,13880,13041,960,622,10070,25291,7590,26044,8430,31171,5130,3154,80870,357207,12420,24615,9020,18327,8280,11187,2210,2005,1270,1286,71180,192875,74340,221146,0,0,0,0,12820,20029,68760,151729
2,50,VT,0,3,46270,19540,22650,3620,24140,91870,23610,170,0,170,0,0,4550,14920,2851399,46260,2906848,40370,2171527,18790,13201,11810,49417,10880,36257,10840,6784,8870,106641,9910,39322,6750,86142,10960,241188,330,...,2530,3432,2310,412,10480,15525,1780,1475,6760,19479,960,3912,1710,7384,900,905,45620,336580,0,0,0,0,570,759,2130,2140,0,0,44160,242908,44860,266097,0,0,0,0,10810,24499,34600,90583
3,50,VT,0,4,30070,5830,22190,960,16060,71610,18860,0,0,0,0,0,1880,10270,2617891,30050,2663815,26300,1931226,14670,11664,9180,46602,8590,35978,8360,6374,6280,93425,7330,43844,4840,81761,7730,208364,290,...,2270,3231,0,0,8260,13131,1170,1327,4770,15946,0,0,400,1607,110,189,29760,313992,0,0,0,0,0,0,1410,1455,0,0,29410,245476,29580,264678,0,0,0,0,7320,21573,21300,67045
4,50,VT,0,5,39530,3900,33800,590,22500,103710,30330,0,0,0,0,0,930,13600,5472982,39530,5574273,34740,3810688,24120,29196,17190,156496,16230,125983,14860,16227,8570,193032,14550,189103,7260,182370,10580,372597,0,...,2420,3719,0,0,6530,8430,1800,3309,6890,31938,0,0,0,0,0,0,39200,787504,0,0,0,0,0,0,2150,2223,0,0,39050,689575,39170,731963,40,24,0,0,12500,67761,23320,103034


Podemos decirle a pandas que consideren estos datos faltantes con el argumento `na_values`, los cuales aceptan un solo valor, puede ser una lista de valores o un diccionario de columnas y valores en esa columna para tratar como datos faltantes.

Pasemos un diccionario que especifique qe los ceros en el co1digo postal deben tratarse como datos faltantes.

In [10]:
text_data = pd.read_csv(data,na_values={"zipcode":0})
text_data.head()

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,PREP,N2,NUMDEP,TOTAL_VITA,VITA,TCE,VITA_EIC,RAL,RAC,ELDERLY,A00100,N02650,A02650,N00200,A00200,N00300,A00300,N00600,A00600,N00650,A00650,N00700,A00700,N00900,A00900,N01000,A01000,N01400,A01400,N01700,A01700,SCHF,...,N07230,A07230,N07240,A07240,N07220,A07220,N07260,A07260,N09400,A09400,N85770,A85770,N85775,A85775,N09750,A09750,N10600,A10600,N59660,A59660,N59720,A59720,N11070,A11070,N10960,A10960,N11560,A11560,N06500,A06500,N10300,A10300,N85530,A85530,N85300,A85300,N11901,A11901,N11902,A11902
0,50,VT,,1,111580,85090,14170,10740,45360,130630,26200,5900,2140,3760,860,1440,12620,30760,1314871,111570,1346018,82490,963138,23700,11699,14740,28241,13320,19296,1690,800,16620,113552,11790,10571,9940,55612,16860,137654,950,...,2100,1529,3910,647,1880,839,90,35,14140,19686,5100,18624,5020,17366,2690,1433,96880,181921,31090,59882,25370,52128,13020,14784,2880,2558,2530,2011,42510,29414,53660,50699,0,0,0,0,10820,9734,88260,138337
1,50,VT,,2,82760,51960,18820,11310,35600,132950,32310,1670,840,830,30,680,13670,19160,3006587,82780,3066101,72330,2426615,20940,12790,13520,40635,12830,29294,7220,3592,12130,136240,10440,26072,8290,75651,13600,213453,940,...,4520,4809,12490,2434,13880,13041,960,622,10070,25291,7590,26044,8430,31171,5130,3154,80870,357207,12420,24615,9020,18327,8280,11187,2210,2005,1270,1286,71180,192875,74340,221146,0,0,0,0,12820,20029,68760,151729
2,50,VT,,3,46270,19540,22650,3620,24140,91870,23610,170,0,170,0,0,4550,14920,2851399,46260,2906848,40370,2171527,18790,13201,11810,49417,10880,36257,10840,6784,8870,106641,9910,39322,6750,86142,10960,241188,330,...,2530,3432,2310,412,10480,15525,1780,1475,6760,19479,960,3912,1710,7384,900,905,45620,336580,0,0,0,0,570,759,2130,2140,0,0,44160,242908,44860,266097,0,0,0,0,10810,24499,34600,90583
3,50,VT,,4,30070,5830,22190,960,16060,71610,18860,0,0,0,0,0,1880,10270,2617891,30050,2663815,26300,1931226,14670,11664,9180,46602,8590,35978,8360,6374,6280,93425,7330,43844,4840,81761,7730,208364,290,...,2270,3231,0,0,8260,13131,1170,1327,4770,15946,0,0,400,1607,110,189,29760,313992,0,0,0,0,0,0,1410,1455,0,0,29410,245476,29580,264678,0,0,0,0,7320,21573,21300,67045
4,50,VT,,5,39530,3900,33800,590,22500,103710,30330,0,0,0,0,0,930,13600,5472982,39530,5574273,34740,3810688,24120,29196,17190,156496,16230,125983,14860,16227,8570,193032,14550,189103,7260,182370,10580,372597,0,...,2420,3719,0,0,6530,8430,1800,3309,6890,31938,0,0,0,0,0,0,39200,787504,0,0,0,0,0,0,2150,2223,0,0,39050,689575,39170,731963,40,24,0,0,12500,67761,23320,103034


Luego filtramos los datos usando el método ìsna en la columna de código postal para ver las filas con los códigos postales faltantes.

In [11]:
text_data[text_data.zipcode.isna()]

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,PREP,N2,NUMDEP,TOTAL_VITA,VITA,TCE,VITA_EIC,RAL,RAC,ELDERLY,A00100,N02650,A02650,N00200,A00200,N00300,A00300,N00600,A00600,N00650,A00650,N00700,A00700,N00900,A00900,N01000,A01000,N01400,A01400,N01700,A01700,SCHF,...,N07230,A07230,N07240,A07240,N07220,A07220,N07260,A07260,N09400,A09400,N85770,A85770,N85775,A85775,N09750,A09750,N10600,A10600,N59660,A59660,N59720,A59720,N11070,A11070,N10960,A10960,N11560,A11560,N06500,A06500,N10300,A10300,N85530,A85530,N85300,A85300,N11901,A11901,N11902,A11902
0,50,VT,,1,111580,85090,14170,10740,45360,130630,26200,5900,2140,3760,860,1440,12620,30760,1314871,111570,1346018,82490,963138,23700,11699,14740,28241,13320,19296,1690,800,16620,113552,11790,10571,9940,55612,16860,137654,950,...,2100,1529,3910,647,1880,839,90,35,14140,19686,5100,18624,5020,17366,2690,1433,96880,181921,31090,59882,25370,52128,13020,14784,2880,2558,2530,2011,42510,29414,53660,50699,0,0,0,0,10820,9734,88260,138337
1,50,VT,,2,82760,51960,18820,11310,35600,132950,32310,1670,840,830,30,680,13670,19160,3006587,82780,3066101,72330,2426615,20940,12790,13520,40635,12830,29294,7220,3592,12130,136240,10440,26072,8290,75651,13600,213453,940,...,4520,4809,12490,2434,13880,13041,960,622,10070,25291,7590,26044,8430,31171,5130,3154,80870,357207,12420,24615,9020,18327,8280,11187,2210,2005,1270,1286,71180,192875,74340,221146,0,0,0,0,12820,20029,68760,151729
2,50,VT,,3,46270,19540,22650,3620,24140,91870,23610,170,0,170,0,0,4550,14920,2851399,46260,2906848,40370,2171527,18790,13201,11810,49417,10880,36257,10840,6784,8870,106641,9910,39322,6750,86142,10960,241188,330,...,2530,3432,2310,412,10480,15525,1780,1475,6760,19479,960,3912,1710,7384,900,905,45620,336580,0,0,0,0,570,759,2130,2140,0,0,44160,242908,44860,266097,0,0,0,0,10810,24499,34600,90583
3,50,VT,,4,30070,5830,22190,960,16060,71610,18860,0,0,0,0,0,1880,10270,2617891,30050,2663815,26300,1931226,14670,11664,9180,46602,8590,35978,8360,6374,6280,93425,7330,43844,4840,81761,7730,208364,290,...,2270,3231,0,0,8260,13131,1170,1327,4770,15946,0,0,400,1607,110,189,29760,313992,0,0,0,0,0,0,1410,1455,0,0,29410,245476,29580,264678,0,0,0,0,7320,21573,21300,67045
4,50,VT,,5,39530,3900,33800,590,22500,103710,30330,0,0,0,0,0,930,13600,5472982,39530,5574273,34740,3810688,24120,29196,17190,156496,16230,125983,14860,16227,8570,193032,14550,189103,7260,182370,10580,372597,0,...,2420,3719,0,0,6530,8430,1800,3309,6890,31938,0,0,0,0,0,0,39200,787504,0,0,0,0,0,0,2150,2223,0,0,39050,689575,39170,731963,40,24,0,0,12500,67761,23320,103034
5,50,VT,,6,9620,600,8150,0,7040,26430,8140,0,0,0,0,0,0,3320,4061746,9620,4136422,8010,1892110,7680,40146,6390,249078,6190,213156,2590,10839,2320,125969,6250,526171,1520,71135,1700,82145,0,...,0,0,0,0,0,0,210,741,2240,16466,0,0,0,0,0,0,9520,892121,0,0,0,0,0,0,0,0,0,0,9580,842156,9600,894432,3350,4939,4990,20428,3900,93123,2870,39425


Un ultima porblema que podemos enfrentear son las línes que pandas simplemente no puede analizar. 

Por ejemplo un registro podria tener mas valores de los que hay, para solucionar esto tenemos dos argumentos que son:

* `error_bad_lines=False` - para saltar registro no validos
* `ewarn_bad_lines=True` - Le dice a pandas si muestra mensajes cuando se omiten las linea que no se pueden analizar.

