### Pair Programming – Limpieza

In [118]:
import pandas as pd
import numpy as np

In [119]:
# 1. Al empezar a trabajar con este dataframe ya eliminamos algunas columnas que a priori no nos interesaban. 
# Ahora llega el momento de eliminar alguna columna más. En este caso tendréis que eliminar las columnas que no
# nos sean útiles para contestar a nuestras preguntas. Pero ojo ⚠️, haced una copia del dataframe para no "cargarnos"
# el dataframe original y perder la info.

df_copia = pd.read_csv("datos/attacks_pandas7.csv", index_col = 0)

In [120]:
df_copia.head(2)

Unnamed: 0,case_number,unnamed:_0,year,type,country,area,location,activity,name,sex_,...,time,species_,href,siglo,fatal_(y/n),injury,date,mes,fatal_(y/n)_limpio,sex
0,2018.06.25,0,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,18h00,White shark,http://sharkattackfile.net/spreadsheets/pdf_di...,siglo XXI,N,"No injury to occupant, outrigger canoe and pad...",25-Jun-2018,['Jun'],N,Y
1,2018.06.03.a,6,2018.0,Unprovoked,BRAZIL,Pernambuco,"Piedade Beach, Recife",Swimming,Jose Ernesto da Silva,M,...,Late afternoon,Tiger shark,http://sharkattackfile.net/spreadsheets/pdf_di...,siglo XXI,Y,FATAL,03-Jun-2018,['Jun'],Y,Y


In [121]:
df_copia = df_copia.drop(["unnamed:_0", "type", "activity", "name", "time", "href", "fatal_(y/n)"], axis = 1)

In [122]:
# 2. ¿Hay valores duplicados en nuestro dataframe? En caso de que los haya, eliminandlos.

print(f"Hay {df_copia.duplicated().sum()} valores duplicados, por lo tanto es necesario eliminar {df_copia.duplicated().sum()}.")

Hay 0 valores duplicados, por lo tanto es necesario eliminar 0.


In [123]:
# 3. Como hemos visto, algunas columnas no tienen el tipo de datos que deberían. Cambiad el tipo de dato para
# la columna de year.

df_copia.dtypes

case_number            object
year                  float64
country                object
area                   object
location               object
sex_                   object
age                    object
species_               object
siglo                  object
injury                 object
date                   object
mes                    object
fatal_(y/n)_limpio     object
sex                    object
dtype: object

In [124]:
df_copia.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7789 entries, 0 to 7788
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   case_number         7788 non-null   object 
 1   year                1672 non-null   float64
 2   country             1662 non-null   object 
 3   area                1626 non-null   object 
 4   location            1621 non-null   object 
 5   sex_                1658 non-null   object 
 6   age                 1518 non-null   object 
 7   species_            1546 non-null   object 
 8   siglo               1502 non-null   object 
 9   injury              6258 non-null   object 
 10  date                6286 non-null   object 
 11  mes                 6286 non-null   object 
 12  fatal_(y/n)_limpio  5678 non-null   object 
 13  sex                 7789 non-null   object 
dtypes: float64(1), object(13)
memory usage: 912.8+ KB


In [125]:
df_copia["year"] = df_copia["year"].astype('Int64', errors = "ignore")

In [126]:
df_copia.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7789 entries, 0 to 7788
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   case_number         7788 non-null   object
 1   year                1672 non-null   Int64 
 2   country             1662 non-null   object
 3   area                1626 non-null   object
 4   location            1621 non-null   object
 5   sex_                1658 non-null   object
 6   age                 1518 non-null   object
 7   species_            1546 non-null   object
 8   siglo               1502 non-null   object
 9   injury              6258 non-null   object
 10  date                6286 non-null   object
 11  mes                 6286 non-null   object
 12  fatal_(y/n)_limpio  5678 non-null   object
 13  sex                 7789 non-null   object
dtypes: Int64(1), object(13)
memory usage: 920.4+ KB


In [127]:
df_copia.head(2)

Unnamed: 0,case_number,year,country,area,location,sex_,age,species_,siglo,injury,date,mes,fatal_(y/n)_limpio,sex
0,2018.06.25,2018,USA,California,"Oceanside, San Diego County",F,57,White shark,siglo XXI,"No injury to occupant, outrigger canoe and pad...",25-Jun-2018,['Jun'],N,Y
1,2018.06.03.a,2018,BRAZIL,Pernambuco,"Piedade Beach, Recife",M,18,Tiger shark,siglo XXI,FATAL,03-Jun-2018,['Jun'],Y,Y


In [128]:
# 4. Poner todos los valores de la columna de country en minúscula. Pista: Tendréis que usar una función o una
# lambda.

# Hacemos una función:

def minusculas(col):
    try:
        col2 = str(col).lower()
        return col2
    except:
        np.nan

In [129]:
# Hacemos una lambda que se reflejará como columna country_:

df_copia["country_"] = df_copia["country"].apply(lambda col: col.lower() if type(col) == str else np.nan)

In [130]:
df_copia["country"] = df_copia["country"].apply(minusculas)

In [131]:
df_copia.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7789 entries, 0 to 7788
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   case_number         7788 non-null   object
 1   year                1672 non-null   Int64 
 2   country             7789 non-null   object
 3   area                1626 non-null   object
 4   location            1621 non-null   object
 5   sex_                1658 non-null   object
 6   age                 1518 non-null   object
 7   species_            1546 non-null   object
 8   siglo               1502 non-null   object
 9   injury              6258 non-null   object
 10  date                6286 non-null   object
 11  mes                 6286 non-null   object
 12  fatal_(y/n)_limpio  5678 non-null   object
 13  sex                 7789 non-null   object
 14  country_            1662 non-null   object
dtypes: Int64(1), object(14)
memory usage: 981.2+ KB


In [132]:
df_copia.head()

Unnamed: 0,case_number,year,country,area,location,sex_,age,species_,siglo,injury,date,mes,fatal_(y/n)_limpio,sex,country_
0,2018.06.25,2018,usa,California,"Oceanside, San Diego County",F,57,White shark,siglo XXI,"No injury to occupant, outrigger canoe and pad...",25-Jun-2018,['Jun'],N,Y,usa
1,2018.06.03.a,2018,brazil,Pernambuco,"Piedade Beach, Recife",M,18,Tiger shark,siglo XXI,FATAL,03-Jun-2018,['Jun'],Y,Y,brazil
2,2018.05.26.b,2018,usa,Florida,"Cocoa Beach, Brevard County",M,15,"Bull shark, 6'",siglo XXI,Lower left leg bitten,26-May-2018,['May'],N,N,usa
3,2018.05.24,2018,australia,Queensland,Cairns Aquarium,M,32,Grey reef shark,siglo XXI,Minor bite to hand by captive shark. PROVOKED ...,24-May-2018,['May'],N,N,australia
4,2018.05.13.a,2018,england,Cornwall,Off Land's End,M,21,Invalid incident,siglo XXI,Injured by teeth of a dead porbeagle shark he ...,13-May-2018,['May'],N,N,england


In [133]:
# 5. Guardamos el csv para seguir trabajando en el siguiente ejercicio de pair de limpieza.

df_copia.to_csv("datos/attacks_pandas8.csv")