# Limpieza de Datos: Ataque de tiburones
La misión del Global Shark Attack File es proporcionar datos actuales e históricos sobre las interacciones entre humanos y tiburones para aquellos que buscan información precisa, significativa y referencias verificables. 

Los humanos no están en el menú de los tiburones. Los tiburones muerden a los humanos por curiosidad o para defenderse.


- Se define un **incidente provocado** como uno en el que el tiburón fue atravesado, enganchado, capturado o en el que un ser humano extrajo "primera sangre". Sabemos que un ser humano vivo rara vez es percibido como presa por un tiburón. Muchos incidentes están motivados por la curiosidad, otros pueden ocurrir cuando un tiburón percibe a un humano como una amenaza o un competidor por una fuente de alimento, y podrían clasificarse como "provocados" cuando se examinan desde la perspectiva del tiburón. 

- **Incidentes que involucran embarcaciones**: los incidentes en los que un barco fue mordido o embestido por un tiburón están en verde. Sin embargo, en los casos en los que el tiburón fue enganchado, enredado o amarrado, la entrada es naranja porque se clasifican como incidentes provocados. 

- **Incidentes cuestionables**: incidentes en los que no hay datos suficientes para determinar si la lesión fue causada por un tiburón o si la persona se ahogó y el cuerpo fue luego devorado por los tiburones

Fuente: https://sharkattackfile.net/

### Modulos

In [24]:
import pandas as pd
import numpy as np
import datetime


In [25]:
attacks = pd.read_csv('https://raw.githubusercontent.com/PPereyraAN/Cursos/main/attacks.csv',encoding = 'latin-1', sep = "|")
attacks

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,Oceanside| San Diego County,Paddling,Julie Wolfe,F,...,White shark,R. Collier| GSAF,2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,St. Simon Island| Glynn County,Standing,Adysonï¿½McNeely,F,...,,K.McMurray| TrackingSharks.com,2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,Habush| Oahu,Surfing,John Denges,M,...,,K.McMurray| TrackingSharks.com,2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,B. Myatt| GSAF,2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,Tiger shark| 3m,A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25718,,,,,,,,,,,...,,,,,,,,,,
25719,,,,,,,,,,,...,,,,,,,,,,
25720,,,,,,,,,,,...,,,,,,,,,,
25721,,,,,,,,,,,...,,,,,,,,,,


Analisamos el dataset y atacando en primera instancia los valores nulos

In [26]:
attacks.shape

(25723, 24)

In [27]:
attacks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25723 entries, 0 to 25722
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Case Number             8702 non-null   object 
 1   Date                    6302 non-null   object 
 2   Year                    6300 non-null   float64
 3   Type                    6298 non-null   object 
 4   Country                 6252 non-null   object 
 5   Area                    5847 non-null   object 
 6   Location                5762 non-null   object 
 7   Activity                5758 non-null   object 
 8   Name                    6092 non-null   object 
 9   Sex                     5737 non-null   object 
 10  Age                     3471 non-null   object 
 11  Injury                  6274 non-null   object 
 12  Fatal (Y/N)             5763 non-null   object 
 13  Time                    2948 non-null   object 
 14  Species                 3464 non-null 

In [28]:
attacks.describe()

Unnamed: 0,Year,original order
count,6300.0,6309.0
mean,1927.272381,3155.999683
std,281.116308,1821.396206
min,0.0,2.0
25%,1942.0,1579.0
50%,1977.0,3156.0
75%,2005.0,4733.0
max,2018.0,6310.0


porcentaje de valores nulos por columna


In [29]:
def nullPercentagePerColunm():
    return round((attacks.isna().sum()*100/attacks.shape[0]).sort_values(ascending=False),3)

Observamos que la cantidad de nulos por columna en todos los casos supera mas del 50%

In [30]:
#eliminamos las columnas que tienen casi un 100% de valores nulos
attacks.drop(axis=1,labels=["Unnamed: 22","Unnamed: 23"],inplace=True)

In [31]:
#eliminamos las filas nulas
attacks.dropna(axis=0,how="all",inplace=True)

In [32]:
attacks.shape

(8703, 22)

In [33]:
nullPercentagePerColunm()

Time                      66.127
Species                   60.198
Age                       60.117
Sex                       34.080
Activity                  33.839
Location                  33.793
Fatal (Y/N)               33.781
Area                      32.816
Name                      30.001
Country                   28.163
Injury                    27.910
Investigator or Source    27.784
Type                      27.634
Year                      27.611
href formula              27.600
Date                      27.588
Case Number.2             27.588
pdf                       27.588
href                      27.588
Case Number.1             27.588
original order            27.508
Case Number                0.011
dtype: float64

In [34]:
#eliminamos todas las filas que tengan un porcentaje de nulos mayor o igual al 80%
PERCENTAGE = int(round(80*attacks.shape[1]/100,0))
attacks.dropna(axis=0,inplace=True,thresh=PERCENTAGE)

In [35]:
nullPercentagePerColunm()

Time                      50.085
Species                   42.935
Age                       41.181
Fatal (Y/N)                7.542
Activity                   5.941
Sex                        5.737
Location                   5.346
Area                       4.086
Name                       1.055
Investigator or Source     0.221
Country                    0.170
Injury                     0.136
Year                       0.034
href formula               0.017
Case Number                0.017
Type                       0.017
Date                       0.000
pdf                        0.000
href                       0.000
Case Number.1              0.000
Case Number.2              0.000
original order             0.000
dtype: float64

Reducimos bastante el procentaje de nulos


In [36]:
attacks.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,Oceanside| San Diego County,Paddling,Julie Wolfe,F,...,N,18h00,White shark,R. Collier| GSAF,2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,St. Simon Island| Glynn County,Standing,Adysonï¿½McNeely,F,...,N,14h00 -15h00,,K.McMurray| TrackingSharks.com,2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,Habush| Oahu,Surfing,John Denges,M,...,N,07h45,,K.McMurray| TrackingSharks.com,2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,N,,2 m shark,B. Myatt| GSAF,2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,N,,Tiger shark| 3m,A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0


#### Analziamos columnas 
analizamos que datos nos van a ser realmente utiles (esto es muy subjetivo dependiendo
de lo que se desee hacer con los datos), en mi caso los voy a preparar para que un analista pueda sacar concluciones de ellos, por ende es conveniente quedarnos con valores numericos, calificaciones de casos y datos de relevancia para dicho analisis

In [41]:
attacks.shape

(5874, 23)

In [48]:
#se repiten bastantes columnas de fechas vamos a quearnos con la que tenga menos variacion en el formato
#pasamos los datos a string y verifiamcos cuantos casos nos qedan nullos así sabemos con cual de todas las fechas
#quedarnos
print("cantidad de nulls en date: ",pd.to_datetime(attacks["Date"],errors="coerce").isna().sum())
print("cantidad de nulls en Case Number: ",pd.to_datetime(attacks["Case Number"],errors="coerce").isna().sum())
print("cantidad de nulls en Case Number.1: ",pd.to_datetime(attacks["Case Number.1"],errors="coerce").isna().sum())
print("cantidad de nulls en Case Number.2: ",pd.to_datetime(attacks["Case Number.2"],errors="coerce").isna().sum())




cantidad de nulls en date:  684
cantidad de nulls en Case Number:  2222
cantidad de nulls en Case Number.1:  2220
cantidad de nulls en Case Number.2:  2222


In [None]:
#nos quedamos con la columna de fechas DATE ya que tiene menos cantidad de fechas nulas
attacks.drop(axis = 1,inplace=True,labels=["Case Number","Case Number.1","Case Number.2"])

In [59]:
attacks.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,original order
0,25-Jun-2018,2018.0,Boating,USA,California,Oceanside| San Diego County,Paddling,Julie Wolfe,F,57.0,No injury to occupant| outrigger canoe and pad...,N,18h00,White shark,R. Collier| GSAF,2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6303.0
1,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,St. Simon Island| Glynn County,Standing,Adysonï¿½McNeely,F,11.0,Minor injury to left thigh,N,14h00 -15h00,,K.McMurray| TrackingSharks.com,2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6302.0
2,09-Jun-2018,2018.0,Invalid,USA,Hawaii,Habush| Oahu,Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,07h45,,K.McMurray| TrackingSharks.com,2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6301.0
3,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,B. Myatt| GSAF,2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6300.0
4,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,Tiger shark| 3m,A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6299.0


In [62]:
#eliminamos las columnas de pdfs que no nos van a servir par anada en nuestro analisis
attacks.drop(columns=["href formula","pdf","href"],inplace=True)

In [63]:
attacks.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,original order
0,25-Jun-2018,2018.0,Boating,USA,California,Oceanside| San Diego County,Paddling,Julie Wolfe,F,57.0,No injury to occupant| outrigger canoe and pad...,N,18h00,White shark,R. Collier| GSAF,6303.0
1,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,St. Simon Island| Glynn County,Standing,Adysonï¿½McNeely,F,11.0,Minor injury to left thigh,N,14h00 -15h00,,K.McMurray| TrackingSharks.com,6302.0
2,09-Jun-2018,2018.0,Invalid,USA,Hawaii,Habush| Oahu,Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,07h45,,K.McMurray| TrackingSharks.com,6301.0
3,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,B. Myatt| GSAF,6300.0
4,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,Tiger shark| 3m,A .Kipper,6299.0


In [65]:
#en mi caso elimino el area, locacion del suceso, nombre, invetigador y el tiempo en el que transcurrio
#ya que particularmente son datos que no me intersean para el analisis
attacks.drop(columns=["Area","Location","Name","Investigator or Source","Time"],inplace=True) 