# Shark attacks - Pandas Project Juan Perez de Ayala

### En este proyecto vamos a tratar de demostrar 3 hipótesis relacionadas con los ataques de tiburones en el mundo.

    Hipótesis 1: Nadar es la actividad más peligrosa por encima del surf.
    Hipótesis 2: El tiburón blanco es el más mortal.
    Hipótesis 3: Hay más ataques en USA que entre Australia y Sudáfrica.

###### Empezamos importando las librerias y csv necesarios:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import re
df = pd.read_csv("DATA/attacks.csv",encoding = "ISO-8859-1")
import src.cleaning_functs as cf

##### Comprobamos las columnas que tiene el DataFrame:

In [2]:
print(df.columns)

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')


##### Homogeneizamos el formato del nombre de las columnas:

In [3]:
columnas_df = list(df.columns)

In [4]:
diccio_nuevas = {columna: columna.replace(" ", "_").lower() for columna in columnas_df}

In [5]:
df.rename(columns = diccio_nuevas, inplace=True)

In [6]:
print(df.columns)

Index(['case_number', 'date', 'year', 'type', 'country', 'area', 'location',
       'activity', 'name', 'sex_', 'age', 'injury', 'fatal_(y/n)', 'time',
       'species_', 'investigator_or_source', 'pdf', 'href_formula', 'href',
       'case_number.1', 'case_number.2', 'original_order', 'unnamed:_22',
       'unnamed:_23'],
      dtype='object')


##### Eliminamos del DataFrame las columnas que no vamos a utilizar en ninguna de nuestras hipótesis:

In [7]:
df_hip_filt = df.drop(['case_number','date','year','type','injury','name','sex_','age','time','investigator_or_source','pdf','href_formula','href','case_number.1','case_number.2','unnamed:_22','unnamed:_23'], axis=1)
print(df_hip_filt.columns)

Index(['country', 'area', 'location', 'activity', 'fatal_(y/n)', 'species_',
       'original_order'],
      dtype='object')


##### Buscamos los valores nulos dentro de nuestro DataFrame ya filtrado:

In [8]:
pd.isnull(df_hip_filt).sum()

country           19471
area              19876
location          19961
activity          19965
fatal_(y/n)       19960
species_          22259
original_order    19414
dtype: int64

##### Eliminamos los registros que tienen valores nulos en todas las columnas:

In [9]:
df_hip_filt.shape

(25723, 7)

In [10]:
df_hip_filt.dropna(how="all", inplace=True)

In [11]:
df_hip_filt.shape

(6309, 7)

In [12]:
pd.isnull(df_hip_filt).sum()

country             57
area               462
location           547
activity           551
fatal_(y/n)        546
species_          2845
original_order       0
dtype: int64

##### Rellenamos los NaN columna por columna

In [13]:
df_hip_filt.country.fillna("Desconocido", inplace=True)

In [14]:
df_hip_filt.area.fillna("Desconocido", inplace=True)

In [15]:
df_hip_filt.location.fillna("Desconocido", inplace=True)

In [16]:
df_hip_filt.activity.fillna("Desconocido", inplace=True)

In [17]:
df_hip_filt['fatal_(y/n)'].fillna("Desconocido", inplace=True)

In [18]:
df_hip_filt.species_.fillna("Desconocido", inplace=True)

In [19]:
pd.isnull(df_hip_filt).sum()

country           0
area              0
location          0
activity          0
fatal_(y/n)       0
species_          0
original_order    0
dtype: int64

##### A partir de esta base de datos ya filtrada nos centramos en las columnas necesarias para cada hipótesis

Hipótesis 1

In [20]:
df_hip_filt['activity'].value_counts()

Surfing                                   971
Swimming                                  869
Desconocido                               551
Fishing                                   431
Spearfishing                              333
                                         ... 
Playing with a frisbee in the shallows      1
Sinking of the ferryboat Dumaguete          1
Wreck of the Storm King                     1
Feeding mullet to sharks                    1
Wreck of  large double sailing canoe        1
Name: activity, Length: 1533, dtype: int64

In [21]:
df_hip_filt['activity_new'] = df_hip_filt['activity'].apply(cf.new_activity)

In [22]:
df_hip_filt['activity_new'].unique()

array(['Surfing', 'Otra', 'Diving', 'Swimming', 'Fishing', 'Boating',
       'Feeding'], dtype=object)

In [23]:
df_hip_filt['activity_new'].value_counts()

Otra        1732
Surfing     1495
Swimming    1474
Fishing     1091
Diving       454
Boating       58
Feeding        5
Name: activity_new, dtype: int64

Hipótesis 2

Columna de especies

In [24]:
df_hip_filt['species_'].value_counts()

Desconocido                                                                                                                      2845
White shark                                                                                                                       163
Shark involvement prior to death was not confirmed                                                                                105
Invalid                                                                                                                           102
Shark involvement not confirmed                                                                                                    88
                                                                                                                                 ... 
1.2 m to 1.5 m [4.5' to 5'] shark                                                                                                   1
Bull shark, 2.3 m [7.5']                                      

In [25]:
df_hip_filt['species_filt'] = df_hip_filt['species_'].apply(cf.clean_species)

In [26]:
df_hip_filt['species_filt'].unique()

array(['White shark', 'Desconocido', 'Tiger shark', 'Bull shark',
       'Whitetip shark', 'Shortfin Mako Shark', 'Hammerhead Shark',
       'Blacktip Shark', 'Sand Tiger Shark'], dtype=object)

In [27]:
df_hip_filt['species_filt'].value_counts()

Desconocido            4928
White shark             668
Tiger shark             278
Bull shark              186
Whitetip shark          108
Shortfin Mako Shark      56
Hammerhead Shark         49
Blacktip Shark           18
Sand Tiger Shark         18
Name: species_filt, dtype: int64

Columna de muertes

In [28]:
df_hip_filt['fatal_(y/n)'].value_counts()

N              4293
Y              1388
Desconocido     546
UNKNOWN          71
 N                7
M                 1
2017              1
N                 1
y                 1
Name: fatal_(y/n), dtype: int64

In [29]:
df_hip_filt['fatal_new'] = df_hip_filt.apply(cf.clean_fatal, axis=1)

In [30]:
df_hip_filt['fatal_new'].value_counts()

N              4301
Y              2006
Desconocido       2
Name: fatal_new, dtype: int64

Hipótesis 3

In [31]:
df_hip_filt['country'].value_counts()

USA                      2229
AUSTRALIA                1338
SOUTH AFRICA              579
PAPUA NEW GUINEA          134
NEW ZEALAND               128
                         ... 
THE BALKANS                 1
NORTH ATLANTIC OCEAN        1
MAYOTTE                     1
GABON                       1
CEYLON (SRI LANKA)          1
Name: country, Length: 213, dtype: int64

In [32]:
df_hip_filt['country_org'] = df_hip_filt['country'].apply(cf.new_country)

In [33]:
df_hip_filt['country_org'].value_counts()

USA             2229
AUSTRALIA       1338
SOUTH AFRICA     579
Name: country_org, dtype: int64

In [34]:
df_hip_filt.head()

Unnamed: 0,country,area,location,activity,fatal_(y/n),species_,original_order,activity_new,species_filt,fatal_new,country_org
0,USA,California,"Oceanside, San Diego County",Paddling,N,White shark,6303.0,Surfing,White shark,N,USA
1,USA,Georgia,"St. Simon Island, Glynn County",Standing,N,Desconocido,6302.0,Otra,Desconocido,N,USA
2,USA,Hawaii,"Habush, Oahu",Surfing,N,Desconocido,6301.0,Surfing,Desconocido,N,USA
3,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,N,2 m shark,6300.0,Surfing,Desconocido,N,AUSTRALIA
4,MEXICO,Colima,La Ticla,Free diving,N,"Tiger shark, 3m",6299.0,Diving,Tiger shark,N,


##### Por último, eliminamos las columnas antiguas y guardamos la nueva DataFrame

In [35]:
df_final = df_hip_filt.drop(['country','activity','species_','fatal_(y/n)'], axis=1)
print(df_final.columns)

Index(['area', 'location', 'original_order', 'activity_new', 'species_filt',
       'fatal_new', 'country_org'],
      dtype='object')


In [36]:
df_final.to_csv('data/df_fin.csv')