<div float: center>
    <img src="../common/../common/logo_DH.png" width="30%" height="30%" style="text-align: left;">
    <img src="../common/wipro_logo.png"  width="20%" height="20%" style="text-align: right; margin-left:400px;">
<div/>

---

# SHARK ATTACK ANALYSIS

Developed by:
- Eng. Aer. Pablo Bauer

<a id="table_contents"></a> 
## Table of Contents

### <a href='#intro'>1. Introduction</a>
- #### <a href='#case'>1.1. Case</a>
- #### <a href='#workflow'>1.2. Workflow</a>

### <a href='#section_mod_import'>2. Modules Import</a>


### <a href='#dataset_import'>3. DataSet Import</a> 
- #### <a href='#cleaninig'>3.1. Cleaninig data</a>

### <a  href='#set_training'>4. Set and training Dataset</a>
- #### <a href='#features_dummies'>4.1. Features and dummies</a>
- #### <a href='#split'>4.2. Split training Dataset</a>
- #### <a href='#matrix_corr'>4.3. Correlation Matrx</a>
- #### <a href='#analysis_corr'>4.4. Correlation Analysis</a>

### <a  href='#models_evaluations'>5. Metrics and Evaluations</a>
- #### <a href='#pipeline_rfr'>5.1. PipeLine [RandomForestRegressor]</a>
- #### <a href='#features_imp_rfr'>5.2. Features Importance [RandomForestRegressor]</a>
- #### <a href='#pipeline_ada'>5.3. PipeLine [Ada Boost Regressor]</a>
- #### <a href='#features_imp_ada'>5.4. Features Importance [Ada Boost Regressor]</a>
- #### <a href='#pipeline_dtr'>5.5. PipeLine [DecisionTreeRegressor]</a>
- #### <a href='#features_imp_dtr'>5.6. Features Importance [DecisionTreeRegressor]</a>
- #### <a href='#pipeline_knn'>5.7. PipeLine [KNN]</a>
- #### <a href='#features_imp_knn'>5.8. Features Importance [KNN]</a>
- #### <a href='#best_performance'>5.9. Entrenamiento con los parámetros de mejor performance [DataSet Completo]</a>
    
### <a href='#roc_curve'>6. ROC curve</a>
- #### <a href='#section_roc_auc'>6.1. ROC and AUC</a>

### <a href='#export_pickle'>7. Expostarndo en Pickle</a>

### <a href='#conclusions'>8. Conclusuions</a>
---

<a id="intro"></a> 
## 1. Introduction
---
<a href='#table_contents'>Back up</a>

<a id="case"></a> 
### 1.1 Abaout the Case

<img src="../common/solar_farms.png"  width="100%" height="100%">

#### Global Shark Attack Incidents
*Credits to Brenda Griffith [source](https://data.world/siyeh)*

The Global Shark Attack Data is a comprehensive dataset that provides daily updated records of shark attack incidents worldwide. It offers valuable information on various aspects of each incident, including the date and location of the attack, specific details about the activity the victim was engaged in at the time, and whether it resulted in a fatality or not. With additional columns such as age, injury description, and even the name of the victim involved, this dataset aims to inform people about the risks associated with coastal water activities.

The dataset also sheds light on factors that contribute to shark attacks by providing information on the type of incident, such as whether it was provoked or unprovoked. It further categorizes incidents based on different countries and areas within those countries where they occurred. The presence of additional columns like investigator or source helps trace back to their reporting sources for further verification and analysis.

Moreover, this dataset goes beyond numbers and text by including links to PDF files containing more detailed information about each shark attack incident. This allows researchers or interested individuals to delve deeper into individual cases for better understanding.

To facilitate data visualization and exploration, several columns provide configurations related to encoding and mark type used in visualization tools. The inclusion of original order column helps maintain consistency while handling large datasets.

In summary, this dataset serves as a valuable resource for researchers studying shark attacks worldwide while aiming to improve understanding between sharks and humans through education on associated risks during coastal water activities

---

### Employer Needs

**Explore Attack Types:**

Use information from Type column to analyze different types of attacks like Unprovoked or Provoked incidents. Identify patterns or trends and gain insights into factors contributing to various types of attacks.

**Analyze Geographic Trends:**

Utilize data from Country and Area columns to identify regions with higher concentrations of recorded incidents. Compare different countries and areas to understand geographic trends regarding locations prone to attacks.

**Study Activity-Related Incidents:**

Focus on data from Activity column to examine activities that often lead to sharks encountering humans. Determine which activities pose higher risks and develop guidelines or recommendations accordingly.

**Investigate Victim Characteristics:**

Analyze data from Name and Age columns to study the profiles of victims involved in shark attack incidents. Identify any patterns or demographics that might be useful in understanding the prevalence of attacks.

**Assess Injury Severity:**

Explore data from Injury column to gain insights into different types and severities of injuries resulting from shark attacks. Analyze patterns and determine the most common injuries sustained by victims.

**Consider Fatality Rates:**

Examine the Fatal (Y/N) column to determine fatality rates associated with shark attack incidents. Compare fatality rates across different countries, areas, or activity types to understand risks

---

#### Global Shark Attack Incidents Dataset

#### Columns
- **Case Number** - A unique identifier for each shark attack incident.
- **Type** - The type of shark attack incident, such as Unprovoked or Provoked.
- **Country** - The country where the shark attack incident occurred.
- **Area** - The specific area within the country where the shark attack incident occurred.
- **Location** - The location or beach where the shark attack incident occurred.
- **Activity** - The activity the victim was engaged in at the time of the shark attack incident.
- **Name** - The name of the victim involved in the shark attack incident.
- **Age** - he age of the victim involved in the shark attack incident.
- **Injury** - The description of the injury sustained by the victim in a shark attack incident.
- **Fatal (Y/N)** - Indicates whether or not a fatality occurred in a given shark attack incident.

<a id="workflow"></a> 
### 1.2 La Workflow

*TO-DO*

<a id="section_mod_import"></a>
## 2. Modules Import
---
<a href='#table_contents'>Back up</a>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
import plotly.express as px
import re

from dateutil.parser import parse
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, r2_score, mean_absolute_error, mean_squared_error, roc_curve
from sklearn.utils import resample

<a id="dataset_import"></a>
## 3. DataSet Import
---
<a href='#table_contents'>Back up</a>

### 3.1 Convert in a DataFrame

In [2]:
file_path = "../data/archive/GSAF5.xls.csv"

data = pd.read_csv(file_path, sep=',', encoding='utf8')

print(data.shape)

(6462, 257)


  data = pd.read_csv(file_path, sep=',', encoding='utf8')


In [3]:
data

Unnamed: 0,index,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,...,Unnamed: 246,Unnamed: 247,Unnamed: 248,Unnamed: 249,Unnamed: 250,Unnamed: 251,Unnamed: 252,Unnamed: 253,Unnamed: 254,Unnamed: 255
0,0,2020.02.05,05-Feb-2020,2020.0,Unprovoked,USA,Maui,,Stand-Up Paddle boarding,,...,,,,,,,,,,
1,1,2020.01.30.R,Reported 30-Jan-2020,2020.0,Provoked,BAHAMAS,Exumas,,Floating,Ana Bruna Avila,...,,,,,,,,,,
2,2,2020.01.17,17-Jan-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,Windang Beach,Surfing,Will Schroeter,...,,,,,,,,,,
3,3,2020.01.16,16-Jan-2020,2020.0,Unprovoked,NEW ZEALAND,Southland,Oreti Beach,Surfing,Jordan King,...,,,,,,,,,,
4,4,2020.01.13,13-Jan-2020,2020.0,Unprovoked,USA,North Carolina,"Rodanthe, Dare County",Surfing,Samuel Horne,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6457,6457,ND.0005,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,...,,,,,,,,,,
6458,6458,ND.0004,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,...,,,,,,,,,,
6459,6459,ND.0003,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,...,,,,,,,,,,
6460,6460,ND.0002,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,Jules Patterson,...,,,,,,,,,,


### 3.2 Overview: Nulls in columns

Identifying columns with and without data, droping that ones that is just noise.

In [4]:
data_nan = data.apply(lambda x: x.notnull().sum())
data_nan

index           6462
Case Number     6460
Date            6461
Year            6459
Type            6457
                ... 
Unnamed: 251       0
Unnamed: 252       0
Unnamed: 253       0
Unnamed: 254       0
Unnamed: 255       0
Length: 257, dtype: int64

In [5]:
data = data.dropna(how='all', axis=1)

In [6]:
data_nan = data.apply(lambda x: x.isnull().sum())
data_nan

index                        0
Case Number                  2
Date                         1
Year                         3
Type                         5
Country                     51
Area                       463
Location                   545
Activity                   552
Name                       215
Unnamed: 9                6434
Age                       2871
Injury                      29
Fatal (Y/N)                547
Time                      3392
Species                   2924
Investigator or Source      19
pdf                       3396
href formula              3400
href                      3400
Case Number.1             3400
Case Number.2             3400
original order            3400
Unnamed: 23               6460
dtype: int64

### 3.3 Dropping other columns

We will drop columns that is not esential, like: Unnamed:9, Unnamed:23, pdf,

Todo that we'll just select that columns that we need.

In [7]:
data.columns

Index(['index', 'Case Number', 'Date', 'Year', 'Type', 'Country', 'Area',
       'Location', 'Activity', 'Name', 'Unnamed: 9', 'Age', 'Injury',
       'Fatal (Y/N)', 'Time', 'Species ', 'Investigator or Source', 'pdf',
       'href formula', 'href', 'Case Number.1', 'Case Number.2',
       'original order', 'Unnamed: 23'],
      dtype='object')

In [8]:
data = data[['index', 'Date', 'Year', 'Type', 'Country', 'Area',
       'Location', 'Activity', 'Age', 'Injury',
       'Fatal (Y/N)', 'Species ', 'Investigator or Source']]

In [9]:
data

Unnamed: 0,index,Date,Year,Type,Country,Area,Location,Activity,Age,Injury,Fatal (Y/N),Species,Investigator or Source
0,0,05-Feb-2020,2020.0,Unprovoked,USA,Maui,,Stand-Up Paddle boarding,,"No injury, but paddleboard bitten",N,Tiger shark,"K. McMurray, TrackingSharks.com"
1,1,Reported 30-Jan-2020,2020.0,Provoked,BAHAMAS,Exumas,,Floating,24,PROVOKED INCIDENT Scratches to left wrist,N,,"K. McMurray, TrackingSharks.com"
2,2,17-Jan-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,Windang Beach,Surfing,59,Laceration ot left ankle and foot,N,"""A small shark""","B. Myatt & M. Michaelson, GSAF; K. McMurray, T..."
3,3,16-Jan-2020,2020.0,Unprovoked,NEW ZEALAND,Southland,Oreti Beach,Surfing,13,Minor injury to lower leg,N,Broadnose seven gill shark?,"K. McMurray, TrackingSharks.com"
4,4,13-Jan-2020,2020.0,Unprovoked,USA,North Carolina,"Rodanthe, Dare County",Surfing,26,Lacerations to foot,N,,"C. Creswell, GSAF"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6457,6457,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,,FATAL,Y,,"H. Taunton; N. Bartlett, p. 234"
6458,6458,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,,FATAL,Y,,"H. Taunton; N. Bartlett, pp. 233-234"
6459,6459,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,,FATAL,Y,,"F. Schwartz, p.23; C. Creswell, GSAF"
6460,6460,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,,FATAL,Y,,"The Sun, 10/20/1938"


In [10]:
mask = data['index'] == 4649

data.loc[mask]

Unnamed: 0,index,Date,Year,Type,Country,Area,Location,Activity,Age,Injury,Fatal (Y/N),Species,Investigator or Source
4649,4649,1950s,1950.0,Invalid,USA,Hawaii,"Waikiki, O'ahu",Spearfishing,,"No injury from shark, scraped chest climbing o...",,"Alleged to involve a white shark ""with little ...","J. Borg, p.73; L. Taylor (1993), pp.100-101; S..."


### 3.4 Cleaning Date column

**3.4.1 Extracting series**

Extract the series that we need to regularized in a format that we need

In [11]:
data_date_regex = data['Date']

**3.4.2 Use the pattern to regularized**

Arrange the data and the pattern that we need to extarct the useful info.

In [12]:
pattern_date = r"^([a-zA-Z]+|[a-z0-9]+)([\-\s\.])*([a-zA-Z]+|[0-9]+)?(-)*([a-zA-Z]+|[0-9]*)(-)*([0-9]*)"
text_1 = 'Before 1903'
text_2 = '05-Feb-2020'
text_3 = 'Reported 30-Jan-2020'
text_4 = '1845-1853'
text_5 = '1950s'
text_6 = '1999'

pattern_regex = re.compile(pattern_date)

print(re.findall(pattern_date, text_1))
print(re.findall(pattern_date, text_2))
print(re.findall(pattern_date, text_3))
print(re.findall(pattern_date, text_4))
print(re.findall(pattern_date, text_5))
print(re.findall(pattern_date, text_6))

data_test = data['Date'].str.extract(pattern_date)
data_test.set_axis(['day','1','month','3','year','5','6'], axis='columns', inplace=True)

#Testing methods to extract info
"""data_test = data['Date'].apply(lambda x: condition_regex(pattern_regex.findall(str(x))))
data_test = data_test.tolist()
data_test = pd.DataFrame(data_test, columns=['t'])
print(type(data_test))
data_test
data_test = pd.DataFrame(data_test.t.to_list(), columns=['0','1','2','3','4','5','6'])"""

[('Before', ' ', '1903', '', '', '', '')]
[('05', '-', 'Feb', '-', '2020', '', '')]
[('Reported', ' ', '30', '-', 'Jan', '-', '2020')]
[('1845', '-', '1853', '', '', '', '')]
[('1950s', '', '', '', '', '', '')]
[('1999', '', '', '', '', '', '')]


"data_test = data['Date'].apply(lambda x: condition_regex(pattern_regex.findall(str(x))))\ndata_test = data_test.tolist()\ndata_test = pd.DataFrame(data_test, columns=['t'])\nprint(type(data_test))\ndata_test\ndata_test = pd.DataFrame(data_test.t.to_list(), columns=['0','1','2','3','4','5','6'])"

In [13]:
data_test

Unnamed: 0,day,1,month,3,year,5,6
0,05,-,Feb,-,2020,,
1,Reported,,30,-,Jan,-,2020
2,17,-,Jan,-,2020,,
3,16,-,Jan,-,2020,,
4,13,-,Jan,-,2020,,
...,...,...,...,...,...,...,...
6457,Before,,1903,,,,
6458,Before,,1903,,,,
6459,1900,-,1905,,,,
6460,1883,-,1889,,,,


**3.4.3 Identifying differences and Parsing dates**

In [14]:
mask = data_test.day != 'Before'
data_test = data_test[mask]

In [15]:
condition_1 = pd.DataFrame()
condition_1[['day','1','month','3','year']] = data_test[['day','1','month','3','year']].where(data_test['day']!='Reported', data_test[['month','3','year','5','6']].values)
condition_1

condition_1 = condition_1.replace({
    'Before':np.nan,
})
data_test = condition_1

In [16]:
data_test = data_test.drop(['1','3'], axis=1)

In [17]:
condition_2 = data_test['year'].where(data_test['day']!='Before', data_test['month'].values)
data_test['year'] = condition_2

for x in ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Ca','Summer']:
    condition_2 = data_test['year'].where(data_test['day']!=x, data_test['month'].values)
    data_test['year'] = condition_2
    
    condition_2 = data_test['month'].where(data_test['day']!=x, data_test['day'].values)
    data_test['month'] = condition_2
    
    condition_2 = data_test['day'].where(data_test['day']!=x, '01')
    data_test['day'] = condition_2

In [18]:
data_test['day'] = pd.to_numeric(data_test['day'], errors='coerce')

In [19]:
'''mask = (data_test['day'] != 'Late') & (data_test['day'] != 'Early') & (data_test['day'] != 'Circa') & (data_test['day'] != 'Between') & (data_test['day'] != 'Fall') & (data_test['day'] != 'July') & (data_test['day'] != 'June') & (data_test['day'] != 'World') & (data_test['day'] != 'Winter') & (data_test['day'] != 'Some') & (data_test['day'] != 'October') & (data_test['day'] != 'A') & (data_test['day'] != 'in') & (data_test['day'] != 'Said') & (data_test['day'] != 'Beforer') & (data_test['day'] != 'Letter') & (data_test['day'] != 'Woirld') & (data_test['day'] != 'November') & (data_test['day'] != 'Last') & (data_test['day'] != 'Reprted') & (data_test['day'] != 'Reportd') & (data_test['day'] != 'December') & (data_test['day'] != 'to') & (data_test['day'] != 'No') & (data_test['day'] != 'Mid')
data_test = data_test.loc[mask]'''

"mask = (data_test['day'] != 'Late') & (data_test['day'] != 'Early') & (data_test['day'] != 'Circa') & (data_test['day'] != 'Between') & (data_test['day'] != 'Fall') & (data_test['day'] != 'July') & (data_test['day'] != 'June') & (data_test['day'] != 'World') & (data_test['day'] != 'Winter') & (data_test['day'] != 'Some') & (data_test['day'] != 'October') & (data_test['day'] != 'A') & (data_test['day'] != 'in') & (data_test['day'] != 'Said') & (data_test['day'] != 'Beforer') & (data_test['day'] != 'Letter') & (data_test['day'] != 'Woirld') & (data_test['day'] != 'November') & (data_test['day'] != 'Last') & (data_test['day'] != 'Reprted') & (data_test['day'] != 'Reportd') & (data_test['day'] != 'December') & (data_test['day'] != 'to') & (data_test['day'] != 'No') & (data_test['day'] != 'Mid')\ndata_test = data_test.loc[mask]"

In [20]:
data_test= data_test.dropna(how='any')

In [21]:
mask = data_test['day'] < 32

data_test = data_test[mask]

In [22]:
data_test.month.value_counts()

Jul       678
Aug       607
Sep       558
Jan       526
Jun       499
Oct       458
Apr       458
Dec       450
Nov       424
Mar       419
May       401
Feb       385
Ca         33
Summer     17
July        4
Ap          2
April       2
Sept        2
30          1
or          1
Decp        1
26          1
March       1
            1
Name: month, dtype: int64

In [23]:
mask = (data_test['month'] == 'Jan') | (data_test['month'] == 'Feb') | (data_test['month'] == 'Mar') | (data_test['month'] == 'Apr') | (data_test['month'] == 'May') | (data_test['month'] == 'Jun') | (data_test['month'] == 'Jul') | (data_test['month'] == 'Aug') | (data_test['month'] == 'Sep') | (data_test['month'] == 'Oct') | (data_test['month'] == 'Nov') | (data_test['month'] == 'Dec')
data_test = data_test[mask]

In [24]:
data_test

Unnamed: 0,day,month,year
0,5.0,Feb,2020
1,30.0,Jan,2020
2,17.0,Jan,2020
3,16.0,Jan,2020
4,13.0,Jan,2020
...,...,...,...
6310,27.0,Oct,1753
6311,27.0,Jul,1751
6315,17.0,Dec,1742
6316,6.0,Apr,1738


In [25]:
m = {
    'Jan':1,
    'Feb':2,
    'Mar':3,
    'Apr':4,
    'May':5,
    'Jun':6,
    'Jul':7,
    'Aug':8,
    'Sep':9,
    'Oct':10,
    'Nov':11,
    'Dec':12
}
data_test['month'] = data_test['month'].map(m)

In [26]:
data_test

Unnamed: 0,day,month,year
0,5.0,2,2020
1,30.0,1,2020
2,17.0,1,2020
3,16.0,1,2020
4,13.0,1,2020
...,...,...,...
6310,27.0,10,1753
6311,27.0,7,1751
6315,17.0,12,1742
6316,6.0,4,1738


In [27]:
data_test['year'] = pd.to_numeric(data_test['year'], errors='coerce')
data_test= data_test.dropna(how='any')

In [28]:
data_test['month'] = data_test['month'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_test['month'] = data_test['month'].astype(float)


In [29]:
mask = data_test['year'] > 1699
data_test = data_test[mask]

In [30]:
data_test

Unnamed: 0,day,month,year
0,5.0,2.0,2020.0
1,30.0,1.0,2020.0
2,17.0,1.0,2020.0
3,16.0,1.0,2020.0
4,13.0,1.0,2020.0
...,...,...,...
6310,27.0,10.0,1753.0
6311,27.0,7.0,1751.0
6315,17.0,12.0,1742.0
6316,6.0,4.0,1738.0


**3.4.4 Parsing columns in one Datetime column**

After that we attach it to the original DataSet and rename the rows without date with 'Unkonw'

In [31]:
data_test['clean_date'] = pd.to_datetime(data_test[['year','month','day']])

In [32]:
data['clean_date'] = data_test['clean_date']

In [33]:
data

Unnamed: 0,index,Date,Year,Type,Country,Area,Location,Activity,Age,Injury,Fatal (Y/N),Species,Investigator or Source,clean_date
0,0,05-Feb-2020,2020.0,Unprovoked,USA,Maui,,Stand-Up Paddle boarding,,"No injury, but paddleboard bitten",N,Tiger shark,"K. McMurray, TrackingSharks.com",2020-02-05
1,1,Reported 30-Jan-2020,2020.0,Provoked,BAHAMAS,Exumas,,Floating,24,PROVOKED INCIDENT Scratches to left wrist,N,,"K. McMurray, TrackingSharks.com",2020-01-30
2,2,17-Jan-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,Windang Beach,Surfing,59,Laceration ot left ankle and foot,N,"""A small shark""","B. Myatt & M. Michaelson, GSAF; K. McMurray, T...",2020-01-17
3,3,16-Jan-2020,2020.0,Unprovoked,NEW ZEALAND,Southland,Oreti Beach,Surfing,13,Minor injury to lower leg,N,Broadnose seven gill shark?,"K. McMurray, TrackingSharks.com",2020-01-16
4,4,13-Jan-2020,2020.0,Unprovoked,USA,North Carolina,"Rodanthe, Dare County",Surfing,26,Lacerations to foot,N,,"C. Creswell, GSAF",2020-01-13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6457,6457,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,,FATAL,Y,,"H. Taunton; N. Bartlett, p. 234",NaT
6458,6458,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,,FATAL,Y,,"H. Taunton; N. Bartlett, pp. 233-234",NaT
6459,6459,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,,FATAL,Y,,"F. Schwartz, p.23; C. Creswell, GSAF",NaT
6460,6460,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,,FATAL,Y,,"The Sun, 10/20/1938",NaT


### 3.5 Coordenates identification with *"Google Maps"*

use a coordenates identifications with google maps and Calcs sheet

link: [script](https://tesel.mx/convertir-direcciones-a-coordenadas-de-latitud-y-longitud-con-google-sheets-y-google-maps-geocoding-7928/)

In [34]:
coord_file_path = "../data/places_coord.xlsx"

coord_data = pd.read_excel(coord_file_path)
coord_data['Area'] = coord_data['Dirección']
coord_data = coord_data.drop(columns=['Dirección'])

In [35]:
coord_data

Unnamed: 0,Latitud,Longitud,Area
0,27.664827,-81.515754,Florida
1,-31.253218,146.921099,New South Wales
2,-22.575197,144.084793,Queensland
3,19.898682,-155.665857,Hawaii
4,36.778261,-119.417932,California
...,...,...,...
805,,,Ho Ha Wan Marine Park
806,8.715832,99.545097,Southern Thailand
807,45.396138,13.041007,Golfo de Venezia
808,22.319304,114.169361,South China Sea 200 miles from Hong Kong


In [36]:
data = pd.merge(data, coord_data, on='Area')

In [37]:
data

Unnamed: 0,index,Date,Year,Type,Country,Area,Location,Activity,Age,Injury,Fatal (Y/N),Species,Investigator or Source,clean_date,Latitud,Longitud
0,0,05-Feb-2020,2020.0,Unprovoked,USA,Maui,,Stand-Up Paddle boarding,,"No injury, but paddleboard bitten",N,Tiger shark,"K. McMurray, TrackingSharks.com",2020-02-05,20.798363,-156.331925
1,1,Reported 30-Jan-2020,2020.0,Provoked,BAHAMAS,Exumas,,Floating,24,PROVOKED INCIDENT Scratches to left wrist,N,,"K. McMurray, TrackingSharks.com",2020-01-30,23.619260,-75.969547
2,121,09-Nov-2018,2018.0,Unprovoked,BAHAMAS,Exumas,Compass Cay,Swimming,8,Arc of shallow lacerations to back,N,Nurse shark,"K. McMurray, TrackingSharks.com",2018-11-09,23.619260,-75.969547
3,2,17-Jan-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,Windang Beach,Surfing,59,Laceration ot left ankle and foot,N,"""A small shark""","B. Myatt & M. Michaelson, GSAF; K. McMurray, T...",2020-01-17,-31.253218,146.921099
4,16,20-Dec-2019,2019.0,Provoked,AUSTRALIA,New South Wales,Shellharbour,Fishing,,PROVOKED INCIDENT,,White shark,"B. Myatt, GSAF",2019-12-20,-31.253218,146.921099
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5994,6411,Before 1958,0.0,Unprovoked,INDONESIA,Riau Province,"Natuna Islands, between Sumatra & Kalimantan i...",Swimming near anchored ship,,"FATAL, leg severed",Y,,"C.H. Townsend, p. 172; V.M. Coppleson, p.258",NaT,0.293347,101.706829
5995,6415,Before 1956,0.0,Unprovoked,MARSHALL ISLANDS,Bikini Atoll,,Swimming,,Buttocks bitten,N,,J.E. Lasch,NaT,11.606514,165.376810
5996,6431,World War II,0.0,Sea Disaster,PAPUA NEW GUINEA,Between New Ireland & New Britain,St. George’s Channel,Spent 8 days in dinghy,,"No injury, but shark removed the heel & part o...",N,,"G.A. Llano in Airmen Against the Sea, p.69; V....",NaT,41.661210,-72.779542
5997,6444,Before 1911,0.0,Unprovoked,VIETNAM,Ba Ria-Vung Tau Province,Vũng Tàu,Swimming around anchored ship,,Foot bitten,N,,"Daily Kennebec Journal, 3/27/1911",NaT,10.541740,107.242998


### 3.6 Analizing Activity and standarized

In [38]:
mask = data['Activity'].value_counts().values > 5
data['Activity'].value_counts()[mask].sum()

3938

In [39]:
key_words = ['Surfing', 'Swimming', 'Fishing', 'Spearfishing', 'Wading', 'Bathing',
       'Diving', 'Standing', 'Snorkeling', 'Scuba diving', 'Body boarding',
       'Body surfing', 'Boogie boarding', 'Kayaking', 'Treading water',
       'Free diving', 'Pearl diving', 'Surf skiing', 'Floating',
       'Fell overboard', 'Walking', 'Boogie Boarding', 'Windsurfing',
       'Shark fishing', 'Rowing', 'Surf fishing', 'Surf-skiing', 'Canoeing',
       'Kayak Fishing', 'Sitting on surfboard', 'Scuba Diving',
       'Diving for trochus', 'Fell into the water', 'Paddle boarding',
       'Sailing', 'Playing', 'Fishing for sharks', 'Sponge diving',
       'Floating on his back', 'Free diving for abalone',
       'Surfing (sitting on his board)', 'Skindiving',
       'Stand-Up Paddleboarding', 'Diving for abalone',
       'Spearfishing on Scuba', 'Freediving', 'Kite Surfing', 'Splashing']

In [40]:
data['Activity'] = data['Activity'].apply(lambda x: x if x in key_words else 'other')

In [41]:
data['Age'] = pd.to_numeric(data['Age'], errors='coerce')

In [42]:
mask = data.Injury.value_counts() > 10
data['Injury'].value_counts()[mask].sum()

1696

### 3.7 Cleaning if is Fatal or Not

In [43]:
mask = data['Fatal (Y/N)'].notnull()
data = data[mask]

data['Fatal (Y/N)'] = data['Fatal (Y/N)'].apply(lambda x: 'Y' if x == 'F' or x == 'M' or x == '2017' else x)
data['Fatal (Y/N)'] = data['Fatal (Y/N)'].apply(lambda x: 'N' if x == 'UNKNOWN' else x) 
data['Fatal (Y/N)'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Fatal (Y/N)'] = data['Fatal (Y/N)'].apply(lambda x: 'Y' if x == 'F' or x == 'M' or x == '2017' else x)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Fatal (Y/N)'] = data['Fatal (Y/N)'].apply(lambda x: 'N' if x == 'UNKNOWN' else x)


N    4294
Y    1210
Name: Fatal (Y/N), dtype: int64

In [44]:
data['Type'].value_counts()

Unprovoked             4423
Provoked                556
Sea Disaster            170
Watercraft              128
Boat                    102
Boating                  86
Invalid                  26
Questionable              9
Unverified                1
Under investigation       1
Boatomg                   1
Name: Type, dtype: int64

In [45]:
data['Country'].value_counts()

USA                     2117
AUSTRALIA               1246
SOUTH AFRICA             517
PAPUA NEW GUINEA         127
NEW ZEALAND              120
                        ... 
WESTERN SAMOA              1
SLOVENIA                   1
ARGENTINA                  1
NETHERLANDS ANTILLES       1
KOREA                      1
Name: Country, Length: 157, dtype: int64

In [46]:
data['Age'] = data['Age'].fillna(data['Age'].median())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Age'] = data['Age'].fillna(data['Age'].median())


### 3.8 Dropping columns useless to the analysis of the model

In [47]:
data = data[['Type', 'Country', 'Area', 'Activity', 'Age', 'Fatal (Y/N)', 'clean_date', 'Latitud', 'Longitud']]

In [48]:
data

Unnamed: 0,Type,Country,Area,Activity,Age,Fatal (Y/N),clean_date,Latitud,Longitud
0,Unprovoked,USA,Maui,other,24.0,N,2020-02-05,20.798363,-156.331925
1,Provoked,BAHAMAS,Exumas,Floating,24.0,N,2020-01-30,23.619260,-75.969547
2,Unprovoked,BAHAMAS,Exumas,Swimming,8.0,N,2018-11-09,23.619260,-75.969547
3,Unprovoked,AUSTRALIA,New South Wales,Surfing,59.0,N,2020-01-17,-31.253218,146.921099
5,Unprovoked,AUSTRALIA,New South Wales,Surfing,29.0,N,2019-10-05,-31.253218,146.921099
...,...,...,...,...,...,...,...,...,...
5994,Unprovoked,INDONESIA,Riau Province,other,24.0,Y,NaT,0.293347,101.706829
5995,Unprovoked,MARSHALL ISLANDS,Bikini Atoll,Swimming,24.0,N,NaT,11.606514,165.376810
5996,Sea Disaster,PAPUA NEW GUINEA,Between New Ireland & New Britain,other,24.0,N,NaT,41.661210,-72.779542
5997,Unprovoked,VIETNAM,Ba Ria-Vung Tau Province,other,24.0,N,NaT,10.541740,107.242998


In [49]:
data.groupby(['Type']).get_group('Unprovoked')

Unnamed: 0,Type,Country,Area,Activity,Age,Fatal (Y/N),clean_date,Latitud,Longitud
0,Unprovoked,USA,Maui,other,24.0,N,2020-02-05,20.798363,-156.331925
2,Unprovoked,BAHAMAS,Exumas,Swimming,8.0,N,2018-11-09,23.619260,-75.969547
3,Unprovoked,AUSTRALIA,New South Wales,Surfing,59.0,N,2020-01-17,-31.253218,146.921099
5,Unprovoked,AUSTRALIA,New South Wales,Surfing,29.0,N,2019-10-05,-31.253218,146.921099
7,Unprovoked,AUSTRALIA,New South Wales,Spearfishing,18.0,N,2019-04-27,-31.253218,146.921099
...,...,...,...,...,...,...,...,...,...
5993,Unprovoked,MADAGASCAR,Toamasina Province,Bathing,24.0,Y,NaT,-17.774529,49.041312
5994,Unprovoked,INDONESIA,Riau Province,other,24.0,Y,NaT,0.293347,101.706829
5995,Unprovoked,MARSHALL ISLANDS,Bikini Atoll,Swimming,24.0,N,NaT,11.606514,165.376810
5997,Unprovoked,VIETNAM,Ba Ria-Vung Tau Province,other,24.0,N,NaT,10.541740,107.242998


In [50]:
mask = data['Area'].value_counts().values > 4
data['Area'].value_counts()[mask]

Florida               1008
New South Wales        449
Queensland             300
Hawaii                 278
California             265
                      ... 
Hamilton                 5
Taveuni                  5
Madang                   5
Bimini Islands           5
Cap Vert Peninsula       5
Name: Area, Length: 99, dtype: int64

### 3.9 Getting the Dummies

In [51]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
dummies = encoder.fit_transform(data[['Type','Country','Activity','Fatal (Y/N)']])

In [52]:
dummies = pd.DataFrame(dummies.astype(int), columns=np.concatenate(encoder.categories_))
dummies.sample(10)

Unnamed: 0,Boat,Boating,Boatomg,Invalid,Provoked,Questionable,Sea Disaster,Under investigation,Unprovoked,Unverified,...,Surfing,Surfing (sitting on his board),Swimming,Treading water,Wading,Walking,Windsurfing,other,N,Y
1778,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,1,0
2543,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,1
4514,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,1,0
4468,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,1
4619,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,1,0
1155,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
2731,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4157,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
3900,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,1
2213,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,1,0


In [55]:
data.sample(3)

Unnamed: 0,Area,Age,clean_date,Latitud,Longitud,Boat,Boating,Boatomg,Invalid,Provoked,...,Surfing,Surfing (sitting on his board),Swimming,Treading water,Wading,Walking,Windsurfing,other,N,Y
929,Queensland,24.0,NaT,-22.575197,144.084793,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
11,New South Wales,36.0,2018-12-09,-31.253218,146.921099,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1687,Hawaii,58.0,2013-04-02,19.898682,-155.665857,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [53]:
data = data.join(dummies)
data = data.drop(columns=['Type','Country','Activity','Fatal (Y/N)'])

### 4.1 Split del set de entrenamiento

In [None]:
X = data.drop([