<div float: center>
    <img src="../common/../common/logo_DH.png" width="30%" height="30%" style="text-align: left;">
    <img src="../common/wipro_logo.png"  width="20%" height="20%" style="text-align: right; margin-left:400px;">
<div/>

---

# SHARK ATTACK ANALYSIS

Developed by:
- Eng. Aer. Pablo Bauer

<a id="table_contents"></a> 
## Table of Contents

### <a href='#intro'>1. Introduction</a>
- #### <a href='#case'>1.1. Case</a>
- #### <a href='#workflow'>1.2. Workflow</a>

### <a href='#section_mod_import'>2. Modules Import</a>


### <a href='#dataset_import'>3. DataSet Import</a> 
- #### <a href='#cleaninig'>3.1. Cleaninig data</a>

### <a  href='#set_training'>4. Set and training Dataset</a>
- #### <a href='#features_dummies'>4.1. Features and dummies</a>
- #### <a href='#split'>4.2. Split training Dataset</a>
- #### <a href='#matrix_corr'>4.3. Correlation Matrx</a>
- #### <a href='#analysis_corr'>4.4. Correlation Analysis</a>

### <a  href='#models_evaluations'>5. Metrics and Evaluations</a>
- #### <a href='#pipeline_rfr'>5.1. PipeLine [RandomForestRegressor]</a>
- #### <a href='#features_imp_rfr'>5.2. Features Importance [RandomForestRegressor]</a>
- #### <a href='#pipeline_ada'>5.3. PipeLine [Ada Boost Regressor]</a>
- #### <a href='#features_imp_ada'>5.4. Features Importance [Ada Boost Regressor]</a>
- #### <a href='#pipeline_dtr'>5.5. PipeLine [DecisionTreeRegressor]</a>
- #### <a href='#features_imp_dtr'>5.6. Features Importance [DecisionTreeRegressor]</a>
- #### <a href='#pipeline_knn'>5.7. PipeLine [KNN]</a>
- #### <a href='#features_imp_knn'>5.8. Features Importance [KNN]</a>
- #### <a href='#best_performance'>5.9. Entrenamiento con los parámetros de mejor performance [DataSet Completo]</a>
    
### <a href='#roc_curve'>6. ROC curve</a>
- #### <a href='#section_roc_auc'>6.1. ROC and AUC</a>

### <a href='#export_pickle'>7. Expostarndo en Pickle</a>

### <a href='#conclusions'>8. Conclusuions</a>
---

<a id="intro"></a> 
## 1. Introduction
---
<a href='#table_contents'>Back up</a>

<a id="case"></a> 
### 1.1 Abaout the Case

<img src="../common/solar_farms.png"  width="100%" height="100%">

#### Global Shark Attack Incidents
*Credits to Brenda Griffith [source](https://data.world/siyeh)*

The Global Shark Attack Data is a comprehensive dataset that provides daily updated records of shark attack incidents worldwide. It offers valuable information on various aspects of each incident, including the date and location of the attack, specific details about the activity the victim was engaged in at the time, and whether it resulted in a fatality or not. With additional columns such as age, injury description, and even the name of the victim involved, this dataset aims to inform people about the risks associated with coastal water activities.

The dataset also sheds light on factors that contribute to shark attacks by providing information on the type of incident, such as whether it was provoked or unprovoked. It further categorizes incidents based on different countries and areas within those countries where they occurred. The presence of additional columns like investigator or source helps trace back to their reporting sources for further verification and analysis.

Moreover, this dataset goes beyond numbers and text by including links to PDF files containing more detailed information about each shark attack incident. This allows researchers or interested individuals to delve deeper into individual cases for better understanding.

To facilitate data visualization and exploration, several columns provide configurations related to encoding and mark type used in visualization tools. The inclusion of original order column helps maintain consistency while handling large datasets.

In summary, this dataset serves as a valuable resource for researchers studying shark attacks worldwide while aiming to improve understanding between sharks and humans through education on associated risks during coastal water activities

---

### Employer Needs

**Explore Attack Types:**

Use information from Type column to analyze different types of attacks like Unprovoked or Provoked incidents. Identify patterns or trends and gain insights into factors contributing to various types of attacks.

**Analyze Geographic Trends:**

Utilize data from Country and Area columns to identify regions with higher concentrations of recorded incidents. Compare different countries and areas to understand geographic trends regarding locations prone to attacks.

**Study Activity-Related Incidents:**

Focus on data from Activity column to examine activities that often lead to sharks encountering humans. Determine which activities pose higher risks and develop guidelines or recommendations accordingly.

**Investigate Victim Characteristics:**

Analyze data from Name and Age columns to study the profiles of victims involved in shark attack incidents. Identify any patterns or demographics that might be useful in understanding the prevalence of attacks.

**Assess Injury Severity:**

Explore data from Injury column to gain insights into different types and severities of injuries resulting from shark attacks. Analyze patterns and determine the most common injuries sustained by victims.

**Consider Fatality Rates:**

Examine the Fatal (Y/N) column to determine fatality rates associated with shark attack incidents. Compare fatality rates across different countries, areas, or activity types to understand risks

---

#### Global Shark Attack Incidents Dataset

#### Columns
- **Case Number** - A unique identifier for each shark attack incident.
- **Type** - The type of shark attack incident, such as Unprovoked or Provoked.
- **Country** - The country where the shark attack incident occurred.
- **Area** - The specific area within the country where the shark attack incident occurred.
- **Location** - The location or beach where the shark attack incident occurred.
- **Activity** - The activity the victim was engaged in at the time of the shark attack incident.
- **Name** - The name of the victim involved in the shark attack incident.
- **Age** - he age of the victim involved in the shark attack incident.
- **Injury** - The description of the injury sustained by the victim in a shark attack incident.
- **Fatal (Y/N)** - Indicates whether or not a fatality occurred in a given shark attack incident.

<a id="workflow"></a> 
### 1.2 La Workflow

*TO-DO*

<a id="section_mod_import"></a>
## 2. Modules Import
---
<a href='#table_contents'>Back up</a>

In [32]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
import plotly.express as px
import re

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, r2_score, mean_absolute_error, mean_squared_error, roc_curve
from sklearn.utils import resample

<a id="dataset_import"></a>
## 3. DataSet Import
---
<a href='#table_contents'>Back up</a>

In [7]:
file_path = "../data/archive/GSAF5.xls.csv"

data = pd.read_csv(file_path, sep=',', encoding='utf8')

print(data.shape)

(6462, 257)


In [8]:
data

Unnamed: 0,index,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,...,Unnamed: 246,Unnamed: 247,Unnamed: 248,Unnamed: 249,Unnamed: 250,Unnamed: 251,Unnamed: 252,Unnamed: 253,Unnamed: 254,Unnamed: 255
0,0,2020.02.05,05-Feb-2020,2020.0,Unprovoked,USA,Maui,,Stand-Up Paddle boarding,,...,,,,,,,,,,
1,1,2020.01.30.R,Reported 30-Jan-2020,2020.0,Provoked,BAHAMAS,Exumas,,Floating,Ana Bruna Avila,...,,,,,,,,,,
2,2,2020.01.17,17-Jan-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,Windang Beach,Surfing,Will Schroeter,...,,,,,,,,,,
3,3,2020.01.16,16-Jan-2020,2020.0,Unprovoked,NEW ZEALAND,Southland,Oreti Beach,Surfing,Jordan King,...,,,,,,,,,,
4,4,2020.01.13,13-Jan-2020,2020.0,Unprovoked,USA,North Carolina,"Rodanthe, Dare County",Surfing,Samuel Horne,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6457,6457,ND.0005,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,...,,,,,,,,,,
6458,6458,ND.0004,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,...,,,,,,,,,,
6459,6459,ND.0003,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,...,,,,,,,,,,
6460,6460,ND.0002,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,Jules Patterson,...,,,,,,,,,,


**Overview: Nulls in columns**

Identifying columns with and without data, droping that ones that is just noise.

In [13]:
data_nan = data.apply(lambda x: x.notnull().sum())
data_nan

index           6462
Case Number     6460
Date            6461
Year            6459
Type            6457
                ... 
Unnamed: 251       0
Unnamed: 252       0
Unnamed: 253       0
Unnamed: 254       0
Unnamed: 255       0
Length: 257, dtype: int64

In [16]:
data = data.dropna(how='all', axis=1)

In [22]:
data_nan = data.apply(lambda x: x.isnull().sum())
data_nan

index                        0
Case Number                  2
Date                         1
Year                         3
Type                         5
Country                     51
Area                       463
Location                   545
Activity                   552
Name                       215
Unnamed: 9                6434
Age                       2871
Injury                      29
Fatal (Y/N)                547
Time                      3392
Species                   2924
Investigator or Source      19
pdf                       3396
href formula              3400
href                      3400
Case Number.1             3400
Case Number.2             3400
original order            3400
Unnamed: 23               6460
dtype: int64

**Dropping other columns**

We will drop columns that is not esential, like: Unnamed:9, Unnamed:23, pdf,

Todo that we'll just select that columns that we need.

In [23]:
data.columns

Index(['index', 'Case Number', 'Date', 'Year', 'Type', 'Country', 'Area',
       'Location', 'Activity', 'Name', 'Unnamed: 9', 'Age', 'Injury',
       'Fatal (Y/N)', 'Time', 'Species ', 'Investigator or Source', 'pdf',
       'href formula', 'href', 'Case Number.1', 'Case Number.2',
       'original order', 'Unnamed: 23'],
      dtype='object')

In [29]:
data = data[['index', 'Date', 'Year', 'Type', 'Country', 'Area',
       'Location', 'Activity', 'Age', 'Injury',
       'Fatal (Y/N)', 'Species ', 'Investigator or Source']]

In [30]:
data

Unnamed: 0,index,Date,Year,Type,Country,Area,Location,Activity,Age,Injury,Fatal (Y/N),Species,Investigator or Source
0,0,05-Feb-2020,2020.0,Unprovoked,USA,Maui,,Stand-Up Paddle boarding,,"No injury, but paddleboard bitten",N,Tiger shark,"K. McMurray, TrackingSharks.com"
1,1,Reported 30-Jan-2020,2020.0,Provoked,BAHAMAS,Exumas,,Floating,24,PROVOKED INCIDENT Scratches to left wrist,N,,"K. McMurray, TrackingSharks.com"
2,2,17-Jan-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,Windang Beach,Surfing,59,Laceration ot left ankle and foot,N,"""A small shark""","B. Myatt & M. Michaelson, GSAF; K. McMurray, T..."
3,3,16-Jan-2020,2020.0,Unprovoked,NEW ZEALAND,Southland,Oreti Beach,Surfing,13,Minor injury to lower leg,N,Broadnose seven gill shark?,"K. McMurray, TrackingSharks.com"
4,4,13-Jan-2020,2020.0,Unprovoked,USA,North Carolina,"Rodanthe, Dare County",Surfing,26,Lacerations to foot,N,,"C. Creswell, GSAF"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6457,6457,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,,FATAL,Y,,"H. Taunton; N. Bartlett, p. 234"
6458,6458,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,,FATAL,Y,,"H. Taunton; N. Bartlett, pp. 233-234"
6459,6459,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,,FATAL,Y,,"F. Schwartz, p.23; C. Creswell, GSAF"
6460,6460,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,,FATAL,Y,,"The Sun, 10/20/1938"


**Extracting series**

Extract the series that we need to regularized in a format that we need

In [33]:
data_date_regex = data['Date']

**Use the pattern to regularized**

Arrange the data and the pattern that we need to extarct the useful info.

In [163]:
pattern_date = r"^([a-zA-Z]+|[0-9]+)(-|.[:space:]*)([a-zA-Z]+|[0-9]+)(-)*([a-zA-Z]+|[0-9]*)(-)*([0-9]*)"
text_1 = 'Before 1903'
text_2 = '05-Feb-2020'
text_3 = 'Reported 30-Jan-2020	'
text_4 = '1845-1853'

pattern_regex = re.compile(pattern_date)

"""
def condition_regex(x):
    if x is not(x[0][0]=='Before') or not(x[0][0]=='Reported') or int(x[0][0]) > 31:
        x[0][0]
"""

print(re.findall(pattern_date, text_1))
print(re.findall(pattern_date, text_2))
print(re.findall(pattern_date, text_3))
print(re.findall(pattern_date, text_4))

data_test = data['Date'].apply(lambda x: pattern_regex.findall(x))
data_test

[('Before', ' ', '1903', '', '', '', '')]
[('05', '-', 'Feb', '-', '2020', '', '')]
[('Reported', ' ', '30', '-', 'Jan', '-', '2020')]
[('1845', '-', '1853', '', '', '', '')]


TypeError: expected string or bytes-like object