## Introduction

The project aims to predict the fatality of a potential shark attack based on predictor variables such as:

- Wound location
- Activity during the attack
- Type of incident
- Victim's age and gender
- Geographical area and a risk categorization assigned to the country based on its fatality ratio
- Size of the shark species responsible for the attack
- Year of the incident

To achieve this prediction, we utilized a logistic regression model and monitored its performance using accuracy, precision, recall, F1 score, and ROC AUC metrics.

In [1]:
# !pip uninstall scikit-learn --yes
# !pip uninstall imblearn --yes
# !pip install scikit-learn==1.2.2
# !pip install imblearn

### Import and settings

In [4]:
import pandas as pd  
from pandas import options
import numpy as np 
import re
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import OneHotEncoder, Normalizer, StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from imblearn.over_sampling import RandomOverSampler

import warnings
warnings.filterwarnings('ignore')

ImportError: cannot import name '_MissingValues' from 'sklearn.utils._param_validation' (C:\Users\laura\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py)

In [5]:
pd.set_option("display.max_rows", 20)
pd.set_option("display.max_columns", 20)

### Loading the cleaned data

At the project's outset, we gathered to share our dataframes, understanding the different approaches and processes each of us had undertaken. 

Due to disparities in aspects such as NaN management, our datasets had different sizes. 

Therefore, we chose one of the two datasets to proceed with model improvement.

In [None]:
final_df = pd.read_csv(r"c:\Users\USUARIO\Desktop\Data Analysis\Ironhack\Mini-Proyecto\final_df_angel.csv", encoding= "latin1")
final_df = final_df.drop(columns="Unnamed: 0")
final_df.head()

## Starting point: Classification model 

In [None]:
X = final_df.drop(["y", "n", "fatal_unspecified"], axis=1)
y = final_df["y"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)

In [None]:
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
classification = LogisticRegression(random_state=42)
classification.fit(X_train_scaled, y_train)

In [None]:
predictions = classification.predict(X_test_scaled)
predictions

In [None]:
cm = confusion_matrix(y_test, predictions)
cf_matrix = confusion_matrix(y_test, predictions, normalize='all')
sns.heatmap(cf_matrix, annot=True)

In [None]:
classification.score(X_test, y_test)

In [None]:
precision = precision_score(predictions, y_test)
print("Precision:", precision)

recall = recall_score(predictions, y_test)
print("Recall:", recall)

f1 = f1_score(predictions, y_test)
print("F1-score:", f1)

roc_auc = roc_auc_score(predictions, y_test)
print("ROC-AUC Score:", roc_auc)

## Optimization 

Once establishing the starting point, our focus shifted to optimizing the model, beginning with outlier elimination. 

### Remove outliers

Guided by the interquartile range, we filtered numerical values beyond or below the limit values, extracting them from the final dataset. 

In [None]:
num_df = final_df[["year", "age"]]
plt.figure(figsize=(15,10))
sns.boxplot(data=num_df, x="year")
sns.boxplot(data=num_df, x="age")
plt.show()

In [None]:
# Calculating the interquartile range of year and age

summary = num_df.describe().T
IQR = summary["75%"] - summary["25%"]
left_end = summary["25%"] - 1.5 * IQR
right_end = summary["75%"] + 1.5 * IQR
print(left_end, right_end)

In [None]:
#Filtering outliers

outliers = final_df[(final_df["year"] <= left_end["year"]) | (final_df["year"] >= right_end["year"]) | (final_df["age"] <= left_end["age"]) | (final_df["age"] >= right_end["age"])]
outliers.head()

In [None]:
# Removing outliers from df

merged_df = final_df.merge(outliers, indicator=True, how='left', on=list(final_df.columns))
filtered_df = merged_df[merged_df['_merge'] == 'left_only'].drop(columns='_merge')
filtered_df = filtered_df.reset_index().drop(columns="index")
num_df.shape[0] - filtered_df.shape[0]

In [None]:
# Checking boxplots

plt.figure(figsize=(15,10))
sns.boxplot(data=filtered_df, x="year")
sns.boxplot(data=filtered_df, x="age")
plt.show()

### Balancing the model

The subsequent step was balancing the values, as there were significantly more negative values than positive ones for the dependent variable. We implemented RandomOverSampler to resample the training data partitions.

In [None]:
# Train-test splitting

X = filtered_df.drop(["y", "n", "fatal_unspecified"], axis=1)
y = filtered_df["y"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=33)

In [None]:
# Resampling X_train and y_train

sampler = RandomOverSampler(random_state=42)

In [None]:
# Fitting the resampler

X_train_balanced, y_train_balanced = sampler.fit_resample(X_train, y_train)
y_train_balanced.value_counts()

In [None]:
#Scaling balanced data

scaler = StandardScaler()
scaler.fit(X_train_balanced)

In [None]:
X_train_scaled = scaler.transform(X_train_balanced)
X_test_scaled = scaler.transform(X_test)

## Final Implementation

After this phase, our intention was to carry out an iterative selection of the best predictor variables based on the backwards selection methodology. 

However, upon monitoring the model's performance following the implemented changes, the results were so positive that we decided against further feature selection.

Finally, we experimented with different scaling methods, ultimately opting for StandardScaler. 

While we can't definitively confirm if the model is overfitting, we are pleased with the achieved results.

In [None]:
classification.fit(X_train_scaled, y_train_balanced)

In [None]:
# Representing performance in a confussion matrix

predictions = classification.predict(X_test_scaled)
cf_matrix = confusion_matrix(y_test, predictions, normalize='all')
sns.heatmap(cf_matrix, annot=True)
print(cf_matrix)

In [None]:
classification.score(X_test_scaled, y_test)

In [3]:
# Trying new metrics to deepen our understanding of the performance of the model

precision = precision_score(predictions, y_test)
print("Precision:", precision)

recall = recall_score(predictions, y_test)
print("Recall:", recall)

f1 = f1_score(predictions, y_test)
print("F1-score:", f1)

roc_auc = roc_auc_score(predictions, y_test)
print("ROC-AUC Score:", roc_auc)

NameError: name 'predictions' is not defined