# Global Shark Attack Incidents
## Heatmap

This notebook leverages a dataset containing records of shark attack incidents worldwide to explore and visualize the frequency of these occurrences. By analyzing this data, we gain insights into the geographic distribution of shark attacks, shedding light on areas where such incidents are more prevalent. Towards the end of the notebook, a heat map is presented, providing a visual representation of the regions most affected by shark attacks. 

## Shark data (import and visualization)

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
!pip install seaborn



In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import preprocessing

""" 
Setting display.max_columns to None ensures that Pandas will display all columns of a DataFrame, 
regardless of its size. This can be useful when working with datasets that have a large number of columns, 
as it allows you to see all the available data without truncation.
It improves the readability and usability of the DataFrame, especially when exploring or analyzing 
the data interactively.
"""
pd.set_option("display.max_columns",None)

"""
The low_memory=False parameter is optional and was set to False.
When set to False, it instructs pandas not to attempt to infer the data type of each column automatically,
which can save memory in cases where the file has many columns with mixed data types or large amounts of data.
This can be useful when working with large files or varied data types.
However, setting low_memory=False may consume more RAM during the file reading process,
we're reading the dataset.
"""
df = pd.read_csv("./inputs/sharks/GSAF5.xls.csv", low_memory=False)
print(df.columns)

df = df[['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area',
       'Location', 'Activity', 'Name', 'Unnamed: 9', 'Age', 'Time','Species ','Fatal (Y/N)','Injury']]


"""
It's not strictly necessary to maintain the original order of columns when renaming them. 
However, doing so can enhance clarity, consistency, and compatibility with existing code.
"""
# Renaming column ('Unnamed: 9' to 'Victim\'s Gender')
df.columns = ['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area',
       'Location', 'Activity', 'Name', 'Victim\'s Gender', 'Age', 'Time','Species','Fatal', 'Injury']

# Amount of "(lines, columns)" respectively
print(df.shape)

df.head()

FileNotFoundError: [Errno 2] No such file or directory: '/home/helena/Documents/PUCRS_files/Curso - CDIA/Intro a CD/global-sharkdataset-GSAF5.xls.csv/GSAF5.xls.csv'

In [None]:
np.array(df.isna() == True)

In [None]:
"""
This code is using the .isnull () method of the DataFrame to check null values in each column
and then it is adding the number of null values in each of these columns, to obtain its total.
"""
df.isnull().sum()

In [None]:
"""
The nunique() function in Pandas is used to count the number of unique values in a specific column of a DataFrame.
Applying this function to the 'Country' column provides valuable insight into the shark incidents,
highlighting the global nature of shark attacks documented in the dataset. 
This information is crucial for understanding the geographical distribution and scope of the dataset.
"""
df['Country'].nunique()

### Counting the occurrences of shark attacks by country

In [None]:
"""
The "value_counts()" function shows the occurrences of shark attacks by country.
The "reset_index()" method converts the results into a DataFrame.
Finally, "head(20)", wich is also a method exhibits the top twenty countries with the highest
number of occurrences, offering a quick view of the most affected countries by shark attacks,
Brazil (the country where I live) is in the seventh position.
"""
# Selecting the column 'Country' from the DataFrame and adding the function `value_counts()`
# in order to create the 'count' column, so that we can find out the number of occurrences, respectively.
df_country = df['Country'].value_counts()

# Counting the occurrences of shark attacks by country and storing the result in a DataFrame
df_country = df['Country'].value_counts().reset_index()

# Displaying the top 20 countries with the highest number of shark attack occurrences
df_country.head(20)

In [None]:
# Plotting a bar graph that shows the 20 countries with the most occurrences of shark attacks
df_country.head(20).plot.bar('Country')

# Adding a title to the graph
plt.title("Number of Shark Attacks")

# Displaying the graph
plt.show()

As a percentage of the total, USA is 36% and Australia is 21%, so these two countries account for half of the total.

In [None]:
# Calculate the rate of shark attack occurrences for each country relative to the total occurrences worldwide
# by dividing the number of shark attack occurrences for each country by the total occurrences of all countries.
# Add a new column 'rate' to the DataFrame df_country to store these rates.
df_country['rate'] = df_country.iloc[:,1] / df_country['count'].sum()

# Display the first five rows of the DataFrame df_country, including the new 'rate' column.
df_country.head(5)

In [None]:
# The count of shark attacks is stored in the column 'Type'.
# The reset_index() method is used to reset the index of the resulting DataFrame, ensuring that 
# the row labels are sequential integers, since that facilitates the access to specific rows and enhances
# compatibility with various data manipulation operations in many libraries. 
df_Area = df[['Country','Area','Type']].groupby(['Country','Area']).count().reset_index().sort_values('Type',ascending=False)

df_Area.head(20)

# Geolocation Data (Importing and Processing)

In [None]:
# Opening the dataset [Geolocation Data](https://www.kaggle.com/datasets/liewyousheng/geolocation/data)
# and reading the 'inputs/geolocation/cities.csv' file in order to obtain a better parameter in geolocation
# and cross with the data from the Shark dataset.
place_data = pd.read_csv('./inputs/geolocation/cities.csv')

In [None]:
# Selecting only the columns 'state_name', 'country_name', 'latitude', and 'longitude' from the DataFrame place_data
# and storing the result back into the variable place_data.
place_data = place_data[['state_name','country_name','latitude','longitude']]

# Cleaning the data by removing duplicate rows based on the combination of 'state_name' and 'country_name'
# subset parameter specifies the columns to consider when identifying duplicates.
# In this case, we want to drop rows where both 'state_name' and 'country_name' are the same.
place_data = place_data.drop_duplicates(subset=['state_name','country_name'])

In [None]:
# Manipulating string data in the 'state_name' column of the 'place_data' DataFrame:
# 1. Converting all letters in the 'state_name' column to lowercase using the str.lower() method.
place_data['state_name'] = place_data['state_name'].str.lower()

# 2. Removing all whitespace characters (' ') from the 'state_name' column using the str.replace() method.
# This step ensures consistency in string formatting and removes unnecessary whitespace.
place_data['state_name'] = place_data['state_name'].str.replace(' ', '')

place_data.head(10)

In [None]:
"""
The shape attribute of a DataFrame or a NumPy array returns a tuple representing the dimensions of the 
data structure. In the specific case of place_data.shape, it returns 
a tuple with two elements: the first element represents the number of rows (or entries) 
in the DataFrame or array, and the second element represents the number of columns (or features) 
in the DataFrame or array. Therefore, when calling place_data.shape, you will obtain the 
total number of entries (or records) and the total number of features (or variables) in the DataFrame place_data. 
This is useful for understanding the data structure and determining its dimensions 
before performing additional operations or analyses.
"""
place_data.shape

# Data Merge

In [None]:
df.head()

In [None]:
# Create a new DataFrame containing only the 'Country' and 'Area' columns from the original DataFrame
df_only_Area = df[['Country', 'Area']]

# Rename the 'Area' column to 'state_name' in order to reflect its content more accurately
df_only_Area.columns = ['Country', 'state_name']

# Convert the values in the 'state_name' column to lowercase
df_only_Area['state_name'] = df_only_Area['state_name'].str.lower()

# Remove whitespace characters from the values in the 'state_name' column
df_only_Area['state_name'] = df_only_Area['state_name'].str.replace(' ', '')

# Display the first few rows of the preprocessed DataFrame
df_only_Area.head()

In [None]:
df_only_Area.isnull().sum()

In [None]:
# Perform a left join operation to merge the DataFrame df_only_Area (left) with the place_data DataFrame (right) 
# on the 'state_name' column. This operation is useful when we want to retain all the data from the left DataFrame 
# (df_only_Area) and only add additional information from the right DataFrame (place_data) where available.
df_join = pd.merge(df_only_Area, place_data, how='left', on='state_name')

# Display the shape of the resulting DataFrame to check the number of rows and columns
print(df_join.shape)

# Display the first few rows of the merged DataFrame
df_join.head()

Since the two sets of data have different purposes, there are null data that have not been combined well.
In this case, all nulls will be excluded.

In [None]:
# Check for missing values in the merged DataFrame df_join
df_join.isnull().sum()

In [None]:
# Display the rows where the 'country_name' column is null (missing)
df_join[pd.isnull(df_join.country_name)].head()

In [None]:
"""
# Drop rows with missing values and specific columns from the DataFrame df_join.
# The axis=0 parameter indicates that rows will be dropped.
# The .drop() function is used to remove rows or columns from a DataFrame.
# Here, we remove rows containing missing values using dropna(), and specific columns 
# ('Country', 'country_name', 'state_name') using the 'axis=1' parameter.
"""
df_join = df_join.dropna(axis=0).drop(['Country','country_name','state_name'], axis=1)

# Check for missing values after dropping rows 
print(df_join.isnull().sum())

# Display the shape of the DataFrame after dropping rows and columns
df_join.shape

In [None]:
# This displays the first few rows of the DataFrame, allowing the data to be inspected 
# in order to confirm that the preprocessing steps have been applied correctly.
df_join.head()

# Heat Map Visualization

In [None]:
# This command instaled the Folium package, which is used for visualizing geospatial data with interactive maps. 
# Once installed, I was able to import and use Folium in the Jupyter notebook.
!pip install folium

In [None]:
# Convert DataFrame values to a list format
df_list = df_join.values.tolist()

# Import the Folium library for creating interactive maps
import folium
from folium.plugins import HeatMap

# Create a base map centered at latitude 10 and longitude -20, with a zoom level of 3
map = folium.Map(location=[10, -20], zoom_start=3)

# Create a HeatMap layer using the list of values converted from the DataFrame
# with a radius of 7 and a blur factor of 5, and add it to the map
HeatMap(df_list, radius=7, blur=5).add_to(map)

# Display the map
map

Successfully visualized heatmap!

We can see that there are incidents with sharks in various locations in the United States. However, it's important to note that the latitude and longitude data may not be perfectly accurate, resulting in some coloration even on land due to imprecise geolocation data. Nevertheless, exploring this heatmap can still provide valuable insights into areas where shark incidents are more prevalent.

Feel free to interact with the map by moving around the region and zooming in to explore specific areas. This visualization can be useful for gaining awareness and understanding potential risk areas, allowing you to make informed decisions and take precautions when traveling or engaging in activities near coastal areas.

# ML

In [None]:
# Importing libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
#random florest (the best option?)
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
#df = pd.read_csv("./inputs/sharks/GSAF5.xls.csv",low_memory=False)
print(df.info)

df = df[['Case Number','Time', 'Date', 'Year', 'Type', 'Country','Fatal','Injury', 'Area','Location', 'Activity', 'Name', 'Victim\'s Gender', 'Age','Species']]
#df_ml = df[['Time','Country','Location','Fatal','Injury','Age','Species',Activity','Victim\'s Gender']]

### Analyze

In [None]:
df.describe()

In [None]:
df.info()

# Check NaNs
print("\nNumber of NaNs in each column:\n\n", df.isna().sum())

In [None]:
# Calculate the percentage of missing values in critical columns
critical_columns = ['Country', 'Location', 'Activity', 'Injury']
missing_data_percentage = df[critical_columns].isna().mean() * 100
print("Percentage of missing data in critical columns:\n", missing_data_percentage)

import seaborn as sns
import matplotlib.pyplot as plt

# Heatmap to visualize missing data
plt.figure(figsize=(10, 6))
sns.heatmap(df[critical_columns].isna(), cbar=False, cmap='viridis')
plt.title("Missing Data in Critical Columns")
plt.show()

# Data Cleaning Strategy for Shark Attack Dataset

Based on the analysis of missing values (NaNs) in the shark attack dataset, we need a more nuanced approach to clean the data while retaining as much valuable information as possible.

## NaN Analysis Summary

The following columns have been identified with the respective number of NaNs:

- **Case Number**: 2 NaNs
- **Date**: 1 NaN
- **Year**: 3 NaNs
- **Type**: 5 NaNs
- **Country**: 51 NaNs
- **Area**: 463 NaNs
- **Location**: 545 NaNs
- **Activity**: 552 NaNs
- **Name**: 215 NaNs
- **Victim's Gender**: 6434 NaNs
- **Age**: 2871 NaNs
- **Time**: 3392 NaNs
- **Species**: 2924 NaNs
- **Fatal**: 547 NaNs
- **Injury**: 29 NaNs

## Data Cleaning Strategy

### Columns to Drop NaNs

For columns where a minimal number of entries are missing, we will drop the rows containing NaNs. This is crucial for columns where missing values would significantly impact the analysis.

- **Country**: 51 NaNs
- **Location**: 545 NaNs
- **Activity**: 552 NaNs
- **Injury**: 29 NaNs

### Columns to Fill NaNs

For columns with a high percentage of missing values, we will fill or impute the NaNs with appropriate values to retain as much data as possible.

- **Age**: 2871 NaNs (Fill with median)
- **Time**: 3392 NaNs (Fill with 'Unknown')
- **Victim's Gender**: 6434 NaNs (Fill with 'Unknown')
- **Species**: 2924 NaNs (Fill with 'Unknown')

### Other Columns with Minimal NaNs

For columns with a minimal number of missing values, we will handle these individually by dropping rows with NaNs.

- **Case Number**
- **Date**
- **Year**
- **Type**
- **Fatal**
- **Area**
- **Name**

Normalizing / Nan

! Further Cleaning !

4FUTURE: Review entries with 'Unknown' values and investigate if additional data sources or methods can reduce these unknowns.
Consider imputing values based on more sophisticated methods if further cleaning is necessary.

## 'Gender'

In [None]:
from gender_guesser.detector import Detector

In [None]:
detector = Detector(case_sensitive=False)

# Filled with 'Unknown' to avoid dropping rows, ensuring the dataset remains comprehensive.
# 4FUTURE: Find a library (or similar) to help w/ that - We need to read the gender
#df['Victim\'s Gender'].fillna('Unknown', inplace=True)

# Define a function to guess gender from name
def guess_gender(name):
    if pd.isnull(name) or name.strip() == '':
        return 'Unknown'

    if name == 'male':
        return 'M'
    elif name == 'female':
        return 'F'
    
    gender = detector.get_gender(name.split()[0]) 
    if gender in ['male', 'mostly_male']:
        return 'M'
    elif gender in ['mostly_female', 'female']:
        return 'F'
            
    return 'Unknown'

In [None]:
# Define critical columns
critical_columns = ['Country', 'Location', 'Activity', 'Injury']

# Drop rows with NaNs in critical columns if percentage is acceptable
df = df.dropna(subset=critical_columns)

# Filled with 'Unknown' to maintain consistency and handle missing values without assuming incorrect times.
# 4FUTURE: Considered doing the MEDIAN 
df['Time'].fillna('Unknown', inplace=True)

# Filled with 'Unknown' to retain entries even if the species is not identified.
# 4FUTURE: Find a library (or similar) to help w/ that - this variable it's helpfull to read Injury (and only that. ?)
df['Species'].fillna('Unknown', inplace=True)

# Filled with 'Unknown' to retain information on non-fatal incidents.
# 4FUTURE: Reading Injury this should be easier to read
df['Fatal'].fillna('Unknown', inplace=True)

# Drop remaining rows with NaNs in minimal NaN columns
df = df.dropna(subset=['Case Number', 'Date', 'Year', 'Type', 'Fatal', 'Area', 'Name'])

# Verify the changes
print("\nNumber of NaNs after treatment:\n\n", df.isna().sum())
print(df.info())

In [None]:
#import pandas as pd

# Remove irrelevant columns
#df = df.loc[:, ~df.columns.str.contains('Unnamed')]

# Check NaNs
#print("\nNumber of NaNs before treatment:\n\n", df.isna().sum())

# Handle NaNs
#df = df.dropna(subset=['Time','Country','Location','Fatal','Injury','Age','Activity','Species','Victim\'s Gender'])

# Check again
#print("\nNumber of NaNs after treatment:\n\n", df.isna().sum())

# Show the cleaned DataFrame
print(df)

## 'Age'

In [None]:
# Check unique values in 'Age' column
print(df['Age'].unique())
print(df.info())

In [None]:
df["Age"] = df["Age"].replace(['30s', '60s', "20's", '40s', 'a minor', '20s', 'Teen', 
                               '18 months', '50s', 'teen', '6½', 'mid-30s', '20?', "60's", 
                               'Elderly', 'mid-20s', 'Ca. 33', '>50', 'adult', '9 months', 
                               '(adult)', 'X', '"middle-age"', '2 to 3 months', 
                               'MAKE LINE GREEN', '"young"', 'F', 'young', 'A.M.', 
                               '2½'], 
                                ['35', '65', '25', '45', '16', '25', '14', '1', 
                                '55', '14', '6', '35', '20', '65', '60', '25', 
                                '33', '55', '30', '0', '30', 'unkown', '50', '0', 
                                'unkown', '20', 'unkown', '20', 'unkown', '2'])

In [None]:
print(df['Age'].unique())

In [None]:
# Fill NaNs in other columns with appropriate values
# Converted to numeric and filled with the median age, Retaining the central tendency of the data.
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['Age'].fillna(df['Age'].median(), inplace=True)

### Visualization of Shark Incidents by age

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Age Distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['Age'], bins=30, kde=True, color='blue')
plt.title('Distribution of Shark Incidents by Age')
plt.xlabel('Age')
plt.ylabel('Number of Incidents')
plt.show()

Currently it is possible to visualize shark encounters, not shark attacks. 

In which period people are attacked the most? 

### ´Time´

In [None]:
# Function to categorize time
def categorize_time(time):
    if pd.isnull(time) or time.strip() == '':
        return "Unknown"
    
    time = time.strip().lower()
    
    # Direct categories
    if "morning" in time:
        return "Morning"
    elif "afternoon" in time or "pm" in time:
        return "Afternoon"
    elif "evening" in time or "night" in time:
        return "Evening"
    elif "early morning" in time:
        return "Morning"
    elif "late afternoon" in time:
        return "Afternoon"
    
    # Specific time conversion
    try:
        time = time.replace('h', ':').replace('H', ':')
        hour = int(time.split(':')[0])
        if 0 <= hour < 12:
            return "Morning"
        elif 12 <= hour < 18:
            return "Afternoon"
        else:
            return "Evening"
    except ValueError:
        return "Unknown"

# Apply the categorization function
df['Time'] = df['Time'].apply(categorize_time)

# Fill remaining missing values with 'Unknown'
df['Time'] = df['Time'].fillna('Unknown')
print(df['Time'])

In [None]:
# Visualize all entries in the 'Time' column
print("\nCount of all values in the 'Time' column:")
print(df['Time'])

# Filter and visualize only the entries that are not 'Unknown' - checking the cleaning
non_unknown_times = df[df['Time'] == 'Unknown']
print("\nEntries in the 'Time' column that are not 'Unknown':")
print(non_unknown_times['Time'].value_counts())
print(non_unknown_times[['Time']])

In [None]:
# Plot bar chart to visualize the distribution of shark attacks over time
time_counts = df['Time'].value_counts()

plt.figure(figsize=(10, 6))
time_counts.plot(kind='bar', color=['violet','crimson','gold','lightgreen'])
plt.title('Distribution of Shark Incidents by Time of Day')
plt.xlabel('Time of Day')
plt.ylabel('Number of Incidents')
plt.xticks(rotation=0)
plt.show()

# Why Do Shark Attacks Occur More in the Afternoon?

We observe that shark attacks are more common in the afternoon due to a blend of human activity, shark behavior, and environmental factors.

## Human Activity
- **Peak Beach Hours**: In the afternoon, there is a significant increase in beachgoers participating in swimming, surfing, and other water activities. This heightened human presence in the water elevates the likelihood of encounters with sharks.
- **Surfing Patterns**: Surfers, who are frequently attacked, often prefer the afternoon for better wave conditions, leading them to spend more time in the water and becoming more susceptible to shark encounters.

## Shark Behavior
- **Feeding Times**: Sharks are generally more active in their feeding during dawn and dusk, coinciding with peak human activity in the water during these times.
- **Prey Movement**: The movement of prey fish closer to shore in the afternoon draws sharks nearer to coastal areas, increasing the chances of shark-human interactions.

## Environmental Factors
- **Water Conditions**: Warmer and clearer water conditions in the afternoon attract more marine life, including sharks, thus heightening the potential for shark attacks.

Understanding these patterns can help us develop strategies to reduce the risk of shark attacks by avoiding peak shark activity times and raising awareness among beachgoers.

### Higher incidence of shark attacks (M/F)? 

In [None]:
# 4FUTURE: Kaggle notebooks has restrictions -- Figured it out
!pip install gender-guesser
!pip install spacy
!python -m spacy download en_core_web_sm

In [None]:
# Convert gender values to readable format
df['Victim\'s Gender'] = df['Victim\'s Gender'].map({'F': 'Female', 'M': 'Male'})

print(df['Victim\'s Gender'])

In [None]:
import spacy
from gender_guesser.detector import Detector

# Apply the function to the 'Name' column
df['Predicted Gender'] = df['Name'].apply(guess_gender)

# Replace 'Unknown' values in 'Victim\'s Gender' with predicted values where possible
#df['Victim\'s Gender'] = df.apply(lambda row: row['Victim\'s Gender'] if row['Victim\'s Gender'] != 'NaN' else row['Predicted Gender'], axis=1)

# Drop the 'Predicted Gender' column as it's no longer needed
#df.drop(columns=['Predicted Gender'], inplace=True)

print(df['Predicted Gender'].value_counts())

# Remove 'Unknown's
#df = df[df['Victim\'s Gender'] != 'Unknown']
print(df['Predicted Gender'].value_counts())

pd.set_option('display.max_rows', None)
df

In [None]:
# Display rows where 'Victim\'s Gender' has NaN values
nan_rows = df[df.isna().any(axis=1)]

# Save rows with NaN values to a CSV file for manual review
nan_rows.to_csv('rows_with_nans.csv', index=False)
print("Rows with NaN values saved to 'rows_with_nans.csv' for manual review.")

# Create a link to download the CSV file
from IPython.display import FileLink
FileLink('rows_with_nans.csv')

In [None]:
df = df[df['age'] != 'unknown']


# Count the number of incidents by gender
gender_counts = df['Predicted Gender'].value_counts()

# Display the result
print("Number of shark incidents by gender:")
print(gender_counts)

# Visualize the distribution of incidents by gender
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
gender_counts.plot(kind='bar', color=['violet','lightblue', 'lightpink'])
plt.title('Distribution of Shark Incidents by Gender')
plt.xlabel('Gender')
plt.ylabel('Number of Incidents')
plt.xticks(rotation=0)
plt.show()

### Activity

In [None]:
# Understanding the Activity variable
df['Activity']

In [None]:
# Trying to automate the categorization of activities

import spacy
from spacy.matcher import PhraseMatcher

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# Create a PhraseMatcher object
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

categories = {
    "Surface Water Sports": ["surfing", "stand-up paddleboarding", "kayak fishing", "windsurfing", "kite surfing", "body boarding"],
    "Underwater Activities": ["scuba diving", "snorkeling", "spearfishing", "free diving"],
    "Casual Water Activities": ["swimming", "floating", "wading", "bathing"],
    "Fishing and Hunting": ["fishing", "spearfishing", "crabbing", "lobstering", "collecting seashells"],
    "Interaction with Marine Life": ["feeding sharks", "touching a shark", "photographing marine life"],
    "Unique or Unspecified": ["sea disaster", "walking on the beach", "cleaning fish", "standing in water"]
}

# Add patterns to the matcher
for category, activities in categories.items():
    patterns = [nlp.make_doc(text) for text in activities]
    matcher.add(category, patterns)

# Function to process text and extract features or categorize
def categorize_activities(activity):
    try:
        doc = nlp(activity)
        matches = matcher(doc)
        
        main_match = matches[0]
        match_id = main_match[0]
        rule_id = nlp.vocab.strings[match_id]

        return rule_id
    
    except:
        return "Unknown"

# Apply the function to the 'Activity' column
df['Activity (Categorized)'] = df['Activity']
df['Activity (Categorized)'] = df['Activity (Categorized)'].apply(categorize_activities)

activity_counts = df['Activity (Categorized)'].value_counts().sort_values(ascending=False)

# Move "Unknown" to the end
if 'Unknown' in activity_counts:
    unknown_count = activity_counts['Unknown']
    activity_counts = activity_counts.drop('Unknown')
    activity_counts['Unknown'] = unknown_count

print(activity_counts)

plt.figure(figsize=(10, 6))
activity_counts.plot(kind='bar', color=['crimson', 'gold', 'lightgreen', 'violet', 'brown', 'orange', 'blue'])
plt.title('Distribution of Shark Incidents by Victim\'s Activity at the moment')
plt.xlabel('Activity')
plt.ylabel('Number of Incidents')
plt.xticks(rotation=45) 
plt.show()

In [None]:
# Checking if it was an attack or an encounter
from spacy.tokens import DocBin
from spacy.training import Example

# Initialize the SpaCy model
nlp = spacy.blank("en")

# Add the classification component to the pipeline
textcat = nlp.add_pipe("textcat")

# Add labels to the classification component
textcat.add_label("attack")
textcat.add_label("encounter")

# 4FUTURE: train the model with more parameters (for example, FATAL is giving Encounter)
# Manually annotated dataset (example)
train_data = [
    ("severe lacerations to left forearm", {"cats": {"attack": 1, "encounter": 0}}),
    ("no injury, shark bit board", {"cats": {"attack": 0, "encounter": 1}}),
    ("lacerations to right foot", {"cats": {"attack": 1, "encounter": 0}}),
    ("no injury, kayak bitten", {"cats": {"attack": 0, "encounter": 1}}),
    ("fatal", {"cats": {"attack": 1, "encounter": 0}})
]

# Create DocBin for storing training data
doc_bin = DocBin()

for text, annotations in train_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    doc_bin.add(example.reference)

# Model training
optimizer = nlp.initialize()

for i in range(10):  # Number of training epochs
    losses = {}
    batches = spacy.util.minibatch(train_data, size=2)
    for batch in batches:
        texts, annotations = zip(*batch)
        examples = [Example.from_dict(nlp.make_doc(text), ann) for text, ann in zip(texts, annotations)]
        nlp.update(examples, drop=0.5, losses=losses)

# Testing the model with new descriptions
texts = ["puncture wounds to leff foot & lower leg", "no injury, board bitten", "injuries to right leg & hand"]
docs = list(nlp.pipe(texts))

# Using the model on dataset data
# Define the function that will categorize the attacks based on the recorded Injury
def categorize_injury(injury):
    doc = nlp(injury)
    
    if doc.cats["attack"] > doc.cats["encounter"]:
        return "Attack"
    
    else:
        return "Encounter"
    
# Apply the categorization function
df["Attack or Encounter"] = df["Injury"].apply(categorize_injury)
df[["Injury", "Attack or Encounter"]]

In [None]:
# Include species data in the injury categorization
#def categorize_injury(injury, species):
    #doc = nlp(injury)
    
    # Check if the species indicates a non-shark
    #if "not a shark" in species.lower():
        #return "Encounter"
    
    #if doc.cats["attack"] > doc.cats["encounter"]:
        #return "Attack"
    
    #else:
        #return "Encounter"

# Apply the categorization function with species consideration
#df["Attack or Encounter"] = df.apply(lambda row: categorize_injury(row['Injury'], row['Species']), axis=1)
#df[["Injury", "Species", "Attack or Encounter"]]
