<a href="https://colab.research.google.com/github/Dvdbijl/Shark_Attack/blob/main/Shark_Attack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import re
%load_ext google.colab.data_table

In [None]:
df = pd.read_csv('https://github.com/Dvdbijl/Shark_Attack/raw/main/attacks.csv', encoding='cp1252')
display(df)

### Cleaning the data

In [None]:
# First I want to make it better readable and remove the columns of least interest.

df2 = df.loc[:,"Date":"Species "]
display(df2)

In [None]:
#There is a lot of missing data in this dataframe.
print("Missing values per column:")
df2.isna().sum()

In [None]:
#Dropping all the NaN's so that the Date column is the one without NaN's
df3 = df2[~df2['Date'].isna()]
df3.isna().sum()

display(df3)

In [None]:
#Checking column names
print(df3.columns)

#Removing the ' ' in Sex and Species column and (Y/N) in Fatal
df4 = df3.rename(columns={'Sex ':'Sex', 'Species ':'Species', 'Fatal (Y/N)':'Fatal'})
df4.columns

###What are the most dangerous types of sharks to humans?

In [None]:
#For this question, I need the column Species and Fatal (Y/N)
columns = ['Species', 'Fatal']
dangerous_shark = df4[columns]

display(dangerous_shark)

In [None]:
print("Missing values per column:")
dangerous_shark.isna().sum()

In [None]:
#Checking the values of the column Fatal
dangerous_shark.Fatal.value_counts()

In [None]:
#First I will clean the column Fatal

#Replacing NaN  with Incorrect Data
dangerous_shark['Fatal'] = dangerous_shark['Fatal'].fillna('Incorrect Data')

#Strip space N
dangerous_shark['Fatal'] = dangerous_shark['Fatal'].str.strip()

#Changing N to No
dangerous_shark.loc[dangerous_shark['Fatal'] == "N", 'Fatal'] = 'No'

#Changing Y to Yes and adding them together
dangerous_shark.loc[dangerous_shark['Fatal'].str.contains("Y|y"), 'Fatal'] = "Yes"

#Change all other input that is not Yes or No to Incorrect Data
dangerous_shark.loc[~dangerous_shark['Fatal'].str.contains("Yes|No")] = "Incorrect Data"

display(dangerous_shark)

In [None]:
#Checking the values of the column Species
dangerous_shark.Species.value_counts()

In [None]:
#Cleaning the column Species

#Quite some NaN's in Species, so I will replace them with 'Unknown Shark'
dangerous_shark['Species'] = dangerous_shark['Species'].fillna('Unknown Shark')
display(dangerous_shark)

#Create new column where I will store the extracted Shark species
dangerous_shark['New Species'] = None

#Regular Expression pattern used to extract shark species and getting everything infront of the word "shark"
shark_pattern = r'.* (shark|Shark)'

#Looping through every row in the dataframe
for row in range(len(dangerous_shark)):
    try:
        shark_species = re.search(shark_pattern, dangerous_shark.iat[row, dangerous_shark.columns.get_loc('Species')]).group()
        dangerous_shark.iat[row, dangerous_shark.columns.get_loc('New Species')] = shark_species
    except:
        # If there is nothing before the string shark or doesn't contain string shark at all,
        # then I will put the string 'Shark involvement not confirmed'
        dangerous_shark.iat[row, dangerous_shark.columns.get_loc('New Species')] = "Shark involvement not confirmed"

new_dangerous_shark = dangerous_shark.drop(columns = 'Species')
new_dangerous_shark = new_dangerous_shark.rename(columns={"New Species":"Species"})

display(new_dangerous_shark)

In [None]:
#Dropping rows I'm not interested in
cleaned_dangerous_shark = new_dangerous_shark[~((new_dangerous_shark['Species'] == 'Shark involvement not confirmed') | (new_dangerous_shark['Species'] == 'Unknown Shark') | (new_dangerous_shark['Fatal'] == 'Incorrect Data'))]
display(cleaned_dangerous_shark)

In [None]:
#Sort on highest number
sorted_sharks = cleaned_dangerous_shark.groupby(['Fatal', 'Species'],as_index=False).size()
sorted_sharks = sorted_sharks.sort_values(by=['size', 'Species'], ascending=False)

display(sorted_sharks)

In [None]:
#Making sure White shark, Tiger shark and Bull shark are in the top 3 list
new_sorted_sharks = sorted_sharks[0:23]
top_3_dangerous_shark = new_sorted_sharks.iloc[[0, 1, 2, 3, 4, 22]]
display(top_3_dangerous_shark)

In [106]:
#Plotting the results
df = top_3_dangerous_shark

fig = px.bar(df, x="Species",
             y=["size"],
             color= "Fatal",
             title="Top 3 dangerous sharks",
             color_discrete_sequence=px.colors.qualitative.Pastel)


fig.update_layout(paper_bgcolor='cornsilk',
                  legend_traceorder="reversed",
                  yaxis_title="Attacks",
                  legend_title = 'Fatal',
                  font = dict(
                      family = "Courier New, monospace",
                      size = 18,
                      color = 'black'
                  ))
fig.show()

What are the most dangerous types of sharks to humans?
The top 3 most dangerous types of sharks are White sharks, Tiger sharks and Bull sharks.
*BIAS: There's a lot of missing data in both the Fatal and Species column in the dataset. I've made efforts to fix up the usability of the Species column, yet a substantial amount of data remains unusable.*

###Are children more likely to be attacked by sharks?

In [None]:
#For this question, I need the Age column
age = ['Age']
age_attacks = df4[age]
display(age_attacks)

In [56]:
#Replacing all the NaN's with Unknown
age_attacks['Age'] = age_attacks['Age'].fillna('Unknown')
display(age_attacks)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Age
0,57
1,11
2,48
3,Unknown
4,Unknown
...,...
6297,Unknown
6298,Unknown
6299,Unknown
6300,Unknown


In [None]:
#Replacing everything that is not a(n) (usable) number with Unknown
age_attacks.loc[age_attacks['Age'].str.contains("s|20?|X|F|A|>|e|u|½| "), 'Age'] = "Unknown"

#More than 4000 rows are Unknown
age_attacks.value_counts()

In [58]:
#Dropping all the Uknown's
age_attacks = age_attacks.drop(age_attacks[age_attacks['Age'] == 'Unknown'].index)

display(age_attacks)

Unnamed: 0,Age
0,57
1,11
2,48
6,18
8,15
...,...
6242,6
6243,16
6254,50
6276,16


In [59]:
#Changing the Age column to Int
age_attacks['Age'] = pd.to_numeric(age_attacks['Age'])

#Making a seperate column age_group, where everyone between 0-12 years is a child,
#12-18 years is a teenager and older than 18 is an adult
bins = [0, 12, 18, 120]
labels = ['Children', 'Teenagers', 'Adults']

age_attacks['age_group'] = pd.cut(age_attacks['Age'], bins=bins, labels=labels)#, right=False)

display(age_attacks)

Unnamed: 0,Age,age_group
0,57,Adults
1,11,Children
2,48,Adults
6,18,Teenagers
8,15,Teenagers
...,...,...
6242,6,Children
6243,16,Teenagers
6254,50,Adults
6276,16,Teenagers


In [60]:
#Adding all the groups together
age_count = age_attacks['age_group'].value_counts().reset_index().rename(columns={'index':'Group','age_group':'Count'})
display(age_count)


Unnamed: 0,Group,Count
0,Adults,1166
1,Teenagers,776
2,Children,206


In [61]:
#Making a piechart
fig = px.pie(data_frame = age_count,
             values = 'Count',
             names = 'Group',
             title = 'Shark Attack by Age Group',
             color_discrete_sequence=px.colors.qualitative.Pastel
             )

fig.update_traces(textposition ='outside',
                  textinfo = 'label+percent')
fig.update_layout(paper_bgcolor='cornsilk',
                  legend_title = 'Age Group',
                  font = dict(
                      family = "Courier New, monospace",
                      size = 18,
                      color = 'black'
                  ))

fig.show()

Are children more likely to be attacked by sharks?
Almost 10% of all the attacks by sharks are on children, so children are less likely to be attacked by sharks.
*BIAS: Out of the 6000+ reported shark attacks, I had to exclude over 4000 cases due to the absence of a usable age figure. It is plausible that the results would have been significantly different if each shark attack had included information about the age of the victims.*

###Are shark attacks where sharks were provoked more or less dangerous?

In [None]:
#For this question, I need the columns Type and Fatal
columns = ['Type', 'Fatal']
provoked_sharks = df4[columns]

display(provoked_sharks)

In [None]:
print("Missing values per column:")
print(provoked_sharks.isnull().sum())

In [None]:
#First I will remove the NaN's in Fatal
cleaned_provoked = provoked_sharks[~provoked_sharks['Fatal'].isna()]
cleaned_provoked.isna().sum()

In [None]:
#Now I will remove the NaN's in Type
clean_provoked = cleaned_provoked[~cleaned_provoked['Type'].isna()]
clean_provoked.isna().sum()

In [None]:
#Checking the values of Fatal
clean_provoked.Fatal.value_counts()

In [None]:
#Some values are incorrect
#Strip space around N
clean_provoked['Fatal'] = clean_provoked['Fatal'].str.strip()

#Changing N to No
clean_provoked.loc[clean_provoked['Fatal'] == "N", 'Fatal'] = 'No'

#Changing Y to Yes and adding them together
clean_provoked.loc[clean_provoked['Fatal'].str.contains("Y|y"), 'Fatal'] = "Yes"

#Change all other input that is not Yes or No to UNKNOWN
clean_provoked.loc[~clean_provoked['Fatal'].str.contains("Yes|No"), 'Fatal'] = "UNKNOWN"

#Making sure the values are correct now
clean_provoked.Fatal.value_counts()


In [None]:
#Checking the values of Type
clean_provoked.Type.value_counts()

In [None]:
#Changing Boating and Boatomg to Boat, as it means the same
clean_provoked.loc[(clean_provoked['Type'] == 'Boating') | (clean_provoked['Type'] == 'Boatomg'), "Type"] = "Boat"

clean_provoked.Type.value_counts()

In [None]:
display(clean_provoked)

In [None]:
#Sort on highest number
provoked_sharks = clean_provoked.groupby(['Fatal', 'Type'],as_index=False).size()
provoked_sharks = provoked_sharks.sort_values(by=['Type'], ascending=False)

display(provoked_sharks)

In [None]:
#Removing UNKNOWN, Questionable and Invalid rows
clean_provoked_sharks = provoked_sharks[~provoked_sharks.apply(lambda row: row.str.contains('UNKNOWN').any(), axis=1)]
clean_provoked_sharks = clean_provoked_sharks[~clean_provoked_sharks.apply(lambda row: row.str.contains('Questionable').any(), axis=1)]
clean_provoked_sharks = clean_provoked_sharks[~clean_provoked_sharks.apply(lambda row: row.str.contains('Invalid').any(), axis=1)]
display(clean_provoked_sharks)

In [100]:
#Adding an extra colomn to make the numbers in perspective
clean_provoked_sharks['Percentage'] = clean_provoked_sharks.groupby('Type')['size'].transform(lambda x: 100*x/x.sum())
display(clean_provoked_sharks)

Unnamed: 0,Fatal,Type,size,Percentage
5,No,Unprovoked,3351,73.940865
15,Yes,Unprovoked,1181,26.059135
4,No,Sea Disaster,66,28.205128
14,Yes,Sea Disaster,168,71.794872
2,No,Provoked,548,96.64903
13,Yes,Provoked,19,3.35097
0,No,Boat,319,96.666667
11,Yes,Boat,11,3.333333


In [98]:
#Plotting the results
df = clean_provoked_sharks

fig = px.bar(df, x="Type",
             y=["Percentage"],
             color= "Fatal",
             title="Shark attacks",
             color_discrete_sequence=px.colors.qualitative.Pastel)


fig.update_layout(paper_bgcolor='cornsilk',
                  legend_traceorder="reversed",
                  yaxis_title="Attacks (%)",
                  legend_title = 'Fatal',
                  font = dict(
                      family = "Courier New, monospace",
                      size = 18,
                      color = 'black'
                  ))
fig.show()

Are shark attacks where sharks were provoked more or less dangerous?
Provoked attacks are less dangerous than unprovoked and Sea Disaster attacks. Just a slight percentage of the provoked attacks are fatal.
*BIAS: The higher fatality rate associated with sea disasters may be attributed to the uncertainty surrounding the cause of death, which can be attributed to either the disaster itself or shark attacks in many cases*

###Are certain activities more likely to result in a shark attack?

In [None]:
#For this question, I need the Activity column
activities = ['Activity']
activities_attacks = df4[activities]
display(activities_attacks)

In [None]:
#Show activities with the most counts
activities_attacks['Activity'].value_counts()

In [21]:
#Grouping the activities and descending on size column
activities_shark = activities_attacks.groupby(['Activity'],as_index=False).size()
activities_shark = activities_shark.sort_values(by=['size'], ascending=False)

display(activities_shark)

Unnamed: 0,Activity,size
1156,Surfing,971
1193,Swimming,869
420,Fishing,431
1048,Spearfishing,333
113,Bathing,162
...,...,...
543,"Fishing, stepped on hooked shark's head",1
542,"Fishing, standing in water washing fish",1
541,"Fishing, standing in water next to purse net",1
540,"Fishing, standing in waist-deep water",1


In [None]:
#Making a top 10 of the activities with the most shark attacks
top10_activities_shark = activities_shark.head(10)
display(top10_activities_shark)

In [18]:
#Plotting the results
df = top10_activities_shark

fig = px.bar(df, x="Activity",
             y=["size"],
             title="Top 10 dangerous activities",
             color_discrete_sequence=px.colors.qualitative.Pastel)


fig.update_layout(paper_bgcolor='cornsilk',
                  showlegend=False,
                  yaxis_title="Shark attacks",
                  font = dict(
                      family = "Courier New, monospace",
                      size = 18,
                      color = 'black'
                  ))
fig.show()

Are certain activities more likely to result in a shark attack?
The data indicates that the likelihood of shark attacks is significantly higher during surfing and swimming activities, with more than twice the risk compared to fishing, which ranks third in terms of shark encounter probability.
*BIAS: Certain activities have equivalent meanings but are expressed using different terminology, which can influence the outcomes. In order to make the Activity column more usefull, it is needed to invest a little more time to thoroughly clean it.*