# Shark Attack Data Analysis

The goal of this project is to analyze and visualize information about shark attacks. The dataset contains information regarding recorded shark attacks until 2018 and can be found here: https://www.kaggle.com/datasets/felipeesc/shark-attack-dataset/data

This project will visualize the type of shark attacks to see if they are more commonly provoked or unprovoked, and if there is a certain activity that is most likely to result in a shark attack. It will also look at the number of shark attacks over time, and examine which countries have had the most shark attacks. Additionally, since the dataset includes the sex of the shark attack victim, we can examine the data to see if men or women are attacked more often. And finally, we will visualize the number of attacks that have been fatal.

In [1]:
import pandas as pd
import altair as alt

#dataset found at https://www.kaggle.com/datasets/felipeesc/shark-attack-dataset/data
attacks = pd.read_csv("attacks.csv", encoding = "ISO-8859-1")

attacks.head()


#Color scheme to match fun shark themed slide deck!
color_scheme = ['#e9dfd7', '#c3b8af', '#595959', '#423d3e', '#97bfc5', '#799ba0']

## Data Cleanup

Before we can visualize our dat, it needs a bit of cleaning. First up, we will look at Null values, and unnecessary columns. There are several columns that are not necessary for the analysis and visualizations we are looking to achieve, so they can be dropped to make things simpler. 

For Null values, we are specifically concerned with data where there is no date listed. This data appears to be unreliable, and we can weed out the data where the date is Null.

In [2]:
attacks.isna().sum()

Case Number               17021
Date                      19421
Year                      19423
Type                      19425
Country                   19471
Area                      19876
Location                  19961
Activity                  19965
Name                      19631
Sex                       19986
Age                       22252
Injury                    19449
Fatal (Y/N)               19960
Time                      22775
Species                   22259
Investigator or Source    19438
pdf                       19421
href formula              19422
href                      19421
Case Number.1             19421
Case Number.2             19421
original order            19414
Unnamed: 22               25722
Unnamed: 23               25721
dtype: int64

In [3]:
attacks = attacks.drop(['pdf', 'href formula', 'href', 'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22', 'Unnamed: 23'], axis = 1)

In [4]:
attacks = attacks[~ attacks['Date'].isna()]

In [5]:
attacks.isna().sum()

Case Number                  1
Date                         0
Year                         2
Type                         4
Country                     50
Area                       455
Location                   540
Activity                   544
Name                       210
Sex                        565
Age                       2831
Injury                      28
Fatal (Y/N)                539
Time                      3354
Species                   2838
Investigator or Source      17
dtype: int64

In [6]:
attacks[attacks['Year'].isna()]

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source
187,2017.01.08.R,Reported 08-Jan-2017,,Invalid,AUSTRALIA,Queensland,,Spearfishing,Kerry Daniel,M,35.0,"No attack, shark made a threat display",,,Bull shark,Liquid Vision 1/8/2017
6079,1836.08.19.R,Reported 19-Aug-1836,,Unprovoked,ENGLAND,Cumberland,Whitehaven,Swimming,a boy,M,,FATAL,Y,,,"C. Moore, GSAF"


In [7]:
attacks.loc[187, 'Year'] = 2017

In [8]:
attacks.loc[187, 'Year']

2017.0

In [9]:
attacks.loc[6079, 'Year'] = 1836

In [10]:
attacks['Year'].describe()

count    6302.000000
mean     1927.272136
std       281.076315
min         0.000000
25%      1942.000000
50%      1977.000000
75%      2005.000000
max      2018.000000
Name: Year, dtype: float64

The data here goes back into the 1800s, but we are mostly interested in more current data, so we will only look at the data from 1900 on. We also need to convert the Year column to be stored as datetime.

In [11]:
attacks = attacks[attacks['Year']>=1900]
pd.to_datetime(attacks['Year'], format='%Y')


0      2018-01-01
1      2018-01-01
2      2018-01-01
3      2018-01-01
4      2018-01-01
          ...    
5559   1900-01-01
5560   1900-01-01
5561   1900-01-01
5562   1900-01-01
5563   1900-01-01
Name: Year, Length: 5563, dtype: datetime64[ns]

In [12]:
attacks['Year'].describe()

count    5563.000000
mean     1978.763437
std        31.502432
min      1900.000000
25%      1958.000000
50%      1987.000000
75%      2007.000000
max      2018.000000
Name: Year, dtype: float64

## Shark Attack Types

The data contains information on the type of shark attack. This data can be examined to give some understanding of the causes behind shark attacks, including whether they are typically provoked or unprovoked.

The first step will require a bit more data cleanup to consolidate the listed attack types, and then we can create a visualization to see which attack types are most frequent.

In [13]:
attacks['Type'].value_counts(dropna = False)

Unprovoked      4044
Provoked         539
Invalid          477
Sea Disaster     186
Boating          181
Boat             130
NaN                3
Questionable       2
Boatomg            1
Name: Type, dtype: int64

In [14]:
attacks['Type'].replace('Questionable', 'Unknown', inplace = True)
attacks['Type'].fillna('Unknown', inplace = True)
attacks['Type'].replace(['Boat', 'Boatomg'], 'Boating', inplace = True)

attacks = attacks[attacks['Type'] != 'Invalid']

In [15]:
attacks['Type'].value_counts(dropna = False)

Unprovoked      4044
Provoked         539
Boating          312
Sea Disaster     186
Unknown            5
Name: Type, dtype: int64

In [16]:
attack_types = attacks.groupby('Type')['Type'].count().reset_index(name = 'counts')
attack_types.head()

Unnamed: 0,Type,counts
0,Boating,312
1,Provoked,539
2,Sea Disaster,186
3,Unknown,5
4,Unprovoked,4044


In [17]:
type_chart = alt.Chart(attack_types, title = "Shark Attack Types").mark_arc().encode(
    theta = 'counts',
    color = 'Type:N'
    #color = alt.Color('Type').scale(domain = 'Type', range = color_scheme)
).configure_range(
    category={'scheme': color_scheme}
)

type_chart

From this chart, we can clearly see that a large majority of recorded shark attacks are considered to have been unprovoked! This may be weighed heavily by the fact that most people are not purposefully going out of their way to provoke a shark, but it is a bit unsettling to see how many attacks occur without provocation...

## Shark Attacks Over Time

We also want to examine the number of shark attacks over time. This can be visualized with a line chart showing the number of attacks per year.

In [18]:
attack_years = attacks.groupby('Year')['Year'].count().reset_index(name = "Attacks")


attacks_by_year = alt.Chart(attack_years, title = 'Shark Attacks by Year').mark_line().encode(
    x = 'Year:N',
    y = 'Attacks'
).configure_range(
    category={'scheme': color_scheme}
)


attacks_by_year

The above chart shows a very large timespan. This can be simplified a bit by only focusing on more recent data, such as just attacks since 1950.

In [19]:
attacks_by_year_1950 = alt.Chart(attack_years[attack_years['Year'] > 1950], title = 'Shark Attacks by Year After 1950').mark_line().encode(
    x = 'Year:N',
    y = 'Attacks'
).configure_range(
    category={'scheme': color_scheme}
)


attacks_by_year_1950

These visualizations show a (scary) trend that shark attacks have mostly been increasing over time. Interestingly there is a fairly significant spike between 1957 - 1963. The attacks settle down a bit until picking up fairly significantly through the 90s and 200s.

In case anyone was wondering, Jaws was released in 1975!

## Shark Attacks by Country

So we've now seen that sharks will attack people unprovoked and that shark attacks seem to be increasing over time. But where do shark attacks happen the most frequently? To visualize this we will look at the 5 countries with the most recorded shark attacks.

In [20]:
attacks_by_country = attacks.groupby('Country')['Country'].count().reset_index(name = 'Attacks')
attacks_by_country.sort_values('Attacks', ascending = False, inplace = True)

country_chart = alt.Chart(attacks_by_country.head(), title = 'Top 5 Countries with the Most Shark Attacks').mark_bar().encode(
    x = 'Country',
    y = 'Attacks',
    color = alt.Color('Country').legend(None)
).configure_range(
    category={'scheme': color_scheme}
)

country_chart

Great... well now we know where not to go swimming! 

But has this been consistent over time? We already looked at how the total number of shark attacks have increased since 1950, now we look into this further by looking at each of the countries with the most shark attacks.

In [21]:
countrylist = ['AUSTRALIA', 'BRAZIL', 'PAPAU NEW GUINEA', 'SOUTH AFRICA', 'USA']
country_time = attacks[attacks['Country'].isin(countrylist)]

country_time = country_time.groupby(['Country', 'Year']).count().reset_index()
country_time.head()
country_time = country_time[['Country', 'Year', 'Case Number'  ]]
country_time.columns = ['Country', 'Year', 'Attacks']
country_time = country_time[country_time['Year'] > 1950]

country_time_chart = alt.Chart(country_time, title = 'Shark Attacks by Country Over Time').mark_line().encode(
    x = 'Year:O',
    y = 'Attacks',
    color = 'Country'
).configure_range(
    category={'scheme': color_scheme}
)


country_time_chart

Of note, the USA - which leads in the number of recorded shark attacks by a fairly large margin, appears to have been fairly consistent with other leading countries, but breaks away fairly dramatically around 1980.

## Shark Attacks by Species

When most poeple think of a shark attack, they picture a great white shark - but is this consistent with what the data shows? We can break down the data by species to find out.

The data contains a lot of inconsistent species information, with some more obscure shark types only having a few recorded attacks, and some recordings only specifying the size of the shark instead of the species. After looking at the data, 3 species in particular appear to be the most common attackers, so we will focus on those specifically.

In [22]:

attacks.rename(columns = {'Species ': 'Species'}, inplace = True)


shark_species = attacks.value_counts('Species')

shark_species = shark_species.to_frame(name = 'Attacks').reset_index()

shark_species = shark_species.head(3)
species_chart = alt.Chart(shark_species).mark_bar().encode(
    x = 'Species',
    y = 'Attacks',
    color = 'Species'
).configure_range(
    category={'scheme': color_scheme}
)


species_chart

Well it turns out that your mental image might be right, with white sharks leading the number of attacks!

## Shark Attacks by Sex

The data contains information about the sex of the victim, so lets see who has been attacked more frequently - men or women?

Note, there are a couple outlier entries other than 'M' or 'F' - these are infrequent and appear to be genuine typos or misentries rather than addititional gender identies, so they will not be included in this visualization.

In [23]:
attacks.rename(columns = {'Sex ': 'Sex'}, inplace = True)
attacks.columns
sex_data = attacks.value_counts('Sex').to_frame(name = 'Attacks').reset_index()

sex_data = sex_data[sex_data['Sex'].isin(['M', 'F'])]

sex_chart = alt.Chart(sex_data).mark_bar().encode(
    
    x = 'Sex',
    y = 'Attacks',
    color = 'Sex'

).configure_range(
    category={'scheme': color_scheme}
)


sex_chart

This is actually fairly surprising. The number of men that have been attacked is around 8X as many as women. This could use some further investigation - are men doing something more likely to cause an attack, or are men just shark-bait?

## Shark Attack Fatalities 

We've looked a lot at the frequency of attacks - but how dangerous are they actually? While certainly no one wants to be attacked by a shark, it would be good to visualize how many attaks have actually been fatal.

Again, this will require a bit of data cleanup.

In [24]:
attacks.value_counts('Fatal (Y/N)')

attacks['Fatal (Y/N)'].replace([' N','N '], 'N', inplace = True)
attacks['Fatal (Y/N)'].replace('y', 'Y', inplace = True)
attacks['Fatal (Y/N)'].replace(['2017','M'], 'UNKNOWN', inplace = True)

attacks.value_counts('Fatal (Y/N)')

Fatal (Y/N)
N          4002
Y          1019
UNKNOWN      56
dtype: int64

In [25]:
fatal_count = attacks.value_counts('Fatal (Y/N)').to_frame(name = 'Attacks').reset_index()


fatal_pie = alt.Chart(fatal_count, title = 'Shark Attack Fatalities').mark_arc(outerRadius = 120).encode(
    alt.Theta('Attacks').stack(True),
    color = 'Fatal (Y/N)'
).configure_range(
    category={'scheme': color_scheme}
)
#text = fatal_pie.mark_text(radius=140, size=20).encode(text="Attacks:Q")

fatal_pie #+ text


Luckily, it appears that a large majority of shark attacks have actually not been fatal!

And finally, since we already saw the 3 sharks responsible for the most attacks, lets see if they are also the most deadly.

In [26]:
fatal_attack_species = attacks[attacks['Fatal (Y/N)'] == 'Y'].value_counts('Species').to_frame(name = 'Attacks').reset_index()


fatal_attack_species_chart = alt.Chart(fatal_attack_species.head(3), title = 'Shark Species with the Most Fatal Attacks').mark_bar().encode(
    x = 'Species',
    y = 'Attacks',
    color = 'Species'
).configure_range(
    category={'scheme': color_scheme}
)

fatal_attack_species_chart

As it turns out, not only have White Sharks caused the most recorded attacks - they have also caused the most fatalities. 


Swim safe everyone!