# More Shark Bite Analysis

Data is from [Global Shark Attack File](https://www.sharkattackfile.net/). You can find documentation at [https://www.sharkattackfile.net/incidentlog.htm](https://www.sharkattackfile.net/incidentlog.htm).

## Questions

1. Read in `GSAF5.xls` and print the first 5 rows of the data, making sure you can see *every column* in the dataset.
2. How many rows and columns are in the dataset?
3. Are more shark attacks provoked or unprovoked?
4. What is the most common activity when a shark attack occurs?
5. What is the most common activity when a shark attack occurs in Australia?
6. What is the most common activity when a shark attack occurs in the USA?
7. What are the top 5 countries with the most shark attacks?
8. What are the top 5 states with the most shark attacks in the USA?
9. What are the most common days of the week for shark attacks?
10. What are the most common months for shark attacks?
11. Are shark attacks from 1900-1950 more or less likely to be fatal than those after 1950?
12. Find 3 things that are bad or difficult to clean about this dataset. This will probably involve sorting, filtering, or grouping the data.

In [1]:
import requests
import pandas as pd

In [3]:
df = pd.read_excel('GSAF5.xls')
df

#There are 25849 rows and 24 columns in the table.

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2022.11.04,04-Nov-2022,2022.0,Unprovoked,USA,California,"Del Mar Beach, San Diego County",Swimming,Lyn Jutronich,F,...,Juvenile white shark,"R.Collier, GSAF",,,,,,,,
1,2022.11.03,03-Nov-2022,2022.0,Provoked,USA,Florida,"Cape San Blas, Gulf County",Fishing,male,M,...,,"Miami Herald, 11/4/2022",,,,,,,,
2,2022.10.31,31-Oct-2022,2022.0,Unprovoked,USA,California,"Otter Point, Pacific Grove",Surfing,Jim Affinito,M,...,,"R. Collier, GSAF",,,,,,,,
3,2022.10.28,28-Oct-2022,2022.0,Unprovoked,BAHAMAS,,Cat Island,Diving,male,M,...,,"Bahamas Press, 10/29/2022",,,,,,,,
4,2022.10.27,27-Oct-2022,2022.0,Provoked,USA,Mississippi,Horn Island,Fishing,male,M,...,5'shark,"WXXV, 10/27/2022",,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25844,,,,,,,,,,,...,,,,,,,,,,
25845,,,,,,,,,,,...,,,,,,,,,,
25846,,,,,,,,,,,...,,,,,,,,,,
25847,,,,,,,,,,,...,,,,,,,,,,


In [4]:
#Are more shark attacks provoked or unprovoked?
df.Type.value_counts()
#More shark attacks are unprovoked compared to provoked

Unprovoked             5003
Provoked                625
Invalid                 552
Watercraft              350
Sea Disaster            241
Questionable             12
Boat                      7
?                         1
Unconfirmed               1
Unverified                1
Under investigation       1
Name: Type, dtype: int64

In [7]:
# What is the most common activity when a shark attack occurs?
df.Activity.value_counts()

#Most common activity when shark attack occurs is surfing

Surfing                                                             1095
Swimming                                                             943
Fishing                                                              477
Spearfishing                                                         372
Wading                                                               167
                                                                    ... 
Scuba diving & U/W photography                                         1
Fishing from paddleski                                                 1
Crabbing (spearing crabs)                                              1
Being pulled to shore from wreck of 25-ton fishing vessel Alan S       1
Wreck of  large double sailing canoe                                   1
Name: Activity, Length: 1577, dtype: int64

In [15]:
# What is the most common activity when a shark attack occurs in Australia?
df[df.Country == 'AUSTRALIA'].Activity.value_counts()

#Surfing is the most common activity when a shark attack occurs in Australia

Surfing                                      230
Swimming                                     177
Fishing                                      122
Spearfishing                                  86
Bathing                                       58
                                            ... 
Adrift after wave swamped engine               1
Spearfishing, Scuba diving                     1
Surfing on "chest board" (boogie board?)       1
Catching sharks under government contract      1
Helmet diving                                  1
Name: Activity, Length: 387, dtype: int64

In [16]:
# What is the most common activity when a shark attack occurs in the USA?

df[df.Country == 'USA'].Activity.value_counts()

#Surfing is the most common activity when a shark attack occurs in USA

Surfing                                           642
Swimming                                          343
Fishing                                           134
Wading                                            117
Standing                                           69
                                                 ... 
Adrift on refugee raft                              1
Commercial diver (submerged or treading water)      1
Surfing, sitting on board                           1
Playing / jumping                                   1
Free diving, collecting sand dollars                1
Name: Activity, Length: 528, dtype: int64

In [17]:
# What are the top 5 countries with the most shark attacks?
df.Country.value_counts().head()

USA                 2484
AUSTRALIA           1455
SOUTH AFRICA         593
NEW ZEALAND          142
PAPUA NEW GUINEA     136
Name: Country, dtype: int64

In [19]:
# What are the top 5 states with the most shark attacks in the USA?
df[df.Country == 'USA'].Area.value_counts().head()

Florida           1153
Hawaii             330
California         320
South Carolina     168
North Carolina     117
Name: Area, dtype: int64

In [36]:
# What are the most common days of the week for shark attacks?

0      2022.11.04
1      2022.11.03
2      2022.10.31
3      2022.10.28
4      2022.10.27
5      2022.10.25
6      2022.10.23
7      2022.10.10
8      2022.10.08
9      2022.10.07
10     2022.10.06
11    2022.010.02
Name: Case_Number, dtype: object

In [116]:
#df.columns = df.columns.str.replace(" ", "_")
df.columns

Index(['Case_Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex', 'Age', 'Injury', 'Fatal_(Y/N)', 'Time',
       'Species_', 'Investigator_or_Source', 'pdf', 'href_formula', 'href',
       'Case_Number.1', 'Case_Number.2', 'original_order', 'Unnamed:_22',
       'Unnamed:_23', 'Date_dt_type', 'Year_only'],
      dtype='object')

In [74]:
#df.Case_Number[11] = df.Case_Number[11].replace('010', '10')
#df.Case_Number[17] = df.Case_Number[17].replace('.c', '')
#df.Date[58] = df.Date[58].replace('Reported ', '')
#df.Date[58]
#df.Date = df.Date.str.replace('Reported ', '')

#df.Date[91] = df.Date[91].replace('`', '')
#df.Date[96] = df.Date[96].replace('Nox', 'Nov')
#df.Date[257] = df.Date[257].replace('202', '2020')
#df.Date[487] = df.Date[487].replace('1018', '2018')
#df.Date[1371] = df.Date[1371].replace('16-Aug--2011', '16-Aug-2011')
#df.Date[1374] = df.Date[1374].replace('11-Aug--2011', '11-Aug-2011')

#df.Date[1522] = df.Date[1522].replace('190Feb-2010', '19-Feb-2010')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.Date[1522] = df.Date[1522].replace('190Feb-2010', '19-Feb-2010')


In [108]:
# What are the most common months for shark attacks?

#df['Date_dt_type'] = pd.to_datetime(df.Date, errors='ignore')
#df.Date_dt_type = pd.to_datetime(df.Date_dt_type, errors='coerce')
df.Date_dt_type.dt.month.value_counts()

#ANSWER: July has the most number of shark attacks

7.0     740
1.0     695
8.0     647
9.0     591
6.0     538
10.0    486
4.0     482
12.0    470
3.0     440
11.0    438
5.0     427
2.0     404
Name: Date_dt_type, dtype: int64

In [132]:
# Are shark attacks from 1900-1950 more or less likely to be fatal than those after 1950?
#df['Year_only'] = df.Date_dt_type.dt.year
#df_1900 = df[(df.Year_only > 1900) & (df.Year_only < 1950)]
df_1900.columns = df_1900.columns.str.replace('Fatal_(Y/N)', 'fatal')

  df_1900.columns = df_1900.columns.str.replace('Fatal_(Y/N)', 'fatal')


In [152]:
df_1900['Fatal_(Y/N)'].value_counts()

N          505
Y          360
UNKNOWN      7
N            1
Name: Fatal_(Y/N), dtype: int64

In [153]:
df_post_1950 = df[df.Year_only > 1950]
df_post_1950['Fatal_(Y/N)'].value_counts()

N          3767
Y           636
UNKNOWN      37
 N            7
M             2
F             2
n             1
Nq            1
2017          1
Y x 2         1
Name: Fatal_(Y/N), dtype: int64

In [154]:
# The percent of fatal incidents was 41.2% between 1900-1950 and 14.2% after 1950.
#However, the number of shark bite incidents that were fatal between 1900-1950 was 360 whereas the number of 
#total fatal incidents after 1950 and 2022 is 636.
#This means that the recorded incidents increased after 1950.

In [164]:
# Find 3 things that are bad or difficult to clean about this dataset. 
1.
#It is difficult to standardize the date column because there are several entries with characters other than date
#Same with case number column
2.
#There are a lot of rows that do not make sense in (Fatal_Y/N) column
3.
#There are 100s of rows in the end that have no entries at all and are making the dataframe longer for no reason

#This will probably involve sorting, filtering, or grouping the data.

3.0