# Project: Data Cleaning and Manipulation with Pandas

Task description:

## Technical Requirements

* Import the data using Pandas.
* Examine the data for potential issues.
* Use at least 8 of the cleaning and manipulation methods you have learned on the data.
* Produce a Jupyter Notebook that shows the steps you took and the code you used to clean and transform your data set.
* Export a clean CSV version of your data using Pandas.

## Necessary Deliverables

The following deliverables should be pushed to your Github repo for this chapter.

* **A cleaned CSV data file** containing the results of your data wrangling work.
* **A Jupyter Notebook (data-wrangling.ipynb)** containing all Python code and commands used in the importing, cleaning, manipulation, and exporting of your data set.
* **A ``README.md`` file** containing a detailed explanation of the process followed in the importing, cleaning, manipulation, and exporting of your data as well as your results, obstacles encountered, and lessons learned.

In [1]:
# Import required
import pandas as pd
import numpy as np
import re

In [2]:
# Import the data using pandas
df = pd.read_csv("sharkattack.csv", sep=",", engine="python")

In [3]:
## Examine the dataset for potential issues

# display df

df

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.c,2016.09.18.c,5993,,
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.b,2016.09.18.b,5992,,
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.a,2016.09.18.a,5991,,
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,...,,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.17,2016.09.17,5990,,
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,...,2 m shark,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.16,2016.09.15,5989,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5987,ND.0005,Before 1903,0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,M,...,,"H. Taunton; N. Bartlett, p. 234",ND-0005-RoebuckBay.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0005,ND.0005,6,,
5988,ND.0004,Before 1903,0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,M,...,,"H. Taunton; N. Bartlett, pp. 233-234",ND-0004-Ahmun.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0004,ND.0004,5,,
5989,ND.0003,1900-1905,0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,...,,"F. Schwartz, p.23; C. Creswell, GSAF",ND-0003-Ocracoke_1900-1905.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0003,ND.0003,4,,
5990,ND.0002,1883-1889,0,Unprovoked,PANAMA,,"Panama Bay 8�N, 79�W",,Jules Patterson,M,...,,"The Sun, 10/20/1938",ND-0002-JulesPatterson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0002,ND.0002,3,,


In [4]:
# display column names

df.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

In [5]:
# display column types

df.dtypes

Case Number               object
Date                      object
Year                       int64
Type                      object
Country                   object
Area                      object
Location                  object
Activity                  object
Name                      object
Sex                       object
Age                       object
Injury                    object
Fatal (Y/N)               object
Time                      object
Species                   object
Investigator or Source    object
pdf                       object
href formula              object
href                      object
Case Number.1             object
Case Number.2             object
original order             int64
Unnamed: 22               object
Unnamed: 23               object
dtype: object

In [6]:
# display missing cell count per column

df.isna().sum()

Case Number                  0
Date                         0
Year                         0
Type                         0
Country                     43
Area                       402
Location                   496
Activity                   527
Name                       200
Sex                        567
Age                       2681
Injury                      27
Fatal (Y/N)                 19
Time                      3213
Species                   2934
Investigator or Source      15
pdf                          0
href formula                 1
href                         3
Case Number.1                0
Case Number.2                0
original order               0
Unnamed: 22               5991
Unnamed: 23               5990
dtype: int64

In [7]:
# Unnamed 22 and 23 can be dropped for sure

df = df.drop(['Unnamed: 22', 'Unnamed: 23'],axis=1)

In [8]:
# Case numbers seem to be duplicates

df = df.drop(['Case Number.1', 'Case Number.2'],axis=1)

In [9]:
# Check if href formula and href are duplicates

print(df.loc[0,"href formula"])
print(df.loc[0,"href"])

# Create column showing whether they are duplicate

df["href_equal"] = df["href formula"] == df["href"]

http://sharkattackfile.net/spreadsheets/pdf_directory/2016.09.18.c-NSB.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2016.09.18.c-NSB.pdf


In [10]:
# Check number of href duplicate

df["href_equal"].value_counts()

True     5938
False      54
Name: href_equal, dtype: int64

In [11]:
# Compare all unequal rows

href_unequal_df = df.loc[df["href_equal"]==False, ["href formula","href"]]

In [12]:
href_unequal_df

Unnamed: 0,href formula,href
20,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
27,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
61,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
107,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
114,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
134,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
180,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
193,http://sharkattackfile.net/spreadsheets/pdf_di...,
232,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
262,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...


In [13]:
# Look at the columns values, do they link to the same object?

print(href_unequal_df.loc[27,"href formula"])
print(href_unequal_df.loc[27,"href"])

http://sharkattackfile.net/spreadsheets/pdf_directory/2016.07.23.a-Cutbirth.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2016.07.23-Cutbirth.pdf


In [14]:
# Saw a NaN in the second column. Count for NaNs

href_unequal_df.isna().sum()


href formula    1
href            3
dtype: int64

In [15]:
# Copy href to href formula where href formula is NaN. Drop href column.

df.loc[pd.isna(df["href formula"]), ["href formula"]] = df["href"]


In [16]:
# Check if it worked

df["href formula"].isna().sum()

0

In [17]:
# Drop href column and href_equal helper column, rename href formula column

df = df.drop(["href","href_equal"],axis=1)

df = df.rename(columns={"href formula":"href"})

In [18]:
df

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href,original order
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,16,Minor injury to thigh,N,13h00,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5993
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,36,Lacerations to hands,N,11h00,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5992
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,43,Lacerations to lower leg,N,10h43,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5991
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,,Struck by fin on chest & leg,N,,,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5990
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,,No injury: Knocked off board by shark,N,,2 m shark,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5989
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5987,ND.0005,Before 1903,0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, p. 234",ND-0005-RoebuckBay.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6
5988,ND.0004,Before 1903,0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, pp. 233-234",ND-0004-Ahmun.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5
5989,ND.0003,1900-1905,0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,,FATAL,Y,,,"F. Schwartz, p.23; C. Creswell, GSAF",ND-0003-Ocracoke_1900-1905.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,4
5990,ND.0002,1883-1889,0,Unprovoked,PANAMA,,"Panama Bay 8�N, 79�W",,Jules Patterson,M,,FATAL,Y,,,"The Sun, 10/20/1938",ND-0002-JulesPatterson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,3


In [19]:
# Check what pdf column is

print(df.loc[0,"pdf"])
print(df.loc[0,"href"])

2016.09.18.c-NSB.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2016.09.18.c-NSB.pdf


In [20]:
# It's just another variation of href. drop pdf.

df = df.drop(columns="pdf",index=1)

In [21]:
# Check if some case numbers are duplicated

df["Case Number"].value_counts().gt(1)

1980.07.00       True
1962.06.11.b     True
1966.12.26       True
1923.00.00.a     True
2014.08.02       True
                ...  
1878.11.17      False
1992.11.12.R    False
1903.00.00.b    False
1983.08.06      False
1908.06.02.R    False
Name: Case Number, Length: 5975, dtype: bool

In [22]:
# Look into duplicated case number rows

dupdf = df[df.duplicated(['Case Number'])]
dupdf

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,href,original order
301,2014.08.02,02-Aug-14,2014,Unprovoked,USA,Florida,"Table Beach, Brevard County",Boogie boarding,Christian Sanhueza,M,8,Laceration to ankle,N,13h00,,"Florida Today, 8/2/2014",http://sharkattackfile.net/spreadsheets/pdf_di...,5691
393,2013.10.05,10-Oct-13,2013,Unprovoked,USA,Florida,"Destin, Okaloosa County",Wading,Zachary Tyke Standridge,M,12,Lacerations to right forearm,N,15h30,Small bull shark,"Monroe County Advocate, 10/9/2013",http://sharkattackfile.net/spreadsheets/pdf_di...,5600
524,2012.09.02.b,02-Sep-12,2012,Provoked,USA,Hawaii,"Spreckelsville, Maui",Spearfishing,M. Malabon,,,Minor laceration to hand PROVOKED INCIDENT,N,12h00,"Tiger shark, 10' to 12'",HawaiiNow.com,http://sharkattackfile.net/spreadsheets/pdf_di...,5469
841,2009.12.18,18-Dec-09,2009,Invalid,SOUTH AFRICA,KwaZulu-Natal,"North Beach, Durban",Surfing,Lance Morris,M,,Minor lacerations to left leg. nitially report...,N,,No shark involvement,"M. Addison, C. Eckstander, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...,5152
1213,2006.09.02,02-Sep-06,2006,Unprovoked,SOUTH AFRICA,Western Cape Province,Noordhoek,Surfing,Steven Harcourt-Wood,M,37,"No injury, shark rammed surfboard",N,,"White shark, 3.5m","Cape Times, 9/3/2006",http://sharkattackfile.net/spreadsheets/pdf_di...,4780
1376,2005.04.06,06-Apr-05,2005,Invalid,HONDURAS,Bay Islands,Utila,SCUBA Diving,female,F,,"Laceration on siide of calf, small laceration ...",N,,Shark involvement not confirmed,"J. Engel, SRI & S. Fox, Deep Blue",http://sharkattackfile.net/spreadsheets/pdf_di...,4617
2410,1990.05.10,10-May-90,1990,Unprovoked,AUSTRALIA,Queensland,Outer Barrier Reef near Port Douglas,"Snorkeling, possibly holding a fish",German male,M,30s,Lacerations,N,,2 m hammerhead,"Courier-Mail, 5/11/1990, p.1",http://sharkattackfile.net/spreadsheets/pdf_di...,3584
2713,1983.06.15,15-Jun-83,1983,Invalid,ITALY,Northwest Italy,Riomaggiore (Ligura),Scuba diving,Roberto Piaviali,M,,"No injury, shark ""harassed"" him at depth of 5 m",N,,3 m [10'] white shark,MEDSAF,http://sharkattackfile.net/spreadsheets/pdf_di...,3280
2838,1980.07.00,Early Jul-1980,1980,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Tom Durrance,M,,Lower leg bitten,N,,,"Daytona Beach Morning Journal, 7/9/1980",http://sharkattackfile.net/spreadsheets/pdf_di...,3155
3318,1966.12.26,26-Dec-66,1966,Unprovoked,AUSTRALIA,New South Wales,Coogee,Spearfishing,David Jensen,M,29,Right leg bitten,N,,1.8 m [6'] shark,"Sun (Sydney), 12/29/1966; H.D. Baldridge, p.137",http://sharkattackfile.net/spreadsheets/pdf_di...,2675


In [23]:
# Check if any rows in the whole df are duplicates

df.duplicated().value_counts()

False    5991
dtype: int64

In [24]:
# Compare with number of duplicate case number rows

df.duplicated(['Case Number']).value_counts()

False    5975
True       16
dtype: int64

In [25]:
# Create list of duplicate case number columns

dup_case_no_list = list(df["Case Number"].value_counts().gt(1).index)
lis = dup_case_no_list

In [26]:
# Check if they are very similar, check one random item 6

df.loc[df['Case Number'] == lis[6]]

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,href,original order
5207,1907.10.16.R,Reported 16-Oct-1907,1907,Unprovoked,CHINA,Hong Kong,"Sharp Peak, Sai Kung Peninsula, New Territories",Fishing,fishermen,M,,"3 of thel 5 were injured, one of whom lost bot...",N,,Shark involvement probable,"Dawson Daily News, 11/20/1907",http://sharkattackfile.net/spreadsheets/pdf_di...,786
5208,1907.10.16.R,Reported 16-Oct-1907,1907,Unprovoked,CHINA,Hong Kong,"Sharp Peak, Sai Kung Peninsula, New Territories",Fishing,fishermen,M,,2 of the 5 fishermen were so seriously injure...,Y,,Shark involvement probable,"Dawson Daily News, 11/20/1907",http://sharkattackfile.net/spreadsheets/pdf_di...,785


In [27]:
# Duplicate case numbers are not duplicate caes. Since case numbers just seem to be a variation of date, they can be dropped. 

df = df.drop(columns="Case Number")

In [28]:
# Check out date column values

lis = list(df["Date"].value_counts().index)

In [29]:
# Replace words such as reported, early and summer, mid, before

removelist = ["Reported", "reported", "Early", "early", "Summer", "summer", "mid", "Mid", "before", "Before"]

for i in removelist:
    df["Date"] = df["Date"].str.replace(i,"")

In [30]:
df["Date"].value_counts()

1957              11
1942               9
1956               8
1958               7
1941               7
                  ..
29-Dec-63          1
 30-March-1878     1
12-30-1980         1
12-Feb-88          1
 28-Mar-1928       1
Name: Date, Length: 5124, dtype: int64

In [31]:
# Furthermore, merge date and year columns into one uniform date format columns.

df['Date'] = pd.to_datetime(df['Date'], errors = "coerce")

In [32]:
# Check datetime

list(df["Date"].value_counts().index)

[Timestamp('1958-01-01 00:00:00'),
 Timestamp('1957-01-01 00:00:00'),
 Timestamp('1942-01-01 00:00:00'),
 Timestamp('1956-01-01 00:00:00'),
 Timestamp('1950-01-01 00:00:00'),
 Timestamp('1941-01-01 00:00:00'),
 Timestamp('1960-01-01 00:00:00'),
 Timestamp('1898-01-01 00:00:00'),
 Timestamp('1959-01-01 00:00:00'),
 Timestamp('2060-10-01 00:00:00'),
 Timestamp('2056-08-01 00:00:00'),
 Timestamp('1949-01-01 00:00:00'),
 Timestamp('1876-01-01 00:00:00'),
 Timestamp('1952-01-01 00:00:00'),
 Timestamp('1954-01-01 00:00:00'),
 Timestamp('1971-01-01 00:00:00'),
 Timestamp('1955-01-01 00:00:00'),
 Timestamp('2001-04-12 00:00:00'),
 Timestamp('1940-01-01 00:00:00'),
 Timestamp('1961-01-01 00:00:00'),
 Timestamp('2003-10-05 00:00:00'),
 Timestamp('1938-01-01 00:00:00'),
 Timestamp('1995-07-28 00:00:00'),
 Timestamp('1911-01-01 00:00:00'),
 Timestamp('1948-01-01 00:00:00'),
 Timestamp('2060-04-01 00:00:00'),
 Timestamp('1890-01-01 00:00:00'),
 Timestamp('1998-01-01 00:00:00'),
 Timestamp('1970-01-

In [33]:
# Check how many could not be converted.

df["Date"].isnull().sum()

272

In [34]:
# Drop year column, as it overlaps with Date column

df = df.drop(columns="Year")

In [35]:
# Check type column.

df["Type"].value_counts()

Unprovoked      4385
Provoked         557
Invalid          519
Sea Disaster     220
Boat             200
Boating          110
Name: Type, dtype: int64

In [36]:
# Merge Boat and Boating to Boat Type Attacks. Other types seem ok

df.loc[df["Type"] == "Boating", ["Type"]] = "Boat"


In [37]:
## Next, look at country, area and location

list(df["Country"].value_counts().index)

['USA',
 'AUSTRALIA',
 'SOUTH AFRICA',
 'PAPUA NEW GUINEA',
 'NEW ZEALAND',
 'BRAZIL',
 'BAHAMAS',
 'MEXICO',
 'ITALY',
 'FIJI',
 'PHILIPPINES',
 'REUNION',
 'NEW CALEDONIA',
 'MOZAMBIQUE',
 'CUBA',
 'SPAIN',
 'INDIA',
 'EGYPT',
 'CROATIA',
 'JAPAN',
 'PANAMA',
 'SOLOMON ISLANDS',
 'IRAN',
 'GREECE',
 'HONG KONG',
 'JAMAICA',
 'FRENCH POLYNESIA',
 'INDONESIA',
 'ENGLAND',
 'PACIFIC OCEAN',
 'ATLANTIC OCEAN',
 'BERMUDA',
 'TONGA',
 'VANUATU',
 'VIETNAM',
 'MARSHALL ISLANDS',
 'SRI LANKA',
 'FRANCE',
 'TURKEY',
 'IRAQ',
 'SOUTH ATLANTIC OCEAN',
 'COSTA RICA',
 'SENEGAL',
 'VENEZUELA',
 'CANADA',
 'UNITED KINGDOM',
 'NEW GUINEA',
 'KENYA',
 'COLUMBIA',
 'TAIWAN',
 'SOUTH KOREA',
 'SCOTLAND',
 'ECUADOR',
 'SIERRA LEONE',
 'TANZANIA',
 'CARIBBEAN SEA',
 'CHILE',
 'DOMINICAN REPUBLIC',
 'NORTH PACIFIC OCEAN',
 'YEMEN ',
 'ISRAEL',
 'INDIAN OCEAN',
 'SAMOA',
 'MAURITIUS',
 'MADAGASCAR',
 'SEYCHELLES',
 'NICARAGUA',
 'KIRIBATI',
 'SOMALIA',
 'SINGAPORE',
 'CHINA',
 'THAILAND',
 'NEW BRITAIN',


In [38]:
## Country: Uppercase all, remove question marks, remove whitespace at beginning and end

df['Country'] = df['Country'].str.upper() 
df['Country'] = df['Country'].str.replace("?","")
df['Country'] = df['Country'].str.lstrip().str.rstrip()

In [39]:
list(df["Country"].value_counts().index)

['USA',
 'AUSTRALIA',
 'SOUTH AFRICA',
 'PAPUA NEW GUINEA',
 'NEW ZEALAND',
 'BRAZIL',
 'BAHAMAS',
 'MEXICO',
 'ITALY',
 'FIJI',
 'PHILIPPINES',
 'REUNION',
 'NEW CALEDONIA',
 'MOZAMBIQUE',
 'CUBA',
 'SPAIN',
 'EGYPT',
 'INDIA',
 'CROATIA',
 'PANAMA',
 'JAPAN',
 'IRAN',
 'SOLOMON ISLANDS',
 'GREECE',
 'HONG KONG',
 'JAMAICA',
 'FRENCH POLYNESIA',
 'INDONESIA',
 'PACIFIC OCEAN',
 'ENGLAND',
 'TONGA',
 'ATLANTIC OCEAN',
 'BERMUDA',
 'VIETNAM',
 'VANUATU',
 'SRI LANKA',
 'MARSHALL ISLANDS',
 'FRANCE',
 'COSTA RICA',
 'IRAQ',
 'TURKEY',
 'SOUTH ATLANTIC OCEAN',
 'VENEZUELA',
 'SENEGAL',
 'CANADA',
 'KENYA',
 'UNITED KINGDOM',
 'NEW GUINEA',
 'YEMEN',
 'SIERRA LEONE',
 'COLUMBIA',
 'TAIWAN',
 'SOUTH KOREA',
 'ECUADOR',
 'SCOTLAND',
 'SEYCHELLES',
 'TANZANIA',
 'CARIBBEAN SEA',
 'INDIAN OCEAN',
 'CHILE',
 'DOMINICAN REPUBLIC',
 'MADAGASCAR',
 'SAMOA',
 'MAURITIUS',
 'NICARAGUA',
 'NORTH PACIFIC OCEAN',
 'ISRAEL',
 'THAILAND',
 'KIRIBATI',
 'CHINA',
 'SOMALIA',
 'NEW BRITAIN',
 'OKINAWA',
 'S

In [40]:
# Area column: remove whitespaces at beginning and end, remove �, Capitalise all words



df['Area'] = df['Area'].str.title() 
df['Area'] = df['Area'].str.lstrip().str.rstrip()
df['Area'] = df['Area'].str.replace("�","")
df['Area'] = df['Area'].str.replace("?","")




In [41]:
list(df["Area"].value_counts().index)

['Florida',
 'New South Wales',
 'Queensland',
 'Hawaii',
 'California',
 'Kwazulu-Natal',
 'Western Cape Province',
 'Western Australia',
 'Eastern Cape Province',
 'South Carolina',
 'North Carolina',
 'South Australia',
 'Victoria',
 'Torres Strait',
 'Texas',
 'Pernambuco',
 'North Island',
 'New Jersey',
 'Tasmania',
 'South Island',
 'New York',
 'Oregon',
 'Northern Territory',
 'Central Province',
 'Abaco Islands',
 'Virginia',
 'Havana Province',
 'South Province',
 'Veracruz',
 'Puerto Rico',
 'Gaza',
 'New Ireland Province',
 'Madang Province',
 'Alabama',
 'Khuzestan Province',
 'Georgia',
 'Guerrero',
 'Primorje-Gorski Kotar County',
 'North Province',
 'Quintana Roo',
 'Adriatic Sea',
 'Luzon Island',
 'Tyrrhenian Sea',
 'Mediterranean Sea',
 'Louisiana',
 'Rio De Janeiro',
 'Massachusetts',
 'Malampa Province',
 'Grand Bahama Island',
 'Baja California',
 'Red Sea',
 'Basrah',
 'Society Islands',
 'Bougainville (North Solomons)',
 'Eastern Province',
 'Morobe Province',


In [42]:
# Same for location column

list(df["Location"].value_counts().index)

df['Location'] = df['Location'].str.title() 
df['Location'] = df['Location'].str.replace("�","")
df['Location'] = df['Location'].str.lstrip().str.rstrip()



In [43]:
# Check activity column

list(df["Activity"].value_counts().index)


['Surfing',
 'Swimming',
 'Fishing',
 'Spearfishing',
 'Bathing',
 'Wading',
 'Diving',
 'Standing',
 'Snorkeling',
 'Scuba diving',
 'Body boarding',
 'Body surfing',
 'Swimming ',
 'Pearl diving',
 'Treading water',
 'Kayaking',
 'Boogie boarding',
 'Free diving',
 'Fell overboard',
 'Windsurfing',
 'Boogie Boarding',
 'Shark fishing',
 'Walking',
 'Surf-skiing',
 'Rowing',
 'Floating',
 'Surf skiing',
 'Fishing ',
 'Canoeing',
 'Surf fishing',
 'Fishing for sharks',
 'Kayak Fishing',
 'Sponge diving',
 'Freediving',
 'Diving for trochus',
 'Sitting on surfboard',
 'Sailing',
 'Sea disaster',
 'Fell into the water',
 'Surfing (sitting on his board)',
 'Skindiving',
 'Spearfishing ',
 'Floating on his back',
 'Playing',
 'Boating',
 'Free diving for abalone',
 'Paddle boarding',
 'Diving for abalone',
 'Murder',
 'Spearfishing on Scuba',
 'Kite Surfing',
 'Surf skiing ',
 'Sea Disaster',
 'Fishing for mackerel',
 'Dangling feet in the water',
 'Paddleskiing',
 'Hard hat diving',
 'Fre

In [44]:
# Rename sex column

df = df.rename(columns={"Sex ": "Sex"})

In [45]:
# make Sex column to only M and F
df.loc[df["Sex"] == "M ", ["Sex"]] = "M"
df.loc[df["Sex"] == ".", ["Sex"]] = ""
df.loc[df["Sex"] == "lli", ["Sex"]] = ""
df.loc[df["Sex"] == "N", ["Sex"]] = ""
df.loc[df["Sex"] == "", ["Sex"]] = np.nan


In [46]:
# Make sex column binary

df["Sex"] = df["Sex"].str.replace("M","1").str.replace("F","0")

df = df.rename(columns={"Sex": "Sex_Male"})

In [47]:
# Check injury column

list(df["Injury"].value_counts().index)

['FATAL',
 'Survived',
 'Foot bitten',
 'No injury',
 'Leg bitten',
 'Left foot bitten',
 'No details',
 'Right foot bitten',
 'Hand bitten',
 'Thigh bitten',
 'No injury, board bitten',
 'FATAL, body not recovered',
 'Minor injury',
 'Foot lacerated',
 'Calf bitten',
 'Right leg bitten',
 'Arm bitten',
 'Lacerations to foot',
 'Ankle bitten',
 'Right calf bitten',
 'Lacerations to right foot',
 'Lacerations to left foot',
 'No injury to occupants',
 'Heel bitten',
 'Left leg bitten',
 'No injury, surfboard bitten',
 'Left arm bitten',
 'Foot severed',
 'Right thigh bitten',
 'Leg lacerated',
 'Leg severed',
 'Minor injuries',
 'Left calf bitten',
 'FATAL, leg severed ',
 'Leg injured',
 'Thigh lacerated',
 'Legs bitten',
 'Lacerations to right leg',
 'Lacerations to leg',
 'Probable drowning & scavenging',
 'Left foot lacerated',
 'Lacerations to lower leg',
 'Lacerations to right hand',
 'Laceration to left foot',
 'Left hand bitten',
 'Right foot lacerated',
 'Puncture wounds to foo

In [48]:
# Check age column values

list(df["Age"].value_counts().index)

['17',
 '18',
 '19',
 '20',
 '15',
 '16',
 '21',
 '22',
 '24',
 '25',
 '14',
 '13',
 '23',
 '26',
 '27',
 '28',
 '29',
 '30',
 '12',
 '32',
 '35',
 '10',
 '40',
 '31',
 '38',
 '34',
 '43',
 '36',
 '33',
 '39',
 '37',
 '42',
 '11',
 '9',
 '52',
 '41',
 '50',
 '45',
 '44',
 '47',
 '8',
 '49',
 '46',
 '48',
 '7',
 '55',
 '51',
 '57',
 '6',
 '60',
 '54',
 '53',
 '58',
 '59',
 '61',
 '63',
 '56',
 '62',
 'Teen',
 '69',
 '30s',
 '5',
 '70',
 '68',
 '20s',
 'teen',
 '65',
 '64',
 '3',
 '77',
 '71',
 '66',
 'young',
 '74',
 '7 or 8',
 'F',
 '75',
 ' ',
 '50s',
 '1',
 '8 or 10',
 '40s',
 '78',
 '73',
 '10 or 12',
 '33 & 37',
 '9 or 10',
 'mid-20s',
 'Elderly',
 '84',
 '20?',
 'mid-30s',
 'M',
 '86',
 '32 & 30',
 '  ',
 'Teens',
 '21, 34,24 & 35',
 '13 or 14',
 'X',
 '23 & 26',
 '>50',
 'adult',
 '46 & 34',
 '81',
 '"young"',
 '33 & 26',
 '33 or 37',
 '16 to 18',
 '"middle-age"',
 '7      &    31',
 '18 or 20',
 '28 & 26',
 '21 or 26',
 '36 & 23',
 '28, 23 & 30',
 '87',
 '2 to 3 months',
 '21 & 

In [49]:
# Objective: keep first instance of repeating digits. For pre-cleaning, remove every string with months value

df.loc[df["Age"].str.contains("month").fillna(False), ["Age"]] = np.nan


In [50]:
# Look at structure of age column

list(df["Age"].value_counts().index)

['17',
 '18',
 '19',
 '20',
 '15',
 '16',
 '21',
 '22',
 '24',
 '25',
 '14',
 '13',
 '26',
 '23',
 '27',
 '28',
 '29',
 '30',
 '12',
 '35',
 '32',
 '10',
 '31',
 '40',
 '38',
 '34',
 '43',
 '36',
 '33',
 '37',
 '39',
 '42',
 '9',
 '11',
 '41',
 '52',
 '45',
 '50',
 '44',
 '47',
 '49',
 '8',
 '48',
 '46',
 '7',
 '55',
 '51',
 '60',
 '6',
 '57',
 '54',
 '53',
 '58',
 '59',
 '61',
 '56',
 '63',
 '69',
 'Teen',
 '62',
 '5',
 '30s',
 '68',
 '70',
 '20s',
 'teen',
 '64',
 '3',
 '65',
 '66',
 '71',
 '77',
 ' ',
 '7 or 8',
 'F',
 '75',
 '40s',
 'young',
 '1',
 '8 or 10',
 '78',
 '50s',
 '74',
 '10 or 12',
 'mid-20s',
 '21, 34,24 & 35',
 '20?',
 'Elderly',
 'mid-30s',
 '32 & 30',
 '84',
 '  ',
 '73',
 'X',
 'Teens',
 '>50',
 '13 or 14',
 '23 & 26',
 '86',
 'M',
 '33 & 37',
 '9 or 10',
 '46 & 34',
 'adult',
 '34 & 19',
 '87',
 '18 to 22',
 '33 & 26',
 '33 or 37',
 '16 to 18',
 '"middle-age"',
 '7      &    31',
 '18 or 20',
 '28 & 26',
 '36 & 23',
 '"young"',
 '28, 23 & 30',
 '21 & ?',
 '2�',
 '

In [51]:
# for every age row without na, create new Age row containing the first instance of consecutive digits

df["newAge"] = 1
df.loc[~df["Age"].isna() , ["newAge"]] = df['Age'].str.extract(r"([0-9]+)", expand=False)

In [52]:
# Check for new Age value counts

df['newAge'].value_counts()

1     2684
17     150
18     147
20     144
19     139
      ... 
87       1
67       1
2        1
86       1
84       1
Name: newAge, Length: 81, dtype: int64

In [53]:
# Compare value count lists

print(list(df['newAge'].value_counts().index))
print(list(df['Age'].value_counts().index))



[1, '17', '18', '20', '19', '16', '15', '21', '22', '25', '24', '14', '13', '30', '23', '26', '28', '27', '29', '12', '32', '35', '40', '10', '31', '38', '34', '33', '43', '36', '37', '9', '39', '42', '11', '50', '41', '52', '45', '8', '44', '47', '49', '46', '48', '7', '55', '51', '60', '6', '57', '54', '53', '58', '61', '59', '63', '56', '69', '62', '5', '70', '68', '3', '65', '64', '71', '77', '66', '75', '74', '1', '78', '73', '81', '72', '87', '67', '2', '86', '84']
['17', '18', '19', '20', '15', '16', '21', '22', '24', '25', '14', '13', '26', '23', '27', '28', '29', '30', '12', '35', '32', '10', '31', '40', '38', '34', '43', '36', '33', '37', '39', '42', '9', '11', '41', '52', '45', '50', '44', '47', '49', '8', '48', '46', '7', '55', '51', '60', '6', '57', '54', '53', '58', '59', '61', '56', '63', '69', 'Teen', '62', '5', '30s', '68', '70', '20s', 'teen', '64', '3', '65', '66', '71', '77', ' ', '7 or 8', 'F', '75', '40s', 'young', '1', '8 or 10', '78', '50s', '74', '10 or 12', 'm

In [54]:
# Check again

df[['newAge','Age']].dropna()

Unnamed: 0,newAge,Age
0,16,16
2,43,43
6,60,60s
7,51,51
8,50,50
...,...,...
5933,16,16
5944,50,50
5955,13,13 or 14
5966,16,16


In [55]:
## Fatal column. Rename and make binary

df = df.rename(columns={"Fatal (Y/N)":"Fatal"})

In [56]:
# Check Fatal column value names

df["Fatal"].value_counts()

N          4314
Y          1552
UNKNOWN      94
 N            8
#VALUE!       1
F             1
n             1
N             1
Name: Fatal, dtype: int64

In [57]:
# make Sex column to only M and F
df.loc[df["Fatal"] == " N", ["Fatal"]] = "N"
df.loc[df["Fatal"] == "N ", ["Fatal"]] = "N"
df.loc[df["Fatal"] == "n", ["Fatal"]] = "N"
df.loc[df["Fatal"] == "F", ["Fatal"]] = "UNKNOWN"
df.loc[df["Fatal"] == "#VALUE!", ["Fatal"]] = "UNKNOWN"

# convert UNKNOWN into NaN
df.loc[df["Fatal"] == "UNKNOWN", ["Fatal"]] = np.nan

# Convert into binary
df.loc[df["Fatal"] == "N", ["Fatal"]] = 0
df.loc[df["Fatal"] == "Y", ["Fatal"]] = 1

In [58]:
# Confirm new Fatal column structure

df["Fatal"].value_counts()

0    4324
1    1552
Name: Fatal, dtype: int64

In [59]:
## Check time column

list(df["Time"].value_counts().index)

df["nTime"] = 1
# Just convert first occurence of digits followed by h into hour
df.loc[~df["Time"].isna() , ["nTime"]] = df['Time'].str.extract(r"([0-9]+)[h]", expand=False)

In [60]:
# Check 

df[['nTime','Time']].dropna()

Unnamed: 0,nTime,Time
0,13,13h00
2,10,10h43
6,15,15h15
7,14,14h30
8,15,15h40
...,...,...
5782,14,14h00
5804,17,17h00-18h00
5811,19,19h00-20h00
5824,13,13h00


In [61]:
# Rename into time_hours

df = df.rename(columns={"nTime":"time_hours"})

In [62]:
# Remove trailing whitespace in Species column name

df = df.rename(columns={"Species ":"Species"})

In [63]:
# Check out species column

list(df["Species"].value_counts().index)

['White shark',
 'Shark involvement not confirmed',
 'Tiger shark',
 'Bull shark',
 "6' shark",
 "4' shark",
 "1.8 m [6'] shark",
 "1.5 m [5'] shark",
 "1.2 m [4'] shark",
 "3' shark",
 "5' shark",
 "4' to 5' shark",
 "3 m [10'] shark",
 '2 m shark',
 'No shark involvement',
 'Wobbegong shark',
 "3' to 4' shark",
 '3 m shark',
 "2.4 m [8'] shark",
 "3.7 m [12'] shark",
 "12' shark",
 'Blue shark',
 'Blacktip shark',
 'Mako shark',
 "1.2 m to 1.5 m [4' to 5'] shark",
 "7' shark",
 '"a small shark"',
 "5 m [16.5'] white shark",
 "10' shark",
 '1.5 m shark',
 "4 m [13'] white shark",
 'Raggedtooth shark',
 'Zambesi shark',
 'Bronze whaler shark',
 "6 m [20'] white shark",
 'Grey nurse shark',
 'Nurse shark',
 "3 m [10'] white shark",
 '1 m shark',
 "White shark, 4 m [13'] ",
 "2.1 m [7'] shark",
 "2' to 3' shark",
 'a small shark',
 'Hammerhead shark',
 '"small shark"',
 'Unidentified species',
 '2.5 m shark',
 "9' shark",
 'Sandtiger shark',
 'Basking shark',
 'Lemon shark',
 "Bull shark

In [64]:
# Convert whole column to lowercase

df["Species"] = df["Species"].str.lower()

In [65]:
# Keep only letters excluding single m

#stop_words = ["thought", "large", "small", "�", "named", "of", "to","lb ", " lb", "m ", " m", "or", "a ", "not","confirmed", "injury", "may", "be", "due", "stringray", "probable", "but", "kg", "possibly", "said", "involve","identity","confirmed", "by", "tooth", "pattern"]
stop_words = [" m ", "m ", "lb"]

df["nSpecies"] = df["Species"]
for i in stop_words:

    df.loc[~df["nSpecies"].isna() , ["nSpecies"]] = df["nSpecies"].str.replace(i,"")

In [66]:
# Remove numbers

df.loc[~df["nSpecies"].isna() , ["nSpecies"]] = df["nSpecies"].str.replace(r"[0-9]+","",regex=True)


In [67]:
# Remove special characters

df.loc[~df["nSpecies"].isna() , ["nSpecies"]] = df["nSpecies"].str.replace(r"[.',?\[\]\"-><]+","",regex=True)


In [68]:
# Remove one and two letter words

df.loc[~df["nSpecies"].isna() , ["nSpecies"]] = df["nSpecies"].str.replace(r"\b[a-zA-Z]{1,2}\b","",regex=True)


In [69]:
# Remove trailing whitespaces

df.loc[~df["nSpecies"].isna() , ["nSpecies"]] = df["nSpecies"].str.strip()

In [70]:
list(df["nSpecies"].value_counts().index)

['shark',
 'white shark',
 'tiger shark',
 'bull shark',
 'shark involvement not confirmed',
 'bronze whaler shark',
 'blacktip shark',
 'nurse shark',
 'small shark',
 'raggedtooth shark',
 'wobbegong shark',
 'mako shark',
 'grey nurse shark',
 'blue shark',
 'hammerhead shark',
 'zambesi shark',
 'lemon shark',
 'shark involvement',
 'sandtiger shark',
 'sand shark',
 'spinner shark',
 'sharks',
 'grey reef shark',
 'reef shark',
 'oceanic whitetip shark',
 'blacktip  spinner shark',
 'caribbean reef shark',
 'blacktip reef shark',
 '',
 'sevengill shark',
 'unidentified species',
 'dusky shark',
 'carpet shark',
 'basking shark',
 'thought  involve  zambesi shark',
 'small sharks',
 'dog shark',
 'blue pointer',
 'copper shark',
 'angel shark',
 'possibly  bull shark',
 'porbeagle shark',
 'juvenile shark',
 'possibly  juvenile blacktip  spinner shark',
 'shark involvement prior  death not confirmed',
 'greycolored shark',
 'gill shark',
 'sandbar shark',
 'small blacktip shark',
 

In [71]:
# List of shark species

shark_list = ["thresher", "grey","small","foot","goblin","silvertip","sevengill", "cocktail","bonita","whale","basket","angel","dog", "leopard","shovelnose", "red","raggedtooth","gummy", "bronze whaler", "galapagos", "sand", "wobbegong", "white ", "tiger", "lemon", "blue", "gray", "hammerhead", "spinner", "brown", "whitetip", "blacktip", "bull", "zambezi", "banjo", "nurse", "blue nose", "salmon", "mako", "blue pointer","porbeagle", "cow", "reef","smoothhound", "dogfish", "dusky"]

# new column
df["mSpecies"] = df["nSpecies"]

# find shark species 
for i in shark_list:
    df.loc[df["nSpecies"].str.contains(i).fillna(False), ["mSpecies"]] = i


# Drop all values not in shark list

df = df[df['mSpecies'].isin(shark_list)]

In [72]:
df = df.drop(columns=["Time","original order","nSpecies","Species"])

In [73]:
df = df.reset_index()

In [74]:
df.dtypes

index                              int64
Date                      datetime64[ns]
Type                              object
Country                           object
Area                              object
Location                          object
Activity                          object
Name                              object
Sex_Male                          object
Age                               object
Injury                            object
Fatal                             object
Investigator or Source            object
href                              object
newAge                            object
time_hours                        object
mSpecies                          object
dtype: object

In [75]:
# Make Sex Male, Age and Fatal to numeric

df['Sex_Male'] = pd.to_numeric(df['Sex_Male'], errors='coerce')
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['Fatal'] = pd.to_numeric(df['Fatal'], errors='coerce')
df['time_hours'] = pd.to_numeric(df['time_hours'], errors='coerce')

In [76]:
# Check numeric columns with describe

df.describe()

Unnamed: 0,index,Sex_Male,Age,Fatal,time_hours
count,1916.0,1749.0,1245.0,1899.0,1751.0
mean,2387.072547,0.889651,28.714859,0.164824,7.556254
std,1477.759363,0.313414,13.512938,0.371119,6.409467
min,7.0,0.0,3.0,0.0,1.0
25%,1191.75,1.0,19.0,0.0,1.0
50%,2296.0,1.0,25.0,0.0,8.0
75%,3508.75,1.0,37.0,0.0,14.0
max,5986.0,1.0,84.0,1.0,23.0


In [77]:
df = df.drop(columns="index")

In [78]:
# display fraction of missing cell count per column of total cells


null_columns = df.isna().sum()
null_columns[null_columns.gt(0)]/len(df)

Date                      0.038100
Country                   0.001566
Area                      0.038622
Location                  0.045407
Activity                  0.049061
Name                      0.016701
Sex_Male                  0.087161
Age                       0.350209
Injury                    0.002088
Fatal                     0.008873
Investigator or Source    0.001566
newAge                    0.002610
time_hours                0.086117
dtype: float64

In [79]:
# Only column with high number of missing values is Age, which is not very important. 
# It is probably still a representative subset.
# Therefore nan rows or column are not dropped

In [80]:
# Rename mSpecies to Species and newAge to Age and time_hours to Time_Hours

df = df.rename(columns={"mSpecies":"Species", "newAge":"Age", "time_hours":"Time_Hours"})

In [81]:
df

Unnamed: 0,Date,Type,Country,Area,Location,Activity,Name,Sex_Male,Age,Injury,Fatal,Investigator or Source,href,Age.1,Time_Hours,Species
0,2016-09-07,Unprovoked,USA,Hawaii,"Makaha, Oahu",Swimming,female,0.0,51.0,Severe lacerations to shoulder & forearm,0.0,"Hawaii News Now, 9/7/2016",http://sharkattackfile.net/spreadsheets/pdf_di...,51,14.0,tiger
1,2016-09-01,Unprovoked,USA,California,"Refugio State Beach, Santa Barbara County",Spearfishing,Tyler McQuillen,1.0,22.0,Two toes broken & lacerated,0.0,"R. Collier, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...,22,1.0,white
2,2016-08-29,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Sam Cumiskey,1.0,25.0,Lacerations to right foot,0.0,"News Channel 8, 8/30/16",http://sharkattackfile.net/spreadsheets/pdf_di...,25,15.0,bull
3,2016-08-27,Unprovoked,REUNION,,Boucan Canot,Surfing,Laurent Chardard,1.0,20.0,"Right arm severed, ankle severely bitten",0.0,"LaDepeche, 8/29/2016",http://sharkattackfile.net/spreadsheets/pdf_di...,20,17.0,bull
4,2016-08-06,Unprovoked,USA,Hawaii,Maui,SUP Foil boarding,Connor Baxter,1.0,21.0,"No inury, shark & board collided",0.0,"SUP, 8/9/2015",http://sharkattackfile.net/spreadsheets/pdf_di...,21,1.0,tiger
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1911,NaT,Unprovoked,BELIZE,,,Standing,a servant,1.0,16.0,FATAL,1.0,Mitchell-Hedges,http://sharkattackfile.net/spreadsheets/pdf_di...,16,1.0,tiger
1912,1906-01-01,Unprovoked,AUSTRALIA,,,Fishing,boy,1.0,,"FATAL, knocked overboard by tail of shark & ca...",1.0,"NY Sun, 9/9/1906, referring to account by Loui...",http://sharkattackfile.net/spreadsheets/pdf_di...,1,1.0,blue pointer
1913,1906-01-01,Unprovoked,AUSTRALIA,,,Fishing,fisherman,1.0,,FATAL,1.0,"NY Sun, 9/9/1906, referring to account by Loui...",http://sharkattackfile.net/spreadsheets/pdf_di...,1,1.0,blue pointer
1914,1906-01-01,Unprovoked,AUSTRALIA,,,Fishing,fisherman,1.0,,FATAL,1.0,"NY Sun, 9/9/1906, referring to account by Loui...",http://sharkattackfile.net/spreadsheets/pdf_di...,1,1.0,blue pointer


In [82]:
# Export as csv

df.to_csv("sharkattack_clean.csv")