# Project 1: Shark attacks
## Cleaning and wrangling

In [1]:
import pandas as pd
import numpy as np
import statistics
import re

### 1. Open the document

In [2]:
attacks = pd.read_csv("../data/attacks.csv", encoding = "latin-1")
attacks.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


### 2. Exploring data

#### 2.1. Size of the dataset

Let's take a look at the size of the dataset:

In [3]:
attacks.shape

(25723, 24)

There are 25723 records of shark attacks in our dataset. In each record, 24 characteristics were saved. Which are these 24 characteristics?

In [4]:
attacks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

#### 2.2. Duplicates
There could be some duplicates which should be eliminated from the dataset.

In [5]:
attacks = attacks.drop_duplicates()
attacks.shape

(6312, 24)

There were 19411 duplicates, which included empty rows.

#### 2.2. Empty rows and columns

**Is there any record (row) empty?**

In [6]:
emptyrows = attacks.isnull().sum(axis=1) #count number of empty data in each row
print(sum(emptyrows ==24)) #counts number of empty rows
print(sum(emptyrows ==23)) #counts number of rows with only one data
print(sum(emptyrows ==22)) #counts number of rows with only two data

1
2
7


We have eliminated all empty rows except the first one. There are also records with only one or two data that doesn't help us to solve the hypothesis. Let's check which data are completed in those records in which we only have two data:

In [7]:
attacks[emptyrows==22]

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
6302,0,,,,,,,,,,...,,,,,,,,6304.0,,
6303,0,,,,,,,,,,...,,,,,,,,6305.0,,
6304,0,,,,,,,,,,...,,,,,,,,6306.0,,
6305,0,,,,,,,,,,...,,,,,,,,6307.0,,
6306,0,,,,,,,,,,...,,,,,,,,6308.0,,
6307,0,,,,,,,,,,...,,,,,,,,6309.0,,
6308,0,,,,,,,,,,...,,,,,,,,6310.0,,


No useful information can be extracted from these rows either so, let's delete the rows that have 2 or less data.

In [8]:
isemptyrow = [i for i in attacks.index if emptyrows[i] >= 22] #list with the indexes of the empty rows
attacks = attacks.drop(isemptyrow) #df with at least three data for each record
attacks.shape

(6302, 24)

We have reduced the dataset to 6302 records.

**How many nulls do we have in each column?**

In [9]:
attacks.isnull().sum()

Case Number                  1
Date                         0
Year                         2
Type                         4
Country                     50
Area                       455
Location                   540
Activity                   544
Name                       210
Sex                        565
Age                       2831
Injury                      28
Fatal (Y/N)                539
Time                      3354
Species                   2838
Investigator or Source      17
pdf                          0
href formula                 1
href                         0
Case Number.1                0
Case Number.2                0
original order               0
Unnamed: 22               6301
Unnamed: 23               6300
dtype: int64

#### 2.3 Data wrangling of columns "Unnamed:22", "Unnamed:23", "pdf", "href formula" and "href"
* We don't have any data in columns "Unnamed: 22" and "Unnamed: 23" as the number of nulls is pretty much the same number of rows.
* No useful information can be got from the "pdf", "href formula" and "href" columns.

Let's delete them all.

In [10]:
attacks = attacks.drop(columns = ["Unnamed: 22", "Unnamed: 23", "pdf", "href formula", "href"])

#### 2.5. Data wrangling of columns "Case Number", "Case Number.1" and "Case Number.2"
Columns named "Case Number", "Case Number.1" and "Case Number.2" seem to have the same values.

In [11]:
attacks[["Case Number","Case Number.1", "Case Number.2"]]

Unnamed: 0,Case Number,Case Number.1,Case Number.2
0,2018.06.25,2018.06.25,2018.06.25
1,2018.06.18,2018.06.18,2018.06.18
2,2018.06.09,2018.06.09,2018.06.09
3,2018.06.08,2018.06.08,2018.06.08
4,2018.06.04,2018.06.04,2018.06.04
...,...,...,...
6297,ND.0005,ND.0005,ND.0005
6298,ND.0004,ND.0004,ND.0004
6299,ND.0003,ND.0003,ND.0003
6300,ND.0002,ND.0002,ND.0002


Let's see how many are different:

In [12]:
diffrows = [i for i in range(attacks.shape[0])
    if attacks["Case Number"][i] != attacks["Case Number.1"][i] 
    or attacks["Case Number"][i] != attacks["Case Number.2"][i]]

print("Different rows:", len(diffrows))

Different rows: 24


In [13]:
attacks[["Case Number","Case Number.1", "Case Number.2"]].iloc[diffrows]

Unnamed: 0,Case Number,Case Number.1,Case Number.2
34,2018.04.03,2018.04.02,2018.04.03
117,2017.07.20.a,2017/07.20.a,2017.07.20.a
144,2017.05.06,2017.06.06,2017.05.06
217,2016.09.15,2016.09.16,2016.09.15
314,2016.01.24.b,2015.01.24.b,2016.01.24.b
334,2015.12.23,2015.11.07,2015.12.23
339,2015.10.28.a,2015.10.28,2015.10.28.a
390,2015.07-10,2015.07.10,2015.07.10
560,2014.05.04,2013.05.04,2014.05.04
3522,1967.07.05,1967/07.05,1967.07.05


We are going to add another column to the dataframe with the standarized case number. We will fill it with the original Case Number in case the three columns have the same value, and with the mode in case they don't.

In [14]:
attacks["Case Number st."]=""
for i in attacks.index:
    if i in diffrows:
        #print(i)
        attacks["Case Number st."][i]=statistics.mode(attacks[["Case Number","Case Number.1", "Case Number.2"]].iloc[i])
    else:
        attacks["Case Number st."][i]=attacks["Case Number"][i]
        
attacks[["Case Number st."]].head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,Case Number st.
0,2018.06.25
1,2018.06.18
2,2018.06.09
3,2018.06.08
4,2018.06.04


Finnaly, we are going to keep this "Case Number st." column, and delete the other three.

In [15]:
attacks=attacks.drop(["Case Number", "Case Number.1", "Case Number.2"], axis=1)
attacks.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,original order,Case Number st.
0,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF",6303.0,2018.06.25
1,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",6302.0,2018.06.18
2,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",6301.0,2018.06.09
3,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",6300.0,2018.06.08
4,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,6299.0,2018.06.04


#### 2.6. Data wrangling of column "Date"

Column "Date" and the new column "Case Number st." seem to have the same values, although in different format.

In [16]:
attacks[["Date", "Case Number st."]].head(10)

Unnamed: 0,Date,Case Number st.
0,25-Jun-2018,2018.06.25
1,18-Jun-2018,2018.06.18
2,09-Jun-2018,2018.06.09
3,08-Jun-2018,2018.06.08
4,04-Jun-2018,2018.06.04
5,03-Jun-2018,2018.06.03.b
6,03-Jun-2018,2018.06.03.a
7,27-May-2018,2018.05.27
8,26-May-2018,2018.05.26.b
9,26-May-2018,2018.05.26.a


**Changing the format of attacks["Date"]**

Let's change the format of column "Date" to compare them.

Function definition:

In [17]:
def dateformat (date):
    [dd, mm, yyyy] = date.split('-')
    months1=["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sept", "Oct", "Nov", "Dec"]
    months2=["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]
    for m1,m2 in zip(months1, months2):
        if mm == m1: mm=m2
    return '.'.join([yyyy,mm,dd])

Application of the function to all elements in column "Date".

In [18]:
for i in attacks.index:
    original = attacks["Date"][i]  
    try:
        attacks["Date"][i] = dateformat(original) # We are rewritting the column
    except:
        pass

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


**Comparing values between column "Date" and column "Case Number st."**

Now they have the same format, let's compare them.

In [19]:
attacks[["Date", "Case Number st."]].head(10)

Unnamed: 0,Date,Case Number st.
0,2018.06.25,2018.06.25
1,2018.06.18,2018.06.18
2,2018.06.09,2018.06.09
3,2018.06.08,2018.06.08
4,2018.06.04,2018.06.04
5,2018.06.03,2018.06.03.b
6,2018.06.03,2018.06.03.a
7,2018.05.27,2018.05.27
8,2018.05.26,2018.05.26.b
9,2018.05.26,2018.05.26.a


Let's merge the values in a new column named *Date std.*:
* In case values in column *Date* have the format yyyy.mm.dd, we'll keep them. 
* If *Date* has not the correct format, we'll check "Case Number st." and we'll keep this last value.

In [21]:
attacks["Date std."] = ""
for i in attacks.index:
    if re.search(r"\d{4}.\d{2}.\d{2}", attacks["Date"][i]):
        attacks["Date std."][i] = re.search(r"\d{4}.\d{2}.\d{2}", attacks["Date"][i]).group()
    elif re.search(r"\d{4}.\d{2}.\d{2}", attacks["Case Number st."][i]):
        attacks["Date std."][i] = re.search(r"\d{4}.\d{2}.\d{2}", attacks["Case Number st."][i]).group()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Now they are combined, we can drop "Date" and "Case Number st.":

In [22]:
attacks = attacks.drop(columns = ["Date", "Case Number st."])

### 3. Filling nulls

How many nulls are there in our current dataframe?

In [23]:
attacks.isnull().sum()

Year                         2
Type                         4
Country                     50
Area                       455
Location                   540
Activity                   544
Name                       210
Sex                        565
Age                       2831
Injury                      28
Fatal (Y/N)                539
Time                      3354
Species                   2838
Investigator or Source      17
original order               0
Date std.                    0
dtype: int64

#### 3.1 Filling the nulls in "Year"
There are only two nulls in "Year", and they can be filled with the data in "Date std."

In [24]:
attacks[["Year", "Date std."]][attacks.Year.isnull()]

Unnamed: 0,Year,Date std.
187,,2017.01.08
6079,,1836.08.19


In [25]:
attacks["Year"][187] = "2017"
attacks["Year"][6079] = "1836"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


#### 3.2. Filling nulls and data wrangling in "Type"

In [26]:
attacks["Type"] = attacks.Type.fillna("NA") # NA = Not Available

In [27]:
attacks.Type.value_counts()

Unprovoked      4595
Provoked         574
Invalid          547
Sea Disaster     239
Boating          203
Boat             137
NA                 4
Questionable       2
Boatomg            1
Name: Type, dtype: int64

In [28]:
attacks.loc[attacks["Type"].str.startswith("Boat"),"Type"] = "Boat"
attacks.Type.value_counts()

Unprovoked      4595
Provoked         574
Invalid          547
Boat             341
Sea Disaster     239
NA                 4
Questionable       2
Name: Type, dtype: int64

#### 3.3. Filling nulls in any other column

In [29]:
attacks = attacks.fillna("NA") # NA = Not Available

### 4. Data wrangling
#### 4.1. Data wrangling in "Country"
Group countries by proximity to create bigger areas of study:

In [30]:
countries = {"USA": ["USA"],
             "AUSTRALIA": ["AUSTRALIA", "PAPUA NEW GUINEA", "NEW GUINEA", "NEW ZEALAND", "SOLOMON ISLANDS"],
             "SOUTH PACIFIC OCEAN": ["FIJI", "NEW CALEDONIA", "FRENCH POLYNESIA", "TONGA", "VANUATU"],
             "SOUTH EAST ASIA": ["PHILIPPINES", "HONG KONG", "INDONESIA", "VIETNAM", "TAIWAN"],
             "MADAGASCAR": ["REUNION", "MOZAMBIQUE", "MAURITIUS", "KENYA", "TANZANIA"],
             "JAPAN":["JAPAN", "SOUTH KOREA", "CHINA", "OKINAWA", "KOREA"],
             "MEDITERRANEAN SEA": ["CROATIA", "ITALY", "FRANCE", "GREECE", "TURKEY"],
             "PERSIAN GULF":["IRAN", "IRAQ", "YEMEN", "SAUDI ARABIA", "PERSIAN GULF"],
             "CARIBE AND MEXICAN GULF": ["BAHAMAS", "MEXICO", "CUBA", "PANAMA", "JAMAICA", "COSTA RICA"],
             "INDIAN OCEAN": ["INDIAN OCEAN", "SRI LANKA", "CEYLON"]
            }

Create a new column "Regions" with grouped countries.

In [31]:
attacks["Regions"]=""

for i in attacks.index:
    for k,v in countries.items():
        if attacks["Country"][i] in v:
            attacks["Regions"][i]=k
            break
        else:
            attacks["Regions"][i]="Other"
        

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [32]:
attacks["Regions"].value_counts()

USA                        2229
AUSTRALIA                  1640
Other                      1392
CARIBE AND MEXICAN GULF     320
SOUTH PACIFIC OCEAN         169
MEDITERRANEAN SEA           155
MADAGASCAR                  133
SOUTH EAST ASIA             132
JAPAN                        56
PERSIAN GULF                 52
INDIAN OCEAN                 24
Name: Regions, dtype: int64

#### 4.2. Data wrangling in "Sex "

In [33]:
attacks["Sex "] = attacks["Sex "].fillna("NA")
attacks["Sex "].value_counts()

M      5094
F       637
NA      565
M         2
N         2
lli       1
.         1
Name: Sex , dtype: int64

In [34]:
attacks.loc[attacks["Sex "]=="M "]="M"
attacks.loc[attacks["Sex "]=="lli"] = "M"
attacks.loc[attacks["Sex "]=="."] = "NA"
attacks.loc[attacks["Sex "]=="N"] = "M"
attacks["Sex "].value_counts()

M     5099
F      637
NA     566
Name: Sex , dtype: int64

#### 4.3. Data wrangling in "Age"

In [35]:
attacks["Age"].value_counts()

NA                 2828
17                  154
18                  150
19                  142
20                  141
                   ... 
"young"               1
9 months              1
10 or 12              1
MAKE LINE GREEN       1
21 or 26              1
Name: Age, Length: 159, dtype: int64

Let's standarized the ages:

In [36]:
for i in attacks.index:
    if re.match(r"\d+$", attacks["Age"][i]):
        attacks.Age[i] = re.match(r"\d+$",attacks["Age"][i]).group()    
    elif re.match(r"\d+s", attacks["Age"][i]):
        attacks.Age[i] = re.match(r"\d+s",attacks["Age"][i]).group()[:-1]
    elif re.match(r"(?i)teen", attacks["Age"][i]):
        attacks.Age[i] = "15"
    elif re.findall(r"(\d+) or (\d+)", attacks["Age"][i]):
        x=re.findall(r"(\d+) or (\d+)", attacks["Age"][i])
        attacks.Age[i] = str(int(np.mean([int(x[0][0]),int(x[0][1])])))
    else:
        attacks.Age[i] = "NA"

#### 4.4. Data wrangling in Injury

In [37]:
attacks["Injury"].value_counts()

FATAL                                 802
Survived                               97
Foot bitten                            87
No injury                              82
Leg bitten                             72
                                     ... 
Laceration to left upper leg            1
3 lacerations to foot                   1
No injury, shark charged surfboard      1
Thighs bitten                           1
Lacerations on right ankle & heel       1
Name: Injury, Length: 3734, dtype: int64

Let's try to group some of the injuries.

In [38]:
for i in attacks.index:
    if re.findall(r"(?i)no injury", attacks["Injury"][i]):
        attacks["Injury"][i] = "No injury"
    elif re.findall(r"(?i)fatal", attacks["Injury"][i]) or re.findall(r"(?i)death", attacks["Injury"][i]):
        attacks["Injury"][i] = "FATAL"
    elif re.findall(r"(?i)lacerat", attacks["Injury"][i]) or re.findall(r"(?i)cut", attacks["Injury"][i]):
        attacks["Injury"][i] = "Laceration"
    elif re.findall(r"(?i)bit", attacks["Injury"][i]):
        attacks["Injury"][i] = "Bite"
    elif (re.findall(r"(?i)injur", attacks["Injury"][i])) or (re.findall(r"(?i)sever", attacks["Injury"][i])) or (re.findall(r"(?i)wound", attacks["Injury"][i])) or (re.findall(r"(?i)abras", attacks["Injury"][i])):
        attacks["Injury"][i] = "Some injury"
    elif re.findall(r"(?i)survived", attacks["Injury"][i]):
        attacks["Injury"][i] = "Survived"
    else:
        attacks["Injury"][i] = "NA"

#### 4.5. Data wrangling in "Fatal"

In [39]:
attacks["Fatal (Y/N)"].value_counts()

N          4289
Y          1386
NA          540
UNKNOWN      71
 N            7
M             6
N             1
y             1
2017          1
Name: Fatal (Y/N), dtype: int64

In [40]:
for i in attacks.index:
    if attacks["Fatal (Y/N)"][i]!= "N" and attacks["Fatal (Y/N)"][i]!= "Y":
        attacks["Fatal (Y/N)"][i] = "NA"

#### 4.6. Data wrangling in "Species"

In [41]:
attacks["Species "].value_counts()

NA                                                                       2838
White shark                                                               163
Shark involvement prior to death was not confirmed                        105
Invalid                                                                   102
Shark involvement not confirmed                                            88
                                                                         ... 
Thought to involve an oceanic whitetip shark or a white shark               1
0.9 m  [3'] shark                                                           1
Dooley believed his Injury was caused by stingray (Dasyatidae family)       1
C. leucas tooth fragment recovered from kayak                               1
4.9 m to 5.5 m [16' to 18'] white shark                                     1
Name: Species , Length: 1549, dtype: int64

In [42]:
for i in attacks.index:
    if re.search(r"(?i)white", attacks["Species "][i]): 
        attacks["Species "][i] = "White shark"
    elif re.search(r"(?i)tiger", attacks["Species "][i]): 
        attacks["Species "][i] = "Tiger shark"
    elif re.search(r"(?i)bull", attacks["Species "][i]): 
        attacks["Species "][i] = "Bull shark"
    elif re.search(r"(?i)wobbegong", attacks["Species "][i]): 
        attacks["Species "][i] = "Wobbegong shark"
    elif re.search(r"(?i)blacktip", attacks["Species "][i]): 
        attacks["Species "][i] = "Blacktip shark"
    elif re.search(r"(?i)mako", attacks["Species "][i]): 
        attacks["Species "][i] = "Mako shark"
    elif re.search(r"(?i)raggedtooth", attacks["Species "][i]): 
        attacks["Species "][i] = "Raggedtooth shark"
    elif re.search(r"(?i)blue", attacks["Species "][i]): 
        attacks["Species "][i] = "Blue shark"
    elif re.search(r"(?i)lemon", attacks["Species "][i]):
        attacks["Species "][i] = "Lemon shark"
    elif re.search(r"(?i)nurse", attacks["Species "][i]): 
        attacks["Species "][i] = "Nurse shark"
    elif re.search(r"(?i)zambesi", attacks["Species "][i]):
        attacks["Species "][i] = "Zambesi shark"
    elif re.search(r"(?i)bronze whaler", attacks["Species "][i]):
        attacks["Species "][i] = "Bronze whaler shark"
    elif re.search(r"(?i)hammerhead", attacks["Species "][i]):
        attacks["Species "][i] = "Hammerhead shark"
    elif re.search(r"[1-6]\'", attacks["Species "][i]) or re.search(r"^2\sm", attacks["Species "][i]):
        attacks["Species "][i] = "0-2m shark"
    elif re.search(r"\d+'", attacks["Species "][i]) or re.search(r"^3\sm", attacks["Species "][i]):
        attacks["Species "][i] = "2-3m shark"
    elif re.search(r"not conf", attacks["Species "][i]) or re.search(r"unconf", attacks["Species "][i]):
        attacks["Species "][i] = "Shark involvement not confirmed"
    else: 
        attacks["Species "][i] = "NA"

#### 4.7. Data wrangling in "Investigator"

In [43]:
attacks["Investigator or Source"].value_counts()

C. Moore, GSAF                                                105
C. Creswell, GSAF                                              92
S. Petersohn, GSAF                                             82
R. Collier                                                     55
R. Collier, GSAF                                               48
                                                             ... 
R. Collier, p. xxvi;  Orlando Sentinel, 9/6/1989, p.3A          1
K. Jones                                                        1
NY Herald Tribune, 8/27/1931; L. Schultz & M. Malin, p.555      1
Sail-World.com, 2/17/2014                                       1
Daily Dispatch; M. Levine, GSAF                                 1
Name: Investigator or Source, Length: 4966, dtype: int64

In [44]:
for i in attacks.index:
    if re.findall(r"[A-Z].\s\w+",attacks["Investigator or Source"][i]):
        attacks["Investigator or Source"][i] = re.findall(r"[A-Z].\s\w+",attacks["Investigator or Source"][i])[0]
    else:
        attacks["Investigator or Source"][i] = "Others"

### 5. Saving the cleaning data

In [45]:
attacks.to_csv("OUTPUT/attacks_cleaned.csv")