# Project 1: Shark attacks
## Cleaning and wrangling

In [1]:
import pandas as pd
import numpy as np
import statistics
import re

### 1. Open the document

In [2]:
attacks = pd.read_csv("../data/attacks.csv", encoding = "latin-1")
attacks.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


### 2. Exploring data

#### 2.1. Size of the dataset

Let's take a look at the size of the dataset:

In [3]:
attacks.shape

(25723, 24)

There are 25723 records of shark attacks in our dataset. In each record, 24 characteristics were saved. Which are these 24 characteristics?

In [4]:
attacks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

#### 2.2. Duplicates
There could be some duplicates which should be eliminated from the dataset.

In [5]:
attacks = attacks.drop_duplicates()
attacks.shape

(6312, 24)

There were 19411 duplicates, which included empty rows.

#### 2.2. Empty rows and columns

**Is there any record (row) empty?**

In [6]:
emptyrows = attacks.isnull().sum(axis=1) #count number of empty data in each row
print(sum(emptyrows ==24)) #counts number of empty rows
print(sum(emptyrows ==23)) #counts number of rows with only one data
print(sum(emptyrows ==22)) #counts number of rows with only two data

1
2
7


We have eliminated all empty rows except the first one. There are also records with only one or two data that doesn't help us to solve the hypothesis. Let's check which data are completed in those records in which we only have two data:

In [7]:
attacks[emptyrows==22]

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
6302,0,,,,,,,,,,...,,,,,,,,6304.0,,
6303,0,,,,,,,,,,...,,,,,,,,6305.0,,
6304,0,,,,,,,,,,...,,,,,,,,6306.0,,
6305,0,,,,,,,,,,...,,,,,,,,6307.0,,
6306,0,,,,,,,,,,...,,,,,,,,6308.0,,
6307,0,,,,,,,,,,...,,,,,,,,6309.0,,
6308,0,,,,,,,,,,...,,,,,,,,6310.0,,


No useful information can be extracted from these rows either so, let's delete the rows that have 2 or less data.

In [8]:
isemptyrow = [i for i in attacks.index if emptyrows[i] >= 22] #list with the indexes of the empty rows
attacks = attacks.drop(isemptyrow) #df with at least three data for each record
attacks.shape

(6302, 24)

We have reduced the dataset to 6302 records.

**How many nulls do we have in each column?**

In [9]:
attacks.isnull().sum()

Case Number                  1
Date                         0
Year                         2
Type                         4
Country                     50
Area                       455
Location                   540
Activity                   544
Name                       210
Sex                        565
Age                       2831
Injury                      28
Fatal (Y/N)                539
Time                      3354
Species                   2838
Investigator or Source      17
pdf                          0
href formula                 1
href                         0
Case Number.1                0
Case Number.2                0
original order               0
Unnamed: 22               6301
Unnamed: 23               6300
dtype: int64

We don't have any data in columns "Unnamed: 22" and "Unnamed: 23" as the number of nulls is pretty much the same number of rows so let's delete them.

In [10]:
attacks = attacks.drop(columns = ["Unnamed: 22", "Unnamed: 23"])

#### 2.3 Data wrangling of column "pdf"

If we look at the names of the pdfs generated for each case, we realize that the name is a concatenation of the "Case Number" and the last name of the victim or the location. No more useful information can be get from the "pdf" column, so it is deleted:

In [11]:
attacks = attacks.drop(columns = ["pdf"])
attacks.shape

(6302, 21)

#### 2.4 Data wrangling of columns "href formula" and href"
"href formula" and href" columns seem to have the same values. Let's compare them:

In [12]:
attacks["equalhref"] = "" # Defining a new column that stores True or False
attacks["equalhref"] = attacks["href formula"] == attacks["href"]
print(sum(attacks["equalhref"]))
attacks[~attacks["equalhref"]][["href formula", "href"]].head()

6242


Unnamed: 0,href formula,href
50,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
96,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
131,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
133,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
141,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...


We can find some spelling mistakes between both columns:

In [13]:
print(attacks["href formula"].iloc[602])
print(attacks["href"].iloc[602])
print("---------------")
print(attacks["href formula"].iloc[6166])
print(attacks["href"].iloc[6166])

http://sharkattackfile.net/spreadsheets/pdf_directory/2013.11.10-Bahamas.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2013.11.10-Bahamas
---------------
http://sharkattackfile.net/spreadsheets/pdf_directory/1637.00.00.R-Manrique.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/1637.00.00-Manrique.pdf


But, in general, these two columns do not add value to the study. Let's delete them and the "equalhref" too.

In [14]:
attacks = attacks.drop(columns = ["href formula", "href", "equalhref"])
attacks.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,Case Number.1,Case Number.2,original order
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF",2018.06.25,2018.06.25,6303.0
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18,2018.06.18,6302.0
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09,2018.06.09,6301.0
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08,2018.06.08,6300.0
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04,2018.06.04,6299.0


#### 2.5. Data wrangling of columns "Case Number", "Case Number.1" and "Case Number.2"
Columns named "Case Number", "Case Number.1" and "Case Number.2" seem to have the same values.

In [15]:
attacks[["Case Number","Case Number.1", "Case Number.2"]]

Unnamed: 0,Case Number,Case Number.1,Case Number.2
0,2018.06.25,2018.06.25,2018.06.25
1,2018.06.18,2018.06.18,2018.06.18
2,2018.06.09,2018.06.09,2018.06.09
3,2018.06.08,2018.06.08,2018.06.08
4,2018.06.04,2018.06.04,2018.06.04
...,...,...,...
6297,ND.0005,ND.0005,ND.0005
6298,ND.0004,ND.0004,ND.0004
6299,ND.0003,ND.0003,ND.0003
6300,ND.0002,ND.0002,ND.0002


Let's see how many are different:

--->> SE PUEDE HACER MÁS FÁCIL??

In [16]:
diffrows = [i for i in range(attacks.shape[0])
    if attacks["Case Number"][i] != attacks["Case Number.1"][i] 
    or attacks["Case Number"][i] != attacks["Case Number.2"][i]]

print("Different rows:", len(diffrows))

Different rows: 24


In [17]:
attacks[["Case Number","Case Number.1", "Case Number.2"]].iloc[diffrows]

Unnamed: 0,Case Number,Case Number.1,Case Number.2
34,2018.04.03,2018.04.02,2018.04.03
117,2017.07.20.a,2017/07.20.a,2017.07.20.a
144,2017.05.06,2017.06.06,2017.05.06
217,2016.09.15,2016.09.16,2016.09.15
314,2016.01.24.b,2015.01.24.b,2016.01.24.b
334,2015.12.23,2015.11.07,2015.12.23
339,2015.10.28.a,2015.10.28,2015.10.28.a
390,2015.07-10,2015.07.10,2015.07.10
560,2014.05.04,2013.05.04,2014.05.04
3522,1967.07.05,1967/07.05,1967.07.05


We are going to add another column to the dataframe with the standarized case number. We will fill it with the original Case Number in case the three columns have the same value, and with the mode in case they don't.

___>>> INTENTAR HACERLO CON UN WHERE

In [18]:
attacks["Case Number st."]=""
for i in attacks.index:
    if i in diffrows:
        #print(i)
        attacks["Case Number st."][i]=statistics.mode(attacks[["Case Number","Case Number.1", "Case Number.2"]].iloc[i])
    else:
        attacks["Case Number st."][i]=attacks["Case Number"][i]
        
attacks[["Case Number st."]].head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,Case Number st.
0,2018.06.25
1,2018.06.18
2,2018.06.09
3,2018.06.08
4,2018.06.04


Finnaly, we are going to keep this "Case Number st." column, and delete the other three.

In [19]:
attacks=attacks.drop(["Case Number", "Case Number.1", "Case Number.2"], axis=1)
attacks.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,original order,Case Number st.
0,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF",6303.0,2018.06.25
1,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",6302.0,2018.06.18
2,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",6301.0,2018.06.09
3,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",6300.0,2018.06.08
4,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,6299.0,2018.06.04


#### 2.6. Data wrangling of column "Date"

Column "Date" and the new column "Case Number st." seem to have the same values, although in different format.

In [20]:
attacks[["Date", "Case Number st."]].head(10)

Unnamed: 0,Date,Case Number st.
0,25-Jun-2018,2018.06.25
1,18-Jun-2018,2018.06.18
2,09-Jun-2018,2018.06.09
3,08-Jun-2018,2018.06.08
4,04-Jun-2018,2018.06.04
5,03-Jun-2018,2018.06.03.b
6,03-Jun-2018,2018.06.03.a
7,27-May-2018,2018.05.27
8,26-May-2018,2018.05.26.b
9,26-May-2018,2018.05.26.a


**Changing the format of attacks["Date"]**

Let's change the format of column "Date" to compare them.

Function definition:

In [21]:
def dateformat (date):
    [dd, mm, yyyy] = date.split('-')
    months1=["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sept", "Oct", "Nov", "Dec"]
    months2=["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]
    for m1,m2 in zip(months1, months2):
        if mm == m1: mm=m2
    return '.'.join([yyyy,mm,dd])

Application of the function to all elements in column "Date".

In [22]:
for i in attacks.index:
    original = attacks["Date"][i]  
    try:
        attacks["Date"][i] = dateformat(original) # We are rewritting the column
    except:
        pass

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


**Comparing values between column "Date" and column "Case Number st."**

Now they have the same format, let's compare them.

In [23]:
attacks[["Date", "Case Number st."]].head(10)

Unnamed: 0,Date,Case Number st.
0,2018.06.25,2018.06.25
1,2018.06.18,2018.06.18
2,2018.06.09,2018.06.09
3,2018.06.08,2018.06.08
4,2018.06.04,2018.06.04
5,2018.06.03,2018.06.03.b
6,2018.06.03,2018.06.03.a
7,2018.05.27,2018.05.27
8,2018.05.26,2018.05.26.b
9,2018.05.26,2018.05.26.a


Let's merge the values in a new column named *Date std.*:
* In case values are equal in column *Date* and *Case Number st.*, the new column will keep the value.
* In case we find "2018.05.26" in column *Date* and "2018.05.26.a" in *Case Number st.*, we will keep tha value in *Date*.
* In case *Date* is "2018.04.Reported 30" and *Case Number st.* is "2018.04.30.R", we apply a regular expression.
* In case *Date* is "1962.08.30" and *Case Number st.* is "1962,08.30.b", we apply another regular expression.
* In any other cases, we keep *Date* value.

In [24]:
attacks["Date std."] = ""
for i in attacks.index:
    if re.search(r"\d{4}.\d{2}.\d{2}", attacks["Date"][i]):
        attacks["Date std."][i] = re.search(r"\d{4}.\d{2}.\d{2}", attacks["Date"][i]).group()
    elif re.search(r"\d{4}.\d{2}.\d{2}", attacks["Case Number st."][i]):
        attacks["Date std."][i] = re.search(r"\d{4}.\d{2}.\d{2}", attacks["Case Number st."][i]).group()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [25]:
attacks[["Date", "Case Number st.","Date std."]].head()

Unnamed: 0,Date,Case Number st.,Date std.
0,2018.06.25,2018.06.25,2018.06.25
1,2018.06.18,2018.06.18,2018.06.18
2,2018.06.09,2018.06.09,2018.06.09
3,2018.06.08,2018.06.08,2018.06.08
4,2018.06.04,2018.06.04,2018.06.04


Now they are combined, we can drop "Date" and "Case Number st.":

In [26]:
attacks = attacks.drop(columns = ["Date", "Case Number st."])
attacks

Unnamed: 0,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,original order,Date std.
0,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF",6303.0,2018.06.25
1,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",6302.0,2018.06.18
2,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",6301.0,2018.06.09
3,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",6300.0,2018.06.08
4,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,6299.0,2018.06.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6297,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, p. 234",6.0,
6298,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, pp. 233-234",5.0,
6299,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,,FATAL,Y,,,"F. Schwartz, p.23; C. Creswell, GSAF",4.0,
6300,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,Jules Patterson,M,,FATAL,Y,,,"The Sun, 10/20/1938",3.0,


### 3. Filling nulls and data wrangling

How many nulls are there in our current dataframe?

In [27]:
attacks.isnull().sum()

Year                         2
Type                         4
Country                     50
Area                       455
Location                   540
Activity                   544
Name                       210
Sex                        565
Age                       2831
Injury                      28
Fatal (Y/N)                539
Time                      3354
Species                   2838
Investigator or Source      17
original order               0
Date std.                    0
dtype: int64

#### 3.1 Filling the nulls in "Year"
There are only two nulls in "Year", and they can be filled with the data in "Date std."

In [28]:
attacks[["Year", "Date std."]][attacks.Year.isnull()]

Unnamed: 0,Year,Date std.
187,,2017.01.08
6079,,1836.08.19


In [29]:
attacks["Year"][187] = "2017"
attacks["Year"][6079] = "1836"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


#### 3.2. Filling nulls and data wrangling in "Type"

In [30]:
attacks["Type"] = attacks.Type.fillna("NA") # NA = Not Available

In [31]:
attacks.Type.value_counts()

Unprovoked      4595
Provoked         574
Invalid          547
Sea Disaster     239
Boating          203
Boat             137
NA                 4
Questionable       2
Boatomg            1
Name: Type, dtype: int64

In [32]:
attacks.loc[attacks["Type"].str.startswith("Boat"),"Type"] = "Boat"
attacks.Type.value_counts()

Unprovoked      4595
Provoked         574
Invalid          547
Boat             341
Sea Disaster     239
NA                 4
Questionable       2
Name: Type, dtype: int64

#### 3.3. Filling nulls and data wrangling in "Country"

In [67]:
attacks["Country"] = attacks["Country"].fillna("NA")

In [76]:
for i in attacks.index:
    if re.search(r"TONGA", attacks["Country"][i]):
        attacks["Country"][i] = "TONGA"
    if re.search(r"YEMEN", attacks["Country"][i]):
        attacks["Country"][i] = "YEMEN"
    if re.search(r"ANDAMAN", attacks["Country"][i]):
        attacks["Country"][i] = "ANDAMAN ISLANDS"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [83]:
attacks["Country"].value_counts()[100:150]

GUYANA                                   3
Fiji                                     3
PORTUGAL                                 3
CAPE VERDE                               3
TRINIDAD & TOBAGO                        3
TUNISIA                                  3
HAITI                                    3
BELIZE                                   3
MICRONESIA                               3
MONTENEGRO                               3
GUINEA                                   3
ST HELENA, British overseas territory    2
EGYPT                                    2
UNITED ARAB EMIRATES                     2
NAMIBIA                                  2
CENTRAL PACIFIC                          2
ANTIGUA                                  2
CRETE                                    2
WEST INDIES                              2
ICELAND                                  2
JOHNSTON ISLAND                          2
CAYMAN ISLANDS                           2
SOUTH PACIFIC OCEAN                      2
NORWAY     

#### Filling nulls in "Sex "

In [33]:
attacks["Sex "] = attacks["Sex "].fillna("NA")
attacks["Sex "].value_counts()

M      5094
F       637
NA      565
M         2
N         2
lli       1
.         1
Name: Sex , dtype: int64

In [34]:
attacks.loc[attacks["Sex "]=="M "]="M"
attacks.loc[attacks["Sex "]=="lli"] = "M"
attacks.loc[attacks["Sex "]=="."] = "NA"
attacks.loc[attacks["Sex "]=="N"] = "M"
attacks["Sex "].value_counts()

M     5099
F      637
NA     566
Name: Sex , dtype: int64

#### Filling nulls in "Age"

In [35]:
attacks["Age"] = attacks["Age"].fillna("NA")
attacks["Age"].value_counts()

NA                 2828
17                  154
18                  150
19                  142
20                  141
                   ... 
34 & 19               1
                      1
MAKE LINE GREEN       1
9 & 12                1
12 or 13              1
Name: Age, Length: 159, dtype: int64

In [36]:
for i in attacks.index:
    if re.match(r"\d+$", attacks["Age"][i]):
        attacks.Age[i] = re.match(r"\d+$",attacks["Age"][i]).group()    
    elif re.match(r"\d+s", attacks["Age"][i]):
        attacks.Age[i] = re.match(r"\d+s",attacks["Age"][i]).group()[:-1]
    elif re.match(r"(?i)teen", attacks["Age"][i]):
        attacks.Age[i] = "15"
    elif re.findall(r"(\d+) or (\d+)", attacks["Age"][i]):
        x=re.findall(r"(\d+) or (\d+)", attacks["Age"][i])
        attacks.Age[i] = str(int(np.mean([int(x[0][0]),int(x[0][1])])))
    else:
        attacks.Age[i] = "NA"

#### Injury

In [37]:
attacks["Injury"] = attacks["Injury"].fillna("NA")
attacks["Injury"].value_counts()

FATAL                                                          802
Survived                                                        97
Foot bitten                                                     87
No injury                                                       82
Leg bitten                                                      72
                                                              ... 
No injury, kayak scratched                                       1
Leg bitten 3 times                                               1
Lower right  leg lacerated                                       1
No injury & not on board. Board adrift when bitten by shark      1
Right thigh & buttock bitten                                     1
Name: Injury, Length: 3734, dtype: int64

In [38]:
for i in attacks.index:
    if re.findall(r"(?i)no injury", attacks["Injury"][i]):
        attacks["Injury"][i] = "No injury"
    elif re.findall(r"(?i)fatal", attacks["Injury"][i]) or re.findall(r"(?i)death", attacks["Injury"][i]):
        attacks["Injury"][i] = "FATAL"
    elif re.findall(r"(?i)lacerat", attacks["Injury"][i]) or re.findall(r"(?i)cut", attacks["Injury"][i]):
        attacks["Injury"][i] = "Laceration"
    elif re.findall(r"(?i)bit", attacks["Injury"][i]):
        attacks["Injury"][i] = "Bite"
    elif (re.findall(r"(?i)injur", attacks["Injury"][i])) or (re.findall(r"(?i)sever", attacks["Injury"][i])) or (re.findall(r"(?i)wound", attacks["Injury"][i])) or (re.findall(r"(?i)abras", attacks["Injury"][i])):
        attacks["Injury"][i] = "Some injury"
    elif re.findall(r"(?i)survived", attacks["Injury"][i]):
        attacks["Injury"][i] = "Survived"
    else:
        attacks["Injury"][i] = "NA"

#### Filling nulls and data wrangling in Fatal

In [39]:
attacks["Fatal (Y/N)"]=attacks["Fatal (Y/N)"].fillna("NA")

In [40]:
attacks["Fatal (Y/N)"].value_counts()

N          4289
Y          1386
NA          540
UNKNOWN      71
 N            7
M             6
N             1
2017          1
y             1
Name: Fatal (Y/N), dtype: int64

In [41]:
for i in attacks.index:
    if attacks["Fatal (Y/N)"][i]!= "N" and attacks["Fatal (Y/N)"][i]!= "Y":
        attacks["Fatal (Y/N)"][i] = "NA"

#### Filling nulls and data wrangling in species

In [42]:
attacks["Species "]=attacks["Species "].fillna("NA")

In [43]:
attacks["Species "].value_counts()

NA                                                    2838
White shark                                            163
Shark involvement prior to death was not confirmed     105
Invalid                                                102
Shark involvement not confirmed                         88
                                                      ... 
said to involve 2.5 m hammerhead sharks                  1
4.5 m to 5.5 m [14.7' to 18'] white shark                1
"black tipped" shark                                     1
Bitten by several 1.8 m [6'] sharks                      1
Bronze whaler shark, 2.4 m [8']                          1
Name: Species , Length: 1549, dtype: int64

In [44]:
for i in attacks.index:
    if re.search(r"(?i)white", attacks["Species "][i]): 
        attacks["Species "][i] = "White shark"
    elif re.search(r"(?i)tiger", attacks["Species "][i]): 
        attacks["Species "][i] = "Tiger shark"
    elif re.search(r"(?i)bull", attacks["Species "][i]): 
        attacks["Species "][i] = "Bull shark"
    elif re.search(r"(?i)wobbegong", attacks["Species "][i]): 
        attacks["Species "][i] = "Wobbegong shark"
    elif re.search(r"(?i)blacktip", attacks["Species "][i]): 
        attacks["Species "][i] = "Blacktip shark"
    elif re.search(r"(?i)mako", attacks["Species "][i]): 
        attacks["Species "][i] = "Mako shark"
    elif re.search(r"(?i)raggedtooth", attacks["Species "][i]): 
        attacks["Species "][i] = "Raggedtooth shark"
    elif re.search(r"(?i)blue", attacks["Species "][i]): 
        attacks["Species "][i] = "Blue shark"
    elif re.search(r"(?i)lemon", attacks["Species "][i]):
        attacks["Species "][i] = "Lemon shark"
    elif re.search(r"(?i)nurse", attacks["Species "][i]): 
        attacks["Species "][i] = "Nurse shark"
    elif re.search(r"(?i)zambesi", attacks["Species "][i]):
        attacks["Species "][i] = "Zambesi shark"
    elif re.search(r"(?i)bronze whaler", attacks["Species "][i]):
        attacks["Species "][i] = "Bronze whaler shark"
    elif re.search(r"(?i)hammerhead", attacks["Species "][i]):
        attacks["Species "][i] = "Hammerhead shark"
    elif re.search(r"[1-6]\'", attacks["Species "][i]) or re.search(r"^2\sm", attacks["Species "][i]):
        attacks["Species "][i] = "0-2m shark"
    elif re.search(r"\d+'", attacks["Species "][i]) or re.search(r"^3\sm", attacks["Species "][i]):
        attacks["Species "][i] = "2-3m shark"
    elif re.search(r"not conf", attacks["Species "][i]) or re.search(r"unconf", attacks["Species "][i]):
        attacks["Species "][i] = "Shark involvement not confirmed"
    else: 
        attacks["Species "][i] = "NA"

#### Filling nulls and data wrangling in "Investigator"

In [45]:
attacks["Investigator or Source"]=attacks["Investigator or Source"].fillna("NA")

In [46]:
attacks["Investigator or Source"].value_counts()

C. Moore, GSAF                                                                                  105
C. Creswell, GSAF                                                                                92
S. Petersohn, GSAF                                                                               82
R. Collier                                                                                       55
T. Peake, GSAF                                                                                   48
                                                                                               ... 
B. Smee, Newcastle Herald, 2/27/2012                                                              1
J. Borg, p.73; L. Taylor (1993), pp.100-101; Skin Diver Magazine, March 1956; SAF Case #1042      1
BBC News, 10/19/2009                                                                              1
NSRI, 6/27/205                                                                                    1


In [47]:
for i in attacks.index:
    if re.findall(r"[A-Z].\s\w+",attacks["Investigator or Source"][i]):
        attacks["Investigator or Source"][i] = re.findall(r"[A-Z].\s\w+",attacks["Investigator or Source"][i])[0]
    else:
        attacks["Investigator or Source"][i] = "Others"

#### Filling nulls and data wrangling of "Time"

In [48]:
attacks["Time"]=attacks["Time"].fillna("NA")

In [54]:
attacks["Time"].value_counts()[0:20]

NA                 3351
Afternoon           187
Morning             121
Night                62
Evening              53
Late afternoon       35
P.M.                 12
A.M.                 12
Early morning        11
Midday               10
Early afternoon       8
M                     5
--                    5
Sunset                4
Midnight              4
After noon            2
AM                    2
1600                  2
Dark                  2
Late afternon         2
Name: Time, dtype: int64

* Early Morning: 6:00 - 9:00
* Morning: 9:00 - 12:00
* Midday: 12:00 - 13:00
* Early afternoon: 13:00-16:00
* Afternoon: 16:00 - 19:00
* Evening: 19:00-22:00
* Night: 22:00 - 6:00

In [50]:
franjas={"Early Morning": ["06", "07", "08"],
         "Morning": ["09", "10", "11"],
         "Midday": ["12"],
         "Early afternoon": ["13", "14", "15"],
         "Afternoon": ["16", "17", "18"],
         "Evening": ["19", "20", "21"],
         "Nigth": ["22", "23", "00", "24", "01", "02", "03", "04", "05", "06"]
        }

In [51]:
def timeband (h):
    for key in franjas.keys():
        for j in range(len(franjas[key])):
            if franjas[key][j]==h: 
                return key

In [52]:
for i in attacks.index:
    if re.search(r"\d+h\d+", attacks["Time"][i]):
        x = re.search(r"\d+h\d+", attacks["Time"][i])
        h=x.group().split("h")
        attacks["Time"][i]=timeband (h)
    elif re.findall(r"(?i)dusk", attacks["Time"][i]):
        attacks["Time"][i]="Evening"

#### Filling nulls and data wrangling of "Activity"

---AGRUPAR

In [96]:
attacks["Activity"] = attacks.Activity.fillna("NA")

In [97]:
attacks["Activity"].value_counts()

Surfing                                      971
Swimming                                     869
NA                                           544
Fishing                                      431
Spearfishing                                 333
                                            ... 
native boats sunk in storm                     1
Crawling                                       1
Sea disaster, wreck of the Alfred Watts        1
Dragging stranded shark into deeper water      1
Murder victim                                  1
Name: Activity, Length: 1533, dtype: int64

Correct some spelling errors:

In [125]:
attacks["Age"] = attacks["Age"].fillna("NA")
attacks["Age"].value_counts()[60:86]

Limpiamos los que sean del formato "20s" y les adjudicamos "20".

In [176]:
for i in attacks.index:
    if re.match(r"\d*", attacks["Age"][i]):
        attacks.Age[i] = re.match(r"\d*",attacks["Age"][i]).group()
    else:
        attacks.Age[i] = "NA"
    #if re.search(r"\d+s*",attacks["Age"][i]):
        #attacks.Age[i] = re.search(r"\d+s*",attacks["Age"][i]).group()[:2]

Los valores "Teen", "teen", "Teens", etc los cambio por 15.

In [154]:
attacks.loc[attacks["Age"]=="Teen"] = "15"
attacks.loc[attacks["Age"]=="teen"] = "15"
attacks.loc[attacks["Age"]=="Teens"] = "15"
attacks.loc[attacks["Age"]=="M"] = "NA"
attacks.loc[attacks["Age"]==" "] = "NA"

In [170]:
attacks.Age[2441]

'20'

In [175]:
re.match(r"\d*", "56's")

<_sre.SRE_Match object; span=(0, 2), match='56'>

In [166]:
re.findall(r"\d+\s.\d+",attacks.Age[4587] )

[]

#### 3.2. Filling some nulls and data wrangling in "Country"

"Country" column could be filled in if we have some data about the area or location.

In [32]:
attacks[["Country","Area","Location"]][attacks.Country.isnull() & (~attacks.Area.isnull() | ~attacks.Location.isnull())]

Unnamed: 0,Country,Area,Location
2956,,English Channel,
3605,,,Florida Strait
4266,,Between Comores & Madagascar,Geyser Bank
4498,,Caribbean Sea,Between Cuba & Costa Rica
4639,,,225 miles east of Hong Kong
4700,,Off South American coast,
4712,,300 miles east of St. Thomas (Virgin Islands),
5425,,,Near the equator
5612,,Mediterranean Sea,
5808,,Western Banks,


With these data we can fill in seven countries:

In [31]:
attacks["Country"][3387]="SAINT KITTS AND NEVIS"
attacks["Country"][4018]="AUSTRALIA"
attacks["Country"][4231]="INDIA"
attacks["Country"][5020]="FRANCE"
attacks["Country"][5742]="MEXICO"
attacks["Country"][6137]="UNITED KINGDOM"
attacks["Country"][6155]="BARBADOS"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A

In [92]:
attacks.loc[attacks["Country"].str.startswith("ANDAMAN"),"Country"] = "ANDAMAN"

ValueError: Cannot mask with non-boolean array containing NA / NaN values

In [90]:
attacks.Country.value_counts()[80:100]

MALAYSIA                5
TURKS & CAICOS          5
MALTA                   5
NIGERIA                 4
HONDURAS                4
RUSSIA                  4
NORTH ATLANTIC OCEAN    4
BURMA                   4
SUDAN                   4
EL SALVADOR             4
PERSIAN GULF            4
URUGUAY                 4
GRENADA                 4
GUAM                    4
LIBERIA                 3
CAPE VERDE              3
LEBANON                 3
GUYANA                  3
CEYLON                  3
HAITI                   3
Name: Country, dtype: int64

In [94]:
attacks.loc[attacks["Country"]=="ANDAMAN ISLANDS"]

Unnamed: 0,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,original order,Date std.
5959,1863.0,Unprovoked,ANDAMAN ISLANDS,,,Fell overboard,a local fisherman,M,,FATAL,Y,10h00,,"North Adams Transcript, 3/18/1898",344.0,1863


In [None]:
Reunion, New Caledonia, French Polynesia --> France (area)
Hong Kong --> China
England, Bermuda, Scotland --> United kingdom
Okinawa --> Japón
New Britain --> Papua New Guinea
Pacific Ocean? North Pacific Ocean?
Atlantic Ocean?
South Atlantic Ocean?
Indian Ocean?
New Guinea?
CARIBBEAN SEA?
Mid Atlantic Ocean

#### 3.4. Filling NaNs

In [61]:
attacks[["Country","Area","Location"]][attacks.Area.isnull() & (~attacks.Country.isnull() | ~attacks.Location.isnull())][0:20]

Unnamed: 0,Country,Area,Location
32,NEW CALEDONIA,,"Magenta Beach, Noumea"
33,BAHAMAS,,Bimini
48,NEW CALEDONIA,,Nouville
56,BAHAMAS,,
59,LIBYA,,Gars Garabulli
90,SOLOMON ISLANDS,,Owarigi Island
101,BAHAMAS,,
129,REUNION,,Roches Noire
132,BAHAMAS,,
206,MEXICO,,Guadalupe Island
