In [1]:
import pandas as pd

In [2]:
original_dataset=pd.read_csv('attacks.csv', encoding='latin-1')

At first sight we can see a 23 columns dataset. Some of them are dates, some of them are categorical data and links to files than can be useful if important information is missed.

In [3]:
original_dataset.head(5)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


We create a copy of the original dataset for security

In [4]:
cleaned_dataset=original_dataset.copy()

# Looking for NaN values

First of all we are going to check the two last columns that appears to be only NaN values

In [5]:
print(f'Since the original data set has {cleaned_dataset["Unnamed: 22"].size} rows and the column Unnamed:22 \
has {sum(cleaned_dataset["Unnamed: 22"].isna())}  NaN values we can safely drop it.')

Since the original data set has 25723 rows and the column Unnamed:22 has 25722  NaN values we can safely drop it.


In [6]:
print(f'Since the original data set has {cleaned_dataset["Unnamed: 23"].size} rows and the column Unnamed:23 \
has {sum(cleaned_dataset["Unnamed: 23"].isna())}  NaN values we can safely drop it.')

Since the original data set has 25723 rows and the column Unnamed:23 has 25721  NaN values we can safely drop it.


In [7]:
cleaned_dataset=cleaned_dataset.drop(columns=["Unnamed: 23","Unnamed: 22"], axis=1)

In the cell bellow we can see that we are not almost near to detect all the missing values. Comparing the number of NaN to the number of data, it seems that there are lots of rows that are empty. Let's check that by creating a new column that shows how many NaN values has each row,

In [8]:
cleaned_dataset.isna().sum()

Case Number               17021
Date                      19421
Year                      19423
Type                      19425
Country                   19471
Area                      19876
Location                  19961
Activity                  19965
Name                      19631
Sex                       19986
Age                       22252
Injury                    19449
Fatal (Y/N)               19960
Time                      22775
Species                   22259
Investigator or Source    19438
pdf                       19421
href formula              19422
href                      19421
Case Number.1             19421
Case Number.2             19421
original order            19414
dtype: int64

In [9]:
cleaned_dataset["Is_all_NaN"]=cleaned_dataset.isna().sum(axis=1)

cleaned_dataset["Is_all_NaN"].value_counts()

22    17020
21     2394
1      1516
0      1422
2      1200
3      1196
4       540
5       293
6       102
7        26
8         7
20        7
Name: Is_all_NaN, dtype: int64

So we can see that there are a great number of columns that have no information or almost no information (columns with 22 or 21). We proceed to remove them. Afther, we will drop the generated "Is_all_NaN"

In [10]:
cleaned_dataset=cleaned_dataset.dropna(axis=0, how='any', thresh=20, subset=None, inplace=False)

In [11]:
cleaned_dataset=cleaned_dataset.drop(columns="Is_all_NaN")

# Duplicated values

Now we are looking for rows that have duplicated values

In [12]:
sum(cleaned_dataset.duplicated())

0

So there are not rows with duplicated values. Now we look for columns with duplicated values.

In [13]:
cleaned_dataset[['Case Number','Date','Case Number.1','Case Number.2']]

Unnamed: 0,Case Number,Date,Case Number.1,Case Number.2
0,2018.06.25,25-Jun-2018,2018.06.25,2018.06.25
1,2018.06.18,18-Jun-2018,2018.06.18,2018.06.18
2,2018.06.09,09-Jun-2018,2018.06.09,2018.06.09
3,2018.06.08,08-Jun-2018,2018.06.08,2018.06.08
4,2018.06.04,04-Jun-2018,2018.06.04,2018.06.04
...,...,...,...,...
6290,ND.0012,Before 19-Jul-1913,ND.0012,ND.0012
6296,ND.0006,Before 1906,ND.0006,ND.0006
6297,ND.0005,Before 1903,ND.0005,ND.0005
6299,ND.0003,1900-1905,ND.0003,ND.0003


In [14]:
cleaned_dataset=cleaned_dataset.drop(columns=['Case Number.1','Case Number.2'], axis=1)

So with "Case Number" it was pretty easy to compare by sight, but it is more difficult to compare URLs by sight, so we will do it by checking with boolean logic.

In [17]:
cleaned_dataset['href formula']==cleaned_dataset['href']

0       True
1       True
2       True
3       True
4       True
        ... 
6290    True
6296    True
6297    True
6299    True
6301    True
Length: 5334, dtype: bool