# Baby shark 🦈
## *(Doo doo doo doo doo doo!)*

### 1. Data acquisition and workspace setup
Import the libraries and locate the file containing the data.

In [1]:
#Import pandas library
import pandas as pd
import numpy as np

#Read the datasheet and force its encoding to 'ISO-8859-1' ('utf-8' default returns an error)
sharks_df = pd.read_csv("../resources/GSAF5.csv", encoding = "ISO-8859-1")

### 2. Data wrangling
To check for potential problems in the database, each column will be examined after a general processing:

##### 2.1 General transformations

In [2]:
def remove_worthless_data(df):
    """
    Removes worthless data from the dataframe.
    """
    
    df = df.replace(["No date", "Invalid", "Unknown"], np.nan)
    return df

#Apply the function
sharks_df = remove_worthless_data(sharks_df)

In [3]:
def strip_dataframe(df):
    """
    Strip leading and trailing spaces of all strings and headers in dataframe.
    """
    
    #Strip the headers
    df.columns = df.columns.str.strip()    
    #Strip cells
    stripped_string = lambda cell: cell.strip() if type(cell) is str else cell
    
    return df.applymap(stripped_string)

#Apply the function
sharks_df = strip_dataframe(sharks_df)

In [4]:
def lower_headers(df):
    """
    Convert all headers of dataframe to lowercase.
    """
    df.columns = [header.lower() for header in df.columns]
    
#Apply the function
lower_headers(sharks_df)

In [5]:
def remove_empty_series(df, limit):
    """
    Go through dataframe series and remove the columns with more empty values than the given percentage (0-100).
    """
    
    rows = len(df.index)
    
    for column in df:   
        empty_values = df[column].isna().sum()
        empty_values_percentage = (empty_values/rows)*100
        
        if empty_values_percentage > limit:
            del df[column]
            
    return df

#Apply the function with 90% as maximum empty cells
sharks_df = remove_empty_series(sharks_df, 90)

##### 2.2 'case number' column
There are three columns for *'case number'*. The standard ID for each attack is generated from its date with *yyyy/mm/dd* format, so the workflow will be:
1. Remove second and third *'case number'* columns
2. Create a 'report' column indicating whether the date is the date of the attack (*False*) or the report date (*True*). Remove this info from *'date'*
3. Wrangle *'date'* column
4. Generate again *'case number'* from fixed *'date'* column

In [6]:
#Remove 'case number.1' and 'case number.2'
del sharks_df["case number.1"]
del sharks_df["case number.2"]

In [7]:
#Creates a 'report date' column
sharks_df["date of report"] = sharks_df["date"].str.lower().str.contains(pat="reported")

#Rearrange columns
new_col_order = ["case number", "date", "date of report", "year", "type", "country", "area", "location", "activity", "name", "sex", "injury", "fatal (y/n)", "species", "investigator or source", "pdf", "href formula", "href", "original order"]
sharks_df = sharks_df[new_col_order]

#Delete "Reported" substrings from 'date' column
sharks_df["date"] = sharks_df["date"].str.lower().str.replace("reported", "")

In [29]:
#Wrangle 'date' column
def clean_dates(df_serie):
    """
    Convert all dates to yyyy-mm-dd, yyyy-mm or yyyy, when possible.
    """
    
    #Convert every date-separator into a dash
    df_serie = df_serie.replace({r"(\d)[^\d\w\n\-](\d)" : r"\1\-\2"}, regex=True)

    #Convert months in letter to number
    months_dictionary = {
                "jan":"01",
                "feb":"02",
                "mar":"03",
                "apr":"04",
                "may":"05",
                "jun":"06",
                "jul":"07",
                "aug":"08",
                "sep":"09",
                "oct":"10",
                "nov":"11",
                "dec":"12"
                }   
    for month in months_dictionary.keys():
        df_serie = df_serie.str.lower().str.replace(month, months_dictionary[month])
    
    #4 digits for year
    
    
    #Change formats from dd-mm-yyyy to yyyy-mm-dd (PRIMERO HAY QUE RELLENAR EL AÑO)
    #df_serie = df_serie.replace({r"(\d\d)-(\d\d)-(\d\d\d\d)" : r"\3\-\2\-\1"}, regex=True)
    

    
    
    return df_serie
    
sharks_df["date"] = clean_dates(sharks_df["date"])

In [30]:
sharks_df
sharks_df.to_csv(r"out.csv")

##### 3.2 Date
Standard format here is dd-mm-yy, but it's inconsistent over time and can appears as:
1. yyyy
2. dd-mm-yyyy
3. mm-yy

Or even strings as:
4. "Reported dd-mm-yyyy" *(or variations)*
5. Periods as text: "Before the war", "During the war", "Winter 1969" etc.
6. Notations
6. Random: "Last incident of 1994 in Hong Kong"

##### 3.3 Year
Standard format is *yyyy*, but *0* appears in some rows.

##### 3.4 Type
*Boat* and *Boating* appear. 

##### 3.5 Country
Some of the fields are empty.

##### 3.6 Area
Some of the areas are empty. Others start with an space or are written with coordinates.

##### 3.7 Location
Adds more info to *Area* column. It appears to be optional and the format varies greatly, so it will not be processed in this exercise.

##### 3.8 Activity
Some are empty. Others could be unified ("*swimming*", "*swimming vigorously*", "*swimming to canoe*") for a better data analysis.

##### 3.9 Name
Names are not relevant for stats, so these won't be processed.

##### 3.10 Sex
"N" and "." appears where only "M" and "F" were expected.

##### 3.11 Age
There are values with *int* type (only number) and others with *string* ("*8 or 10*", "*from 7 to 14*"...)

##### 3.12 Injury
However, in some cases it provides extra information about the damage, so it can be kept as a human-readable value.

##### 3.13 Fatal (Y/N)
Only must show *"Y"*, *"N"* or *"Unknown"*. There are some typos with spaces.

##### 3.14 Time
*"Afternoon"* or *"Morning"* are mixed with the hours in numbered format.

##### 3.15 Species
That's anarchy I don't know anything 'bout sharkies please let me go my family is waiting for me.

##### 3.16 Investigator or Source
Just more random people and data sources. To process it is to lose information without it helping the processing, so I will preserve it.

##### 3.17 pdf
Okey!

##### 3.18 href formula
That seems nice.

##### 3.19 href
And this too.

##### 3.20 Case Number
Column is duplicated.

##### 3.21 Original order