# Would sharks rather work a 9-5 "desk job" or go freelance? 

**Criteria used to define their preference:**
  ***

&nbsp;
  >1. *ENJOYS WORKING AT THESE HOURS*
   - This is the first point to be considered when choosing a non-flexible job. For a shark, the word "work" would mean "hunting" since eating requires intense labor to meet their physiological needs. They do not eat for pleasure like humans do.
  &nbsp;
  >2. *LIKES WORKING CLOSE TO OTHER PEOPLE (or other sharks)*
   - Although many "desk job" positions can be done remotely, a research by Ladders showed that by the end of 2022 only around 25% of jobs offered by the top 50,000 employers in North America would be remote. That means that, statistically, if a shark decides to work a 9-5 job, it is highly likely that it will be working close to other sharks.
&nbsp;
  >3. *DOES MAINLY THE SAME TYPE OF JOB EVERY DAY*
   - Having a traditional 9-5 "desk job" usually means doing the same type of tasks every day. Picture a data analyst, for instance: even though the data might differ, most of the work consists of cleaning, analysing and then visualizing the data. In shark terms this would be translated into attacking the same type of people in a similar way.
  &nbsp;
  >4. *PREFERS HAVING VACATIONS DURING HIGH SEASONS*
 - Although this is not necessarily always true, a non-flexible type of job might make it challenging for a employee to have several days off during months in which this is not expected from them (because of social and/or family reasons). Therefore, it is reasonable to assume that it would be easier to ask for days off during the holidays (December) or school vacations (July) since it is a societal convention.

# Exploration

In [1]:
# Libraries:

import pandas as pd
import re 
import numpy as np
import plotly.express as px
import chart_studio.plotly as py
import plotly.graph_objects as go


In [2]:
df_attacks = pd.read_csv ('./data/attacks.csv', encoding='unicode_escape')
df_attacks.sample(5)

# print(df_attacks.isna().sum())
# df_attacks.shape


Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
10431,,,,,,,,,,,...,,,,,,,,,,
16735,,,,,,,,,,,...,,,,,,,,,,
18061,,,,,,,,,,,...,,,,,,,,,,
7927,0.0,,,,,,,,,,...,,,,,,,,,,
17296,,,,,,,,,,,...,,,,,,,,,,


In [3]:
df_attacks.dropna(axis=0, inplace=False, how="all")

#print(df_attacks.isna().sum())

df_attacks["Investigator or Source"].sample(15)

# Checking the columns I realized everything below Species doesn't seem to be relevant.
# I'll create a copy with only the data I intend to analyse

df_attacks_2 = df_attacks.drop(['Investigator or Source', 'pdf', 'href formula', 'href', 'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22', 'Unnamed: 23'], axis=1)


In [4]:
# I realized many columns had all NaN values but the number "0" in the Case Number column
# I used the dropna combined with a threshold of 2 to delete these columns since they had no information. 
# The way the threshold works is that at least 2 values in the row have to be not null in order for the row not to be deleted.

df_attacks_2.dropna(axis=0, inplace=True, thresh=2)

df_attacks_2.head(3)


Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11,Minor injury to left thigh,N,14h00 -15h00,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48,Injury to left lower leg from surfboard skeg,N,07h45,


In [6]:
# Cleaning the Date so I have Months and Years in different columns
# The original Year column had over 800 NaNs even when the year was explicitly shown in the Date column
# To solve this I used regex to gather the months from the Date column and went to 19 NaNs instead

df_attacks_2[["Month"]] = df_attacks_2["Date"].str.lower().str.extract(r'-(\w{3})-')
df_attacks_2[["Year"]] = df_attacks_2["Date"].str.extract(r'(\d{4})')


df_attacks_2['Month'].isna().sum()
df_attacks_2['Year'].isna().sum()

df_attacks_2.drop(['Case Number'], axis=1, inplace=True)

# print(df_attacks_2['Year'])
# print(df_attacks_2['Date'])

In [7]:
print(df_attacks_2.isna().sum())

df_attacks_2.shape

# Here I can have some insight on which datapoints might be worth exploring in order to formulate a hypothesis
# Datapoints with a large percentage of NaNs will be difficult to draw conclusions from


Date              0
Year             19
Type              4
Country          50
Area            455
Location        540
Activity        544
Name            210
Sex             565
Age            2831
Injury           28
Fatal (Y/N)     539
Time           3354
Species        2838
Month           910
dtype: int64


(6302, 15)

In [8]:
# Checking the size of the dataframe and the unique entries for the categories Activity, Type, Country and Area

# print("\n\nNumber of rows and columns", df_attacks_2.shape)

# print("\n\nUnique descriptions for Type", df_attacks_2['Type'].value_counts())
# print("\n\n\nUnique descriptions for Activity", df_attacks_2['Activity'].value_counts().sum())
# print("\n\n\nUnique descriptions for Area", df_attacks_2['Area'].value_counts())
# print("\n\n\nUnique descriptions for Country", df_attacks_2['Country'].value_counts())
# print("\n\n\nUnique descriptions for Time", df_attacks_2['Time'].value_counts())

#print("\n\n\nUnique descriptions for Country", df_attacks_2['Country'].value_counts().head(25))

In [9]:

# Analysing the time patterns to see if they fit the "traditional" category or the freelancing one
# For the purposes of the analysis, "traditional" is any time between 8am and 6pm

#df_attacks_2[df_attacks_2["Case Number"].apply(lambda x: len(str(x))!=10)]

def time_habits (time):  
    '''
    Function that receives the elements in the column 'Time' and categorizes them into traditional working hours or 
    freelance working hours. 
    '''
    
    try: 
        if 8 <= int(time) <= 18: 
            return "Trad."
        elif 19 <= int(time) <= 7:
            return "Freela."
      
    
    except Exception: 
        
        if ('noon' in str(time).lower()) or ('morning' in str(time).lower()) or ('evening' in str(time).lower()) or ('dusk' in str(time).lower()): 
            return "Trad."
    
        elif ('midday' in str(time).lower()) or ('sunset' in str(time).lower()) or ('a.m' in str(time).lower()): #broken for readability
            return "Trad."
        
        elif ('night' in str(time).lower()) or ('p.m' in str(time).lower()):
            return "Freela."
        
        else:
            return "Unknown"
    


In [10]:
df_attacks_2['Time'] = df_attacks_2['Time'].str.replace(r"(h\d\w*)", r"", regex=True) # Standardizing the time to only have the "hour" digits


df_attacks_2["Traditional or Freelance?"] = df_attacks_2['Time'].apply(lambda x: (time_habits(x)))
df_attacks_2["Traditional or Freelance?"] = df_attacks_2["Traditional or Freelance?"].fillna("Unknown")



print("\nUnique descriptions for Traditional or Freelance\n\n", df_attacks_2['Traditional or Freelance?'].value_counts())
print(df_attacks_2["Traditional or Freelance?"].isna().sum())

# Over half of the sample consists of unknown times, therefore the conclusion cannot be precise.
# Another important point is that the vast majority of people go into the sea during the day only.
# Therefore, it would be far more rare for a shark to attack people during the night.


Unique descriptions for Traditional or Freelance

 Unknown    3692
Trad.      2526
Freela.      84
Name: Traditional or Freelance?, dtype: int64
0


In [11]:
# The objective is to get a list of the top 5 countries where most attacks happened
# From this, the var Total_prefered_countries saves the sum of the attacks in these countries only


df_attacks_2['Attacks_country'] = df_attacks_2.groupby('Country')['Country'].transform('count')

Total_prefered_countries = df_attacks_2["Country"].value_counts()[:5]  

print("\n\nNumber of attacks in top 5 countries:\n", Total_prefered_countries)
print("\n\nNumber of attacks in top 5 countries:", Total_prefered_countries.sum())
print("\nTotal number of attacks registered:", df_attacks_2['Country'].value_counts().sum())
print("\n\nNumber of countries registered:\n\n", df_attacks_2['Country'].value_counts())

# More than 70% of attacks registered were concentrated in these 5 countries.
# It can be argued that this is a matter of attacks being registered into a dataframe only in these countries
# other than the real number of attacks in the world.
# Since this theory can't be corroborated here, it will be assumed that the data reflects the real world.



Number of attacks in top 5 countries:
 USA                 2229
AUSTRALIA           1338
SOUTH AFRICA         579
PAPUA NEW GUINEA     134
NEW ZEALAND          128
Name: Country, dtype: int64


Number of attacks in top 5 countries: 4408

Total number of attacks registered: 6252


Number of countries registered:

 USA                       2229
AUSTRALIA                 1338
SOUTH AFRICA               579
PAPUA NEW GUINEA           134
NEW ZEALAND                128
                          ... 
MALDIVE ISLANDS              1
NICARAGUA                    1
NORTH SEA                    1
RED SEA / INDIAN OCEAN       1
CEYLON (SRI LANKA)           1
Name: Country, Length: 212, dtype: int64


In [12]:
df_attacks_2['Activity'].value_counts()

Surfing                                   971
Swimming                                  869
Fishing                                   431
Spearfishing                              333
Bathing                                   162
                                         ... 
Playing with a frisbee in the shallows      1
Sinking of the ferryboat Dumaguete          1
Wreck of the Storm King                     1
Feeding mullet to sharks                    1
Wreck of  large double sailing canoe        1
Name: Activity, Length: 1532, dtype: int64

In [13]:
def clean_columns(col, pats, subs):
    
    '''
    Function that receives the name of a column, a list of patterns to be looked in it,
    a list of substitutions for these patterns (both lists need to match in legth and position)
    and updates the column with the changes.
    The first line transforms the type of the column in string so the NaN don't break the lambda function
    '''
    
    df_attacks_2[col] = df_attacks_2[col].astype('str')
    
    for n in range(len(pats)):
        
        df_attacks_2[col] = df_attacks_2[col].apply(lambda x: subs[n] if pats[n] in x.lower() else x)


In [14]:
# Standardizing the Activity column by searching key words from the top 5 results 
# since there are too many different descriptions

clean_columns('Activity', ["surf", "swim", "fish", "bath", "dive"],["Surfing", "Swimming", "Fishing", "Bathing", "Diving"])

df_attacks_2['Activity'].value_counts()

Surfing                                                                                                                                                  1261
Fishing                                                                                                                                                  1181
Swimming                                                                                                                                                 1106
nan                                                                                                                                                       544
Bathing                                                                                                                                                   189
                                                                                                                                                         ... 
Overturned skiff                                    

In [16]:
# Standardizing the Injury column by searching key words from the top 5 results since there are too many different descriptions

clean_columns('Injury', ['no injury','survived', 'foot', 'leg', 'bite'],['No serious injury', 'No serious injury', 'Foot bitten', 'Leg bitten', 'Bitten somewhere else'])

print("Top 5 types of injuries:\n\n", df_attacks_2['Injury'].value_counts().head(5))

print("\nSum of top 5 injuries:", df_attacks_2['Injury'].value_counts().head(5).sum())

print("\nSum of other types of injuries:", df_attacks_2['Injury'].value_counts().sum() - 
      df_attacks_2['Injury'].value_counts().head(5).sum())
    

Top 5 types of injuries:

 No serious injury        971
Leg bitten               884
FATAL                    802
Foot bitten              780
Bitten somewhere else     65
Name: Injury, dtype: int64

Sum of top 5 injuries: 3502

Sum of other types of injuries: 2800


In [17]:
# Type consists of provoked or unprovoked
# Instances such as "boat", "boating" or "invalid" don't state if the shark was provoked or not
# so they were grouped under the category "Questionable" that already existed in the dataframe

clean_columns('Type', ['boat', 'invalid', 'sea disaster'], ['Questionable', 'Questionable', 'Questionable'])

df_attacks_2['Type'].value_counts()

Unprovoked      4595
Questionable    1129
Provoked         574
nan                4
Name: Type, dtype: int64

In [18]:
df_attacks_2['Month'].value_counts()

# The top 5 countries with most attacks have holidays/school vacations during jul/aug and dec since their culture is mostly western

jul    621
aug    556
sep    521
jan    494
jun    475
apr    420
oct    417
dec    415
mar    381
nov    378
may    358
feb    356
Name: Month, dtype: int64

# Visualizations

## Working hours

**The first criterion is the time preference. Traditional 'desk jobs' hours usually fall within the range of 8am to 7pm, so for the purposes of this analysis if a shark works (read: hunts) during these times then it prefers a 9-5 type of job.**

The 'time' column was categorized in Traditional (Trad.) or Freelance (Freela.) and below we can see the results plotted.

In [43]:
fig = go.Figure()
fig.add_trace(go.Histogram(histfunc="count",  x=df_attacks_2["Traditional or Freelance?"], opacity = 1))

fig.update_layout({

'plot_bgcolor': 'rgba(100, 0, 100, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)'
})

fig.update_traces(
                  marker_color='lightblue',
                  marker_line_color='blue',
                  marker_line_width=1, opacity=1
)

fig.show()

### Conclusion


- Over half of the sample consists of unknown times, therefore the conclusion cannot be precise. Another important point is that the vast majority of people go into the sea during the day only. Therefore, it would be far more rare for a shark to attack people during the night. Besides that, the data is not organized taking into consideration weekdays or weekends. During the weekends, holidays or vacations days, all the attacks would be classified as "Freela". This would be something I would have explored if I had more time.  


- Consenquently, although the results indicate a preference for traditional working hours, without the possibility of replicating the analysis in a setting where people entered the sea at night as much as during the day, it becomes impossible to say if they do prefer this time or if it is the only possible time for them to 'work'.



## Workplace  

**The second criterion is workplace. Here, the places where the attacks happened are analysed in order to check if most attacks normally happen in the same few locations or if they are heavily scattered through several places.**

Possible conclusions:
- If most attacks happen in a few select locations this indicates sharks indeed like to work close to one another. Therefore, they prefer a more "traditional" work setting;  


- On the other hand, if they are scattered, this would indicate that they do not choose their place of work necessarily based on where other sharks work. In this case, it can be concluded that sharks prefer a rather flexible form of work, a.k.a freelancing.


In [38]:
fig2 = px.bar(Total_prefered_countries, title = "Top 5 most attacked countries", 
              labels = {'index': 'Country', 'value':'No. of attacks' })

fig2.update_layout({

'plot_bgcolor': 'rgba(100, 0, 50, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)'
})

fig2.update_traces(
                  marker_color='lightblue',
                  marker_line_color='blue',
                  marker_line_width=1, opacity=1
)

fig2.show()


In [50]:
print("\nNumber of attacks in top 5 countries:", Total_prefered_countries.sum())
print("\nTotal number of attacks registered:", df_attacks_2['Country'].value_counts().sum())
print("\n\nNumber of countries registered:\n\n", df_attacks_2['Country'].value_counts())



Number of attacks in top 5 countries: 4408

Total number of attacks registered: 6252


Number of countries registered:

 USA                       2229
AUSTRALIA                 1338
SOUTH AFRICA               579
PAPUA NEW GUINEA           134
NEW ZEALAND                128
                          ... 
MALDIVE ISLANDS              1
NICARAGUA                    1
NORTH SEA                    1
RED SEA / INDIAN OCEAN       1
CEYLON (SRI LANKA)           1
Name: Country, Length: 212, dtype: int64


### Conclusion: 

- More than 70% of attacks registered were concentrated in these 5 countries.  


- It can be argued that this is a matter of attacks being registered into a dataframe only in these countries other than the real number of attacks in the world. Since this theory can't be corroborated here, it will be assumed that the data reflects the real world.  


- Based on this criterion alone, sharks seem to prefer a more traditional setting.

## Same type of job

**The third criterion is the type of work they usually do. For this, first we see if sharks normally kill their victims or not, and second, we look for a correlation between the type of injury (fatal or not) and what activity the human was performing. For the analysis, only the top 5 types of activity were taken into consideration since the other were not standardized and varied vastly.**

Possible conslusions:

- If there is a pattern between these two things, then sharks operate in a similar manner most of the time, therefore, they do the same type of 'job' and into de category of 'traditional' work;  


- If the correlation is not expressive, then they are erratic in the way they work and are more similar to a freelance worker.


In [40]:

# 1 IN GENERAL, DO THEY KILL THEIR HUMAN VICTIMS?

subset1 = df_attacks_2.loc[(df_attacks_2["Fatal (Y/N)"].isin(["N", "Y"]))]
caca = subset1.groupby(["Fatal (Y/N)"]).agg({"Date" : "count"}).rename(columns={'Date': 'count'}).reset_index()
caca['%'] = 100 * caca['count'] / caca['count'].sum()
caca

fig3 = px.bar(caca,  y="%", x="Fatal (Y/N)", barmode="group")

fig3.update_layout({

'plot_bgcolor': 'rgba(100, 0, 50, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)'
})

fig3.update_traces(
                  marker_color='lightblue',
                  marker_line_color='blue',
                  marker_line_width=1, opacity=1
)

fig3



In [31]:
#2 DOES THE KILL RATIO DEPEND ON THE ACTIVITY THE HUMAN WAS DOING?

df_excl_nan_activities = df_attacks_2[df_attacks_2["Activity"]!="nan"]
top_activities = df_excl_nan_activities["Activity"].value_counts().head(5).index

subset = df_attacks_2.loc[(df_attacks_2["Activity"].isin(top_activities)) & (df_attacks_2["Fatal (Y/N)"].isin(["N", "Y"]))]

graph = subset.groupby(["Activity", "Fatal (Y/N)"]).agg({"Date" : "count"}).rename(columns={'Date': 'count'})

graph['%'] = 100 * graph['count'] / graph.groupby('Activity')['count'].transform('sum')
graph = graph.reset_index()
graph

fig4 = px.bar(graph, x="Activity", y="%", color="Fatal (Y/N)", barmode="group")

fig4.update_layout({

'plot_bgcolor': 'rgba(100, 0, 50, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)'
})


fig4


### Conclusions

- In general, sharks do not kill their victims, but looking at the percentage of kills according to the different types of activities it becomes clear that they do not act the same (or do the same type of "job") all the time;  


- Based on these 5 activities, sharks seem to follow a pattern when it comes to surfing, fishing and wading, but act more erratically in the others;  


- Still, most of the time they seem to repeat a pattern and, therefore, fall into the category of a 9-5 job.  


## Vacation time

**The last criterion refers to vacations. For traditional jobs, it's more common that vacation days are taken during the holidays and school vacation months, that is, jul/aug and dec, specially in countries where the employee is not allowed to take a few days off. For example, in Brazil a worker can only take between 15 and 30 days at a time, so it's more likely for the employer to accept dates that are the social norm.**

Possible conclusions:

- If there are fewer attacks during these months, then the sharks follow a more traditional vacation time;  


- Otherwise, they are more flexible with vacations and fall into the category of freelancing.

In [47]:

fig5 = go.Figure()
fig5.add_trace(go.Histogram(histfunc="count",  x=df_attacks_2["Month"], opacity = 1))

fig5.update_xaxes(categoryorder='array', categoryarray= ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 
                                                        'aug', 'sep', 'oct', 'nov', 'dec'])

fig5.update_layout({
                'plot_bgcolor': 'rgba(100, 0, 50, 0)',
                'paper_bgcolor': 'rgba(0, 0, 0, 0)'
})

fig5.update_traces(
                  marker_color='lightblue',
                  marker_line_color='blue',
                  marker_line_width=1, opacity=1
)

fig5.show()

### Conclusions

- Sharks do tend to work more during vacation time. However, this is also a period where more people go to beach, so it becomes impossible to objectively analyse this data point without a proportional comparison of the number of beach visitors throughout the year;  



- Another aspect to be taken into consideration is that the top 5 attacked countries are in different hemispheres, so the months  represent different seasons in each country and this should also be accounted for;  



- In the end, using simply this data, we can conclude that when it comes to vacation sharks behave as freelancers, not caring about holidays at all!


# Grand finale: would sharks be freelancers or not?!

### The answer seems to be no! Apparently, sharks are all about their 9 to 5 jobs.

- It is important to mention that the results are, of course, unreliable due to the nature of the dataframe analysed. However, I personally like to imagine sharks wearing suits and going to their desk jobs every day. It makes me feel less alone, don't know about you :) 