<a href="https://colab.research.google.com/github/MehrNoushR/sharkattPy/blob/main/shark_attacks2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [None]:
#loading the dataset
attacks_df = pd.read_csv("attacks.csv", encoding="ISO-8859-1", low_memory=False)

# Display the first few rows of the dataset for exploration
attacks_df.head()


Based on the displayed rows, here are some relevant columns that may help answer the given questions:

* Type: Discribes where the shark attack was provoked or unprovoked.
* Species: Inoformation on the spcies of shark involved in the attack.
* Activity: Describes the activity the victim was engaged in when the attack occurred.
* Age: Age of the victim. This can be used to address the question about children.
* Fatal (Y/N): Indicates if the attack was fatal or not.

Data Cleaning

* Check for missing values.
* Assess the unique values in the columns of interest, specially to understand the classifications used.


In [None]:
# Checking for missing values in the dataset
missing_values = attacks_df.isnull().sum()

# Display the number of missing values for each column
missing_values

For our questions, the most relevant columns are Type, Species, Activity, Age, and Fatal (Y/N). We can see from the missing values count that:

* Type has 19,425 missing values.
* Species has 22,259 missing values.
* Activity has 19,965 missing values.
* Age has 22,252 missing values.
* Fatal (Y/N) has 19,960 missing values.


In [None]:
# Checking the unique values in the relevant columns
unique_values = {
    "Type" : attacks_df["Type"].unique(),
    "Species" : attacks_df["Species"].dropna().unique()[:10], # only first 10
    "Activity" : attacks_df["Activity"].dropna().unique() [:10], # only first 10
    "Age" : attacks_df["Age"].dropna().unique() [:10], # only first 10
    "Fatal (Y/N)": attacks_df[ "Fatal (Y/N)"].unique()
}

unique_values

1. Type:

.Boating

.Unprovoked

.Invalid

.Provoked

.Questionable

.Sea Disaster

.Boat

.Boatomg

.(Some missing values)





2. Species (first 10 unique values):

.White shark

.2m shark

.Tiger shark, 3m

.Tiger shark

.Lemon shark, 3'

.Bull shark, 6'

.Grey reef shark

.Invalid incident Tawny nurse shark, 2m

.Shark involvement not confirmed



3. Activity (first 10 unique values):

.Paddling

.Standing

.Surfing

.Free diving

.Kite surfing

.Swimming

.Fishing

.Walking

.Feeding sharks

.Boogie boarding



4. Fatal (Y/N):

.N

.Y

.M

.UNKNOWN

.2017

.(Some variations of 'N' and 'Y' with extra spaces)

.(Some missing values)



Observations and Assumptions:

.The Type column includes values like "Invalid" and "Boatomg" (which might be a typo for "Boating"). We should be cautious while considering these values.

.The Species column has various formats for describing sharks. We might need to group similar species or generalize some categories for analysis.

.The Activity column seems relatively straightforward. However, there are numerous unique activities, so we might need to categorize them into broader groups for analysis.

.The Age column includes numeric age values. We can decide on an age threshold to define "children" (e.g., age < 18).

.The Fatal (Y/N) column has some inconsistencies and odd values (like "2017" and "M"). We'll need to clean this up to have a consistent binary classification.





 Question 1: What are the most dangerous types of sharks to humans?

.We can group by the Species column and count the number of fatal attacks to determine this.



In [None]:
# Cleaning up the "Fatal (Y/N)"column
attacks_df["Fatal (Y/N)"] = attacks_df["Fatal (Y/N)"].str.strip().str.upper().replace({"M": "N","UNKNOWN":"N", "2017":"N"})

In [None]:
# Filtering the data to consider only fatal attacks
fatal_attacks = attacks_df["Fatal (Y/N)"] == "Y"

In [None]:
# Grouping by the species and counting the number of fatal attacks
dangerous_sharks = fatal_attacks ["Species"].value_counts().head(10)

In [None]:
# Set uo the visualization style
sns.set_style("whitegrid")

# Plot for the most dangerous types of sharks to humans
plt.figure(figsize=(12, 7))
dangerous_sharks_plot = sns.barplot(x=dangerous_sharks.values,
                                    y=dangerous_sharks.index, palette="viridis")
plt.title("Top 10 Shark Species by Fatal Attacks", fontsize=15)
plt.xlabel("Number of Fatal Attack", fontsize=12)
plt.ylabel("Shark Species", fontsize=12 )
dangerous_sharks_plot.figure.tight_layout()

plt.show()

In [None]:
dangerous_sharks

Based on the data, the top 10 most dangerous types of sharks to humans, in terms of fatal attacks, are:


White shark: 44 fatal attacks

Tiger shark: 25 fatal attacks

Bull shark: 15 fatal attacks

3.7 m [12'] shark: 9 fatal attacks

3 m [10'] shark: 8 fatal attacks

12' shark: 5 fatal attacks

6 m [20'] white shark: 5 fatal attacks

Thought to involve a Zambesi shark: 5 fatal attacks

2 m shark: 4 fatal attacks

Blue shark: 4 fatal attacks



Assumptions and Potential Biases:

We assumed that the species descriptions are accurate and consistent. However, there might be some inconsistencies or variations in naming conventions, leading to fragmented counts.

Some descriptions, like "3.7 m [12'] shark", are not specific to a species but rather indicate the size of the shark. This can create ambiguity in identifying the exact species.


We considered only fatal attacks to determine the "dangerousness" of a shark species. However, the frequency of non-fatal attacks could provide additional context.

.




Question 2: Are children more likely to be attacked by sharks?


We can group by age and count the number of attacks for each age group. For simplicity, we'll consider anyone below the age of 18 as a child.


In [None]:
# Cleaning the Age column to extract numeric values and convert them to integers
attacks_df['Cleaned_Age']= attacks_df['Age'].str.extract('(\d+)').astype(float)

In [None]:
# Categorizing ages into "Child" and "Adult"
attacks_df['Age_Group'] = attacks_df['Cleaned_Age'].apply(lambda x: 'Child' if x < 18 else 'Adult')

In [None]:
# Counting the number of attacks for each age group
age_group_attacks = attacks_df ['Age_Group'].value_counts()

In [None]:
age_group_attacks

In [None]:
# Pie chart for number of attacks on children vs. adults
age_group_attacks.plot(kind='pie', figsize=(8, 6),
                       autopct= '%1.1f%%',startangle=140,
                       colors=sns.color_palette("coolwarm",2))
plt.title("Shark Attacks : Children vs. Adults", fontsize=15)
plt.ylabel("")
plt.show()

Based on the data:

Adults experienced 24,798 shark attacks.

Children (individuals below the age of 18) experienced 925 shark attacks.


Assumptions and Potential Biases:

We've defined "children" as individuals below the age of 18. This threshold is somewhat arbitrary and could be adjusted based on different definitions of childhood.


Age data might not be consistently reported or might be missing for many records. This can affect the accuracy of our counts.

We've only considered the absolute number of attacks.


 A relative comparison (e.g., considering the population size of children vs. adults or the frequency of exposure to shark habitats) might provide a different perspective.

.



Question 3: Are shark attacks where sharks were provoked more or less dangerous?


We can group by the Type column and count the number of fatal vs. non-fatal attacks to determine this.


In [None]:
# Grouping by the 'Type' column and counting the number of fatal vs. non-fatal attacks
provoked_danger = attacks_df.groupby('Type')['Fatal (Y/N)'].value_counts().unstack()
provoked_danger

In [None]:
# Filter data to only consider "Provoked" and "Unprovoked" types for clarity
provoked_data = provoked_danger.loc[["Provoked","Unprovoked"]]

# Stacked bar plot for provoked vs. unprovoked attacks
provoked_data_plot = provoked_data.plot(kind="bar", stacked=True,figsize=(8, 6),
                                        colormap="viridis")
plt.title("Fatal vs. Non-Fatal Attacks: Provoked vs. Unprovoked", fontsize=15)
plt.xlabel("Type of Attack", fontsize=12)
plt.ylabel("Number of Attacks", fontsize=12)
plt.legend(title="Fatality")
provoked_data_plot.figure.tight_layout()

plt.show()

In [None]:
# Pie charts for provoked vs. unprovoked attacks
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 7))

# Non-Fatal Attacks
provoked_data['N'].plot(kind='pie', ax=axes[0], autopct='%1.1f%%', startangle=140,
                        colors=sns.color_palette("viridis", 2))
axes[0].set_title("Non-Fatal Attacks", fontsize=15)
axes[0].set_ylabel("")

# Fatal Attacks
provoked_data['Y'].plot(kind='pie', ax=axes[1], autopct='%1.1%%', startangle=140,
                        colors=sns.color_palette ("viridis, 2"))
axes[1].set_title("Fatal Attacks", fontsize=15)
axes[1].set_ylabel("")

plt.suptitle("Provoked vs. Unprovoked Attacks", fontsize=16)
plt.tight_layout()
plt.show()



Base on the data:

For "Provoked" attacks, there were 553 non-fatal and 19 fatal incidents.

For "Unprovoked" attacks, there were 3,408 non-fatal and 1,181 fatal incidents.

Assumptions and Potential Biases:

The "Type" column accurately classifies the nature of the attack.

There are values like "Invalid" and "Boatomg" which might need further investigation.




.





Question 4: Are certain activities more likely to result in a shark attack?

we can analyze the Activity column and count the number of attacks associated with each recorded activity.


However, there might be many unique activities in the dataset.

To make the analysis more concise, we'll focus on the top activities associated with the highest number of shark attacks.

In [None]:
# Counting the number of attacks for each activity
activity_attacks = attacks_df[ "Activity"].value_counts().head(10)

activity_attacks

In [None]:
# Plotting the top 10 activities associated with shark attacks
plt.figure(figsize=(12, 8))
activity_attacks.plot(kind='barh', color= 'lightblue')
plt.xlabel( 'Number of Attacks')
plt.ylabel('Activity')
plt.title('Top 10 activities Associated with Shark Attacks')
plt.gca().invert_yaxis() # to display the activity with the highest count at the top
plt.show()

Based on the data, the top 10 activities most associated with shark attacks are:

>Surfing: 971 attacks

>Swimming: 869 attacks

>Fishing: 431 attacks

>Spearfishing: 333 attacks

>Bathing: 162 attacks

>Wading: 149 attacks

>Diving: 127 attacks

>Standing: 99 attacks

>Snorkeling: 89 attacks

>Scuba diving: 76 attacks


Assumptions and Potential Biases:

We assumed that the Activity column accurately captures the primary activity the individual was engaged in at the time of the attack.

There might be variations or nuances in naming conventions for activities, leading to fragmented counts (e.g., "Diving" and "Scuba diving" might refer to similar activities but are counted separately).

The dataset may not be exhaustive or representative of all shark attacks globally, so the actual distribution of activities could be different.





