# COGS 108 - EDA Checkpoint

# Names

- Hugs Clorina
- John Howell
- Andy Chow
- Jawad Osman
- Vince Ermitano

<a id='research_question'></a>
# Research Question

Is there a statistically significant correlation between rising ocean temperature and sea level with the frequency of unprovoked shark attacks in North America?

# Setup

In [1]:
# Import seaborn and apply its plotting styles
import seaborn as sns
sns.set(font_scale=2, style="white")

# import matplotlib
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.style as style
# set plotting size parameter
plt.rcParams['figure.figsize'] = (17, 7)

# import pandas & numpy library
import pandas as pd
import numpy as np

In [2]:
# read in all datasets
sharks_df = pd.read_csv("./shark_attacks.csv")
temp_df = pd.read_csv("./temperature_anomalies.csv")
sl_pacific_df = pd.read_csv("./sea_level_north_pacific.csv")
sl_atlantic_df = pd.read_csv("./sea_level_north_atlantic.csv")

# Data Cleaning

The data that needs the most cleaning work done is the shark attack data because
- it contains columns that are irrelevant to our research or revealed personal information
- it contains shark attacks in regions/areas of the world that are not relevant to our scope.
- it contains rows for *provoked* shark attacks when we are really trying to research relative to *unprovoked* shark attacks
- it is missing categorization of areas of attacks between East and West Coast

Thus, pertaining to the shark attack data, we cleaned up our data as follows:
1. read in the shark attack csv file
2. filtered the dataset to only areas that we of interest (North America)
    * looked at all unique values for countries
    * defined which of these values to retain
    * dropped rows for values in which don't exist in our area of interest
3. filtered the dataset to only include rows that had their 'Type' column value as 'unprovoked'
4. dropped columns that were either irrelevant or included personal data
5. categorized the areas into East and West Coast appropriately

In [3]:
# get a feel for the data
print(sharks_df.shape)
print(sharks_df.columns)

(25847, 24)
Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')


In [4]:
# look at all unique countries
print(sharks_df['Country'].unique())

# filter by area, unprovoked attacks, and relevant year frame
sharks_df = sharks_df[sharks_df['Country'] == 'USA']
sharks_df = sharks_df[sharks_df['Type'] == 'Unprovoked']
sharks_df = sharks_df[sharks_df['Year'] >= 1880]

# drop irrelevant or ethically exposing columns
sharks_df = sharks_df.drop(columns=['Investigator or Source', 'Injury', 'Time', 'pdf','Species ', 'href formula', 'Name', 'Unnamed: 22', 'Unnamed: 23', 'Case Number.1', 'Case Number.2', 'href', 'original order']).reset_index(drop=True)

['USA' 'BAHAMAS' 'AUSTRALIA' 'SOUTH AFRICA' 'ENGLAND' 'JAPAN' 'INDONESIA'
 'EGYPT' 'JA MAICA' 'BELIZE' 'MALDIVES' 'FRENCH POLYNESIA' 'THAILAND'
 'COLUMBIA' 'NEW ZEALAND' 'MEXICO' 'COSTA RICA' 'New Zealand' 'BRAZIL'
 'British Overseas Territory' 'CANADA' 'ECUADOR' 'JORDAN' 'NEW CALEDONIA'
 'JAMAICA' 'ST KITTS / NEVIS' 'ST MARTIN' 'SPAIN' 'FIJI' 'SEYCHELLES'
 'PAPUA NEW GUINEA' 'REUNION ISLAND' 'ISRAEL' 'CHINA' 'SAMOA' 'IRELAND'
 'ITALY' 'COLOMBIA' 'MALAYSIA' 'LIBYA' nan 'CUBA' 'MAURITIUS'
 'SOLOMON ISLANDS' 'ST HELENA, British overseas territory' 'COMOROS'
 'REUNION' 'UNITED KINGDOM' 'UNITED ARAB EMIRATES' 'PHILIPPINES'
 'CAPE VERDE' 'Fiji' 'DOMINICAN REPUBLIC' 'CAYMAN ISLANDS' 'ARUBA'
 'MOZAMBIQUE' 'PUERTO RICO' 'ATLANTIC OCEAN' 'GREECE' 'ST. MARTIN'
 'FRANCE' 'TRINIDAD & TOBAGO' 'KIRIBATI' 'DIEGO GARCIA' 'TAIWAN'
 'PALESTINIAN TERRITORIES' 'GUAM' 'NIGERIA' 'TONGA' 'SCOTLAND' 'CROATIA'
 'SAUDI ARABIA' 'CHILE' 'ANTIGUA' 'KENYA' 'RUSSIA' 'TURKS & CAICOS'
 'UNITED ARAB EMIRATES (UAE)' 'AZ

In [5]:
# categorize 'Area' column values to 'East Coast' or 'West Coast'
print(sharks_df['Area'].unique())

west_coast = ['California', 'Hawaii', 'Texas', 'Oregon', 'Guam', 'Maui', 'Baja ', 'Guerrero',
              'Washington', 'Baja California Sur', 'Palmyra Atoll', 'Johnston Atoll', 'Midway Atoll']

east_coast = ['Louisiana', 'South Carolina', 'Florida','New York', 'Noirth Carolina', 'Alabama',
              'Maryland', 'North Carolina', 'Georgia', 'Franklin County, Florida', 'Virgin Islands',
              'Maine', 'Bahamas', 'Cayman Islands', 'Rhode Island', 'New Jersey', 'Massachusetts', 'Delaware',
              'Virginia', 'Puerto Rico', 'US Virgin Islands', 'South Carolina ', 'Connecticut', 'Mississippi',
              'Wake Island', ' North Carolina', 'East coast']

['California' 'Hawaii' 'Louisiana' 'South Carolina' 'Florida' 'New York'
 'Noirth Carolina' 'Alabama' 'Texas' 'Maryland' 'North Carolina' 'Georgia'
 'Oregon' 'Franklin County, Florida' 'Virgin Islands' 'Maine' 'Bahamas'
 'Maui' 'Guam' 'Cayman Islands' 'Rhode Island' 'New Jersey'
 'Massachusetts' 'Washington' 'Delaware' 'Palmyra Atoll' 'Puerto Rico'
 'Virginia' 'US Virgin Islands' 'South Carolina ' 'Johnston Atoll'
 'Connecticut' 'Mississippi' 'Wake Island' ' North Carolina'
 'Midway Atoll' 'East coast']


In [6]:
def categorize_east_west(str_in):
    if str_in in west_coast:
        return 'West Coast'
    return 'East Coast'

sharks_df['West/East Coast'] = sharks_df['Area'].apply(categorize_east_west)

In [7]:
# take a look at our cleaned-up shark data
sharks_df.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),West/East Coast
0,2022.10.31,31-Oct-2022,2022.0,Unprovoked,USA,California,"Otter Point, Pacific Grove",Surfing,M,,N,West Coast
1,2022.10.25,25 Oct-2022,2022.0,Unprovoked,USA,Hawaii,Kauai,Snorkeling,M,51.0,N,West Coast
2,2022.10.08,08-Oct-2022,2022.0,Unprovoked,USA,Louisiana,25 miles off Empire,Shipwreck,M,40.0,N,East Coast
3,2022.010.02,02-Oct-2022,2022.0,Unprovoked,USA,California,Centerville Beach,Surfing,M,31.0,N,West Coast
4,2022.09.03,03-Sep-2022,2022.0,Unprovoked,USA,Hawaii,"Lower Paia Beach Park, Maui",Swimming or Snorkeling,F,51.0,N,West Coast


For the remaining datasets (sea_level & temperature_anomalies), most of the data we already clean so we only made the following changes:
1. Removed first 4 rows in the ocean temperature dataset because it stored irrelevant data.
2. The intial column names in the ocean temperature dataset did not make sense for the values that are stored, so we changed the column names to appropriate titles (Year, Temperature Anomaly (Celsius))

In [8]:
# get a feel for the data
print(temp_df.shape)
print(temp_df.columns)

temp_df.head()

(146, 2)
Index(['Northern Hemisphere Ocean Temperature Anomalies', ' January-December'], dtype='object')


Unnamed: 0,Northern Hemisphere Ocean Temperature Anomalies,January-December
0,Units: Degrees Celsius,
1,Base Period: 1901-2000,
2,Missing: -999,
3,Year,Value
4,1880,-0.02


In [9]:
# need to remove unnecessary initial rows (0-4)
temp_df = temp_df.loc[4:].reset_index(drop=True)

# rename column titles appropriately
temp_df = temp_df.rename(columns={'Northern Hemisphere Ocean Temperature Anomalies': 'Year', ' January-December': 'Temperature Anomaly (Celsius)'})

temp_df.head()

Unnamed: 0,Year,Temperature Anomaly (Celsius)
0,1880,-0.02
1,1881,-0.02
2,1882,-0.03
3,1883,-0.08
4,1884,-0.16


The sea level data we already clean, so we just read the dataset in with no modifications

In [10]:
# sea level data does not need cleaning, so just read in and examine overall structure

sl_pacific_df.shape
sl_pacific_df

Unnamed: 0,year,TOPEX/Poseidon,Jason-1,Jason-2,Jason-3
0,1992.9611,19.62,,,
1,1992.9865,-8.28,,,
2,1993.0126,-16.68,,,
3,1993.0408,-43.48,,,
4,1993.0659,-61.18,,,
...,...,...,...,...,...
1370,2022.5020,,,,66.58
1371,2022.5291,,,,91.58
1372,2022.5563,,,,99.38
1373,2022.5835,,,,99.38


In [11]:
# for northern atlantic region

sl_atlantic_df.shape
sl_atlantic_df

Unnamed: 0,year,TOPEX/Poseidon,Jason-1,Jason-2,Jason-3
0,1992.9620,-0.32,,,
1,1992.9873,-0.62,,,
2,1993.0129,-16.42,,,
3,1993.0413,-8.52,,,
4,1993.0667,-35.72,,,
...,...,...,...,...,...
1373,2022.5025,,,,69.24
1374,2022.5295,,,,83.54
1375,2022.5567,,,,99.44
1376,2022.5839,,,,114.54


# Data Analysis & Results (EDA)

Carry out EDA on your dataset(s); Describe in this section

In [12]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION