## Hypothesis

`The objective of the project is to analyze three hypotheses:`

`- The highest number of attacks occurred in the Northern Hemisphere in summer.`

`- The highest number of attacks occurred in Australia.`

`- Number of attacks depending on the water activity you are doing.`

### Libraries 📚

In [3]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import src.limpieza_texto as lt

### Cleaning for hypothesis resolution

#### 1ª.Hypothesis

##### 1.1 Hemispheres

In [23]:
df = pd.read_csv("midatasetslimpio.csv/attack_cleann.csv")

In [24]:
#If we want to analise the Hemisphere, we have to replace the Countries to Hemispheres.
df.Country.value_counts()

USA                 2229
AUSTRALIA           1338
SOUTH AFRICA         579
PAPUA NEW GUINEA     134
NEW ZEALAND          128
                    ... 
GUATEMALA              1
AFRICA                 1
DIEGO GARCIA           1
SOUTH CHINA SEA        1
OCEAN                  1
Name: Country, Length: 212, dtype: int64

`So first filter out those that are the same but are written differently and then put them together in a new column according to hemisphere.`

In [25]:
#We make a list of patterns and another one with the hemispheres in which each country is located.
regex_countries = [r'^usa', r'^australia', r'^new zealand', r'^south africa', r'^new guinea$', r'^papua new guinea$', r'^brazil$', r'^bahamas$', r'^mexico$', r'^italy$']
new_countries_hemisphere = ['Hemisphere-N', 'Hemisphere-S', 'Hemisphere-S', 'Hemisphere-S', 'Hemisphere-S', 'Hemisphere-S', 'Hemisphere-S', 'Hemisphere-N', 'Hemisphere-N', 'Hemisphere-N']

In [26]:
#Make a new column with the name "Hemispheres" and replace the patterns with the hemispheres.
df['Hemispheres'] = df.Country.str.lower().replace(regex_countries, new_countries_hemisphere, regex=True)

In [27]:
df

Unnamed: 0.1,Unnamed: 0,Date,Country,Activity,Hemispheres
0,0,25-Jun-2018,USA,Paddling,Hemisphere-N
1,1,18-Jun-2018,USA,Standing,Hemisphere-N
2,2,09-Jun-2018,USA,Surfing,Hemisphere-N
3,3,08-Jun-2018,AUSTRALIA,Surfing,Hemisphere-S
4,4,04-Jun-2018,MEXICO,Free diving,Hemisphere-N
...,...,...,...,...,...
6297,6297,Before 1903,AUSTRALIA,Diving,Hemisphere-S
6298,6298,Before 1903,AUSTRALIA,Pearl diving,Hemisphere-S
6299,6299,1900-1905,USA,Swimming,Hemisphere-N
6300,6300,1883-1889,PANAMA,,panama


In [28]:
#This function give us relevant information, now we can see where most attacks have occurred.
df.Hemispheres.value_counts()

Hemisphere-N                  2498
Hemisphere-S                  2301
fiji                            65
philippines                     61
reunion                         60
                              ... 
british virgin islands           1
paraguay                         1
andaman / nicobar islandas       1
admiralty islands                1
indian ocean?                    1
Name: Hemispheres, Length: 201, dtype: int64

In [29]:
hemisphere = df["Hemispheres"]
hemisphere

0             Hemisphere-N
1             Hemisphere-N
2             Hemisphere-N
3             Hemisphere-S
4             Hemisphere-N
               ...        
6297          Hemisphere-S
6298          Hemisphere-S
6299          Hemisphere-N
6300                panama
6301    ceylon (sri lanka)
Name: Hemispheres, Length: 6302, dtype: object

In [30]:
list_ = ['Hemisphere-N', 'Hemisphere-S']

In [31]:
df = lt.create(df,'Hemispheres',list_)
df.Hemispheres.value_counts()

Hemisphere-N    2498
Hemisphere-S    2301
Name: Hemispheres, dtype: int64

##### 1.2 Months

In [32]:
#We create a variable with the pattern of the months in order to make a new column with the months
pattern = r"[a-zA-Z]+"
#I know that this go in VSC but it doesn't work
def months(patron, string):
    import re
    try:
        return re.search(patron, string).group().lower()
    except:
        return f"Error"

In [33]:
df["Months"] = df.Date.apply(lambda x: months(pattern, x))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Months"] = df.Date.apply(lambda x: months(pattern, x))


Unnamed: 0.1,Unnamed: 0,Date,Country,Activity,Hemispheres,Months
0,0,25-Jun-2018,USA,Paddling,Hemisphere-N,jun
1,1,18-Jun-2018,USA,Standing,Hemisphere-N,jun
2,2,09-Jun-2018,USA,Surfing,Hemisphere-N,jun
3,3,08-Jun-2018,AUSTRALIA,Surfing,Hemisphere-S,jun
4,4,04-Jun-2018,MEXICO,Free diving,Hemisphere-N,jun


In [34]:
#According to the hypothesis, we want to select the useful information.
lista_months = ['jul', 'aug', 'sep', 'jan', 'jun', 'apr', 'oct', 'dec', 'mar', 'nov', 'may', 'feb']
df = lt.create(df,'Months',lista_months)
df.Months.value_counts()

jul    468
aug    422
sep    389
jan    385
jun    356
oct    340
apr    332
dec    329
mar    307
nov    293
feb    288
may    273
Name: Months, dtype: int64

#### Exporting the Dataset

In [35]:
df.to_csv("attack_cleanhipotesis.csv")

#### 2ª. Hipótesis

In [36]:
#We just want the information about the countries
lista_country = ['USA', 'AUSTRALIA', 'SOUTH AFRICA', 'PAPUA NEW GUINEA', 'NEW ZEALAND']
df = df[df['Country'].isin(lista_country)]

In [37]:
df.Country.value_counts()

USA                 2016
AUSTRALIA           1152
SOUTH AFRICA         523
NEW ZEALAND          100
PAPUA NEW GUINEA      81
Name: Country, dtype: int64

#### Exporting the Dataset

In [38]:
df.to_csv("attack_cleanhipotesis2.csv")

#### 3ª.Hipótesis

In [39]:
#We have to analyze all the values containing the activity
new_activities = ['swimming', 'spearfishing', 'bathing', 'surfing', 'fishing']
regex_activity = [r".*?\bswimming\b.*", r".*?\bspearfishing\b.*", r".*?\bbathing\b.*", r".*?\bsurfing\b.*", r".*?\bfishing\b.*"] 

In [40]:
df["Activities"] = df.Activity.str.lower().replace(regex_activity, new_activities, regex=True)

In [41]:
df

Unnamed: 0.1,Unnamed: 0,Date,Country,Activity,Hemispheres,Months,Activities
0,0,25-Jun-2018,USA,Paddling,Hemisphere-N,jun,paddling
1,1,18-Jun-2018,USA,Standing,Hemisphere-N,jun,standing
2,2,09-Jun-2018,USA,Surfing,Hemisphere-N,jun,surfing
3,3,08-Jun-2018,AUSTRALIA,Surfing,Hemisphere-S,jun,surfing
5,5,03-Jun-2018,AUSTRALIA,Kite surfing,Hemisphere-S,jun,surfing
...,...,...,...,...,...,...,...
6128,6128,May-17-1803,USA,,Hemisphere-N,may,
6129,6129,Mar-1803,AUSTRALIA,,Hemisphere-S,mar,
6136,6136,10-May-1788,AUSTRALIA,Fishing,Hemisphere-S,may,fishing
6142,6142,08-Aug-1780,USA,Swimming,Hemisphere-N,aug,swimming


In [42]:
list3 = ["swimming", "surfing", "fishing", "spearfishing", "bathing"]
df = lt.create(df,'Activities',list3)
df.Activities.value_counts()

surfing         952
swimming        665
fishing         422
spearfishing    234
bathing          90
Name: Activities, dtype: int64

#### Exporting the dataset

In [44]:
df.to_csv("attack_cleanhipotesis3.csv")