## DATA CLEANING:

En este notebook se procederá a la limpieza del csv de ataques de tiburones registrados a nivel global. No se realizará ningún tipo de manipulación de datos, únicamente limpieza, de manera que cualquier persona que lo necesite pueda obtener los datos para realizar sus propioes estudios

In [1]:
import numpy as np
import pandas as pd

In [2]:
Shark = pd.read_csv("./CSVs/global-shark-attack.csv")
Shark = pd.read_csv(".global-shark-attack.csv", sep= ";")
Shark.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order
0,202273.b,03/07/2022,2022.0,Unprovoked,USA,New York,"Smith Point Beach, Suffolk County",Lifeguard Exercises,Zach Gallo,M,...,N,10h15,5'shark,"ABC7,7/3/2022",202273.a-Gallo.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,202273.a,202273.a,6778.0
1,20226.22,21/06/2022,2022.0,Unprovoked,USA,South Carolina,"Myrtle Beach, Horry County",,male,M,...,N,,,"C. Creswll, GSAF",20226.21-AC.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,20226.21,20226.21,6766.0
2,20224.11,09/04/2022,2022.0,Provoked,USA,New Jersey,Tutle Back Zoo,Feeding,female,F,...,N,Afternoon,Epaulette shark,"Rsl Media, 4/11/2022",202249-TurtleBackZoo.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,202249,202249,6752.0
3,20222.17,16/02/2022,2022.0,Unprovoked,AUSTRALIA,New South Wales,"Buchan Point, Sydney",Swimming,Simon Nellist,M,...,Y,16h30,,"7News,2/16/2022",20222.16-Nellist.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,20222.16,20222.16,6739.0
4,2021.12.25,22/12/2021,2021.0,Unprovoked,USA,Florida,"Anna Maria Island, Manatee County",Wing Foil Surfing,Erika Lane,F,...,N,,Blacktip or spinner shark,ABC.net,2021.12.22-Lane.pdf,,,2021.12.22,2021.12.22,6724.0


In [3]:
#Vamos a ver que columnas tenemos en nuestro DataFrame, para empezar con la limpieza

In [4]:
Shark.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order'],
      dtype='object')

In [5]:
#Vemos que nos sobran varias columnas para hacer nuestro análisis. En primer lugar, la primera columna o Case Number, coincide con la fecha
#en la que se ha producido el ataque, por lo que es redundante (a lo sumo, varía en unos pocos días, lo cual es irrelevante para nuestro análisis). 
#En segundo lugar, las columnas de pdf en adelante tampoco nos van a interesar en nuestro análisis, por lo que podemos de igual forma eliminarlas. 

In [6]:
Shark = Shark.drop(columns=["Case Number", "pdf", "href formula", "href",
       "Case Number.1", "Case Number.2", "original order"], axis= 1)

In [7]:
#También vamos a corregir el nombre de Sex y Species, para evitar problemas a la hora de utilizar las columnas (contienen espacios)

In [8]:
Shark.rename(columns={"Sex ": "Sex", "Species ": "Species"}, inplace= True)

In [9]:
Shark.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source
0,03/07/2022,2022.0,Unprovoked,USA,New York,"Smith Point Beach, Suffolk County",Lifeguard Exercises,Zach Gallo,M,31,Injuries to chest and right hand,N,10h15,5'shark,"ABC7,7/3/2022"
1,21/06/2022,2022.0,Unprovoked,USA,South Carolina,"Myrtle Beach, Horry County",,male,M,14,,N,,,"C. Creswll, GSAF"
2,09/04/2022,2022.0,Provoked,USA,New Jersey,Tutle Back Zoo,Feeding,female,F,12,Finger nipped by captive shark PROVOKED INCIDENT,N,Afternoon,Epaulette shark,"Rsl Media, 4/11/2022"
3,16/02/2022,2022.0,Unprovoked,AUSTRALIA,New South Wales,"Buchan Point, Sydney",Swimming,Simon Nellist,M,36,FATAL,Y,16h30,,"7News,2/16/2022"
4,22/12/2021,2021.0,Unprovoked,USA,Florida,"Anna Maria Island, Manatee County",Wing Foil Surfing,Erika Lane,F,42,Punctures to leg,N,,Blacktip or spinner shark,ABC.net


In [10]:
Shark.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6833 entries, 0 to 6832
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    6632 non-null   object 
 1   Year                    6698 non-null   float64
 2   Type                    6810 non-null   object 
 3   Country                 6779 non-null   object 
 4   Area                    6351 non-null   object 
 5   Location                6268 non-null   object 
 6   Activity                6247 non-null   object 
 7   Name                    6611 non-null   object 
 8   Sex                     6260 non-null   object 
 9   Age                     3876 non-null   object 
 10  Injury                  6795 non-null   object 
 11  Fatal (Y/N)             6833 non-null   object 
 12  Time                    3326 non-null   object 
 13  Species                 3751 non-null   object 
 14  Investigator or Source  6809 non-null   

In [11]:
#Vemos que las columnas de Age, Time y Species tienen muchos datos nulos. 

In [12]:
#Comprobamos que no exista ninguna fila donde todos los datos estén perdidos
Shark.isnull().all(axis=1).value_counts()

False    6833
dtype: int64

In [13]:
#Comprobamos que no haya registros duplicados

Shark.duplicated().value_counts()

False    6830
True        3
dtype: int64

In [14]:
#Como vemos que hay tres resgistros duplicados, procedemos a borrarlos. 
Shark.drop_duplicates(inplace= True)

In [15]:
#La columna de Source nonos va a aportar mucha información para el análisis, la borramos también
Shark.drop(columns= "Investigator or Source", inplace= True)

In [16]:
#Aquellas filas donde la gran mayoría de datos están perdidos no tendrán mucho valor para nuestro estudio, vamos a borrarlas.
#Primero echamos un vistazo, asegurándonos que no borramos nada importante. 

Shark[Shark.isnull().sum(axis=1) > 5]

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species
137,01/02/1977,1977.0,Unprovoked,NEW CALEDONIA,,I'le Ouen,,Jean Blanchet,,,Face & thorax bitten,N,,
234,1878-10-13,1878.0,Unprovoked,,,,Jumped overboard after murdering 2 shipmates,Sherrington,M,,FATAL,Y,,
246,,,Unprovoked,BELIZE,,,,Charles Ritchie CBE,M,,,N,,
248,,,Unprovoked,REUNION,Grand'Anse,Petite-île,yachtsman in a zodiac,,M,,Survived,N,,
249,,,Unprovoked,AUSTRALIA,New South Wales,"Spectacle Island, Port Jackson",,"male, the Sergeant of Marines",M,,,UNKNOWN,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6811,1890-03-03,1890.0,,CEYLON,,,Diving,a pearl diver,M,,FATAL,Y,,
6822,1836-01-01,1836.0,Unprovoked,AUSTRALIA,South Australia,,,,,,"No details, it was the year the first settlers...",UNKNOWN,,
6825,1797-05-28,1797.0,Unprovoked,,,,Dropped overboard,child,,,FATAL,Y,,
6828,,,Unprovoked,SPAIN,Canary Islands,Tenerife,Skin diving,,,,Injury required 16 stitches,N,,


In [17]:
#Tras echar un vistazo, y con la gran cantidad de datos que tiene el dataset, podemos permitirnos borrar estas filas.

Shark.dropna(how='any',axis=0, thresh=9, inplace= True)

In [18]:
Shark.reset_index(drop= True, inplace= True) #Hacemos reset index tras borras las filas para evitar futuros problemas

----------------------------------

Vamos ahora a ir analizando columna a columna, realizando la mejor limpieza posible de nuestros datos

### COLUMNA DATES

In [19]:
#Vamos a comprobar si todas las fechas están escritas de forma dd/mm/aaaa, y sustituir los NaN por "unknow"
Shark.Date = Shark.Date.fillna("Unknown")

In [20]:
Shark.Date.isnull().value_counts() #Comprobamos que no quedan NaN

False    6674
Name: Date, dtype: int64

In [21]:
mask = Shark.Date.str.contains("/")
mask.value_counts()

True     5995
False     679
Name: Date, dtype: int64

In [22]:
#Vemos que la mayoría de ellas tiene pinta de encontrarse escritas en el formato que queremos, pero hay 775 que no, vamos a verlas. 

In [23]:
Shark.Date[~mask]


176        Unknown
226     1899-05-04
227     1895-04-29
228     1893-06-22
229     1892-11-09
           ...    
6669    1733-01-01
6670    1595-01-01
6671       Unknown
6672       Unknown
6673       Unknown
Name: Date, Length: 679, dtype: object

In [24]:
#Como vemos, aparte de las fechas desconocidas, el resto de fechas están escritas en formato aaaa-mm-dd, vamos a cambiarlas a nuestro formato desado

def change_date(x):
    if "-" in x:
        x = x.split("-")
        return x[2] + "/" + x[1] + "/" + x[0]
    return x

Shark["Date"] = Shark["Date"].map(change_date)


In [25]:
"-" in Shark["Date"]

False

### COLUMNA YEAR

In [26]:
#Vamos ahora a cambiar la columna de Year de float a int, y los Nan los trataremos del mismo modo como "Unknown"

In [27]:
Shark.Year.fillna("Unknown", inplace= True)

def change_year (x):
    if x == "Unknown":
        return x
    return int(x)

Shark["Year"] = Shark["Year"].map(change_year)

In [28]:
Shark.Year.isnull().value_counts()

False    6674
Name: Year, dtype: int64

### COLUMNA TYPE

In [29]:
#Vamos ahora a investigar la columna Type, que nos indicará si el accidente fue provocado o no

In [30]:
Shark["Type"].value_counts()

Unprovoked             4921
Provoked                620
Invalid                 542
Watercraft              341
Sea Disaster            218
Questionable             12
Unconfirmed               1
?                         1
Under investigation       1
Unverified                1
Name: Type, dtype: int64

In [31]:
#Vemos que tenemos varios datos, primero empezaremos por sustituir todos aquellos datos que no nos permiten conocer el tipo
Shark["Type"] = Shark["Type"].str.replace("Invalid", "Unknown").replace("Questionable", "Unknown").replace("Unconfirmed", "Unknown")\
                .replace("?", "Unknown").replace("Under investigation", "Unknown").replace("Unverified", "Unknown")

In [32]:
Shark["Type"].value_counts()

Unprovoked      4921
Provoked         620
Unknown          558
Watercraft       341
Sea Disaster     218
Name: Type, dtype: int64

In [33]:
#Watercraft no nos indica información sobre el tipo de ataque, nos conviene incluirlo dentro de la actividad. Pasaremos los Watercraft a Unknown
#y, después de observar las actividades donde el tipo de ataque es Watercraft, y ver que no aportan mucha información, concluimos cambiar 
#aquellas actividades por Watercraft. 

In [34]:
Shark["Activity"] = np.where(Shark["Type"] == "Watercraft", "Watercraft", Shark["Activity"])
Shark["Type"] = np.where(Shark["Type"] == "Watercraft", "Unknown", Shark["Type"])

In [35]:
Shark["Type"].value_counts()

Unprovoked      4921
Unknown          899
Provoked         620
Sea Disaster     218
Name: Type, dtype: int64

### COLUMNA COUNTRY

In [36]:
#Vamos a echar un vistazo a la columna de Country

In [37]:
Shark["Country"].value_counts()

USA                             2486
AUSTRALIA                       1443
SOUTH AFRICA                     593
NEW ZEALAND                      139
PAPUA NEW GUINEA                 133
                                ... 
EQUATORIAL GUINEA / CAMEROON       1
COLOMBIA                           1
COOK ISLANDS                       1
GEORGIA                            1
JAVA                               1
Name: Country, Length: 203, dtype: int64

In [38]:
Shark["Country"].isnull().value_counts()

False    6649
True       25
Name: Country, dtype: int64

In [39]:
#Comenzamos viendo que hay 54 valores nulos, vamos a sustituirlos por "Unknown"

Shark["Country"].fillna("UNKNOWN", inplace= True)

In [40]:
#Hay 212 países, vamos a echar un vistazo a ver si se puede limpiar algo
Shark.Country.unique()

array(['USA', 'AUSTRALIA', 'CANADA', 'BELIZE', 'BAHAMAS', 'SOUTH AFRICA',
       'EGYPT', 'MALDIVES', 'COSTA RICA', 'MEXICO', 'SPAIN',
       'UNITED ARAB EMIRATES', 'THAILAND', 'NEW ZEALAND', 'TONGA',
       'PAPUA NEW GUINEA', 'BRAZIL', 'MOZAMBIQUE', 'ATLANTIC OCEAN',
       'GRAND CAYMAN', 'JAPAN', 'HONG KONG', 'REUNION', 'SOUTH KOREA',
       'PACIFIC OCEAN', 'BRITISH ISLES', 'FIJI', 'ITALY', 'JAMAICA',
       'NEW GUINEA', 'KENYA', 'BRITISH WEST INDIES', 'TURKEY', 'SENEGAL',
       'SOLOMON ISLANDS', 'CUBA', 'PORTUGAL', 'NORTH PACIFIC OCEAN',
       'GUATEMALA', 'NORTH ATLANTIC OCEAN', 'SRI LANKA', 'TANZANIA',
       'NEW CALEDONIA', 'CROATIA', 'INDONESIA', 'NIGERIA', 'KIRIBATI',
       'UNKNOWN', 'CAPE VERDE', 'FRENCH POLYNESIA', 'TRINIDAD & TOBAGO',
       'PHILIPPINES', 'VIETNAM', 'LIBERIA', 'Fiji', 'GRENADA', 'GREECE',
       'COLUMBIA', 'CHINA', 'BERMUDA', 'ENGLAND', 'IRELAND',
       'FALKLAND ISLANDS', 'HAITI', 'YEMEN', 'LEBANON', 'PERSIAN GULF',
       'INDIAN OCEAN', 'IND

In [41]:
#Empezamos viendo algún país escrito en minúscula, por lo que pasaremos todo a mayúsculas. 
Shark["Country"] = Shark["Country"].str.upper().str.strip()

In [42]:
#Vamos a ver si hay interrogantes, y los cambiaremos por Unknown
Shark[Shark["Country"].str.contains("?", regex=False)]

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species
4776,30/11/1872,1872,Unprovoked,INDIAN OCEAN?,,,Swimming to avoid capture,Malay pirates,M,,FATAL,Y,,
5895,01/04/1954,1954,Provoked,SUDAN?,Red Sea,Southern part,Spearfishing,Jean Foucher-Createau,M,,"Speared small shark, shark bit his thigh and/o...",N,,


In [43]:
def change_country(x):
    if ("?" in x) or ("/" in x):
        return "Unknown"
    return x

Shark["Country"] = Shark["Country"].map(change_country)

### COLUMNA ÁREA

In [44]:
#Continuamos limpiando viendo que tenemos en la columna Área. Aquí realmente nos interesará para nuestro análisis los áreas de los países
#donde tenemos más ataques de tiburón. Vamos a verlo. Primero de todo, cambiamos los Nan por Unknown y después vemos la frecuencia de datos por países. 

In [45]:
#Pasamos todos las áreas con la primera letra en mayúscula y quitamos espacios a la derecha e izquierda del string
Shark["Area"] = Shark["Area"].str.strip().str.title()

In [46]:
Shark["Area"].fillna("Unknown", inplace= True)

In [47]:
Shark["Country"].value_counts()[Shark["Country"].value_counts() > 90]


USA                 2486
AUSTRALIA           1443
SOUTH AFRICA         594
NEW ZEALAND          140
PAPUA NEW GUINEA     133
BAHAMAS              127
BRAZIL               119
MEXICO                97
Name: Country, dtype: int64

In [48]:
#Vemos que con diferencia, USA y Australia son los países donde se han registrado más ataques, seguido de Sudáfrica, vamos a Analizar estas áreas,
#que serán las más relevantes en nuestro análisis. 

In [49]:
Countrys = Shark[(Shark["Country"] == "USA") | (Shark["Country"] == "AUSTRALIA") | (Shark["Country"] == "SOUTH AFRICA")]

In [50]:
Countrys.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species
0,03/07/2022,2022,Unprovoked,USA,New York,"Smith Point Beach, Suffolk County",Lifeguard Exercises,Zach Gallo,M,31,Injuries to chest and right hand,N,10h15,5'shark
1,21/06/2022,2022,Unprovoked,USA,South Carolina,"Myrtle Beach, Horry County",,male,M,14,,N,,
2,09/04/2022,2022,Provoked,USA,New Jersey,Tutle Back Zoo,Feeding,female,F,12,Finger nipped by captive shark PROVOKED INCIDENT,N,Afternoon,Epaulette shark
3,16/02/2022,2022,Unprovoked,AUSTRALIA,New South Wales,"Buchan Point, Sydney",Swimming,Simon Nellist,M,36,FATAL,Y,16h30,
4,22/12/2021,2021,Unprovoked,USA,Florida,"Anna Maria Island, Manatee County",Wing Foil Surfing,Erika Lane,F,42,Punctures to leg,N,,Blacktip or spinner shark


In [51]:
Countrys.groupby(["Country"])["Area"].unique().values

array([array(['New South Wales', 'Queensland', 'Western Australia',
              'South Australia', 'Victoria', 'Torres Strait', 'Norfolk Island',
              'Northern Territory', 'Westerm Australia', 'Tasmania', 'Unknown',
              'Territory Of Cocos (Keeling) Islands', 'New South Ales',
              'Western  Australia'], dtype=object)                             ,
       array(['Eastern Cape Province', 'Kwazulu-Natal', 'Western Cape Province',
              'Knz', 'Unknown', 'Eastern Province', 'Transvaal',
              'South Atlantic Ocean', 'Kzn',
              'Kwazulu-Natal Between Port Edward And Port St Johns',
              'Eastern Cape  Province', 'Western Province'], dtype=object)      ,
       array(['New York', 'South Carolina', 'New Jersey', 'Florida', 'Hawaii',
              'Texas', 'Washington', 'North Carolina', 'California', 'Louisiana',
              'Connecticut', 'Georgia', 'Unknown', 'Massachusetts', 'Oregon',
              'Alabama', 'Kentucky', '

In [52]:
#Respecto al primer array, que se corresponde con las áreas de AUSTRALIA, lo primero que salta a la vista es que Western Australia, está
#escrito de tres formas diferentes. Vamos a solucionar este problema, y cambiarlo dentro de nuestro DF. También nos damos cuenta de que 
#New South Wales está escrito también como New South Ales, lo cambiamos tambien. 

In [53]:
Shark["Area"] = Shark["Area"].str.replace("Westerm Australia", "Western Australia").replace("Western  Australia", "Western Australia")
Shark["Area"] = Shark["Area"].str.replace("New South Ales", "New South Wales")

In [54]:
#Vamos a echar un ojo al array segundo, perteneciente a SUDÁFRICA. Vemos varios nombres a cambiar.

In [55]:
Shark["Area"] = Shark["Area"].str.replace("Eastern Province", "Eastern Cape Province").replace("Eastern Cape  Province", "Eastern Cape Province")
Shark["Area"] = Shark["Area"].str.replace("Knz", "Kwazulu-Natal").replace("Kzn", "Kwazulu-Natal").replace("Kwazulu-Natal Between Port Edward And Port St Johns", "Kwazulu-Natal")
Shark["Area"] = Shark["Area"].str.replace("Western Province", "Western Cape Province")

In [56]:
#Por último, echamos un ojo al último array, las áreas de USA,y hacemos pequeños cambios del mismo modo

Shark["Area"] = Shark["Area"].str.replace("Franklin County, Florida", "Florida")
Shark["Area"] = Shark["Area"].str.replace("South Carolina", "North & South Carolina").replace("North Carolina", "North & South Carolina")
Shark["Area"] = Shark["Area"].str.replace("Us Virgin Islands", "Virgin Islands")


### COLUMNA ACTIVITY

In [57]:
#Seguimos con la columna Activity, que nos resultará muy útil para nuestro análisis. Vamos a comenzar viendo cuantos Nan tnemos y modificarlos por Unknowns

Shark["Activity"].isnull().sum()

463

In [58]:
Shark["Activity"].fillna("Unknown", inplace= True)

In [59]:
#Limpiamos los strings del mismo modo que hemos hecho anteriormente: 
Shark["Activity"] = Shark["Activity"].str.strip().str.lower()

In [60]:
#Vamos ahora a ver los valores únicos que tiene esta columna. Es una columna con información muy valiosa por lo que es imprescindible realizar
#una limpieza exhaustiva. Primero vamos a ver los valores que más se repiten. 

In [61]:
#Tras una investigación, concluimos qu Boogie boarding en realidad es body boarding, y que body surfing es lo mismo que surfing. Vamos a modificarlo

Shark["Activity"] = Shark["Activity"].str.replace("boogie boarding", "body boarding")
Shark["Activity"] = Shark["Activity"].str.replace("body surfing", "surfing")

In [62]:
#Vamos a ver si podemos limpiar un poco más los datos, tomaremos como referencia las actividades con más frecuencia y buscaremos si otras actividades
#se pueden englobar en ellas. Vemos a fondo los datos que tenemos y realizamos los cambios oportunos mediante una funcion 

In [63]:
Shark["Activity"].value_counts()[Shark["Activity"].value_counts() >40]

surfing          1233
swimming         1018
unknown           466
spearfishing      387
fishing           375
watercraft        345
diving            315
wading            220
bathing           166
body boarding     128
snorkeling        121
standing          116
floating           54
Name: Activity, dtype: int64

In [64]:
set(Shark[(Shark["Activity"].str.contains("surfing"))  & ((Shark["Activity"] != "Surfing"))]["Activity"])

{'body boarding or surfing',
 'body-surfing',
 'bodysurfing',
 'kite surfing',
 'kitesurfing',
 'night surfing',
 'paddle-surfing',
 'surfing',
 'surfing & dangling foot in water amid baitfish',
 'surfing & filming dolphins',
 'surfing (lying prone on his board)',
 'surfing (or body boarding)',
 'surfing (or sailboarding)',
 'surfing (pneumatic surfboard)',
 'surfing (sitting on his board)',
 'surfing amid a shoal of sharks',
 'surfing on chest board (boogie board?)',
 'surfing or body boarding',
 'surfing or surfing',
 'surfing with dolphins',
 'surfing, but lying prone on his board',
 'surfing, but standing in water alongside board',
 'surfing, but swimming to his board',
 'surfing, collided with shark',
 'surfing, lying on surfboard',
 'surfing, paddling seawards',
 'surfing, paddling shorewards',
 'surfing, pushing board ashore',
 'surfing, sitting on board',
 'surfing, stood up on sandbar',
 'surfing?',
 'surfinging',
 'surfinging (urinating on his board)',
 'surfinging, but sitti

In [65]:
set(Shark[(Shark["Activity"].str.contains("swimming")) & ((Shark["Activity"] != "Swimming"))]["Activity"])

{'aircraft ditched in the sea, swimming ashore',
 'anti-aircraft cruiser uss atlanta (cl,-05) travelling in convoy after the battle of midway, encountered a japanese flotilla  (battle of guadalcanal) &, heavily damaged by gunfire, she was lost off lunga point. victim was swimming when bitten.',
 'bathing/swimming',
 'boat capsized, swimming ashore',
 'boogie-boarding / swimming',
 'canoe swamped, swimming back to canoe',
 'crew swimming alongside their anchored ship',
 'dived overboard & was swimming near stern of trawler',
 'diving for trochus shell, swimming to dinghy',
 'diving, but swimming on surface',
 'fall into the water overboard & swimming',
 'fishing, fall into the water in water & swimming strongly to shore',
 'flying fortress bomber aircraft went down after daytime raid on naples. he was swimming on the surface',
 'freedom swimming',
 'had just dived into water & was swimming',
 'jumped overboard and swimming',
 'marathon swimming',
 'refused permission to cross on the fer

In [66]:
set(Shark[(Shark["Activity"].str.contains("spearfishing")) & ((Shark["Activity"] != "Spearfishing"))]["Activity"])

{'commercial spearfishing',
 'competing in spearfishing championship & towing dead fish',
 'diving & spearfishing',
 'diving & spearfishing (ascending)',
 'diving & spearfishing (descending)',
 'diving / spearfishing',
 'diving / spearfishing (resting on the surface)',
 'diving / spearfishing,',
 'diving / spearfishingat edge of reef',
 'diving spearfishing',
 'diving, reportedly also spearfishing',
 'diving, spearfishing',
 'spearfishing',
 'spearfishing & diving for paua',
 'spearfishing & had just speared a ulua',
 'spearfishing & holding catch',
 'spearfishing & lassoed shark',
 'spearfishing (but on surface)',
 'spearfishing (diving)',
 'spearfishing / diving',
 'spearfishing / diving (at surface)',
 'spearfishing / freediving',
 'spearfishing / night diving',
 'spearfishing / swimming on surface',
 'spearfishing competition',
 'spearfishing on scuba',
 'spearfishing on scuba & transferring fish onto a stringer',
 'spearfishing or fishing',
 'spearfishing using scuba',
 'spearfish

In [67]:
set(Shark[(Shark["Activity"].str.contains("bathing"))  & ((Shark["Activity"] != "bathing"))]["Activity"])

{'bathing / standing',
 'bathing alongside ship',
 'bathing alongside the american ship thomas w. sears',
 'bathing alongside us naval ship',
 'bathing close inshore',
 'bathing in 2 feet of water',
 "bathing in 3' to 4' of water",
 "bathing in 5' of water",
 'bathing in knee-deep water',
 'bathing in river',
 'bathing in waist-deep water',
 'bathing in water 0.9 m deep',
 'bathing near whaling ship (bark a. r. tucker of new bedford, massachusetts)',
 'bathing or body',
 'bathing or washing',
 'bathing with her mother',
 'bathing with sister',
 'bathing/swimming',
 'british ship, britannia,  was loading lumber. he was bathing',
 'chasing shark out of bathing area while riding on a surf-ski',
 'night bathing',
 'standing / bathing',
 'sunbathing on beach when he saw child being attacked by the shark',
 'surf bathing'}

In [68]:
set(Shark[(Shark["Activity"].str.contains("fishing")) & ((Shark["Activity"] != "fishing")) & ((Shark["Activity"] != "spearfishing"))]["Activity"])

{'a 75-ton  japanese fishing ship was sunk by  chinese nationalist gunboat, shipwrecked men were clinging to debris',
 'attempting to remove fishing net from submerged object',
 'being pulled to shore from wreck of 25-ton fishing vessel alan s',
 'capsized fishing boat',
 'commercial fishing vessel, ev-nn, struck object & sank. ken crosby and  jame & ann dumas adrift on makeshift raft.',
 'commercial spearfishing',
 'competing in spearfishing championship & towing dead fish',
 'crayfishing',
 'diving & fishing with net',
 'diving & spearfishing',
 'diving & spearfishing (ascending)',
 'diving & spearfishing (descending)',
 'diving / fishing',
 'diving / spearfishing',
 'diving / spearfishing (resting on the surface)',
 'diving / spearfishing,',
 'diving / spearfishingat edge of reef',
 'diving spearfishing',
 'diving, reportedly also spearfishing',
 'diving, spearfishing',
 'dynamite fishing',
 'fall into the water from cliff while fishing & disappeared in strong current',
 'fall into 

In [69]:
set(Shark[(Shark["Activity"].str.contains("diving")) & ((Shark["Activity"] != "Diving"))]["Activity"])

{'abalone diving',
 'abalone diving using hookah (near calving whales)',
 'abalone diving using hookah (resting on the surface)',
 'anti-sabotage night dive exercise alongside destroyer (diving)',
 'cage diving',
 'closed circuit diving (submerged). diving to recover jettisoned packets of opium for police',
 'commercial salvage diving',
 'diving',
 'diving & filming',
 'diving & fishing with net',
 'diving & spearfishing',
 'diving & spearfishing (ascending)',
 'diving & spearfishing (descending)',
 'diving & u/w photography',
 'diving (ascending using scooter)',
 'diving (but on surface)',
 'diving (hookah)',
 'diving (shell maintenance)',
 'diving (submerged riding a scooter)',
 'diving (submerged)',
 'diving , but surfacing',
 'diving / culling lionfish',
 'diving / filming',
 'diving / fishing',
 'diving / kissing the shark',
 'diving / modeling',
 'diving / photographing pilot whales',
 'diving / photography, kneeling on sand',
 'diving / spearfishing',
 'diving / spearfishing (re

In [70]:
def change_activity (x): 
    if ("surfing" in x):
        return "surfing"
    if ("fall" in x) or ("fell" in x):
        return "fall into the water"
    if ("spearfishing" in x):
        return "spearfishing"
    if ("bathing" in x):
        return "bathing"
    if ("fishing" in x):
        return "fishing"
    if ("diving" in x):
        return "diving"
    if ("swimming" in x):
        return "swimming"
    if ("board" in x):
        return "body boarding"
    if ("snork" in x):
        return "snorkeling"
    if ("surf" in x):
        return "surfing" #ponemos esto en último lugar debido a que se han observado varios registros donde aparece surface, que no tiene que ver con el surf
    if ("standing" in x):
        return "wading"

    return x

Shark["Activity"] = Shark["Activity"].map(change_activity)

In [71]:
Shark["Activity"] = Shark["Activity"].str.capitalize()

In [72]:
Shark.Activity.value_counts()

Surfing                                                1336
Swimming                                               1177
Fishing                                                 564
Diving                                                  536
Spearfishing                                            470
                                                       ... 
Wreck of the 1689-ton portuguese  coaster angoche         1
Murder victim                                             1
Moving a shark in a net                                   1
Attempting to drive shark away from sailing regatta       1
Ship lay at anchor & man was working on its rudder        1
Name: Activity, Length: 542, dtype: int64

In [73]:
Shark.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species
0,03/07/2022,2022,Unprovoked,USA,New York,"Smith Point Beach, Suffolk County",Lifeguard exercises,Zach Gallo,M,31,Injuries to chest and right hand,N,10h15,5'shark
1,21/06/2022,2022,Unprovoked,USA,North & South Carolina,"Myrtle Beach, Horry County",Unknown,male,M,14,,N,,
2,09/04/2022,2022,Provoked,USA,New Jersey,Tutle Back Zoo,Feeding,female,F,12,Finger nipped by captive shark PROVOKED INCIDENT,N,Afternoon,Epaulette shark
3,16/02/2022,2022,Unprovoked,AUSTRALIA,New South Wales,"Buchan Point, Sydney",Swimming,Simon Nellist,M,36,FATAL,Y,16h30,
4,22/12/2021,2021,Unprovoked,USA,Florida,"Anna Maria Island, Manatee County",Surfing,Erika Lane,F,42,Punctures to leg,N,,Blacktip or spinner shark


### COLUMNA NAME Y SEX

In [74]:
#La columna Name no nos aporta información relevante para nuestro análisis, sin embargo, a veces nos aporta información sobre el sexo de la
#víctima. De esta forma, rellenaremos información en nuestra columna Sexo gracias a esta información: 

In [75]:
#Vamos a pasar la columna a tipo string
Shark["Sex"].astype(str)

0         M
1         M
2         F
3         M
4         F
       ... 
6669    nan
6670      M
6671      M
6672      M
6673      M
Name: Sex, Length: 6674, dtype: object

In [76]:
Shark["Name"] = Shark["Name"].str.lower()

In [77]:
Shark.Sex.isnull().value_counts()

False    6179
True      495
Name: Sex, dtype: int64

In [78]:
Shark.Sex.value_counts(dropna= False)

M        5426
F         747
NaN       495
N           3
M x 2       1
lli         1
.           1
Name: Sex, dtype: int64

In [79]:
Shark["Sex"]= Shark["Sex"].str.replace("N", "Unknown").replace("M x 2", "Unknown").replace("lli", "Unknown").replace(".", "Unknown")

In [80]:
for i in range(len(Shark)):

    if str(Shark['Sex'][i]) == "nan":
    
        if ("female" in str(Shark['Name'][i])) or ("girl" in str(Shark['Name'][i])):
            Shark['Sex'][i] = "F"

        elif ("male" in str(Shark['Name'][i])) or ("boy" in str(Shark['Name'][i])):
            Shark['Sex'][i] = "M"        

In [81]:
Shark["Sex"].fillna("Unknown", inplace= True) #Rellenamos el resto de Nans con Unknown

In [82]:
#Borramos la columna de name, que no nos es de utilidad para nuestro análisis

Shark.drop(columns= "Name", inplace= True)

In [83]:
Shark.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Sex,Age,Injury,Fatal (Y/N),Time,Species
0,03/07/2022,2022,Unprovoked,USA,New York,"Smith Point Beach, Suffolk County",Lifeguard exercises,M,31,Injuries to chest and right hand,N,10h15,5'shark
1,21/06/2022,2022,Unprovoked,USA,North & South Carolina,"Myrtle Beach, Horry County",Unknown,M,14,,N,,
2,09/04/2022,2022,Provoked,USA,New Jersey,Tutle Back Zoo,Feeding,F,12,Finger nipped by captive shark PROVOKED INCIDENT,N,Afternoon,Epaulette shark
3,16/02/2022,2022,Unprovoked,AUSTRALIA,New South Wales,"Buchan Point, Sydney",Swimming,M,36,FATAL,Y,16h30,
4,22/12/2021,2021,Unprovoked,USA,Florida,"Anna Maria Island, Manatee County",Surfing,F,42,Punctures to leg,N,,Blacktip or spinner shark


### COLUMNA AGE


In [84]:
Shark.Age.value_counts()

17            174
18            155
15            155
19            153
20            152
             ... 
X               1
middle-age      1
72              1
M               1
84              1
Name: Age, Length: 86, dtype: int64

In [85]:
Shark.Age.unique()

array(['31', '14', '12', '36', '42', '21', '57', '16', '73', nan, '22',
       '18', '9', '20', '50', '41', '32', '26', '15', '37', '10', '60',
       '40', '23', '24', '63', '25', '30', '29', '34', '51', '46', '19',
       '17', '8', '39', '62', '33', '52', '35', '49', '48', '13', '27',
       '7', '44', '38', '43', '1', '28', '11', '59', '54', '58', '61',
       '55', '6', '68', '45', '71', '70', '47', '3', '75', '53', '4',
       '65', '56', '74', '67', 'middle-age', '69', '64', '5', '77', '86',
       '66', 'Teen', '81', '82', '!!', '87', 'X', '78', '72', 'M', '84'],
      dtype=object)

In [86]:
Shark.Age.describe()

count     3872
unique      86
top         17
freq       174
Name: Age, dtype: object

In [87]:
Shark["Age"] = Shark["Age"].str.replace("nan", "Unknown").replace("middle-age", "Unknown").replace("Teen", "Unknown")\
    .replace("!!", "Unknown").replace("X", "Unknown").replace("M", "Unknown")

In [88]:
Shark["Age"].fillna("Unknown", inplace= True)

### COLUMNA FATAL


In [89]:
#Cambiamos el nombre de la columna
Shark.rename(columns= {"Fatal (Y/N)": "Damage"}, inplace= True)

In [90]:
Shark["Damage"] = Shark["Damage"].str.upper()

In [91]:
Shark["Damage"].unique()

array(['N', 'Y', 'UNKNOWN', 'Y X 2', 'F', 'NQ', '2017'], dtype=object)

In [92]:
#Vamos a modificar los valores a tres categorías: No fatal, Fatal o Unknown

In [93]:
def change_damage(x):
    if x == "Y" or x == "Y x 2":
        return "Fatal"
    
    if x == "N":
        return "No Fatal"
    
    return "Unknown"

Shark["Damage"] = Shark["Damage"].map(change_damage)

In [94]:
Shark.Damage.isnull().value_counts() #No hay datos perdidos en esta columna

False    6674
Name: Damage, dtype: int64

In [95]:
Shark.Damage.unique() #Comprobamos

array(['No Fatal', 'Fatal', 'Unknown'], dtype=object)

### COLUMNA TIME

In [96]:
Shark.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Sex,Age,Injury,Damage,Time,Species
0,03/07/2022,2022,Unprovoked,USA,New York,"Smith Point Beach, Suffolk County",Lifeguard exercises,M,31,Injuries to chest and right hand,No Fatal,10h15,5'shark
1,21/06/2022,2022,Unprovoked,USA,North & South Carolina,"Myrtle Beach, Horry County",Unknown,M,14,,No Fatal,,
2,09/04/2022,2022,Provoked,USA,New Jersey,Tutle Back Zoo,Feeding,F,12,Finger nipped by captive shark PROVOKED INCIDENT,No Fatal,Afternoon,Epaulette shark
3,16/02/2022,2022,Unprovoked,AUSTRALIA,New South Wales,"Buchan Point, Sydney",Swimming,M,36,FATAL,Fatal,16h30,
4,22/12/2021,2021,Unprovoked,USA,Florida,"Anna Maria Island, Manatee County",Surfing,F,42,Punctures to leg,No Fatal,,Blacktip or spinner shark


In [97]:
Shark.Time.value_counts()

Afternoon         212
11h00             138
Morning           133
12h00             123
16h00             119
                 ... 
22h20               1
10h45-11h15         1
11h115              1
18h30?              1
Late Afternoon      1
Name: Time, Length: 397, dtype: int64

In [98]:
Shark.Time.unique()

array(['10h15', nan, 'Afternoon', '16h30', '`17h00', '10h00', '11h45',
       '07h45', '17h00', '15h30', '10h30', '13h40', '06h15', 'Morning',
       '21h50', '15h01', '18h30', '19h30', '11h30', '14h00',
       'Late afternoon', '08h30', 'Night', '16h00', '18h00', '05h45',
       '10h45', '10h55', 'Evening', '11h00', '09h00', '13h00', 'Sunset',
       '13h30', '16h45', '15h00', '11h10', '20h00', '17h30', '06h10',
       '09h40', '14h20', '07h50', '16h20', '14h30', '22h00', '14h55',
       '09h15', 'P.M.', '13h15', '11h48', '10h25', '13h37', '14h45',
       '07h05', '17h15', '08h35', '12h30', '12h00', '15h15', '12h40',
       '16h23', '09h30', 'Before daybreak', '07h00', '08h40',
       'Early Morning', '11h55', '09h30 / 10h00', '06h00', '17h51',
       '10h50', '06h30', '15h45', '09h45', '07h08', '13h20', '13h14',
       '11h50', '10h35', 'Early morning', 'Mid-morning', '19h00', '18h40',
       '>17h00', '12h45 / 13h45', '08h00', '14h35', '14h10', '15h40',
       '08h15', '11h41', '12h

In [99]:
Shark["Time"] = Shark["Time"].str.strip().str.lower().str.replace(">","").replace("<","")

In [100]:
Shark.Time.isnull().value_counts()

True     3352
False    3322
Name: Time, dtype: int64

In [101]:
Shark["Time"].fillna("Unknown", inplace= True)

In [102]:
def change_time(x):
    if x[:2] in ["06", "07", "08", "09", "10", "11", "12"] or "morning" in x or "sunset" in x:
        return "morning"
    if x[:2] in ["13", "14", "15", "16", "17", "18"] or "afternoon" in x or "afternon" in x:
        return "afternoon"
    if x[:2] in ["19", "20", "21", "22"] or "evening" in x:
        return "evening"
    if x[:2] in ["23", "24", "00", "01", "02", "03", "04", "05"] or "night" in x:
        return "night"
    return x

Shark["Time"] = Shark["Time"].map(change_time)
    

In [103]:
Shark["Time"] = Shark["Time"].str.title()

### COLUMNA SPECIES

In [104]:
Shark["Species"] = Shark["Species"].str.strip()

In [105]:
Shark["Species"] = Shark["Species"].str.lower()

In [106]:
Shark.Species.value_counts()

white shark                                           194
shark involvement prior to death was not confirmed    103
invalid                                               102
shark involvement not confirmed                        93
tiger shark                                            89
                                                     ... 
porbeagle, 1.5 m                                        1
white shark, 1,900-lb                                   1
comrades saw shark's tail appear about 5' away          1
4 m [13'] shark seen in vicinity                        1
blue or porbeagle shark                                 1
Name: Species, Length: 1526, dtype: int64

In [107]:
Shark.Species = Shark.Species.astype(str)

In [108]:
set(Shark.Species.unique())

{'unidentified shark',
 'questionable incident; reported as shark attack but thought to involve a pinniped instead',
 'white shark, 4.6m',
 "4.3 m [14'] shark seen in area previous week",
 "3.6 m [11'9] white shark",
 "said to involve a bull shark, 5' to 6'",
 '4 to 5m white shark',
 'juvenile bull shark?',
 "tiger shark, >3 m [10']",
 "2.4 m [8'] shark, possibly a dusky shark",
 "hammerhead shark, 2.4 m [8'], according to lifeguard sam barrows",
 "said to involve a 6 m to 7.3 m [20' to 24'] shark",
 "bronze whaler shark, 3 m [10'], 200-lb",
 "bull shark, 4 m [13']",
 "1.8 m to 2.1 m [6' to 7'] shark",
 'bull shark',
 'white shark, 4.5m',
 'possibly a small hammerhead shark',
 'hammerhead shark?+o2356',
 "2 m [6'9], 87.5-kg [193-lb]  shark",
 "2.5' shark",
 'shark involvement prior to death could not be determined',
 'raggedtooth shark, 1.96 m, 140-kg',
 "2.1 m [7'] shark, possibly a spinner shark",
 'white shark, 2.8 to 3 m',
 'possibly juvenile tiger shark',
 "bronze whaler shark, 6'

In [109]:
Shark["Species"].fillna("Unknown")

0                                            5'shark
1                                                nan
2                                    epaulette shark
3                                                nan
4                          blacktip or spinner shark
                            ...                     
6669    shark involvement prior to death unconfirmed
6670                                             nan
6671                         nurse shark, 2.1 m [7']
6672                                             nan
6673                                      bull shark
Name: Species, Length: 6674, dtype: object

In [110]:
#Después de echar un ojo a las principales especies, realizamos cambios con una función, para limpiar lo máximo posible la columna

def change_species(x):
    if "white" in x:
        return "white shark"
    if "Tiger" in x:
        return "tiger shark"
    if "hammerhead" in x:
        return "hammerhead shark"
    if "wobbegong" in x:
        return "wobbegong shark"
    if "bull" in x:
        return "bull shark"
    if "blacktip" in x:
        return "blacktip shark"
    if "blacktip" in x:
        return "blacktip shark"
    if "nurse" in x:
        return "nurse shark"
    if "raggedtooth" in x:
        return "raggedtooth shark"
    if "mako" in x:
        return "mako shark"
    if "lemon" in x:
        return "lemon shark"
    if "mako" in x:
        return "mako shark"
    if "zambesi" in x: 
        return "zambesi shark"
    if "whale" in x: 
        return "whaler shark"
    if "reef" in x:
        return "reef shark"
    if "blue" in x: 
        return "blue shark"
    if ("invalid" in x) or ("involvement" in x) or ("questionable" in x) or ("unknown" in x) or ("nan" in x): 
        return "unknown"
    return x
    
Shark["Species"] = Shark["Species"].map(change_species)
    

In [111]:
Shark["Species"] = Shark["Species"].str.title()

In [112]:
Shark.Species.value_counts()

Unknown                                 3455
White Shark                              728
Bull Shark                               221
Blacktip Shark                           130
Nurse Shark                              111
                                        ... 
Tiger Shark, 4 M [13']                     1
Two 1.2 M To 1.5 M [4' To 5'] Sharks       1
Sandtiger Shark, 2 M, Male                 1
200 To 300 Kg Shark                        1
13'10 Shark                                1
Name: Species, Length: 648, dtype: int64

In [113]:
Shark.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Sex,Age,Injury,Damage,Time,Species
0,03/07/2022,2022,Unprovoked,USA,New York,"Smith Point Beach, Suffolk County",Lifeguard exercises,M,31,Injuries to chest and right hand,No Fatal,Morning,5'Shark
1,21/06/2022,2022,Unprovoked,USA,North & South Carolina,"Myrtle Beach, Horry County",Unknown,M,14,,No Fatal,Unknown,Unknown
2,09/04/2022,2022,Provoked,USA,New Jersey,Tutle Back Zoo,Feeding,F,12,Finger nipped by captive shark PROVOKED INCIDENT,No Fatal,Afternoon,Epaulette Shark
3,16/02/2022,2022,Unprovoked,AUSTRALIA,New South Wales,"Buchan Point, Sydney",Swimming,M,36,FATAL,Fatal,Afternoon,Unknown
4,22/12/2021,2021,Unprovoked,USA,Florida,"Anna Maria Island, Manatee County",Surfing,F,42,Punctures to leg,No Fatal,Unknown,Blacktip Shark


In [114]:
Shark.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6674 entries, 0 to 6673
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Date      6674 non-null   object
 1   Year      6674 non-null   object
 2   Type      6658 non-null   object
 3   Country   6674 non-null   object
 4   Area      6674 non-null   object
 5   Location  6219 non-null   object
 6   Activity  6674 non-null   object
 7   Sex       6674 non-null   object
 8   Age       6674 non-null   object
 9   Injury    6652 non-null   object
 10  Damage    6674 non-null   object
 11  Time      6674 non-null   object
 12  Species   6674 non-null   object
dtypes: object(13)
memory usage: 678.0+ KB


-----------------------------

In [115]:
Shark.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Sex,Age,Injury,Damage,Time,Species
0,03/07/2022,2022,Unprovoked,USA,New York,"Smith Point Beach, Suffolk County",Lifeguard exercises,M,31,Injuries to chest and right hand,No Fatal,Morning,5'Shark
1,21/06/2022,2022,Unprovoked,USA,North & South Carolina,"Myrtle Beach, Horry County",Unknown,M,14,,No Fatal,Unknown,Unknown
2,09/04/2022,2022,Provoked,USA,New Jersey,Tutle Back Zoo,Feeding,F,12,Finger nipped by captive shark PROVOKED INCIDENT,No Fatal,Afternoon,Epaulette Shark
3,16/02/2022,2022,Unprovoked,AUSTRALIA,New South Wales,"Buchan Point, Sydney",Swimming,M,36,FATAL,Fatal,Afternoon,Unknown
4,22/12/2021,2021,Unprovoked,USA,Florida,"Anna Maria Island, Manatee County",Surfing,F,42,Punctures to leg,No Fatal,Unknown,Blacktip Shark


### SE AÑADEN LAS COLUMNAS: MESES, TABLE PRESENCE, ACTIVITY ON SURFACE

In [116]:
#Vamos a crear una columna con los meses donde se han producido los accidentes

def create_month(x):

    if x[3:5] == "01":
        return "January"
    if x[3:5] == "02":
        return "February"
    if x[3:5] == "03":
        return "March"
    if x[3:5] == "04":
        return "April"
    if x[3:5] == "05":
        return "May"
    if x[3:5] == "06":
        return "June"
    if x[3:5] == "07":
        return "July"
    if x[3:5] == "08":
        return "August"
    if x[3:5] == "09":
        return "September"
    if x[3:5] == "10":
        return "October"
    if x[3:5] == "11":
        return "November"
    if x[3:5] == "12":
        return "December"
        
    return x
    

New_column = Shark["Date"].map(create_month)
Shark.insert(2, "Month", New_column)

In [117]:
#Nuestro estudio trata sobre si la presencia de tablas de surf o similares influye a la hora de que un tiburón ataque o no.
#Sería interesante, del mismo modo, crear una columna que nos indique ese dato a partir de la columna de actividad.

def create_column_board(x):
    if  x in ["Surfing", "Body boarding"]:
        return "Yes"
    if x in ["Swimming", "Diving", "Spearfishing", "Bathing", "Snorkeling", "Fall into the water", "Wading"]:
        return "No"
    
    
New_column2 = Shark["Activity"].map(create_column_board)
Shark.insert(8, "Table Presence", New_column2)

In [119]:
#Sería interesante, del mismo modo, crear una columna que nos indique si la actividad que estaba realizando la persona era sobre la superficie o no 

def create_column_surface(x):
    if  x in ["Surfing", "Body boarding", "Swimming", "Bathing", "Snorkeling", "Fall into the water", "Floating", ]:
        return "Yes"
    if x in ["Diving", "Spearfishing"]:
        return "No"
    
    
New_column3 = Shark["Activity"].map(create_column_surface)
Shark.insert(8, "Activity on surface", New_column3)

In [120]:
Shark.head()

Unnamed: 0,Date,Year,Month,Type,Country,Area,Location,Activity,Activity on surface,Table Presence,Sex,Age,Injury,Damage,Time,Species
0,03/07/2022,2022,July,Unprovoked,USA,New York,"Smith Point Beach, Suffolk County",Lifeguard exercises,,,M,31,Injuries to chest and right hand,No Fatal,Morning,5'Shark
1,21/06/2022,2022,June,Unprovoked,USA,North & South Carolina,"Myrtle Beach, Horry County",Unknown,,,M,14,,No Fatal,Unknown,Unknown
2,09/04/2022,2022,April,Provoked,USA,New Jersey,Tutle Back Zoo,Feeding,,,F,12,Finger nipped by captive shark PROVOKED INCIDENT,No Fatal,Afternoon,Epaulette Shark
3,16/02/2022,2022,February,Unprovoked,AUSTRALIA,New South Wales,"Buchan Point, Sydney",Swimming,Yes,No,M,36,FATAL,Fatal,Afternoon,Unknown
4,22/12/2021,2021,December,Unprovoked,USA,Florida,"Anna Maria Island, Manatee County",Surfing,Yes,Yes,F,42,Punctures to leg,No Fatal,Unknown,Blacktip Shark


In [122]:
#Guardamos el DF en nuestro directorio

Shark.to_csv("./CSVs/shark-file.csv", index= False)

----------------------------------