### Data cleaning

Este notebook sirve para limpiar la base de datos extraida y almcenada con título `scraped_data.csv`. Aquí se llevarán a cabo métodos de inspección del dataframe para comprobar que los datatypes sean correctos y que la información contenida sea útil y legible en otras herramientas como pueden ser SQL y PowerBI.

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [2]:
df = pd.read_csv('../data/scraped_data.csv')
df.head()

Unnamed: 0,date,lat_and_long,GTOA_Protocol,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
0,2023-11-01 22:15:00,"32 47.4980 N, 9 54.3980 W","No, did not follow protocol, interaction laste...",Sail,10 - 12.5m,Not towing,No,No,1,0,0,Spade,Sailing,5 - 7,Moderate,5 - 6 (17 - 27 knots),Night,0 - 25%,Over 10,200m+,On,On,White/light,Blue,No,No,"Orca interaction at 10:15pm on 01/11, 40 miles...",I would describe the behaviour of the Orca dur...
1,2023-10-31 07:50:00,"39 26.0000 N, 9 23.0000 W","Yes, followed protocol, interaction lasted les...",Sail,12.5 - 15m,Not towing,No,Yes,2,5,0,Twin rudder,Motoring,5 - 7,Rough,3 - 4 (7 - 16 knots),Day,50 - 75%,2 - 5,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",No,We had sandbags on our sugar scoops and metal ...,Juveniles hitting the rudders adults close by
2,2023-09-19 11:00:00,"37 40.0000 N, 8 54.0000 W","No, did not follow protocol, interaction laste...",Sail,12.5 - 15m,Not towing,No,Yes,1,0,0,Spade,Motoring,5 - 7,Calm,3 - 4 (7 - 16 knots),Day,0 - 25%,2 - 5,40 - 200m,On,On,White/light,Other,"Yes, moderate - immediate repairs required",No,We saw the orca approach from 10 o’clock posit...,There was an initial approach 45 minutes earli...
3,2023-09-01 13:15:00,"45 36.0000 N, 3 45.0000 W","Yes, followed protocol, interaction lasted 10 ...",Sail,Over 15m,Not towing,Yes,Yes,1,2,0,Spade,Sailing,3 - 4,Calm,3 - 4 (7 - 16 knots),Day,25 - 50%,Over 10,200m+,Off,Off,White/light,Black,"Yes, moderate - immediate repairs required",No,Les trois orques passent constamment de bâbord...,Pas de comportement visblement agressif./// No...
4,2023-09-02 03:45:00,"42 45.0000 N, 9 14.0000 W","Yes, followed protocol, interaction lasted les...",Sail,12.5 - 15m,Not towing,No,Yes,1,2,0,Spade,Motorsailing,5 - 7,Calm,0 - 2 (0 - 6 knots),Night,0 - 25%,5 - 10,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",Yes,Arrêt du pilote automatique a la 2 eme interac...,Approche furtive à la première interaction dir...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154 entries, 0 to 153
Data columns (total 28 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   date                           154 non-null    object
 1   lat_and_long                   154 non-null    object
 2   GTOA_Protocol                  154 non-null    object
 3   Boat_Type                      154 non-null    object
 4   Boat_Length                    154 non-null    object
 5   Towing_Inflatable              154 non-null    object
 6   Trailing_Fishing_Lure          154 non-null    object
 7   Physical_Contact_With_Boat     154 non-null    object
 8   Number_of_Adult_Orcas          154 non-null    object
 9   Number_of_Juvenile_Orcas       154 non-null    object
 10  Number_of_Uncertain_Age_Orcas  154 non-null    object
 11  Rudder                         154 non-null    object
 12  Motoring_or_Sailing            154 non-null    object
 13  Speed

Tenemos 28 columnas, todas de tipo objeto por como se ha escrapeado la información en el notebook `1.WebScraping`. Entre otras cosas se va a hacer lo siguiente:

* Borrar aquellas filas que no contengan información o que estén repetidas
* Corregir cualquier fila de datos que pueda estar en columnas equivocadas
* Separar datos en dos columnas si fuera necesario, como por ejemplo con latitud y longitud que actualmente se encuentra en una sola columna
* Cambiar los datatypes de las columnas. Actualmente todas las columnas son de tipo objeto/string.
* Quitar duplicados en caso de que los hubiera
* La información contenida en el dataframe viene de un formulario que llevaron a cabo los patrones de las embarcaciones que sufrieron una interacción con una orca constaban de varias opciones (y no respuesta libre), será util convertir las variables categóricas en numéricas, vía aplicación de One-Hot encoding. Esto supondrá un paso clave de cara a aplicar métodos de modelos predictivos en el futuro.
* Identificar y rechazar outliers en caso de que los hubiera y siempre razonando si interesa deshacernos de ellos o nos aportan alguna información útil.
* Comprobación de consistencia de la BBDD.

#### 1. Descripción de las columnas

A continuación se incluye una descripción de las columnas con las que cuenta originalmente nuestra base de datos:

* **date**: Fecha y hora aproximada de la interacción
* **lat_and_long**: Latitud y longitud donde se dio la interacción
* **GTOA_Protocol**: Se siguió el *protocolo* de GT Orca Atlántica: Arriar las velas, detener la embarcación, apagar el motor y mantener un perfil bajo así como *duración* de la interacción
* **Boat_Type**: Tipo de barco - Velero | Motor | Barco de pesca
* **Boat_Length**: Eslora del barco (m) - menos de 10m | 10-12.5 | 12.5-15m | Más de 15m
* **Towing_Inflatable**: Se encontraba el barco remolcando una lancha neumática?
* **Trailing_Fishing_Lure**: Se encontraba el barco arrastrando un señuelo de pesca?
* **Physical_Contact_With_Boat**: Hubo contacto físico de las orcas con la embarcación?
* **Number_of_Adult_Orcas**: Número de orcas adultas?
* **Number_of_Juvenile_Orcas**: Número de orcas juveniles?
* **Number_of_Uncertain_Age_Orcas**: Número de orcas de edad incierta?
* **Rudder**: Tipo de timón - Spade/pala | Semi skeg/Semicompensado | Full skeg/Completo | Twin rudder/Doble timón | Keel hung/Quilla corrida
* **Motoring_or_Sailing**: Motor o vela - Vela | Motor | Motor/Vela | Hove-to
* **Speed_Knots**: Velocidad (kts)
* **Sea_State**: Estado del mar - Calma | Moderado | Gruesa
* **Wind_Speed_Beaufort**: Velocidad del viento (Escala de Beaufort) - 0.2 | 3-4 | 5-6 | 7+
* **Daylight_or_Darkness**: Noche/Día - Amanecer | Día | Atardecer | Noche
* **Cloud_Cover**: Cobertura de nubes - 0-25% | 25-50% | 50-75% | 75-100%
* **Distance_Off_Land_NM**: Distancia a tierra (nm) - 0-2 | 2-5 | 5-10 | Más de 10
* **Depth_Meters**: Profundidad (m) - hasta 20m | 20-40m | 40-200m | 200m+
* **Depth_Gauge**: Medidor de profundidad - On | Off
* **Autopilot**: Piloto automático - On | Off
* **Hull_Topsides_Color**: Color del casco - Blanco | Oscuro
* **Antifoul_Color**: Color del antifoul - Negro | Azul | Rojo | Blanco | Verde | Coppercoat | Otro
* **Boat_Damaged**: Fue dañado el barco o necesita reparación alguna? Sí, menor | Sí, moderado | Sí, grandes reparaciones | No
* **Tow_Required**: ¿Fue remolcado? - Yes | No
* **Crew_Response**: Descripción abierta de la interacción así como acciones que se tomaron y su estas  disuadieron o no la interacción con las orcas.
* **Orcas_Behaviour**: Descripción del comportamiento de la/s orca/s

#### Fila 105
* Cambio en la fila con indice 105 por error de dimensionamiento - Vamos a mover una celda a la derecha todos los valores desde la columna 'Rudder' en adelante:

In [4]:
print(df.iloc[105])

date                                                           2022-04-13 15:15:00
lat_and_long                                              35 52.0000 N, 6 1.1000 W
GTOA_Protocol                                                                 Sail
Boat_Type                                                               12.5 - 15m
Boat_Length                                                             Not towing
Towing_Inflatable                                                            Spade
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudder                                                                     Sailing
Moto

In [5]:
index_to_shift = 105

# Cogemos el indice de la columna 'Rudder'
rudder_column_index = df.columns.get_loc('Rudder')

# Usamos la función shift para la fila en particular y desde 'Rudder' en adelante
df.iloc[index_to_shift, rudder_column_index:] = df.iloc[index_to_shift, rudder_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2022-04-13 15:15:00
lat_and_long                                              35 52.0000 N, 6 1.1000 W
GTOA_Protocol                                                                 Sail
Boat_Type                                                               12.5 - 15m
Boat_Length                                                             Not towing
Towing_Inflatable                                                            Spade
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudder                                                                        None
Moto

Ahora voy a cambiar a mano:
* GTOA_Protocol: 'Unknown'
* Boat_Type: 'Sail'
* Boat_Length: '12.5 - 15m'
* Towing_Inflatable: 'Not towing'
* Rudder: 'Spade'

In [6]:
# Voy a usar .loc[] para hacer los cambios
df.loc[105, 'GTOA_Protocol'] = 'Unknown'
df.loc[105, 'Boat_Type'] = 'Sail'
df.loc[105, 'Boat_Length'] = '12.5 - 15m'
df.loc[105, 'Towing_Inflatable'] = 'Not towing'
df.loc[105, 'Rudder'] = 'Spade'

print(df.iloc[105])

date                                                           2022-04-13 15:15:00
lat_and_long                                              35 52.0000 N, 6 1.1000 W
GTOA_Protocol                                                              Unknown
Boat_Type                                                                     Sail
Boat_Length                                                             12.5 - 15m
Towing_Inflatable                                                       Not towing
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudder                                                                       Spade
Moto

#### Fila 153
* Vamos a borrar esta fila por contener todo 'Unknown'

In [7]:
print(df.iloc[153])

date                             Unknown
lat_and_long                     Unknown
GTOA_Protocol                    Unknown
Boat_Type                        Unknown
Boat_Length                      Unknown
Towing_Inflatable                Unknown
Trailing_Fishing_Lure            Unknown
Physical_Contact_With_Boat       Unknown
Number_of_Adult_Orcas            Unknown
Number_of_Juvenile_Orcas         Unknown
Number_of_Uncertain_Age_Orcas    Unknown
Rudder                           Unknown
Motoring_or_Sailing              Unknown
Speed_Knots                      Unknown
Sea_State                        Unknown
Wind_Speed_Beaufort              Unknown
Daylight_or_Darkness             Unknown
Cloud_Cover                      Unknown
Distance_Off_Land_NM             Unknown
Depth_Meters                     Unknown
Depth_Gauge                      Unknown
Autopilot                        Unknown
Hull_Topsides_Color              Unknown
Antifoul_Color                   Unknown
Boat_Damaged    

In [8]:
df.drop(index=153, inplace=True)
df.tail()

Unnamed: 0,date,lat_and_long,GTOA_Protocol,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
148,2022-07-06 07:15:00,"35 59.0000 N, 5 55.0000 W","No, did not follow protocol, interaction laste...",Sail,12.5 - 15m,Not towing,Unknown,Unknown,0,0,0,Twin rudder,Motorsailing,5 - 7,Calm,3 - 4 (7 - 16 knots),Day,0 - 25%,5 - 10,40 - 200m,On,On,White/light,Black,No,No,We were following another sail boat from Porti...,"They appeared from my starboard, went under th..."
149,2022-07-03 18:30:00,"39 24.6240 N, 9 38.4150 W","Yes, followed protocol, interaction lasted 10 ...",Sail,10 - 12.5m,Not towing,Unknown,Unknown,0,0,0,Twin rudder,Sailing,5 - 7,Moderate,3 - 4 (7 - 16 knots),Day,50 - 75%,Over 10,200m+,On,On,White/light,Black,"Yes, extensive - major works required",No,Nous naviguions à la voile avec un catamaran d...,"Relativement calme, allant sous le bateau, don..."
150,2022-07-03 17:50:00,"38 7.2100 N, 9 1.4300 W","Yes, followed protocol, interaction lasted les...",Sail,Under 10m,Not towing,Unknown,Unknown,0,0,0,Spade,Motorsailing,5 - 7,Calm,5 - 6 (17 - 27 knots),Day,25 - 50%,5 - 10,40 - 200m,On,On,White/light,Blue,No,No,I only saw 2 adult females and 2 juveniles. My...,Very placid and almost lethargic. I’ve never s...
151,2022-07-03 13:00:00,"39 21.0000 N, 9 35.0000 W","No, did not follow protocol, interaction laste...",Sail,Over 15m,Not towing,Unknown,Unknown,0,0,0,Twin rudder,Motorsailing,5 - 7,Moderate,7+ (28 knots+),Day,25 - 50%,5 - 10,40 - 200m,On,Off,White/light,Black,"Yes, extensive - major works required",No,The boat is a catamaran with 2 engines and rud...,There were 3 adults and 2 juveniles. The small...
152,2022-06-24 12:50:00,"36 50.0000 N, 8 55.0000 W","No, did not follow protocol, interaction laste...",Sail,10 - 12.5m,Not towing,Unknown,Unknown,0,0,0,Spade,Sailing,5 - 7,Calm,3 - 4 (7 - 16 knots),Day,0 - 25%,Over 10,200m+,On,On,White/light,Blue,No,No,The two orcas were nearby a buoy and as soon a...,The two orcas (one of 6 meters and other of 3-...


## Comprobación columna a columna
* Vamos a comprobar los valores unicos de cada columna para detectar valores que no cuadren

#### GTOA_Protocol

In [9]:
df.GTOA_Protocol.unique()

array(['No, did not follow protocol, interaction lasted less than 10 minutes',
       'Yes, followed protocol, interaction lasted less than 10 minutes',
       'Yes, followed protocol, interaction lasted 10 to 30 minutes',
       'Yes, followed protocol, interaction lasted 30 to 60 minutes',
       'No, did not follow protocol, interaction lasted 10 to 30 minutes',
       'Yes, followed protocol, interaction lasted more than 60 minutes',
       'No, did not follow protocol, interaction lasted more than 60 minutes',
       'No, did not follow protocol, interaction lasted 30 to 60 minutes',
       'Unknown'], dtype=object)

Dado que esta columna contiene estos valores únicos, podemos facilmente definir una función que nos separe en dos columnas de mayor utilidad. Por un lado nos están diciendo si se siguió o no el protocolo y por otro cuanto tiempo duró la interacción. Separaremos esta columna en las siguientes: 'Followed_GTOA_Protocol' e 'Interaction_time'.

Vamos a aprovechar como están construidas las respuestas para separar en cada una de las nuevas columnas:
* 'Followed_GTOA_Protocol': A partir de la primera palabra podemos saber si se siguió (``Yes``) si no (``No``) o si no se conoce la respuesta del patrón, desconocido (``Unknown``)

* 'Interaction_time': A partir de las últimas 4 palabras podemos clasificar en 5 rangos de tiempo:
1) less than 10 minutes: 0-10
2) 10 to 30 minutes: 10-30
3) 30 to 30 minutes: 30-60
4) more than 60 minutes: 60+
5) Unknown

In [10]:
# Definimos una función que nos categorice:
def saca_protocolo_tiempo(value):
    
    # Empezamos categorizando según se siguiera el protocolo o no
    if value.startswith('Yes'):
        protocolo = 'Yes'
    elif value.startswith('No'):
        protocolo = 'No'
    else:
        protocolo = 'Unknown'

    # Seguimos categorizando por duración de la interacción
    if 'less than 10 minutes' in value:
        interaccion = '0-10'
    elif '10 to 30 minutes' in value:
        interaccion = '10-30'
    elif '30 to 60 minutes' in value:
        interaccion = '30-60'
    elif 'more than 60 minutes' in value:
        interaccion = '60+'
    else:
        interaccion = 'Unknown'

    return pd.Series([protocolo, interaccion], index = ['Followed_GTOA_Protocol', 'Interaction_time'])

In [11]:
# Hago la llamada a la función con un apply() y creo las dos nuevas columnas:
df[['Followed_GTOA_Protocol', 'Interaction_time']] = df.GTOA_Protocol.apply(saca_protocolo_tiempo)
df.head()

Unnamed: 0,date,lat_and_long,GTOA_Protocol,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour,Followed_GTOA_Protocol,Interaction_time
0,2023-11-01 22:15:00,"32 47.4980 N, 9 54.3980 W","No, did not follow protocol, interaction laste...",Sail,10 - 12.5m,Not towing,No,No,1,0,0,Spade,Sailing,5 - 7,Moderate,5 - 6 (17 - 27 knots),Night,0 - 25%,Over 10,200m+,On,On,White/light,Blue,No,No,"Orca interaction at 10:15pm on 01/11, 40 miles...",I would describe the behaviour of the Orca dur...,No,0-10
1,2023-10-31 07:50:00,"39 26.0000 N, 9 23.0000 W","Yes, followed protocol, interaction lasted les...",Sail,12.5 - 15m,Not towing,No,Yes,2,5,0,Twin rudder,Motoring,5 - 7,Rough,3 - 4 (7 - 16 knots),Day,50 - 75%,2 - 5,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",No,We had sandbags on our sugar scoops and metal ...,Juveniles hitting the rudders adults close by,Yes,0-10
2,2023-09-19 11:00:00,"37 40.0000 N, 8 54.0000 W","No, did not follow protocol, interaction laste...",Sail,12.5 - 15m,Not towing,No,Yes,1,0,0,Spade,Motoring,5 - 7,Calm,3 - 4 (7 - 16 knots),Day,0 - 25%,2 - 5,40 - 200m,On,On,White/light,Other,"Yes, moderate - immediate repairs required",No,We saw the orca approach from 10 o’clock posit...,There was an initial approach 45 minutes earli...,No,0-10
3,2023-09-01 13:15:00,"45 36.0000 N, 3 45.0000 W","Yes, followed protocol, interaction lasted 10 ...",Sail,Over 15m,Not towing,Yes,Yes,1,2,0,Spade,Sailing,3 - 4,Calm,3 - 4 (7 - 16 knots),Day,25 - 50%,Over 10,200m+,Off,Off,White/light,Black,"Yes, moderate - immediate repairs required",No,Les trois orques passent constamment de bâbord...,Pas de comportement visblement agressif./// No...,Yes,10-30
4,2023-09-02 03:45:00,"42 45.0000 N, 9 14.0000 W","Yes, followed protocol, interaction lasted les...",Sail,12.5 - 15m,Not towing,No,Yes,1,2,0,Spade,Motorsailing,5 - 7,Calm,0 - 2 (0 - 6 knots),Night,0 - 25%,5 - 10,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",Yes,Arrêt du pilote automatique a la 2 eme interac...,Approche furtive à la première interaction dir...,Yes,0-10


In [12]:
df.GTOA_Protocol.value_counts()

GTOA_Protocol
No, did not follow protocol, interaction lasted less than 10 minutes    40
Yes, followed protocol, interaction lasted 10 to 30 minutes             26
Yes, followed protocol, interaction lasted less than 10 minutes         24
No, did not follow protocol, interaction lasted 10 to 30 minutes        24
Yes, followed protocol, interaction lasted 30 to 60 minutes             15
Yes, followed protocol, interaction lasted more than 60 minutes         11
No, did not follow protocol, interaction lasted 30 to 60 minutes         9
No, did not follow protocol, interaction lasted more than 60 minutes     3
Unknown                                                                  1
Name: count, dtype: int64

In [13]:
df.Followed_GTOA_Protocol.value_counts()

Followed_GTOA_Protocol
No         76
Yes        76
Unknown     1
Name: count, dtype: int64

In [14]:
df.Interaction_time.value_counts()

Interaction_time
0-10       64
10-30      50
30-60      24
60+        14
Unknown     1
Name: count, dtype: int64

In [15]:
# Una vez comprobado que se han separado bien los datos via el value_counts(), reordenamos columnas y nos deshacemos de la columna de partida GTOA_Protocol

df.drop(columns='GTOA_Protocol', inplace=True)

column_order = ['date', 'lat_and_long', 'Followed_GTOA_Protocol', 'Interaction_time', 'Boat_Type', 'Boat_Length',
                'Towing_Inflatable', 'Trailing_Fishing_Lure', 'Physical_Contact_With_Boat', 'Number_of_Adult_Orcas',
                'Number_of_Juvenile_Orcas', 'Number_of_Uncertain_Age_Orcas', 'Rudder', 'Motoring_or_Sailing',
                'Speed_Knots', 'Sea_State', 'Wind_Speed_Beaufort', 'Daylight_or_Darkness', 'Cloud_Cover',
                'Distance_Off_Land_NM', 'Depth_Meters', 'Depth_Gauge', 'Autopilot', 'Hull_Topsides_Color',
                'Antifoul_Color', 'Boat_Damaged', 'Tow_Required', 'Crew_Response', 'Orcas_Behaviour']

df = df[column_order]
df.head()

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
0,2023-11-01 22:15:00,"32 47.4980 N, 9 54.3980 W",No,0-10,Sail,10 - 12.5m,Not towing,No,No,1,0,0,Spade,Sailing,5 - 7,Moderate,5 - 6 (17 - 27 knots),Night,0 - 25%,Over 10,200m+,On,On,White/light,Blue,No,No,"Orca interaction at 10:15pm on 01/11, 40 miles...",I would describe the behaviour of the Orca dur...
1,2023-10-31 07:50:00,"39 26.0000 N, 9 23.0000 W",Yes,0-10,Sail,12.5 - 15m,Not towing,No,Yes,2,5,0,Twin rudder,Motoring,5 - 7,Rough,3 - 4 (7 - 16 knots),Day,50 - 75%,2 - 5,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",No,We had sandbags on our sugar scoops and metal ...,Juveniles hitting the rudders adults close by
2,2023-09-19 11:00:00,"37 40.0000 N, 8 54.0000 W",No,0-10,Sail,12.5 - 15m,Not towing,No,Yes,1,0,0,Spade,Motoring,5 - 7,Calm,3 - 4 (7 - 16 knots),Day,0 - 25%,2 - 5,40 - 200m,On,On,White/light,Other,"Yes, moderate - immediate repairs required",No,We saw the orca approach from 10 o’clock posit...,There was an initial approach 45 minutes earli...
3,2023-09-01 13:15:00,"45 36.0000 N, 3 45.0000 W",Yes,10-30,Sail,Over 15m,Not towing,Yes,Yes,1,2,0,Spade,Sailing,3 - 4,Calm,3 - 4 (7 - 16 knots),Day,25 - 50%,Over 10,200m+,Off,Off,White/light,Black,"Yes, moderate - immediate repairs required",No,Les trois orques passent constamment de bâbord...,Pas de comportement visblement agressif./// No...
4,2023-09-02 03:45:00,"42 45.0000 N, 9 14.0000 W",Yes,0-10,Sail,12.5 - 15m,Not towing,No,Yes,1,2,0,Spade,Motorsailing,5 - 7,Calm,0 - 2 (0 - 6 knots),Night,0 - 25%,5 - 10,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",Yes,Arrêt du pilote automatique a la 2 eme interac...,Approche furtive à la première interaction dir...


#### Boat Type

In [16]:
df.Boat_Type.unique()

array(['Sail', 'Motor', 'Fishing Vessel'], dtype=object)

In [17]:
df.Boat_Type.value_counts()

Boat_Type
Sail              150
Motor               2
Fishing Vessel      1
Name: count, dtype: int64

* Parece que no hay valores nulos y que la gran mayoría de los barcos que tuvieron interacciones con orcas eran veleros.

#### Boat length

In [18]:
df.Boat_Length.unique()

array(['10 - 12.5m', '12.5 - 15m', 'Over 15m', 'Under 10m'], dtype=object)

In [19]:
df.Boat_Length.value_counts()

Boat_Length
10 - 12.5m    59
12.5 - 15m    56
Over 15m      30
Under 10m      8
Name: count, dtype: int64

Vamos a cambiar el formato de los rangos a los siguientes:

* Under 10m -> 0-10
* 10 - 12.5m -> 10-12.5
* 12.5 - 15m -> 12.5-15
* Over 15m -> 15+


In [20]:
# Definimos una función de cambio:
def clean_length(value):

    if value == 'Under 10m':
        return '0-10'
    elif value == '10 - 12.5m':
        return '10-12.5'
    elif value == '12.5 - 15m':
        return '12.5-15'
    elif value == 'Over 15m':
        return '15+'
    else:
        return 'Unknown'

In [21]:
# Hacemos la llamada a la función con un apply() para aplicárselo a toda la columna
df.Boat_Length = df.Boat_Length.apply(clean_length)
df.head()

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
0,2023-11-01 22:15:00,"32 47.4980 N, 9 54.3980 W",No,0-10,Sail,10-12.5,Not towing,No,No,1,0,0,Spade,Sailing,5 - 7,Moderate,5 - 6 (17 - 27 knots),Night,0 - 25%,Over 10,200m+,On,On,White/light,Blue,No,No,"Orca interaction at 10:15pm on 01/11, 40 miles...",I would describe the behaviour of the Orca dur...
1,2023-10-31 07:50:00,"39 26.0000 N, 9 23.0000 W",Yes,0-10,Sail,12.5-15,Not towing,No,Yes,2,5,0,Twin rudder,Motoring,5 - 7,Rough,3 - 4 (7 - 16 knots),Day,50 - 75%,2 - 5,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",No,We had sandbags on our sugar scoops and metal ...,Juveniles hitting the rudders adults close by
2,2023-09-19 11:00:00,"37 40.0000 N, 8 54.0000 W",No,0-10,Sail,12.5-15,Not towing,No,Yes,1,0,0,Spade,Motoring,5 - 7,Calm,3 - 4 (7 - 16 knots),Day,0 - 25%,2 - 5,40 - 200m,On,On,White/light,Other,"Yes, moderate - immediate repairs required",No,We saw the orca approach from 10 o’clock posit...,There was an initial approach 45 minutes earli...
3,2023-09-01 13:15:00,"45 36.0000 N, 3 45.0000 W",Yes,10-30,Sail,15+,Not towing,Yes,Yes,1,2,0,Spade,Sailing,3 - 4,Calm,3 - 4 (7 - 16 knots),Day,25 - 50%,Over 10,200m+,Off,Off,White/light,Black,"Yes, moderate - immediate repairs required",No,Les trois orques passent constamment de bâbord...,Pas de comportement visblement agressif./// No...
4,2023-09-02 03:45:00,"42 45.0000 N, 9 14.0000 W",Yes,0-10,Sail,12.5-15,Not towing,No,Yes,1,2,0,Spade,Motorsailing,5 - 7,Calm,0 - 2 (0 - 6 knots),Night,0 - 25%,5 - 10,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",Yes,Arrêt du pilote automatique a la 2 eme interac...,Approche furtive à la première interaction dir...


In [22]:
df.Boat_Length.value_counts()

Boat_Length
10-12.5    59
12.5-15    56
15+        30
0-10        8
Name: count, dtype: int64

In [23]:
# Queda comprobado que se ha hecho bien el cambio, pasamos a la siguiente columna

#### Towing Inflatable

In [24]:
df.Towing_Inflatable.unique()

array(['Not towing', 'Towing and interacted with inflatable first',
       'Unknown', 'Spade'], dtype=object)

In [25]:
df.Towing_Inflatable.value_counts()

Towing_Inflatable
Not towing                                     148
Unknown                                          3
Towing and interacted with inflatable first      1
Spade                                            1
Name: count, dtype: int64

In [26]:
# Vamos a localizar la fila donde está 'Spade' porque es un valor que no debería estar ahí:
df[df.Towing_Inflatable == 'Spade']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
144,2022-06-04 08:00:00,"36 4.4000 N, 5 59.9000 W",Yes,0-10,Sail,12.5-15,Spade,Unknown,Unknown,0,0,0,Motoring,3 - 4,0 - 2 (0 - 6 knots),Day,5 - 10,40 - 200m,On,On,White/light,Black,"Yes, minor - will wait until the end of the se...",No,This report was obtained by GTOA. I stopped th...,This report was obtained by GTOA. It was only ...,daytime,waxing\n20% illuminated,Within 3 days of spring tide


Hay un pequeño desorden en la fila con indice 144 que vamos a arreglar a mano porque no hay un claro patrón. Lo vamos a hacer en los siguientes pasos:

1) Desde el valor que está en la columna *Wind_Speed_Beaufort* en adelante vamos a mover todos los valores 3 posiciones a la derecha
2) El valor correspondiente a *Cloud_Cover* se va a cambiar a Unknown
3) El valor correspondiente a *Daylight_or_Darkness* se va a cambiar a Day
4) El valor correspondiente a *Wind_Speed_Beaufort* se va a cambiar a 0 - 2 (0 - 6 knots)
5) El valor correspondiente a *Sea_State* se va a cambiar a Unknown
6) El valor correspondiente a *Speed_Knots* se va a cambiar a 3 - 4
7) El valor correspondiente a *Motoring_or_Sailing* se va a cambiar a Motoring
8) El valor correspondiente a *Rudder* se va a cambiar a Spade
9) El valor correspondiente a *Towing_Inflatable* se va a cambiar a Unknown


In [27]:
# 1) Desde el valor que está en la columna *Wind_Speed_Beaufort* en adelante vamos a mover todos los valores 3 posiciones a la derecha

# Indice de la fila que queremos alterar
index_to_shift = 144

# Cogemos el indice de la columna 'Wind_Speed_Beaufort'
Beaufort_column_index = df.columns.get_loc('Wind_Speed_Beaufort')

# Usamos la función shift para la fila en particular y desde 'Wind_Speed_Beaufort' en adelante
df.iloc[index_to_shift, Beaufort_column_index:] = df.iloc[index_to_shift, Beaufort_column_index:].shift(3)

print(df.iloc[index_to_shift])

date                                                           2022-06-04 08:00:00
lat_and_long                                              36 4.4000 N, 5 59.9000 W
Followed_GTOA_Protocol                                                         Yes
Interaction_time                                                              0-10
Boat_Type                                                                     Sail
Boat_Length                                                                12.5-15
Towing_Inflatable                                                            Spade
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [28]:
# Voy a usar .loc[] para hacer los cambios
# 2) El valor correspondiente a *Cloud_Cover* se va a cambiar a Unknown

df.loc[144, 'Cloud_Cover'] = 'Unknown'

# 3) El valor correspondiente a *Daylight_or_Darkness* se va a cambiar a Day 

df.loc[144, 'Daylight_or_Darkness'] = 'Day'

# 4) El valor correspondiente a *Wind_Speed_Beaufort* se va a cambiar a 0 - 2 (0 - 6 knots)

df.loc[144, 'Wind_Speed_Beaufort'] = '0 - 2 (0 - 6 knots)'

# 5) El valor correspondiente a *Sea_State* se va a cambiar a Unknown

df.loc[144, 'Sea_State'] = 'Unknown'

# 6) El valor correspondiente a *Speed_Knots* se va a cambiar a 3 - 4

df.loc[144, 'Speed_Knots'] = '3 - 4'

# 7) El valor correspondiente a *Motoring_or_Sailing* se va a cambiar a Motoring

df.loc[144, 'Motoring_or_Sailing'] = 'Motoring'

# 8) El valor correspondiente a *Rudder* se va a cambiar a Spade

df.loc[144, 'Rudder'] = 'Spade'

# 9) El valor correspondiente a *Towing_Inflatable* se va a cambiar a Unknown

df.loc[144, 'Towing_Inflatable'] = 'Not towing'


print(df.iloc[144])

date                                                           2022-06-04 08:00:00
lat_and_long                                              36 4.4000 N, 5 59.9000 W
Followed_GTOA_Protocol                                                         Yes
Interaction_time                                                              0-10
Boat_Type                                                                     Sail
Boat_Length                                                                12.5-15
Towing_Inflatable                                                       Not towing
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [29]:
# Volvemos a ejecutar el value_counts()
df.Towing_Inflatable.value_counts()

Towing_Inflatable
Not towing                                     149
Unknown                                          3
Towing and interacted with inflatable first      1
Name: count, dtype: int64

Dada la naturaleza de las respuestas se va a cambiar a las siguientes:
* Not towing = No
* Towing and interacted with inflatable first = Yes
* Unknown will remain as Unknown

In [30]:
# Definimos una función que haga esta limpieza
def limpia_inflatable(value):

    if value == 'Not towing':
        return 'No'
    elif value == 'Towing and interacted with inflatable first':
        return 'Yes'
    elif value == 'Towing and interacted with boat first': # Esta no está entre las actuales opciones pero se puede dar en el futuroy se interpretará como un 'Yes'
        return 'Yes'
    else:
        return 'Unknown'

In [31]:
df.Towing_Inflatable = df.Towing_Inflatable.apply(limpia_inflatable)
df.Towing_Inflatable.value_counts()

Towing_Inflatable
No         149
Unknown      3
Yes          1
Name: count, dtype: int64

In [32]:
# Perfecto, pasamos a la siguiente.

#### Trailing Fishing Lure

In [33]:
df.Trailing_Fishing_Lure.unique()

array(['No', 'Yes', 'Unknown'], dtype=object)

In [34]:
df.Trailing_Fishing_Lure.value_counts()

Trailing_Fishing_Lure
Unknown    129
No          20
Yes          4
Name: count, dtype: int64

Esta columna se puede quedar como tal, pasamos a la siguiente columna.

#### Physical Contact with Boat

In [35]:
df.Physical_Contact_With_Boat.value_counts()

Physical_Contact_With_Boat
Unknown    129
Yes         22
No           2
Name: count, dtype: int64

Posiblemente esta sea una columna de la que podamos prescindir más adelante o completar la gran cantidad de *Unknowns* usando la información contenida en otras columnas. De momento la vamos a dejar como está.

Number of Adult, juvenile and uncertain age orcas

In [36]:
df.Number_of_Adult_Orcas.value_counts()

Number_of_Adult_Orcas
0    134
1     12
2      3
5      1
6      1
4      1
3      1
Name: count, dtype: int64

In [37]:
df.Number_of_Juvenile_Orcas.value_counts()

Number_of_Juvenile_Orcas
0    144
2      6
5      2
1      1
Name: count, dtype: int64

In [38]:
df.Number_of_Uncertain_Age_Orcas.value_counts()

Number_of_Uncertain_Age_Orcas
0    141
4      4
2      2
3      2
5      2
6      1
7      1
Name: count, dtype: int64

Voy a filtrar aquellas filas que tienen un 0 en las tres columnas. Esta condición se debe a que cambió el formato del cuestionario a mitad base de datos y no se preguntaba cuantas orcas interactuaron a los patrones de los barcos. Sin embargo, de las columnas *Crew_Response* y *Orcas_Behaviour* sí que se puede sacar información. Esta relleno de las columnas con el número de orcas se hará en un notebook aparte.

In [39]:
df[(df.Number_of_Adult_Orcas == '0') & (df.Number_of_Juvenile_Orcas == '0') & (df.Number_of_Uncertain_Age_Orcas == '0')].shape

(126, 29)

#### Rudder

In [40]:
df.Rudder.value_counts()

Rudder
Spade           72
Twin rudder     29
Full skeg       23
Semi skeg       15
Keel hung        5
Unknown          3
5 - 7            2
Sailing          1
3 - 4            1
Yes              1
Motorsailing     1
Name: count, dtype: int64

Aquellos valores que son distintos de Spade, Twin Rudder, Full Skeg, Semi skeg, Keel hung u Unknown son valores que hay que arreglar. Vamos a inspeccionarlos uno a uno.

In [41]:
df[df.Rudder == '5 - 7']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
20,2023-06-03 04:30:00,"36 2.0000 N, 6 13.0000 W",Yes,10-30,Sail,15+,No,No,Yes,0,0,5,5 - 7,Moderate,3 - 4 (7 - 16 knots),Dawn,0 - 25%,Over 10,200m+,On,On,White/light,Black,"Yes, moderate - immediate repairs required",No,Linked to 141 and 143. Details to follow,before dawn,waxing\n99% illuminated\nwithin 3 days of full,Not within 3 days of springs
21,2023-06-02 20:45:00,"36 2.0000 N, 6 57.0000 W",No,0-10,Sail,15+,No,No,No,0,0,5,5 - 7,Calm,0 - 2 (0 - 6 knots),Dusk,0 - 25%,Over 10,200m+,On,On,White/light,Black,No,No,Linked to 142 and 143. Details to follow,dusk twilight,waxing\n98% illuminated\nwithin 3 days of full,Not within 3 days of springs


In [42]:
# Para estas dos filas, vamos a mover 2 posiciones a la derecha todos los valores desde la columna 'Rudder' en adelante

# Indice de la fila que queremos alterar
index_to_shift = 20

# Cogemos el indice de la columna 'Wind_Speed_Beaufort'
rudder_column_index = df.columns.get_loc('Rudder')

# Usamos la función shift para la fila en particular y desde 'Wind_Speed_Beaufort' en adelante
df.iloc[index_to_shift, rudder_column_index:] = df.iloc[index_to_shift, rudder_column_index:].shift(2)

print(df.iloc[index_to_shift])

date                                                    2023-06-03 04:30:00
lat_and_long                                       36 2.0000 N, 6 13.0000 W
Followed_GTOA_Protocol                                                  Yes
Interaction_time                                                      10-30
Boat_Type                                                              Sail
Boat_Length                                                             15+
Towing_Inflatable                                                        No
Trailing_Fishing_Lure                                                    No
Physical_Contact_With_Boat                                              Yes
Number_of_Adult_Orcas                                                     0
Number_of_Juvenile_Orcas                                                  0
Number_of_Uncertain_Age_Orcas                                             5
Rudder                                                                 None
Motoring_or_

In [43]:
# Para estas dos filas, vamos a mover 2 posiciones a la derecha todos los valores desde la columna 'Rudder' en adelante

# Indice de la fila que queremos alterar
index_to_shift = 21

# Cogemos el indice de la columna 'Wind_Speed_Beaufort'
rudder_column_index = df.columns.get_loc('Rudder')

# Usamos la función shift para la fila en particular y desde 'Wind_Speed_Beaufort' en adelante
df.iloc[index_to_shift, rudder_column_index:] = df.iloc[index_to_shift, rudder_column_index:].shift(2)

print(df.iloc[index_to_shift])

date                                                  2023-06-02 20:45:00
lat_and_long                                     36 2.0000 N, 6 57.0000 W
Followed_GTOA_Protocol                                                 No
Interaction_time                                                     0-10
Boat_Type                                                            Sail
Boat_Length                                                           15+
Towing_Inflatable                                                      No
Trailing_Fishing_Lure                                                  No
Physical_Contact_With_Boat                                             No
Number_of_Adult_Orcas                                                   0
Number_of_Juvenile_Orcas                                                0
Number_of_Uncertain_Age_Orcas                                           5
Rudder                                                               None
Motoring_or_Sailing                   

In [44]:
# Ahora tendremos que meter a mano para las columnas 'Rudder' y 'Motoring_or_Sailing' dos valores 'Unknown' para cada fila

# Fila 20
df.loc[20, 'Rudder'] = 'Unknown'
df.loc[20, 'Motoring_or_Sailing'] = 'Unknown'

# Fila 21
df.loc[21, 'Rudder'] = 'Unknown'
df.loc[21, 'Motoring_or_Sailing'] = 'Unknown'

Seguimos limpiando la columna Rudder, volvemos a ejecutar el value counts, debería haber dos valores más de ``Unknown``

In [45]:
df.Rudder.value_counts()

Rudder
Spade           72
Twin rudder     29
Full skeg       23
Semi skeg       15
Unknown          5
Keel hung        5
Sailing          1
3 - 4            1
Yes              1
Motorsailing     1
Name: count, dtype: int64

In [46]:
# Repetimos el proceso que acabamos de hacer, ahora con 'Sailing'
df[df.Rudder == 'Sailing']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
7,2023-07-15 23:58:00,"36 15.0090 N, 5 11.2660 W",Yes,30-60,Sail,12.5-15,No,No,Yes,4,0,0,Sailing,5 - 7,Moderate,3 - 4 (7 - 16 knots),Night,0 - 25%,2 - 5,40 - 200m,On,On,White/light,Blue,"Yes, moderate - immediate repairs required",Yes,We were sailing and the boat was on autopilot....,One of them was under the boat. Others were at...,"Not applicable, did not reverse"


In [47]:
# Este es facil, movemos una posición todos los valores desde la columna 'Rudder' en adelante y la posición de Rudder la cambiamos a 'Unknown'
# Indice de la fila que queremos alterar
index_to_shift = 7

# Cogemos el indice de la columna 'Wind_Speed_Beaufort'
rudder_column_index = df.columns.get_loc('Rudder')

# Usamos la función shift para la fila en particular y desde 'Wind_Speed_Beaufort' en adelante
df.iloc[index_to_shift, rudder_column_index:] = df.iloc[index_to_shift, rudder_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2023-07-15 23:58:00
lat_and_long                                             36 15.0090 N, 5 11.2660 W
Followed_GTOA_Protocol                                                         Yes
Interaction_time                                                             30-60
Boat_Type                                                                     Sail
Boat_Length                                                                12.5-15
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                           No
Physical_Contact_With_Boat                                                     Yes
Number_of_Adult_Orcas                                                            4
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [48]:
# Cambiamos valor de la fila 7 y columna Rudder a Unknown. Al ejecutar el value_counts() de nuevo tendremos un ´Unknown´ más.

df.loc[7, 'Rudder'] = 'Unknown'
df.Rudder.value_counts()

Rudder
Spade           72
Twin rudder     29
Full skeg       23
Semi skeg       15
Unknown          6
Keel hung        5
3 - 4            1
Yes              1
Motorsailing     1
Name: count, dtype: int64

In [49]:
# Repetimos el proceso que acabamos de hacer, ahora con '3 - 4'
df[df.Rudder == '3 - 4']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
19,2023-06-03 07:10:00,"36 7.0000 N, 6 3.0000 W",Yes,10-30,Sail,15+,No,No,Yes,1,0,0,3 - 4,Moderate,3 - 4 (7 - 16 knots),Day,0 - 25%,Over 10,200m+,On,On,White/light,Black,"Yes, extensive - major works required",No,Linked to 141 and 142. Details to follow,daytime,waxing\n99% illuminated\nwithin 3 days of full,Within 3 days of spring tide


In [50]:
# Otra vez toca mover desde la columna 'Rudder' 2 posiciones a la derecha y luego cambiar los valores de la mencionada columna y 'Motoring_or_Sailing' o 'Unknown'

# Indice de la fila que queremos alterar
index_to_shift = 19

# Cogemos el indice de la columna 'Wind_Speed_Beaufort'
rudder_column_index = df.columns.get_loc('Rudder')

# Usamos la función shift para la fila en particular y desde 'Wind_Speed_Beaufort' en adelante
df.iloc[index_to_shift, rudder_column_index:] = df.iloc[index_to_shift, rudder_column_index:].shift(2)

print(df.iloc[index_to_shift])

date                                                  2023-06-03 07:10:00
lat_and_long                                      36 7.0000 N, 6 3.0000 W
Followed_GTOA_Protocol                                                Yes
Interaction_time                                                    10-30
Boat_Type                                                            Sail
Boat_Length                                                           15+
Towing_Inflatable                                                      No
Trailing_Fishing_Lure                                                  No
Physical_Contact_With_Boat                                            Yes
Number_of_Adult_Orcas                                                   1
Number_of_Juvenile_Orcas                                                0
Number_of_Uncertain_Age_Orcas                                           0
Rudder                                                               None
Motoring_or_Sailing                   

In [51]:
# Cambiamos mencionadas columnas y ejecutamos value_counts()

df.loc[19, 'Rudder'] = 'Unknown'
df.loc[19, 'Motoring_or_Sailing'] = 'Unknown'

df.Rudder.value_counts()

Rudder
Spade           72
Twin rudder     29
Full skeg       23
Semi skeg       15
Unknown          7
Keel hung        5
Yes              1
Motorsailing     1
Name: count, dtype: int64

In [52]:
# Quedan dos más, vamos con 'Yes'
df[df.Rudder == 'Yes']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
23,2023-05-25 14:00:00,"36 1.0000 N, 5 56.0000 W",Yes,10-30,Sail,15+,No,Unknown,Unknown,0,0,0,Yes,Yes,Spade,Motorsailing,8 - 11,Calm,3 - 4 (7 - 16 knots),Day,0 - 25%,5 - 10,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",No


In [53]:
# Esta fila tiene un error distinto a las demás, primero hay que cambiar los valores de 'Trailing_Fishing_Lure' y 'Physical_Contact_With_Boat' a Yes y segundo hay que mover desde la columna 'Orcas_behaviour' hasta 0Speed_Knots' dos posiciones hacia la izquierda

# En orden, primero cambiamos 'Trailing_Fishing_Lure' y 'Physical_Contact_With_Boat'
df.loc[23, 'Trailing_Fishing_Lure'] = 'Yes'
df.loc[23, 'Physical_Contact_With_Boat'] = 'Yes'


In [54]:
df.Trailing_Fishing_Lure.value_counts()

Trailing_Fishing_Lure
Unknown    128
No          20
Yes          5
Name: count, dtype: int64

In [55]:
df.Physical_Contact_With_Boat.value_counts()

Physical_Contact_With_Boat
Unknown    128
Yes         23
No           2
Name: count, dtype: int64

In [56]:
# Ahora vamos a mover desde la columna 'Orcas_behaviour' hasta 0Speed_Knots' dos posiciones hacia la izquierda

# Especificamos que fila estamos alterando
row_index_to_shift = 23

# Especificamos las columnas que vamos a mover (desde 'Speed_Knots' a 'Orcas_Behaviour')
start_column_index = df.columns.get_loc('Speed_Knots')
end_column_index = df.columns.get_loc('Orcas_Behaviour')

# Movemos los valores dos posiciones a la izq (de ahí el signo negativo de dentro del shift())
df.iloc[row_index_to_shift, start_column_index:end_column_index + 1] = df.iloc[row_index_to_shift, start_column_index:end_column_index + 1].shift(-2)

# Printeamos resultado
print(df.iloc[row_index_to_shift])

date                                                    2023-05-25 14:00:00
lat_and_long                                       36 1.0000 N, 5 56.0000 W
Followed_GTOA_Protocol                                                  Yes
Interaction_time                                                      10-30
Boat_Type                                                              Sail
Boat_Length                                                             15+
Towing_Inflatable                                                        No
Trailing_Fishing_Lure                                                   Yes
Physical_Contact_With_Boat                                              Yes
Number_of_Adult_Orcas                                                     0
Number_of_Juvenile_Orcas                                                  0
Number_of_Uncertain_Age_Orcas                                             0
Rudder                                                                  Yes
Motoring_or_

In [57]:
# Vale, ahora cambiamos los valores de 'Rudder' y 'Motoring_or_Sailing' a Unknown y pasamos al ultimo error de la columna
df.loc[23, 'Rudder'] = 'Unknown'
df.loc[23, 'Motoring_or_Sailing'] = 'Unknown'

df.Rudder.value_counts()

Rudder
Spade           72
Twin rudder     29
Full skeg       23
Semi skeg       15
Unknown          8
Keel hung        5
Motorsailing     1
Name: count, dtype: int64

In [58]:
# Ya solo nos queda el error 'Motorsailing'
df[df.Rudder == 'Motorsailing']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
66,2022-11-12 08:50:00,"44 0.6000 N, 7 32.6000 W",No,0-10,Sail,10-12.5,No,Unknown,Unknown,0,0,0,Motorsailing,5 - 7,Calm,0 - 2 (0 - 6 knots),Day,0 - 25%,Over 10,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",No,The interaction lasted 2 minutes. 2 or 3 big b...,"Not interested in us, only broke rudder and went",daytime


In [59]:
# Tenemos que mover una posición a la derecha todos los valores desde la columna Rudder en adelante y luego meter un 'Unknown' en la columna Rudder que se quedará vacía

# Indice de la fila que queremos alterar
index_to_shift = 66

# Cogemos el indice de la columna 'Rudder'
rudder_column_index = df.columns.get_loc('Rudder')

# Usamos la función shift para la fila en particular y desde 'Rudder' en adelante
df.iloc[index_to_shift, rudder_column_index:] = df.iloc[index_to_shift, rudder_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2022-11-12 08:50:00
lat_and_long                                              44 0.6000 N, 7 32.6000 W
Followed_GTOA_Protocol                                                          No
Interaction_time                                                              0-10
Boat_Type                                                                     Sail
Boat_Length                                                                10-12.5
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [60]:
# Metemos un 'Unknown' en la celda correspondiente a la columna Rudder

df.loc[66, 'Rudder'] = 'Unknown'

df.Rudder.value_counts()

Rudder
Spade          72
Twin rudder    29
Full skeg      23
Semi skeg      15
Unknown         9
Keel hung       5
Name: count, dtype: int64

In [61]:
# Hecho, pasamos a la siguiente columna

#### Motoring or Sailing

In [62]:
df.Motoring_or_Sailing.value_counts()

Motoring_or_Sailing
Motoring        51
Sailing         49
Motorsailing    47
Unknown          4
Hove-to          1
Twin rudder      1
Name: count, dtype: int64

Vamos a seguir el mismo proceso que con las anteriores columnas. Aquí únicamente debería haber como opciones Motoring, Sailing, Motorsailing y Hove-to (capeando). Pude haber *Unknown* como opción alternativa cuando no se conoce la respuesta. Analizamos separadamente 'Twin rudder' para ver como se pueden corregir.

In [63]:
df[df.Motoring_or_Sailing == 'Twin rudder']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
145,2022-06-01 12:30:00,"36 20.1000 N, 6 16.7400 W",No,30-60,Sail,12.5-15,Unknown,Unknown,Unknown,0,0,4,Unknown,Twin rudder,Sailing,Unknown,Unknown,Day,Unknown,2 - 5,20 - 40m,Unknown,Unknown,On,On,White/light,Black,"Yes, extensive - major works required",Yes


In [64]:
# Vamos a corregir esta fila en más de un pasos separados.

# Primero vamos a mover una posición a la izq los valores comprendidos entre las columnas Motoring_or_Sailing y Wind_Speed_Beaufort

# Especificamos que fila estamos alterando
row_index_to_shift = 145

# Especificamos las columnas que vamos a mover (desde 'Motoring_or_Sailing' a 'Wind_Speed_Beaufort')
start_column_index = df.columns.get_loc('Motoring_or_Sailing')
end_column_index = df.columns.get_loc('Wind_Speed_Beaufort')

# Movemos los valores dos posiciones a la izq (de ahí el signo negativo de dentro del shift())
df.iloc[row_index_to_shift, start_column_index:end_column_index + 1] = df.iloc[row_index_to_shift, start_column_index:end_column_index + 1].shift(-1)

# Printeamos resultado
print(df.iloc[row_index_to_shift])

date                                               2022-06-01 12:30:00
lat_and_long                                 36 20.1000 N, 6 16.7400 W
Followed_GTOA_Protocol                                              No
Interaction_time                                                 30-60
Boat_Type                                                         Sail
Boat_Length                                                    12.5-15
Towing_Inflatable                                              Unknown
Trailing_Fishing_Lure                                          Unknown
Physical_Contact_With_Boat                                     Unknown
Number_of_Adult_Orcas                                                0
Number_of_Juvenile_Orcas                                             0
Number_of_Uncertain_Age_Orcas                                        4
Rudder                                                         Unknown
Motoring_or_Sailing                                            Sailing
Speed_

In [65]:
# Introducimos a mano los valores correspondientes a las columnas Rudder y Wind_Speed_Beaufort

df.loc[145, 'Rudder'] = 'Twin rudder'
df.loc[145, 'Wind_Speed_Beaufort'] = 'Unknown'

In [66]:
# Ahora vamos a mver dos posiciones a la izquierda los valores comprendidos entre las columnas Hull_Topsides_Color y Orcas_Behaviour

# Especificamos que fila estamos alterando
row_index_to_shift = 145

# Especificamos las columnas que vamos a mover (desde 'Hull_Topsides_Color' a 'Orcas_Behaviour')
start_column_index = df.columns.get_loc('Hull_Topsides_Color')
end_column_index = df.columns.get_loc('Orcas_Behaviour')

# Movemos los valores dos posiciones a la izq (de ahí el signo negativo de dentro del shift())
df.iloc[row_index_to_shift, start_column_index:end_column_index + 1] = df.iloc[row_index_to_shift, start_column_index:end_column_index + 1].shift(-2)

# Printeamos resultado
print(df.iloc[row_index_to_shift])

date                                               2022-06-01 12:30:00
lat_and_long                                 36 20.1000 N, 6 16.7400 W
Followed_GTOA_Protocol                                              No
Interaction_time                                                 30-60
Boat_Type                                                         Sail
Boat_Length                                                    12.5-15
Towing_Inflatable                                              Unknown
Trailing_Fishing_Lure                                          Unknown
Physical_Contact_With_Boat                                     Unknown
Number_of_Adult_Orcas                                                0
Number_of_Juvenile_Orcas                                             0
Number_of_Uncertain_Age_Orcas                                        4
Rudder                                                     Twin rudder
Motoring_or_Sailing                                            Sailing
Speed_

In [67]:
# Introducimos a mano los valores correspondientes a las columnas Rudder y Wind_Speed_Beaufort y además de los comentarios Crew_Response y Orcas_Behaviour

df.loc[145, 'Depth_Gauge'] = 'On'
df.loc[145, 'Autopilot'] = 'On'
df.loc[145, 'Crew_Response'] = 'This report was obtained by an interview with a GTOA member. One orca was seen near a fisherman who was gathering his net. It disappeared but was then seen about 30m away. When it arrived the crew dropped the sails and turned off all electrical systems. Initially they tried to deter it with a boat hook and they shouted at it and switched on one engine but quickly decided to stop that for fear of increasing the aggression. After that they sat still. The protocol cannot be said to have been followed due to the initial reaction. Both rudders were damaged including metal structural elements and the autopilot and one propeller may be damaged. There was concern that water ingress might occur due to the nature and extent of the damage. The boat was towed to port by the maritime rescue service.'
df.loc[145, 'Orcas_Behaviour'] = 'There was a single orca. The crew saw it 30m away and it went straight for the rudders before doing anything else. It broke one rudder and then the other on the catamaran. It started to hit the daggerboards but the crew lifted them without damage. The orca then started to hit the hulls with its head, gathering momentum. The interaction lasted 60 minutes. This seemed very much like an aggressive attack rather than play.'

In [68]:
#Volvemos a comprobar la fila a ver si ya está bien:

df[df.date == '2022-06-01 12:30:00']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
145,2022-06-01 12:30:00,"36 20.1000 N, 6 16.7400 W",No,30-60,Sail,12.5-15,Unknown,Unknown,Unknown,0,0,4,Twin rudder,Sailing,Unknown,Unknown,Unknown,Day,Unknown,2 - 5,20 - 40m,On,On,White/light,Black,"Yes, extensive - major works required",Yes,This report was obtained by an interview with ...,There was a single orca. The crew saw it 30m a...


In [69]:
df.Motoring_or_Sailing.value_counts()

Motoring_or_Sailing
Motoring        51
Sailing         50
Motorsailing    47
Unknown          4
Hove-to          1
Name: count, dtype: int64

#### Speed Knots

In [70]:
df.Speed_Knots.value_counts()

Speed_Knots
5 - 7      111
3 - 4       24
8 - 11      14
0 - 2        2
Calm         1
Unknown      1
Name: count, dtype: int64

In [71]:
# Una vez más, analizamos el caso particular donde tenemos un caso que no es una de las opciones de respuesta, 'calm'

df[df.Speed_Knots == 'Calm']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
40,2023-04-20 10:45:00,"35 53.0500 N, 5 50.3900 W",No,10-30,Sail,12.5-15,No,Unknown,Unknown,0,0,0,Twin rudder,Motoring,Calm,3 - 4 (7 - 16 knots),Day,0 - 25%,5 - 10,40 - 200m,On,On,White/light,Other,"Yes, moderate - immediate repairs required",No,One or two orcas approached us at first. They ...,"Playful, determined, rolling onto their backs ...",daytime


In [72]:
# Debemos mover una posición todos los valores desde la columna Sea_State en adelante y luego poner un *Unknown* manualmente para la celda de dicha columna

# Indice de la fila que queremos alterar
index_to_shift = 40

# Cogemos el indice de la columna 'Rudder'
Sea_State_column_index = df.columns.get_loc('Sea_State')

# Usamos la función shift para la fila en particular y desde 'Sea_State' en adelante
df.iloc[index_to_shift, Sea_State_column_index:] = df.iloc[index_to_shift, Sea_State_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2023-04-20 10:45:00
lat_and_long                                             35 53.0500 N, 5 50.3900 W
Followed_GTOA_Protocol                                                          No
Interaction_time                                                             10-30
Boat_Type                                                                     Sail
Boat_Length                                                                12.5-15
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [73]:
df.loc[40, 'Sea_State'] = 'Calm'
df.loc[40, 'Speed_Knots'] = 'Unknown'

df.Speed_Knots.value_counts()

Speed_Knots
5 - 7      111
3 - 4       24
8 - 11      14
Unknown      2
0 - 2        2
Name: count, dtype: int64

#### Sea State

In [74]:
df.Sea_State.value_counts()

Sea_State
Calm        86
Moderate    52
Rough       11
Unknown      4
Name: count, dtype: int64

In [75]:
df[df.Sea_State == 'Unknown']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
141,2022-06-06 06:00:00,"36 10.0000 N, 6 15.0000 W",No,30-60,Fishing Vessel,0-10,Unknown,Unknown,Unknown,0,0,4,Unknown,Hove-to,0 - 2,Unknown,Unknown,Dawn,Unknown,5 - 10,20 - 40m,Unknown,Unknown,White/light,Red,"Yes, moderate - immediate repairs required",No,This report was obtained by GTOA. The boat was...,There were four orcas and they all acted. They...
143,2022-06-04 02:30:00,"36 8.0000 N, 6 21.0000 W",Yes,0-10,Sail,12.5-15,Unknown,Unknown,Unknown,0,0,4,Unknown,Motoring,5 - 7,Unknown,Unknown,Night,Unknown,Over 10,40 - 200m,Unknown,Unknown,White/light,Black,"Yes, extensive - major works required",Yes,From a report obtained by GTOA. The yacht is a...,We could not see the animals as it was dark.
144,2022-06-04 08:00:00,"36 4.4000 N, 5 59.9000 W",Yes,0-10,Sail,12.5-15,No,Unknown,Unknown,0,0,0,Spade,Motoring,3 - 4,Unknown,0 - 2 (0 - 6 knots),Day,Unknown,5 - 10,40 - 200m,On,On,White/light,Black,"Yes, minor - will wait until the end of the se...",No,This report was obtained by GTOA. I stopped th...,This report was obtained by GTOA. It was only ...
145,2022-06-01 12:30:00,"36 20.1000 N, 6 16.7400 W",No,30-60,Sail,12.5-15,Unknown,Unknown,Unknown,0,0,4,Twin rudder,Sailing,Unknown,Unknown,Unknown,Day,Unknown,2 - 5,20 - 40m,On,On,White/light,Black,"Yes, extensive - major works required",Yes,This report was obtained by an interview with ...,There was a single orca. The crew saw it 30m a...


In [76]:
# Esta columna parece estar bien, vamos con la siguiente.

#### Wind Speed Beaufort

In [77]:
df.Wind_Speed_Beaufort.value_counts()

Wind_Speed_Beaufort
3 - 4 (7 - 16 knots)     58
0 - 2 (0 - 6 knots)      54
5 - 6 (17 - 27 knots)    31
7+ (28 knots+)            7
Unknown                   3
Name: count, dtype: int64

In [78]:
df[df.Wind_Speed_Beaufort == 'Unknown']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
141,2022-06-06 06:00:00,"36 10.0000 N, 6 15.0000 W",No,30-60,Fishing Vessel,0-10,Unknown,Unknown,Unknown,0,0,4,Unknown,Hove-to,0 - 2,Unknown,Unknown,Dawn,Unknown,5 - 10,20 - 40m,Unknown,Unknown,White/light,Red,"Yes, moderate - immediate repairs required",No,This report was obtained by GTOA. The boat was...,There were four orcas and they all acted. They...
143,2022-06-04 02:30:00,"36 8.0000 N, 6 21.0000 W",Yes,0-10,Sail,12.5-15,Unknown,Unknown,Unknown,0,0,4,Unknown,Motoring,5 - 7,Unknown,Unknown,Night,Unknown,Over 10,40 - 200m,Unknown,Unknown,White/light,Black,"Yes, extensive - major works required",Yes,From a report obtained by GTOA. The yacht is a...,We could not see the animals as it was dark.
145,2022-06-01 12:30:00,"36 20.1000 N, 6 16.7400 W",No,30-60,Sail,12.5-15,Unknown,Unknown,Unknown,0,0,4,Twin rudder,Sailing,Unknown,Unknown,Unknown,Day,Unknown,2 - 5,20 - 40m,On,On,White/light,Black,"Yes, extensive - major works required",Yes,This report was obtained by an interview with ...,There was a single orca. The crew saw it 30m a...


In [79]:
# Esta columna parece estar bien, vamos con la siguiente

#### Daylight or Darkness

In [80]:
df.Daylight_or_Darkness.value_counts()

Daylight_or_Darkness
Day      107
Night     23
Dawn      12
Dusk      11
Name: count, dtype: int64

In [81]:
# Esta columna está perfecta, no falta ningún dato. Vamos con la siguiente.

#### Cloud_Cover

In [82]:
df.Cloud_Cover.value_counts()

Cloud_Cover
0 - 25%      83
25 - 50%     36
75 - 100%    13
50 - 75%     12
Unknown       4
2 - 5         3
Over 10       1
5 - 10        1
Name: count, dtype: int64

In [83]:
# Aquí si tenemos algun valor que no debería estar en esta columna. Analizamos uno a uno.

In [84]:
df[df.Cloud_Cover == '2 - 5']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
89,2022-08-20 21:00:00,"43 18.3610 N, 9 3.5880 W",Yes,30-60,Sail,10-12.5,No,Unknown,Unknown,0,0,0,Twin rudder,Motoring,3 - 4,Moderate,0 - 2 (0 - 6 knots),Day,2 - 5,40 - 200m,On,Off,Dark colour,White,"Yes, extensive - major works required",No,From a report obtained by GTOA/MITECO There we...,During the interactions the orcas moved away a...,dusk twilight
95,2022-08-06 14:10:00,"42 50.0000 N, 9 13.7000 W",Yes,10-30,Sail,10-12.5,No,Unknown,Unknown,0,0,0,Spade,Motorsailing,5 - 7,Calm,0 - 2 (0 - 6 knots),Day,2 - 5,40 - 200m,On,Off,White/light,Black,"Yes, moderate - immediate repairs required",Yes,This report is from an interview by GTOA. The ...,There were two orcas that were not detected be...,daytime
101,2022-04-28 08:30:00,"36 2.0000 N, 5 42.0000 W",No,0-10,Sail,12.5-15,No,Unknown,Unknown,0,0,0,Spade,Motoring,8 - 11,Calm,0 - 2 (0 - 6 knots),Day,2 - 5,20 - 40m,Off,White/light,Black,No,No,Les orques venaient de la même manière de touc...,daytime,waning\n7% illuminated\nwithin 3 days of new,Not within 3 days of springs


In [85]:
# Para las filas 89 y 95 vamos a mover una posición todos los valores desde la columna Cloud_Cover en adelante
# Indice de la fila que queremos alterar
index_to_shift = 89

# Cogemos el indice de la columna 'Cloud_Cover'
Cloud_Cover_column_index = df.columns.get_loc('Cloud_Cover')

# Usamos la función shift para la fila en particular y desde 'Sea_State' en adelante
df.iloc[index_to_shift, Cloud_Cover_column_index:] = df.iloc[index_to_shift, Cloud_Cover_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2022-08-20 21:00:00
lat_and_long                                              43 18.3610 N, 9 3.5880 W
Followed_GTOA_Protocol                                                         Yes
Interaction_time                                                             30-60
Boat_Type                                                                     Sail
Boat_Length                                                                10-12.5
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [86]:
# Indice de la fila que queremos alterar
index_to_shift = 95

# Cogemos el indice de la columna 'Rudder'
Cloud_Cover_column_index = df.columns.get_loc('Cloud_Cover')

# Usamos la función shift para la fila en particular y desde 'Sea_State' en adelante
df.iloc[index_to_shift, Cloud_Cover_column_index:] = df.iloc[index_to_shift, Cloud_Cover_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2022-08-06 14:10:00
lat_and_long                                             42 50.0000 N, 9 13.7000 W
Followed_GTOA_Protocol                                                         Yes
Interaction_time                                                             10-30
Boat_Type                                                                     Sail
Boat_Length                                                                10-12.5
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [87]:
# Ahora cambiamos manualmente los valores correspondientes a la columna Cloud_Cover de las filas 89 y 95

df.loc[40, 'Cloud_Cover'] = 'Unknown'
df.loc[40, 'Cloud_Cover'] = 'Unknown'

df.Cloud_Cover.value_counts()

Cloud_Cover
0 - 25%      82
25 - 50%     36
75 - 100%    13
50 - 75%     12
Unknown       5
Over 10       1
5 - 10        1
2 - 5         1
Name: count, dtype: int64

In [88]:
# Seguimos con el siguiente valor anómalo de el anterior output, la fila 101

# Recordamos otra vez su apariencia:
df[df.Cloud_Cover == '2 - 5']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
101,2022-04-28 08:30:00,"36 2.0000 N, 5 42.0000 W",No,0-10,Sail,12.5-15,No,Unknown,Unknown,0,0,0,Spade,Motoring,8 - 11,Calm,0 - 2 (0 - 6 knots),Day,2 - 5,20 - 40m,Off,White/light,Black,No,No,Les orques venaient de la même manière de touc...,daytime,waning\n7% illuminated\nwithin 3 days of new,Not within 3 days of springs


In [89]:
# Debemos mover una posición los valores entre Cloud_Cover y Depth_Gauge, y los valores siguientes están desordenados más de una posicion asi que volveremos a visionar la fila para corregir el final
# Indice de la fila que queremos alterar
index_to_shift = 101

# Cogemos el indice de la columna 'Cloud_Cover'
Cloud_Cover_column_index = df.columns.get_loc('Cloud_Cover')

# Usamos la función shift para la fila en particular y desde 'Sea_State' en adelante
df.iloc[index_to_shift, Cloud_Cover_column_index:] = df.iloc[index_to_shift, Cloud_Cover_column_index:].shift(1)

print(df.iloc[index_to_shift])


date                                                           2022-04-28 08:30:00
lat_and_long                                              36 2.0000 N, 5 42.0000 W
Followed_GTOA_Protocol                                                          No
Interaction_time                                                              0-10
Boat_Type                                                                     Sail
Boat_Length                                                                12.5-15
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [90]:
df[df.date == '2022-04-28 08:30:00']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
101,2022-04-28 08:30:00,"36 2.0000 N, 5 42.0000 W",No,0-10,Sail,12.5-15,No,Unknown,Unknown,0,0,0,Spade,Motoring,8 - 11,Calm,0 - 2 (0 - 6 knots),Day,,2 - 5,20 - 40m,Off,White/light,Black,No,No,Les orques venaient de la même manière de touc...,daytime,waning\n7% illuminated\nwithin 3 days of new


In [91]:
# Volvemos a mover una posición pero ahora los valores comprendidos entre la columna Depth_Gauge y en adelante

# Indice de la fila que queremos alterar
index_to_shift = 101

# Cogemos el indice de la columna 'Cloud_Cover'
Depth_Gauge_column_index = df.columns.get_loc('Depth_Gauge')

# Usamos la función shift para la fila en particular y desde 'Sea_State' en adelante
df.iloc[index_to_shift, Depth_Gauge_column_index:] = df.iloc[index_to_shift, Depth_Gauge_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2022-04-28 08:30:00
lat_and_long                                              36 2.0000 N, 5 42.0000 W
Followed_GTOA_Protocol                                                          No
Interaction_time                                                              0-10
Boat_Type                                                                     Sail
Boat_Length                                                                12.5-15
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [92]:
# Y ahora rellenamos a mano los valores de las columnas Cloud_Cover, Depth_Gauge y Orcas_Behaviour
df.loc[101, 'Cloud_Cover'] = 'Unknown'
df.loc[101, 'Depth_Gauge'] = 'On'
df.loc[101, 'Orcas_Behaviour'] = 'Unknown'

df.Cloud_Cover.value_counts()


Cloud_Cover
0 - 25%      82
25 - 50%     36
75 - 100%    13
50 - 75%     12
Unknown       6
Over 10       1
5 - 10        1
Name: count, dtype: int64

In [93]:
# Pasamos con el siguiente valor anómalo: Over 10
df[df.Cloud_Cover == 'Over 10']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
71,2022-10-14 15:40:00,"37 4.2000 N, 9 8.2000 W",No,0-10,Sail,12.5-15,No,Unknown,Unknown,0,0,0,Spade,Sailing,5 - 7,Calm,3 - 4 (7 - 16 knots),Day,Over 10,200m+,On,On,White/light,Black,No,No,"Two Orcas observed about 50m from boat, headin...","They swam past, close by.",daytime


In [95]:
# Esta fila es fácil de arreglar, movemos una posición todos los valores desde la columna Cloud_Cover en adelante y luego a mano el valor de esta columna a Unknown
# Indice de la fila que queremos alterar
index_to_shift = 71

# Cogemos el indice de la columna 'Cloud_Cover'
Cloud_Cover_column_index = df.columns.get_loc('Cloud_Cover')

# Usamos la función shift para la fila en particular y desde 'Cloud_Cover' en adelante
df.iloc[index_to_shift, Cloud_Cover_column_index:] = df.iloc[index_to_shift, Cloud_Cover_column_index:].shift(1)

print(df.iloc[index_to_shift])


date                                                           2022-10-14 15:40:00
lat_and_long                                               37 4.2000 N, 9 8.2000 W
Followed_GTOA_Protocol                                                          No
Interaction_time                                                              0-10
Boat_Type                                                                     Sail
Boat_Length                                                                12.5-15
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [96]:
df.loc[71, 'Cloud_Cover'] = 'Unknown'

df.Cloud_Cover.value_counts()

Cloud_Cover
0 - 25%      82
25 - 50%     36
75 - 100%    13
50 - 75%     12
Unknown       7
5 - 10        1
Name: count, dtype: int64

In [97]:
# Repetimos el proceso con el último valor anómalo de la columna: 5 - 10
df[df.Cloud_Cover == '5 - 10']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
88,2022-08-06 18:00:00,"42 52.0000 N, 9 25.0000 W",No,10-30,Sail,12.5-15,No,Unknown,Unknown,0,0,0,Full skeg,Motorsailing,5 - 7,Calm,0 - 2 (0 - 6 knots),Day,5 - 10,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",No,From a report obtained by GTOA: Half an hour b...,1-2 orcas. They believe it is likely that they...,daytime


In [98]:
# Esta fila es fácil de arreglar, movemos una posición todos los valores desde la columna Cloud_Cover en adelante y luego a mano el valor de esta columna a Unknown
# Indice de la fila que queremos alterar
index_to_shift = 88

# Cogemos el indice de la columna 'Cloud_Cover'
Cloud_Cover_column_index = df.columns.get_loc('Cloud_Cover')

# Usamos la función shift para la fila en particular y desde 'Cloud_Cover' en adelante
df.iloc[index_to_shift, Cloud_Cover_column_index:] = df.iloc[index_to_shift, Cloud_Cover_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2022-08-06 18:00:00
lat_and_long                                             42 52.0000 N, 9 25.0000 W
Followed_GTOA_Protocol                                                          No
Interaction_time                                                             10-30
Boat_Type                                                                     Sail
Boat_Length                                                                12.5-15
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [99]:
df.loc[88, 'Cloud_Cover'] = 'Unknown'

df.Cloud_Cover.value_counts()

Cloud_Cover
0 - 25%      82
25 - 50%     36
75 - 100%    13
50 - 75%     12
Unknown       8
Name: count, dtype: int64

In [100]:
# Perfecto, pasamos a la siguiente columna

#### Distance off land NM

In [101]:
df.Distance_Off_Land_NM.value_counts()

Distance_Off_Land_NM
Over 10    57
2 - 5      43
5 - 10     40
0 - 2      12
On          1
Name: count, dtype: int64

In [102]:
# El único valor anómalo es 'On', lo estudiamos individualmente
df[df.Distance_Off_Land_NM == 'On']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
63,2022-05-02 11:37:00,"36 2.4690 N, 5 55.4080 W",Yes,60+,Sail,10-12.5,No,Unknown,Unknown,0,0,0,Twin rudder,Motoring,5 - 7,Moderate,3 - 4 (7 - 16 knots),Day,0 - 25%,On,On,White/light,Black,"Yes, extensive - major works required",We were motoring off the southern Spanish coas...,The orcas we clearly coordinated in their acti...,daytime,waxing\n2% illuminated\nwithin 3 days of new,Within 3 days of spring tide


In [103]:
# Debemos pasar 2 posiciones a la derechas todos los valores desde la columna Distance_Off_Land_NM en adelante
# Indice de la fila que queremos alterar
index_to_shift = 63

# Cogemos el indice de la columna 'Cloud_Cover'
Distance_Off_Land_NM_column_index = df.columns.get_loc('Distance_Off_Land_NM')

# Usamos la función shift para la fila en particular y desde 'Cloud_Cover' en adelante
df.iloc[index_to_shift, Distance_Off_Land_NM_column_index:] = df.iloc[index_to_shift, Distance_Off_Land_NM_column_index:].shift(2)

print(df.iloc[index_to_shift])

date                                                           2022-05-02 11:37:00
lat_and_long                                              36 2.4690 N, 5 55.4080 W
Followed_GTOA_Protocol                                                         Yes
Interaction_time                                                               60+
Boat_Type                                                                     Sail
Boat_Length                                                                10-12.5
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [104]:
df[df.date == '2022-05-02 11:37:00']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
63,2022-05-02 11:37:00,"36 2.4690 N, 5 55.4080 W",Yes,60+,Sail,10-12.5,No,Unknown,Unknown,0,0,0,Twin rudder,Motoring,5 - 7,Moderate,3 - 4 (7 - 16 knots),Day,0 - 25%,,,On,On,White/light,Black,"Yes, extensive - major works required",We were motoring off the southern Spanish coas...,The orcas we clearly coordinated in their acti...,daytime


In [105]:
# Ahora debemos mover una posición a la derecha los valores correspondientes a Crew_Response y Orcas_Behaviour
# Indice de la fila que queremos alterar
index_to_shift = 63

# Cogemos el indice de la columna 'Cloud_Cover'
Tow_Required_column_index = df.columns.get_loc('Tow_Required')

# Usamos la función shift para la fila en particular y desde 'Cloud_Cover' en adelante
df.iloc[index_to_shift, Tow_Required_column_index:] = df.iloc[index_to_shift, Tow_Required_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2022-05-02 11:37:00
lat_and_long                                              36 2.4690 N, 5 55.4080 W
Followed_GTOA_Protocol                                                         Yes
Interaction_time                                                               60+
Boat_Type                                                                     Sail
Boat_Length                                                                10-12.5
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [106]:
# Y ahora cambiar a mano los valores de las columnas Distance_Off_Land_NM, Depth_Meters y Tow_Required

df.loc[63, 'Distance_Off_Land_NM'] = 'Unknown'
df.loc[63, 'Depth_Meters'] = 'Unknown'
df.loc[63, 'Tow_Required'] = 'Unknown'

df.Distance_Off_Land_NM.value_counts()

Distance_Off_Land_NM
Over 10    57
2 - 5      43
5 - 10     40
0 - 2      12
Unknown     1
Name: count, dtype: int64

In [107]:
# Bien, pasamos a la siguiente columna

#### Depth Meters

In [108]:
df.Depth_Meters.value_counts()

Depth_Meters
40 - 200m    88
200m+        47
20 - 40m     13
On            3
Unknown       1
Up to 20m     1
Name: count, dtype: int64

In [109]:
# Hay que cambiar los tres valores que tienen un 'On', único valor anómalo de la columna
df[df.Depth_Meters == 'On']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
26,2023-04-29 08:00:00,"36 6.0000 N, 5 54.0000 W",No,10-30,Sail,15+,No,Unknown,Unknown,0,0,0,Twin rudder,Motorsailing,8 - 11,Moderate,5 - 6 (17 - 27 knots),Dawn,25 - 50%,5 - 10,On,On,Dark colour,Black,No,No,"Au petit matin, alors que nous avions traversé...","Les orques sont arrivés rapidement, et ont tou...",daytime
70,2022-10-03 13:30:00,"35 56.0000 N, 5 43.0000 W",Yes,0-10,Sail,10-12.5,No,Unknown,Unknown,0,0,0,Twin rudder,Motorsailing,5 - 7,Calm,3 - 4 (7 - 16 knots),Day,0 - 25%,2 - 5,On,On,White/light,Black,"Yes, moderate - immediate repairs required",No,Attaque très rapide. 3 orques ont attaqué les ...,Tout a été très rapide. Elles se sont concentr...,daytime
114,2022-08-27 14:30:00,"42 15.0000 N, 9 19.0000 W",Yes,60+,Sail,12.5-15,No,Unknown,Unknown,0,0,0,Spade,Motorsailing,5 - 7,Moderate,5 - 6 (17 - 27 knots),Day,0 - 25%,Over 10,On,On,White/light,Blue,"Yes, extensive - major works required",Yes,1 er choc j ai arrêté le moteur puis affaler m...,agressifs/// Aggressive,daytime


In [110]:
# Para las 3 filas hay que hacer lo mismo, mover una posición a la derecha todos los valores entre Depth_Meters y el final. Luego pasar a 'Unknown' las tres casillas de la misma columna

In [111]:
# Indice de la fila que queremos alterar
index_to_shift = 26

# Cogemos el indice de la columna 'Depth_Meters'
Depth_Meters_column_index = df.columns.get_loc('Depth_Meters')

# Usamos la función shift para la fila en particular y desde 'Depth_Meters' en adelante
df.iloc[index_to_shift, Depth_Meters_column_index:] = df.iloc[index_to_shift, Depth_Meters_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2023-04-29 08:00:00
lat_and_long                                              36 6.0000 N, 5 54.0000 W
Followed_GTOA_Protocol                                                          No
Interaction_time                                                             10-30
Boat_Type                                                                     Sail
Boat_Length                                                                    15+
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [112]:
# Indice de la fila que queremos alterar
index_to_shift = 70

# Cogemos el indice de la columna 'Depth_Meters'
Depth_Meters_column_index = df.columns.get_loc('Depth_Meters')

# Usamos la función shift para la fila en particular y desde 'Depth_Meters' en adelante
df.iloc[index_to_shift, Depth_Meters_column_index:] = df.iloc[index_to_shift, Depth_Meters_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2022-10-03 13:30:00
lat_and_long                                             35 56.0000 N, 5 43.0000 W
Followed_GTOA_Protocol                                                         Yes
Interaction_time                                                              0-10
Boat_Type                                                                     Sail
Boat_Length                                                                10-12.5
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [113]:
# Indice de la fila que queremos alterar
index_to_shift = 114

# Cogemos el indice de la columna 'Depth_Meters'
Depth_Meters_column_index = df.columns.get_loc('Depth_Meters')

# Usamos la función shift para la fila en particular y desde 'Depth_Meters' en adelante
df.iloc[index_to_shift, Depth_Meters_column_index:] = df.iloc[index_to_shift, Depth_Meters_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2022-08-27 14:30:00
lat_and_long                                             42 15.0000 N, 9 19.0000 W
Followed_GTOA_Protocol                                                         Yes
Interaction_time                                                               60+
Boat_Type                                                                     Sail
Boat_Length                                                                12.5-15
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [114]:
# Cambiamos a mano el valor de la columna Depth_Meters a 'Unknown' para las tres filas

df.loc[26, 'Depth_Meters'] = 'Unknown'
df.loc[70, 'Depth_Meters'] = 'Unknown'
df.loc[114, 'Depth_Meters'] = 'Unknown'

df.Depth_Meters.value_counts()

Depth_Meters
40 - 200m    88
200m+        47
20 - 40m     13
Unknown       4
Up to 20m     1
Name: count, dtype: int64

In [115]:
# Bienn, pasamos a la siguiente columna

#### Depth Gauge

In [116]:
df.Depth_Gauge.value_counts()

Depth_Gauge
On             136
Off             14
Unknown          2
White/light      1
Name: count, dtype: int64

In [117]:
# Debemos analizar por separado la fila en la que está el valor anómalo 'White/light'
df[df.Depth_Gauge == 'White/light']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
140,2022-06-04 10:30:00,"36 3.0000 N, 6 25.0000 W",No,30-60,Sail,12.5-15,No,Unknown,Unknown,0,0,0,Spade,Motorsailing,5 - 7,Calm,0 - 2 (0 - 6 knots),Day,0 - 25%,Over 10,40 - 200m,White/light,Black,"Yes, moderate - immediate repairs required",Yes,This is from a report obtained by GTOA togethe...,There were 6 or 7 orcas involved and the crew ...,daytime,waxing\n20% illuminated


In [118]:
# Debemos mover 2 posiciones todos los valores desde la columna Depth_Gauge en adelante
# Indice de la fila que queremos alterar
index_to_shift = 140

# Cogemos el indice de la columna 'Depth_Meters'
Depth_Gauge_column_index = df.columns.get_loc('Depth_Gauge')

# Usamos la función shift para la fila en particular y desde 'Depth_Meters' en adelante
df.iloc[index_to_shift, Depth_Gauge_column_index:] = df.iloc[index_to_shift, Depth_Gauge_column_index:].shift(2)

print(df.iloc[index_to_shift])

date                                                           2022-06-04 10:30:00
lat_and_long                                              36 3.0000 N, 6 25.0000 W
Followed_GTOA_Protocol                                                          No
Interaction_time                                                             30-60
Boat_Type                                                                     Sail
Boat_Length                                                                12.5-15
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [119]:
# Y ahora se cambian los valores de Depth_Gauge y Autopilot a 'Unknown'
df.loc[140, 'Depth_Gauge'] = 'Unknown'
df.loc[140, 'Autopilot'] = 'Unknown'

df.Depth_Gauge.value_counts()

Depth_Gauge
On         136
Off         14
Unknown      3
Name: count, dtype: int64

In [120]:
# Hecho, pasamos a la siguiente columna.

#### Autopilot

In [121]:
df.Autopilot.value_counts()

Autopilot
On             111
Off             36
White/light      3
Unknown          3
Name: count, dtype: int64

In [122]:
# Tenemos un solo valor anómalo, repetido 3 veces. Lo filtramos para ver como arreglarlo
df[df.Autopilot == 'White/light']

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
51,2022-09-16 12:00:00,"42 5.0000 N, 9 2.0000 W",Yes,60+,Sail,10-12.5,No,Unknown,Unknown,0,0,0,Spade,Motoring,5 - 7,Calm,0 - 2 (0 - 6 knots),Day,0 - 25%,5 - 10,40 - 200m,On,White/light,Blue,"Yes, moderate - immediate repairs required",Yes,FROM A REPORT OBTAINED BY GTOA: They were calm...,Number and behaviour of the orcas observed? It...,daytime
57,2022-11-20 10:50:00,"38 36.0000 N, 9 5.0000 W",No,0-10,Sail,10-12.5,No,Unknown,Unknown,0,0,0,Semi skeg,Sailing,3 - 4,Calm,7+ (28 knots+),Day,0 - 25%,5 - 10,40 - 200m,On,White/light,Blue,"Yes, moderate - immediate repairs required",No,j'ai vu les orques(au moins quatre) arriver dr...,Les m'ont semblé jouer à la toupie avec mon ba...,daytime
118,2022-08-13 15:00:00,"43 46.6866 N, 8 28.0606 W",Yes,10-30,Sail,0-10,No,Unknown,Unknown,0,0,0,Full skeg,Motorsailing,3 - 4,Calm,0 - 2 (0 - 6 knots),Day,50 - 75%,Over 10,40 - 200m,On,White/light,Green,"Yes, moderate - immediate repairs required",No,J’allais à une vitesse de 4 à 5 Nœuds environ....,daytime,waning\n97% illuminated\nwithin 3 days of full


In [123]:
# Hay que mover una posicion a la derecha los valores desde la columna de Autopilot en adelante, para las 3 filas

In [124]:
# Indice de la fila que queremos alterar
index_to_shift = 51

# Cogemos el indice de la columna 'Depth_Meters'
Autopilot_column_index = df.columns.get_loc('Autopilot')

# Usamos la función shift para la fila en particular y desde 'Depth_Meters' en adelante
df.iloc[index_to_shift, Autopilot_column_index:] = df.iloc[index_to_shift, Autopilot_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2022-09-16 12:00:00
lat_and_long                                               42 5.0000 N, 9 2.0000 W
Followed_GTOA_Protocol                                                         Yes
Interaction_time                                                               60+
Boat_Type                                                                     Sail
Boat_Length                                                                10-12.5
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [125]:
# Indice de la fila que queremos alterar
index_to_shift = 57

# Cogemos el indice de la columna 'Depth_Meters'
Autopilot_column_index = df.columns.get_loc('Autopilot')

# Usamos la función shift para la fila en particular y desde 'Depth_Meters' en adelante
df.iloc[index_to_shift, Autopilot_column_index:] = df.iloc[index_to_shift, Autopilot_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2022-11-20 10:50:00
lat_and_long                                              38 36.0000 N, 9 5.0000 W
Followed_GTOA_Protocol                                                          No
Interaction_time                                                              0-10
Boat_Type                                                                     Sail
Boat_Length                                                                10-12.5
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [126]:
# Indice de la fila que queremos alterar
index_to_shift = 118

# Cogemos el indice de la columna 'Depth_Meters'
Autopilot_column_index = df.columns.get_loc('Autopilot')

# Usamos la función shift para la fila en particular y desde 'Depth_Meters' en adelante
df.iloc[index_to_shift, Autopilot_column_index:] = df.iloc[index_to_shift, Autopilot_column_index:].shift(1)

print(df.iloc[index_to_shift])

date                                                           2022-08-13 15:00:00
lat_and_long                                             43 46.6866 N, 8 28.0606 W
Followed_GTOA_Protocol                                                         Yes
Interaction_time                                                             10-30
Boat_Type                                                                     Sail
Boat_Length                                                                   0-10
Towing_Inflatable                                                               No
Trailing_Fishing_Lure                                                      Unknown
Physical_Contact_With_Boat                                                 Unknown
Number_of_Adult_Orcas                                                            0
Number_of_Juvenile_Orcas                                                         0
Number_of_Uncertain_Age_Orcas                                                    0
Rudd

In [127]:
# Ahora añadimos el valor unknown a la columna Autopilot de las 3 filas

df.loc[51, 'Autopilot'] = 'Unknown'
df.loc[57, 'Autopilot'] = 'Unknown'
df.loc[118, 'Autopilot'] = 'Unknown'

df.Autopilot.value_counts()

Autopilot
On         111
Off         36
Unknown      6
Name: count, dtype: int64

In [128]:
# Hecho, pasamos a la siguiente columna

MAÑANA SEGUIR POR AQUÍ!!!!

Tareas:
* Empezar arreglando el error de Rudder == Motorsailing, comprobar el valuecounts una ultima vez y pasar a la siguiente columna
* Cuando haya terminado todas las columnas tendré que pasar a cambiar los datatypes y luego en un notebook distinto hacer lo de meter el número de orcas en las tres columnas ad,ju,unc.
* El objetivo es que este finde tenga ya en SQL la base de datos guay almacenada
* PLANTEARME objetivos para cada día!!

In [60]:
df.head()

Unnamed: 0,date,lat_and_long,Followed_GTOA_Protocol,Interaction_time,Boat_Type,Boat_Length,Towing_Inflatable,Trailing_Fishing_Lure,Physical_Contact_With_Boat,Number_of_Adult_Orcas,Number_of_Juvenile_Orcas,Number_of_Uncertain_Age_Orcas,Rudder,Motoring_or_Sailing,Speed_Knots,Sea_State,Wind_Speed_Beaufort,Daylight_or_Darkness,Cloud_Cover,Distance_Off_Land_NM,Depth_Meters,Depth_Gauge,Autopilot,Hull_Topsides_Color,Antifoul_Color,Boat_Damaged,Tow_Required,Crew_Response,Orcas_Behaviour
0,2023-11-01 22:15:00,"32 47.4980 N, 9 54.3980 W",No,0-10,Sail,10-12.5,No,No,No,1,0,0,Spade,Sailing,5 - 7,Moderate,5 - 6 (17 - 27 knots),Night,0 - 25%,Over 10,200m+,On,On,White/light,Blue,No,No,"Orca interaction at 10:15pm on 01/11, 40 miles...",I would describe the behaviour of the Orca dur...
1,2023-10-31 07:50:00,"39 26.0000 N, 9 23.0000 W",Yes,0-10,Sail,12.5-15m,No,No,Yes,2,5,0,Twin rudder,Motoring,5 - 7,Rough,3 - 4 (7 - 16 knots),Day,50 - 75%,2 - 5,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",No,We had sandbags on our sugar scoops and metal ...,Juveniles hitting the rudders adults close by
2,2023-09-19 11:00:00,"37 40.0000 N, 8 54.0000 W",No,0-10,Sail,12.5-15m,No,No,Yes,1,0,0,Spade,Motoring,5 - 7,Calm,3 - 4 (7 - 16 knots),Day,0 - 25%,2 - 5,40 - 200m,On,On,White/light,Other,"Yes, moderate - immediate repairs required",No,We saw the orca approach from 10 o’clock posit...,There was an initial approach 45 minutes earli...
3,2023-09-01 13:15:00,"45 36.0000 N, 3 45.0000 W",Yes,10-30,Sail,15+,No,Yes,Yes,1,2,0,Spade,Sailing,3 - 4,Calm,3 - 4 (7 - 16 knots),Day,25 - 50%,Over 10,200m+,Off,Off,White/light,Black,"Yes, moderate - immediate repairs required",No,Les trois orques passent constamment de bâbord...,Pas de comportement visblement agressif./// No...
4,2023-09-02 03:45:00,"42 45.0000 N, 9 14.0000 W",Yes,0-10,Sail,12.5-15m,No,No,Yes,1,2,0,Spade,Motorsailing,5 - 7,Calm,0 - 2 (0 - 6 knots),Night,0 - 25%,5 - 10,40 - 200m,On,On,White/light,Black,"Yes, moderate - immediate repairs required",Yes,Arrêt du pilote automatique a la 2 eme interac...,Approche furtive à la première interaction dir...
