 02 ‚Äì DATA PREPROCESSING

The purpose of this notebook is to clean and prepare the raw data so it‚Äôs suitable for analysis and modeling. This step includes loading the dataset, checking for issues like missing values or duplicates, and making sure the data types are correct. It also covers encoding categorical variables, scaling features, and preparing the train/test split.

1. Load the Datasets
   
Instead of loading a single file, this step loads all CSV files located in the data/raw/ folder into a dictionary of pandas DataFrames. Each dataset is stored using its filename (without the .csv extension) as the key. This approach allows easy access and inspection of multiple datasets simultaneously.

In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [4]:
import os
import glob
import pandas as pd

# Path to the raw data folder
raw_data_path = '/content/drive/MyDrive/yacht-data-insights/data/raw/'

# Get a list of all CSV files in the folder
csv_files = glob.glob(os.path.join(raw_data_path, '*.csv'))

# Create a dictionary of DataFrames with the filename (without extension) as the key
dataframes = {}

for file in csv_files:
    name = os.path.splitext(os.path.basename(file))[0]  # filename without path or extension
    try:
        # Try reading the CSV with a specified encoding
        df = pd.read_csv(file, encoding='ISO-8859-1', low_memory=False)
        dataframes[name] = df
        print(f"Loaded {name} with shape {df.shape}")
    except Exception as e:
        print(f"Error loading {name}: {e}")

# Optionally, check the first file to confirm
if csv_files:
    first_file = csv_files[0]
    print(f"\nFirst file preview: {pd.read_csv(first_file, encoding='ISO-8859-1').head()}")
else:
    print("No CSV files found.")


Loaded named_anchorages_v1_20191205 with shape (166508, 10)
Loaded CVP_loitering_202411 with shape (684, 14)
Loaded named_anchorages_v1_20181108 with shape (119748, 7)
Loaded Weather-for-Boating-Activities with shape (1060, 6)
Loaded CVP_ports_202411 with shape (1410, 14)
Loaded boat_data with shape (9888, 10)
Loaded CVP_encounters_202411 with shape (348, 14)
Loaded sar_vessel_detections_pipev20231026_202410 with shape (268681, 10)
Loaded named_anchorages_v2_20201104 with shape (166515, 10)
Loaded boat_dataset with shape (10344, 38)
Loaded Boats_No_Price_dataset with shape (936, 26)
Loaded named_anchorages_v2_20221206 with shape (166482, 10)
Loaded sar_vessel_detections_pipev3_202411 with shape (248247, 10)
Loaded sar_vessel_detections_pipev3_202412 with shape (239081, 10)

First file preview:        s2id        lat        lon     label sublabel     label_source iso3  \
0  3e4e429b  26.914042  52.220320   SHARJAH      NaN  top_destination  IRN   
1  1a575de7  -7.715992  11.724560  BLOC

  print(f"\nFirst file preview: {pd.read_csv(first_file, encoding='ISO-8859-1').head()}")


2. Initial Checks: Data Types, Missing Values, and Duplicates
This step involves checking the data types to ensure each column is correctly formatted. It also includes identifying any missing values by counting them for each column. Additionally, duplicate rows are checked for any repetitions. These checks help identify common data issues that need to be addressed before further analysis.

In [5]:
# Loop through each loaded DataFrame in the dictionary
for name, df in dataframes.items():
    print(f"\nüîç Initial checks for: {name}")

    # Overview of the dataset
    print("\nüìÑ Dataset Info:")
    print(df.info())

    # Count missing values per column
    print("\n‚ùì Missing Values per Column:")
    print(df.isnull().sum())

    # Check for duplicates
    print("\nüîÅ Number of Duplicate Rows:")
    print(df.duplicated().sum())



üîç Initial checks for: named_anchorages_v1_20191205

üìÑ Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166508 entries, 0 to 166507
Data columns (total 10 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   s2id                   166508 non-null  object 
 1   lat                    166508 non-null  float64
 2   lon                    166508 non-null  float64
 3   label                  166502 non-null  object 
 4   sublabel               5586 non-null    object 
 5   label_source           166508 non-null  object 
 6   iso3                   166501 non-null  object 
 7   distance_from_shore_m  166483 non-null  float64
 8   drift_radius           166346 non-null  float64
 9   at_dock                166507 non-null  object 
dtypes: float64(4), object(6)
memory usage: 12.7+ MB
None

‚ùì Missing Values per Column:
s2id                          0
lat                           0
lon                     

3. Merging Anchorages Datasets. The anchorage datasets were merged into a single DataFrame to facilitate unified analysis. This step combined multiple datasets related to anchorages, ensuring that all relevant data from different timeframes and sources is present in one dataset. This is an essential step for streamlining further data analysis and avoiding redundant information across multiple datasets.

In [7]:
# 1. Merging the Anchorages datasets
# Assuming the datasets have already been loaded into the `dataframes` dictionary

# Select the anchorage datasets
anchorage_keys = [key for key in dataframes.keys() if 'anchorages' in key]

# Print the datasets being merged
print("Merging the following Anchorage datasets:")
for key in anchorage_keys:
    print(f"- {key}")

# Merge them into a single DataFrame
df_anchorages = pd.concat([dataframes[key] for key in anchorage_keys], ignore_index=True)

# Display quick check
print("\nCombined Anchorage dataset shape:", df_anchorages.shape)
print(df_anchorages.head())


Merging the following Anchorage datasets:
- named_anchorages_v1_20191205
- named_anchorages_v1_20181108
- named_anchorages_v2_20201104
- named_anchorages_v2_20221206

Combined Anchorage dataset shape: (619253, 12)
       s2id        lat        lon     label sublabel     label_source iso3  \
0  3e4e429b  26.914042  52.220320   SHARJAH      NaN  top_destination  IRN   
1  1a575de7  -7.715992  11.724560  BLOCK 17      NaN  top_destination  AGO   
2  3fcf5295  29.642077  48.696705  KAZ IRAQ      NaN  top_destination  KWT   
3  3fcf52bf  29.644148  48.701873  KAZ IRAQ      NaN  top_destination  KWT   
4  3fcf52bd  29.639744  48.701769  UMM QASR      NaN  top_destination  KWT   

   distance_from_shore_m  drift_radius at_dock anchorage_group dock  
0                63000.0      0.056322   False             NaN  NaN  
1               134000.0      0.111111   False             NaN  NaN  
2                33000.0      0.162583   False             NaN  NaN  
3                33000.0      0.16162

4. Handling Missing Values Across Multiple Datasets. This step ensures that all loaded datasets are properly cleaned by addressing missing values. A loop iterates through each dataset, checking for columns with missing data. For categorical columns like 'sublabel', missing values are filled with 'Unknown', while numerical columns such as 'Length' and 'Width' are filled using the median value. This approach ensures that missing data is handled consistently across all datasets, preventing KeyErrors by verifying the existence of each column before making changes. After cleaning, the missing value count is rechecked to confirm successful handling.

In [10]:
# Loop through each dataset in your collection
for name, df in dataframes.items():
    print(f"\nSample values for 'Length' and 'Width' in dataset: {name}")

    # Sample values for 'Length' column
    if 'Length' in df.columns:
        print("\nSample values in 'Length' column:")
        print(df['Length'].head(10))
        # Check for missing values in 'Length'
        print(f"Missing values in 'Length' column: {df['Length'].isnull().sum()}")

    # Sample values for 'Width' column
    if 'Width' in df.columns:
        print("\nSample values in 'Width' column:")
        print(df['Width'].head(10))
        # Check for missing values in 'Width'
        print(f"Missing values in 'Width' column: {df['Width'].isnull().sum()}")

    print("-" * 50)



Sample values for 'Length' and 'Width' in dataset: named_anchorages_v1_20191205
--------------------------------------------------

Sample values for 'Length' and 'Width' in dataset: CVP_loitering_202411
--------------------------------------------------

Sample values for 'Length' and 'Width' in dataset: named_anchorages_v1_20181108
--------------------------------------------------

Sample values for 'Length' and 'Width' in dataset: Weather-for-Boating-Activities
--------------------------------------------------

Sample values for 'Length' and 'Width' in dataset: CVP_ports_202411
--------------------------------------------------

Sample values for 'Length' and 'Width' in dataset: boat_data

Sample values in 'Length' column:
0    4.00
1    4.00
2    3.69
3    3.00
4    3.55
5    4.03
6    6.20
7    3.00
8    3.64
9    4.35
Name: Length, dtype: float64
Missing values in 'Length' column: 0

Sample values in 'Width' column:
0    1.90
1    1.50
2    1.42
3    1.00
4    1.46
5    1.56
6

In [9]:
# Loop through each dataset in your collection
for name, df in dataframes.items():
    print(f"\nSample values for 'Length' and 'Width' in dataset: {name}")

    # Sample values for 'Length' column
    if 'Length' in df.columns:
        print("Sample values in 'Length' column:")
        print(df['Length'].head(10))

    # Sample values for 'Width' column
    if 'Width' in df.columns:
        print("Sample values in 'Width' column:")
        print(df['Width'].head(10))



Sample values for 'Length' and 'Width' in dataset: named_anchorages_v1_20191205

Sample values for 'Length' and 'Width' in dataset: CVP_loitering_202411

Sample values for 'Length' and 'Width' in dataset: named_anchorages_v1_20181108

Sample values for 'Length' and 'Width' in dataset: Weather-for-Boating-Activities

Sample values for 'Length' and 'Width' in dataset: CVP_ports_202411

Sample values for 'Length' and 'Width' in dataset: boat_data
Sample values in 'Length' column:
0    4.00
1    4.00
2    3.69
3    3.00
4    3.55
5    4.03
6    6.20
7    3.00
8    3.64
9    4.35
Name: Length, dtype: float64
Sample values in 'Width' column:
0    1.90
1    1.50
2    1.42
3    1.00
4    1.46
5    1.56
6    2.38
7    3.33
8    1.37
9    1.73
Name: Width, dtype: float64

Sample values for 'Length' and 'Width' in dataset: CVP_encounters_202411

Sample values for 'Length' and 'Width' in dataset: sar_vessel_detections_pipev20231026_202410

Sample values for 'Length' and 'Width' in dataset: named_

5. Remove Duplicates
Removing duplicate rows to avoid bias and redundancy during analysis or training.

In [None]:
# Drop duplicates
df = df.drop_duplicates()

6. Encode Categorical Variables
Categorical columns were one-hot encoded to convert them into a numerical format suitable for machine learning models.

In [None]:
# One-hot encode selected categorical column
df = pd.get_dummies(df, columns=['categorical_column'])

7. Scale Numerical Features
Numerical features were scaled using StandardScaler to ensure all features contribute equally to model performance.

In [None]:
from sklearn.preprocessing import StandardScaler

# Apply standard scaling
scaler = StandardScaler()
df[['numerical_column1', 'numerical_column2']] = scaler.fit_transform(df[['numerical_column1', 'numerical_column2']])

8. Split Dataset into Train/Test
The cleaned and prepared dataset was split into training and testing sets to allow model validation on unseen data.

In [None]:
from sklearn.model_selection import train_test_split

# Separate features and target
X = df.drop('target_column', axis=1)
y = df['target_column']

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
