# **Data Cleaning & Preprocessing**

## Objectives

* This notebook comprises the data cleaning and preprocessing steps necessary to ensure the data is in a suitable format for subsequent visualisations and analysis.

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/Users/mahahussain/Desktop/NEA-Seafloor-Litter-Analysis/NEA-Seafloor-Litter-Analysis/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/Users/mahahussain/Desktop/NEA-Seafloor-Litter-Analysis/NEA-Seafloor-Litter-Analysis'

# Section 1: Design & Implement an ETL Pipeline

## 1.1 Extract: Importing Libraries & Extracting Dataset

This section involves:
- Importing the required libraries.
- Loading the data from a CSV file into a Pandas DataFrame. 

In [1]:
# Importing necessary libraries
'''
Importing necessary libraries
    1) Pandas: for data manipulation and analysis
    2) NumPy: for numerical operations
'''
import pandas as pd 
import numpy as np


In [8]:
# Read the dataset in with read_csv()
raw_df = pd.read_csv("data/01_RAW_NEA-Seafloor-Litter.csv")

# Verify successful operation
print("Dataset loaded successfully!")

Dataset loaded successfully!


## 1.2 Understanding the Data

This section involves:
- Examining column data types and counts
- Correcting incorrect types.

### Checking Data Types

In [11]:
def check_data_and_types(data_frame):
    """
    This function shows the type of data for each column and the first 5 rows of the dataset.

    Args: data_frame: The data frame we want to inspect.
    """
    print(" ----- Data Types -----")
    print(data_frame.info())

    print(" ----- First Five Rows -----")
    print(data_frame.head())

check_data_and_types(raw_df)

 ----- Data Types -----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4310 entries, 0 to 4309
Data columns (total 62 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   year               4310 non-null   int64  
 1   survey             4310 non-null   object 
 2   cruise             4310 non-null   object 
 3   area               4033 non-null   object 
 4   station            4310 non-null   int64  
 5   fldHaulLatDegrees  4310 non-null   int64  
 6   fldHaulLatMinutes  4052 non-null   float64
 7   fldHaulLonDegrees  4310 non-null   int64  
 8   fldHaulLonMinutes  4052 non-null   float64
 9   fldHaulEorW        4057 non-null   object 
 10  fldShotLatDegrees  3829 non-null   float64
 11  fldShotLatMinutes  3603 non-null   float64
 12  fldShotLonDegrees  3745 non-null   float64
 13  fldShotLonMinutes  3519 non-null   float64
 14  fldShotEorW        3519 non-null   object 
 15  Latitude           4310 non-null   float64
 16  

### Using value_counts() to examine frequency of each unique value in select columns.

In [17]:
def get_value_counts(data_frame):
   """
    This function shows the frequency of each category for the following columns:
    1. year
    2. survey
    3. cruise
    4. area
    5. station

   It provides the count of each unique category for the specified columns.

    Args: data_frame: The data frame to inspect.
    """
   
   # Create a list of the columns to inspect
   columns_to_check = ['year', 'survey', 'cruise', 'area', 'station']

   # For every col in the list, print its title and its value counts
   for col in columns_to_check:
      print(f"\n Value count for {col}:")
      print(data_frame[col].value_counts())
    
get_value_counts(raw_df)


 Value count for year:
year
2011    640
2010    570
2014    503
2013    484
2012    416
2015    277
2009    163
1994    150
1992    149
1993    146
2008    111
2000     92
2005     86
1999     74
1998     74
1997     74
1995     74
1996     72
2007     31
2003     28
2006     27
2004     25
2001     22
2002     22
Name: count, dtype: int64

 Value count for survey:
survey
IBTS                 1575
NWGFS                 566
Q1SW_with Blinder     496
7DBTS                 444
Q1SW_No Blinder       408
CSEMP                 368
Q4SW                  220
Q1SW                  138
MEMFISH                95
Name: count, dtype: int64

 Value count for cruise:
cruise
CEND 4/15     277
Cend04/14     192
Cend5/11      175
Cend6/10      172
CEND 02/13    142
Cend 18/13    118
CEND 13/12    118
CEND 15/11    116
Cend 18/14    112
CEND 14/10    112
CEND 15/12    108
CEND4/11       95
Cend 12/10     94
CEND 11/12     93
CEND 15/14     91
CEND 13/10     90
CEND 12/09     90
Cend 15/13     89
CEND 14

#### **AI Insights for Value Counts**
   **Year Distribution**
  - The data spans several years, with 2011 having the highest count (640) and 2002 the lowest (22).
  - The year range covers both recent and older periods, with a noticeable decrease in data points as you go further back in time.
  - The years 2011, 2010, and 2014 dominate, indicating that more data was collected in those years.

  **Survey Distribution**
  - **IBTS** survey has the highest count (1575), significantly outpacing all other surveys.
  - The next largest survey is **NWGFS** (566), followed by **Q1SW_with Blinder** (496), suggesting these surveys might be more extensive or more frequently conducted.
  - Other surveys, such as **MEMFISH** (95) and **Q1SW** (138), have significantly lower counts, indicating they might be more specialised or less frequent.

  **Cruise Distribution**
  - **CEND 4/15** has the highest cruise count (277), followed by other CEND series cruises (e.g., **Cend04/14**, **Cend5/11**), which seem to be relatively frequent (counts between 90 and 200).
  - Several cruises with **CIRO** prefixes (e.g., **CIRO 6/00**, **CIRO 11/94**) have counts of 75, indicating they were conducted on a few key occasions.
  - Some cruises like **CSEMP2008** (37) and **CSEMP2007** (31) show smaller numbers, likely due to being more specific or infrequent surveys.

  **Area Distribution**
  - Two main areas: **Greater North Sea** (2241) and **Celtic Seas** (1792).
  - **Greater North Sea** has a significantly higher count, indicating it may be a more frequently surveyed or larger region compared to the **Celtic Seas**.
  
  **Station Distribution**
  - Station values are widely distributed:
    - **Station 0** has the highest count (1031), indicating it's a primary or central location.
    - Other stations have significantly lower counts, with some stations (like 290, 142, 292) having only 1 count, suggesting they represent rare or specific locations.


#### **Observations**
From the analysis, it's clear that the data collection has been much more concentrated in recent years, particularly around 2010-2014. The dominance of the **IBTS** survey suggests it's the primary source of the data, with a smaller set of other surveys contributing. The variety in the number of cruises and the regions surveyed shows that the data may come from both regular, extensive surveys (Greater North Sea) and more targeted studies in specific areas. It also seems that Station 0 is the central location for data collection, while others are used sparingly or for more niche observations. 


### Using nunique() to Check Cardinality (Number of Unique Values of Spec. Column)

In [18]:
def get_cardinality(data_frame):
   """
    This function shows the count of distinct categories for the following columns:
    1. year
    2. survey
    3. cruise
    4. area
    5. station

    Args: data_frame: The data frame to inspect.
    """
       
   # Create a list of the columns to inspect
   cols = ['year', 'survey', 'cruise', 'area', 'station']

   # For every col in the list, print its title and its value counts
   for col in cols:
      unqiue_count = data_frame[col].nunique()
      print(f"Number of unique {col}s: {unqiue_count}") 
  
    
get_cardinality(raw_df)
       

Number of unique years: 24
Number of unique surveys: 9
Number of unique cruises: 54
Number of unique areas: 2
Number of unique stations: 320


#### **AI Insights for Unique Category Counts**
**Year Distribution**
- The data spans across **24 unique years**, indicating a broad temporal range of data collection.
- The presence of multiple years reflects long-term monitoring, with a focus on both recent and older data points.

**Survey Distribution**
- **9 unique surveys** are represented, showcasing a diverse range of research efforts.
- The presence of multiple surveys suggests a variety of research methodologies and objectives, capturing different aspects of the dataset.

**Cruise Distribution**
- The data covers **54 unique cruises**, indicating a diverse set of research trips or expeditions.
- The variety in cruises suggests both extensive and more specific sampling events across various periods.

**Area Distribution**
- The data is focused on **2 unique areas**, likely representing two primary regions of interest.
- These areas may have distinct environmental conditions, offering opportunities for comparison and regional analysis.

**Station Distribution**
- The dataset includes data from **320 unique stations**, reflecting a detailed geographical distribution.
- The large number of stations suggests a comprehensive sampling effort, providing rich spatial data for analysis.

#### **Observations**
- The data spans a significant time period with 24 years of collected data.
- The variety in the number of surveys, cruises, and stations shows a comprehensive and diverse data collection effort, capturing both broad trends and specific research events.
- The focus on two primary areas highlights regional interest, while the extensive number of stations offers a high level of detail and spatial resolution in the dataset.


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
