# **Exploratory Data Analysis**


* This notebook documents ......

## Objectives
- **Obj**: ....
 


## Inputs
* Processed Dataset: `02_PROCESSED_NEA-Seafloor-Litter.csv`. 

    This served as the initial data source.

## Outputs

* .....


## Additional Comments

* .....



11.1 Research and experiment with the application of data analytics tools, technologies, and methodologies:
Explain why you chose specific tools (e.g., comparing different Python libraries like Matplotlib vs. Plotly for visualisation). Detail in Jupyter any experiments with new tools or methodologies, showing code snippets and explaining their results. Include version control commits that show progressive experimentation and refinements of your work. Incorporate new tools/technologies you've researched (such as a new library). The project would show evidence of trial, adaptation, and application in a meaningful way. Focus on challenges encountered and solutions found, explaining how you adapted to using them in the project.


# Change working directory

To facilitate proper file access, the working directory is to be adjusted to its parent directory
* os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/mahahussain/Desktop/NEA-Seafloor-Litter-Analysis/NEA-Seafloor-Litter-Analysis/jupyter_notebooks'

To set the parent of the current directory the new current directory:
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/mahahussain/Desktop/NEA-Seafloor-Litter-Analysis/NEA-Seafloor-Litter-Analysis'

---

# Section 1 : Descriptive Statistics

## In this section:
- Introduction to the key statistical concepts used throughout the analysis.
- #TODO
-




### 1.1 Introduction to Core Statistical Concepts

During data analysis, we want to be able to describe and interpret our dataset. We can do so by utilising statistical concepts that help us identify patterns and draw conclusions.

Summary statistics summarise the mean, median, mode, standard deviation and min/max values in one statistic.

---

1. `MEAN:` The mean is a statistical tool used to find the average of a set of given numbers. 

- To calculate it, sum all of the values in the dataset and divide that sum by the count. We can use np.mean() to do so.

    - Sum: Total amount of the addition of all datapoints in the given numerical column.
    - Count: Total number of all datapoints in the set.

This is an important foundational principle for data analysis because it provides a measure of central tendency which represents the typical value in a given data column. This allows us to compare averages across different datasets or subgroups within a given dataset. The mean can also be used to examine the change in mean over an extended period of time to view shifts and trends in the data.

Note: It is important to note, however, that the mean is sensitive to extreme outliers. In datasets that have highly skewed distribution, the mean may not accurately represent a typical datapoint. Hence, we must investigate the data in conjunctions with other statistical measures.

---

2. `MEDIAN:` The median is a statistical tool used to find the middle value of an ordered set of given numbers. 

- To calculate it, you first need to order the datapoints in ascending order.

    - If you have an odd number of values, the median is simply the middle number.
    - If you have an even number of values, the median is the averaege of the two middle numbers.

    We can use np.median() to do so.

This is an important foundational principle for data analysis because it helps to identify a typical value of the dataset - 50% of the data is above this point and 50% of the data is below this point.

Note: Unlike the mean, the median is not affected by extreme outliers, therefore it is a better measure of central tendency for a dataset with a skewed distribution. If the mean and median values are closely aligned, the data is likely symmetrical with a normal distribution, else, if they are not aligned, the data is likely skewed.

---

3. `MODE:` The mode is a statistical tool used to find the most frequent value in the given datapoints.

- To calculate it, simply count the number of times each value appears in the data.

We can use scipy.stats.mode() to do so on a numpy array.

This is an important foundational principle for data analysis because because it allows us to identify the most common / popular value in our data. It is especially helpful for analysing categorical types, such as types of litter.
    
---

4. `STANDARD DEVIATION:` Standard deviation is a statistical tool used to measure how spread out the numbers in a dataset are.

- To calculate it you can use: np.std(data)
- Since we are using DataFrames we can use df.std() from Pandas.

$$
\sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{N}}
$$

This is an important foundational principle for data analysis because it tells us how far out the data is from the mean value. 

- Small standard deviation: The datapoints are close to the average.
- Large standard deviation: The datappoints are far from the average.

---
5. `HYPOTHESIS TESTING:` Hypothesis testing is a statistical concept used to check if our guesses about the data are likely to be true.

- It involves setting up a "null hypothesis" (a statement we want to test) and then using statistical tests to see if the data provides enough evidence to reject it.

This is an important foundational principle for data analysis because it allows us to check if observerd patterns are either likely to be due to chance or if they are statistically significant.

---
6. `BASIC PROBABILITY:` Probability is a statistical concept used to understand how likely it is that something will happen.

- It's expressed as a number between 0 and 1 (or as a percentage). 0 means it's impossible, and 1 (or 100%) means it's certain.

This is an important foundational principle for data analysis because it helps us understand the changes of different outcomes. For example, if we want to know the probability of finding a certain type of litter in a specifc area, based on the historical information, we can use probability to understand how likely that is.


### 1.2 Statistical Summary for Numerical Values

This section provides a statistical overview of the numerical data, calculating and interpreting descriptive measures such as mean, median, and standard deviation.


In [4]:
# Import libraries
import numpy as np
import pandas as pd

In [5]:
# Load in the processed dataset
df = pd.read_csv('data/02_PROCESSED_NEA-Seafloor-Litter.csv')

In [6]:
# Display the first few rows of the dataframe
df.head()

Unnamed: 0,survey,cruise,area,station,Latitude,Longitude,date,bottle,sheet,bag,...,other.wood,clothing,shoes,other.misc,totallitter,distance,wingspread,year,month,day
0,IBTS,CIRO 9/92,Greater North Sea,1,51.738333,1.753333,1992-08-14,0,0,1,...,0,0,0,0,1,3794,0,1992,8,14
1,IBTS,CIRO 9/92,Greater North Sea,2,51.601667,2.796667,1992-08-15,0,0,1,...,0,0,0,0,2,3918,0,1992,8,15
2,IBTS,CIRO 9/92,Greater North Sea,3,51.823333,3.643333,1992-08-15,0,0,0,...,0,0,0,0,1,3624,0,1992,8,15
3,IBTS,CIRO 9/92,Greater North Sea,4,52.823333,2.76,1992-08-16,0,0,1,...,0,0,0,0,1,3642,0,1992,8,16
4,IBTS,CIRO 9/92,Greater North Sea,5,52.685,3.411667,1992-08-16,1,0,1,...,0,0,0,0,2,3791,0,1992,8,16


In [7]:
# Columns that we want to describe
cols_to_describe = ['totallitter', 'distance', 'wingspread', 'year', 'month', 'day']

def describe_columns(df, cols):
    for col in cols:
        print(f"Column: {col}")
        print(df[col].describe())  # Get the usual describe() output

        # Add median
        median = df[col].median()
        print(f"Median: {median}")

        # Add mode (handling potential multiple modes)
        mode = df[col].mode()
        if len(mode) == 1:
            print(f"Mode: {mode[0]}")
        else:
            print(f"Modes: {', '.join(map(str, mode.tolist()))}")

        print("\n")

describe_columns(df, cols_to_describe)

Column: totallitter
count    4307.000000
mean        2.451823
std         6.979546
min         0.000000
25%         0.000000
50%         1.000000
75%         3.000000
max       271.000000
Name: totallitter, dtype: float64
Median: 1.0
Mode: 0


Column: distance
count     4307.000000
mean      2071.758068
std       2212.877810
min          0.000000
25%          0.000000
50%       2951.000000
75%       3696.000000
max      13208.000000
Name: distance, dtype: float64
Median: 2951.0
Mode: 0


Column: wingspread
count    4307.000000
mean        3.038078
std         4.618863
min         0.000000
25%         0.000000
50%         4.000000
75%         4.000000
max        22.000000
Name: wingspread, dtype: float64
Median: 4.0
Mode: 4


Column: year
count    4307.000000
mean     2007.981658
std         7.048329
min      1992.000000
25%      2005.000000
50%      2011.000000
75%      2013.000000
max      2015.000000
Name: year, dtype: float64
Median: 2011.0
Mode: 2011


Column: month
count    4307.0

## AI Insights from Descriptive Statistics (Numerical)

**1. Marine Litter Abundance ('totallitter'):**

* The mean litter count per survey is 2.45, with a high standard deviation of 6.98, indicating significant variability.
* A median of 1 and mode of 0 suggest that most surveys recorded very low or no litter.
* The maximum count of 271 highlights the presence of extreme accumulation events, contributing to data skewness.

**2. Survey Distance ('distance'):**

* The mean survey distance is 2071.76, with a large standard deviation of 2212.88, reflecting substantial variation in survey effort.
* The median distance of 2951 exceeds the mean, and a mode of 0 indicates numerous surveys with minimal or no distance covered.
* This suggests a negatively skewed distribution, with potential clustering of surveys at shorter distances.

**3. Wingspread Measurements ('wingspread'):**

* The mean wingspread is 3.04 (std 4.62), showing considerable variability.
* The median and mode of 4 indicate a common measurement value.
* The 25th percentile of 0 suggests a significant proportion of surveys did not record wingspread data.

**4. Temporal Survey Distribution (Year, Month, Day):**

* Surveys span 1992-2015 (mean 2008, std 7.04), showing a wide temporal range.
* The median and mode year of 2011 suggest a concentration of survey activity around this year.
* Surveys are distributed throughout the year (mean month 6.88), with a median and mode month of 8 (August), indicating peak survey activity during this month.
* The mean day is 17, with a median of 18 and modes of 20 and 22, indicating a slight tendency for surveys in the latter half of the month.

**General Observations:**

* High standard deviations across 'totallitter', 'distance', and 'wingspread' suggest substantial data variability, warranting further investigation into influencing factors.
* Skewness in 'totallitter' and 'distance' suggests the presence of outliers and non-normal distributions, requiring appropriate statistical handling.
* Temporal data, including medians and modes, provides valuable context for understanding long-term and seasonal trends in marine litter distribution.

#### Contextualising the AI summary 
While the descriptive statistics provide a quantitative overview, it is crucial to acknowledge the inherent limitations and potential biases in the data. The observed variability in litter counts and survey distances may reflect not only genuine environmental fluctuations but also variations in survey methodologies, environmental conditions, and observer biases. Furthermore, the temporal distribution of surveys, while seemingly uniform, might be influenced by logistical constraints and funding cycles, which could impact the representativeness of the data. Therefore, the interpretation of these statistics must be approached with a nuanced understanding of the data collection process and its potential limitations.


### 1.3 Statistical Summary for Categorical Values

**Objective:** To perform a descriptive analysis of categorical variables within the dataset.

**Methodology:**
Utilise frequency counts to determine the occurrence of each unique value within selected categorical columns.
Examine the cardinality of these columns, which refers to the number of unique categories present.

**Purpose:**
To provide a comprehensive understanding of the distribution and diversity of categorical variables.
To contribute to a broader understanding of the dataset's composition.

#### 1.3.1 Mode of Categorical Values

In [None]:
cols_for_mode = ['year', 'month', 'day', 'survey', 'cruise', 'area', 'station']

def find_modes(df, cols_for_mode):
    

#### 1.3.2 Using value_counts() to examine frequency of each unique value in select columns.

Categorical data, representing distinct groups or classifications, requires different statistical approaches compared to numerical data. We will utilise frequency coubts to determine the unique occurence of each value within selected columns.

In [8]:
def get_value_counts(data_frame):
   """
    This function shows the frequency of each category for the following columns:
    1. year, month, day
    2. survey
    3. cruise
    4. area
    5. station

   It provides the count of each unique category for the specified columns.

    Args: data_frame: The data frame to inspect.
    """
   
   # Create a list of the columns to inspect
   columns_to_check = ['year', 'month', 'day', 'survey', 'cruise', 'area', 'station']

   # For every col in the list, print its title and its value counts
   for col in columns_to_check:
      print(f"\n Value count for {col}:")
      print(data_frame[col].value_counts())
    
get_value_counts(df)


 Value count for year:
year
2011    640
2010    570
2014    500
2013    484
2012    416
2015    277
2009    163
1994    150
1992    149
1993    146
2008    111
2000     92
2005     86
1999     74
1998     74
1997     74
1995     74
1996     72
2007     31
2003     28
2006     27
2004     25
2001     22
2002     22
Name: count, dtype: int64

 Value count for month:
month
8     1085
9      811
3      782
7      616
11     321
2      296
6      192
10     125
4       59
12      19
5        1
Name: count, dtype: int64

 Value count for day:
day
20    219
22    219
17    197
18    190
19    190
21    189
28    173
25    172
23    169
26    168
16    164
27    161
24    159
13    136
15    134
1     128
12    119
14    117
2     111
30    110
9     110
29    110
3     107
8     102
7     101
5     101
6      99
10     98
4      97
11     97
31     60
Name: count, dtype: int64

 Value count for survey:
survey
IBTS                 1575
NWGFS                 566
Q1SW_with Blinder     496
7DBTS

#### **AI Insights for Value Counts**
   **Year Distribution**
  - The data spans several years, with 2011 having the highest count (640) and 2002 the lowest (22).
  - The year range covers both recent and older periods, with a noticeable decrease in data points as you go further back in time.
  - The years 2011, 2010, and 2014 dominate, indicating that more data was collected in those years.

  **Survey Distribution**
  - **IBTS** survey has the highest count (1575), significantly outpacing all other surveys.
  - The next largest survey is **NWGFS** (566), followed by **Q1SW_with Blinder** (496), suggesting these surveys might be more extensive or more frequently conducted.
  - Other surveys, such as **MEMFISH** (95) and **Q1SW** (138), have significantly lower counts, indicating they might be more specialised or less frequent.

  **Cruise Distribution**
  - **CEND 4/15** has the highest cruise count (277), followed by other CEND series cruises (e.g., **Cend04/14**, **Cend5/11**), which seem to be relatively frequent (counts between 90 and 200).
  - Several cruises with **CIRO** prefixes (e.g., **CIRO 6/00**, **CIRO 11/94**) have counts of 75, indicating they were conducted on a few key occasions.
  - Some cruises like **CSEMP2008** (37) and **CSEMP2007** (31) show smaller numbers, likely due to being more specific or infrequent surveys.

  **Area Distribution**
  - Two main areas: **Greater North Sea** (2241) and **Celtic Seas** (1792).
  - **Greater North Sea** has a significantly higher count, indicating it may be a more frequently surveyed or larger region compared to the **Celtic Seas**.
  
  **Station Distribution**
  - Station values are widely distributed:
    - **Station 0** has the highest count (1031), indicating it's a primary or central location.
    - Other stations have significantly lower counts, with some stations (like 290, 142, 292) having only 1 count, suggesting they represent rare or specific locations.


#### **Observations**
From the analysis, it's clear that the data collection has been much more concentrated in recent years, particularly around 2010-2014. The dominance of the **IBTS** survey suggests it's the primary source of the data, with a smaller set of other surveys contributing. The variety in the number of cruises and the regions surveyed shows that the data may come from both regular, extensive surveys (Greater North Sea) and more targeted studies in specific areas. It also seems that Station 0 is the central location for data collection, while others are used sparingly or for more niche observations. 


#### 1.3.2 Using nunique() to Check Cardinality (Number of Unique Values of Spec. Column)

In [9]:
def get_cardinality(data_frame):
   """
    This function shows the count of distinct categories for the following columns:
    1. year
    2. survey
    3. cruise
    4. area
    5. station

    Args: data_frame: The data frame to inspect.
    """
       
   # Create a list of the columns to inspect
   cols = ['year', 'month', 'day', 'survey', 'cruise', 'area', 'station']

   # For every col in the list, print its title and its value counts
   for col in cols:
      unqiue_count = data_frame[col].nunique()
      print(f"Number of unique {col}s: {unqiue_count}") 
  
    
get_cardinality(df)
       

Number of unique years: 24
Number of unique months: 11
Number of unique days: 31
Number of unique surveys: 9
Number of unique cruises: 54
Number of unique areas: 4
Number of unique stations: 320


In [10]:
# list of unique areas
unique_areas = df['area'].unique()
print(f"\nUnique areas: {unique_areas}")


Unique areas: ['Greater North Sea' 'Celtic Seas' 'Celtic Seas, Greater North Sea'
 'Unknown']


#### **AI Insights for Unique Category Counts**
**Year Distribution**
- The data spans across **24 unique years**, indicating a broad temporal range of data collection.
- The presence of multiple years reflects long-term monitoring, with a focus on both recent and older data points.

**Survey Distribution**
- **9 unique surveys** are represented, showcasing a diverse range of research efforts.
- The presence of multiple surveys suggests a variety of research methodologies and objectives, capturing different aspects of the dataset.

**Cruise Distribution**
- The data covers **54 unique cruises**, indicating a diverse set of research trips or expeditions.
- The variety in cruises suggests both extensive and more specific sampling events across various periods.

**Area Distribution**
- The data is focused on **2 unique areas**, likely representing two primary regions of interest.
- These areas may have distinct environmental conditions, offering opportunities for comparison and regional analysis.

**Station Distribution**
- The dataset includes data from **320 unique stations**, reflecting a detailed geographical distribution.
- The large number of stations suggests a comprehensive sampling effort, providing rich spatial data for analysis.

#### **Observations**
- The data spans a significant time period with 24 years of collected data.
- The variety in the number of surveys, cruises, and stations shows a comprehensive and diverse data collection effort, capturing both broad trends and specific research events.
- The focus on two primary areas highlights regional interest, while the extensive number of stations offers a high level of detail and spatial resolution in the dataset.


### 1.4 Additional Explorations

variability, outlers, discussion of data limitations

# Section 2 : Univariate Analysis

### 2.1 Intro to Univariate Analysis

why univariate analysis for distrubiotn
why distributions
why chose to look at x vars

### 2.2 Analysis of Numerical Values

### 2.3 Analysis of Categorical Values

### 2.4 Visualisation Insights & Evaluation

# Section 3 : Multivariate Analysis

### 3.1 Intro to Multivariate Analysis

### 3.2 Categorical vs. Categorical

### 3.3 Categorical vs. Numerical

### 3.4 Numerical vs. Numerical

### 3.5 Visualisation Insights & Evaluation

# Section 4 : Insights and Interpretations

### 4.1 Summary of Key Findings

### 4.2 Discussion of Implications