### Using value_counts() to examine frequency of each unique value in select columns.

In [None]:
def get_value_counts(data_frame):
   """
    This function shows the frequency of each category for the following columns:
    1. year
    2. survey
    3. cruise
    4. area
    5. station

   It provides the count of each unique category for the specified columns.

    Args: data_frame: The data frame to inspect.
    """
   
   # Create a list of the columns to inspect
   columns_to_check = ['year', 'survey', 'cruise', 'area', 'station']

   # For every col in the list, print its title and its value counts
   for col in columns_to_check:
      print(f"\n Value count for {col}:")
      print(data_frame[col].value_counts())
    
get_value_counts(raw_df)

#### **AI Insights for Value Counts**
   **Year Distribution**
  - The data spans several years, with 2011 having the highest count (640) and 2002 the lowest (22).
  - The year range covers both recent and older periods, with a noticeable decrease in data points as you go further back in time.
  - The years 2011, 2010, and 2014 dominate, indicating that more data was collected in those years.

  **Survey Distribution**
  - **IBTS** survey has the highest count (1575), significantly outpacing all other surveys.
  - The next largest survey is **NWGFS** (566), followed by **Q1SW_with Blinder** (496), suggesting these surveys might be more extensive or more frequently conducted.
  - Other surveys, such as **MEMFISH** (95) and **Q1SW** (138), have significantly lower counts, indicating they might be more specialised or less frequent.

  **Cruise Distribution**
  - **CEND 4/15** has the highest cruise count (277), followed by other CEND series cruises (e.g., **Cend04/14**, **Cend5/11**), which seem to be relatively frequent (counts between 90 and 200).
  - Several cruises with **CIRO** prefixes (e.g., **CIRO 6/00**, **CIRO 11/94**) have counts of 75, indicating they were conducted on a few key occasions.
  - Some cruises like **CSEMP2008** (37) and **CSEMP2007** (31) show smaller numbers, likely due to being more specific or infrequent surveys.

  **Area Distribution**
  - Two main areas: **Greater North Sea** (2241) and **Celtic Seas** (1792).
  - **Greater North Sea** has a significantly higher count, indicating it may be a more frequently surveyed or larger region compared to the **Celtic Seas**.
  
  **Station Distribution**
  - Station values are widely distributed:
    - **Station 0** has the highest count (1031), indicating it's a primary or central location.
    - Other stations have significantly lower counts, with some stations (like 290, 142, 292) having only 1 count, suggesting they represent rare or specific locations.


#### **Observations**
From the analysis, it's clear that the data collection has been much more concentrated in recent years, particularly around 2010-2014. The dominance of the **IBTS** survey suggests it's the primary source of the data, with a smaller set of other surveys contributing. The variety in the number of cruises and the regions surveyed shows that the data may come from both regular, extensive surveys (Greater North Sea) and more targeted studies in specific areas. It also seems that Station 0 is the central location for data collection, while others are used sparingly or for more niche observations. 


### Using nunique() to Check Cardinality (Number of Unique Values of Spec. Column)

In [None]:
def get_cardinality(data_frame):
   """
    This function shows the count of distinct categories for the following columns:
    1. year
    2. survey
    3. cruise
    4. area
    5. station

    Args: data_frame: The data frame to inspect.
    """
       
   # Create a list of the columns to inspect
   cols = ['year', 'survey', 'cruise', 'area', 'station']

   # For every col in the list, print its title and its value counts
   for col in cols:
      unqiue_count = data_frame[col].nunique()
      print(f"Number of unique {col}s: {unqiue_count}") 
  
    
get_cardinality(raw_df)
       

Number of unique years: 24
Number of unique surveys: 9
Number of unique cruises: 54
Number of unique areas: 3
Number of unique stations: 320


#### **AI Insights for Unique Category Counts**
**Year Distribution**
- The data spans across **24 unique years**, indicating a broad temporal range of data collection.
- The presence of multiple years reflects long-term monitoring, with a focus on both recent and older data points.

**Survey Distribution**
- **9 unique surveys** are represented, showcasing a diverse range of research efforts.
- The presence of multiple surveys suggests a variety of research methodologies and objectives, capturing different aspects of the dataset.

**Cruise Distribution**
- The data covers **54 unique cruises**, indicating a diverse set of research trips or expeditions.
- The variety in cruises suggests both extensive and more specific sampling events across various periods.

**Area Distribution**
- The data is focused on **2 unique areas**, likely representing two primary regions of interest.
- These areas may have distinct environmental conditions, offering opportunities for comparison and regional analysis.

**Station Distribution**
- The dataset includes data from **320 unique stations**, reflecting a detailed geographical distribution.
- The large number of stations suggests a comprehensive sampling effort, providing rich spatial data for analysis.

#### **Observations**
- The data spans a significant time period with 24 years of collected data.
- The variety in the number of surveys, cruises, and stations shows a comprehensive and diverse data collection effort, capturing both broad trends and specific research events.
- The focus on two primary areas highlights regional interest, while the extensive number of stations offers a high level of detail and spatial resolution in the dataset.
