# IFQ619 Assignment 1 – Part A
**Name:** [Your Name]  
**Student ID:** [Your Student ID]  

---

## Introduction
This notebook demonstrates foundational data analytics techniques using the OSMI 2016 mental health survey dataset (sample).  
It addresses **Criterion 1 (Verification)** and **Criterion 2 (Basic techniques)** through loading, verifying, cleaning, analysing, and visualising survey data.

---


## Step 1 — Load the dataset
Load the CSV dataset into a pandas DataFrame. This allows us to inspect, clean, and analyse the survey responses.

In [None]:

import pandas as pd

# Load dataset
df = pd.read_csv("osmi_2016.csv")

# Preview shape and first rows
df.shape, df.head()


## Step 2 — Verification and cleaning
We verify the dataset by checking data types, missing values, and duplicates. Since all fields are categorical survey responses, we apply light cleaning:
- Normalise Yes/No/Maybe style answers
- Leave graded responses unchanged
- Prepare for analysis and plotting

In [None]:

import numpy as np

# Missing values and duplicates
missing = df.isna().sum()
duplicates = df.duplicated().sum()

print("Missing values per column:\n", missing)
print("\nDuplicate rows:", duplicates)

# Simple normalisation for Yes/No/Maybe
def norm_yn(x):
    if not isinstance(x, str):
        return np.nan
    s = x.strip().lower()
    if s in ["yes", "y", "true", "t"]:
        return "Yes"
    if s in ["no", "n", "false", "f"]:
        return "No"
    if s in ["maybe", "not sure", "unsure", "don't know", "do not know", "dk", "na"]:
        return "Unsure / Maybe"
    return x.strip().title()

clean = df.copy()
for col in clean.columns:
    clean[col] = clean[col].apply(norm_yn)

clean.head()


## Step 3 — Descriptive statistics
We calculate frequency tables (counts and percentages) for each question. This provides an overview of response patterns.

In [None]:

# Frequency tables for all columns
for col in clean.columns:
    vc = clean[col].value_counts(dropna=False).to_frame(name="count")
    vc["percent"] = (vc["count"] / len(clean) * 100).round(1)
    print(f"\nFrequency table – {col}:\n", vc)


## Step 4 — Visualisations
We use simple bar charts (matplotlib) to visualise response distributions.
- One bar chart per question
- Cross-tabs for selected relationships (row-normalised)

Plots are unstyled as required (default matplotlib).

In [None]:

import matplotlib.pyplot as plt

# Plot bar chart for each column
for col in clean.columns:
    plt.figure()
    clean[col].value_counts(dropna=False).plot(kind="bar")
    plt.title(f"Distribution – {col}")
    plt.xlabel(col)
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()


In [None]:

# Example cross-tabs: willingness vs perceived consequences

pairs = [
    ("willing_to_discuss_with_supervisor", "negative_consequence_supervisor"),
    ("willing_to_discuss_with_coworkers", "negative_consequence_coworkers"),
    ("benefits_mental_health_coverage", "aware_of_mental_health_resources")
]

for a, b in pairs:
    if a in clean.columns and b in clean.columns:
        ct = pd.crosstab(clean[a], clean[b], normalize="index").round(2)
        print(f"\nCross-tab (row-normalised) – {a} vs {b}:\n", ct)


## Step 5 — Notes for verification
- I can demonstrate dataset loading, inspection, cleaning, and analysis steps live in the verification session.
- I normalised Yes/No/Maybe style responses for consistency.
- I created frequency tables and visualisations to address Criterion 1 and 2.
- A cleaned CSV has been saved for reproducibility.

---
**End of Part A**