<a href="https://colab.research.google.com/github/GianUOM/UniBath-Courseworks/blob/main/AnalyticalSoftwareTech/AST_CW_Notebook_updated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Analytical Software Technologies — Coursework Notebook
###Code Marks = 40/100, Report Marks = 60/100

---
## Introduction

This coursework aims to give you experience working on a real-world project. You are provided with a recent dataset and guiding questions. Informed by your analysis, you will make decisions and ultimately answer the business questions and present your recommendations.

This coursework is **not** just about filling out the missing code snippets and providing broad answers. It is intentionally designed to lead you through the often-unclear stages of data analysis: starting with broad questions, formulating your own deeper lines of inquiry, and then seeking patterns and insights to make evidence-based decisions and conclusions. **This spirit of independent inquiry is central to the coursework and will be rewarded**.

As part of this coursework, you will discuss your implementation details, insights, and the justification for your decisions in your report. It is therefore advisable to look beyond the guide questions and explore the dataset so you can provide well-supported decisions and recommendations.

---
## Scenario

You have been hired as a Data Scientist by a UK university that plans to launch a new degree programme OR revamp an existing degree programme. The university seeks evidence on commercial feasibility, student demand, and the competitive landscape. As part of this project, you will analyse the [National Student Survey (NSS)](https://www.thestudentsurvey.com/) 2025 dataset (survey conducted in April 2025). Working from broad questions set by the Executive Board, your task is to explore the dataset, generate insights, and write an executive summary (your report) that contains your reflections and recommendation for the new OR existing degree programme based on evidence.

This is your first role as a Data Scientist. Use this project to demonstrate rigorous, independent inquiry, justify your approach, look beyond the guide questions, and present an analysis backed by transparent, reproducible evidence.

---

## Code
This notebook provides skeleton code for you to modify, extend, or replace. It is designed to run on the Google Colab free-tier.

### How to use this notebook

1. **Set up**: Ensure the dataset archive `nss2025.zip` is available on your project path. You can also download `nss2025.zip` from the AST unit Moodle page coursework page.
2. **Environment**: Use Python with pandas, numpy and any other necessary libraries installed. You are free to use any libraries of your own, **as long as they are available on Google Colab**.    
3. **Execution order**: Run from top to bottom. Tasks are numbered and output of some tasks can be utilised in other tasks.
4. **Reproducibility**: Avoid editing data in place without keeping a copy; prefer creating tidy tables.
5. **Your answers**: Your answers are saved at the end of each task in a specified format in a dictionary named (`all_answers`). At the end of each task, there's code that appends your result to `all_answers`. For each task, you must save your answer at the required location (the line after the comment *"STORE YOUR RESULTS HERE"*) in the specified data structure. There is a method (`add_to_answer())` to help you append your answer for each task to your overall answers in (`all_answers`). You are advised not to modify the code which starts with after the comment *"# DO NOT CHANGE*.

---
##  Tasks
There are 16 tasks in total. Each task states the question. Where helpful, comments and examples are provided to guide your analysis.

---
## Dataset

### What is the NSS?  

The **[National Student Survey (NSS)](https://www.thestudentsurvey.com/)** is an annual UK-wide survey of final-year undergraduate students. It asks students 28 questions (some questions are mandatory others are not) about their experiences with teaching, assessment, learning resources, academic support, organisation, and overall satisfaction with their course.  

The survey is run across all publicly funded universities and many colleges in the UK. Results are used by universities, policymakers, and prospective students to understand how satisfied students are with different subjects and providers.

The NSS 2025 dataset is available [online](https://www.officeforstudents.org.uk/data-and-analysis/national-student-survey-data/download-the-nss-data/). For this coursework, we have downloaded the data and pre-processed a subset so you can use it easily. If you are curious, the schema for the original, larger NSS dataset is available [here](https://www.officeforstudents.org.uk/data-and-analysis/national-student-survey-data/about-the-nss-data/). The schema for the subset of the NSS 2025 data for this coursework, is included in a section below.


---

### Why this matters to your employer  

Your employer is considering launching/revamping a degree programme and wants evidence on its potential success. The NSS dataset provides valuable insights, some examples of which may include:  

- **Commercial feasibility** — by identifying subjects with high student demand (large populations)  
- **Student satisfaction** — by highlighting areas where students responses/scores are consistently positive or negative  
- **Competitive landscape** — by comparing your university against peer institutions on specific subjects  
- **Strategic opportunities** — by finding subjects where student demand is high but satisfaction is relatively low, suggesting room for improvement and market entry  

As a data scientist, your analysis will help provide an **evidence-based recommendation** on the demand maximise both demand and student satisfaction. Your executive summary should make a clear, evidence-based case for the programme.

---

### NSS 2025 dataset schema for this coursework

Each row in the dataset represents a provider’s response data for a subject, broken down by study characteristics and question.


| Column         | Data Type | Description                                                                 |
|----------------|-----------|-----------------------------------------------------------------------------|
| `ukprn`        | int     | Unique UK Provider Reference Number identifying the university/college.     |
| `provider`     | object    | Name of the higher education provider.                                      |
| `cah1_code`   | object    | The subject code for the broad subject (could be thought of as the *academic discipline*)
| `cah1_subject`   | object    | The subject name for the broad subject (could be thought of as the *academic discipline*)
| `cah2_code`   | object    | The subject code for the sub-categories for a given `cah1_code`. Think of it as more specific categories which narrows down the `cah1_code`
| `cah2_subject`   | object    | The subject name for the sub-categories for a given `cah1_subject`. Think of it as more specific categories which narrows down the `cah1_subject`
| `cah3_code`   | object    | The subject code that further specifies categories within `cah2_code`
| `cah3_subject`   | object    | The subject code that further specifies categories within `cah2_subject`
| `question`     | object    | NSS question.        |
| `agree_pct`    | float   | Percentage of students who agreed/strongly agreed (positivity measure) for the NSS question.     |
| `agree_pct_pcd`    | float   | Standard deviation of the `agree_pct` percentage of students who agreed/strongly agreed for a given NSS question.     |
| `benchmark`    | float   | Sector-wide benchmark `agree_pct` for comparison.                                  |
| `respondents`  | int   | Number of respondents (headcount) who answered a given NSS question.                                          |
| `population`   | int   | Total student population (i.e., cohort).                          |                      |

---

### Summary
- `agree_pct` is the main measure used in most analyses. It is the "student satisfaction" score for a given NSS question.  
- `benchmark` allows comparison against the sector average for a given NSS question.  
- `respondents` and `population`, for a given cohort and a given NSS question, are the number of students who responded and the total number of students in the cohort, respectively.

- `cah_code`, and `cah_subject` follow the [Common Aggregation Hierarchy (CAH)](https://www.hesa.ac.uk/collection/coding-manual-tools/hecoscahdata/cah) classification system. Boradly speaking, `cah1_subject` could be interpreted as the *academic discipline* which is further narrowed by `cah2_subject` and `cah3_subject` respectively. For example, if you select `cah1_subject = "social sciences"` then `cah2_subject` subcategories are ['sociology, social policy and anthropology', 'health and social care',
 'economics', 'politics'], and if you further select `cah2_subject = "health and social care"` then `cah3_subject` subcategories (for the selected `cah2_subject`) are ['social work', 'childhood and youth studies', 'health studies']. For some academic disciplines, `cah1_subject` or `cah2_subject` or `cah3_subject` may have a single subject. For example, for the academic discipline "Law" the `cah1_subject`, `cah2_subject` and `cah3_subject` are "law".

---
## IMPORTANT NOTES

####**1.Calculating NSS Populations**

Please note that the provided NSS dataset for this coursework is a subset of the original NSS 2025 data, and has been processed specifically for this coursework.

Due to the structure of the NSS questions, the process through which the [Office for Students](https://www.officeforstudents.org.uk/) calculates and provides the NSS data, and the pre-processing done for this coursework - the NSS respondents or populations numbers may not be exact. For example, for a given `cah1_subject` if you calculate the sum for populations of its subcategories (i.e., the sums of `cah2_subject` or `cah3_subject` populations), it will not necessarily match the sum of `cah1_subject`population. For this coursework, whereever you are required to sum a population, you will take the **sum the unique population values**. This provides an in-direct measure of the sum of populations that is sufficient for the tasks in this coursework. Please see **Example 1** that shows the correct way of calculating the unique sum of populations.

####**2. Your Employer**
This is the university where you work as a data scientist. **You must only use the employer you have been allocated**.

```python
employer_name = "University of Southampton"
```

####**3. Your Answers**

Your answers are stored in a dictionary `all_answers` and for each task, the answer for the task is appended to / accumulated in the dictionary. At the end of all tasks, the last code cell dumps it to a json file. **Please do not** change the format of your answers or the json file in any of the tasks.

**WHILST SOME CHECKING IS PROVIDED BY THE `add_to_answer()` METHOD TO HELP YOU WHEN YOU ADD YOUR ANSWERS TO `all_answers`, IT IS THE STUDENT'S REPONSIBILITY TO ENSURE THAT THERE ANSWERS ARE CORRECTLY FORMATTED AS SPECIFIED IN EACH SECTION BEFORE SUBMISSION.**

####**4. Freedom to Code**
You are allowed to change any part of the code in this notebook. Although, it is advisable to not edit the parts of code marked as "Do not change" and the utility methods in this notebook - you are still allowed to edit them, if you deem necessary, as long as your answers in `all_answer` remain in the correct format as specified in each task.

You are also allowed to use any Python libraries you'd like, but they **must be** available on Google Colab, to ensure that your submitted notebook could be run by assessors on Google Colab if necessary.

You are also allowed to add sections / cells to the notebook for any further analyses you may wish to carry out.
  

####**6. Marking**

The Analytic Software Technologies unit has a single coursework of 100 mark. The coursework has two parts: the coding tasks in this notebook (40/100 Marks) and the report (60/100 Marks). Each of the 16 tasks in this notebook has clearly associated marks, which sums up to a total of 40 marks. A mark for a given task will be awarded based on the correct implementation and the quality of code. For tasks with incorrect implementation and/or answers, partial or zero marks may be awarded.


####**7. Submission**

You must submit 3 files on Moodle as follows:

1. This Jupyter Notebook (in .ipynb format) with all the cells executed so all outputs of the cells **must be** displayed and visible in the submitted notebook by default, without the assessors having to run your notebook. All of your developed code must run without any errors or throwing exceptions within the Google Colab environment. This means that any additional Python libraries you would like to use, **must be** available on Google Colab. **If an assessor re-runs your notebook on Google Colab then your code must execute without any errors**. If your code shows errors or does not run properly on Google Colab, then **a penalty of up to 10 marks** could be applied.

2. The generated `all_answers.json` (in .json format) which contains your answers. Each answer in the json file **must be** in the correct format as specified for each task. Before submission, You **must** ensure that you review your json file thoroughly to ensure that it accurately reflects your answers and formatted correctly. If your `all_answers.json` file is not correctly formatted, then **a penalty of up to 5 marks** could be applied.

3. Your report must be in the .pdf format. The requirements for your report are specified in the coursework specification on the Moodle unit.

In [6]:
# @title Section (a): Importing Modules, Initialising Variables and Some Utility Methods

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zipfile, os, time
import json
from IPython.display import display, HTML, Markdown

# Your allocated employer
employer_name = "University of Southampton"

# Your answers are stored here
all_answers = {}

##------ SKIP THE METHODS BELOW AS THEY ARE NOT PART OF YOUR TASKS------------------

## This is to validate answers that are in a list format
def validate_list_of_lists(data, expected_types, list_length):

    assert isinstance(data, list), "Input data must be a list"
    if list_length!=-1:
      assert len(data) == list_length, f"Expected {list_length} items, got {len(data)}"

    for i, row in enumerate(data):
        assert isinstance(row, list), f"Item {i} is not a list"
        assert len(row) == len(expected_types), f"Item {i} should have {len(expected_types)} elements"

        for j, (value, expected_type) in enumerate(zip(row, expected_types)):
            assert value is not None, f"Item {i}, element {j} is None"
            assert isinstance(value, expected_type), (
                f"Item {i}, element {j} should be of type {expected_type.__name__}"
            )

            # Additional logical checks
            if isinstance(value, str):
                assert value.strip() != "", f"Item {i}, element {j} string is empty or whitespace"
            elif isinstance(value, (float, int)):
                assert value > 0, f"Item {i}, element {j} number must be positive"

# This is to validate answers before adding them.
## THE CHECKS HERE ARE NOT COMPREHENSIVE. IT IS THE STUDENT'S RESPONSIBILITY TO ENSURE THAT EACH ANSWER IS IN THE CORRECT FORMAT
## IF NEEDED, YOU ARE FREE TO EDIT THIS FUNCTION

def add_to_answer(answer_number, answer):
  assert answer is not None, f" answer_{answer_number} is None"
  assert answer_number in np.arange(1,17)

  answer_var = eval(f"answer_{answer_number}")
  #validation checks specific to tasks
  if answer_number==1:
    assert isinstance(answer, int), f" answer_{answer_number} is not a integer."
  elif answer_number==2:
    validate_list_of_lists([answer], [str, int], 1)
  elif answer_number in [13,14]:
    assert isinstance(answer, str), f" answer_{answer_number} is not a string."
    assert answer != "", f" String is empty or whitespace only"
  else:
    assert isinstance(answer, list), f"answer_{answer} is not a list"
    if answer_number in [3,6,12,15]:
      validate_list_of_lists(answer, [str, float],5)
      answer = [[x[0], float(round(x[1], 2))] for x in answer]
    elif answer_number in [4,7, 11]:
      validate_list_of_lists(answer, [str, str, float], 15)
      answer = [[x[0], x[1], float(round(x[2], 2))] for x in answer]
    elif answer_number in [5,8]:
      validate_list_of_lists(answer, [str, float, float], 5)
      answer = [[x[0], float(round(x[1], 2)), float(round(x[2], 2))] for x in answer]
    elif answer_number in [9]:
      validate_list_of_lists(answer, [str, int],-1)
    elif answer_number in [10]:
      validate_list_of_lists(answer, [str, int],5)
    elif answer_number in [16]:
      validate_list_of_lists(answer, [str, float], -1)
      answer = [[x[0], round(x[1], 2)] for x in answer]
    else:
      raise ValueError(f"Invalid answer_number: {answer_number}")

  all_answers[answer_number] = answer


### Section (b): Loading Data

In [9]:
#The %%time is a Python built-in magic command - (https://ipython.readthedocs.io/en/9.2.0/interactive/magics.html)
%%time

dir_path  = '/content/drive/MyDrive/Colab Notebooks/AnalyticSoftwareTechnologies/' # The zip file should be in your working directory.
zip_file = os.path.join(dir_path, "nss2025.zip") #This is the path to your dataset zip file.
csv_file = os.path.join(dir_path, "nss2025.csv") #This is the path to your extracted csv file.

with zipfile.ZipFile(zip_file, 'r') as z: #Extract the csv file
  z.extract("nss2025.csv", path=dir_path)

nss = pd.read_csv(csv_file, low_memory=False) # Load into DataFrame

CPU times: user 393 ms, sys: 85.6 ms, total: 479 ms
Wall time: 2.22 s


### Section (c): Let's have a quick look at the NSS data

In [10]:
%%time
# This utility function displays the given dataset in a readible format
from tabulate import tabulate

def describe_dataframe(df, name="DataFrame"):
    # Basic info
    shape = df.shape
    memory = df.memory_usage(deep=True).sum() / (1024**2)  # MB

    print(f"Summary of `{name}`")
    print("-" * 50)
    print(f"Rows: {shape[0]:,}")
    print(f"Columns: {shape[1]:,}")
    print(f"Memory usage: {memory:.2f} MB")

    # Safe helper to extract example values
    def safe_examples(series):
        try:
            vals = series.dropna().unique()
            # Flatten if elements are list-like or arrays
            vals_flat = []
            for v in vals:
                if isinstance(v, (list, tuple, set)):
                    vals_flat.extend(v)
                else:
                    vals_flat.append(v)
            return [str(v) for v in vals_flat[:3]]
        except Exception:
            return ["<unprintable>"]

    # Column details
    details = pd.DataFrame({
        "DataType": df.dtypes.astype(str),
        "Missing": df.isna().sum(),
        "Unique": df.nunique(dropna=True),
        "Example": [safe_examples(df[col]) for col in df.columns]
    })

    details.reset_index(inplace=True)
    details.rename(columns={"index": "Column"}, inplace=True)

    print("\nColumn Details:")
    print(tabulate(details, headers="keys", tablefmt="github", showindex=False))

# --- Usage ---
describe_dataframe(nss, name="NSS 2025")

nss.head()

Summary of `NSS 2025`
--------------------------------------------------
Rows: 158,700
Columns: 14
Memory usage: 96.01 MB

Column Details:
| Column       | DataType   |   Missing |   Unique | Example                                                                                                                                                                               |
|--------------|------------|-----------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ukprn        | int64      |         0 |      374 | ['10042570', '10067853', '10007849']                                                                                                                                                  |
| provider     | object     |         0 |      374 | ['AAP Education Limited', 'ACM Guildford Limited', 'Abertay University']                         

Unnamed: 0,ukprn,provider,cah1_code,cah1_subject,cah2_code,cah2_subject,cah3_code,cah3_subject,question,agree_pct,agree_pct_sd,benchmark,respondents,population
0,10042570,AAP Education Limited,CAH11,computing,CAH11-01,computing,CAH11-01-06,computer games and animation,Q01: How good are teaching staff at explaining...,87.4,2.8,87.0,119,141
1,10042570,AAP Education Limited,CAH11,computing,CAH11-01,computing,CAH11-01-06,computer games and animation,Q02: How often do teaching staff make the subj...,80.5,3.6,70.6,118,141
2,10042570,AAP Education Limited,CAH11,computing,CAH11-01,computing,CAH11-01-06,computer games and animation,Q03: How often is the course intellectually st...,77.8,3.5,81.7,117,141
3,10042570,AAP Education Limited,CAH11,computing,CAH11-01,computing,CAH11-01-06,computer games and animation,Q04: How often does your course challenge you ...,84.9,3.2,83.0,119,141
4,10042570,AAP Education Limited,CAH11,computing,CAH11-01,computing,CAH11-01-06,computer games and animation,Q05: To what extent have you had the chance to...,80.5,3.5,80.0,118,141


### Section (d): Further Dataset Visualisation (Optional)

This section is optional and provides you with furhter visualisation to understand data. You can uncomment the following code and run it. Please note that it can take up to **5 minutes** to install and run it.

In [12]:
%%time
#You need the ydata-profiling module for it so let's install it (in a quiet mode)


!pip install ydata-profiling --exists-action=i --quiet

from ydata_profiling import ProfileReport
profile = ProfileReport(nss, explorative=True)
profile.to_notebook_iframe()


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m399.3/399.3 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.5/296.5 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m679.7/679.7 kB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m37.3/37.3 MB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.4/105.4 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.3/43.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/14 [00:00<?, ?it/s][A
  7%|▋         | 1/14 [00:00<00:04,  3.11it/s][A
 14%|█▍        | 2/14 [00:00<00:03,  3.85it/s][A
 21%|██▏       | 3/14 [00:01<00:04,  2.20it/s][A
 29%|██▊       | 4/14 [00:01<00:03,  2.68it/s][A
 36%|███▌      | 5/14 [00:01<00:03,  2.62it/s][A
 43%|████▎     | 6/14 [00:02<00:02,  3.33it/s][A
 50%|█████     | 7/14 [00:02<00:02,  3.39it/s][A
 57%|█████▋    | 8/14 [00:02<00:01,  4.12it/s][A
100%|██████████| 14/14 [00:02<00:00,  5.14it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

CPU times: user 17.5 s, sys: 1.08 s, total: 18.6 s
Wall time: 32.4 s


### Example 1: Calculating the number of students (`population`) at the Uniersity of Southampton

For the purpose of this coursework, this example demonstrate how to correctly calculate the number of student when dataset is grouped by a particular column (or criteria) resulting in multiple population values (i.e., a list of populations). The total population must be calculated as the **sum of unique values within the list of populations**.

This example shows how to sum the unique populations when grouped for a subject (`cah1_subject`).

In [13]:
#Example 1: Calculating the unique population values per cah1_subject for University of Southampton

## Filtering for University of Southampton
df = nss[nss["provider"]==employer_name]

## Group by 'cah1_subject', summing unique population values, and sorting by sum of unique population
grouped = (
    df
    .groupby('cah1_subject', as_index=False)
    # We create a new column "sum_unique_population" which uses lambda function to sum the unique values (via set() function)
    .agg(sum_unique_population=('population', lambda x: sum(set(x))))
    .sort_values(by='sum_unique_population', ascending=False)
)
print("Total population = ",grouped['sum_unique_population'].sum())
print("grouped.shape = ", grouped.shape)
display(grouped.head())

Total population =  3933
grouped.shape =  (17, 2)


Unnamed: 0,cah1_subject,sum_unique_population
1,business and management,481
5,engineering and technology,443
16,subjects allied to medicine,291
0,biological and sport sciences,289
12,medicine and dentistry,268


### Task 1: How many subjects (`cah3_subject`) are offered by University of Oxford? [0.5 Mark]


In [18]:
# YOUR CODE HERE

df_oxford = nss[nss["provider"]== "University of Oxford"]


#-------- STORE YOUR RESULTS HERE ---------------
## Your answer for the number of subjects MUST be an integer
# Your answer must be stored in "answer_1"
answer_1 = df_oxford['cah3_subject'].nunique()


#-------- DO NOT CHANGE ---------------
add_to_answer(1, answer_1)
display(Markdown(f"**Your Answer:** At Oxford, a total of **{all_answers[1]}** subjects are offered"))
display(Markdown(f"**Your answer appended to `all_answer` dictionary: {all_answers[1]}**"))


**Your Answer:** At Oxford, a total of **35** subjects are offered

**Your answer appended to `all_answer` dictionary: 35**

### Task 2: Which subject (`cah3_subject`) has the highest number of students (`population`) at the University Oxford? [0.5 Mark]

In [None]:
# YOUR CODE HERE





#-------- STORE YOUR RESULTS HERE ---------------
## Your answer must be a list of two items: [cah3_subject (string), sum of unique population (int)]
# Your answer must be stored in "answer_2"
answer_2 = None


#-------- DO NOT CHANGE ---------------
add_to_answer(2, answer_2)
display(Markdown(f"**Your Answer:** At Oxford, the subject **{all_answers[2][0]}** has the highest number of students = **{all_answers[2][1]}**"))
display(Markdown(f"**Your answer appended to `all_answer` dictionary: {all_answers[2]}**"))


### Task 3: Which 5 NSS questions (`question`) have the highest **mean** `agree_pct`, regardless of providers, subjects or populations. [2 Marks]

In [None]:
# YOUR CODE HERE




#-------- STORE YOUR RESULTS HERE ---------------
## Your answer MUST be a list of lists. Each list has two values ["question" (string), "mean agree_pct" (float)]
# Your answer must be stored in "answer_3"
answer_3 = None

#-------- DO NOT CHANGE ---------------
add_to_answer(3, answer_3)
display(Markdown(f"**Your Answer:** Top 5 Questions with corresponding highest mean agree_pct are:"))
print("\n".join(str(x) for x in all_answers[3]))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[3]}"))


### Task 4: For each of the top 5 NSS questions (`question`) identified in Task 3, find 3 providers (`provider`) who have the highest mean scores (`agree_pct)` [5 Marks]

You need to find 3 providers per NSS question (5 questions), so in total 15 providers.

In [None]:
# YOUR CODE HERE
# HINT: Use the top 5 questions you worked out in the previous task



#-------- STORE YOUR RESULTS HERE ---------------
## Your answer MUST be a list of 15 lists. Each list MUST contain the following items in order ["question" (string), "provider" (string), "mean agree_pct" (float)]
# Your answer must be stored in "answer_4"
answer_4 = None


#-------- DO NOT CHANGE ---------------
add_to_answer(4, answer_4)
display(Markdown(f"**Your Answer:** Top 5 NSS Questions the Top 3 Providers with corresponding highest mean agree_pct are:"))
print("\n".join(str(x) for x in all_answers[4]))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[4]}"))

### Task 5: For each of the top 5 NSS questions (`question`) with highest mean `agree_pct` identified in Task 3, calculate the distribution (mean and standard deviation) for **all providers**, and plot the 5 distributions (i.e., one distribution for each NSS question) [6 Marks]

You need to calculate 5 distributions for the top 5 NSS questions.

In [None]:
# YOUR CODE HERE

# HINT: Use the top 5 questions you found in previous task






#-------- STORE YOUR RESULTS HERE ---------------

## Your answer MUST be a list of 5 lists. Each list MUST contain the following items in order ["question" (string), "mean" (float), "standard deviation" (float)]
## Your answer must be stored in "answer_5" list below
answer_5 = None


#-------- DO NOT CHANGE ---------------
add_to_answer(5, answer_5)
display(Markdown(f"**Your Answer:** Top 5 NSS Questions with mean and standard deviation for **all providers** are:"))
display(Markdown(f"**[Question, Mean, Standard Deviation]**:"))
print("\n".join(str(x) for x in all_answers[5]))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[5]}"))

### Task 6: Which 5 NSS questions (`question`) have the **lowest mean** `agree_pct`, regardless of providers, subjects or population. [1.5 Marks]

In [None]:
# YOUR CODE HERE




#-------- STORE YOUR RESULTS HERE ---------------

## Your answer MUST be a list of 5 lists. Each list has two values ["question" (string), "mean agree_pct" (float)]
# Your answer must be stored in "answer_6"
answer_6 = None


#-------- DO NOT CHANGE ---------------
add_to_answer(6, answer_6)
display(Markdown(f"**Your Answer:** 5 NSS Questions with corresponding lowest mean agree_pct are:"))
print("\n".join(str(x) for x in all_answers[6]))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[6]}"))


### Task 7: For each of the lowest scoring 5 NSS questions (`question`) identified in Task 6, find 3 providers who have the lowest mean scores (`agree_pct)`. [2.5 Marks]

You need to find 3 providers per NSS question (5 questions), so in total 15 providers.

In [None]:
# YOUR CODE HERE

# HINT: Use the lowest 5 NSS questions you found in Task 6




#-------- STORE YOUR RESULTS HERE ---------------

## Your answer MUST be list of 15 lists. Each list MUST contain the following items in order ["question" (string), "provider" (string), "mean agree_pct" (float)]
# Your answer must be stored in "answer_7"
answer_7 = None


#-------- DO NOT CHANGE ---------------
add_to_answer(7, answer_7)
display(Markdown(f"**Your Answer:** 5 NSS Questions with 3 Providers with the corresponding lowest mean agree_pct are:"))
print("\n".join(str(x) for x in all_answers[7]))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[7]}"))

### Task 8: For the lowest scoring 5 NSS questions identified in Task 6 , for each `question` calculate the distribution (mean and standard deviation) for **all providers**, and plot the 5 distributions (i.e., one distribution for each NSS question). [4 Marks]

You need to calculate 5 distributions across all providers for the lowest scoring 5 NSS questions.

In [None]:
# YOUR CODE HERE
# HINT: Use the bottom 5 questions you found in Task 6




#-------- STORE YOUR RESULTS HERE ---------------
## Your answer MUST be list of 5 lists. Each list MUST contain the following items in order ["question" (string), "mean" (float), "standard deviation" (float)]
## Your answer must be stored in "answer_8" list below
answer_8 = None


#-------- DO NOT CHANGE ---------------
add_to_answer(8, answer_8)
display(Markdown(f"**Your Answer:** 5 NSS Questions with the lowest mean and standard deviation for **all providers** are:"))
display(Markdown(f"**[Question, Mean, Standard Deviation]**:"))
print("\n".join(str(x) for x in all_answers[8]))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[8]}"))

### Task 9: Calculate the total number of students for each providers identified in Task 4. [2 Marks]

In Task 4, for each of the top 5 NSS questions, you found 3 providers with highest mean `agree_pct`. What are the total number of undergraduate students (`population`) at each provider? Given 15 providers, you answer will have 15 populations.

In [None]:
# YOUR CODE HERE

# HINT: Use the 15 providers you identified in Task 4
# HINT: TAKE THE SUM OF UNIQUE VALUES IN POPULATIONS for a given provider


#-------- STORE YOUR RESULTS HERE ---------------

## Your answer MUST be list of lists where each list is in order ["provider" (string), "sum of unique population" (int)]
# Your answer must be stored in "answer_9"
answer_9 = None


#-------- DO NOT CHANGE ---------------
add_to_answer(9, answer_9)
display(Markdown(f"**Your Answer:** for each of the top 5 NSS questions, the sum of unique populations for the top 3 providers with the highest mean `agree_pct` are:"))
print("\n".join(str(x) for x in all_answers[9]))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[9]}"))


### Task 10: What 5 broad subjects (`cah1_subject`) are the most popular (i.e., have the highest `population`) across all providers? [2.5 Mark]

In [None]:
# YOUR CODE HERE
# HINT: You MUST TAKE THE SUM OF UNIQUE VALUES IN POPULATION FOR A GIVEN CAH1_SUBJECT


#-------- STORE YOUR RESULTS HERE ---------------

## Your answer MUST be list of 5 lists where each list is in order ["subject (string), "sum of unique populations" (int)]
# Your answer must be stored in "answer_10"
answer_10 = None


#-------- DO NOT CHANGE ---------------
add_to_answer(10, answer_10)
display(Markdown(f"**Your Answer:** The top 5 `cah1_subject`, and their respective populations are:"))
print("\n".join(str(x) for x in all_answers[10]))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[10]}"))

### Task 11: For the 5 most popular broad subjects (`cah1_subject`) as calculated in Task 10, what 3 NSS questions have the highest mean `agree_pct`? [3.5 Marks]

For each of the 5 providers, you will have 3 nss questions so in total 15 questions.

In [None]:
# YOUR CODE HERE

# HINT: use the 5 most popular subjects that you identified in Task 10




#-------- STORE YOUR RESULTS HERE ---------------

## Your answer MUST be list of 15 lists. Each list MUST contain the following items in order ["cah1_subject" (string), "question" (string), "mean agree_pct" (float)]
# Your answer must be stored in "answer_11"
answer_11 = None

#-------- DO NOT CHANGE ---------------
add_to_answer(11, answer_11)
display(Markdown(f"**Your Answer:** For the top 5 subjects, the top 3 questions with the corresponding highest mean agree_pct are:"))
print("\n".join(str(x) for x in all_answers[11]))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[11]}"))



### Task 12: Which 5 subjects (`cah3_subject`) have the highest mean score (`agree_pct`) for **all** NSS Questions? [3 Mark]

For each subject, you will need to find the mean `agree_pct` across **all** NSS questions for that subject.

In [None]:
# YOUR CODE HERE



#-------- STORE YOUR RESULTS HERE ---------------

## Your answer MUST be list of 5 lists. Each list MUST contain the following items in order ["cah3_subject" (string), "mean agree_pct" (float)]
# Your answer must be stored in "answer_12"
answer_12 = None


#-------- DO NOT CHANGE ---------------
add_to_answer(12, answer_12)
display(Markdown(f"**Your Answer:** For the top 5 specific subjects (cah3_subject), the corresponding mean `agree_pct` across all questions are:"))
print("\n".join(str(x) for x in all_answers[12]))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[12]}"))

### Task 13: Choose a broad subject (`cha1_subject`) [0.5 Mark]

We have guided you through the questions and exploring the NSS dataset in the above task so far.

You may wish to explore the dataset further now to decide which broad subject (or discipline) you would like to recommend based on your reasoning and findings in the dataset.

If needed, you can add new code and/or text cells.

In [None]:
# YOUR CODE HERE
#Set this variable to a string as per your analysis, e.g. "mathematical sciences"
chosen_cha1_subject = None



#-------- STORE YOUR RESULTS HERE ---------------
## Your answer MUST be string and must exist in the "cah1_subject" column in the nss dataset
# Your answer must be stored in "answer_13"

answer_13 = None

#-------- DO NOT CHANGE ---------------
add_to_answer(13, answer_13)
display(Markdown(f"**Your Answer:** You have chosen : {all_answers[13]}"))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[13]}"))

### Task 14:  Choose a subject (`cah3_subject`) within your chosen broad subject (`cah1_subject`) [0.5 Mark]

You need to explore what subjects (`cah3_subject`) within your chosen broad subject (`cah1_subject`) are available and based on your analyses, recommend your specfic subject (`cah3_subject`).

If needed, you can add new code and/or text cells.

In [None]:
# YOUR CODE HERE
#Set this variable to a string as per your analysis, e.g. "statistics"
chosen_cha3_subject = None



#-------- STORE YOUR RESULTS HERE ---------------
## Your answer MUST be string and must exist in the "cha3_subject" column in the nss dataset
# Your answer must be stored in "answer_14"

answer_14 = None


#-------- DO NOT CHANGE ---------------
## DO NOT Change - Formatted answer and appending them to all answers
add_to_answer(14, answer_14)
display(Markdown(f"**Your Answer:** You have chose : {all_answers[14]}"))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[14]}"))

### Task 15: For your chosen `chosen_cha3_subject`, what are the top 5 NSS questions with the highest mean `agree_pct` across all providers but excluding your employer? [4 Marks]

In [None]:
# YOUR CODE HERE
# HINT: TAKE THE SUM OF UNIQUE VALUES IN POPULATION FOR A GIVEN CAH3_SUBJECT


#-------- STORE YOUR RESULTS HERE ---------------
## Your answer MUST be list of 5 lists. Each list MUST contain the following items in order ["question" (string), "mean agree_pct" (float)]
# Your answer must be stored in "answer_15"
answer_15 = None



#-------- DO NOT CHANGE ---------------
add_to_answer(15, answer_15)
display(Markdown(f"**Your Answer:** For {chosen_cha3_subject} subject, the corresponding 5 NSS questions with the highest mean `agree_pct` are:"))
print("\n".join(str(x) for x in all_answers[15]))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[15]}"))

### Task 16: What are the NSS question scores (`agree_pct`) for your employer across all subjects (`cah3_subject)`? [2 Marks]

As an approximate and comparative measure to your findings in Task 15, you need to explore your employer's NSS question score (`agree_pct`) across all all subjects (`cah3_subject)`

In [None]:
# YOUR CODE HERE
# HINT: TAKE THE SUM OF UNIQUE VALUES IN POPULATION FOR A GIVEN CAH3_SUBJECT

# HINT: This is already initialised to "University of Southampton"
employer_name = employer_name




#-------- STORE YOUR RESULTS HERE ---------------
## Your answer MUST be list of lists. Each list MUST contain the following items in order ["question" (string), "mean agree_pct" (float)]
# Your answer must be stored in "answer_16"
answer_16 = None


#-------- DO NOT CHANGE ---------------
add_to_answer(16, answer_16)
display(Markdown(f"**Your Answer:** For {employer_name}, the corresponding NSS questions with the highest mean `agree_pct` are:"))
print("\n".join(str(x) for x in all_answers[16]))
display(Markdown(f"**Your answer appended to `all_answer` dictionary:** <br> {all_answers[16]}"))

### Section (e): Saving Your Answers to a JSON File

The code below will save your answers to a json file and prints out the answers.

In [None]:
# DOUBLE CHECK ALL YOUR ANSWERS

#To save your answers to a json file - this should be created in your current directory
with open("all_answers.json", "w") as f:
    json.dump(all_answers, f)
    display(Markdown(f"## Your answers have been saved to `all_answer.json` file successfully!"))


#Finally, open all_answers.json file to check  to ensure it has the correctly formatted answers
#If you are satisfied with it then submit all_answer.json file along with this notebook (.ipynb format)

# --- Load JSON file ---
with open("all_answers.json", "r") as f:
    all_answers = json.load(f)

# --- Display in readable plain-text format ---
display(Markdown(f"## Your saved answers :"))

for key, value in all_answers.items():
    if isinstance(value, list):
        print(f"{key}: [")
        for item in value:
            if isinstance(item, list):
                print(f"   [{', '.join(str(x) for x in item)}]")
            else:
                print(f"   {item}")
        print("]")
    else:
        print(f"{key}: {value}")
    print()
