## Quiz 05 - Parallel Computing, Reproducibility, and Containers

### Instructions

This quiz is based on the material covered in lectures 21 to 24. You may use
any resources available to you, including the lecture notes and the internet.

All the data required for this quiz can be found in the `data` folder within this repository. If you need to recreate the datasets, you can do so by running the Python script included in the `script-data-generation` folder.

**Important:** Please start by completing Question 01 to set up the correct Python environment before proceeding with the other questions.

This notebook contains the questions you need to answer.
If possible, please submit your answers as an `.html` file on Canvas.

### **Question 01: Setting up the Python Environment**

Before proceeding with the rest of the quiz, it is important to set up a Python environment with specific package versions to ensure compatibility and reproducibility. This quiz requires **Python 3.10** and the following packages with exact versions:
- `dask-sql=2024.5.0`
- `dask=2024.4.1`
- `ipykernel=6.29.3`
- `joblib=1.3.2`
- `numpy=1.26.4`
- `pandas=2.2.1`

You can use tools like `conda`, `pipenv`, or `uv` to manage your environment. However, due to the specific version requirements, especially with `dask-sql`, **`conda` is strongly recommended**, as the instructions below are tested for this setup.

Write the terminal commands to accomplish this in the code cell below:

In [3]:
# Please write your bash commands here. You can run them using the `!` operator or the `%%bash` magic.
# Step 1: Create a new conda environment with Python 3.10
!pipenv --python 3.10
!pipenv install dask-sql==2024.5.0 dask==2024.4.1 ipykernel==6.29.3 joblib==1.3.2 numpy==1.26.4 pandas==2.2.1



[    ] Creating virtual environment...
[=   ] Creating virtual environment...
[==  ] Creating virtual environment...
[=== ] Creating virtual environment...
[ ===] Creating virtual environment...
[  ==] Creating virtual environment...
[    ] Creating virtual environment...
[   =] Creating virtual environment...
[  ==] Creating virtual environment...
[ ===] Creating virtual environment...
[====] Creating virtual environment...
[=== ] Creating virtual environment...
[=   ] Creating virtual environment...
[    ] Creating virtual environment...
[=   ] Creating virtual environment...
[==  ] Creating virtual environment...
[=== ] Creating virtual environment...
[ ===] Creating virtual environment...
[   =] Creating virtual environment...
[    ] Creating virtual environment...
[   =] Creating virtual environment...
[  ==] Creating virtual environment...
[ ===] Creating virtual environment...
[====] Creating virtual environment...
[==  ] Creating virtual environment...
[=   ] Creating virtual e

Creating a virtualenv for this project
Pipfile: C:\Users\manny\OneDrive\Documents\GitHub\qtm350-quiz05\Pipfile
Using 
C:/Users/manny/AppData/Local/Microsoft/WindowsApps/PythonSoftwareFoundation.Pyt
hon.3.10_qbz5n2kfra8p0/python.exe3.10.11 to create virtualenv...
created virtual environment CPython3.10.11.final.0-64 in 23344ms
  creator Venv(dest=C:\Users\manny\.virtualenvs\qtm350-quiz05-FgxbsimO, 
clear=False, no_vcs_ignore=False, global=False, describe=CPython3Windows)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, 
wheel=bundle, via=copy, 
app_data_dir=C:\Users\manny\AppData\Local\Packages\PythonSoftwareFoundation.Pyt
hon.3.10_qbz5n2kfra8p0\LocalCache\Local\pypa\virtualenv)
    added seed packages: pip==25.0.1, setuptools==78.1.0, wheel==0.45.1
  activators 
BashActivator,BatchActivator,FishActivator,NushellActivator,PowerShellActivator
,PythonActivator

Successfully created virtual environment!
Virtualenv location: C:\Users\manny\.virtualenvs\qtm350-quiz05-Fgxbs

To activate this project's virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.
Installing dask-sql==2024.5.0...
Installation Succeeded
Installing dask==2024.4.1...
Installation Succeeded
Installing ipykernel==6.29.3...
Installation Succeeded
Installing joblib==1.3.2...
Installation Succeeded
Installing numpy==1.26.4...
Installation Succeeded
Installing pandas==2.2.1...
Installation Succeeded
To activate this project's virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.
Installing dependencies from Pipfile.lock (c606b7)...
All dependencies are now up-to-date!
Building requirements...
[    ] Locking packages...
Resolving dependencies...
[    ] Locking packages...
[=   ] Locking packages...
[==  ] Locking packages...
[=== ] Locking packages...
[  ==] Locking packages...
[   =] Locking packages...
[    ] Locking packages...
[   =] Locking packages...
[  ==] Locking packages...
[====] Locking package

Pipfile.lock not found, creating...
Locking [packages] dependencies...
Locking [dev-packages] dependencies...
Updated Pipfile.lock (34c67823d0895516cdc008b7a1e9cfbb2ac3da7fcb4b6c292c9e6fcbecc606b7)!
Upgrading dask-sql==2024.5.0, dask==2024.4.1, ipykernel==6.29.3, joblib==1.3.2,
numpy==1.26.4, pandas==2.2.1 in  dependencies.


In [4]:
!pipenv shell

^C


### Question 02 - Parallelising a Function with Joblib

Use `joblib` to parallelise the computation of squaring numbers in a large array. Import the required packages and write code that uses four cores to parallelise the computation. Print the first 10 numbers.

```python
import numpy as np

def square(x):
    return x ** 2

numbers = np.arange(1000000)
```

In [None]:
# Please write your answer here.
import numpy as np
from joblib import Parallel, delayed

def square(x):
    return x ** 2

numbers = np.arange(1000000)

# Use Joblib to parallelize using 4 cores
squared_numbers = Parallel(n_jobs=4)(delayed(square)(x) for x in numbers)

# Print the first 10 results
print(squared_numbers[:10])


### Question 03 - Using Dask Arrays for Large Data

Using Dask's `array` module, create a Dask array of random numbers with 10,000 rows and 10,000 columns. The array should be divided into chunks of 1,000 rows by 1,000 columns to enable efficient parallel computation. Populate the array with random numbers drawn from a normal distribution, where the mean is 0 and the standard deviation is 1. After creating the array, compute the mean, standard deviation, maximum, and minimum of the array using Dask's parallel computation capabilities. Use the `.compute()` method to execute the computations and print the results.

In [None]:
# Please write your answer here.
import dask.array as da

# Create a Dask array with 10,000 x 10,000 random values from a normal distribution
# with chunks of 1,000 x 1,000
array = da.random.normal(loc=0, scale=1, size=(10000, 10000), chunks=(1000, 1000))

# Compute statistics in parallel using Dask
mean = array.mean().compute()
std = array.std().compute()
maximum = array.max().compute()
minimum = array.min().compute()

# Print results
print(f"Mean: {mean}")
print(f"Standard Deviation: {std}")
print(f"Maximum: {maximum}")
print(f"Minimum: {minimum}")


### Question 04 - Dask DataFrame Operations with Parquet Files

The `data` folder contains datasets for four countries—Brazil, India, UK, and USA—covering the years 1945 to 2023. Each country's data is stored in a separate Parquet file named after the country (`Brazil.parquet`, `India.parquet`, `UK.parquet`, `USA.parquet`). Each file contains the following columns:

- `country` (string): The name of the country.
- `year` (integer): The year of the record.
- `gdp_per_capita` (float): The GDP per capita for that country and year.
- `population` (integer): The population for that country and year.

Using Dask's `dataframe` module, read _only the `country` and the `gdp_per_capita` columns_ from the Parquet files into a Dask DataFrame. Then, compute the mean and standard deviation of the GDP per capita for each country using Dask's parallel computation capabilities.

In [None]:
# Please write your answer here.
import dask.dataframe as dd

# Step 1: Read only 'country' and 'gdp_per_capita' columns from all Parquet files
df = dd.read_parquet(
    'data/*.parquet', 
    columns=['country', 'gdp_per_capita']
)

# Step 2: Group by country and compute mean and standard deviation of gdp_per_capita
result = df.groupby('country')['gdp_per_capita'].agg(['mean', 'std']).compute()

# Step 3: Print the result
print(result)


### Question 05 - Dask and SQL Queries

Load the `data.csv` file into a Dask DataFrame and use the `dask_sql` package to perform a SQL query that selects the `country` and `gdp_per_capita` columns and filters the rows where `gdp_per_capita` is greater than 20000 in 2014. Display the results. Do not forget to register the Dask DataFrame as a SQL table with the `create_table` method.

In [None]:
# Please write your answer here.
import dask.dataframe as dd
from dask_sql import Context

# Step 1: Load the CSV into a Dask DataFrame
df = dd.read_csv('data.csv')

# Step 2: Create a dask-sql context
c = Context()

# Step 3: Register the DataFrame as a table
c.create_table("gdp_data", df)

# Step 4: Run the SQL query
query = """
SELECT country, gdp_per_capita
FROM gdp_data
WHERE year = 2014 AND gdp_per_capita > 20000
"""

# Step 5: Execute and compute the result
result = c.sql(query).compute()

# Step 6: Display the results
print(result)


### Question 06 - Parallelising a Function with Dask Delayed

Suppose we need to compute the sum of squares of numbers for large ranges. The function below calculates the sum of squares from `0` up to `n-1`. Modify the given `sum_of_squares` function to use Dask's `@delayed` decorator and compute the sum of squares for each number in the numbers list in parallel. Measure and print the total execution time for the parallel computation, and print the results for each input number (as indicated in the code).

```python
import time

def sum_of_squares(n):
    """Compute the sum of squares from 0 to n-1."""
    return sum(i * i for i in range(n))

numbers = [100_000_000, 200_000_000, 300_000_000, 400_000_000]

# Measure the start time
start_time = time.time()

# Perform the computations serially
results_serial = []
for n in numbers:
    result = sum_of_squares(n)
    results_serial.append(result)
    print(f"Sum of squares up to {n}: {result}")

# Measure the end time
end_time = time.time()

# Calculate and print the total execution time
serial_execution_time = end_time - start_time
print(f"Total execution time (serial): {serial_execution_time:.2f} seconds")
```

In [None]:
# Please write your answer here.
import time
from dask import delayed, compute

# Use the delayed decorator to parallelize the function
@delayed
def sum_of_squares(n):
    """Compute the sum of squares from 0 to n-1."""
    return sum(i * i for i in range(n))

numbers = [100_000_000, 200_000_000, 300_000_000, 400_000_000]

# Measure the start time
start_time = time.time()

# Create delayed tasks
delayed_results = [sum_of_squares(n) for n in numbers]

# Trigger parallel execution and gather results
results_parallel = compute(*delayed_results)

# Measure the end time
end_time = time.time()

# Print the results
for n, result in zip(numbers, results_parallel):
    print(f"Sum of squares up to {n}: {result}")

# Calculate and print the total execution time
parallel_execution_time = end_time - start_time
print(f"Total execution time (parallel): {parallel_execution_time:.2f} seconds")


### Question 07 - Using `pip` and `requirements.txt` for Dependency Management

Explain how you can use `pip` to manage dependencies in a Python project. Describe the process of generating a `requirements.txt` file from your current environment and how to use this file to install the same packages in another environment or on a different machine. Please comment your code to explain each step. It is not necessary to run the code, but you can if you want to test it. You can use the code cell below to write your answer.

In [None]:
# Please write your answer here.
# -------------------------------
# Managing Dependencies with pip
# -------------------------------

# 1. Install project dependencies normally using pip
# Example: pip install numpy pandas scikit-learn

# 2. Once your environment is set up, generate a list of all installed packages
# and their versions by exporting them to a requirements.txt file
# This is useful for reproducing the environment on another machine

# Generate the requirements.txt file
# This will create a text file listing all installed packages with their exact versions
pip freeze > requirements.txt

# 3. To replicate the environment on another machine or virtual environment:
# First, create/activate your new environment (optional but recommended)

# Then, install all the dependencies listed in requirements.txt
pip install -r requirements.txt

# ---------------------------------------
# Notes:
# - Use `pip freeze` instead of `pip list` because `freeze` outputs in a format
#   that `pip install -r` can understand.
# - You can customize requirements.txt manually to include only the packages
#   your project directly depends on, rather than all packages in the environment.


### Question 08 - Writing a Simple Dockerfile

Write a simple `Dockerfile` that creates a Docker image for a Python application. The application consists of a single Python script named `app.py` that prints "Hello, World!" when executed. The `Dockerfile` should use the official Python image as the base image, set a working directory in the container called `app`, and copy the `app.py` script into the image. When the container is run, it should execute the `app.py` script and print "Hello, World!".

#### Please write your answer here. You can use ```dockerfile to format your code.

```dockerfile
# Use the official Python base image
FROM python:3.10-slim

# Set the working directory inside the container
WORKDIR /app

# Copy the local app.py script into the container at /app
COPY app.py .

# Define the command to run the script
CMD ["python", "app.py"]
```

### Question 09 - Writing a Dockerfile to Install Software on a Base Image

Create a Dockerfile that starts from an Ubuntu 24.04 base image and installs the following software:

- Git version 2.43.0-1ubuntu7.1
- SQLite version 3.45.1-1ubuntu2

Ensure that you specify the exact versions of the packages by checking their versions after installation. Include commands to clean up the package manager cache after installation to reduce the image size.

#### Please write your answer here. You can use ```dockerfile to format your code.
```dockerfile
# Use the official Ubuntu 24.04 base image
FROM ubuntu:24.04

# Set the working directory (optional)
WORKDIR /root

# Update the package lists and install the necessary dependencies
# Install specific versions of Git and SQLite
RUN apt-get update && \
    apt-get install -y \
    git=2.43.0-1ubuntu7.1 \
    sqlite3=3.45.1-1ubuntu2 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Verify the installed versions of Git and SQLite
RUN git --version && \
    sqlite3 --version

# Define a simple command to keep the container running
CMD ["bash"]
```


### Question 10 - Writing a Dockerfile to Install Python and Packages on Ubuntu

Write a `Dockerfile` that starts from an Ubuntu 24.04 base image, installs Python 3.12 and `pip`, and then uses `pip` to install specific versions of `numpy` (1.26.4), `pandas` (2.2.2), and `matplotlib` (3.9.2). Ensure you include commands to clean up the package manager cache after installation to reduce the image size. Set up a working directory named `app/` and configure the container to start an interactive Python shell `python3` by default.

#### Please write your answer here. You can use ```dockerfile to format your code.
```dockerfile
# Use the official Ubuntu 24.04 base image
FROM ubuntu:24.04

# Set the working directory inside the container
WORKDIR /app

# Install required dependencies (Python 3.12 and pip)
RUN apt-get update && \
    apt-get install -y \
    python3.12 \
    python3.12-venv \
    python3.12-dev \
    curl && \
    curl -sS https://bootstrap.pypa.io/get-pip.py | python3.12 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Upgrade pip to the latest version
RUN python3.12 -m pip install --upgrade pip

# Install specific versions of numpy, pandas, and matplotlib using pip
RUN python3.12 -m pip install numpy==1.26.4 pandas==2.2.2 matplotlib==3.9.2

# Verify the installed versions of numpy, pandas, and matplotlib
RUN python3.12 -c "import numpy, pandas, matplotlib; print(numpy.__version__, pandas.__version__, matplotlib.__version__)"

# Set the default command to start an interactive Python shell
CMD ["python3.12"]
```