## Install required packages

In [None]:
!pip install camog

In [None]:
!pip install requests

In [None]:
!pip install numpy

In [None]:
!pip install pandas

In [None]:
!pip install opencv-python

In [None]:
!pip install matplotlib

In [None]:
!pip install sklearn

## Load packages

In [None]:
import pandas as pd
import numpy as np
import time
import multiprocessing as mp

## Check the number of logical cores (threads) you have

In [None]:
NUM_OF_THREADS = mp.cpu_count()

print("You have {} logical cores".format(NUM_OF_THREADS))

**Jupyter notebook is capable of measuring the time needed to execute code in a cell using "cell magic".**

**However, if you want to have a similar functionality in your Python scripts aside from Jupyter, you can use the following code snippet.**

**For this practical, we will use Jupyter notbook cell magic to time the code execution**

Documentation: https://ipython.readthedocs.io/en/stable/interactive/magics.html

In [None]:
# timing code without jupyter notebook cell magic

total_runs = 10

time_runs = np.empty((0,0))

for i in range(total_runs):

    start_time = time.time() 

    ### Start of code
    
    
    ### End of code

    end_time = time.time() 

    time_taken = round(end_time - start_time, 2)
    
    print("Time taken for run {}: {} seconds".format(i, time_taken))
    
    time_runs = np.append(time_runs, time_taken)

print('Average run time: {}'.format(round(np.mean(time_runs), 2)))

## Exercise 1: Read files in parallel, using the "camog" package

Download the augmented gene expression file from the following link, unzip the file and place the resulting CSV inside the data/ folder
https://drive.google.com/file/d/1xpaueGzBUpK2lSECN-JpbReg3pKUqFcq/view?usp=sharing

This is a made-up large file simulating gene expression data for 3000 sample and ~43000 human genes

The file size is ~2.4GB and the large size is made on purpose to allow measuring reasonable difference in performance when reading it in a parallel way

In [None]:
import camog

The following 4 code snippets read the file "augmented_gene_expression.csv" using 1,2,4,8 threads respectively and print the execution time underneath each cell.
Run those cells and compare the obtained times.

What are your observations?

In [None]:
%%timeit -r1 -n10

headers, columns = camog.load('data/augmented_gene_expression.csv', nthreads=1)
pandas_df = pd.DataFrame({key: value for key, value in zip(headers, columns)})

In [None]:
%%timeit -r1 -n10

headers, columns = camog.load('data/augmented_gene_expression.csv', nthreads=2)
pandas_df = pd.DataFrame({key: value for key, value in zip(headers, columns)})

In [None]:
%%timeit -r1 -n10

headers, columns = camog.load('data/augmented_gene_expression.csv', nthreads=4)
pandas_df = pd.DataFrame({key: value for key, value in zip(headers, columns)})

In [None]:
%%timeit -r1 -n10

headers, columns = camog.load('data/augmented_gene_expression.csv', nthreads=8)
pandas_df = pd.DataFrame({key: value for key, value in zip(headers, columns)})

Lets check now the run time when using Pandas directly to read the file

In [None]:
%%timeit -r1 -n5 

pandas_df = pd.read_csv('data/augmented_gene_expression.csv', encoding='utf-8')

The "camog" package is an open source project available on GitHub https://github.com/walshb/camog/tree/master

Check the source code file reponsible for reading the csv file in parallel and **explain what type of parallel programming is used in this package, which programming language and what library is used.**

**Hint:** start with this file https://github.com/walshb/camog/blob/master/camog/_csv.py

## Exercise 2: Populate a Pandas dataframe with API calls (enrich UniProt IDs with protein information)

In [None]:
import json
import requests

The following code line reads a dataframe of 100 rows and five columns. The first column contains UniProt IDs for 100 proteins and the remaing columns are empty.

The purpose of this exercise is to apply parallel computing on Pandas datafram and make external API calls to retrieve information about the protiens in the first column and fill the rest of empty column with relevant information about those proteins.

In [None]:
uniprot_df = pd.read_csv('data/uniprot_ids_df.csv', encoding='utf-8', dtype=str)

In [None]:
uniprot_df.shape

In [None]:
uniprot_df.head(10)

The code cell bellow define a function that will be applied on each dataframe chunk processed in parallel. It takes a dataframe as input (in the parallel way, the dataframe is split into multiple chunks, each to be handled by a different thread) and returns the same dataframe after filling the empty column.

Explain the function code below using comments showing the role of each piece of code

In [None]:
def process_df_row(df):
    
    uniprot_url = "https://rest.uniprot.org/uniprotkb/{}.json"
    
    for i, row in df.iterrows():
        
        uniprot_id = row['uniprot_id'].strip()
        
        try:
            response = requests.get(uniprot_url.format(uniprot_id))
            
            response_json = json.loads(response.content)
            
            try:
                proten_name = response_json['proteinDescription']['recommendedName']['fullName']['value']
            except KeyError:
                proten_name = ""
            
            try:
                proten_length = response_json['sequence']['length']
            except KeyError:
                proten_length = ""
            
            try:
                proten_organism = response_json['organism']['scientificName']
            except KeyError:
                proten_organism = ""
                
                
            try:
                proten_sequence = response_json['sequence']['value']
            except KeyError:
                proten_sequence = ""
            
            
            df.at[i, 'protein_name'] = proten_name
            df.at[i, 'protein_length'] = proten_length
            df.at[i, 'protein_organism'] = proten_organism
            df.at[i, 'protein_sequence'] = proten_sequence
            
        except Exception as e:
            print(e)
        
    return df

In [None]:
# Apply the defined function on the dataframe
# this line is meant to show how the output look like and not meant to assess performance
uniprot_enriched_df = process_df_row(uniprot_df)

In [None]:
uniprot_enriched_df.head(10)

The following two cells compare the sequential and parallel processing of the dataframe.

Run the code and compare the results.

You can try the parallel part with different number of cores and see how does that effect the execution time

In [None]:
%%timeit -r1 -n5

process_df_row(uniprot_df)

In [None]:
%%timeit -r1 -n5

NUM_OF_THREADS = 4

df_splits = np.array_split(uniprot_df, NUM_OF_THREADS)

pool = mp.Pool(NUM_OF_THREADS)

results = pool.map(process_df_row, df_splits)

uniprot_enriched_df = pd.concat(results)

pool.close()

## Exercise 3: Apply edge detection to blood cell image files in parallel

In this exercise, we will work with applying an image processing function on a large number of blood cell images (simulating what you would do in a similar research project) and we will compare this process with and without parallel computing.

The dataset was originally obtained from Kaggle (https://www.kaggle.com/datasets/paultimothymooney/blood-cells/). However, for this practical, the image were copied and multiplied a couple of time to increase their number in order the observe reasonable differences between sequential and parallel approaches.

Therefore, download the image dataset from the following URL: https://drive.google.com/file/d/1a5EPJPSrrpaKTu6tvIY37sdtqoPdTMNd/view?usp=sharing

Unzip the folder into the data/ folder and make sure that you have the images directly under data/original_images/

In [None]:
from os import listdir
from os.path import isfile, join
import cv2

In [None]:
files_path = 'data/original_images'

# Get a list of all image file names in the specified path
list_of_files = [f for f in listdir(files_path) if isfile(join(files_path, f))]

In [None]:
len(list_of_files)

In [None]:
list_of_files[:5]

In [None]:
def process_image(image_file):
    
    folder = "data/original_images"
    folder_processed = "data/processed_images"
        
    image = cv2.imread(folder+"/"+image_file)
  
    # convert to gray scale image
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # apply median filter for smoothing
    blurM = cv2.medianBlur(gray, 5)
    
    
    # apply Canny edge detector and save the output to processed_images folder
    edgeM = cv2.Canny(blurM, 10, 50)
    cv2.imwrite(folder_processed+"/"+image_file, edgeM)

In [None]:
%%timeit -r1 -n5

for image_file in list_of_files:
    
    process_image(image_file)   

In [None]:
%%timeit -r1 -n5

NUM_OF_THREADS = 4

pool = mp.Pool(NUM_OF_THREADS)
pool.map(process_image, list_of_files)

pool.close()

What is the difference in using pool.map() in this excercise compared to excercise 2?

## Exercise 4: machine learning model training using parallel computing

In [None]:
from matplotlib import pyplot as plt
from sklearn.datasets import make_regression 
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RepeatedStratifiedKFold

In [None]:
# make up a regression dataset sample to use it for machine learning model training
X, y = make_regression(n_samples=30000, n_features=25, n_informative=10,
                        random_state=0, shuffle=False)

Explain in comments the code below, run the code, what are your observations from the output plot?

In [None]:
time_runs = np.empty((0,0))

n_cores = [1,2,3,4,5,6,7,8]

for n in n_cores:
    
    start_time = time.time() 

    model = RandomForestRegressor(n_jobs=n)
    
    model.fit(X, y)

    end_time = time.time() 

    time_taken = round(end_time - start_time, 2)

    print("Time taken for run {}: {} seconds".format(n, time_taken))

    time_runs = np.append(time_runs, time_taken)

In [None]:
plt.plot(n_cores, time_runs)

To learn more about parallism in scikit-learn, check this link: https://scikit-learn.org/stable/computing/parallelism.html

Which parallel computing paradigm is used by default in scikit-learn?

## Exercise 5: Solve the following problem using a parallel programming approach

1. Load the CSV file in the "data" folder, named "pdb_ids.csv" using Pandas
2. Create a function to process each row of the dataframe. The function should take one argument of type DataFrame and return a dataframe object.
The function should iterate through the dataframe rows and perform the following steps:
    * Get the PDB ID from the relevant column and make an HTTP call to download the protein image from PDB (use the following URL template: http://cdn.rcsb.org/images/structures/dl/{}/{}_assembly-1.jpeg).
    * Save the content of the response (binary content) to an image file stored in the folder "data/pdb_images" named with the PDB id and the extension "jpeg"
    * read the image file from the folder using OpenCV and extract the size of the image (i.e. width and height)
    * store the width and the height of the image in the relevant columns in the dataframe
3. Save the dataframe to a file
4. Use the timing template provided at the beginning of this notebook to time your code and test it using 2, 4 and 8 cores