
# Real-world data coding for neuroscientists (ReCoN)

### MSc in Translational Neuroscience,

### Department of Brain Sciences, Faculty of Medicine,

### Imperial College London

### Contributors: Katarzyna Marta Zoltowska, Cecilia Rodriguez, Rishideep Chatterjee, Marirena Bafaloukou, Anastasia Ilina, Sahar Rahbar, Cynthia Sandor

### Autumn 2025

## Exercises to practise python programming and pandas dataframe processing


##### The aim of the following exercises is to practise basic programming concepts

Please do not worry about the assert statement in the cells. It is there to guide you whether your solution is correct. If it raises an error, this means that you still need to improve your solution.

#### Task 1
Given two DNA sequences of equal length, the Hamming distance between those sequences is the number of corresponding symbols that differ.
Write a python function that to calculate hamming distance

In [None]:
# Write your code here 

def hamming_distance(a,b):
    distance=0
    for n,i in enumerate(a): # iterate over the first string - with enumerate you can get index of each letter and use this information to access letters in the string b
        if i!=b[n]:
            distance+=1
    return distance

assert hamming_distance("GAGCCTACTAACGGGAT", "CATCGTAATGACGGCCT")==7, "There is a mistake in your function. The expected distance is 7"
assert hamming_distance("GG", "CC")==2, "There is a mistake in your function. The expected distance is 2"
assert hamming_distance("CC", "CC")==0, "There is a mistake in your function. The expected distance is 0"

#### Task 2

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

Write a function that returns a reverse complementary DNA sequence. Make sure that your function works both for uppoer and lower letter DNA sequences. If the letter is not in [A,T,C,G] just skip it.

In [None]:
# Write your code here

def rev_compl(seq):
    rev_compl=[]
    rev_seq=seq[::-1] # This is to reverse the string
    for i in rev_seq.upper(): #This is to convert it to uppercase letters
        if i=="A":
            rev_compl.append("T")
        elif i=="T":
            rev_compl.append("A")
        elif i=="G":
            rev_compl.append("C")
        elif i=="C":
            rev_compl.append("G")
    new_seq="".join(rev_compl)
    return new_seq

assert rev_compl("GAGCCTACTAACGGGAT") == "ATCCCGTTAGTAGGCTC", "There is a mistake in your function. The expected reverse complement is ATCCCGTTAGTAGGCTC"
assert rev_compl("atcg") == "CGAT", "There is a mistake in your function. The expected reverse complement is CGAT"
assert rev_compl("aattGGCC") == "GGCCAATT", "There is a mistake in your function. The expected reverse complement is GGCCAATT"

#### Task 3
This task builds on the hamming distance. 
Two DNA samples — one from a healthy individual and one from a patient — have been sequenced. Find whether the patient’s sequence contains mutations (differences) compared to the healthy sequence.
Write a function called mutation_report(seq1, seq2) that:
 - Checks if both sequences have the same length (DNA sequences must align). If not the function stops and prints that the sequences are of different length.
 - Counts how many positions differ between the two.
 - Prints each mutation position and the change (e.g., "Position 5: A → G").
 - Returns the total number of mutations.

Test your function with:

healthy = "ATGCTAGCTAGCTTACG"

patient = "ATGCGAGATAGCCTACG"

In [None]:
healthy = "ATGCTAGCTAGCTTACG"

patient = "ATGCGAGATAGCCTACG"

def mutation_report(seq1, seq2):
    if len(seq1) != len(seq2):
        print("The sequences are of different length")
        return None # This will prevent raising an error - your function will not continue to the else block, so it will not be able to return the len(mutations) and will raise an error
    else:
        mutations = []
        for i, (a, b) in enumerate(zip(seq1.upper(), seq2.upper()), start=1): # You can adjust the start of the enumerate function to start from 1 instead of 0, another option would be to correct the positions by adding 1 later as python counts from 0
            if a != b:
                mutations.append((i, a, b)) # You can use tupple to keep positions together
        
        for pos, ref, mut in mutations:
            print(f"Position {pos}: {ref} → {mut}")
        
        print(f"Total mutations found: {len(mutations)}")
    return len(mutations)

assert mutation_report("ATGCTAGCTAGCTTACG", "ATGCGAGATAGCCTACG") == 3, "Expected 3 mutations between the healthy and patient sequences."
assert mutation_report("ATGC", "ATGC") == 0, "Expected 0 mutations for identical sequences."
assert mutation_report("AAAA", "TTTT") == 4, "Expected 4 mutations when all bases differ."
assert mutation_report("atgc", "aTgG") == 1, "Expected 1 mutation; function should handle lowercase input."


#### Task 4

Each patient reports one or more symptoms related to neurological, general, or psychological health. The goal is to count how many patients report each symptom and visualize the most common ones.
Note that there are white spaces in some string and some of them have uppercase letters. Use respective string methods to correct that.
Write a function that returns a dictionary with the symptom:number of occurances pairs

In [None]:
# Patient symptoms
patients = {
    "P001": ["headache", "fatiguE", "memory loss"],
    "P002": ["dizziness", "nausea"],
    "P003": ["headache", "vision problems", " difficulty concentrating"],
    "P004": ["fatigue", "anxiety", "sleep problems"],
    "P005": ["headache", "nausea", "vomiting"],
    "P006": ["dizziness", "blurred vision"],
    "P007": ["memory loss", "confusion", "anxiety"],
    "P008": ["tremor", "fatigue"],
    "P009": ["Headache ", "dizziness", "confusion"],
    "P010": ["sleep problems", " Anxiety", "fatigue"],
}

# Write here your count_symptoms() function
def count_symptoms(patient_dict):
    counts = {}
    for pid, symptoms in patient_dict.items(): # Iterate over the items in the dictionary
        for symptom in symptoms: # Iterate over each list of symptoms
            symptom = symptom.lower().strip() # Strip leading and trailing white spaces and convert everything to lower case
            if symptom in counts.keys(): # Check if the key already exist in the dictionary
                counts[symptom] = counts[symptom]+1 #If yes increase the value
            else: 
                counts[symptom] = 1 #If not add a new key and set valye to 1
    return counts

count_symptoms(patients)

#### Task 5
##### The aim of the exercise is to explore clinical and demographic data recorded for healthy individuals and Alzheimer's disease patients.

The dataset is adapted from: https://www.kaggle.com/datasets/rabieelkharoua/alzheimers-disease-dataset

In [None]:
# Import the modules that you need for the project
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

How many rows and columns are in the dataset?

What are the names of the columns?

Are all columns informative?

In [None]:
# Load the data
data=pd.read_csv("./data/alzheimers_disease_data.csv")

# How many rows and columns are there?
print(data.shape)

# Print column names
print(data.columns)

# Display first 5 rows of the dataframe
display(data.head(5))

# As you may noticed not all the columns are visible
# Can you split the dataframe into column chunks and display 10 columns at once (1-10,11-20,21-30, 31-35), lopping through the chunks. As you may need this code later, write a function
# Note: when the dataframe is the last line in the code cell, it is by default displayed in the notebook, but if it is in the middle, then not. Use display in that case.

def display_dataframe_in_chunks(df, chunk_size=10, n_rows=5):
    """
    Display a pandas DataFrame in column chunks.

    Parameters:
    df (pd.DataFrame): The DataFrame to display.
    chunk_size (int): Number of columns to show at a time. Default is 10.
    n_rows: How many rows to display? 
    """
    num_columns = df.shape[1]
    for start in range(0, num_columns, chunk_size):
        end = min(start + chunk_size, num_columns)
        chunk = df.iloc[range(n_rows), start:end]
        display(chunk)

# Use this function to display 3 first rows for all columns

display_dataframe_in_chunks(data.head(5), 10,3)

Set patient id as index and drop column Doctor in charge

In [None]:
# set index
data.set_index("PatientID", inplace=True)

# Drop column doctor in charge
data.drop("DoctorInCharge", axis=1, inplace=True)

Confirm if it worked by displaying 4rows of the dataframe using your function

In [None]:
display_dataframe_in_chunks(data.head(5), 10,4)

How many categorical and numeric columns are there? 
Check the datatypes as well as data ditribution (histogram). Are all numeric columns continous?

In [None]:
# Check data types
print(data.dtypes);

# Plot a histogram
data.hist(grid=False, figsize=(14,14));

Is there any correlation between different variables? 

Plot correlation heatmap and pairplots (color the pairplots by Diagnosis)

Take a look how to set a larger figure size, so all the labels are displayed
(https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html)

In [None]:
# Heatmap
plt.figure(figsize=(16, 12))
sns.heatmap(data.corr())

Looking at the heatmap. Which columns are best correlated with AD diagnosis? Is it what you expected?

In [None]:
sns.pairplot(data, hue="Diagnosis");

How many unique diagnosis values are there? How many participants per group?

In [None]:
print(data.Diagnosis.unique())

data.value_counts("Diagnosis")

Replace the numeric values in the diagnosis column with CTRL and AD, 0 is CTRL and 1 is AD

Take a look at the replace function https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

In [None]:
# Change the 0 and 1 values in the diagnosis column
data["Diagnosis"]=data.Diagnosis.replace({0:"CTRL", 1:"AD"})

Plot a count plot with the distribution of the Diagnosis values

In [None]:
sns.countplot(data, x="Diagnosis");

Do you think that it is a balanced dataset in terms of number of individuals in the CTRL and PD class?

Filter the dataframe  - include only participants with age greater than 60 and non-smokers (smoking is 0)

In [None]:
# Filter the dataframe
data=data.loc[(data.Age>=60) & (data.Smoking==0),:]

# Check the filtering - look at the min age in your new dataframe and smoking values
data.select_dtypes('number').describe()

MMSE stands for the Mini-Mental State Examination, a 30-point questionnaire used to screen for cognitive impairment, such as dementia. 

Plot a boxplot to determine if there is any difference between the MMSE between CTRL and AD patients. Do not worry about statistical significance at this point. You will learn that in the next workshop.
Try both build in pandas boxplot https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html  and seaborn package https://seaborn.pydata.org/generated/seaborn.boxplot.html.

Remember  - documentation is your best friend not AI!

Explore seaborn boxplot documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html
 - try to adjust colors of the boxplot - explore color palettes: deep, muted, pastel, bright, dark, tab10 
 - change the orientation to horizontal
 - group the boxplot by gender

In [None]:
data.boxplot(column="MMSE", by="Diagnosis")

In [None]:
# plot a boxplot
sns.boxplot(data, x="Diagnosis", y="MMSE");

In [None]:
# Nicer boxplot
sns.boxplot(data, y="Diagnosis", x="MMSE", hue="Gender", palette="bright");

Take a closer look at the distribution of the MMSE scores among the individuals

Explore the distplot() function in the seaborn documentation https://seaborn.pydata.org/tutorial/distributions.html

Explort further hist() function in pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html - we plotted histograms of all columns before, can you select the column to plot and the column to split based on the diagnosis

In [None]:
# Plot the data distribution
sns.displot(data, hue="Diagnosis", x="MMSE");

In [None]:
# Plot distribution using build in pandas .hist function
data.hist(column="MMSE", by="Diagnosis");

Sometimes ratios of certain diagnostic assays may be more informative in clinical context - add columns to the dataframe as a ratio of CholesterolHDL/CholesterolLDL

There are multiple methods to complete this exercise: https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html

 - simple column addition https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html
 - `assign()` function https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
 - you can also insert this column in a specific place `insert()`https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html

Check the documentation for each of the methods and try it on your dataframe, generating 3 extra columns named: HDL_LDL_ratio_method1, HDL_LDL_ratio_method2 and HDL_LDL_ratio_method3

What happens if you try to re-run the cell with insert function? 

In [None]:
# Use simple column addition method
data["HDL_LDL_ratio_method1"]=data["CholesterolHDL"]/data["CholesterolLDL"]
data

In [None]:
# Use assign method
data=data.assign(HDL_LDL_ratio_method2=data.CholesterolHDL/data.CholesterolLDL)
data

In [None]:
# Use insert method - insert as a second column
data.insert(loc=1, column="HDL_LDL_ratio_method3", value=data.CholesterolHDL/data.CholesterolLDL)
data