
# Real-world data coding for neuroscientists (ReCoN)

### MSc in Translational Neuroscience,

### Department of Brain Sciences, Faculty of Medicine,

### Imperial College London

### Contributors: Katarzyna Marta Zoltowska, Cecilia Rodriguez, Rishideep Chatterjee, Marirena Bafaloukou, Anastasia Ilina, Sahar Rahbar, Cynthia Sandor

### Autumn 2025

## Exercises to practise python programming and pandas dataframe processing

##### The aim of the following exercises is to practise basic programming concepts

Please do not worry about the assert statement in the cells. It is there to guide you whether your solution is correct. If it raises an error, this means that you still need to improve your solution.

#### Task 1
Given two DNA sequences of equal length, the Hamming distance between those sequences is the number of corresponding symbols that differ.
Write a python function that to calculate hamming distance

In [None]:
# Write your code here 

def hamming_distance(a,b):
   

   

assert hamming_distance("GAGCCTACTAACGGGAT", "CATCGTAATGACGGCCT")==7, "There is a mistake in your function. The expected distance is 7"
assert hamming_distance("GG", "CC")==2, "There is a mistake in your function. The expected distance is 2"
assert hamming_distance("CC", "CC")==0, "There is a mistake in your function. The expected distance is 0"

#### Task 2

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

Write a function that returns a reverse complementary DNA sequence. Make sure that your function works both for uppoer and lower letter DNA sequences. If the letter is not in [A,T,C,G] just skip it.

In [None]:
# Write your code here

def rev_compl(seq):
 


 

assert rev_compl("GAGCCTACTAACGGGAT") == "ATCCCGTTAGTAGGCTC", "There is a mistake in your function. The expected reverse complement is ATCCCGTTAGTAGGCTC"
assert rev_compl("atcg") == "CGAT", "There is a mistake in your function. The expected reverse complement is CGAT"
assert rev_compl("aattGGCC") == "GGCCAATT", "There is a mistake in your function. The expected reverse complement is GGCCAATT"

#### Task 3
This task builds on the hamming distance. 
Two DNA samples — one from a healthy individual and one from a patient — have been sequenced. Find whether the patient’s sequence contains mutations (differences) compared to the healthy sequence.
Write a function called mutation_report(seq1, seq2) that:
 - Checks if both sequences have the same length (DNA sequences must align). If not the function stops and prints that the sequences are of different length.
 - Counts how many positions differ between the two.
 - Prints each mutation position and the change (e.g., "Position 5: A → G").
 - Returns the total number of mutations.

Test your function with:

healthy = "ATGCTAGCTAGCTTACG"

patient = "ATGCGAGATAGCCTACG"

In [None]:
healthy = "ATGCTAGCTAGCTTACG"

patient = "ATGCGAGATAGCCTACG"

def mutation_report(seq1, seq2):





assert mutation_report("ATGCTAGCTAGCTTACG", "ATGCGAGATAGCCTACG") == 3, "Expected 3 mutations between the healthy and patient sequences."
assert mutation_report("ATGC", "ATGC") == 0, "Expected 0 mutations for identical sequences."
assert mutation_report("AAAA", "TTTT") == 4, "Expected 4 mutations when all bases differ."
assert mutation_report("atgc", "aTgG") == 1, "Expected 1 mutation; function should handle lowercase input."

#### Task 4

Each patient reports one or more symptoms related to neurological, general, or psychological health. The goal is to count how many patients report each symptom and visualize the most common ones.
Note that there are white spaces in some string and some of them have uppercase letters. Use respective string methods to correct that.
Write a function that returns a dictionary with the symptom:number of occurances pairs

In [None]:
# Patient symptoms
patients = {
    "P001": ["headache", "fatiguE", "memory loss"],
    "P002": ["dizziness", "nausea"],
    "P003": ["headache", "vision problems", " difficulty concentrating"],
    "P004": ["fatigue", "anxiety", "sleep problems"],
    "P005": ["headache", "nausea", "vomiting"],
    "P006": ["dizziness", "blurred vision"],
    "P007": ["memory loss", "confusion", "anxiety"],
    "P008": ["tremor", "fatigue"],
    "P009": ["Headache ", "dizziness", "confusion"],
    "P010": ["sleep problems", " Anxiety", "fatigue"],
}

# Write here your count_symptoms() function
def count_symptoms(patient_dict):






count_symptoms(patients)

#### Task 5 

##### Exercises to practise python programming and pandas dataframe processing

##### The aim of the exercise is to explore clinical and demographic data recorded for healthy individuals and Alzheimer's disease patients.

The dataset is adapted from: https://www.kaggle.com/datasets/rabieelkharoua/alzheimers-disease-dataset

In [None]:
# Import the modules that you need for the project


How many rows and columns are in the dataset?

What are the names of the columns?

Are all columns informative?

In [None]:
# Load the data


# How many rows and columns are there?


# Print column names


# Display first 5 rows of the dataframe


# As you may noticed not all the columns are visible
# Can you split the dataframe into column chunks and display 10 columns at once (1-10,11-20,21-30, 31-35), lopping through the chunks. As you may need this code later, write a function
# Note: when the dataframe is the last line in the code cell, it is by default displayed in the notebook, but if it is in the middle, then not. Use display in that case.


# Use this function to display 3 first rows for all columns



Set patient id as index and drop column Doctor in charge

In [None]:
# set index


# Drop column doctor in charge


Confirm if it worked by displaying 4 rows of the dataframe using your function

In [None]:
# Display first 4 rows of the modified dataframe 


How many categorical and numeric columns are there? 
Check the datatypes as well as data ditribution (histogram). Are all numeric columns continous?

In [None]:
# Check data types


# Plot histograms


Is there any correlation between different variables? 

Plot correlation heatmap and pairplots (color the pairplots by Diagnosis)

Take a look how to set a larger figure size, so all the labels are displayed
(https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html)

In [None]:
# Heatmap


Looking at the heatmap. Which columns are best correlated with AD diagnosis? Is it what you expected?

In [None]:
# Plot pairplots


How many unique diagnosis values are there? How many participants per group?

In [None]:
# Look at the diagnoses

Replace the numeric values in the diagnosis column with CTRL and AD, 0 is CTRL and 1 is AD

Take a look at the replace function https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

In [None]:
# Change the 0 and 1 values in the diagnosis column


Plot a count plot with the distribution of the Diagnosis values

In [None]:
# Plot countplot


Do you think that it is a balanced dataset in terms of number of individuals in the CTRL and PD class?

Filter the dataframe  - include only participants with age greater than 60 and non-smokers (smoking is 0)

In [None]:
# Filter the dataframe


# Check the filtering - look at the min age in your new dataframe and smoking values


MMSE stands for the Mini-Mental State Examination, a 30-point questionnaire used to screen for cognitive impairment, such as dementia. 

Plot a boxplot to determine if there is any difference between the MMSE between CTRL and AD patients. Do not worry about statistical significance at this point. You will learn that in the next workshop.
Try both build in pandas boxplot https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html  and seaborn package https://seaborn.pydata.org/generated/seaborn.boxplot.html.

Remember  - documentation is your best friend not AI!

Explore seaborn boxplot documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html
 - try to adjust colors of the boxplot - explore color palettes: deep, muted, pastel, bright, dark, tab10 
 - change the orientation to horizontal
 - group the boxplot by gender

In [None]:
# Plot a boxplot 


In [None]:
# Plot a boxplot


In [None]:
# Nicer boxplot


Take a closer look at the distribution of the MMSE scores among the individuals

Explore the distplot() function in the seaborn documentation https://seaborn.pydata.org/tutorial/distributions.html

Explort further hist() function in pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html - we plotted histograms of all columns before, can you select the column to plot and the column to split based on the diagnosis

In [None]:
# Plot the data distribution


In [None]:
# Plot distribution using build in pandas .hist function


Sometimes ratios of certain diagnostic assays may be more informative in clinical context - add columns to the dataframe as a ratio of CholesterolHDL/CholesterolLDL

There are multiple methods to complete this exercise: https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html

 - simple column addition https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html
 - `assign()` function https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
 - you can also insert this column in a specific place `insert()`https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html

Check the documentation for each of the methods and try it on your dataframe, generating 3 extra columns named: HDL_LDL_ratio_method1, HDL_LDL_ratio_method2 and HDL_LDL_ratio_method3

What happens if you try to re-run the cell with insert function? 

In [None]:
# Use simple column addition method


In [None]:
# Use assign method


In [None]:
# Use insert method - insert as a second column
