
# Real-world data coding for neuroscientists (ReCoN)

### MSc in Translational Neuroscience,

### Department of Brain Sciences, Faculty of Medicine,

### Imperial College London

### Contributors: Cecilia Rodriguez, Katarzyna Zoltowska, Rishideep Chatterjee, Marirena Bafaloukou, Anastasia Ilina, Sahar Rahbar, Cynthia Sandor

### Autumn 2025

# Introduction to python

**Python is a popular, high-level, general-purpose programming language known for its readable, English-like syntax, making it easy to learn and use.**

**It is commonly used in data science, bioinformatics and machine learning fields.**

**Why do programmers love python?**
- Simple. English-like syntax
- Open-source
- Fast 
- Easy to debug
- Availability of well-documented libraries for data processing, visualisation, machine learning...

**Python syntax**
- Uses new line to complete a command
- Uses indentations to define scope (blocks of code)

**How to run python code?**
- Python file with an extenstion .py: e.g. run a python script from the command line in the terminal: python script.py 
- Using jupyter notebooks - this allows for generating nice reports including markdown, tables and plots. Jupyter notebooks are run interactively and the file extensions are .ipynb. During this workshop we will be using jupyter notebooks. 


The "Zen of Python" is a set of aphorisms by Tim Peters that serve as guiding principles for writing elegant, readable, and Pythonic code, emphasizing simplicity, explicitness, and readability.
The poem is "hidden" inside python and can be accessed through: import this

*Beautiful is better than ugly.<br>
Explicit is better than implicit.<br>
Simple is better than complex.<br>
Complex is better than complicated.<br>
Flat is better than nested.<br>
Sparse is better than dense.<br>
Readability counts.<br>
Special cases aren't special enough to break the rules.<br>
Although practicality beats purity.<br>
Errors should never pass silently.<br>
Unless explicitly silenced.<br>
In the face of ambiguity, refuse the temptation to guess.<br>
There should be one-- and preferably only one --obvious way to do it.<br>
Although that way may not be obvious at first unless you're Dutch.<br>
Now is better than never.<br>
Although never is often better than *right* now.<br>
If the implementation is hard to explain, it's a bad idea.<br>
If the implementation is easy to explain, it may be a good idea.<br>
Namespaces are one honking great idea -- let's do more of those!*<br>

---
### 1. Let's run our first python code
---

#### Print function

In [None]:
print("Hello world")

#### Using python as a calculator

In [None]:
2+92

#### Counting letters in a word

In [None]:
len("Hello")

In [None]:
len(Hello)

Oh no! We got an error. Why?

Python does not see Hello as a string but as a variable. 

#### Creating variables

Variable is a container that stores data values.

In [None]:
# Here we assign 3 to a and 4 to b and then we can use those variables to perform operations
a=3
b=4
c=a/b
print(c)

#### How to check the type of a variable?
More on data types later.

In [None]:
# Check the type of variable?
print(type(a))
print(type(c))

The type of a is int and the type of c is float. But what does this mean?

---
### 2. Data types in python
---

| Data Type     | Description                            | Example                         |
|---------------|----------------------------------------|---------------------------------|
| `int`         | Integer numbers                        | `5`, `-10`, `0`                 |
| `float`       | Floating-point numbers                 | `3.14`, `-0.001`, `2.0`         |
| `bool`        | Boolean values                         | `True`, `False`                 |
| `str`         | String of characters                   | `"hello"`, `'Python'`           |
| `list`        | Ordered, mutable sequence              | `[1, 2, 3]`, `['a', 'b']`       |
| `tuple`       | Ordered, immutable sequence            | `(1, 2)`, `('x', 'y', 'z')`     |
| `set`         | Unordered collection of unique items   | `{1, 2, 3}`, `{'a', 'b'}`       |         
| `dict`        | Key-value pairs                        | `{'a': 1, 'b': 2}`              |
                      


#### Examples from biomedical field

| Data Type | Why Useful in Medical/Neurology Context                                      | Example |
|-----------|------------------------------------------------------------------------------|---------|
| `int`     | Used for whole-number values like patient age, heart rate, test scores.     | `age = 55`, `heart_rate = 72` |
| `float`   | Captures precise measurements like BMI, glucose level, brain volume.         | `bmi = 24.7`, `brain_volume = 1350.8` |
| `bool`    | Represents binary conditions like symptoms present/absent, test positive/negative. | `has_seizures = True`, `mri_abnormal = False` |
| `str`     | Stores textual data like patient names, diagnoses, or medication names.      | `diagnosis = "Epilepsy"`, `medication = "Levetiracetam"` |
| `list`    | Used for collections like symptoms, medications, visit dates.                | `symptoms = ["headache", "nausea", "dizziness"]` |
| `tuple`   | Stores fixed sets of data, like coordinates in brain scans or immutable records. | `scan_resolution = (256, 256)`, `coordinates = (34.2, -117.1)` |
| `set`     | Holds unique unordered data, such as known allergies or genetic markers.     | `allergies = {"penicillin", "latex"}` |
| `dict`    | Key-value pairs for structured patient records, labs, or visit notes.        | `patient = {"name": "John", "age": 50, "diagnosis": "Stroke"}` |


#### Before you review the examples, let's take a look at the printing function syntax in the code below. 


In [None]:
x="Hello World"
print(f"{x} is {type(x)}\n{80*'='}")

In the above cell we use f-string for string formating.<br>
The {} brackets allow us to use a variable inside the string, and are called placeholders.<br>
As you see in the example, you can use not only a variable, but also a function inside the curly brackets. <br>
The \n stands for a new line.<br>
The 80*'=' is just creating a line composed of 80 = signs for a nicer formatting. <br>
Note the use of different quotes: " vs '. You need to nest different types of quotes. Using " everywhere will not work.<br>

Another useful feature of the string formating is formating numbers inside a string. 

In [None]:
# Let's print a number - a result of a devision
print(f"The number is {4/3}")

This does not look nice. Can we format it to show just 2 decimal places? Yes, we can.

In [None]:
print(f"The number is {4/3.:.2f}")

#### Let's go back to the data types and see some examples

In [None]:
# String - textual data such as diagnosis or symptom
x = "Parkinson's Disease"
print(f"{x} is {type(x)}\n{80*'='}")

# Integer - a whole number: patient age, UPDRS score
x = 65  # age of patient
print(f"{x} is {type(x)}\n{80*'='}")

# Float - a number with decimal: tremor frequency, medication dosage
x = 1.5  # tremor frequency in Hz
print(f"{x} is {type(x)}\n{80*'='}")

# List - a sequence of items such as symptoms or medication names
x = ["tremor", "bradykinesia", "rigidity", "tremor"]  # symptoms (duplicates allowed)
print(f"{x} is {type(x)}\n{80*'='}")

# Tuple - fixed set of immutable values like MRI scan dimensions or fixed coordinates
x = (256, 256, 150)  # dimensions of brain scan
print(f"{x} is {type(x)}\n{80*'='}")

# Range - e.g., simulating time points in signal analysis
x = range(20)  # 20 time points 
print(f"{x} is {type(x)}\n{80*'='}")

# Dictionary - structured patient record with key-value pairs
x = {
    "name": "Alice",
    "age": 68,
    "diagnosis": "Parkinson's Disease",
    "UPDRS_score": 42
}
print(f"{x} is {type(x)}\n{80*'='}")

# Set - unique items such as known gene mutations or allergens
x = {"LRRK2", "PINK1", "SNCA"}  # genes linked to Parkinson's disease
print(f"{x} is {type(x)}\n{80*'='}")


#### Why data types matter?

In [None]:
# Lists are mutable - can be changed
symptoms_list = ["tremor", "rigidity"]
symptoms_list.append("bradykinesia")

symptoms_list

In [None]:
# Tuples are immutable - cannot be changed
scan_resolution = (256, 256, 150)
scan_resolution[0] = 512 # TypeError (this action would fail) - resolution can have only 3 coordinates

### Simple methods applied on different data types

##### String methods

To call a methog on a python object we can use a . notation. The dot is followed by a function().


| Method           | Description                                      | Example                                                 | Output                      |
|------------------|--------------------------------------------------|----------------------------------------------------------|-----------------------------|
| `.lower()`       | Converts all characters to lowercase             | `"Parkinson".lower()`                                   | `'parkinson'`              |
| `.upper()`       | Converts all characters to uppercase             | `"dopamine".upper()`                                    | `'DOPAMINE'`               |
| `.capitalize()`  | Capitalizes the first letter                     | `"tremor".capitalize()`                                 | `'Tremor'`                 |
| `.title()`       | Capitalizes the first letter of each word        | `"parkinson's disease".title()`                           | `'Parkinson's Disease'`      |
| `.strip()`       | Removes leading/trailing whitespace              | `"  rigidity  ".strip()`                                | `'rigidity'`               |
| `.replace(a, b)` | Replaces `a` with `b`                            | `"UPDRS Score".replace(" ", "_")`                       | `'UPDRS_Score'`            |
| `.split()`       | Splits string into list                         | `"tremor,rigidity,slowness".split(",")`                 | `['tremor', 'rigidity', 'slowness']` |
| `.join()`        | Joins list into string using a separator         | `" - ".join(["tremor", "gait", "balance"])`             | `'tremor - gait - balance'`|
| `.find()`        | Returns index of first occurrence                | `"bradykinesia".find("kin")`                            | `5`                        |
| `.startswith()`  | Checks if string starts with substring           | `"Parkinson's".startswith("Park")`                      | `True`                     |
| `.endswith()`    | Checks if string ends with substring             | `"medication.pdf".endswith(".pdf")`                     | `True`                     |
| `.count()`       | Counts occurrences of substring                  | `"tremor tremor tremor".count("tremor")`                | `3`                        |
| `.isalpha()`     | Checks if all characters are letters             | `"GaitTest".isalpha()`                                  | `True`                     |
| `.isdigit()`     | Checks if all characters are digits              | `"1234".isdigit()`                                      | `True`                     |


In [None]:
# Different methods that can be applied on strings - examples

x="Hello World"

# Get 3rd character of the string
print(f"The third character of the string is {x[2]}.") #in python one counts from 0

# Check the length of the string
print(f"The string has {len(x)} characters.")

# Concatenate strings - put 2 strings together
y="Hello"
z="World"

print("The concatenated string is "+y+" "+z+".")

# Capitalise string
a="hello"
print(f"The string starting with capital letter is {a.capitalize()}.") # Converts the first character to capital letter
print(f"The string in all capital letters is {a.upper()}.") # Converts all letters to capitals

# Replace part of the string
print(f"The new string is {x.replace('World', 'Universe')}.") # Replaces world Hello with Universe.

In [None]:
# Example using different data types and string methods

dose_count = 3                      # Integer: number of doses
dose_price = 12.5                   # Float: price per dose in GBP
medication = "Levodopa"            # String: medication name

print(f"{dose_count} {medication.lower()} doses cost {dose_price * dose_count:.2f} GBP.")

But what if we would like to do some maths and someone stored the numbers as a string?

In [None]:
a="3.2"
b=3
a+b

We get an error but python already suggests us what is the problem and how to solve it.<br>
Let's look at the error: <br>TypeError - cannot add string to integer. Luckily, we can fix that using casting. We can change the data type using:
- str() - converts to string
- float() - converts to float
- int() - converts to integer

In [None]:
a="3.2"
b=3
float(a)+b

##### List methods

| Method           | Description                                         | Example                                                   | Output / Effect                          |
|------------------|-----------------------------------------------------|------------------------------------------------------------|-------------------------------------------|
| `.append(x)`     | Adds item `x` to the end of the list                | `symptoms.append("fatigue")`                              | Adds `"fatigue"` to the `symptoms` list   |
| `.extend(iter)`  | Adds all items from another iterable                | `symptoms.extend(["tremor", "rigidity"])`                 | Adds both `"tremor"` and `"rigidity"`     |
| `.insert(i, x)`  | Inserts item `x` at index `i`                       | `medications.insert(1, "Levodopa")`                       | Inserts `"Levodopa"` at position 1        |
| `.remove(x)`     | Removes first occurrence of item `x`                | `symptoms.remove("tremor")`                               | Removes `"tremor"` from the list          |
| `.pop(i)`        | Removes and returns item at index `i`               | `symptoms.pop(0)`                                         | Removes and returns first symptom         |
| `.index(x)`      | Returns index of first occurrence of item `x`       | `medications.index("Amantadine")`                         | Returns index of `"Amantadine"`           |
| `.count(x)`      | Counts how many times item `x` appears              | `symptoms.count("rigidity")`                              | Returns number of `"rigidity"` instances  |
| `.sort()`        | Sorts list in place (ascending by default)          | `updrs_scores.sort()`                                     | Sorts UPDRS scores from lowest to highest |
| `.reverse()`     | Reverses the order of the list                      | `visit_dates.reverse()`                                   | Reverses the order of visit dates         |
| `.copy()`        | Returns a shallow copy of the list                  | `copy_scores = updrs_scores.copy()`                       | Makes a copy of the `updrs_scores` list   |
| `.clear()`       | Removes all items from the list                     | `symptoms.clear()`                                        | Empties the `symptoms` list               |


In [None]:
# Different methods that can be applied on lists (examples shown below)
x = ["tremor", "bradykinesia", "rigidity", "tremor"]

# Access list items
# Retrieve first element of the list
print(f"The first element of the list is {(first_el := x[0])}.")  # := is the walrus operator

# The element was also assigned to a new variable
print(first_el)

# Add an element to the list
x.append("postural instability")  # the original list gets modified in place
print(f"The new list is {x}.")

# Remove an element from the list
x.remove("tremor")  # removes the first occurrence only
print(f"The new list is {x}.")

# Change the 3rd element of the list
x[2] = "depression"

# Sort list
x.sort()
print(f"The sorted list is {x}.")

# Loop over the elements of the list
print("=" * 80)
for i in x:
    print(i)

# Enumerate function allows to assign the indexes to the list (object) elements
print("=" * 80)
for n, i in enumerate(x):
    print(f"Symptom #{n} is {i}.")


#### Dictionary methods

| Method              | Description                                             | Example                                                                 | Output / Effect                                                       |
|---------------------|---------------------------------------------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------|
| `.get(key)`         | Returns the value for the specified key                | `patient.get("diagnosis")`                                              | `'Parkinson's Disease'`                                               |
| `.keys()`           | Returns a view of all keys                             | `patient.keys()`                                                        | `dict_keys(['name', 'age', 'diagnosis'])`                             |
| `.values()`         | Returns a view of all values                           | `patient.values()`                                                      | `dict_values(['Alice', 68, 'Parkinson's Disease'])`                   |
| `.items()`          | Returns a view of key-value pairs                      | `patient.items()`                                                       | `dict_items([('name', 'Alice'), ('age', 68), ...])`                   |
| `.update(dict)`     | Updates dictionary with key-value pairs from another   | `patient.update({"UPDRS": 42})`                                         | Adds `'UPDRS': 42` to dictionary                                       |
| `.pop(key)`         | Removes the key and returns its value                  | `patient.pop("age")`                                                    | Returns `68`, removes `"age"` from dictionary                         |
| `.popitem()`        | Removes and returns the last inserted key-value pair   | `patient.popitem()`                                                     | e.g., `('diagnosis', 'Parkinson's Disease')`                          |
| `.clear()`          | Removes all items from the dictionary                  | `patient.clear()`                                                       | Dictionary becomes `{}`                                               |
| `.copy()`           | Returns a shallow copy of the dictionary               | `new_patient = patient.copy()`                                          | `new_patient` is a separate copy of `patient`                         |
     


In [None]:
# Original dictionary with Parkinson's symptoms and severity levels
x = {
    "tremor": "mild",
    "bradykinesia": "moderate",
    "rigidity": "severe"
}

# Add a new symptom with its severity
x["postural instability"] = "moderate"
print("The new dictionary is:")
print(x)

# Get the keys, values or key:value pairs from the dictionary
print("=" * 80)
print("The symptoms are:")
for k in x.keys():
    print(k)

print("=" * 80)
print("The severity levels are:")
for v in x.values():
    print(v)

print("=" * 80)
print("The symptom:severity pairs are:")
for k, v in x.items():
    print(f"{k}: {v}")

# Reverse keys and values in the dictionary
# (severity becomes key, symptom becomes value — note: values must be unique for meaningful reversal)
y = {v: k for k, v in x.items()}
print("=" * 80)
print("The reversed dictionary (severity:symptom) is:")
for k, v in y.items():
    print(f"{k}: {v}")

#### Combining 2 lists into a dictionary

A very useful functionality is to combine two lists into a dictionary, with one list becoming keys and the other becoming values.

In [None]:
# Lists of symptoms and their severity levels
symptoms = ["tremor", "bradykinesia", "rigidity"]
severity = ["mild", "moderate", "severe"]

# Combine two lists into a dictionary
symptom_severity_dict = dict(zip(symptoms, severity))

print(symptom_severity_dict)

#### List comprehensions

A list comprehension creates a new list by applying an expression to each item in an existing iterable (like a list).

Synthax:
`new_list = [expression for item in iterable if condition]`


In [None]:
severities = ["mild", "moderate", "severe"]
upper_severities = [s.upper() for s in severities]
print(upper_severities)


In [None]:
symptoms = ["tremor", "rigidity", "fatigue", "bradykinesia"]
selected_symptoms = [s for s in symptoms if s.startswith("t")]
print(selected_symptoms)

#### Dictionary comprehensions

A dictionary comprehension creates a new dictionary by generating key-value pairs from an iterable.

Synthax:
`new_dict = {key_expression: value_expression for item in iterable if condition}`

In [None]:
symptoms = ["tremor", "rigidity", "bradykinesia"]
severities = ["mild", "severe", "moderate"]

symptom_severity = {symptoms[i]: severities[i] for i in range(len(symptoms))}
print(symptom_severity)


In [None]:
symptom_severity = {
    "tremor": "mild",
    "rigidity": "severe",
    "bradykinesia": "moderate",
    "postural instability": "severe"
}

severe_symptoms = {symptom: severity for symptom, severity in symptom_severity.items() if severity == "severe"}
print(severe_symptoms)


---
### 3. Arithmetic, comparison, assignment and logical operators
---

#### Arithmetic operators
| Operator | Name           | Description     |
|----------|----------------|-----------------|
| +        | Addition       | Adds two numbers|
| -        | Subtraction    | Subtracts one number from another |
| *        | Multiplication | Multiplies two numbers |
| /        | Division       | Divides one number by another (returns float) |
| %        | Modulus        | Returns remainder of division |
| **       | Exponential    | Raises a number to the power of another |
| //       | Floor Division | Divides and returns the integer part (floor) |

In [None]:
# Examples

print(9/7)

print(9%7)

print(9//7)

#### Comparison operators
| Operator | Name                    | Description                              |
|----------|-------------------------|------------------------------------------|
| ==       | Equal                   | Checks if two values are equal            |
| !=       | Not equal               | Checks if two values are not equal        |
| >        | Greater than            | Checks if left value is greater than right|
| <        | Less than               | Checks if left value is less than right   |
| >=       | Greater than or equal to| Checks if left value is greater or equal  |
| <=       | Less than or equal to   | Checks if left value is less or equal      |


#### Assignment operators
| Operator | Example      | Description                                                   |
|----------|--------------|---------------------------------------------------------------|
| =        | `x = 3`      | Assignment operator: assigns value 3 to variable `x`          |
| +=       | `x += 1`     | Increment assignment: equivalent to `x = x + 1`                |
| -=       | `x -= 1`     | Decrement assignment: equivalent to `x = x - 1`                |


The two last ones are used a lot in loops.

#### Logical operators
| Operator | Description                          | Example                      | Result          |
|----------|------------------------------------|------------------------------|-----------------|
| and      | Returns True if **both** statements are True | `(x > 5) and (x < 10)`       | True if both conditions true |
| or       | Returns True if **at least one** statement is True | `(x < 5) or (x > 10)`        | True if one or both true     |
| not      | Reverses the result (negates)      | `not(x == 5)`                 | True if `x` is not 5         |


In [None]:
# Examples
x=6

# Check if x<8 or x>3
print(x<8 or x>3)

# Check if x equals 3
print(x==3)

---
### 4. Python modules
---
##### What is a library/module?
Python library is a collection of modules. Module is a file consisting of Python code. It defines functions, classes, and variables.
##### How to install a library?
```%pip install library_name``` (the % operator allows to install directly from jupyter notebook)
Alternatively, create conda environment and install modules using ```conda install library_name```
##### How to make it available?
```import library_name```
If you need only part of the library you can use ```from library_name import module_name```
##### Examples of powerful modules in data science:
- 🐼 pandas - DataFrame manipulation
- 🔢 numpy - Working with arrays
- 📊 seaborn and 📉 matplotlib - Plotting and visualization
- 🤖 scikit-learn - Machine learning

### Let's try our first module

In [None]:
# preinstalled modules include random, os, etc.

# random module allows to generate random numbers
import random
#choose a random number
print(f"The chosen random number is: {random.choice([1,3,4,0,2,3,0])}") 

# Generate random number between 0 and 1
print(f"A random number is: {random.random():.2f}") # the :.2f allows to format the number and show only 2 decimal places

---
### 5. Loops and conditionals
---

**for loop** - allow to iterate over a sequence - for example over elements in a list

In [None]:
# Allows to loop through a sequence and modify/print it
# We saw simple examples of for loops while working with dictionaries and lists

# Print a simple triangle
for i in range(1,6):
    print('*'*i)

print("="*80)

#=====================================================
# As seen before print elements of the list -  modify all of them (add 3 to every element of the list)
list1=[0,1,2,3,4,5]

list2=[] # Initiate new list befor the loop
# Loop over all elements of the list1, increment them by 3 and append to new list2
for i in list1:
    list2.append(i+3)
print(f"Old list: {list1} vs new list {list2}.")

# The same statement can be simplified as a list comprehension - much faster so recommended for large data but outcome is exactly the same
list3=[i+3 for i in list1]
print(f"Old list: {list1} vs new list {list3}.")


**while loop** - execute a statement as long as the condition is True

In [None]:
i=7
while i>0:
    print(i)
    i-=1 # This is a shortcut for i=i-1 that we saw before in the assignment operators

In [None]:
# We can break a loop under certain conditions
i=7
while i>0:
    print(i)
    i-=1 # This is a shortcut for i=i-1
    if i==3:
        break #the loop is stopped when the i=3

**conditionals** - execute a statement if the condition is met.<br><br>
Basic syntax: `if...: elif...: else:`<br><br>
After if we place our first condition, after elif we place our second condition and finally after else we do not place anything (if other conditions were not met the else statement is executed)

In [None]:
# Use if elif else statements

x = {
    "tremor": "mild",
    "bradykinesia": "moderate",
    "rigidity": "severe"
}

# While writing Python code, pay attention to indentation

for n, (k, v) in enumerate(x.items()):
    if k == "tremor":  # Important to use == for comparison, not = which is assignment
        print(f"{k} is the {n} key of the dictionary and the severity is {v}.")
    elif k == "bradykinesia":
        print(f"{k} is the {n} key of the dictionary and the severity is {v}.")
    else:  # This else executes if we get to "rigidity"
        print("It is neither tremor nor bradykinesia.")


#### Indentations
How you align the blocks of code is important. 

In [None]:
for n,(k,v) in enumerate(x.items()): 
    if k=="tremor": # Important to use == sign for comparison, as oppose to = which is used for assignment
        print(f"{k} is the {n} key of the dictionairy and the severity is {v}.")

In [None]:
for n,(k,v) in enumerate(x.items()): 
    if k=="tremor": # Important to use == sign for comparison, as oppose to = which is used for assignment
    print(f"{k} is the {n} key of the dictionairy and the severity is {v}.")

Oh no! We got an error. This is because python expects an indented block of code after if, elif, else, for, while statements. In the simple loops it may seem unjustified but in the more complex nested loops and statements you will find it very important. It controls how blocks of code are executed. It is possible that the code will work but if the blocks of codes are not aligned under correct loops and statements, it will lead to unexpected results.

In [None]:
for n,(k,v) in enumerate(x.items()): 
if k=="tremor": # Important to use == sign for comparison, as oppose to = which is used for assignment
    print(f"{k} is the {n} key of the dictionairy and the severity is {v}.")

In [None]:
for n,(k,v) in enumerate(x.items()): 
    if k=="tremor":
        print("") 
print(f"{k} is the {n} key of the dictionairy and the severity is {v}.")

This worked, but the second print is not conditioned with k=="tremor"

---
### 6. Time to put the theory into practice
---
#### Fizz buzz exercise

Write a code to play Fizzbuzz for numbers from 1 to 50.
Print "fizz" for numbers divisible by three, "buzz" for numbers divisible by five, and "fizzbuzz" for numbers divisible by both three and five.

**Expected output:**<br>
3 fizz<br>
5 buzz<br>
6 fizz<br>
9 fizz<br>
10 buzz<br>
12 fizz<br>
15 fizzbuzz<br>
18 fizz<br>
20 buzz<br>
21 fizz<br>
24 fizz<br>
25 buzz<br>
27 fizz<br>
30 fizzbuzz<br>
33 fizz<br>
35 buzz<br>
36 fizz<br>
39 fizz<br>
40 buzz<br>
42 fizz<br>
45 fizzbuzz<br>
48 fizz<br>
50 buzz<br>

Hint: use modulus operator - calculates the remainder of a division operation. For instance, 7 % 3 equals 1 because when 7 is divided by 3, the remainder is 1. 10 % 2 equals 0.

In [None]:
for i : # Specify the range you want to loop over
    if : # Specify conditions for fizzbuzz
        print(i, "fizzbuzz")
    elif : # Specify conditions for buzz
        print(i, "buzz")
    elif : # Specify conditions for fizz
        print(i, "fizz")

#### Triangle exercises

Print a triangle with a base of 5, from * characters.

*<br>
**<br>
***<br>
****<br>
*****<br>

In [None]:
# Write your code here

Print a triangle with a top of 5, from * characters.

*****<br>
****<br>
***<br>
**<br>
*

In [None]:
# Write your code here

Print a balanced triangle

starts on the left side of the notebook

<center>
   *<br>
  ***<br>
 *****<br>
*******
</center>

In [None]:
# Write your code here

##### Processing patient symptoms

Tasks:

1. Print all symptoms with their severity scores.

2. Use a loop and conditional statements to categorize symptoms as:

 - "Mild" if severity is 1 or 2
 - "Moderate" if severity is 3 or 4
 - "Severe" if severity is 5 or more

3. Count how many symptoms are in each category and print the counts at the end.


In [None]:
# Dictionary of symptoms
patient_symptoms = {
    "tremor": 3,
    "bradykinesia": 4,
    "rigidity": 2,
    "postural instability": 1,
    "depression": 5
}

# Write your code here


---
### 7. Navigating directories and processing text files
---

In [None]:
# os module allows you to navigate through the file system
import os 

# Where are you in the file system?
# Get current working directory
print("Start point")
print(os.getcwd())
# Change to parent directory
os.chdir("../") # This takes you to the parent directory, one can use both relative and absolute paths
# Check where you are after the change
print("Changed directory")
print(os.getcwd())
# Change again - use the path were you store the course files on your computer - please not you need to change the path to the path on your computer
path="/Users/k.zoltowska/Documents/teaching_module/intro_python_all"
os.chdir(path)
if os.path.isdir("results")==False: # Check if the directory exists, if not create it
    os.mkdir("results")
os.chdir("data")

#### Listing content of directory in python and reading files

In [None]:
# Get list of files in the directory
print(os.listdir(path+"/data"))

### For this example we will be working wih .vcf file format - Variant Call Format, which is commonly used in genomics
<img src="./vcf_format.png" width="800" height="400">


In order to read a file we can just use `open()` function. 

file = open("filename.txt", "mode")

"filename.txt" – the name (and optionally the path) of the file.

"mode" – how you want to open the file (read, write, etc.).

| Mode  | Name           | Description                                                                 |
|-------|----------------|-----------------------------------------------------------------------------|
| `r`   | Read           | Opens file for reading (default). Raises error if file doesn't exist.       |
| `w`   | Write          | Opens file for writing. Overwrites existing file or creates a new one.      |
| `a`   | Append         | Opens file for appending. Creates file if it doesn't exist.                 |
| `r+`  | Read and Write | Opens file for both reading and writing. File must already exist.           |


In [None]:
# By default read function will read the whole file, but we can specify the number of characters we want to read, 200 in this example
with open("clinvar_5000.vcf") as file:
  print(file.read(200))

In [None]:
# We can also read a single line
with open("clinvar_5000.vcf") as file:
  print(file.readline()) # Reads a line of the file

In [None]:
# We can also iterate through the lines of a file
n=50
# Start counter to read only 50 lines of the file (it is a large file)
with open("clinvar_5000.vcf") as file:
    for i, line in enumerate(file):
        if i >= n:
            break
        print(line.strip()) #strip removes leading and trailing white spaces

# First are just header lines starting with ## - with information about the file
# Then, there is a line with column headers #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
# Then, there are lines with actual genetic variants

---
#### Task - read the .vcf file, extract:

```#CHROM	POS	ID	REF	ALT```<br>
```1	66926	3385321	AG	A```<br>
```1	69134	2205837	A	G```<br>
```1	69308	3925305	A	G```<br>
```1	69314	3205580	T	G```<br>
...

and save to a new file



In [None]:
# Hints -  look at the file - we need to extract rows that do not start with ## 
# From the remaining rows - we need to split the lines on tab space and extract first 5 elements of the list
# How? using loops and conditionals

# It is a large file so test only on part of the file first
with open("clinvar_5000.vcf", "r") as file_r:
    count = 0
    for line in file_r:
        if line.startswith("##"):
            continue  # skip header lines
        if count >= 50:
            break     # stop after 50 lines
        list_line = line.strip().split("\t")[0:5] # This line of code can be decomposed into the following steps: strip()-removes leading and trailing white spaces, split() - splits the line on tabs and returns a list; [0:5] extracts first elements of the list
        print(list_line)
        count += 1

# The output below shows that we are correctly extracting the lines we want from the file
# Printing is a great strategy for coding and debugging - use it when you can!


Now we need to write the output to a file. By default `open()` uses `r` mode which stands for read. But we can use `open()` also to write to a file, but we need to use it in `w` (write) mode.
Note that `w` will overwrite any existing file content. If you would like to append to the end of the file, use `a` (append) mode.

In [None]:
# Next step - we need to write it to a file
# Most of the code below is exactly the same as before - we just need to add two lines that allow us to write to a file
with open("clinvar_5000.vcf", "r") as file_r:
    with open("../results/clinvar_modified.vcf", "w") as file_w: # Creates file handle to write to a file
        count = 0
        for line in file_r:
            if line.startswith("##"):
                continue 
            if count >= 50:
                break 
            list_line = line.strip().split("\t")[0:5]
            #print(list_line) no longer need for print so we can comment it out
            file_w.write("\t".join(list_line) + "\n") # In this line you join the list elements with tab seperator and add new line at the end
            count += 1

---
### Time to put the theory into practice
---

#### Processing .sam file

Given a sam file (see format below), process the file to skip the header rows and extract only rows where MAPQ is larger than 30. Save the results into a new file.

<img src="./sam_file.png" width="800" height="400">

In [None]:
with open("example_sam.sam", "r") as file_r: # Creates handle to read file line by line
    with open("", "w") as file_w: # Creates file handle to write to a file - save modified file in the results folder
        for : # Loop over lines in the sam file
            if  : # Include condition to skip the rows that start with @
                continue 
            mapq =  # Extract mapping quality from the line
            if : # Check the condition if MAPQ>30 
                print(line.strip()) # Check whether we are extracting correct lines
                 # Write the line to a file

---
### 8. Data Science with Pandas
---

Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
https://pandas.pydata.org/docs/index.html

Pandas come with a number of functionalities and with a great documentation. https://pandas.pydata.org/docs/ Here, we will cover only the basics. 

In [None]:
# import pandas 
import pandas as pd # pd is used as a convension

##### Simple methods in pandas

| Function / Attribute     | Description                                                                 | Example                            |
|--------------------------|-----------------------------------------------------------------------------|------------------------------------|
| `pd.read_csv()`          | Reads a CSV file into a DataFrame                                          | `pd.read_csv("patients.csv")`      |
| `.shape`                 | Returns the number of rows and columns as a tuple                          | `df.shape`                         |
| `.head()`                | Returns the first 5 rows (can specify number)                              | `df.head()` or `df.head(10)`       |
| `.tail()`                | Returns the last 5 rows (can specify number)                               | `df.tail()`                        |
| `.describe()`            | Generates summary statistics for numeric columns                           | `df.describe()`                    |
| `.select_dtypes()`       | Selects columns by data type (e.g., number, object)                        | `df.select_dtypes(include="number")` |
| `.columns`               | Returns the list of column names                                           | `df.columns`                       |
| `.index`                 | Returns the index (row labels)                                             | `df.index`                         |
| `.loc[]`                 | Label-based selection (rows/columns by name)                               | `df.loc[0, "age"]` or `df.loc[0]`  |
| `.iloc[]`                | Position-based selection (rows/columns by integer index)                   | `df.iloc[0, 1]` or `df.iloc[0]`    |
| `.to_csv()`              | Writes the DataFrame to a CSV file                                         | `df.to_csv("output.csv", index=False)` |


#### Read csv file as pandas dataframe

We can read a file, using `pandas.read_csv()` function. The required argument is the file path, but you can also specify much more, eg. sep - character used as a delimiter, index_col - which column to use as row names
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

When working with new functions it is always good practice to check the documentation.

In [None]:
data=pd.read_csv("./Parkinson_disease.txt") 
data

Oh no! It does not display as a table. But what we see is: Jitter(%)\tMDVP:Jitter(Abs)\tMDVP:RAP\t
There are tabs not commas (.tsv vs .csv)

In [None]:
data=pd.read_csv("./Parkinson_disease.txt", sep="\t", index_col=0)
data

#### Check what is in the rows and columns

In [None]:
# Exploring the dataframe

# Get column names
print("Column names")
print(data.columns)

print("="*80)

# Get row names
print("Row names")
print(data.index)

You can see that there are no meaningful rownames. Let's fix that.

##### What is in the columns?

name - ASCII subject name and recording number<br>
MDVP:Fo(Hz) - Average vocal fundamental frequency<br>
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency<br>
MDVP:Flo(Hz) - Minimum vocal fundamental frequency<br>
MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several<br>
measures of variation in fundamental frequency<br>
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude<br>
NHR,HNR - Two measures of ratio of noise to tonal components in the voice<br>
status - Health status of the subject PD or CTRL<br>
RPDE,D2 - Two nonlinear dynamical complexity measures<br>
DFA - Signal fractal scaling exponent<br>
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation<br>
gender - Male or Female<br>

The dataset was adapted from:

*Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection',
Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM.
BioMedical Engineering OnLine 2007, 6:23 (26 June 2007)*

https://www.kaggle.com/datasets/debasisdotcom/parkinson-disease-detection

#### Set index to name

In [None]:
# Lets set name as a row names
data.set_index("name", inplace=True) # By using inplace we are modifying the existing dataframe

print("Row names")
print(data.index)

#### Get basic summary statistics
`describe()` function
- mean
- median
- std
- 25%, 75% quartiles
- min
- max
- number of unique values in categorical columns

In [None]:
# Get basic summary statistics from the dataframe
data.describe() # For all numeric columns we can view basic statistics - number of values, mean, standard deviation, min, max, quantiles

# What we can notice that some values must be missing as not every column has 195 values

We can see that after running describe we see only numeric columns. Can we check also categorical or strings?
What data types are in pandas?

| pandas Data Type | Description                           | Python Equivalent     | Example Values                      |
|------------------|---------------------------------------|------------------------|-------------------------------------|
| `int64`          | Integer numbers                       | `int`                 | 0, 1, -20, 103                      |
| `float64`        | Decimal (floating-point) numbers      | `float`               | 3.14, -2.5, 0.0                     |
| `object`         | Text or mixed types (usually strings) | `str`                 | "Male", "Tremor", "Patient A"      |
| `bool`           | Boolean values                        | `bool`                | True, False                        |
| `datetime64`     | Date and time values                  | `datetime.datetime`   | 2023-06-01, 2025-10-15             |
| `category`       | Categorical data                      | pandas `Categorical`  | "Mild", "Moderate", "Severe"       |


In [None]:
# Let's check what types of data we have in the dataframe
data.dtypes

With `select_dtypes()` we can select only categorical columns

In [None]:
# Floats were covered by describe, but can we check the object type columns
# Yes we can, but we need to select the categorical columns (type: object)
data.select_dtypes(object).describe()

#### Handling missing data
Missing values present a challenge in the data science, but are a reality.
How can we handle missing data?
- Drop rows/columns with missing values
- Impute  - more about it later

In pandas we can use `dropna()` to drop rows or columns with missing values. Take a look at the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

What to axis= and how= arguments control?

In [None]:
# For today lets just drop all the rows that contain missing values - not always the best idea, as we might be losing lots of information
print("rows,columns")
print(data.shape) # shape allows us to see the number of rows and columns in the dataframe - returns a tupple (rows, cols)
data.dropna(inplace=True)
print("rows,columns")
print(data.shape)

You might have noticed that sometimes we use () and sometimes not. `data.dropna()` and `data.shape` . <br>
In Python, methods are functions associated with a specific object or class and are utilized to perform particular operations. For methods we use (). dropna() is a method. <br>
Attributes are data related to an object or class and can be used to store information or state. To get attributes, we do not use (). shape is an attribute.


### Exploring the dataframe
- extracting columns
- simple plotting

The best way to access rows and columns in pandas is to use `loc()` and `iloc()`. The difference between those two is that loc uses label-based indexing and iloc uses position-based indexing.

We can also view first and last rows of the dataframe with `head()` or `tail()`.



In [None]:
# Take a look at the names of the columns
data.columns

In [None]:
# Accesing parts of the dataframe - using column names
# The : in the brackets indicates that we are selecting all rows.
data.loc[:,['MDVP:Fhi(Hz)',	'MDVP:Flo(Hz)']]

In [None]:
# We can also select specific rows and columns based on their labels
data.loc[["phon_R01_S01_2", "phon_R01_S50_5"],['MDVP:Fhi(Hz)','MDVP:Flo(Hz)']]

In [None]:
# We can also extract single columns
data['gender']

In [None]:
# We can also extract single columns
data.gender

In [None]:
# We can also extract single columns
data[["gender"]] # double square brackets to keep it as a dataframe

In [None]:
# Accesing parts of the dataframe - using column positions
data.iloc[[1,3,5],[2,3]]

In [None]:
# Lets view 3 first rows of the data
data.head(3)

In [None]:
# Lets view 3 last rows of the data
data.tail(3)

#### Basic plotting functionality in pandas

| Function          | Library | Description                                         | Example                                              |
| ----------------- | ------- | --------------------------------------------------- | ---------------------------------------------------- |
| `.plot()`         | pandas  | General-purpose plot (line by default)              | `df.plot()`                                          |
| `.plot.line()`    | pandas  | Line plot                                           | `df.plot.line(y="UPDRS_score")`                      |
| `.plot.bar()`     | pandas  | Vertical bar chart                                  | `df.plot.bar(x="patient", y="score")`                |
| `.plot.barh()`    | pandas  | Horizontal bar chart                                | `df.plot.barh(x="test", y="value")`                  |
| `.plot.hist()`    | pandas  | Histogram                                           | `df["age"].plot.hist(bins=10)`                       |
| `.plot.box()`     | pandas  | Box plot for distributions and outliers             | `df[["UPDRS1", "UPDRS2"]].plot.box()`                |
| `.plot.kde()`     | pandas  | Kernel density estimate (smooth histogram)          | `df["age"].plot.kde()`                               |
| `.plot.area()`    | pandas  | Area plot                                           | `df.plot.area()`                                     |
| `.plot.scatter()` | pandas  | 2D scatter plot                                     | `df.plot.scatter(x="age", y="UPDRS")`                |
| `.plot.pie()`     | pandas  | Pie chart (for single series or value counts)       | `df["diagnosis"].value_counts().plot.pie()`          |


Today we just introduce basic plotting options. More about plotting in future modules.

##### Simple scatterplot between two variables

In [None]:
# Plot simple scatterplot between different variables
data.plot(x="MDVP:Fo(Hz)", y="MDVP:Fhi(Hz)",kind="scatter"); # The semicolon at the end is a "trick" to avoid showing processing text, so plot looks nicer

##### Histograms to show the destribution of the data

In [None]:
# Plot histograms to look at the distribution of the data
# It shows numeric columns
data.hist(grid=False, bins=20,figsize=(14,10), layout=(4,6)); # With layout we can control number of rows and columns, with grid whether grids are shown on the plot or not

What if we would like to know value distributions for the categorical columns?
- We can convert them into numbers.
- We can use `.value_counts()` function to count occurance of different values 
- Use `countplot()` from seaborn library

##### Using `.value_counts()` to check the destribution of categorical columns

In [None]:
# Checking the number in each category for categorical columns
for col in data.select_dtypes(object).columns: # get column names for the categorical columns - one can select columns by data type
    print(data.value_counts(col)) # count values in the categorical columns

##### Using `countplot()` from seaborn library

Seaborn is a library for data visualisation based on matplotlib. It is a very popular library for data science.
https://seaborn.pydata.org/tutorial.html 

In [None]:
# run %pip install seaborn if the library is not installed
import seaborn as sns

# With seaborn we can very easily plot the destribution of the categorical data
sns.countplot(data, x="status", hue="gender");

We can also use integrated seaborn and pandas functionality to look at the correlations between different variables

`pairplot()` function will plot destributions and relationships between all numeric variables

In [None]:
# Note that due to the size of the dataset it takes ~10-30 sec to run this

sns.pairplot(data, hue="status");

##### Correlation heatmap

In order to plot correlation heatmap first we need to calculate correlations between different columns and then plot a heatmap.

Pandas has a built in function `.corr()` to calculate correlation coefficients.

Then we can use `heatmap()` from seaborn to plot it as a heatmap.


In [None]:
# Let's first calculate correlations
data.corr()

---
Oh no! We got an error. But python tells us what is the problem.<br>
ValueError: could not convert string to float: 'PD'

Luckily there is a solution: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

Let's check the documentation for the corr() function. From the documentation:<br> 
DataFrame.corr(method='pearson', min_periods=1, numeric_only=False)

We can change the default numeric_only from False to True and try again.

In [None]:
data.corr(numeric_only=True)

Now lets wrap inside a heatmap function

In [None]:
sns.heatmap(data.corr(numeric_only=True)); 

# Seaborn allows us to customise the plots. Always check documentation. It also comes with examples, which can be very handy.
# https://seaborn.pydata.org/generated/seaborn.heatmap.html

By default pearson coefficient is used in the corr() function, but we can choose from:
 - pearson : standard correlation coefficient
 - kendall : Kendall Tau correlation coefficient
 - spearman : Spearman rank correlation

| Method     | Full Name                     | Type of Relationship Measured         | Handles Ranks? | Handles Nonlinear? | Use Case Example                                           |
|------------|-------------------------------|----------------------------------------|----------------|---------------------|-------------------------------------------------------------|
| `pearson`  | Pearson correlation coefficient| Measures **linear** relationships      | ❌ No           | ❌ No                | Compare age vs. UPDRS score assuming linearity             |
| `kendall`  | Kendall Tau correlation       | Measures **monotonic** relationships   | ✅ Yes          | ✅ Yes (monotonic)   | Rank-based correlation between symptom severity ratings     |
| `spearman` | Spearman rank correlation     | Measures **monotonic** relationships   | ✅ Yes          | ✅ Yes (monotonic)   | Association between patient rank in tremor vs rigidity     |


When to use which?
Pearson for linear relationships between continuous, normally distributed variables. 
Spearman for monotonic relationships, non-normally distributed data, or ordinal variables, as it's robust to outliers. 
Kendall for monotonic relationships, especially with small datasets. It can be more robust than Spearman in those situations.

Was the choice of pearson coefficient best for this data?

More about that in the next modules.

### Subseting and sorting

We can subset/filter the dataframe based on a condition. This can be done with Boolean indexing.
In simple terms we create a sequence of booleans telling us whether to keep a row or column or not.
Let's see on an example.

We want to keep only records where gender column has a value "Female".


In [None]:
# This code creates a boolean sequence True when stress level is Low and False when there is another value. 
data.gender=="Female"
# We can use this index for filtering.

Let's now apply this to filter the dataframe

In [None]:
# Filtering the data frame to include only female participants
females=data.loc[data.gender=="Female",:]
females

We can also sort the dataframe rows based on values in a column using `sort_values()` pandas function.

In [None]:
# Sorting the dataframe
data.sort_values(by="MDVP:Fo(Hz)", ascending=True) # We sort the dataframe on the MDVP:Fo(Hz)column starting from low to high values

#### Saving the processed dataframe to a .tsv file

Again pandas brings in a very simple solution `to_csv()`. In order to save as a tsv (tab as delimiter) we need to change the seperator value to "\t".

In [None]:
data.to_csv("../results/Parkinson_disease_processed.tsv", sep="\t")

---
### Pandas exercises
---

#### Task - explore another Parkinson's disease dataset

parkinsons_disease_data.csv

The dataset was obtained from: https://www.kaggle.com/datasets/rabieelkharoua/parkinsons-disease-dataset-analysis

In [None]:
# import dataframe from csv file

In [None]:
# How many rows and columns are in the dataframe?

In [None]:
# Are there any missing values? Generate summary statistics

# Look at numeric columns

# Look at categorical columns

In [None]:
# Select only rows with BMI between 19 and 25 and older than 60
# Use describe to verify

In [None]:
# Drop column DoctorInCharge as it is not informative

In [None]:
# How many different ethinicities are in the dataset, what are those and how many participants they include?


In [None]:
# Plot distribution of numeric data on histograms, all columns


In [None]:
# Plot histograms but this time include only those columns that have non-binary (0,1) content


In [None]:
# Plot countplot with number of records per gender category


In [None]:
# Plot correlation heatmap but use only columns with non binary data - prepared in the step before


In [None]:
# Plot boxplot with x axis representing Diagnosis and y axis representing UPDRS score

---
### 9. Gentle introduction to numpy
---

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

 - NumPy arrays have a fixed size at creation, unlike Python lists (which can grow dynamically). Changing the size of an ndarray will create a new array and delete the original.

 - The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory.  
 
 - NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. 

| **Name**              | **Type**  | **Description**                                               |
| --------------------- | --------- | ------------------------------------------------------------- |
| `ndarray.shape`       | Attribute | Returns a tuple representing the dimensions of the array.     |
| `ndarray.ndim`        | Attribute | Returns the number of dimensions (axes) of the array.         |
| `ndarray.size`        | Attribute | Returns the total number of elements in the array.            |
| `ndarray.dtype`       | Attribute | Returns the data type of the array elements.                  |
| `ndarray.T`           | Attribute | Returns the transposed version of the array.                  |
| `np.array()`          | Method    | Creates a NumPy array from a Python list or tuple.            |
| `np.zeros()`          | Method    | Creates an array filled with zeros.                           |
| `np.ones()`           | Method    | Creates an array filled with ones.                            |
| `np.arange()`         | Method    | Creates an array with regularly spaced values within a range. |
| `np.reshape()`        | Method    | Changes the shape of an array without changing its data.      |
| `np.sum()`            | Method    | Computes the sum of array elements.                           |
| `np.mean()`           | Method    | Computes the mean (average) of the array elements.            |
| `np.std()`            | Method    | Computes the standard deviation.                              |
| `np.max()`            | Method    | Returns the maximum value in the array.                       |
| `np.min()`            | Method    | Returns the minimum value in the array.                       |


In [None]:
import numpy as np

#### 9.1. Creating arrays

In [None]:
# Create a 1D array
arr1 = np.array([1, 2, 3, 4])
print("1D array:", arr1)

# Create a 2D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print("2D array:\n", arr2)

# Create arrays with default values
zeros = np.zeros((2, 3))     # 2 rows, 3 columns of zeros
ones = np.ones((3, 2))       # 3 rows, 2 columns of ones


print("Zeros:\n", zeros)
print("Ones:\n", ones)

# Create a sequence of numbers
# Use arange function - analogous to range with start, step and interval

print("Ranges")
print(np.arange(10, 30, 5))
print(np.arange(4))


#### 9.2. Array operations

In [None]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Element-wise operations
print("a + b:", a + b)
print("a * b:", a * b)
print("a squared:", a ** 2)



#### 9.3 Basic numpy functions and array attributes

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])

print("Shape:", arr.shape)
print("Data type:", arr.dtype)
print("Max:", np.max(arr))
print("Min:", np.min(arr))
print("Sum:", np.sum(arr))
print("Mean:", np.mean(arr))
print("Standard Deviation:", np.std(arr))


#### 9.4 More complex but very useful numpy functions
 - `concatenate()` - function is used to join two or more arrays along an existing axis
 - `convolve()` - performs a discrete, linear convolution of two 1-dimensional sequences

In [3]:
# 1D Arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Concatenate along axis 0
result_1d = np.concatenate((a, b))
print("1D Concatenation:", result_1d)

# 2D Arrays
c = np.array([[1, 2], [3, 4]])
d = np.array([[5, 6]])

# Concatenate along axis 0 (rows)
result_2d_axis0 = np.concatenate((c, d), axis=0)
print("2D Concatenation along axis 0:\n", result_2d_axis0)

# Concatenate along axis 1 (columns)
e = np.array([[7], [8]])
result_2d_axis1 = np.concatenate((c, e), axis=1)
print("2D Concatenation along axis 1:\n", result_2d_axis1)

1D Concatenation: [1 2 3 4 5 6]
2D Concatenation along axis 0:
 [[1 2]
 [3 4]
 [5 6]]
2D Concatenation along axis 1:
 [[1 2 7]
 [3 4 8]]


In [None]:
# Convolution "slides" the kernel over the signal and multiplies-and-sums overlapping values.

# Signal and kernel
signal = np.array([1, 2, 3])
kernel = np.array([0, 1, 0.5])

# Full convolution
full = np.convolve(signal, kernel, mode='full')
print("Full:", full)

# Same size as original
same = np.convolve(signal, kernel, mode='same')
print("Same:", same)

# Valid part only (no padding)
valid = np.convolve(signal, kernel, mode='valid')
print("Valid:", valid)

Full: [0.  1.  2.5 4.  1.5]
Same: [1.  2.5 4. ]
Valid: [2.5]


---
### 10. Introduction to functions
---

In this lesson, you will learn how to design, implement, and organise your code with functions. As you will realise further on, functions are just a piece of code that will let you perform a specific task multiple times without duplicating any code.

### 10.1 Definitions

Every function has 2 parts: a header and a body. 
The header specifies the name of the function and its argument(s).
The body specifies the work that the function does.

For example, 


In [None]:
def multiply_by_five(input_variable):
    output_variable = input_variable * 5
    return output_variable

In this case, the header will be the part composed by: 

**<font color="blue">def</font> <font color="black">multiply_by_five(input_variable):</font>**


and the body is the rest of the function:


&nbsp;&nbsp;&nbsp;&nbsp;**<font color="black">output_variable = input_variable * 5</font>**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**<font color="blue">return</font> <font color="black">output_variable</font>**

&rarr;  **<font color="blue">def</font>**: This is the key word that tells Pyhton we are about to define a function.

&rarr;  **<font color="black">multiply_by_five</font>**: This is the name you choose for your function. It can be whatever you like.

&rarr;  **<font color="black">(input_variable):</font>**: This is called the argument. It is the name of the variable the function will use. You can also have a function with no argument, or with multiple arguments. It is always enclosed in parantheses that appear immediately after the name of the function. For every function, the parentheses enclosing the function argument(s) must be followed by a colon : .

&rarr;  **<font color="blue">return</font>**: This is the key word that tells Python we are about to exit a function.

&rarr;  **<font color="black">output_variable</font>** is the value that is returned when the function is exited. You can add as many values as you like the function to return. 

### 10.2 How to run (or "call") a function

In the code below, we run the function with 10 as the input value. 
It is important to note that the new variables created inside the function will not be callable outside, and vice versa: the elements that are not in the argument will not be able to be used inside the body of the function. 

Let's see an example of these rules.

In [None]:
# first, we can define our original variable:

my_number = 10
print(my_number)

In [None]:
# now, we can call the function with that input value, and obtain the output:

new_number = multiply_by_five(my_number)
print(new_number)

What we have learned so far:
1. We have to run the function **BEFORE** we call it. Sometimes, developers place all the functions at the top of the document and then they run the code. 

2. The name of the function should only use lowercase letters, with words separated by underscores and **NOT** spaces. As for the variables, the functions should not start with a number; e.g. 5_multiply() would be wrong.

In [None]:
def 2_print_hello():
    print("Hello World!")

2_print_hello()

Naming, designing and calling fuctions will feel natural over time. It is normal for it to feel uncomfortable at first. To fix this, let's go over some easy examples.

### 10.3 Examples

##### 10.3.1 Functions with no arguments.

In [None]:
def my_function():
  print("Hello world!")

my_function()

In [None]:
# Try it yourself! Write a function that prints "Hello World!" 5 times.

# Add the code here

my_function()

In [None]:
# Those functions are called "void" functions, as they do not return any values.
# you can also return a value from a function. 

# The elements inside the function that are not returned will not be callable outside the function. e.g:
def my_function():
    string = "Hello World!"
    number = 5    # randomly create a variable to test this
    return string

string = my_function()
print(string)
print(number)
# as you can see, the string returned is accessible outside the function, but the number is not


In [None]:
# Functions cannot be empty, but if you have a function definition with no content, you can add the pass statement to avoid errors:
def my_function():

In [None]:
def my_function():
    pass

# this does not return anything, but is not an error
my_function()

In [None]:
print(my_function())

##### 10.3.2 Functions with strings

In [None]:
def add_surname(name):
    return name + " Smith" # we don't have to create a variable. We can just return the result.

print(add_surname("John"))
print(add_surname("Michael"))
print(add_surname("Paul"))

In [None]:
# we can also see that if we add the wrong number of arguments, we will get an error:
print(add_surname("John", "Michael"))

It is important that you understand how to read and understand these errors and not rely on AI to tell you what they mean!

In [None]:
# Try it yourself! Do a function that will replace the base T for U in a DNA sequence. 

# Add the code here


replace_t_u("ATGTTC")

In [None]:
# Try it yourself! Write a function that cleans a sentence from different punctuation marks.
    # HINT: there is a python method that can replace specific charaters with another specific set of characters. 
    # https://docs.python.org/3/library/stdtypes.html#str.replace - look for str.replace() to read the documentation.

# Add the code here

clean_sentence(string = "Hello,world!&My_name|is:John")

In [None]:
# You can also pass a list as an argument the same way you would pass a string.
def list_items(items):
    print(items)

list_items(["pineapples", "banana", "oranges"])

##### 10.3.3 Arbitrary arguments. Choosing from a tuple.

In [None]:
# If you are unsure about the number of arguments that will be passed into your function,
# you can add a * before the parameter name in the function definition. 

# an example with a tuple as the input:
def find_youngest(*ages):
    print("The youngest person is ", ages[-1], " years old")

find_youngest(12, 10, 8, 6)

In [None]:
# Try it yourself! Create a function that will find the average of heart rate measurements given an undefined list of measurements.

# Add the code here


average = avg_heart_rate(120, 70, 80, 68, 55, 90)
print(average)

In [None]:
# Try it yourself! Create a function that will find the total count of a specific base in any number of given DNA sequences.


total_match = count_bases("ATCGCG", "GGGTTTGCACT", "ACCGCTAGG", matched_base='C')
print(total_match)

##### 10.3.4 Keyword arguments

In [None]:
# If you have a function with many parameters, you can use kwyword arguments to pass the correct values to the function.
def sum_and_multiply(sum1, sum2, sum3, multiply2, multiply3):
    sum_result = sum1 + sum2 + sum3

    multiply_result = sum_result * multiply2 * multiply3

    return multiply_result

result = sum_and_multiply(sum1=1, sum2=2, sum3=3, multiply2=4, multiply3=5)
print(result)

In [None]:
# an interesting thing to note is that the order of the arguments does not matter as long as they are named.

# Try it yourself!

result2 = sum_and_multiply(multiply3=5, sum2=2, sum1=1, sum3=3, multiply2=4)
print(result2)



In [None]:
# Now, try removing the names of some of the arguments, and see what happens.
result3 = sum_and_multiply(1, sum2=2, 3, multiply2=4, multiply3=5)
print(result3)

# If you get an error (positional argument follows keyword argument), that's okay. 
# This means that the function is expecting the arguments to be in a specific order.
# If you remove the names, the function will not know which value belongs to which parameter. If you want to remove some names, and not others, you can only remove the names of the first parameters. 
# For example, this will work:
result4 = sum_and_multiply(1, 2, 3, multiply2=4, multiply3=5)
print(result4)


In [None]:
# but this will not:
result5 = sum_and_multiply(1, 2, 3, multiply2=4, 5)
print(result5)

##### 10.3.5 Arbitrary keyword arguments (kwargs)

In [None]:
# If you do not know how many keyword arguments that will be passed into your function, add two asterisk: ** before the parameter name in the function definition.
# This way the function will receive a *dictionary* of arguments, and can access the items accordingly:


# .......... REVIEW THIS EXAMPLE INTO DICTIONARY ......


def sum_squares(**numbers):
    sum_squares = numbers["num1"]**2 + numbers["num2"]**2
    return sum_squares
result = sum_squares(num1=2, num2=4, num3=6)
print(result)

In [None]:
# Try it yourself! Create a function that will find add a random amount of metadata to a given participant and return the updated participant info in dictionary format.

# Add your code here


participant_info = add_metadata("S-10", gender="female", age=25, height=170, weight=60)
print(participant_info)


In [None]:
# Try it yourself! Create a function that will accept any set of channels (e.g. gNA, gK, gLeak, etc) and return the total and the normalized fractions per channel

# HINT: to access the arguments in the functions, you can use the .items() method. https://docs.python.org/3/library/stdtypes.html#dict.items
# for example, if you have a dictionary and want to access the keys and the values:
my_dict = {
    "key1": "value1",
    "key2": "value2",
    "key3": "value3",
    "key4": "value4"
}

for key, value in my_dict.items():
    print(key, value)


# Add your code here


total, fractions = channel_fractions(gNa=120, gK=36, gLeak=0.3)
print(f"Total conductance: {total}")
print(f"Fractions: {fractions}")


##### 10.3.6 Default parameter value

In [None]:
# You will find situations where you will want to set a default parameter value, that can be overridden with a new value upon request.
def sum_squares(num1=5, num2=10):
    sum_squares = num1**2 + num2**2
    return sum_squares

result = sum_squares() # the default values that are used are 5, and 10, respectively. 
print(f'The result with the default values is {result}') 

result = sum_squares(num2 = 20) # we can change only one of the values
print(f'The result with num2=20 is {result}') 

result = sum_squares(4, 3) # or override both values
print(f'The result with num1=4, and num2=3 is {result}') 

In [None]:
# You can set the default value to any data type, such as dictionaries, lists, tuples, boolean values, etc. 
# Let's practice with a boolean value. Try it yourself! Create a function that ignores a specific base or bases of a DNA sequence. If multple bases are added, ignore any matching bases. No need to match the order.

def ignore_bases(sequence, *bases, ignore=True):
    if ignore:
        for base in bases:
            sequence = sequence.replace(base, "")                
    return sequence


original_sequence = "ATCGATCGATCG"
print(original_sequence)

new_sequence = ignore_bases(original_sequence, "A", "T", "C", ignore=True)
print(new_sequence)

# now, make sure it works with a single base 
new_sequence2 = ignore_bases(original_sequence, "A", ignore=True)
print(new_sequence2)

##### 10.3.7 Calculate the total number of occurances of the base (A, C, T, G) in a given DNA sequence 

In [None]:
def count_base(sequence, base):
    base_count = 0  # We have to initialise the variable that will count the numbers

    # To loop through the sequence, we have to loop from 0 until the end. 
    # First, we have to find the length of the sequence.
    sequence_length = len(sequence)

    for idx in range(sequence_length):
        seq_base = sequence[idx] # this is the base for the current index.
        if seq_base == base: # if the current sequence matches the base, we add 1 to the base_counter
            base_count += 1
    
    return base_count

# We have now defined the function. Let's define the variables:
sequence1 = 'ACGTACGTACGT'
sequence2 = 'TTATCGACTTC'

base1 = 'C'

sequence1_count = count_base(sequence1, base1)
print(f"In the first squence, {sequence1}, the base {base1} appears {sequence1_count} times")

# we can also use the function without defining the variable if we don't want to store it in memory
sequence2_count = count_base(sequence2, "T")
print(f"In the second squence, {sequence2}, the base 'T' appears {sequence2_count} times")


It is good practice to document the function you are creating for future use, especially if you are working with multiple people on a project. You need to define what the arguments inputs are, what the function does, and what it returns. If working with multiple variables and data structures, additional documentation would include the shape, size or type of the array that it returns.

In [None]:
# one way to document the function is the following:
def count_base(sequence, base):
    ''' 
    Calculate the total number of occurances of a base in a DNA sequence.
    
    Args: 
        sequence: str, the DNA sequence to be analysed
        base: str, the base to count in the sequence
    
    Returns:
        base_count: int, the total number of occurances in the sequence

    '''
    base_count = 0 
    sequence_length = len(sequence)

    # Loop through every base in the sequence
    for idx in range(sequence_length):
        seq_base = sequence[idx]

        # Match the current base to the desired base
        if seq_base == base: 
            base_count += 1
    
    return base_count

In [None]:
# As you can see, we used the function "range()" to loop thruogh the sequence. Let's practice with this.
x = range(3, 10)
print(x)

In [None]:
# This creates a sequence of each of the numbers from 3 to 9:
for i in x:
    print(i)

In [None]:
# if we want to access the elements of the sequence:
for idx, value in enumerate(x):
    print("index: ", idx, " value: ", value)

Remember that python indexing starts from 0. 

In [None]:
# We can also add a step to the range. For example, let's print all the even numbers between 0 and 10:
x = range(0, 10, 2)
for i in x:
    print(i)

In [None]:
# we can also access the elements as a list by converting the range to a list:
print(list(x))

In [None]:
# And only accessing the first 5 elements:
x = range(0, 20, 2)
y = x[:5]
print('Original list: ', list(x))
print('First 5 elements of the list: ', list(y))

##### 10.3.8 Examples with increasing complexity comprising all these concepts

In [None]:
# Convert a DNA sequence to RNA (by replacing T with U in the sequence)
def convert_to_rna(dna):
    rna = dna.replace("T", "U")
    return rna 

rna = convert_to_rna("ATGTTCA")
print(f'The RNA sequence is {rna}')

# what if the string is not clean?
rna = convert_to_rna("atgTtca")
print(f'The RNA sequence is {rna}')

# it will only convert the EXTACT matches. Try creating a function that will convert all the bases to uppercase:

# Add your code here

print(f'The RNA sequence is {rna}')

In [None]:
# Count the total number of codon occurances in a large DNA sequence. Adapt to count more than one codon.

# Add your code here


count_codons("ATGAAAATGCCCAATGGHS", "ATG", "GCC")

In [None]:
# Access the centre point of a nested array (centroid).

array = [[1, 2, 3, 4, 5], 
         [4, 5, 6, 3, 7], 
         [7, 8, 9, 7, 1]]

centroid, value = access_centroid(array)
print("The centroid location in the array is: ", centroid)
print("The value at the centroid is: ", value)

In [None]:
# What if the nested array is missing some values?
array = [[1, 2, 3,     ], 
         [4, 5, 6, 3, 7], 
         [7, 8, 9, 7   ]]

centroid, value = access_centroid(array)
print("The centroid location in the array is: ", centroid)
print("The value at the centroid is: ", value)

# We can clean the data by imputing the missing values with various methods. You can try it with the mean of all the values in that row.
def clean_array(array, method="mean"):
    width = max(len(row) for row in array)
    total_mean = sum(sum(row) for row in array) / sum(len(row) for row in array)
    for row in array:
        if len(row) < width:
            number_missing_cols = width - len(row)
            for i in range(number_missing_cols, width-1):
                if method == "mean":
                    row.append(int(sum(row) / len(row)))
                elif method == "total_mean":
                    row.append(int(total_mean))
    return array

cleaned_array = clean_array(array, "total_mean")
print("The new cleaned array is: ", cleaned_array)

centroid, value = access_centroid(cleaned_array)
print("The centroid location in the cleaned array is: ", centroid)
print("The value at the centroid is: ", value)

In [None]:
# Analyse the spike trains of an input neuron. Adapt for multple neurons. Return mean, median and standard deviation of each of the spike trains.
        # HINT: you can add nested functions inside another function to further modularise your code. In this example, we have added four nested functions to calculate the four different metrics.

# LEAVE SOME SPACES BLANK FOR THEM TO FILL IN

def analyze_spike_train(*trains, start, end):
    # Add smaller, locally-available functions to calculate each metric
    def calculate_mean(train):
        return sum(train) / len(train)

    def calculate_median(train):
        return sorted(train)[len(train)//2]
    
    def calculate_std(train):
        mean = calculate_mean(train) # calculate the mean of the train 
        return (sum((x-mean)**2 for x in train) / len(train))**0.5
    
    def sampling_frequency(train, start, end):
        return len(train) / (end - start)


    data = {}
    for idx, train in enumerate(trains):
        data[f"Train {idx+1}"] = {}
        data[f"Train {idx+1}"]["mean"] = calculate_mean(train)
        data[f"Train {idx+1}"]["median"] = calculate_median(train)
        data[f"Train {idx+1}"]["std"] = calculate_std(train)
        data[f"Train {idx+1}"]["sampling_freq"] = sampling_frequency(train, start, end)

    return data

    

train1 = [0.05, 0.10, 0.30, 0.305, 1.2]
train2 = [0.02, 0.50, 0.70, 1.1, 1.4, 1.41]

data = analyze_spike_train(train1, train2, start=0.0, end=1.5)
print(data)
