## 🐍 Introduction to Reproducibility with Python

*Notebook by [Pedro V Hernandez Serrano](https://github.com/pedrohserrano)*


---
# 2. Reproducible Analysis in Jupyter Notebooks
* [2.1. Creating basic functions](#3.1)
* [2.2. Sharing is caring](#3.2)


---
## 2.1. Creating basic functions
<a id="2.1">

A function is a block of organized, reusable code that can make scripts easier to read and simple to manage. We can think of functions as small, self-contained programs for specific tasks.

We have already used some functions, such as the `print()` command, a built-in Python function. 

**Making a function in Python**
- Begin the definition of a new function with **def**.
- Followed by the name of the function.
    - Must obey the same rules as variable names.
- Then, the parameters are in parentheses.
    - Empty parentheses if the function doesn’t take any inputs.
- Then a colon.
- Then, an indented block of code.

In [None]:
def print_greeting():
    print('Hello! Maastricht')

- Defining a function does not run it.
    - Like assigning a value to a variable.
- Must call the function to execute the code it contains.

In [None]:
print_greeting?

In [None]:
print_greeting()

- More useful when we can specify parameters when defining a function.
    - These become variables when the function is executed.
    - Are assigned the arguments in the call (i.e., the values passed to the function).
    - If you don’t name the arguments when using them in the call, the arguments will be matched to parameters in the order the parameters are defined in the function.

In [None]:
str(1988) + '/' + str(9) + '/' + str(23)

In [None]:
def print_date(year, month, day):
    joined = str(year) + '/' + str(month) + '/' + str(day)
    return print(joined)

In [None]:
print_date(1871, 3, 19)

In [None]:
print_date(month=3, year=1871, day=19)

Most importantly, we want to automate calculations with functions

    A function that is able to add 100 to a given X value.   

```Python
def _____(x):
    value = x + _____
    return _____   
```


---
## 2.2. Data structures
<a id="2.2">

* Pandas is a widely-used Python library for handling data, particularly on tabular data.
* Borrows many features from R's dataframes:
  - Two-dimensional table whose columns have names and potentially have different types of data types/
* Load it with `import pandas as pd`. The alias pd is commonly used for Pandas.
* Pandas can handle virtually any kind of formats

![](https://pandas.pydata.org/docs/_images/02_io_readwrite.svg)

In [None]:
import pandas as pd

In [None]:
path = '../data/'

* The columns in a dataframe are the observed variables, and the rows are the observations.
* Pandas uses backslash `\` to show wrapped lines when the output is too wide to fit the screen.

**File Not Found:**

> Normally, data files in a `data` sub-directory, which is why the path to the file is `data/...csv`. If you forget to include `data/` or if you have it but your copy of the file is somewhere else, you will get a runtime error that ends with a line like this:

`ERROR`: _OSError: File b'XXXXX.csv' does not exist_

- A Dataframe is a collection of Series. 
- Dataframe is how Pandas represent a table, and Series is the data structure Pandas use to describe a column.
- Pandas is built on top of the Numpy library, thus most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.
- Thus, Pandas can access individual table records, handling of missing values, and relational database operations between DataFrames.

In [None]:
# Basic representation of a dataframe
dictionary = {'age': [20, 30], 
              'height': [1.80, 1.60], 
              'course':['Python', None]}

# Define a dataframe
df = pd.DataFrame(data=dictionary)

# Print the dataframe
df

Use the `DataFrame.info()` method to find out more about a dataframe

In [None]:
df.info()

* This is a DataFrame
* Columns named `age`, `height` and `course`
* Two actual 64-bit int values and one is floating point
* The columns are not null
* Uses 176 bytes of memory

The DataFrame.columns variable stores information about the dataframe's columns

* Note that this is data, _not_ a method. (it doesn't have parentheses)
  - Like `math.pi`
  - So do not use `()` to call it

In [None]:
df.columns

Use `DataFrame.T` to transpose a dataframe:

* Sometimes want to treat columns as rows and vice versa.
* Transpose (written `.T`) doesn't copy the data, just changes the program's view of it.
* Like `columns`, it is a member variable


In [None]:
df.T

Use `DataFrame.describe()` to get summary statistics about data

`DataFrame.describe()` gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument `include = 'all'` 

In [None]:
df.describe()

Make use of a list to select desired columns and create a subset

In [None]:
# Desired columns
my_columns = ['age','course']

# Selection of subset
df[my_columns]

Use `DataFrame.to_csv()` to generate a CSV file as result of the analysed dataframe

In [None]:
#df[my_columns].to_csv('this_is_an_example.csv', index=False)

---
## 2.3. Sharing Jupyter Notebooks
<a id="2.3">


- Posting your work in Github will automatically be rendered by **NBviewer:** [nbviewer.jupyter.org/](https://nbviewer.jupyter.org/)

- Uploading your work in **Google Colab** can make it sharable: [colab.research.google.com/](https://colab.research.google.com/)  
**Note:** Never upload personal data to Google Colab or any other Google Drive service as it's considered processing personal data and involves the risk of infringement of data protection laws. 

- Markdown cells can contain embedded links and images

>Add a link using the following pattern: `[link text](URL_or_relative_path)`
gives the clickable link: [Maastricht University](https://www.maastrichtuniversity.nl).

>Add an image using the following pattern: `![image alt text](URL_or_path)`
 embeds the following image: ![UM logo](https://logos-download.com/wp-content/uploads/2017/11/Maastricht_University_logo.png)

- Markdown cells can include Latex Expressions with a $ on either side.

> For example, `$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$` is rendered as the inline $e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$ expression.

> Wrapping the expression with `$$` on either side forces it to be rendered on a new line in the centre of the cell: $$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$

- Checking Reproducibility

> One purpose of notebooks is to produce an executable document that can be rerun to reproduce the results.

> To run cells from scratch (i.e. from a fresh kernel), `Kernel -> Restart and Clear Output` and then run the cells you want.

> To run all the cells in the notebook from scratch: `Kernel -> Restart and Run All`

- Freezing the environment

> `! conda list --export > requirements.txt`

Or simply watermarking the notebook

> `!pip install -q watermark`

> Load the watermark extension
`%load_ext watermark`

> Show packages that were imported
`%watermark --iversions`

- Notebook Licensing

[Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)  
More tips at: [Reproducible science curriculum, Carpentries lesson](https://reproducible-science-curriculum.github.io/sharing-RR-Jupyter/LICENSE.html)

---

## 🙌🏼 Hands-on

The following exercise aims to practice the creation of modular Python functions.   
We will **develop functions to compute 1) Height in meters, 2) Body Mass Index (BMI), and 3) Body Fat Percentage (BFP)**. Finally, we will encapsulate these functions within a Python class and export them to a standalone `.py` file.

- **Exercise 1:** Develop a function that converts the input **h** (height) in meters based on a metric **m**

$$
\text{height in meters }(h, \text{metric}) = 
\begin{cases} 
h & \text{if metric = 'm'} \\ 
\frac{h}{100} & \text{if metric = 'cm'} \\
h \times 0.3048 & \text{if metric = 'ft'} \\
\text{Error} & \text{otherwise}
\end{cases}
$$

See: [Wikipedia - foot (unit)](https://en.wikipedia.org/wiki/Foot_(unit))

**Hint:**
```Python
def height_in_meters(height, metric):
    if metric == 'm':
        return height
    elif _____ :
        _____ _____
    elif _____ :
        _____ _____ 
    else:
        raise ValueError("Unknown metric unit")
```

In [None]:
# Answer
def height_in_meters(height, metric):
    if metric == 'cm':
        return round(height / 100, 2)  # Convert centimeters to meters
    elif metric == 'ft':
        return round(height * 0.3048, 2)  # Convert feet to meters
    elif metric == 'm':
        return height  # No conversion needed if already in meters
    else:
        raise ValueError("Unknown metric unit")

- Test your formula with the following values
```Python
height = 5.51
metric = 'ft'
```
---

- **Exercise 2:** Develop a function that calculates the body-mass index (BMI), body mass divided by the square of the body height

$$
\text{BMI }(\text{weight}, \text{height}) = \frac{\text{weight(kg)}}{\text{height(m)}^2}
$$

See: [Wikipedia - Body mass index](https://simple.wikipedia.org/wiki/Body_mass_index)

**Hint:**

```Python
def calculate_bmi(_____, _____):
    bmi = _____ / _____
    return bmi
```

In [None]:
# Answer
def calculate_bmi(weight, height):
    bmi = weight / (height ** 2)
    return round(bmi, 2)

- Test your formula with the following values
```Python
weight = 71
height = '1.77'
```
---

- **Exercise 3:** Develop a function that calculates the Body fat percentage (BFP); total mass of fat divided by total body mass, multiplied by 100:

$$
\text{BFP }(\text{fat mass}, \text{weight}) = \left(\frac{\text{fat mass (kg)}}{\text{weight(kg)}}\right) \times 100
$$

See: [Wikipedia - Body fat percentage](https://en.wikipedia.org/wiki/Body_fat_percentage#References)

**Hint:**

```Python
def _____ (_____, _____):
    _____ _____ _____
    return _____
```

In [None]:
# Answer
def calculate_bfp(fat_mass, weight):
    bfp = (fat_mass / weight) * 100
    return round(bfp, 2)

- Test your formula with the following values
```Python
weight = 87
fat_mass = 22
```
---

- **Exercise 4:** Apply the previously created functions to create 3 new columns in the following synthetic dataset.   The dataset contains 4 columns/variables

Here are the first few rows of the synthetic dataset:

| ID  | Weight    | Fat_Mass  | Height  | Metric |
|-----|-----------|-----------|---------|--------|
| ID1 | 77.44 kg  | 10.29 kg  | 1.58 m  | m      |
| ID2 | 85.76 kg  | 20.48 kg  | 5.10 ft | ft     |
| ID3 | 80.14 kg  | 14.12 kg  | 1.83 m  | m      |

Generate new columns

| ID  | Weight    | Fat_Mass  | Height  | Metric | Height in meters | BMI | BFP |
|-----|-----------|-----------|---------|--------|--------|--------|--------|
| ID1 | 77.44 kg  | 10.29 kg  | 1.58 m  | m      | ? | ? | ? |
| ID2 | 85.76 kg  | 20.48 kg  | 5.10 ft | ft     | ? | ? | ? |
| ID3 | 80.14 kg  | 14.12 kg  | 1.83 m  | m      | ? | ? | ? |


**Hint:**

```Python
import pandas as pd

df = pd.read_csv('../data/synthetic_dataset.csv')

# Apply the formulas to create new variables/columns
df['Height_in_Meters'] = _____
df['BMI'] = _____
df['BFP'] = _____

# Descriptive statistics
print(df.describe())
```
---

- **Exercise 5:** Export the new dataset as CSV into the `data_output` and call it `"updated_synthetic_dataset.csv"`

- **(BONUS) Exercise 6:** Generate histogram plots using `seaborn` Python package ([seaborn.pydata.org/](https://seaborn.pydata.org/)).  
Save the images into `data_output` folder too

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('../data/synthetic_dataset.csv')

# Apply the formulas to create new variables
df['Height_in_Meters'] = df.apply(lambda x: height_in_meters(x['Height'], x['Metric']), axis=1)
df['BMI'] = df.apply(lambda x: calculate_bmi(x['Weight'], x['Height_in_Meters']), axis=1)  # Using 'm' since height is converted to meters
df['BFP'] = df.apply(lambda x: calculate_bfp(x['Fat_Mass'], x['Weight']), axis=1)

# Perform simple descriptive statistics
print(df.describe())
print(df.describe(include=[object]))

# Save the updated DataFrame with new variables
df.to_csv('../data_outputs/updated_synthetic_dataset.csv', index=False)

print("Updated dataset saved successfully with new variables for height in meters, BMI, and BFP.")
