# Reproducible Research

## What is reproducibility?

As the name suggests, reproducibility, or reproducible research, means that the work that you have done can be reproduced by others. Though the idea might seem simple, it can actually take quite a bit of work. However, it is well worth doing, because the benefits of reproducible research can be vast. 

## Why reproducibility?

The first and foremost reason that we want to emphasize reproducibility in science is because that's simply how science should be done. It is important that the work that you do be verifiable so that others can confirm your work. It lends much more credibility to your work to have all of your code easily accessible, because it isn't being obfuscated or hidden. The overall practice of reproducible research also means that you can go in and see what others have done to confirm and refute their findings. It also helps others build on the work that you've done and allows for more efficient collaboration.

Making sure that your work is reproducible can be very helpful for you too. Going back to your work after a break is much easier if you practice reproducible research strategies. Code is much easier to read and the workflow is much clearer when you follow these strategies.

## How do you *practice* reproducible research?

One of the easiest ways to make sure that your work is reproducible is something we have already been doing throughout this class already: using Jupyter notebooks! More specifically, using Jupyter notebooks in conjunction with the cloud environment means that we can share our work and know that someone else can run the code we used and get the same result. In our classes, this is useful for me to be able to share code and show step by step what is going on within a specific process. Jupyter also allows me to provide descriptions in narrative text in Markdown cells along with the code in code cells.

> **NOTE FOR FINAL PROJECT**: These are all techniques you should be employing in your final projects as well! Make sure you describe what is going on within your code, and follow the proper practice of making sure that I can reproduce what you are doing if I were to run your code!

### Describing Code

However, just using Jupyter notebooks isn't enough. We need to make sure that the code we write is readable and accessible by others. As an example, let's take a look at a block of code that we've used before. 

In [1]:
def data_function(year, census_key):
    census_base_url = f'https://api.census.gov/data/{year}/acs/acs1/profile'
    census_params = {'get':'NAME,DP02_0001E,DP03_0087E,DP03_0002PE,DP02_0068PE,DP02_0066PE',
                     'for':'county:*',
                     'key': census_key}
    r = get(census_base_url, params = census_params)
    people_by_county = r.json()
    keys = ['county', 'num_households','mean_income','percent_employed','percent_bachelors','percent_graduate']
    census_dict = {key:[county[keys.index(key)] for county in people_by_county[1:]] for key in keys}
    census_df = pd.DataFrame(census_dict)
    census_df[keys[1:]] = census_df[keys[1:]].apply(pd.to_numeric)
    county_state = census_df.county.str.split(', ', expand = True).rename({0:'county', 1:'state'}, axis = 1)
    return county_state.merge(census_df.iloc[:,1:], right_index = True, left_index = True)

There's a lot going on here! You might be able to piece together what is going on eventually, but it might be difficult at first glance. Not only that, it might be easy to make a mistake in interpreting what is going on. One of the most important steps in writing replicable code is making sure that the comments and descriptions of the code make it easy to decipher and understand what is going on. For example, let's take a look at the same code, except with more descriptions.

In [2]:
def get_county_data(year, census_key):
    '''
    Gets county-level data from the ACS using the Census API
    
    Arguments:
        year: Year for which the data should be pulled
        census_key: str, the Census key to use to pull from the API
        
    Returns:
        A DataFrame
    '''
    census_base_url = f'https://api.census.gov/data/{year}/acs/acs1/profile'
    census_params = {'get':'NAME,DP02_0001E,DP03_0087E,DP03_0002PE,DP02_0068PE,DP02_0066PE',
                     'for':'county:*',
                     'key': census_key}
    # Pull from the API
    r = get(census_base_url, params = census_params)
    people_by_county = r.json()
    
    # Get the data into dictionary format
    keys = ['county', 
            'num_households',
            'mean_income',
            'percent_employed',
            'percent_bachelors',
            'percent_graduate']
    census_dict = {key:[county[keys.index(key)] for county in people_by_county[1:]] for key in keys}
    
    # Convert to DataFrame
    census_df = pd.DataFrame(census_dict)
    
    # Change numeric values to numeric
    census_df[keys[1:]] = census_df[keys[1:]].apply(pd.to_numeric)
    county_state = census_df.county.str.split(', ', expand = True).rename({0:'county', 1:'state'}, axis = 1)
    
    return county_state.merge(census_df.iloc[:,1:], right_index = True, left_index = True)

In [4]:
help(get_county_data)

Help on function get_county_data in module __main__:

get_county_data(year, census_key)
    Gets county-level data from the ACS using the Census API

    Arguments:
        year: Year for which the data should be pulled
        census_key: str, the Census key to use to pull from the API

    Returns:
        A DataFrame



This function does the exact same thing, but it's much more descriptive and easier to follow. 

- **The function name is more descriptive.** It is closer to what the actual function is doing, making it easier to follow code that uses this function as well.
- **There is a doc string.** This provides a summary of what the function does, not allowing allowing you to follow what is going on when defining the function, but also giving you a way to look it up while using the functino later on.
- **There are comments throughout the function.** These comments notably outline each step of the process in digestible chunks, making it much easier to see what each step does. 

#### Tips for making sure that your code is readable

After you have written the code to do the analysis you need to do, ask yourself these questions:
- **Has every step of the process been explained in words rather than just the code?** That is, if you were to remove the code, can you follow what is happening?
- **Is the code easy to read?** Are there parts that you can make easier to follow by adding spaces, giving helpful names, etc.?


## Making generalizable code

Another key feature of reproducible code is generalizability: we want to avoid writing code that makes too many assumptions about where it's running, what data are available, or what code might have run before. One common place this issue might show up would be when we have code that needs to reference a file in a specific directory. For instance, if I need to load a file in my documents folder, I might write something like this:

In [19]:
import pandas as pd

cah_data = pd.read_csv('/home/nlund/BSOS326-Spring25/1. In-Class Work/Week 04 - Pandas/201807-CAH_PulseOfTheNation_Raw.csv')

...but obviously this is going to pose problems because, barring a truly lucky coincidence, another user presumably won't have a directory path called `C:/Users/neilb/`. 

Instead if writing an absolute path, I'll want to try to write file paths to be **relative to the directory where my code is running.** My directory is structured something like this:

So I'll need to essentially go "up" one level to get to the parent directory `1. In-Class Work/` and then go back down two levels to get to `Week 04 - Pandas/` so I can access the relevant file. In python, you can use the double-dots ".." to indicate going up one level from the current directory. So my correct relative path would be:

In [32]:
path = '../Week 04 - Pandas/201807-CAH_PulseOfTheNation_Raw.csv'
cah_data = pd.read_csv(path)

I know this will work because we're all sharing `BSOS326-Spring25/`, even if that folder might be located in a different place on your computer.

If you're not sure how to specify the relative path for a file, you can use `os.path.relpath` from the `os` module to turn an absolute file location into a relative one:

In [34]:
import os
# getting the abosolute location (this depends on your username)
abspath = os.path.expanduser("~") + '/BSOS326-Spring25/1. In-Class Work/Week 04 - Pandas/201807-CAH_PulseOfTheNation_Raw.csv'
print("The absolute location is: ", abspath)
# turning the absolute path to a relative path
print("The relative location is: " , os.path.relpath(abspath))

The absolute location is:  C:\Users\neilb/BSOS326-Spring25/1. In-Class Work/Week 04 - Pandas/201807-CAH_PulseOfTheNation_Raw.csv
The relative location is:  ..\..\..\..\..\BSOS326-Spring25\1. In-Class Work\Week 04 - Pandas\201807-CAH_PulseOfTheNation_Raw.csv


Other ways to make code generalizable are:

- Store use-specific parameters like API keys in a separate file, then import them as variables. (we've done a version of this with the `keys.yaml` file already)
- Parameterize commands that might need to change at some later date. If you want to make a function that always queries data from the last 2 weeks, then do something like:

In [39]:
from datetime import datetime, timedelta
# Get the current date and time
today = date.today()
two_weeks_ago = today- timedelta(days=14)
two_weeks_ago

datetime.date(2025, 4, 23)

## Version Control with Git

One common method for implementing transparency and improving workflows in coding projects is **version control**. Version control is the practice of tracking changes to code over time. You can think of it similar to how Google Docs keeps track of any changes that you make within its history. With version control, you can look back at previous version of your work, see the progress as you add or change things, and even revert back if a change wasn't desirable.

**Git** is a version control system that is widely used for coding projects. Working with Git involves setting up a **repository** and tracking files within that repository as you make changes. You can then easily make updates or revert changes within that repository with a few commands. It's like saving multiple versions of the same file, except you don't need to have different files for each of them. 

### GitHub

GitHub (https://github.com) is a commonly used as the **remote repository**. A remote repository is just a repository that is stored somewhere else besides your own local computer or workspace. Typically, a remote repository like GitHub is used to share your work with others and/or make it publicly available. This is most helpful when you are collaborating with others, because you can make changes to a project and others who have access to the remote repository would then be able to see those changes without needing to have access to what you have on your own workspace.

### How Git Works

**Step 1: Initialize a git repository** This creates a git repository so that you can start tracking files in the repository.

    git init 
    
    git clone <remote repo>

**Step 2: Make changes and/or add files** Just adding files or making changes to files in the repository folder isn't enough to track them! You need to add the changes or files so that they are tracked. 

    git add <filename>
    
**Step 3: Commit changes** This creates a checkpoint of all the changes you have made so far. Typically, you add, then commit any adds you have made. 

    git commit
    
**Step 4: Push changes to remote repository** If you have a remote repository, you then push those changes so that the remote repository is the same as your local repository.

    git push

![Git](git.png)

## Using GitHub

You can make a free account at GitHub and create repositories that track that work that you do and use it as a way to share the work that you do.  

> **Note:** This is particularly useful for resumes! Put coding projects that you want to showcase on your GitHub page and link to it on your resume!

GitHub also offers a lot of integrations and functionalities that make sharing your work easier. For example, we have been using the integration with Git and GitHub to share the material throughout the course with you! The link that you click automatically pulls from a GitHub repository, which is what brings in the latest material into your space. 

If you have a working repository with some files, you can import it into jupyterlab automatically with a URL. The basic structure will be:

`https://jupyter.umd.edu/hub/user-redirect/git-pull?repo=<URL OF REPO>&urlpath=lab/tree/<REPO NAME>&branch=main`


You can try importing [this repo](https://github.com/Neilblund/example_git/tree/main) by clicking the link below:

https://jupyter.umd.edu/hub/user-redirect/git-pull?repo=https://github.com/Neilblund/example_git&urlpath=lab/tree/example_git&branch=main




## Git Resources

- Pro Git Book: https://git-scm.com/book/en/v2
- Git and GitHub Cheatsheet: https://training.github.com/downloads/github-git-cheat-sheet.pdf
- Git and GitHub Resources: https://docs.github.com/en/get-started/start-your-journey/git-and-github-learning-resources