
![dsl_logo](https://github.com/BrockDSL/RDM_Jupyter_Workshop/raw/main/dsl_logo.png)

# RDM in Jupyter: The importance of keeping your data reproducible


This session will take a deep dive into some research data management best practices when developing in a Jupyter environment. The focus will be on ensuring reproducibility of analysis and bundling up code and data for use by others. This will be examined in two ways: moving your project to Github, and remixing/extending work that already exists. Participants will need a GitHub account for the session that can be created [here](https://github.com/join).

# Part 2 - Recreating the work from others


In this part of the workshop we'll clone someone else's repository to see if we can fix a problem with their results.

## Hmmmmm

A colleague has told us that there is something fishy in the results reported from a research project they came across...

[https://github.com/tribaric/Super_Star_Research](https://github.com/tribaric/Super_Star_Research)

Let's have a look at this repository to see if we can spot any problems. Once you are ready to proceed start running the cells below.

In [None]:
#Libraries to Load

#My binder doesnt' seem to have pandas in the base install!
#!pip -q install pandas

import pandas as pd
import matplotlib.pyplot as plt

## Forking a Repository

Our first step is to create a copy of the repository we are interested in:

- Navigate to [https://github.com/tribaric/Super_Star_Research](https://github.com/tribaric/Super_Star_Research)
- Click the 'Fork' button in the top of the screen and follow the steps to fork.
- Just like in part 1 copy the URL of **your** version of the repository and add it into the variable defintion for `github_url` below


In [None]:
#GitHub Configuration
gh_username = "elibtronic"
github_url = "https://github.com/elibtronic/Super_Star_Research.git"

## GH Token

We also need a GH token to do our work. We can re-use the one from part 1. Paste it in and run the next cell. Full [instructions](https://github.com/BrockDSL/RDM_Jupyter_Workshop/raw/main/Token_instructions.pdf)

In [None]:
gh_token = ""

## GH Branch

we will need to **branch** our repository to create our own changes. We'll set the branch name in the `gh_branch` variable. Include something specific to you in the label so we can see it later. Avoid using any spaces or punctuation though. I'm going to call mine: `tim_branch`

In [None]:

gh_branch = ""


#Some parts we'll need later
clone_url = github_url.replace("https://","@")
gh_folder = github_url.split("/")[4].split(".")[0]

## Clone your version of the repository

Run the next cell to clone your version of the repository so that you can explore it and modify it.

In [None]:
## git init

!git clone https://$gh_username:$gh_token$clone_url
%cd $gh_folder

## Branch work

We'll switch to our branch and work within that.

In [None]:
!git checkout -b $gh_branch

## Setting Remote

We need to tell GH that we want to connect to the repository we forked this from. Just run the following cell, we don't need to modify anything.

In [None]:
!git remote add upstream https://github.com/tribaric/Super_Star_Research

In [None]:
# One last look at git status.
!git status

## Let's investigate

We have a feeling that there is some problems with the data and/or the analysis. Let's see what we can find... It looks like the diagram is on the `README` page is incorrect. Let's correct that. The repository has the code used to generate the image. Let's copy that into the next cell and modify it to make sure it is fixed

In [None]:
#this cell will load up the data into a dataframe and display it

dataset = pd.read_csv("gpt_article_analysis.csv")
dataset

In [None]:
# Some modifications to the data?
# this is the code direct from the bad repository README file
# can you modify it so that the graph ends up being correct?

labels = ['cnn','cbc','bbc']

plt.bar(labels, dataset["compound"])
plt.title("Sentiment of news sources")
plt.ylabel("Score")
plt.xlabel("News Source")
plt.savefig("graph.png")
plt.show()

## Make changes in GH

now that we've modified our files we want to stage them in Github

In [None]:
!git status

In [None]:
!git add .
!git status

## Pull Request to Github

Now that we've indentifed the problem we want to create a **pull request** to correct the original repository. We just need to complete the repository update and complete the pull request in the GH web interface.

In [None]:
#Set Commit Message

commit_message = """


"""

In [None]:
!git commit -am "$commit_message"


In [None]:
#!git push $github_url
!git push -u origin $gh_branch