
![dsl_logo](https://github.com/BrockDSL/RDM_Jupyter_Workshop/raw/main/dsl_logo.png)

# RDM in Jupyter: The importance of keeping your data reproducible


This session will take a deep dive into some research data management best practices when developing in a Jupyter environment. The focus will be on ensuring reproducibility of analysis and bundling up code and data for use by others. This will be examined in two ways: moving your project to Github, and remixing/extending work that already exists. Participants will need a GitHub account for the session that can be created [here](https://github.com/join).

# First a word...

As we've discovered Jupyter is a wonderful environment that allows us to create a _virtual machine_ that emulates a Linux environment pretty well. Notebooks themselves are pretty much just HTML pages that show us the results of code and shell execution.

We are going to rely on that for the two parts of this workshop. We are just going to be hitting the run button in some pre-made cells (with some modifications) but in practice you could probably fire up a terminal window in your environment and run some scripts that will accomplish these steps.

## Loading in our Libraries

I like to put these in their own cell right off the bat.

In [None]:
# Run this cell load libraries

#my binder doesn't have pandas out of the box, so we'll load it up
#!pip -q install pandas

import requests
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from textblob import TextBlob
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import warnings


## Functions

The cell below has the function that will be our workhorse. It will the contents of URLs and just keep the text, stripping off the HTML tags.

In [None]:

#some settings we'll use later
warnings.simplefilter(action='ignore', category=FutureWarning)
sid = SentimentIntensityAnalyzer()

# Helper Function that will fetch content for us from a URL and score it
# Uses VADER sentiment analysis
def process_url(url):
    
    res = requests.get(url)
    html_page = res.content
    soup = BeautifulSoup(html_page, 'html.parser')
    text = soup.find_all(text=True)

    output = ''
    blacklist = [
        '[document]',
        'noscript',
        'header',
        'html',
        'meta',
        'head', 
        'input',
        'script',
    ]

    for t in text:
        if t.parent.name not in blacklist:
            output += '{} '.format(t)
        
    score = sid.polarity_scores(output)
    return [url, score['neg'], score['neu'], score['pos'], score['compound']]

## The plan

For this part of the workshop we are going to:
- generate a data set
 - identify a list of urls we are interested in
 - score their sentiment with VADER
- create a data log book that describes the date
- create a README file describing our project
- save our data set to GitHub
- modify our data set in GitHub by adding to it

## Github Prep

Let's start by getting the GitHub sides of things done. We need to do two things:

- Create a repository for our material. We'll be using the Github web interface for this
- Create a GitHub security token. We'll also be using the Github web interface for this

### Create a Repository

First create a [new repository](https://github.com/new) and put the URL in the box variable below

In [None]:
gh_username = "elibtronic"
github_url = "https://github.com/elibtronic/rdm_workshop.git"


#Some parts we'll need later
clone_url = github_url.replace("https://","@")
gh_folder = github_url.split("/")[4].split(".")[0]

### Create a token

You'll need to first configure a [Github Token](https://github.com/settings/personal-access-tokens/new). Be sure to configure it so that it only works against the repository you just created.

In [None]:
gh_username = "elibtronic"
gh_token = ""


# Connect our repository

Now that the repository is created we want to **clone** it into our current Jupyter environment. Once that is done we can add files to it, and **push** the changes back to GitHub. 

### Cloning Your Repository

In [None]:
!git clone https://$gh_username:$gh_token$clone_url
%cd $gh_folder
!ls -lah

## Generating your Data

Add in as many URLs following the pattern show here. Basically keep them inside quotes and follow it with a comma. This will be the urls we are going to analyze. I'm doing some work on news articles that talk about ChatGPT.

In [None]:
url_list = [
    "https://www.cbc.ca/news/canada/new-brunswick/chatgpt-academia-cybersecurity-1.6733202",
    "https://www.cbc.ca/news/canada/hamilton/chatgpt-school-cheating-1.6734580",
]

The next cell will harvest the URLs in the list above, score them using [VADER](https://medium.com/@piocalderon/vader-sentiment-analysis-explained-f1c4f9101cd9) and then create a Pandas Dataframe with the results. VADER returns three scores as we will see.

In [None]:
# Create Data Frame

data = pd.DataFrame()

for url in url_list:
    print("processing... ", url)
    try:
        result = process_url(url)
        data = data.append(pd.DataFrame(result).T)
    except:
        print("Couldn't download URL")

data.columns = ["url","neg","neu","pos","compound"]
data.reset_index()
print("Done")

### Data so far

Run the next cell to print out the dataframe to screen.

In [None]:
data

### Analyze

Visualizations are cool... Let's make a simple graph of average scores

In [None]:
avg_neg = data['neg'].mean()
avg_neu = data['neu'].mean()
avg_pos = data['pos'].mean()

plt.pie([avg_neg,avg_neu,avg_pos], labels = ["Negative Scores", "Neutral Scores", "Postive Scores"])
plt.title("Average Scores per column")
plt.show()

## Save our Data

Now that we have a dataset that we built from scratch let's turn it into a CSV file so that we can put it into our repository

In [None]:
#Write-out dataframe to CSV file
data.to_csv("url_data.csv", index=False)
!ls -lah

### Codebook

Modify the text that is between the `"""` and `"""` areas to create your code book. Describe what each column of the CSV file will do.

Once your run the cell it will save your results into a text file that we will add to our repository.

In [None]:
#Generate codebook

data_code_book = """

Column of the data are:

xxx -
xxx - 
xxx -
xxx -

"""


with open("data_code_book.txt","w") as data_code_book_file:
    data_code_book_file.writelines(data_code_book)

### Readme


Modify the text that is between the `"""` and `"""` areas to create a README file for your project. If you know Markdown feel free to use that.

Once your run the cell it will save your results into a text file that we will add to our repository.

In [None]:
#Generate README.md

readme = """

# RDM Test dataset

This is a pretend dataset that I've created for a workshop

"""

with open("README.md","w") as readme_file:
    readme_file.writelines(readme)

### Last peek

Run the next cell to look at our project files one last time.

In [None]:
#last look at our files

!pwd
!ls -lh

## Staging in Github

Now that we've modified our files we want to stage them in Github. That is two stage process.

In [None]:
!git status

In [None]:
#We 'add' all the files that we have been working on into the repository
!git add data_code_book.txt
!git add url_data.csv
!git add README.md
!git status

## Pushing to Github

We have now staged our files, we just need to push them. to finalize our git commit we'll add a comment. As before add some text between the `"""` and `"""` areas.

In [None]:
#Set Commit Message

commit_message = """

A commit message is usually a simple description of what changed in the repository.

"""

## Did it work?

Run the following cell and go check your repository in the web interface of GitHub, it should have some files and a commit associated with them.

In [None]:
!git commit -am "$commit_message"
!git push $github_url

## Do it again!

Return to the cell right after **Generating Your Data** and add a couple of urls to the list. Re-run all the following cells and see if you can commit an updated dataset to your repository!

# End of Part 1

I hope this section of the workshop has highlighted the need of proper RDM practices to make sure your data is safely tracked and available to others who wish to replicate your findings. In part 2 we'll modify someone else's code and analysis to demonstrate how we can move the scholarly conversation along.