
![dsl_logo](https://github.com/BrockDSL/RDM_Jupyter_Workshop/raw/main/dsl_logo.png)

# RDM in Jupyter: The importance of keeping your data reproducible


This session will take a deep dive into some research data management best practices when developing in a Jupyter environment. The focus will be on ensuring reproducibility of analysis and bundling up code and data for use by others. This will be examined in two ways: moving your project to Github, and remixing/extending work that already exists. Participants will need a GitHub account for the session that can be created [here](https://github.com/join).

# Part 1 - Saving Your Work to Github

We're going to work through a pretend project where you have created both:

- code
- data

We are going to prepare and stage both of these pieces into a repository on github so that you can share it with other researchers and hopefully get a citation out of all your work.

In [2]:
#Libraries to Load

import pandas as pd

# Github

Now that we have our data and code all ready we want to get GitHub ready!

## Create a Repository

First create a [new repository](https://github.com/new) and put the URL in the box variable below

In [1]:
gh_username = "elibtronic"
github_url = "https://github.com/elibtronic/rdm_workshop.git"


#Some parts we'll need later
clone_url = github_url.replace("https://","@")
gh_folder = github_url.split("/")[4].split(".")[0]

## Create a token

You'll need to first configure a [Github Token](https://github.com/settings/personal-access-tokens/new). Be sure to configure it so that it only works against the repository your just created.

In [3]:
gh_username = "elibtronic"
gh_token = "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"

## Cloning Your Repository

In [4]:
## git init

!git clone https://$gh_username:$gh_token$clone_url

Cloning into 'rdm_workshop'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 12 (delta 0), reused 4 (delta 0), pack-reused 0[K
Unpacking objects: 100% (12/12), done.


## Getting your Data Ready

In [6]:
#Run this cell to automatically download a CSV file of 'data' that we'll modify
%cd $gh_folder
!wget https://borealisdata.ca/api/access/datafile/75156
!mv 75156 data_set.csv

[Errno 2] No such file or directory: 'rdm_workshop'
/Users/tim/Documents/RDM_Jupyter_Workshop/rdm_workshop
--2023-01-31 12:53:55--  https://borealisdata.ca/api/access/datafile/75156
Resolving borealisdata.ca (borealisdata.ca)... 142.1.121.150
Connecting to borealisdata.ca (borealisdata.ca)|142.1.121.150|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1238972 (1.2M) [text/tab-separated-values]
Saving to: '75156'


2023-01-31 12:53:55 (2.56 MB/s) - '75156' saved [1238972/1238972]



In [8]:
#Load our dataset
data_set = pd.read_csv("data_set.csv",delimiter="\t")
data_set.describe()

Unnamed: 0,Ngbd,SaleYearandMonth,Saletype,PrevSaleYearandMonth,SalePrice,LNSalePrice,PrevSalePrice,LNSalePriceLNPrevSalePrice,Saleyear1981,Saleyear1982,...,dYoungadultresidents,dFamilieswithatleastonechildathome,dOwneroccupiers,dDwellingunitsinneedofmajorrepairs,dResidentsmovingintoorwithinaDAduringpast5years,dAdultunemploymentrate,dBluecollarworkers,dUniversityeducatedadults,dAdultmedianincome,dVisibleminorityresidents
count,4495.0,4495.0,4495.0,4495.0,4495.0,4495.0,1575.0,1575.0,1999.0,1999.0,...,1575.0,1575.0,1575.0,1575.0,1575.0,1575.0,1575.0,1575.0,1575.0,1575.0
mean,0.555284,2000.896574,2.931479,1991.675457,86672.25406,11.238065,71983.987937,0.260346,0.001001,0.003002,...,0.546937,6.96318,-1.187886,0.138585,-0.599923,0.546404,2.593848,-1.983629,-1.016674,1.412656
std,0.49699,11.013255,1.0021,9.811376,44298.174962,0.538962,32445.99416,0.459625,0.122532,0.176131,...,2.090517,14.43565,7.585719,6.928623,12.452793,8.561403,9.624771,7.779842,5.08725,8.831253
min,0.0,1981.02,1.0,1980.99,5000.0,8.517193,6000.0,-1.492,-1.0,-1.0,...,-6.0,-26.0,-25.702128,-28.0,-61.0,-39.0,-32.0,-25.0,-31.0,-30.0
25%,0.0,1990.03,2.0,1985.03,57000.0,10.950807,48000.0,0.0,0.0,0.0,...,0.0,0.0,-2.589744,0.0,-1.0,0.0,0.0,-4.0,-2.074,0.0
50%,1.0,2001.07,3.0,1988.08,80000.0,11.289782,69900.0,0.197,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,2011.09,4.0,1999.04,109000.0,11.599103,93000.0,0.506709,0.0,0.0,...,1.0,11.0,0.0,0.0,0.0,0.0,3.22549,0.0,0.0,1.0
max,1.0,2018.12,4.0,2018.11,380000.0,12.848,218000.0,2.763,1.0,1.0,...,8.0,57.948718,27.0,28.0,65.0,41.0,39.0,28.0,25.0,42.0


In [9]:
#Generate codebook

data_code_book = """

pretend text to stand in for code book

"""


with open("data_code_book.txt","w") as data_code_book_file:
    data_code_book_file.writelines(data_code_book)

In [10]:
#last look at our files

!ls -lh

total 2432
-rw-r--r--  1 tim  staff    42B 31 Jan 12:54 data_code_book.txt
-rw-r--r--  1 tim  staff   1.2M 31 Jan 12:53 data_set.csv


## Staging in Github

now that we've modified our files we want to stage them in Github

In [11]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31mdata_code_book.txt[m
	[31mdata_set.csv[m

nothing added to commit but untracked files present (use "git add" to track)


In [12]:
#%cd $gh_folder
!git add data_code_book.txt
!git add data_set.csv
!git status

On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	[32mnew file:   data_code_book.txt[m
	[32mnew file:   data_set.csv[m



## Pushing to Github

We have now staged our files, we just need to push them

In [13]:
#Set Commit Message

commit_message = """

first commit

"""

In [14]:
!git commit -am "$commit_message"
!git push $github_url

[main 3445bad] first commit
 2 files changed, 4500 insertions(+)
 create mode 100644 data_code_book.txt
 create mode 100644 data_set.csv
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 16 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (4/4), 183.10 KiB | 5.23 MiB/s, done.
Total 4 (delta 0), reused 0 (delta 0)
To https://github.com/elibtronic/rdm_workshop.git
   02266c1..3445bad  main -> main


## Check out your Repository

You should now see the CSV file, and the codebook