<div style="font-size:30px" align="center"> <b> Scaping GitHub READMEs </b> </div>

<div style="font-size:18px" align="center"> <b> Brandon Kramer, UVA Biocomplexity Institute, OSS DSPG 2021 </b> </div>

<br>

### Overview  

In this notebook, we have developed a function for scraping GitHub READMEs in order to classify repositories into different types of software projects. 

The pipeline is setup with the following steps: 

1. Loading all of the packages 

2. Calling a function that scrapes the READMEs 

3. Loading the repositories as a DataFrame 

    - In this initial stage, this data came from PosgreSQL
    - In the later stages, we used csvs to sync scraping efforts between Brandon and Crystal 
    

4. Cross-referencing the repos against the scraped data 

5. Scraping the repos using multiprocessing 

6. Checking the data that was scraped 

### Load Packages and Personal Access Tokens

In [1]:
# for pulling/manipulating data 
import os 
import glob
import itertools
import pandas as pd
from datetime import datetime
import psycopg2 as pg
from sqlalchemy import create_engine

# for web scraping 
import json 
import lxml
import requests 
from requests.auth import HTTPBasicAuth
from bs4 import BeautifulSoup
from ratelimit import limits, sleep_and_retry
import multiprocessing
from multiprocessing.pool import ThreadPool as Pool
from datetime import datetime
import sys

# loads personal access tokens from the database 
# do NOT print any individual tokens in the notebook
# printing tokens and posting them to github invalidates the pats
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))
github_pats = '''SELECT * FROM gh_2007_2020.pats_update'''
github_pats = pd.read_sql_query(github_pats, con=connection)
connection.close()

print("ready")

ready


### Create Functions

To use the `scrape_readmes()` function, you first need to set the username and personal acccess token (PAT) that you create on GitHub. While you can run it without this information but having no PAT means you can only make 50 calls an hour compared to about 5,000. It helps if you have several PATs to help in this process, especially since this function takes only a few moments to make 5,000 calls after the multiprocessing is incorporated. In this pipeline, we are calling the usernames and PATs from a table in PostgreSQL that looks like this: 

|    login    |   token  | 
|    :---:    |  :----:  |     
|  username1  |   PAT1   | 
|  username2  |   PAT2   | 
|  etc.  |   etc.   | 

Once the username and PAT are passed into the authetication fields, `requests` will connect to the GitHub repository of all the slugs you feed it. We have designed the function to throw errors for all issues it encounters unless it gets a 404 error (i.e. no README available), as this usually means that the repo or README has been deleted, never existed, and/or is no longer available for some reason. 

Unfortunately, the way that GitHub's API is setup seems to require two calls to get the README: one to get the JSON with the README location and another to decode the content. We have a `@sleep_and_retry` decorator to deal with this if we setup a `slurm` for each PAT, but using it here in the notebook means that you will just want to wait for the threshold to be hit and then move onto the next PAT. While we plan to continue working on this function to minimize the number of calls, this process will allow us to get some preliminary data for classification in the short-term. Lastly, our current implementation does not have the capability to loop through all available PATs. This is something we plan to work on, but for now we just feed the function PATs one-by-one and let it scrape until it hits the query threshold.

At the end of this chunk, you will also see the `filter_scraped_readmes()` function that filters scraped READMEs. Basically, you just feed this function your original data and then it filters out slugs that have already been scraped based on the local CSV that has that information. 

**NOTE: Before running this cell, you need to set the** `github_pat_index`. **Changing this parameter provides access to 36 different PATs.** 

In [2]:
# set this parameter to a number between 0 and 30  
github_pat_index = 30

# can only make 2500 calls per hour 
# because the function calls twice each time 
@sleep_and_retry
@limits(calls=2500, period=3600)
def scrape_readmes(slug):
    
    github_username = github_pats.login[github_pat_index]
    github_token = github_pats.token[github_pat_index]
    
    while True:
        try: 
            # define url based on the slug 
            url = f'https://api.github.com/repos/{slug}/readme'
            response = requests.get(url, auth=(github_username, github_token))
            response_code = response.status_code
            
            if response_code == 404: 
                print(f"404 error on {slug}")
                readme_string = "404 ERROR - NO README"
                now = datetime.now()
                current_time = now.strftime("%Y-%m-%d %H:%M:%S")
                return slug, readme_string, current_time, "Done"
            
        except KeyError:
            print("Key error for: " + slug, flush=True)
            break
        
        except requests.exceptions.HTTPError as http_error:
            print ("HTTP Error:", http_error)
            raise SystemExit(http_error)
            break
            
        except requests.exceptions.ConnectionError as connection_error:
            print ("Error Connecting:", connection_error)
            raise SystemExit(connection_error)
            break 
        
        except requests.exceptions.TooManyRedirects as toomany_requests:
            print ("Too Many Requests:", toomany_requests)
            raise SystemExit(toomany_requests)
            break
                
        except requests.exceptions.Timeout as timeout_error:
            print ("Timeout Error:", timeout_error)
            raise SystemExit(timeout_error)
            break
        
        except requests.exceptions.RequestException as request_exception_error:
            print ("Oops, Some Other Error:", request_exception_error)
            raise SystemExit(request_exception_error)
            break 
            
        html_content = response.content
        soup = BeautifulSoup(html_content, 'html.parser')
        site_json=json.loads(soup.text)
        readme_link = site_json['download_url']
            
        while True:
            try: 
                readme_response = requests.get(readme_link, auth=(github_username, github_token))
                readme_response_code = readme_response.status_code
                
            except requests.exceptions.HTTPError as http_error:
                print ("HTTP Error:", http_error)
                raise SystemExit(http_error)
                break
            
            except requests.exceptions.ConnectionError as connection_error:
                print ("Error Connecting:", connection_error)
                raise SystemExit(connection_error)
                break
            
            except requests.exceptions.TooManyRedirects as toomany_requests:
                print ("Too Many Requests:", toomany_requests)
                raise SystemExit(toomany_requests)
                break 
        
            except requests.exceptions.Timeout as timeout_error:
                print ("Timeout Error:", timeout_error)
                raise SystemExit(timeout_error)
                break
        
            except requests.exceptions.RequestException as request_exception_error:
                print ("Oops, Some Other Error:", request_exception_error)
                raise SystemExit(request_exception_error)
                break  
    
            # pull the content out of the readme 
            readme_content = readme_response.content
            readme_soup = BeautifulSoup(readme_content, 'html.parser')
            readme_string = str(readme_soup)
    
            #give us the the timing and status 
            now = datetime.now()
            current_time = now.strftime("%Y-%m-%d %H:%M:%S")
            #print(readme_string)
            return slug, readme_string, current_time, "Done"

def filter_scraped_readmes(original_data): 
    ''' 
    Function ingests repos data and filters out already scraped data from local csv 
    '''
    
    # ingests local csv data and converts it to a list 
    #os.chdir('/project/class/bii_sdad_dspg/ncses_oss_2021/requests_scrape/')
    os.chdir('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/')
    all_filenames = [i for i in glob.glob('*.csv')]
    combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
    combined_csv = combined_csv[combined_csv['status'] == 'Done']
    scraped_slugs_list = combined_csv['slug'].tolist()
    
    # filters out all of the scraped slugs from the original_data 
    filtered_slugs = ~raw_slug_data.slug.isin(scraped_slugs_list)
    filtered_slugs = raw_slug_data[filtered_slugs]
    
    # provides the output of current slug count and number of slugs filtered 
    new_slug_count = filtered_slugs['slug'].count()
    slug_count_diff = raw_slug_data['slug'].count() - filtered_slugs['slug'].count()
    print("Current data has", new_slug_count, "entries left to scrape (filtered", slug_count_diff, "from input data)")
    return filtered_slugs
    
if __name__ == "__main__":
    print("Started scraping")
    scrape_readmes(slug = 'cjbd/src')
    print("Finished scraping")

Started scraping
404 error on cjbd/src
Finished scraping


### Ingesting the Repo Slugs from the Database 

First, we ingested the repository data from PostgreSQL. To pull from different subsets of the data, you can change the range of the commits clause in the `SQL` code. Even if you pull from the same range as someone else already has, the next step of the pipeline is cross-referencing which READMEs have already been scraped and then removing from slugs from the dataset. Given how long scraping takes, we decided to just collect READMEs from repositories with more than 350 commits and then collect the rest in the fall.

In [3]:
# connect to the database, download data 
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))
# commits < 99 AND commits > 95 AND status != 'Init'
raw_slug_data = '''SELECT * FROM gh_2007_2020.repos_ranked where commits > 349 '''

# convert to a dataframe, show how many missing we have (none)
raw_slug_data = pd.read_sql_query(raw_slug_data, con=connection)
raw_slug_data.head()

Unnamed: 0,id,spdx,slug,createdat,description,primarylanguage,branch,commits,asof,status
0,MDEwOlJlcG9zaXRvcnkyNzUwNzAwMzc=,BSD-3-Clause,dbuskariol-org/chromium,2020-06-26 04:04:29,The official GitHub mirror of the Chromium source,,MDM6UmVmMjc1MDcwMDM3OnJlZnMvaGVhZHMvbWFzdGVy,849191,2021-01-03 16:55:57,Init
1,MDEwOlJlcG9zaXRvcnkxOTM0MzEyNTI=,BSD-3-Clause,cjbd/src,2019-06-24 04:03:03,src,,MDM6UmVmMTkzNDMxMjUyOnJlZnMvaGVhZHMvbWFzdGVy,795211,2021-01-03 22:57:50,Init
2,MDEwOlJlcG9zaXRvcnkyODUxOTgyOTQ=,GPL-2.0,firemax13/android_kernel_sm6150_unified,2020-08-05 06:17:00,Samsung Galaxy A71 & A80 Unified Kernel Source...,C,MDM6UmVmMjg1MTk4Mjk0OnJlZnMvaGVhZHMvYnRmNy1maXJl,745131,2021-01-03 19:04:27,Init
3,MDEwOlJlcG9zaXRvcnkzMDU2MTkzMjA=,GPL-2.0,firemax13/a80kernel,2020-10-20 07:04:11,FireKernel Custom Extreme Kernel For Galaxy A80,C,MDM6UmVmMzA1NjE5MzIwOnJlZnMvaGVhZHMvbWFpbg==,745131,2021-01-03 23:51:13,Init
4,MDEwOlJlcG9zaXRvcnk4Mjk0MDUzOA==,MIT,eugene-matvejev/ultimate-commit-machine,2017-02-23 15:24:09,"explore github.com limits and ""same-hash"" attack",Shell,MDM6UmVmODI5NDA1Mzg6cmVmcy9oZWFkcy9tYXN0ZXI=,716089,2021-01-03 22:19:35,Init


### Checking the Counts 

Before running the function, you want to compare the original table that you just pulled from the database and the local CSV files that are keeping track of the already downloaded data. The first cell below gives you the count for the original table and the next cell pulls in all of the CSVs, concatenates them, and then gives the count of how many more slugs need to be scraped. 

In [None]:
raw_slug_data['slug'].count()

In [None]:
new_slugs = filter_scraped_readmes(original_data=raw_slug_data)

### Later Stage Scraping 

If you were scraping on your own, pulling the list of repos from PostgreSQL would suffice, but our other team member is also scraping GitHub repo stats, so we want to make sure that we are syncing up with the same repos that they are collecting. This alternative process pulls from a .csv file of all the repos that they have scraped and scrapes the READMEs from those projects.

In [4]:
os.chdir('/project/class/bii_sdad_dspg/uva_2021/dspg21oss/')
raw_slug_data = pd.read_csv('brandon_to_scrape_0712.csv')
raw_slug_data.head()

Unnamed: 0,slug
0,coderZsq/coderZsq.project.ios
1,healthsparq/ember-radical
2,PierreSenellart/provsql
3,sfi0zy/muilessium
4,kufii/My-UserScripts


In [49]:
new_slugs = filter_scraped_readmes(original_data=raw_slug_data)

Current data has 150 entries left to scrape (filtered 143971 from input data)


### Scraping the READMEs 

This chunk of code does a few things. First, it sets the `batch_name` to both keep track of which batch of data we are collecting and to name the output files with the appropriate name. This MUST be done before you run the code chunk. Second, the code sets up multiprocessing to draw from multiple cores. Third, the code converts the DataFrame into a list to feed the slugs into the for loop. Lastly, we feed all the slugs to the function as a for loop and it downloads the data into our project folder. 

**NOTE: Before running this cell, set the** `batch_name` **variable. Changing this variable will tell us which download batch the data was collected during and then save the CSV with that batch name.** 

In [None]:
# need to change this for each batch or you will save over what you had  
batch_name = 'oss_readme_batch_07_01' 

# sets the number of cores so that it can draw from multiprocessors 
# there must be 1 core subtracted so that the notebook can run too 
cores_available = multiprocessing.cpu_count() - 1
pool = Pool(cores_available)

# convert the dataframe into a list for the subsequent for loop 
raw_slugs = new_slugs["slug"].tolist()
slugs = []
for s in raw_slugs:
    slugs.append(s.strip())
    
# now we will feed in all of the remaining slugs 
slug_log = []
readme_log = []
asof_log = []
status_log = []
for result in pool.imap_unordered(scrape_readmes, slugs):
    slug_log.append(result[0])
    readme_log.append(result[1])
    asof_log.append(result[2])
    status_log.append(result[3])
    final_log = pd.DataFrame({'slug': slug_log, "readme_text": readme_log, 'batch': batch_name, 'as_of': asof_log, 'status': status_log}, 
                              columns=["slug", "readme_text", "batch", "as_of", "status"])
    #final_log.to_csv('/project/class/bii_sdad_dspg/ncses_oss_2021/requests_scrape/'+batch_name+'.csv', sep=',', encoding='utf-8', index=False)
    final_log.to_csv('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/'+batch_name+'.csv', sep=',', encoding='utf-8', index=False)
print("Finished scraping", len(final_log), "of", len(slugs), "records")

### Common/Known Errors 

Here are some common/known errors that I have identified: 

1. `"MarkupResemblesLocatorWarning: "{slug}" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.` 

    Solution: This will just throw a warning and does not seemt to affect subsequent runs of the dataset. You can ignore this. 
    
2. `Recursion error:` Sometimes you will get a recurison error.

    Solution: You can quickly fit that by increasing that limit with `sys.setrecursionlimit(50000)`.

3. `KeyError: 'download_url'` This is basically equivalent to running out of queries. We will need to develop a better system, but this was a good workaround to get our data in the short-term. 

### Examining the Dataset 

This code just pulls in all of the downloaded data as a DataFrame and sees how many entries we have downloaded already. 

In [2]:
os.chdir('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/')
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
combined_csv = combined_csv[combined_csv['status'] == 'Done']
combined_csv = combined_csv.sort_values("batch")
combined_csv

### Saving the Dataset

In [None]:
# we kept this saved in a couple spots (the final collaborative version goes class folder while a backup is stored in biocomplexity)
combined_csv.to_csv('/project/class/bii_sdad_dspg/uva_2021/dspg21oss/oss_readme_data_071521.csv', sep=',', encoding='utf-8', index=False)
combined_csv.to_csv('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/oss_readme_data_071521.csv', sep=',', encoding='utf-8', index=False)

In [9]:
os.chdir('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/')
check = pd.read_csv('oss_readme_data_071221.csv')
check

Unnamed: 0,slug,readme_text,batch,as_of,status
0,mumblepins/debian-samba,"This is the release version of Samba, the free...",oss_readme_batch1_1,6/11/21 17:21,Done
1,fdvarela/odoo8,[![Build Status](http://runbot.odoo.com/runbot...,oss_readme_batch1_1,6/11/21 17:21,Done
2,Nkosi-tshawe/moodle,.-..-.\n __...,oss_readme_batch1_1,6/11/21 17:21,Done
3,dllsf/odootest,[![Build Status](http://runbot.odoo.com/runbot...,oss_readme_batch1_1,6/11/21 17:21,Done
4,dkavadia/stupefy,.-..-.\n __...,oss_readme_batch1_1,6/11/21 17:21,Done
...,...,...,...,...,...
444522,zuokun2013/site1,# Initial page\n\nthis is a hello world page\n...,oss_readme_batch_05_72,2021-07-12 23:16:10,Done
444523,almcalle/gatsby-starter-netlify-cms,# Gatsby + Netlify CMS Starter\n\n[![Netlify S...,oss_readme_batch_05_72,2021-07-12 23:16:10,Done
444524,biwers/aj-test,<!-- AUTO-GENERATED-CONTENT:START (STARTER) --...,oss_readme_batch_05_72,2021-07-12 23:16:10,Done
444525,taraldga/mossekarusellen-netlify-cms,404 ERROR - NO README,oss_readme_batch_05_72,2021-07-12 23:16:10,Done
