<div style="font-size:30px" align="center"> <b> Scaping GitHub READMEs </b> </div>

<div style="font-size:18px" align="center"> <b> Brandon Kramer, UVA Biocomplexity Institute, OSS DSPG 2021 </b> </div>

<br>

### Overview  

In this notebook, we have developed a function for scraping GitHub READMEs in order to classify repositories into different types of software projects. 

The pipeline is setup with the following steps: 

1. Loading all of the packages 

2. Calling a function that scrapes the READMEs 

3. Loading the repositories from PostgreSQL as a DataFrame 

4. Cross-referencing the repos against the scraped data 

5. Scraping the repos using multiprocessing 

6. Checking the data that was scraped 

### Load Packages 

In [5]:
# for pulling/manipulating data 
import os 
import glob
import itertools
import pandas as pd
from datetime import datetime
import psycopg2 as pg
from sqlalchemy import create_engine

# for web scraping 
import json 
import lxml
import requests 
from requests.auth import HTTPBasicAuth
from bs4 import BeautifulSoup
from ratelimit import limits, sleep_and_retry
import multiprocessing
from multiprocessing.pool import ThreadPool as Pool
print("ready")

ready


### Call Functions

To use the `scrape_readmes()` function, you first need to set the username and personal acccess token (PAT) that you create on GitHub. While you can run it without this information but having no PAT means you can only make 50 calls an hour compared to about 5,000. It helps if you have several PATs to help in this process, especially since this function takes only a few moments to make 5,000 calls after the multiprocessing is incorporated. In this pipeline, we are calling the usernames and PATs from a table in PostgreSQL that looks like this: 

|    login    |   token  | 
|    :---:    |  :----:  |     
|  username1  |   PAT1   | 
|  username2  |   PAT2   | 
|  etc.  |   etc.   | 

Once the username and PAT are passed into the authetication fields, `requests` will connect to the GitHub repository of all the slugs you feed it. We have designed the function to throw errors for all issues it encounters unless it gets a 404 error (i.e. no README available), as this usually means that the repo or README has been deleted, never existed, and/or is no longer available for some reason. Unfortunately, the way that GitHub's API is setup seems to require two calls to get the README: one to get the JSON with the README location and another to decode the content. We have a `@sleep_and_retry` decorator to deal with this if we setup a `slurm` for each PAT, but using it here in the notebook means that you will just want to wait for the threshold to be hit and then move onto the next PAT. While we plan to continue working on this function to minimize the number of calls, this process will allow us to get some preliminary data for classification in the short-term. 

At the end of this chunk, you will also see the `filter_scraped_readmes()` function that filters scraped READMEs. Basically, you just feed this function your original data and then it filters out slugs that have already been scraped based on the local CSV that has that information. 

**NOTE: Before running this cell, you need to set the** `github_pat_index`. **Changing this parameter provides access to 36 different PATs.** 

In [6]:
# set this parameter to a number between 0 and 35 
# pats #7, #14. #19, #25 seem to be setup wrong or have been removed 
github_pat_index = 20

# set the environment variables with your username and github personal access token here 
# connect to the database, download data 
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))
github_pats = '''SELECT * FROM gh_2007_2020.pats'''
github_pats = pd.read_sql_query(github_pats, con=connection)
connection.close()
github_username = github_pats.login[github_pat_index]
github_token = github_pats.token[github_pat_index]

# os.environ['GITHUB_USERNAME'] = ''
# os.environ['GITHUB_TOKEN'] = ''
# github_username = os.environ.get("GITHUB_USERNAME")
# github_token = os.environ.get("GITHUB_TOKEN")

# can only make 2500 calls per hour 
# because the function calls twice each time 
@sleep_and_retry
@limits(calls=2500, period=3600)
def scrape_readmes(slug):
    
    while True:
        try: 
            # define url based on the slug 
            url = f'https://api.github.com/repos/{slug}/readme'
            response = requests.get(url, auth=(github_username, github_token))
            response_code = response.status_code
            
            if response_code == 404: 
                print(f"404 error on {slug}")
                readme_string = "404 ERROR - NO README"
                now = datetime.now()
                current_time = now.strftime("%Y-%m-%d %H:%M:%S")
                return slug, readme_string, current_time, "Done"
            
            # can i build in a response_code == 403 continue onto the next PAT in the PAT for loop 
            
        except KeyError:
            print("Key error for: " + slug, flush=True)
            break
        
        except requests.exceptions.HTTPError as http_error:
            print ("HTTP Error:", http_error)
            raise SystemExit(http_error)
            break
            
        except requests.exceptions.ConnectionError as connection_error:
            print ("Error Connecting:", connection_error)
            raise SystemExit(connection_error)
            break 
        
        except requests.exceptions.TooManyRedirects as toomany_requests:
            print ("Too Many Requests:", toomany_requests)
            raise SystemExit(toomany_requests)
            break
                
        except requests.exceptions.Timeout as timeout_error:
            print ("Timeout Error:", timeout_error)
            raise SystemExit(timeout_error)
            break
        
        except requests.exceptions.RequestException as request_exception_error:
            print ("Oops, Some Other Error:", request_exception_error)
            raise SystemExit(request_exception_error)
            break 
            
        html_content = response.content
        soup = BeautifulSoup(html_content, 'html.parser')
        site_json=json.loads(soup.text)
        readme_link = site_json['download_url']
        
        while True:
            try: 
                readme_response = requests.get(readme_link, auth=(github_username, github_token))
                readme_response_code = readme_response.status_code
                
            except requests.exceptions.HTTPError as http_error:
                print ("HTTP Error:", http_error)
                raise SystemExit(http_error)
                break
            
            except requests.exceptions.ConnectionError as connection_error:
                print ("Error Connecting:", connection_error)
                raise SystemExit(connection_error)
                break
            
            except requests.exceptions.TooManyRedirects as toomany_requests:
                print ("Too Many Requests:", toomany_requests)
                raise SystemExit(toomany_requests)
                break 
        
            except requests.exceptions.Timeout as timeout_error:
                print ("Timeout Error:", timeout_error)
                raise SystemExit(timeout_error)
                break
        
            except requests.exceptions.RequestException as request_exception_error:
                print ("Oops, Some Other Error:", request_exception_error)
                raise SystemExit(request_exception_error)
                break  
    
            # pull the content out of the readme 
            readme_content = readme_response.content
            readme_soup = BeautifulSoup(readme_content, 'html.parser')
            readme_string = str(readme_soup)
    
            #give us the the timing and status 
            now = datetime.now()
            current_time = now.strftime("%Y-%m-%d %H:%M:%S")
            #print(readme_string)
            return slug, readme_string, current_time, "Done"
        
def filter_scraped_readmes(original_data): 
    ''' 
    Function ingests repos data and filters out already scraped data from local csv 
    '''
    
    # ingests local csv data and converts it to a list 
    os.chdir('/project/class/bii_sdad_dspg/ncses_oss_2021/requests_scrape/')
    all_filenames = [i for i in glob.glob('*.csv')]
    combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
    combined_csv = combined_csv[combined_csv['status'] == 'Done']
    scraped_slugs_list = combined_csv['slug'].tolist()
    
    # filters out all of the scraped slugs from the original_data 
    filtered_slugs = ~raw_slug_data.slug.isin(scraped_slugs_list)
    filtered_slugs = raw_slug_data[filtered_slugs]
    
    # provides the output of current slug count and number of slugs filtered 
    new_slug_count = filtered_slugs['slug'].count()
    slug_count_diff = raw_slug_data['slug'].count() - filtered_slugs['slug'].count()
    print("Current data has", new_slug_count, "entries (filtered", slug_count_diff, "from input data)")
    return filtered_slugs
    
if __name__ == "__main__":
    print("Started scraping")
    scrape_readmes(slug = 'cjbd/src')
    print("Finished scraping")

Started scraping
404 error on cjbd/src
Finished scraping


### Ingesting the Repo Slugs from the Database 

Here, we ingested the repository data from PostgreSQL. To pull from different subsets of the data, you can change the range of the commits clause in the `SQL` code. Even if you pull from the same range as someone else already has, the next step of the pipeline is cross-referencing which READMEs have already been scraped and then removing from slugs from the dataset. I have also added a clause to remove all of the repos with an 'Init' status, as their commits data have not yet been scraped and they seem to missing in some systematic way. For now, we are just ignoring them to deal with the majority of valid repos.

In [7]:
# connect to the database, download data 
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))

raw_slug_data = '''SELECT * FROM gh_2007_2020.repos_ranked where commits < 106 AND commits > 100 AND status != 'Init' '''

# convert to a dataframe, show how many missing we have (none)
raw_slug_data = pd.read_sql_query(raw_slug_data, con=connection)
raw_slug_data.head()

Unnamed: 0,id,spdx,slug,createdat,description,primarylanguage,branch,commits,asof,status
0,MDEwOlJlcG9zaXRvcnk4NTk0MjE4NQ==,Apache-2.0,crain/accounting,2017-03-23 11:29:26,"Microservice for managing accounts, ledgers, a...",Java,MDM6UmVmODU5NDIxODU6cmVmcy9oZWFkcy9kZXZlbG9w,105,2021-01-03 13:53:22,Done
1,MDEwOlJlcG9zaXRvcnkyNTMzODA2OA==,MIT,yanhaijing/data.js,2014-10-17 04:23:00,data.js 是带有消息通知的数据中心，我称其为会说话的数据。旨在让编程变得简单，世界变得美好。,JavaScript,MDM6UmVmMjUzMzgwNjg6cmVmcy9oZWFkcy9tYXN0ZXI=,105,2021-01-03 14:18:29,Done
2,MDEwOlJlcG9zaXRvcnkyMzc4MDg1NTQ=,MIT,kituyiharry/gatsby-starter-blog-theme,2020-02-02 17:30:53,,JavaScript,MDM6UmVmMjM3ODA4NTU0OnJlZnMvaGVhZHMvbWFzdGVy,105,2021-01-03 15:47:44,Done
3,MDEwOlJlcG9zaXRvcnkxNjc4MDQ2NDQ=,MIT,cancit/teyit.link,2019-01-27 12:10:11,,Go,MDM6UmVmMTY3ODA0NjQ0OnJlZnMvaGVhZHMvbWFzdGVy,105,2021-01-04 04:43:20,Done
4,MDEwOlJlcG9zaXRvcnk5NjE1OTkyOA==,MIT,temiooo/PostIt-Application,2017-07-04 00:24:29,An application that allows users to create gro...,JavaScript,MDM6UmVmOTYxNTk5Mjg6cmVmcy9oZWFkcy9EZXZlbG9w,105,2021-01-03 18:06:24,Done


### Checking the Counts 

Before running the function, you want to compare the original table that you just pulled from the database and the local CSV files that are keeping track of the already downloaded data. The first cell below gives you the count for the original table and the next cell pulls in all of the CSVs, concatenates them, and then gives the count of how many more slugs need to be scraped. 

In [11]:
raw_slug_data['slug'].count()

27976

In [12]:
new_slugs = filter_scraped_readmes(original_data=raw_slug_data)

Current data has 19755 entries (filtered 8221 from input data)


### Scraping the READMEs 

This chunk of code does a few things. First, it sets the `batch_name` to both keep track of which batch of data we are collecting and to name the output files with the appropriate name. This MUST be done before you run the code chunk. Second, the code sets up multiprocessing to draw from multiple cores. Third, the code converts the DataFrame into a list to feed the slugs into the for loop. Lastly, we feed all the slugs to the function as a for loop and it downloads the data into our project folder. 

**NOTE: Before running this cell, set the** `batch_name` **variable. Changing this variable will tell us which download batch the data was collected during and then save the CSV with that batch name.** 

In [10]:
# need to change this for each batch or you will save over what you had  
batch_name = 'oss_readme_batch_02_01' 

# sets the number of cores so that it can draw from multiprocessors 
# there must be 1 core subtracted so that the notebook can run too 
cores_available = multiprocessing.cpu_count() - 1
pool = Pool(cores_available)

# convert the dataframe into a list for the subsequent for loop 
raw_slugs = new_slugs["slug"].tolist()
slugs = []
for s in raw_slugs:
    slugs.append(s.strip())
    
# now we will feed in all of the remaining slugs 
slug_log = []
readme_log = []
asof_log = []
status_log = []
for result in pool.imap_unordered(scrape_readmes, slugs):
    slug_log.append(result[0])
    readme_log.append(result[1])
    asof_log.append(result[2])
    status_log.append(result[3])
    final_log = pd.DataFrame({'slug': slug_log, "readme_text": readme_log, 'batch': batch_name, 'as_of': asof_log, 'status': status_log}, 
                              columns=["slug", "readme_text", "batch", "as_of", "status"])
    final_log.to_csv('/project/class/bii_sdad_dspg/ncses_oss_2021/requests_scrape/'+batch_name+'.csv', sep=',', encoding='utf-8', index=False)
print("Finished scraping", len(final_log), "of", len(slugs), "records")

404 error on leewujung/rousettus_bp
404 error on saai/codingbitch
404 error on emergenetwork/etherboy-core
404 error on rizkiramadhan2/antarpulau
404 error on mozama/AG
404 error on jdaigle/jdaigle.github.io
404 error on E-L-N-A/ELNA
404 error on layd/wordsearch-android
404 error on hrgdavor/java-hipster-sql
404 error on BastyZ/Subastas_nSystem
404 error on serlosan/serlosan.github.io
404 error on fdabl/fdabl.github.io
404 error on fvendrameto/Databases-Final-Project
404 error on xbib/catalog
404 error on DrenfongWong/x509-ada
404 error on darrowco/z76-Docs
404 error on tronixworkshop/tronixworkshop.github.io
404 error on cvast/cvast-arches
404 error on LaunchCoderGirlSTL/Data-Analysis-Learning-Track
404 error on bmaupin/bmaupin.github.io
404 error on AshtrayBroom/ashbloom
404 error on selfhostedworks/openmappr
404 error on zeddee/nsdmdh-src
404 error on Jaruso/Maple_Tycoon
404 error on SENERGY-Platform/smart-meter-connector
404 error on cronfy/experience
404 error on Steven-Chavez/Cou

KeyError: 'download_url'

Once this chunk is completed, you can go back up and filter out what you just scraped from the dataframe by re-running the "Checking the Counts" section and then re-running the "Scraping the READMEs" section again. Rinse, cycle, repeat. Feel free to document errors below and don't forget to update the batch number with each cycle.

### Common/Known Errors 

Here are some common/known errors that I have identified: 

1. `"MarkupResemblesLocatorWarning: "{slug}" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.` 

    Solution: This will just throw a warning and does not seemt to affect subsequent runs of the dataset. You can ignore this. 
    
2. `KeyError: 'download_url'`

    Solution: It seems like this error can mean one of two things. The first is insignificant and 

### Examining the Dataset 

This code just pulls in all of the downloaded data as a DataFrame and sees how many entries we have downloaded already. You can use the commented line to output all the data together.

In [4]:
os.chdir('/project/class/bii_sdad_dspg/ncses_oss_2021/requests_scrape/')
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
combined_csv = combined_csv[combined_csv['status'] == 'Done']
combined_csv = combined_csv.sort_values("batch")
combined_csv
# 21220 > 55872 now 
#combined_csv.to_csv('/project/class/bii_sdad_dspg/ncses_oss_2021/requests_scrape/oss_readme_aggregated/oss_readme_data_062121.csv', sep=',', encoding='utf-8', index=False)

Unnamed: 0,slug,readme_text,batch,as_of,status
0,h2oota/emacs-win64-msvc,Copyright (C) 2001-2016 Free Software Foundati...,oss_readme_batch1_1,6/11/21 17:21,Done
462,CelestiaProject/CelestiaContent,Scientific Data Base\n--------------------\n\n...,oss_readme_batch1_1,6/11/21 16:40,Done
463,leanprover-community/mathlib,# Lean mathlib\n\n![](https://github.com/leanp...,oss_readme_batch1_1,6/11/21 16:40,Done
464,adamlaska/osmos-cosmos-sdk,# Cosmos SDK\r\n\r\n![banner](docs/cosmos-sdk-...,oss_readme_batch1_1,6/11/21 16:40,Done
465,UniTime/unitime,<!-- \n * Licensed to The Apereo Foundation un...,oss_readme_batch1_1,6/11/21 16:40,Done
...,...,...,...,...,...
55196,Comparative-Pathology/czi_spatial,# czi_spatial\n\n[![Build Status](https://trav...,oss_readme_batch1_9,2021-06-14 15:06:37,Done
55197,nirenjan/x52pro-linux,Saitek X52Pro joystick driver for Linux\n=====...,oss_readme_batch1_9,2021-06-14 15:06:37,Done
55198,giftman/Gifts,Gifts\n====\n\n##Test MarkDown\n`\n\npublic vo...,oss_readme_batch1_9,2021-06-14 15:06:35,Done
55185,xsfelvis/lemon-yang.github.io,404 ERROR - NO README,oss_readme_batch1_9,2021-06-14 15:06:38,Done


^ That tells you the number of entries we have downloaded. 

### Development Space 

Below, are just some snippets of code that might be useful when tweaking the existing code. 

In [123]:
os.chdir('/project/class/bii_sdad_dspg/ncses_oss_2021/requests_scrape/')
check = pd.read_csv('oss_readme_batch1_8.csv')
check

Unnamed: 0,slug,readme_text,batch,as_of,status
0,shahabsaf1/copy,# [InfernalTG](https://telegram.me/TeleInferna...,oss_readme_batch1_8,2021-06-14 12:20:52,Done
1,stephan0992/week-1,# Jekyll Now\n\n**Jekyll** is a static site ge...,oss_readme_batch1_8,2021-06-14 12:20:52,Done
2,SynapseProject/handlers.ActiveDirectory.net,# handlers.ActiveDirectory.net\nActive Directo...,oss_readme_batch1_8,2021-06-14 12:20:52,Done
3,mraggi/discreture,[![Build Status](https://travis-ci.org/mraggi/...,oss_readme_batch1_8,2021-06-14 12:20:52,Done
4,karlstroetmann/Lineare-Algebra,Lineare-Algebra\n===============\n\nIn diesem ...,oss_readme_batch1_8,2021-06-14 12:20:52,Done
...,...,...,...,...,...
2174,pkalro/project8,# angular-seed â€” the seed for AngularJS apps...,oss_readme_batch1_8,2021-06-14 12:35:51,Done
2175,Tizzio/WifiTransfer,# WifiTransfer\nDirect wifi file transfer betw...,oss_readme_batch1_8,2021-06-14 12:35:51,Done
2176,PAPARA-ZZ-I/PAPARA-ZZ-I,# PAPARA(ZZ)I\nCopyright 2015-2017 Yann Marcon...,oss_readme_batch1_8,2021-06-14 12:35:51,Done
2177,Capital-T-Industries/docker-elk,"# The ELK stack (Elasticsearch, Logstash, Kiba...",oss_readme_batch1_8,2021-06-14 12:35:51,Done


In [110]:
# Copy the function from above and make tweaks in this code chunk
combined_csv[combined_csv['readme_text'] == "404 ERROR - NO README"]

Unnamed: 0,slug,readme_text,batch,as_of,status
211,g0v-data/mirror-10minutely,404 ERROR - NO README,oss_readme_batch1_1,6/11/21 17:21,Done
301,gregorycv/moodle_quiz_extended,404 ERROR - NO README,oss_readme_batch1_1,6/11/21 17:21,Done
49,sisirkoppaka/fluent,404 ERROR - NO README,oss_readme_batch1_1,6/11/21 17:21,Done
82,djbender/homebrew-tmux,404 ERROR - NO README,oss_readme_batch1_1,6/11/21 17:21,Done
43,saidganim/llvm_clone,404 ERROR - NO README,oss_readme_batch1_1,6/11/21 17:21,Done
...,...,...,...,...,...
278,King19931229/KApp,404 ERROR - NO README,oss_readme_batch1_9,2021-06-14 15:06:24,Done
281,tedkulp/phpspec,404 ERROR - NO README,oss_readme_batch1_9,2021-06-14 15:06:24,Done
221,Mad9201/T.M.D,404 ERROR - NO README,oss_readme_batch1_9,2021-06-14 15:06:23,Done
185,davidjhardman/djh-cms,404 ERROR - NO README,oss_readme_batch1_9,2021-06-14 15:06:22,Done


In [124]:
select_data = combined_csv[combined_csv['readme_text'].notna()]
#select_data['readme_clean'] = select_data['readme_text'].str.lower()
select_data[select_data['readme_text'].str.contains("Blockchain")]

Unnamed: 0,slug,readme_text,batch,as_of,status
449,flowchain/flowchain-ledger,# flowchain-ledger\n&gt; Flowchain distributed...,oss_readme_batch1_10,2021-06-14 15:15:29,Done
948,fabiocolacio/Mercury,# Mercury Chat\n\nMercury is my end-to-end enc...,oss_readme_batch1_10,2021-06-14 15:15:38,Done
254,fkbenjamin/pc-firebase-starter,"# PassChain\n\nAuthors: Rob-Jago Flötgen, Flor...",oss_readme_batch1_11,2021-06-14 15:19:18,Done
1041,peacedudegregoryks/Old-SBC-SocialBenefitCoin,# The Social Benefit Coin Smart contract\n \nW...,oss_readme_batch1_11,2021-06-14 15:19:33,Done
758,mukira/fukoblockchainexplorer,# Fuko blockchain Explorer\n\n![GitHub Logo](h...,oss_readme_batch1_11,2021-06-14 15:19:28,Done
...,...,...,...,...,...
1724,agadzinski/vehicle-manufacture-20180607173737101,# Blockchain - Tutorial\n\nThis is the tutoria...,oss_readme_batch1_8,2021-06-14 12:35:40,Done
1689,shubhamp1p/vehicle-manufacture-20180606135842485,# Blockchain - Tutorial\n\nThis is the tutoria...,oss_readme_batch1_8,2021-06-14 12:35:40,Done
33,energywebfoundation/ew-did-registry,# EW DID Library v0.1\n## Disclaimer\n&gt; The...,oss_readme_batch1_8,2021-06-14 12:20:53,Done
236,jeffet/vehicle-manufacture-20180318192523657,# Blockchain - Tutorial\n\nThis is the tutoria...,oss_readme_batch1_9,2021-06-14 15:06:23,Done


to insert updates into sql: https://stackoverflow.com/questions/23103962/how-to-write-dataframe-to-postgres-table
multiprocessing: https://stackoverflow.com/questions/45718546/with-clause-for-multiprocessing-in-python/45734483


