<div style="font-size:30px" align="center"> <b> Scaping GitHub READMEs </b> </div>

<div style="font-size:18px" align="center"> <b> Brandon Kramer, UVA Biocomplexity Institute, OSS DSPG 2021 </b> </div>

<br>

### Overview  

In this notebook, we have developed a function for scraping GitHub READMEs in order to classify repositories into different types of software projects. 

The pipeline is setup with the following steps: 

1. Loading all of the packages 

2. Calling a function that scrapes the READMEs 

3. Loading the repositories from PostgreSQL as a DataFrame 

4. Cross-referencing the repos against the scraped data 

5. Scraping the repos using multiprocessing 

6. Checking the data that was scraped 

### Load Packages 

In [1]:
# for pulling/manipulating data 
import os 
import glob
import itertools
import pandas as pd
from datetime import datetime
import psycopg2 as pg
from sqlalchemy import create_engine

# for web scraping 
import json 
import lxml
import requests 
from requests.auth import HTTPBasicAuth
from bs4 import BeautifulSoup
from ratelimit import limits, sleep_and_retry
import multiprocessing
from multiprocessing.pool import ThreadPool as Pool
from datetime import datetime
import sys
print("ready")

ready


### Call Functions

To use the `scrape_readmes()` function, you first need to set the username and personal acccess token (PAT) that you create on GitHub. While you can run it without this information but having no PAT means you can only make 50 calls an hour compared to about 5,000. It helps if you have several PATs to help in this process, especially since this function takes only a few moments to make 5,000 calls after the multiprocessing is incorporated. In this pipeline, we are calling the usernames and PATs from a table in PostgreSQL that looks like this: 

|    login    |   token  | 
|    :---:    |  :----:  |     
|  username1  |   PAT1   | 
|  username2  |   PAT2   | 
|  etc.  |   etc.   | 

Once the username and PAT are passed into the authetication fields, `requests` will connect to the GitHub repository of all the slugs you feed it. We have designed the function to throw errors for all issues it encounters unless it gets a 404 error (i.e. no README available), as this usually means that the repo or README has been deleted, never existed, and/or is no longer available for some reason. Unfortunately, the way that GitHub's API is setup seems to require two calls to get the README: one to get the JSON with the README location and another to decode the content. We have a `@sleep_and_retry` decorator to deal with this if we setup a `slurm` for each PAT, but using it here in the notebook means that you will just want to wait for the threshold to be hit and then move onto the next PAT. While we plan to continue working on this function to minimize the number of calls, this process will allow us to get some preliminary data for classification in the short-term. 

At the end of this chunk, you will also see the `filter_scraped_readmes()` function that filters scraped READMEs. Basically, you just feed this function your original data and then it filters out slugs that have already been scraped based on the local CSV that has that information. 

**NOTE: Before running this cell, you need to set the** `github_pat_index`. **Changing this parameter provides access to 36 different PATs.** 

In [2]:
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))
github_pats = '''SELECT * FROM gh_2007_2020.pats_update'''
github_pats = pd.read_sql_query(github_pats, con=connection)
connection.close()

In [28]:
# set this parameter to a number between 0 and 35  
github_pat_index = 24

# can only make 2500 calls per hour 
# because the function calls twice each time 
@sleep_and_retry
@limits(calls=2500, period=3600)
def scrape_readmes(slug):
    
    github_username = github_pats.login[github_pat_index]
    github_token = github_pats.token[github_pat_index]
    
    while True:
        try: 
            # define url based on the slug 
            url = f'https://api.github.com/repos/{slug}/readme'
            response = requests.get(url, auth=(github_username, github_token))
            response_code = response.status_code
            
            if response_code == 404: 
                print(f"404 error on {slug}")
                readme_string = "404 ERROR - NO README"
                now = datetime.now()
                current_time = now.strftime("%Y-%m-%d %H:%M:%S")
                return slug, readme_string, current_time, "Done"
            
        except KeyError:
            print("Key error for: " + slug, flush=True)
            break
        
        except requests.exceptions.HTTPError as http_error:
            print ("HTTP Error:", http_error)
            raise SystemExit(http_error)
            break
            
        except requests.exceptions.ConnectionError as connection_error:
            print ("Error Connecting:", connection_error)
            raise SystemExit(connection_error)
            break 
        
        except requests.exceptions.TooManyRedirects as toomany_requests:
            print ("Too Many Requests:", toomany_requests)
            raise SystemExit(toomany_requests)
            break
                
        except requests.exceptions.Timeout as timeout_error:
            print ("Timeout Error:", timeout_error)
            raise SystemExit(timeout_error)
            break
        
        except requests.exceptions.RequestException as request_exception_error:
            print ("Oops, Some Other Error:", request_exception_error)
            raise SystemExit(request_exception_error)
            break 
            
        html_content = response.content
        soup = BeautifulSoup(html_content, 'html.parser')
        site_json=json.loads(soup.text)
        readme_link = site_json['download_url']
        
        #try:
        #    readme_link = site_json['download_url']
        
        #except KeyError as error_403:
        #    now = datetime.now()
        #    print(f"Rate limited exceeded (403 error) on {slug} at", 
        #          now.strftime("%Y-%m-%d %H:%M:%S"))
            #github_pat_index = github_pat_index - 1
        #    raise SystemExit(error_403) 
        #    break 
            
        while True:
            try: 
                readme_response = requests.get(readme_link, auth=(github_username, github_token))
                readme_response_code = readme_response.status_code
                
            except requests.exceptions.HTTPError as http_error:
                print ("HTTP Error:", http_error)
                raise SystemExit(http_error)
                break
            
            except requests.exceptions.ConnectionError as connection_error:
                print ("Error Connecting:", connection_error)
                raise SystemExit(connection_error)
                break
            
            except requests.exceptions.TooManyRedirects as toomany_requests:
                print ("Too Many Requests:", toomany_requests)
                raise SystemExit(toomany_requests)
                break 
        
            except requests.exceptions.Timeout as timeout_error:
                print ("Timeout Error:", timeout_error)
                raise SystemExit(timeout_error)
                break
        
            except requests.exceptions.RequestException as request_exception_error:
                print ("Oops, Some Other Error:", request_exception_error)
                raise SystemExit(request_exception_error)
                break  
    
            # pull the content out of the readme 
            readme_content = readme_response.content
            readme_soup = BeautifulSoup(readme_content, 'html.parser')
            readme_string = str(readme_soup)
    
            #give us the the timing and status 
            now = datetime.now()
            current_time = now.strftime("%Y-%m-%d %H:%M:%S")
            #print(readme_string)
            return slug, readme_string, current_time, "Done"

def filter_scraped_readmes(original_data): 
    ''' 
    Function ingests repos data and filters out already scraped data from local csv 
    '''
    
    # ingests local csv data and converts it to a list 
    #os.chdir('/project/class/bii_sdad_dspg/ncses_oss_2021/requests_scrape/')
    os.chdir('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/')
    all_filenames = [i for i in glob.glob('*.csv')]
    combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
    combined_csv = combined_csv[combined_csv['status'] == 'Done']
    scraped_slugs_list = combined_csv['slug'].tolist()
    
    # filters out all of the scraped slugs from the original_data 
    filtered_slugs = ~raw_slug_data.slug.isin(scraped_slugs_list)
    filtered_slugs = raw_slug_data[filtered_slugs]
    
    # provides the output of current slug count and number of slugs filtered 
    new_slug_count = filtered_slugs['slug'].count()
    slug_count_diff = raw_slug_data['slug'].count() - filtered_slugs['slug'].count()
    print("Current data has", new_slug_count, "entries left to scrape (filtered", slug_count_diff, "from input data)")
    return filtered_slugs
    
if __name__ == "__main__":
    print("Started scraping")
    scrape_readmes(slug = 'cjbd/src')
    print("Finished scraping")

Started scraping
404 error on cjbd/src
Finished scraping


### Ingesting the Repo Slugs from the Database 

Here, we ingested the repository data from PostgreSQL. To pull from different subsets of the data, you can change the range of the commits clause in the `SQL` code. Even if you pull from the same range as someone else already has, the next step of the pipeline is cross-referencing which READMEs have already been scraped and then removing from slugs from the dataset. I have also added a clause to remove all of the repos with an 'Init' status, as their commits data have not yet been scraped and they seem to missing in some systematic way. For now, we are just ignoring them to deal with the majority of valid repos.

In [5]:
# connect to the database, download data 
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))
# commits < 99 AND commits > 95 AND status != 'Init'
raw_slug_data = '''SELECT * FROM gh_2007_2020.repos_ranked where commits > 1000 '''

# convert to a dataframe, show how many missing we have (none)
raw_slug_data = pd.read_sql_query(raw_slug_data, con=connection)
raw_slug_data.head()

Unnamed: 0,id,spdx,slug,createdat,description,primarylanguage,branch,commits,asof,status
0,MDEwOlJlcG9zaXRvcnkyNzUwNzAwMzc=,BSD-3-Clause,dbuskariol-org/chromium,2020-06-26 04:04:29,The official GitHub mirror of the Chromium source,,MDM6UmVmMjc1MDcwMDM3OnJlZnMvaGVhZHMvbWFzdGVy,849191,2021-01-03 16:55:57,Init
1,MDEwOlJlcG9zaXRvcnkxOTM0MzEyNTI=,BSD-3-Clause,cjbd/src,2019-06-24 04:03:03,src,,MDM6UmVmMTkzNDMxMjUyOnJlZnMvaGVhZHMvbWFzdGVy,795211,2021-01-03 22:57:50,Init
2,MDEwOlJlcG9zaXRvcnkyODUxOTgyOTQ=,GPL-2.0,firemax13/android_kernel_sm6150_unified,2020-08-05 06:17:00,Samsung Galaxy A71 & A80 Unified Kernel Source...,C,MDM6UmVmMjg1MTk4Mjk0OnJlZnMvaGVhZHMvYnRmNy1maXJl,745131,2021-01-03 19:04:27,Init
3,MDEwOlJlcG9zaXRvcnkzMDU2MTkzMjA=,GPL-2.0,firemax13/a80kernel,2020-10-20 07:04:11,FireKernel Custom Extreme Kernel For Galaxy A80,C,MDM6UmVmMzA1NjE5MzIwOnJlZnMvaGVhZHMvbWFpbg==,745131,2021-01-03 23:51:13,Init
4,MDEwOlJlcG9zaXRvcnk4Mjk0MDUzOA==,MIT,eugene-matvejev/ultimate-commit-machine,2017-02-23 15:24:09,"explore github.com limits and ""same-hash"" attack",Shell,MDM6UmVmODI5NDA1Mzg6cmVmcy9oZWFkcy9tYXN0ZXI=,716089,2021-01-03 22:19:35,Init


### Checking the Counts 

Before running the function, you want to compare the original table that you just pulled from the database and the local CSV files that are keeping track of the already downloaded data. The first cell below gives you the count for the original table and the next cell pulls in all of the CSVs, concatenates them, and then gives the count of how many more slugs need to be scraped. 

In [9]:
os.chdir('/project/class/bii_sdad_dspg/uva_2021/dspg21oss/')
raw_slug_data = pd.read_csv('brandon_to_scrape_0712.csv')
raw_slug_data = raw_slug_data.iloc[100000:]
raw_slug_data.head()

Unnamed: 0,slug
100000,ITHIM/ithim-r-interface
100001,wmde/jahresbericht2016
100002,asskek/VARIS
100003,konnectors/caf
100004,tuleyman/discord


In [10]:
raw_slug_data['slug'].count()

44121

In [29]:
new_slugs = filter_scraped_readmes(original_data=raw_slug_data)

Current data has 37369 entries left to scrape (filtered 6752 from input data)


### Scraping the READMEs 

This chunk of code does a few things. First, it sets the `batch_name` to both keep track of which batch of data we are collecting and to name the output files with the appropriate name. This MUST be done before you run the code chunk. Second, the code sets up multiprocessing to draw from multiple cores. Third, the code converts the DataFrame into a list to feed the slugs into the for loop. Lastly, we feed all the slugs to the function as a for loop and it downloads the data into our project folder. 

**NOTE: Before running this cell, set the** `batch_name` **variable. Changing this variable will tell us which download batch the data was collected during and then save the CSV with that batch name.** 

In [30]:
# need to change this for each batch or you will save over what you had  
batch_name = 'oss_readme_batch_06_34' 

# sets the number of cores so that it can draw from multiprocessors 
# there must be 1 core subtracted so that the notebook can run too 
cores_available = multiprocessing.cpu_count() - 1
pool = Pool(cores_available)

# convert the dataframe into a list for the subsequent for loop 
raw_slugs = new_slugs["slug"].tolist()
slugs = []
for s in raw_slugs:
    slugs.append(s.strip())
    
# now we will feed in all of the remaining slugs 
slug_log = []
readme_log = []
asof_log = []
status_log = []
for result in pool.imap_unordered(scrape_readmes, slugs):
    slug_log.append(result[0])
    readme_log.append(result[1])
    asof_log.append(result[2])
    status_log.append(result[3])
    final_log = pd.DataFrame({'slug': slug_log, "readme_text": readme_log, 'batch': batch_name, 'as_of': asof_log, 'status': status_log}, 
                              columns=["slug", "readme_text", "batch", "as_of", "status"])
    #final_log.to_csv('/project/class/bii_sdad_dspg/ncses_oss_2021/requests_scrape/'+batch_name+'.csv', sep=',', encoding='utf-8', index=False)
    final_log.to_csv('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/'+batch_name+'.csv', sep=',', encoding='utf-8', index=False)
print("Finished scraping", len(final_log), "of", len(slugs), "records")

404 error on ctessum/geom
404 error on KDAB/perfparser
404 error on zeatul/poc
404 error on Azure/azure-mobile-apps-js-client
404 error on gstoner/gpudb
404 error on tttamaki/libdai
404 error on irfu/Lapdog_GIT
404 error on romanodesouza/dotfiles
404 error on Yechengyang/FOD
404 error on james5deutschland/nsjail
404 error on openSUSE/susefirewall2
404 error on CyanogenMod/android_external_libphonenumbergoogle
404 error on mqudsi/nvim-config
404 error on code-hunger/opengl3-sandbox
404 error on BaroboRobotics/BaroboLink2
404 error on atefganm/openpli-oe-core
404 error on humanfactors/blog
404 error on natheon/rustlings
404 error on knl/prezto
404 error on cilt-uct/uct-quartz
404 error on cilt-uct/profilewow
404 error on akihiro-terasaki/sakai_master
404 error on stevemar/collections-master
404 error on tianocore/tianocore.github.io
404 error on arizvisa/syringe
404 error on logasja/pbrt-v3
404 error on rockon9sky/custom-kcptun
404 error on aunali1/xtideuniversalbios
404 error on GerkinD

KeyError: 'download_url'

### Common/Known Errors 

Here are some common/known errors that I have identified: 

1. `"MarkupResemblesLocatorWarning: "{slug}" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.` 

    Solution: This will just throw a warning and does not seemt to affect subsequent runs of the dataset. You can ignore this. 
    
2. `KeyError: 'download_url'`

    Solution: It seems like this error can mean one of two things. The first is insignificant and 

### Examining the Dataset 

This code just pulls in all of the downloaded data as a DataFrame and sees how many entries we have downloaded already. You can use the commented line to output all the data together.

In [51]:
#import sys
#print(sys.getrecursionlimit())
sys.setrecursionlimit(50000)

In [7]:
os.chdir('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/')
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
combined_csv = combined_csv[combined_csv['status'] == 'Done']
combined_csv = combined_csv.sort_values("batch")
combined_csv
# 21220 > 55872 now 
combined_csv.to_csv('/project/class/bii_sdad_dspg/uva_2021/dspg21oss/oss_readme_data_071221.csv', sep=',', encoding='utf-8', index=False)
combined_csv.to_csv('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/oss_readme_data_071221.csv', sep=',', encoding='utf-8', index=False)


Unnamed: 0,slug,readme_text,batch,as_of,status
415,mumblepins/debian-samba,"This is the release version of Samba, the free...",oss_readme_batch1_1,6/11/21 17:21,Done
238,fdvarela/odoo8,[![Build Status](http://runbot.odoo.com/runbot...,oss_readme_batch1_1,6/11/21 17:21,Done
237,Nkosi-tshawe/moodle,.-..-.\n __...,oss_readme_batch1_1,6/11/21 17:21,Done
236,dllsf/odootest,[![Build Status](http://runbot.odoo.com/runbot...,oss_readme_batch1_1,6/11/21 17:21,Done
235,dkavadia/stupefy,.-..-.\n __...,oss_readme_batch1_1,6/11/21 17:21,Done
...,...,...,...,...,...
535,zuokun2013/site1,# Initial page\n\nthis is a hello world page\n...,oss_readme_batch_05_72,2021-07-12 23:16:10,Done
534,almcalle/gatsby-starter-netlify-cms,# Gatsby + Netlify CMS Starter\n\n[![Netlify S...,oss_readme_batch_05_72,2021-07-12 23:16:10,Done
533,biwers/aj-test,<!-- AUTO-GENERATED-CONTENT:START (STARTER) --...,oss_readme_batch_05_72,2021-07-12 23:16:10,Done
539,taraldga/mossekarusellen-netlify-cms,404 ERROR - NO README,oss_readme_batch_05_72,2021-07-12 23:16:10,Done


In [9]:
os.chdir('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/')
check = pd.read_csv('oss_readme_data_071221.csv')
check
#check.to_csv('/project/class/bii_sdad_dspg/uva_2021/dspg21oss/oss_readme_data_071121.csv', sep=',', encoding='utf-8', index=False)


Unnamed: 0,slug,readme_text,batch,as_of,status
0,mumblepins/debian-samba,"This is the release version of Samba, the free...",oss_readme_batch1_1,6/11/21 17:21,Done
1,fdvarela/odoo8,[![Build Status](http://runbot.odoo.com/runbot...,oss_readme_batch1_1,6/11/21 17:21,Done
2,Nkosi-tshawe/moodle,.-..-.\n __...,oss_readme_batch1_1,6/11/21 17:21,Done
3,dllsf/odootest,[![Build Status](http://runbot.odoo.com/runbot...,oss_readme_batch1_1,6/11/21 17:21,Done
4,dkavadia/stupefy,.-..-.\n __...,oss_readme_batch1_1,6/11/21 17:21,Done
...,...,...,...,...,...
444522,zuokun2013/site1,# Initial page\n\nthis is a hello world page\n...,oss_readme_batch_05_72,2021-07-12 23:16:10,Done
444523,almcalle/gatsby-starter-netlify-cms,# Gatsby + Netlify CMS Starter\n\n[![Netlify S...,oss_readme_batch_05_72,2021-07-12 23:16:10,Done
444524,biwers/aj-test,<!-- AUTO-GENERATED-CONTENT:START (STARTER) --...,oss_readme_batch_05_72,2021-07-12 23:16:10,Done
444525,taraldga/mossekarusellen-netlify-cms,404 ERROR - NO README,oss_readme_batch_05_72,2021-07-12 23:16:10,Done


^ That tells you the number of entries we have downloaded. 

### Development Space 

Below, are just some snippets of code that might be useful when tweaking the existing code. 

In [12]:
os.chdir('/project/class/bii_sdad_dspg/ncses_oss_2021/requests_scrape/')
check = pd.read_csv('oss_readme_batch_01_all.csv')
check

Unnamed: 0,slug,readme_text,batch,as_of,status
0,h2oota/emacs-win64-msvc,Copyright (C) 2001-2016 Free Software Foundati...,oss_readme_batch1_1,6/11/21 17:21,Done
1,Distrotech/mysql-server,MySQL Server 5.6\r\n\r\nThis is a release of M...,oss_readme_batch1_1,6/11/21 17:21,Done
2,DaichiUeura/Emacs-for-Windows--xj-,"Copyright (C) 2001, 2002, 2003, 2004, 2005, 20...",oss_readme_batch1_1,6/11/21 17:21,Done
3,httpgit12/jb4evea-16,"<img alt=""Swift logo"" height=""70"" src=""https:/...",oss_readme_batch1_1,6/11/21 17:21,Done
4,Tehsurfer/hugo-contrarian,# Contrarian website\nThis website is the face...,oss_readme_batch1_1,6/11/21 17:21,Done
...,...,...,...,...,...
55867,jongbinjung/undi,\n<!-- README.md is generated from README.Rmd....,oss_readme_batch1_9,2021-06-14 15:06:27,Done
55868,angelsenra/orphans,# orphans\nCollection of small projects or gis...,oss_readme_batch1_9,2021-06-14 15:06:26,Done
55869,polsani/gatsby-starter-netlify-cms,**Note:** Gatsby v2 beta support is here! Chec...,oss_readme_batch1_9,2021-06-14 15:06:26,Done
55870,Arya-NK/Arya-NK.github.io,\n,oss_readme_batch1_9,2021-06-14 15:06:27,Done


In [18]:
# Copy the function from above and make tweaks in this code chunk
combined_csv[combined_csv['readme_text'] == "404 ERROR - NO README"]

Unnamed: 0,slug,readme_text,batch,as_of,status
460,zvini/website,404 ERROR - NO README,oss_readme_batch1_1,6/11/21 16:40,Done
448,paleobiodb/data_service,404 ERROR - NO README,oss_readme_batch1_1,6/11/21 16:40,Done
451,jandockx/ppwcode-recovered-from-google-code,404 ERROR - NO README,oss_readme_batch1_1,6/11/21 16:40,Done
493,Liujingfang1/kprune,404 ERROR - NO README,oss_readme_batch1_1,6/11/21 17:21,Done
439,tiagoanatar/ninjagame,404 ERROR - NO README,oss_readme_batch1_1,6/11/21 16:40,Done
...,...,...,...,...,...
1386,a14chrve/Examensarbetet,404 ERROR - NO README,oss_readme_batch_02_27,2021-06-23 12:44:33,Done
1321,albinsjolin/esporthub-website,404 ERROR - NO README,oss_readme_batch_02_27,2021-06-23 12:44:32,Done
1317,edurekavivekh/pistream,404 ERROR - NO README,oss_readme_batch_02_27,2021-06-23 12:44:32,Done
1344,mouseM/learningMouse,404 ERROR - NO README,oss_readme_batch_02_27,2021-06-23 12:44:32,Done


In [124]:
select_data = combined_csv[combined_csv['readme_text'].notna()]
#select_data['readme_clean'] = select_data['readme_text'].str.lower()
select_data[select_data['readme_text'].str.contains("Blockchain")]

Unnamed: 0,slug,readme_text,batch,as_of,status
449,flowchain/flowchain-ledger,# flowchain-ledger\n&gt; Flowchain distributed...,oss_readme_batch1_10,2021-06-14 15:15:29,Done
948,fabiocolacio/Mercury,# Mercury Chat\n\nMercury is my end-to-end enc...,oss_readme_batch1_10,2021-06-14 15:15:38,Done
254,fkbenjamin/pc-firebase-starter,"# PassChain\n\nAuthors: Rob-Jago Flötgen, Flor...",oss_readme_batch1_11,2021-06-14 15:19:18,Done
1041,peacedudegregoryks/Old-SBC-SocialBenefitCoin,# The Social Benefit Coin Smart contract\n \nW...,oss_readme_batch1_11,2021-06-14 15:19:33,Done
758,mukira/fukoblockchainexplorer,# Fuko blockchain Explorer\n\n![GitHub Logo](h...,oss_readme_batch1_11,2021-06-14 15:19:28,Done
...,...,...,...,...,...
1724,agadzinski/vehicle-manufacture-20180607173737101,# Blockchain - Tutorial\n\nThis is the tutoria...,oss_readme_batch1_8,2021-06-14 12:35:40,Done
1689,shubhamp1p/vehicle-manufacture-20180606135842485,# Blockchain - Tutorial\n\nThis is the tutoria...,oss_readme_batch1_8,2021-06-14 12:35:40,Done
33,energywebfoundation/ew-did-registry,# EW DID Library v0.1\n## Disclaimer\n&gt; The...,oss_readme_batch1_8,2021-06-14 12:20:53,Done
236,jeffet/vehicle-manufacture-20180318192523657,# Blockchain - Tutorial\n\nThis is the tutoria...,oss_readme_batch1_9,2021-06-14 15:06:23,Done


In [16]:
# set the environment variables with your username and github personal access token here 
# connect to the database, download data 
#connection = pg.connect(host = 'postgis1', database = 'sdad', 
#                        user = os.environ.get('db_user'), 
#                        password = os.environ.get('db_pwd'))
#github_pats = '''SELECT * FROM gh_2007_2020.pats_update'''
#github_pats = pd.read_sql_query(github_pats, con=connection)
   
        
# can only make 2500 calls per hour 
# because the function calls twice each time 
@sleep_and_retry
@limits(calls=2500, period=3600)
def scrape_readmes(slug, github_pat_index):
    
    github_username = github_pats.login[github_pat_index]
    github_token = github_pats.token[github_pat_index]
    
    while True: 
        try: 
            # define url based on the slug 
            url = f'https://api.github.com/repos/{slug}/readme'
            response = requests.get(url, auth=(github_username, github_token))
            response_code = response.status_code
            
            if response_code == 404: 
                print(f"404 error on {slug}")
                readme_string = "404 ERROR - NO README"
                now = datetime.now()
                current_time = now.strftime("%Y-%m-%d %H:%M:%S")
                return slug, readme_string, current_time, "Done"
            
            elif response_code == 403:
                print(f"Rate limit exceeded (403 error) on {slug} at ", datetime.datetime.now())
                github_pat_index+=1
                print("***Exit current PAT, proceed to next PAT.")
                break  
            
        except KeyError:
            print("Key error for: " + slug, flush=True)
            break
        
        except requests.exceptions.HTTPError as http_error:
            print ("HTTP Error:", http_error)
            raise SystemExit(http_error)
            break
            
        except requests.exceptions.ConnectionError as connection_error:
            print ("Error Connecting:", connection_error)
            raise SystemExit(connection_error)
            break 
        
        except requests.exceptions.TooManyRedirects as toomany_requests:
            print ("Too Many Requests:", toomany_requests)
            raise SystemExit(toomany_requests)
            break
                
        except requests.exceptions.Timeout as timeout_error:
            print ("Timeout Error:", timeout_error)
            raise SystemExit(timeout_error)
            break
        
        except requests.exceptions.RequestException as request_exception_error:
            print ("Oops, Some Other Error:", request_exception_error)
            raise SystemExit(request_exception_error)
            break 
            
        html_content = response.content
        soup = BeautifulSoup(html_content, 'html.parser')
        site_json=json.loads(soup.text)
        readme_link = site_json['download_url']
        
        while True:
            try: 
                readme_response = requests.get(readme_link, auth=(github_username, github_token))
                readme_response_code = readme_response.status_code
                
            except requests.exceptions.HTTPError as http_error:
                print ("HTTP Error:", http_error)
                raise SystemExit(http_error)
                break
            
            except requests.exceptions.ConnectionError as connection_error:
                print ("Error Connecting:", connection_error)
                raise SystemExit(connection_error)
                break
            
            except requests.exceptions.TooManyRedirects as toomany_requests:
                print ("Too Many Requests:", toomany_requests)
                raise SystemExit(toomany_requests)
                break 
        
            except requests.exceptions.Timeout as timeout_error:
                print ("Timeout Error:", timeout_error)
                raise SystemExit(timeout_error)
                break
        
            except requests.exceptions.RequestException as request_exception_error:
                print ("Oops, Some Other Error:", request_exception_error)
                raise SystemExit(request_exception_error)
                break  
    
            # pull the content out of the readme 
            readme_content = readme_response.content
            readme_soup = BeautifulSoup(readme_content, 'html.parser')
            readme_string = str(readme_soup)
    
            #give us the the timing and status 
            now = datetime.now()
            current_time = now.strftime("%Y-%m-%d %H:%M:%S")
            #print(readme_string)
            return slug, readme_string, current_time, "Done"
        
def filter_scraped_readmes(original_data): 
    ''' 
    Function ingests repos data and filters out already scraped data from local csv 
    '''
    
    # ingests local csv data and converts it to a list 
    os.chdir('/project/class/bii_sdad_dspg/ncses_oss_2021/requests_scrape/')
    all_filenames = [i for i in glob.glob('*.csv')]
    combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
    combined_csv = combined_csv[combined_csv['status'] == 'Done']
    scraped_slugs_list = combined_csv['slug'].tolist()
    
    # filters out all of the scraped slugs from the original_data 
    filtered_slugs = ~raw_slug_data.slug.isin(scraped_slugs_list)
    filtered_slugs = raw_slug_data[filtered_slugs]
    
    # provides the output of current slug count and number of slugs filtered 
    new_slug_count = filtered_slugs['slug'].count()
    slug_count_diff = raw_slug_data['slug'].count() - filtered_slugs['slug'].count()
    print("Current data has", new_slug_count, "entries (filtered", slug_count_diff, "from input data)")
    return filtered_slugs    

test_slugs = ["brandonleekramer/diversity", 
              'cjbd/src'
              "uva-bi-sdad/oss-2020", 
              "facebook/react", 
              "RichardLitt/standard-readme",            
]
    
if __name__ == "__main__":
    print("Started scraping")
    
    cores_available = multiprocessing.cpu_count() - 1
    print(f'There are {cores_available} CPUs available.')
    pool = multiprocessing.Pool(cores_available)
    
    # now we will feed in all of the remaining slugs 
    slug_log = []
    readme_log = []
    asof_log = []
    status_log = []
    
    for slug in test_slugs:
        result = pool.apply_async(scrape_readmes, args=(test_slugs, 15)
        results = [r.get() for r in results]
            #slug_log.append(result[0])
            #readme_log.append(result[1])
            #asof_log.append(result[2])
            #status_log.append(result[3])
            #final_log = pd.DataFrame({'slug': slug_log, "readme_text": readme_log, 'batch': batch_name, 'as_of': asof_log, 'status': status_log}, 
            #                  columns=["slug", "readme_text", "batch", "as_of", "status"])
            #final_log.to_csv('/project/class/bii_sdad_dspg/ncses_oss_2021/requests_scrape/'+batch_name+'.csv', sep=',', encoding='utf-8', index=False)
        #print("Finished scraping", len(final_log), "of", len(slugs), "records")
        #print(results)

SyntaxError: invalid syntax (<ipython-input-16-ae7a4300d788>, line 158)

In [15]:
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))
#PATs access token, saved as a dataframe
github_pats = '''SELECT * FROM gh_2007_2020.pats_update'''
github_pats = pd.read_sql_query(github_pats, con=connection)

#PATs access token, saved as a list
access_tokens = github_pats["token"]

#number of tokens available for use, a numeric value
num_token = '''SELECT COUNT(*) FROM gh_2007_2020.pats_update'''
num_token = pd.read_sql_query(num_token, con=connection)
num_token=num_token.iloc[0]['count']

connection.close()

In [17]:
# index ranges from 0 to maximum number of PATs available
def get_access_token(github_pat_index):
    if github_pat_index < num_token:
       # print("Extracting access token #", github_pat_index+1,", total", num_token, "tokens are available.")
        return github_pats.token[github_pat_index]
    else:
        print("token exceed limit")

In [19]:
num_token

28

In [59]:
test_slugs = ["brandonleekramer/diversity", 
              'cjbd/src'
              "uva-bi-sdad/oss-2020", 
              "facebook/react", 
              "RichardLitt/standard-readme",            
]

tmp_username = os.environ['GITHUB_USERNAME'] = 'brandonleekramer'
tmp_token = os.environ['GITHUB_TOKEN'] = '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'

# need to change this for each batch or you will save over what you had  
batch_name_test = "nothing"
cores_available = multiprocessing.cpu_count() - 1
pool = Pool(cores_available)

slug_log = []
readme_log = []
asof_log = []
status_log = []

for github_username, github_token in zip(login_list, token_list): 
    
    test_slugs = filter_scraped_readmes(original_data = test_slugs)
    
    for result in pool.imap_unordered(scrape_readmes, test_slugs):
        slug_log.append(result[0])
        readme_log.append(result[1])
        asof_log.append(result[2])
        status_log.append(result[3])
        final_log = pd.DataFrame({'slug': slug_log, "readme_text": readme_log, 
                                  'batch': batch_name_test, 'as_of': asof_log, 'status': status_log}, 
                                  columns=["slug", "readme_text", "batch", "as_of", "status"])
    print("Finished scraping", len(final_log), "of", len(test_slugs), "records", "using", github_username)

Current data has 24729 entries (filtered 3247 from input data)
404 error on id
404 error on slug
404 error on branch
404 error on status
404 error on primarylanguage
404 error on commits
404 error on createdat
404 error on description
404 error on spdx
404 error on asof
Finished scraping 10 of 24729 records using akindlon977
Current data has 24729 entries (filtered 3247 from input data)
404 error on description
404 error on slug
404 error on primarylanguage
404 error on status
404 error on id
404 error on createdat
404 error on commits
404 error on asof
404 error on branch
404 error on spdx
Finished scraping 20 of 24729 records using Azraab


In [61]:
for github_username, github_token in zip(login_list, token_list): 
    check_this = filter_scraped_readmes(original_data = raw_slug_data)
    # now you need to add +=1 to the batch name when sending the output 

print(raw_slug_data['slug'].count(), check_this['slug'].count())


Current data has 24729 entries (filtered 3247 from input data)
Current data has 24729 entries (filtered 3247 from input data)
27976 24729


to insert updates into sql: https://stackoverflow.com/questions/23103962/how-to-write-dataframe-to-postgres-table
multiprocessing: https://stackoverflow.com/questions/45718546/with-clause-for-multiprocessing-in-python/45734483


