# Data Access Notebook

In [18]:
import pandas as pd

## Project Gutenberg

Visit the [homepage](https://www.gutenberg.org/) for Project Gutenberg if you are having trouble finding a specific book.

Usage documentation for the Python package can be found [here](https://pypi.org/project/Gutenberg/).

In [67]:
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers

text = strip_headers(load_etext(11)).strip()
#print(text)

### Simple Example

In [20]:
books = [[1342,"Pride and Prejudice","Jane Austen"],
         [11,"Alice's Adventures in Wonderland", "Lewis Carroll"],
         [2701,"Moby Dick; Or, The Whale","Herman Melville"],
         [84,"Frankenstein; Or, The Modern Prometheus", "Mary Wollenstonecraft Shelley" ],
         [345,"Dracula", "Bram Stoker"]
        ]

In [24]:
gutenDF = pd.DataFrame(books, columns=['ID','Title','Author'])
gutenDF['FullText']=gutenDF.apply(lambda row: strip_headers(load_etext(row['ID'])).strip() , axis=1)
gutenDF

Unnamed: 0,ID,Title,Author,FullText
0,1342,Pride and Prejudice,Jane Austen,THERE IS AN ILLUSTRATED EDITION OF THIS TITLE ...
1,11,Alice's Adventures in Wonderland,Lewis Carroll,[Illustration]\n\n\n\n\nAlice’s Adventures in ...
2,2701,"Moby Dick; Or, The Whale",Herman Melville,"MOBY-DICK;\n\nor, THE WHALE.\n\nBy Herman Melv..."
3,84,"Frankenstein; Or, The Modern Prometheus",Mary Wollenstonecraft Shelley,and David Meltzer. HTML version by Al Haines.\...
4,345,Dracula,Bram Stoker,DRACULA\n\n\n\n\n\n ...


### MetaData and Caching

If you plan on doing a lot of work with Project Gutenberg's metadata functionality, you'll need to cache their metadata first. This can take a very long time but makes it possible to query their metadata quickly.

## EDGAR Database

Here is the [homepage](https://www.sec.gov/edgar.shtml) for the Securities and Exchange Commission's EDGAR database. If you are having trouble finding a specific company, try their [full text search](https://www.sec.gov/edgar/search/#).

Usage documentation for the Python package can be found [here](https://pypi.org/project/edgar/).

Pull the last 5 10-K reports for the Oracle Corporation

In [48]:
from edgar import Company, TXTML
company = Company("Oracle Corp", "0001341439")
tree = company.get_all_filings(filing_type = "10-K")
docs = Company.get_documents(tree, no_of_documents=5)

Parse the most recent 10-K filing for IBM

In [49]:
company = Company("INTERNATIONAL BUSINESS MACHINES CORP", "0000051143")
doc = company.get_10K()
text = TXTML.parse_full_10K(doc)

Search EDGAR for a company Cisco System

In [50]:
from edgar import Edgar
edgar = Edgar()
possible_companies = edgar.find_company_name("Cisco System")
possible_companies

['CISCO SYSTEMS (SWITZERLAND) INVESTMENTS LTD',
 'CISCO SYSTEMS CAPITAL CORP',
 'CISCO SYSTEMS INC',
 'CISCO SYSTEMS INTERNATIONAL B.V.',
 'CISCO SYSTEMS, INC.',
 'SPANISH BROADCASTING SYSTEM SAN FRANCISCO INC']

### Simple Example

In [51]:
companies = [['AMAZON COM INC','0001018724'],
            ['Alphabet Inc.','0001652044'],
            ['MICROSOFT CORP','0000789019']
            ]

In [52]:
edgarDF = pd.DataFrame(companies, columns=['Company','CIK'])
edgarDF['MostRecent_10K']=edgarDF.apply(lambda row: TXTML.parse_full_10K(Company(row['Company'],row['CIK']).get_10K()) , axis=1)
edgarDF

Unnamed: 0,Company,CIK,MostRecent_10K
0,AMAZON COM INC,1018724,\n\n\n\n\n\n\n\n\t\n\t\t\n\t\tDocument\n\t\n\t...
1,Alphabet Inc.,1652044,\n\n\n\n\n\n\n\n\t\n\t\t\n\t\tDocument\n\t\n\t...
2,MICROSOFT CORP,789019,\n\n\n\n\n\n\n\n\n\n\n\nmsft-10k_20200630.htm\...


## Hansard

## US Congress

Data pulled via shell script (congress_download.sh) from [Stanford's Congressional Record](https://data.stanford.edu/congress_text) and is available on M2. If you use this data, please use the proper citation for the dataset.

In [68]:
import glob

In [69]:
path_to_congress = "/scratch/group/oit_research_data/stanford_congress"

In [72]:
glob.glob('{}/*'.format(path_to_congress))

['/scratch/group/oit_research_data/stanford_congress/__MACOSX',
 '/scratch/group/oit_research_data/stanford_congress/speakermap_stats',
 '/scratch/group/oit_research_data/stanford_congress/keywords.txt',
 '/scratch/group/oit_research_data/stanford_congress/topic_phrases.txt',
 '/scratch/group/oit_research_data/stanford_congress/congress_download.sh',
 '/scratch/group/oit_research_data/stanford_congress/vocabulary',
 '/scratch/group/oit_research_data/stanford_congress/hein-daily',
 '/scratch/group/oit_research_data/stanford_congress/false_matches.txt',
 '/scratch/group/oit_research_data/stanford_congress/party_full',
 '/scratch/group/oit_research_data/stanford_congress/partisan_phrases',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound',
 '/scratch/group/oit_research_data/stanford_congress/audit']

If you want to know how to best use this dataset, we suggest you read the codebook found [here](https://stacks.stanford.edu/file/druid:md374tz9962/codebook_v4.pdf)

## COVID-19 Text Data

Data pulled via shell script (congress_download.sh) from [MIT's COVID-19 Open Research Dataset](https://innovation.mit.edu/cord19/) and is available on M2. If you use this data, please use the proper citation for the dataset.

In [200]:
import glob
import json

In [201]:
date = '2020-08-23'

In [202]:
path_to_covid = "/scratch/group/oit_research_data/semantic_scholar_cord_19/"+date

In [203]:
glob.glob('{}/*'.format(path_to_covid))

['/scratch/group/oit_research_data/semantic_scholar_cord_19/2020-08-23/document_parses',
 '/scratch/group/oit_research_data/semantic_scholar_cord_19/2020-08-23/cord_19_embeddings_2020-08-23.csv',
 '/scratch/group/oit_research_data/semantic_scholar_cord_19/2020-08-23/changelog',
 '/scratch/group/oit_research_data/semantic_scholar_cord_19/2020-08-23/metadata.csv']

In [204]:
metadata = pd.read_csv('{}/metadata.csv'.format(path_to_covid), dtype=object)
metadata.head(2)

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,11472636,no-cc,OBJECTIVE: This retrospective chart review des...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",BMC Infect Dis,,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,
1,02tnwd4m,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,PMC,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,PMC59543,11667967,no-cc,Inflammatory diseases of the respiratory tract...,2000-08-15,"Vliet, Albert van der; Eiserich, Jason P; Cros...",Respir Res,,,,document_parses/pdf_json/6b0567729c2143a66d737...,document_parses/pmc_json/PMC59543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,


Here we have provided a function that will take the associated files for each row in our metadata table and read those files in.

In [205]:
def read_text_from_json(row,path=path_to_covid):
    
    file_pdf_json = '{}/{}'.format(path, row['pdf_json_files'])
    file_pmc_json = '{}/{}'.format(path, row['pmc_json_files'])
    
    read = False
    
    try:
        with open(file_pdf_json) as f:
            text = json.load(f)
            read = True
    except FileNotFoundError:
        try:
            with open(file_pmc_json) as f:
                text = json.load(f)
                read = True
        except FileNotFoundError:
            text = {"body_text":"No Files Listed Were Found"}
            read = False
    
    return text
    
        
    

In [206]:
covidDF = metadata.dropna(subset=['pdf_json_files', 'pmc_json_files'],thresh=1).sample(100) # Selects a random sample of articles from our collection that have at least 1 json file listed

covidDF['Text'] = covidDF.apply(lambda row: read_text_from_json(row,path_to_covid), axis = 1) # read the text using a function to parse the associated file

Pulls the body text of the paper for the first row in our sample subset of our data

In [199]:
covidDF.iloc[0]['Text']['body_text']

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id,Text
193929,8dyaeu2v,40276a234b15f83ca6f70b4956659025fd7b2ae5,Elsevier; Medline; PMC,Persistent rhinovirus infection in pediatric h...,10.1016/j.jcv.2015.03.022,PMC7172262,25959156,no-cc,BACKGROUND: HRV infections are generally self-...,2015-03-31,"Piralla, Antonio; Zecca, Marco; Comoli, Patriz...",J Clin Virol,,,,document_parses/pdf_json/40276a234b15f83ca6f70...,document_parses/pmc_json/PMC7172262.xml.json,https://doi.org/10.1016/j.jcv.2015.03.022; htt...,8392293,{'paper_id': '40276a234b15f83ca6f70b4956659025...
22329,7lgffsh2,9249cd9619f8ea2f256907447cf4438a9fd368b8,PMC,Vaccine Safety,10.1016/b978-0-323-35761-6.00082-1,PMC7173515,,no-cc,,2017-07-17,"Destefano, Frank; Offit, Paul A.; Fisher, Allison",Plotkin's Vaccines,,,,document_parses/pdf_json/9249cd9619f8ea2f25690...,document_parses/pmc_json/PMC7173515.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,{'paper_id': '9249cd9619f8ea2f256907447cf4438a...
25522,7alkvlx2,8c33b15e03d3af2f797ec6c1a1ec451e1162ce29,PMC,Arbeitsfähig mit Flexibilität und Pragmatismus,10.1007/s35114-020-0238-8,PMC7280019,,no-cc,,2020-06-09,"Papenkort, ﻿Jörg; Preußner, Philipp",Innov Verwalt,,,,document_parses/pdf_json/8c33b15e03d3af2f797ec...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,{'paper_id': '8c33b15e03d3af2f797ec6c1a1ec451e...
19701,v10t6kvw,73f5bad27f59343df9604ab16a5af610c9c19f33,PMC,Esophageal cancer spatial and correlation anal...,10.1007/s11442-014-1072-8,PMC7149033,,no-cc,Esophageal cancer exhibits one of the highest ...,2013-12-17,"Zhang, Xueyan; Zhuang, Dafang; Ma, Xin; Jiang,...",,,,,document_parses/pdf_json/73f5bad27f59343df9604...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,{'paper_id': '73f5bad27f59343df9604ab16a5af610...
23140,i6puqauk,1477b1cdf998c498ab4772b27d46e60998a9de56,PMC,Deep Multivariate Time Series Embedding Cluste...,10.1007/978-3-030-47426-3_25,PMC7206254,,no-cc,"Nowadays, great quantities of data are produce...",2020-04-17,"Ienco, Dino; Interdonato, Roberto",Advances in Knowledge Discovery and Data Mining,,,,document_parses/pdf_json/1477b1cdf998c498ab477...,document_parses/pmc_json/PMC7206254.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,{'paper_id': '1477b1cdf998c498ab4772b27d46e609...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227128,jm07z3zk,5bb36fbce1ce7a228b339e244490b2e7148b42c6,Medline; PMC,A Marketing Approach to a Psychological Proble...,10.3390/ijerph17072471,PMC7177352,32260429,cc-by,Background: Smartphones have become an indispe...,2020-04-04,"Ertemel, Adnan Veysel; Ari, Ela",Int J Environ Res Public Health,,,,document_parses/pdf_json/5bb36fbce1ce7a228b339...,document_parses/pmc_json/PMC7177352.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/32260429/;...,215410395,{'paper_id': '5bb36fbce1ce7a228b339e244490b2e7...
231772,65e8ol64,f7ff62013c879bbcaca24b52d09d15b0aa70839b,Medline; PMC,Biochemical Microarrays for Studying Chemical ...,10.1111/j.1747-0285.2005.00326.x,PMC7162007,16492155,no-cc,DiscoveryDot(TM) is a novel solution‐phase tec...,2005-12-21,"Horiuchi, Kurumi Y.; Wang, Yuan; Ma, Haiching",Chem Biol Drug Des,,,,document_parses/pdf_json/f7ff62013c879bbcaca24...,document_parses/pmc_json/PMC7162007.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/16492155/,29919012,{'paper_id': 'f7ff62013c879bbcaca24b52d09d15b0...
206739,k353k8x9,51469a904b145a6b4f0b5a3724e9621929400294,Medline; PMC,Folding of Viral Envelope Glycoproteins in the...,10.1034/j.1600-0854.2000.010702.x,PMC7190097,11208140,no-cc,Viral glycoproteins fold and oligomerize in th...,2002-01-10,"Braakman, Ineke; Van Anken, Eelco",Traffic,,,,document_parses/pdf_json/51469a904b145a6b4f0b5...,document_parses/pmc_json/PMC7190097.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/11208140/,21045560,{'paper_id': '51469a904b145a6b4f0b5a3724e96219...
30351,t9dkn5mp,aa66d9515d520d06f9022dc33a2208e1c8c24904,PMC,Evidence-Based Guidelines for Recording Slide-...,10.1007/s40670-020-01032-w,PMC7394704,,no-cc,Pre-recorded lectures can be an efficient way ...,2020-07-31,"Kurzweil, Dina; Meyer, Eric; Marcellas, Karen;...",Med Sci Educ,,,,document_parses/pdf_json/aa66d9515d520d06f9022...,document_parses/pmc_json/PMC7394704.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,{'paper_id': 'aa66d9515d520d06f9022dc33a2208e1...


## Reddit Archive

To use Pandas with Reddit, you will need more memory than your typical node. To do this you will want to open a JupyterLab Session on a medium-mem queue. For more information, look at the documentation [here](http://faculty.smu.edu/csc/documentation/slurm.html#maneframe-ii-s-slurm-partitions-queues).

If you want to keep the data on disk instead of using a higher memory node, you will need to use a different dataframe library than pandas. Dask will let you do this but it will be very slow.