# Basics on Using `Corpus`

One of the easiest ways to get started with allofplos is to use the `Corpus` class. 

But, you ask:

> Why use the `Corpus` class?

It is a straightforward way to get back `Article` objects from your corpus without needing to instantiate them one by one.

It also has handy utilities if you wanted to do more specific things that we're not going to get into.

> How do I use it? 

Eager, are we‽ I thought you'd never ask!

# Import Corpus

First, we need to import `Corpus`. 

We're also going to import the `starterdir` corpus directory to use the corpus that comes with `allofplos`.

In [1]:
from allofplos import Corpus, starterdir

## Instantiate Corpus

Second we need to instantiate the `Corpus` object. 

In this case we're going to pass in `starterdir` so we use allofplos' starter corpus. 

In [2]:
corpus = Corpus(starterdir)

## Use Corpus

Now you're ready to use `corpus`! 

`Corpus()` is a great way to get access to `Article` objects that come from your corpus directory. There are a number of ways that you can access this functionality that we'll discuss below.

### See how many articles are in the corpus

You can use `len(corpus)` to get the number of articles in the corpus.

In [3]:
len(corpus)

122

### Display a random article 

To get a single random article we use `corpus.random_article`.

This will resample the article each time you ask for it.

In [4]:
display(corpus.random_article)

DOI: 10.1371/journal.pone.0118342
Title: An Efficient Algorithm to Perform Local Concerted Movements of a Chain Molecule

### Display an article from a specific doi


If you already know the doi for the article you are interested in, you can access the doi like you would in a dictionary: `corpus[your_doi]`.

In [5]:
corpus['10.1371/journal.pcbi.1004141']

DOI: 10.1371/journal.pcbi.1004141
Title: The Equivalence of Information-Theoretic and Likelihood-Based Methods for Neural Dimensionality Reduction

### Display specific articles in lexicographic order

You can also get access to articles at specific positions in the corpus.

#### Integer indexing
You can do this with a single integer:

In [6]:
display(corpus[0])

DOI: 10.1371/journal.pbio.0020188
Title: Taking the Stem Cell Debate to the Public

#### Slice indexing

Or, you can do this with a slice of integers like you would access a list. 

However, Articles can take up a lot of memory if you have (say) over 200,000 of them. To avoid memory overheads, this does not return a list, it returns a generator.

Below, we display every other article in first 10 in the corpus.  

In [7]:
display(*(art for art in corpus[:10:2]))

DOI: 10.1371/journal.pbio.0020188
Title: Taking the Stem Cell Debate to the Public

DOI: 10.1371/journal.pbio.0030408
Title: Stimulating the Brain Makes the Fingers More Sensitive

DOI: 10.1371/journal.pbio.1000359
Title: The Light-Driven Proton Pump Proteorhodopsin Enhances Bacterial Survival during Tough Times

DOI: 10.1371/journal.pbio.1001199
Title: Interplay between BRCA1 and RHAMM Regulates Epithelial Apicobasal Polarization and May Influence Risk of Breast Cancer

DOI: 10.1371/journal.pbio.1001315
Title: Sialyllactose in Viral Membrane Gangliosides Is a Novel Molecular Recognition Pattern for Mature Dendritic Cell Capture of HIV-1

### Access every article in the corpus

You can use python's `for article in corpus:` syntax to do something to each article in your corpus.

This will return the articles in a new random order each time you call it.

In [8]:
for article in corpus:
    print("doi:", article.doi, "journal:", article.journal)

doi: 10.1371/journal.pone.0100977 journal: PLOS ONE
doi: 10.1371/journal.pone.0126470 journal: PLOS ONE
doi: 10.1371/journal.pone.0052690 journal: PLOS ONE
doi: 10.1371/journal.pcbi.1000589 journal: PLOS Computational Biology
doi: 10.1371/journal.pone.0121226 journal: PLOS ONE
doi: 10.1371/journal.pone.0152025 journal: PLOS ONE
doi: 10.1371/journal.ppat.0040045 journal: PLOS Pathogens
doi: 10.1371/journal.pone.0117014 journal: PLOS ONE
doi: 10.1371/journal.pone.0153170 journal: PLOS ONE
doi: 10.1371/journal.pbio.1000359 journal: PLOS Biology
doi: 10.1371/journal.pntd.0001969 journal: PLOS Neglected Tropical Diseases
doi: 10.1371/journal.pone.0118238 journal: PLOS ONE
doi: 10.1371/journal.pone.0008519 journal: PLOS ONE
doi: 10.1371/journal.pbio.1001199 journal: PLOS Biology
doi: 10.1371/journal.pbio.1001473 journal: PLOS Biology
doi: 10.1371/journal.pone.0120924 journal: PLOS ONE
doi: 10.1371/journal.pone.0016329 journal: PLOS ONE
doi: 10.1371/journal.pone.0111971 journal: PLOS ONE
doi:

### Access a random sample of articles

You can use the `corpus.random_sample()` method to get a random sample of articles from the corpus. 

The best way to use this is by iterating through the random sample: `for article in corpus.random_sample(x)`

**NB**: It returns a generator (not a list) to avoid using too much memory.

In [9]:
for article in corpus.random_sample(50):
    display(article)

DOI: 10.1371/journal.pone.0042593
Title: The Impact of Psychological Stress on Men's Judgements of Female Body Size

DOI: 10.1371/journal.pone.0120049
Title: Effects of Acute Exposure to Increased Plasma Branched-Chain Amino Acid Concentrations on Insulin-Mediated Plasma Glucose Turnover in Healthy Young Subjects

DOI: 10.1371/journal.pmed.0030520
Title: Angiotensin-Converting Enzyme I/D Polymorphism and Preeclampsia Risk: Evidence of Small-Study Bias

DOI: 10.1371/journal.pgen.1000052
Title: A Genome-Wide Gene Expression Signature of Environmental Geography in Leukocytes of Moroccan Amazighs

DOI: 10.1371/journal.pone.0118342
Title: An Efficient Algorithm to Perform Local Concerted Movements of a Chain Molecule

DOI: 10.1371/journal.pone.0119074
Title: The Effect of Cluster Size Variability on Statistical Power in Cluster-Randomized Trials

DOI: 10.1371/journal.pone.0047391
Title: Monitoring HIV Viral Load in Resource Limited Settings: Still a Matter of Debate?

DOI: 10.1371/journal.ppat.1000166
Title: The Pseudomonas Quinolone Signal (PQS) Balances Life and Death in Pseudomonas aeruginosa Populations

DOI: 10.1371/journal.pone.0153170
Title: Renal Transplant Recipients Treated with Calcineurin-Inhibitors Lack Circulating Immature Transitional CD19+CD24hiCD38hi Regulatory B-Lymphocytes

DOI: 10.1371/journal.pbio.0030408
Title: Stimulating the Brain Makes the Fingers More Sensitive

DOI: 10.1371/journal.pone.0147124
Title: A Microarray-Based Analysis Reveals that a Short Photoperiod Promotes Hair Growth in the Arbas Cashmere Goat

DOI: 10.1371/journal.pcbi.1003292
Title: Reconstructing the Genomic Content of Microbiome Taxa through Shotgun Metagenomic Deconvolution

DOI: 10.1371/journal.pone.0055490
Title: Genetic Testing for TMEM154 Mutations Associated with Lentivirus Susceptibility in Sheep

DOI: 10.1371/journal.pone.0100977
Title: Identification of a Major Phosphopeptide in Human Tristetraprolin by Phosphopeptide Mapping and Mass Spectrometry

DOI: 10.1371/journal.pone.0097541
Title: Correction: Pollen and Phytolith Evidence for Rice Cultivation and Vegetation Change during the Mid-Late Holocene at the Jiangli Site, Suzhou, East China

DOI: 10.1371/journal.pntd.0001041
Title: A Phase Two Randomised Controlled Double Blind Trial of High Dose Intravenous Methylprednisolone and Oral Prednisolone versus Intravenous Normal Saline and Oral Prednisolone in Individuals with Leprosy Type 1 Reactions and/or Nerve Function Impairment

DOI: 10.1371/journal.pone.0138823
Title: Structure-Activity Relationship of Indole-Tethered Pyrimidine Derivatives that Concurrently Inhibit Epidermal Growth Factor Receptor and Other Angiokinases

DOI: 10.1371/journal.pone.0005723
Title: Complete Primate Skeleton from the Middle Eocene of Messel in Germany: Morphology and Paleobiology

DOI: 10.1371/journal.ppat.1003133
Title: Schmallenberg Virus Pathogenesis, Tropism and Interaction with the Innate Immune System of the Host

DOI: 10.1371/journal.ppat.1002247
Title: Vaccinia Virus Protein C6 Is a Virulence Factor that Binds TBK-1 Adaptor Proteins and Inhibits Activation of IRF3 and IRF7

DOI: 10.1371/journal.pone.0116586
Title: Diminished Response of Arctic Plants to Warming over Time

DOI: 10.1371/journal.pbio.1001044
Title: Cancer: The Whole Story

DOI: 10.1371/journal.pone.0080518
Title: Ecoinformatics Can Reveal Yield Gaps Associated with Crop-Pest Interactions: A Proof-of-Concept

DOI: 10.1371/journal.pmed.1001300
Title: Multidrug Resistant Pulmonary Tuberculosis Treatment Regimens and Patient Outcomes: An Individual Patient Data Meta-analysis of 9,153 Patients

DOI: 10.1371/journal.pbio.1001473
Title: The Oxytricha trifallax Macronuclear Genome: A Complex Eukaryotic Genome with 16,000 Tiny Chromosomes

DOI: 10.1371/journal.pmed.0020124
Title: Why Most Published Research Findings Are False

DOI: 10.1371/journal.pone.0068090
Title: Abnormal Contextual Modulation of Visual Contour Detection in Patients with Schizophrenia

DOI: 10.1371/journal.pcbi.1004082
Title: On the Number of Neurons and Time Scale of Integration Underlying the Formation of Percepts in the Brain

DOI: 10.1371/journal.pone.0040259
Title: The Eyes Don’t Have It: Lie Detection and Neuro-Linguistic Programming

DOI: 10.1371/journal.pmed.1001186
Title: Guidance for Evidence-Informed Policies about Health Systems: Linking Guidance Development to Policy Development

DOI: 10.1371/journal.pone.0052690
Title: The Internal Organization of Mycobacterial Partition Assembly: Does the DNA Wrap a Protein Core?

DOI: 10.1371/journal.pone.0153152
Title: "May I Buy a Pack of Marlboros, Please?" A Systematic Review of Evidence to Improve the Validity and Impact of Youth Undercover Buy Inspections

DOI: 10.1371/journal.pmed.0030132
Title: Bigger and Better: How Pfizer Redefined Erectile Dysfunction

DOI: 10.1371/journal.pcbi.1000589
Title: A Quick Guide for Developing Effective Bioinformatics Programming Skills

DOI: 10.1371/journal.pone.0058242
Title: Improved Glomerular Filtration Rate Estimation by an Artificial Neural Network

DOI: 10.1371/journal.ppat.1005207
Title: Retraction: Extreme Resistance as a Host Counter-counter Defense against Viral Suppression of RNA Silencing

DOI: 10.1371/journal.pone.0116752
Title: A Kramers-Moyal Approach to the Analysis of Third-Order Noise with Applications in Option Valuation

DOI: 10.1371/journal.pone.0066742
Title: Relative Impact of Multimorbid Chronic Conditions on Health-Related Quality of Life – Results from the MultiCare Cohort Study

DOI: 10.1371/journal.pcbi.1004089
Title: Delayed Response and Biosonar Perception Explain Movement Coordination in Trawling Bats

DOI: 10.1371/journal.pntd.0001969
Title: An In-Depth Analysis of a Piece of Shit: Distribution of Schistosoma mansoni and Hookworm Eggs in Human Stool

DOI: 10.1371/journal.pone.0002554
Title: A Comparison of Wood Density between Classical Cremonese and Modern Violins

DOI: 10.1371/journal.pone.0117949
Title: Exact Solutions of Linear Reaction-Diffusion Processes on a Uniformly Growing Domain: Criteria for Successful Colonization

DOI: 10.1371/journal.pone.0119705
Title: TBI Server: A Web Server for Predicting Ion Effects in RNA Folding

DOI: 10.1371/journal.pmed.0030205
Title: Mischievous Odds Ratios

DOI: 10.1371/journal.pbio.1001315
Title: Sialyllactose in Viral Membrane Gangliosides Is a Novel Molecular Recognition Pattern for Mature Dendritic Cell Capture of HIV-1

DOI: 10.1371/journal.pone.0114370
Title: Strategies of Eradicating Glioma Cells: A Multi-Scale Mathematical Model with MiR-451-AMPK-mTOR Control

DOI: 10.1371/journal.pmed.1000431
Title: Strategies and Practices in Off-Label Marketing of Pharmaceuticals: A Retrospective Analysis of Whistleblower Complaints

DOI: 10.1371/journal.pone.0120924
Title: Glowing Seashells: Diversity of Fossilized Coloration Patterns on Coral Reef-Associated Cone Snail (Gastropoda: Conidae) Shells from the Neogene of the Dominican Republic

DOI: 10.1371/journal.pcbi.1000204
Title: Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web

DOI: 10.1371/journal.pntd.0002570
Title: NTDs V.2.0: “Blue Marble Health”—Neglected Tropical Disease Control and Elimination in a Shifting Health Policy Landscape

# Now you know! 

Now you know the basics of using the `Corpus` class. 

- You can point `Corpus(directory)` to a corpus directory on your file system. 
- You can how many articles are in your corpus with `len(Corpus())`
- You can get one random article with `Corpus().random_article`.
- You can get the article with a specific doi with `Corpus()[doi]`.
- You can get the first article in the corpus with `Corpus()[0]`.
- You can get the first `x` articles in the corpus with `Corpus()[:x]` 
- You can access all of the articles in a corpus iteratively with `for article in Corpus():`.
- You can access `x` random articles from the corpus with `for article in Corpus().random_sample(x):`.

Now it's time to check out the Article tutorial. Once it exists, we'll definitely link to it here.