## Fetching Corpora
CLTK provides a wide range of freely available corpora for numerous languages. Normally, if you'd like to import an already existing text, you would just call the ``CorpusImporter``

In [1]:
from cltk.corpus.latin.corpora import LATIN_CORPORA 

In [2]:
from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter("latin")

In [3]:
CorpusImporter

cltk.corpus.utils.importer.CorpusImporter

In [4]:
corpus_importer.list_corpora

['latin_text_perseus',
 'latin_treebank_perseus',
 'latin_text_latin_library',
 'phi5',
 'phi7',
 'latin_proper_names_cltk',
 'latin_models_cltk',
 'latin_pos_lemmata_cltk',
 'latin_treebank_index_thomisticus',
 'latin_lexica_perseus',
 'latin_training_set_sentence_cltk',
 'latin_word2vec_cltk',
 'latin_text_antique_digiliblt',
 'latin_text_corpus_grammaticorum_latinorum',
 'latin_text_poeti_ditalia',
 'latin_text_tesserae']

In [5]:
corpus_importer.import_corpus('latin_text_latin_library')

In [6]:
corpus_importer.import_corpus('latin_models_cltk')

## Accessing the corpora

After the corpus has been imported, users will want to access the data. You can achieve this with the ``CorpusReader`` module, which allows you to access all the available documents, paragraphs and words in a corpus.

In [7]:
from cltk.corpus.readers import get_corpus_reader

In [8]:
latin_text_lib = get_corpus_reader(corpus_name = 'latin_text_latin_library', language = 'latin')

In [9]:
latin_text_lib

<FilteredPlaintextCorpusReader in 'C:\\Users\\clems\\cltk_data\\latin\\text\\latin_text_latin_library'>

In [10]:
list(latin_text_lib.words())[:10]

['DUODECIM',
 'TABULARUM',
 'LEGES',
 'DUODECIM',
 'TABULARUM',
 'LEGES',
 'TABULA',
 'I',
 'Si',
 'in']

## Fetching texts from an online source

Of course, you may also want to fetch some texts that are not on CLTK. In this case, you will need to fetch and clean up the texts yourself with the help of Python base and external libraries. 

In [11]:
links = ["http://www.thelatinlibrary.com/apicius/apicius"+str(i)+".shtml" for i in range(1, 6)]

In [12]:
from requests import get

In [14]:
books = [get(link).text for link in links]

In [15]:
books[0][:500]

'<html>\n\t<head>\n\t\t<title>\n\t\t\tApicius: de Re Coquinaria Liber I\n\t\t</title>\n\n\t\t<link rel="SHORTCUT ICON" href="http://www.thelatinlibrary.com/icon.ico">\n\t\t<link rel="StyleSheet" href="http://www.thelatinlibrary.com/latinlibrary.css">\n\t</head>\n\t\n<body>\n\n<p class=pagehead>DE RE COQUINARIA LIBER PRIMUS M. GAVII APICII</p>\n\n<p class=border></P>\n\n<p>\n<b>LIBER I. EPIMELES.</b>\n</P>\n\n<P>\n<b>I. Conditum paradoxum.</b>\n</P>\n\n<P>\n1. CONDITI PARADOXI COMPOSITIO.\n</P>\n\n<P>\nMellis p.XV in aeneum uas mittuntur, '

In its current form the text is essentially useless for human readers. To parse the current document easily and efficiently, we turn to a powerful Python library - Beautiful Soup - which is commonly used when fetching texts from webpages.

In [16]:
from bs4 import BeautifulSoup

In [17]:
r = BeautifulSoup(books[0], "lxml")

In [18]:
ps = [p.text.strip() for p in r.body.select("p") if p.text.strip()]

In [19]:
ps

['DE RE COQUINARIA LIBER PRIMUS M. GAVII APICII',
 'LIBER I. EPIMELES.',
 'I. Conditum paradoxum.',
 '1. CONDITI PARADOXI COMPOSITIO.',
 'Mellis p.XV in aeneum uas mittuntur, praemissis vini sextariis duobus, ut in coctura mellis vinum decoquas.  Quod igni lento et aridis lignis calefactum, commotum ferula dum coquitur, si effervere coeperit, vini rore conpescitur, praeter quod subtracto igni in se redit.  Cum perfrixerit, rursus accenditur.  Hoc secundo ac tertio fiet, ac tum demum remotum a foco postridie despumatur.  Tum mittis piperis uncias quattor iam triti, masticis scripulos III, folii et croci dragmae singulae, datilorum ossibus torridis quinque, isdemque dactilis vino mollitis, intercedente prius suffusione vini de suo modo ac numero, ut tritura lenis habeatur.  His omnibus paratis supermittis vini lenis sextaria XVIII.  Carbones perfecto aderunt.',
 '2. CONDITUM MELIZOMUM PERPETUUM QUOD SUBMINISTRATUR PER VIAM PEREGRINANTI.',
 'Piper tritum cum melle despumto in cupellam mit