# Working with a corpus

This notebook shows how you can use a CSV to work with a collection of documents in a folder.

## Importing libraries

As usual, we first import the needed libraries.

In [1]:
import csv, re, collections

## Checking directory

We need to check where the directory of texts is.

In [2]:
ls

ExampleTable.csv             [34mTexts[m[m/
Hume Enquiry.txt             Third Notebook.ipynb
My First Notebook.ipynb      Web Scraping.ipynb
Second Notebook.ipynb        Working with a Corpus.ipynb
Test.txt


## Selecting and Reading Texts

In this loop we go through a CSV with rows for each item of the corpus we want to work with. We have included a test so we can grab only those that meet some condition.

In [3]:
files = []
with open('Texts/mycorpus.csv', 'r') as file: # This makes sure that file is closed after reading
    data = csv.reader(file)
    # This goes through all the rows in the CSV
    for row in data:
        # This tests a condition to decide if we want to process the text.
        if row[4] != "error": 
            files.append(row[3]) # This puts all the data into a list
file.closed
fileNames = files[1:] # This gets all items but the first (which is a label)
fileNames

['phil.txt', 'theo.txt', 'grcv.txt']

## Change Directory

Now we need to change directory to make sure we process the files in the *Texts* directory

In [4]:
cd Texts

/Users/grockwel/Sync/Rockwell IPython Stuff/LearningNotebooks/LearningNotebooks/Texts


In [5]:
ls

grcv.txt      mycorpus.csv  phil.txt      results.csv   theo.txt


## Processing all the Texts

Here we do something with all the texts. In this case we read them and add them to a string. 

In [6]:
theBigText = ""
for item in fileNames:
    with open(item, 'r') as file:
        theBigText += file.read()
    file.closed
theBigText[:300]
len(theBigText)

159322

Now we get a dictionary of word counts.

In [7]:
tokens = re.findall(r'\b\w[\w-]*\b', theBigText.lower())
theTypesCount = {}
theTypes = set(tokens)

for item in theTypes:
    theTypesCount[item]= (tokens.count(item))

theTypesCount["the"]

1310

With is next command you can check the count for any word.

In [8]:
theTypesCount["university"]

346

## List of High Frequency Words

This takes the dictionary (which can't be sorted) and turns it into a list of tuples. (Tuples are lists with two items.) 

In [10]:
listTuples = []
for w in sorted(theTypesCount, key=theTypesCount.get, reverse=True):
    theTuple = [w,theTypesCount[w]]
    listTuples.append(theTuple)

listTuples[:30]

[['the', 1310],
 ['of', 805],
 ['and', 651],
 ['at', 645],
 ['on', 565],
 ['a', 497],
 ['in', 491],
 ['for', 461],
 ['university', 346],
 ['humanities', 328],
 ['conference', 233],
 ['presented', 222],
 ['with', 202],
 ['digital', 200],
 ['to', 179],
 ['by', 163],
 ['is', 139],
 ['s', 135],
 ['computing', 135],
 ['text', 132],
 ['research', 132],
 ['june', 125],
 ['paper', 119],
 ['2011', 114],
 ['2014', 114],
 ['2012', 114],
 ['2013', 109],
 ['may', 109],
 ['2010', 102],
 ['was', 94]]

# Table of Word Frequencies

Now we will run the process all over of opening files in order to build a table of word counts or frequencies for each individual file.

This first function will **count** the tokens for each word in the list provided.

In [66]:
def wordCounter(text,words):
    tokens = re.findall(r'\b\w[\w-]*\b', text.lower())
    listCounts = []
    for word in words:
        listCounts.append(tokens.count(word))
        
    return listCounts

This second fuction will calculate the **relative frequency** for each word in the list provided.

In [11]:
def wordFreqCounter(text,words):
    tokens = re.findall(r'\b\w[\w-]*\b', text.lower())
    listFreqs = []
    numTokens = len(tokens)
    for word in words:
        listFreqs.append(tokens.count(word)/numTokens)
        
    return listFreqs

## Open Files and Calculate 

This is the main loop that opens each file and does calculations for each word provided in each file. It builds a list of lists (or table) of results. 

**Note:** that you need to provide the list words you want counted. You can use the list of high frequency words above that was generated from all the files.

In [12]:
listOfWords2Count = ["university","humanities","conference","research","banana"]
listOflists = []
listOflists.append(listOfWords2Count)
for item in fileNames:
    with open(item, 'r') as file:
        theText = file.read()
    file.closed
    # If you want just word counts use wordCounter function.
    # Otherwise use the wordFreqCounter function.
    localCounts = [item] + wordFreqCounter(theText,listOfWords2Count)
    listOflists.append(localCounts)
    
listOflists[:3]

[['university', 'humanities', 'conference', 'research', 'banana'],
 ['phil.txt',
  0.005250875145857643,
  0.01575262543757293,
  0.027421236872812137,
  0.0029171528588098016,
  0.0],
 ['theo.txt',
  0.0,
  0.0054525627044711015,
  0.006543075245365322,
  0.0016357688113413304,
  0.0]]

## Write Out the Table

Now we write out the table to a new CSV. Make sure you change the file name if you don't want overwrite the existing file.

In [13]:
with open("results.csv", 'w', newline='') as csvfile:
    resultsWriter = csv.writer(csvfile, delimiter=',',)
    for item in listOflists:
        resultsWriter.writerow(item)
        
print("Done")

Done


This results CSV can now be manipulated in a spreadsheet program like Excel.