# Python for Poets

This Jupyter Notebook is inspired on Keneth W. Church's [Unix for Poets](https://www.cs.upc.edu/~padro/Unixforpoets.pdf). From that chapter itself:

- "many researchers have more data than they know what to do with"
- "Many researchers believe that they don’t have sufficient computing resources to do these things for themselves."
- "This chapter will describe a set of simple Unix-based (**Python in our case**) tools that should
be more than adequate for counting trigrams on a corpus the size of the Brown Corpus"
- "this chapter will focus on examples and avoid definitions whenever possible"

The code has been developed using Python 3.6. It has been written using [PyCharm](), and tested on [Colab](). All snippets could be run in any machine with Python 3.6 (or higher) installed, or online, as a Jupyter notebook.

Note: the solution to many of these exercises is simpler using Unix/GNU Linux command line one-liners!

## 1. Excercise 1: Count words in a text

From Chuch. "The problem is to input a text file, say Genesis (a good place to start), and output a list of words in the file along with their frequency counts. The algorithm consists of three steps:"

1. Tokenize the text into a sequence of words ([re](https://docs.python.org/3.8/library/re.html)),
2. Count the words (with a [dictionary](https://docs.python.org/3.8/tutorial/datastructures.html?highlight=dictionary#dictionaries) or with [Counter](https://docs.python.org/3.8/library/collections.html?highlight=counter#collections.Counter))


In [1]:
import re   

In [3]:
with open("genesis.txt", 'r') as input:
    txt = input.read()

# Apply a regex to string txt and look for all occurrences of the given pattern
tokens = re.findall('[A-Za-z]+', txt)


In [None]:
print(tokens)

Option 1: using a [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)

In [None]:
mydict = {}
for token in tokens:
    if token not in mydict:
        mydict[token] = 0
    mydict[token] += 1
print(mydict)

Option 2: using a [Counter](https://docs.python.org/3/library/collections.html?highlight=counter#collections.Counter)

In [None]:
# Option 2: using a counter
from collections import Counter
counter = Counter(tokens)
print(counter)

print(counter['the'])

In [None]:
print("Counter", counter["his"])
print("dictionary", mydict["his"])

- There are many official Python (and contributed) libraries available. They are imported with _import_:
  - import _library_
  - from _library_ import _module_
- Once a library is imported, we have access to all its methods and classes 
- The contents of a (text) file are accessed with _open()_
- Regular expressions are powerful tools to find patterns
- Lists are precisely that: lists of elements. 
- Dictionaries are key-value pairs.
- Loops are repetitions until certain condition is true (here we will use _for_)
- Conditionals execute a code if a condition is _true_ (here we use a simple _if_)

In [None]:
# print the first k words in the text

print(tokens)
print(len(tokens))
print(tokens[3])

In [None]:
for i in range(0, 20):
  print(i, tokens[i])

In [None]:
print(tokens[0:7])

In [None]:
print(tokens[-7:])

In [None]:
# sort the words in the list

sorted_tokens = sorted(tokens)
print(sorted_tokens[:10])
print(sorted_tokens[-10:])

In [None]:
# counting again, but this time on the sorted_tokens

mydict = {}
for token in sorted_tokens:
    if token not in mydict:
        mydict[token]  = 0
    mydict[token] += 1
print(mydict)

## 2. Different ways of sorting a list of words

Ignore the case when counting: lower casing

In [None]:
print(txt)

In [None]:
txt = txt.lower()
tokens = re.findall('[A-Za-z]+', txt)

counter = Counter(tokens)
print(counter)

Count sequences of vowels

In [None]:
vowels = re.findall('[aeiou]+', txt)
counter = Counter(vowels)
print(counter)

Count sequences of consonants

In [None]:
consonants = re.findall('[bcdfghjklmnpqrstvwxyz]+', txt)
counter = Counter(consonants)
print(counter)


**From Unix for poets**

"These three examples are intended to show how easy it is to change the definition of what counts as a word. Sometimes you want to distinguish between upper and lower case, and sometimes you don’t [...] The same basic counting program can be used to count a variety of different things, depending on how you implement the definition of _thing_ (=token)."

### 2.1 Sort in dictionary order

In [None]:
with open("genesis.txt", 'r') as input:
    txt = input.read()
tokens = re.findall('[A-Za-z]+', txt)
tokens = sorted(tokens)
print(tokens)

### 2.2 Sort in "rhyming" order

In [None]:
word = ["hello how are you", "my name", "today"]
for w in word:
  print(w[::-1])

In [None]:
for i in range(len(tokens) -1 ):
  print(tokens[i:i+2])

In [None]:
# Note this method!
def invert(word):
    return word[::-1]

# Note the additional parameter
rythm_tokens = sorted(tokens, key=invert)

print(rythm_tokens)

## 3. Compute n-gram statistics

In [None]:
"".join(["one", "two", "three"])

For 2-grams

In [None]:
with open("genesis.txt", 'r') as input:
    txt = input.read()
txt = txt.lower()
tokens = re.findall('[A-Za-z]+', txt)

bigrams = [" ".join(tokens[i:i+2]) for i in range(len(tokens)-1) ] 
c = Counter(bigrams)
print(c)

For 3-grams

In [None]:
trigrams = [" ".join(tokens[i:i+3]) for i in range(len(tokens)-2)]
c = Counter(trigrams)
print(c)

For **any** n

In [None]:
n = 4
grams = [" ".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
c = Counter(grams)
print(c)

---

# Exercises

Your time to play with Genesis.

First of all, load the corpus and store it into a variable named **txt**, containing a list of all the lines in it (hint: scroll up)

In [None]:
None


## 1. Create a list containing only those lines in the corpus **including** substring "gh"


In [None]:
lines_with_gh = None


print(lines_with_gh)


## 2.  Create a list containing only those lines in the corpus **not including** substring "gh"

In [None]:
lines_without_gh = None



print(lines_with_gh)


## 3. . Create a list containing only those lines in the corpus **ending** with "ing"

In [None]:
lines_without_gh = None


print(lines_without_gh)

## Now start from scratch. 

Reload the corpus into a variable **txt** containing the full corpus as a single string

In [None]:
None

## 4. How many uppercase tokens exist in this version of Genesis?


In [None]:
None

## 5. How many 4-letter words exist in this version of Genesis?


In [None]:
None

## 6. How many words contain only vowels?



In [None]:
None

## 7. Are there words without vowels?

In [None]:
None