# Python Programming for Linguists
**03 - Python for (Corpus) Linguists**
as of 2021-06-11

## 1. Environment and Data

Before we begin, we need to set up **our development environment**.

First, we will download (*git cloning*) the workshop repository. The ["magic command"](https://ipython.readthedocs.io/en/stable/interactive/magics.html) `%%capture` will suppress any cell output. Be careful: `rm -r python-programming-for-linguists` will delete previous files.


Next, we are installing two additional libraries/dependencies: `textdirectory` and `justext`. While many libraries are available on Colab, some need (and can) be installed using `pip`.

Then we are `import`-ing all the needed dependencies.

Finally, we are using two scripts, provided in the repository, to download two corpora.

In addition, we will define a `print_dict` helper function that we will use to look at large dictionaries without breaking *Colab*.

In [None]:
%%capture
!rm -r python-programming-for-linguists
!git clone https://github.com/IngoKl/python-programming-for-linguists

In [None]:
%%capture
!pip install textdirectory --upgrade
!pip install justext

In [None]:
# Basics from Python's standard library
import re
import statistics
import math

from collections import Counter
from operator import itemgetter

from io import StringIO

# Data Science
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# XML
import lxml

# NLP
import nltk
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.collocations import BigramCollocationFinder
from nltk.collocations import BigramAssocMeasures

import spacy
from spacy import displacy

import textdirectory

# Web
import requests
from bs4 import BeautifulSoup
import justext

# Formatting output
from tabulate import tabulate

Downloading two corpora (HUM19UK and COCA sampler)

In [None]:
%%capture
!cd python-programming-for-linguists/2020/data && sh download_hum19uk.sh
!cd python-programming-for-linguists/2020/data && sh download_coca.sh

Helper function for looking at large dictionaries:

In [None]:
def print_dict(d, top=10):
  print(list(d.items())[0:top])

## 2. New Tools and Hints

### Classes and Objects

You can think of classes as blueprints for objects. An object, which is an instantiation of a class, can have attributes and methods (basically functions tied to the object). There's lots more to this, but this should get you going!

Here we create a new class `Word`. The class has two attributes (`word` and `length`) as well as one method `reverse`.

In [None]:
class Word():
  
  def __init__(self, word):
    self.word = word
    self.length = len(word)

  def reverse(self):
    self.word = self.word[::-1]

In [None]:
new_word = Word('cat')

Now we have created a new object based on our blueprint. We can access the instance attributes by using `object.attribute`.

In [None]:
new_word.word, new_word.length

('cat', 3)

Of course, we now also use the methods of the object by calling `object.method()`.

In [None]:
new_word.reverse()
new_word.word

'tac'

### List Comprehensions

In [None]:
numbers = [10, 20, 30]
times_ten = [n * 10 for n in numbers]

times_ten

[100, 200, 300]

In [None]:
list_of_lists = [['A', 1], ['B', 2], ['C', 3]]
only_first_element = [n[1] for n in list_of_lists]

only_first_element

[1, 2, 3]

### Enumerate

In [None]:
l = ['A', 'B', 'C']

for index, value in enumerate(l):
  print(index, value)

0 A
1 B
2 C


## 3. Exercises (8 to 17)

### Exercise 8 – Concordancer

In [None]:
# YOUR CODE GOES HERE

### Exercise 9 - N-Grams
Note: Number of N-Grams = Tokens + 1 - N

In [None]:
# YOUR CODE GOES HERE

### Exercise 10 - Frequency Analysis

In [None]:
# YOUR CODE GOES HERE

### Exercise 11 - Computing Basic Statistics

In [None]:
# YOUR CODE GOES HERE

### Exercise 12 – Basic Collocation Analysis

In [None]:
# YOUR CODE GOES HERE

### Exercise 13 – NLTK Stemming, Lemmatization, and WordNet

In [None]:
# YOUR CODE GOES HERE

### Exercise 14 – spaCy Tagging

In [None]:
# YOUR CODE GOES HERE

### Exercise 15 - Parsing XML

In [None]:
# YOUR CODE GOES HERE

### Exercise 16 - Web Scraping

In [None]:
# YOUR CODE GOES HERE

### Exercise 17 - Putting Everything Together (Keyword Analysis)

In [None]:
# YOUR CODE GOES HERE