# Python Programming for Linguists
**03 - Python for (Corpus) Linguists**

There are a number of **new tools** and a **little bit of new syntax**, which can be extremely helpful in approaching exercises 8 to 17. Here are some basic examples to get you started!

This notebook will introduce these concepts by looking at a number of straightforward examples. **Play around** with the code to get a feeling for what these things are doing.

## 1. Miscellaneous

##### Lists and Sets



In [None]:
tokens = ['a', 'the', 'car', 'the']
tokens

Sets, at least on the surface level, work similarly to lists. However, they are *unordered* and set elements are *unique*. Hence, turning (casting) a set into a list will remove duplicate entries.

In [None]:
types = set(tokens)
types

##### The `.join()` method (on strings)

The `.join()` method allows you to turn a list of items into a string. This is, for example, useful if you want to turn a tokenized text into a string.

In [None]:
tokens = ['The', 'cat', 'is', 'grey']
s1 = ' '.join(tokens)
s2 = '-'.join(tokens)

s1, s2

##### Lambda Functions / Anonymous (nameless) Functions

Lambda functions are a little bit strange but very powerful. They are small anonymous functions that can be used whenever you don't need a named (regular) function.

In [None]:
x = lambda a: a + 10
x(5)

We will be using a Lambda below when using `.apply()` on a DataFrame (see Pandas).

##### `Counter` objects

A `Counter` is an easy way to count items in a list. They are a quick and easy way to, for example, generate frequency tables.

In [None]:
from collections import Counter

numbers = [1, 1, 2, 3, 3, 4]
counts = Counter(numbers)

# Show the frequency of '1'
counts[1]

In [None]:
counts.most_common(2)

In [None]:
tokens = ['I', 'have', 'a', 'cat', '.', 'She', 'has', 'a', 'mouse.']
Counter(tokens)

##### Adding to Variables


Python supports the `+=`and `-=` operators to easily add or substract from a variable. This also works when concatenating strings.

In [None]:
a = 1
a += 5

a

In [None]:
b = 'Hello'
b += 'World'

b

##### Enumerate and (Un)packing

When looping (e.g., iterating over a list) it is sometimes helpful to keep track of where you are. The `enumerate` function allows you to do just that.

In [None]:
l = ['A', 'B', 'C']

for i in l:
  print(l)

In [None]:
for e, i in enumerate(l):
  print(e, i)

Maybe you are wondering what is up with his `e, i` construction. In Python, you can pack and unpack variables from structures. Have a look at the following example:

In [None]:
# A list with two elements
characters = ['this is a', 'this is b']

# Unpacking the elements into two variables
a, b = characters

a

Similarly, we can also pack variables into, for example, a list:

In [None]:
a = 'this is a'
b = 'this is b'

characters = [a, b]

characters

##### Slicing Notation

In many cases, we need to only get part of a list. This is possible using slicing notation. 

The syntax is: *start:stop:step* 

In [None]:
l = [0, 1, 2, 3, 4, 5]

In [None]:
l[1:3]

In [None]:
l[0:5:2]

## 2. List Comprehensions

List comprehensions offer a shorter syntax for creating lists. Here are two examples:

**Example 1**: We want to create a new list that contains all of the numbers in `numbers` multiplied by ten.

In [None]:
numbers = [1, 2, 3]
n_times_ten = []

for number in numbers:
  n_times_ten.append(number * 10)

n_times_ten

The very same thing can be achieved using a list comprehension:

In [None]:
numbers = [1, 2, 3]

[n * 10 for n in numbers]

**Example 2**: We can also use list comprehensions when working with lists of lists. Below is a list of lists (`lol`). Let's assume we only want to print the second element (A, B, C) from each sublist.

In [None]:
lol = [
       [1, 'A'],
       [2, 'B'],
       [3, 'C']
]

lol

In [None]:
for n in lol:
  print(n[1])

The same thing but as a compact list comprehension:

In [None]:
[n[1] for n in lol]

## 3. Pandas

Pandas is one of the most widely used libraries in Data Science. While Pandas can do lots more, we are focusing on the `DataFrame` provided by the library. You can think of DataFrames as tables. They look and work similarly to spreadsheets.


In [None]:
import pandas as pd

When importing libraries, we can use `as` to give the library another name. For `pandas`, it is convention to simple use `pd` as an alias.

The first thing we will do is to create a simple `DataFrame`, a table. The easiest way of doing this is to populate each column (Document, Tokens, Sentiment) individually.

In [None]:
df = pd.DataFrame()

df['Document'] = [0, 1, 2, 3]
df['Tokens'] = [1000, 2000, 3000, 3000]
df['Sentiment'] = [0.2, 0.3, 0.8, None]

df

Of course, we don't have to create tables manually all the time.

Pandas has many methods that help with getting data into your programs. For example, here we are using `read_csv()` to read a CSV file.

In [None]:
df_2 = pd.read_csv('https://raw.githubusercontent.com/IngoKl/python-programming-for-linguists/main/2020/data/numerical/pandas_demo.csv')

In [None]:
df = df.set_index('Document')

`df.head()` will only show the first few lines of a `DataFrame`.

In [None]:
df.head()

Now that we have a new `DataFrame`, we can also look at individual columns or calculate some statistics.

In [None]:
df['Tokens']

In [None]:
df['Tokens'].mean()

In [None]:
df['Sentiment'].describe()

We can also filter/select parts of the data:

In [None]:
df[df['Tokens'] > 2000]

This selection works based on boolean logic (True/False). `df['Tokens'] > 2000` will return a series of True/False statements for each row in the DataFrame that correspond to the criteria (`> 2000`).

In [None]:
df['Tokens'] > 2000

In [None]:
df.fillna(df.mean())

The `.apply()` Method can be used to apply a function to a row.

In [None]:
def double(x):
  '''This function will double a given number.'''
  return x * 2

We will `apply` the `double` function to axis 1 (rows). As you can see, all numbers have doubled.

In [None]:
df.apply(double, axis=1)

Sometimes we might want to use column values while using apply. Here Lambdas come into play. In the example below, we want to create a new column that contains *Sentiment* times 100. We will be using a very simple function `times100` to do that. In the `.apply()` method, we will be using a Lambda to pass the relevant column (*Sentiment*) to the function.

In [None]:
def time100(x):
  return x * 100

In [None]:
df['Sx100'] = df.apply(lambda row : time100(row[1]), axis=1)
df