# About this notebook

This notebook was written for the [2016 Construction grammar course](http://budling.hu/elmnyelv/index.php/Constructions2016) at the [Department of Theoretical Linguistics](http://www.nytud.hu/tlp/index.html). Its main aim is to provide easier access to the [Tádé korpusz](http://hlt.bme.hu/hu/resources/tade), on which most of the experiments in the course are based.

## Basic setup

The first part of the file shows how to download the corpus to your computer and then how to load it into a [Pandas](http://pandas.pydata.org/) *dataframe* (i.e. table). The best way to set up your own experiments is to copy this file, rename it something else (e.g. *my_fruitful_experiments**.ipynb***), and load that notebook: in this way, you won't have to worry about all this boilerplate, and you will be able to start crunching at those verb frame frequencies right away!

In [2]:
import pandas as pd
# So that plots work correctly
%matplotlib inline   

import matplotlib
import numpy as np

By default, the code in this notebook creates files in the current directory, i.e. the one from which you started the notebook. To use a different directory, just change the value of the `work_directory` variable.

In [12]:
import os

work_directory = os.path.abspath('.')
data_file = 'tade.tsv'

if not os.path.isdir(work_directory):
    os.makedirs(work_directory)
os.chdir(work_directory)

print("The working directory is: " + os.getcwd())

The working directory is: /run/shm/Tade-corpus-tools/notebooks


## Getting the data into a table

The first step is to download the Tádé file if it is not downloaded yet. Remember to execute the cell above before this one so that you are in the data directory you specified.

In [10]:
if not os.path.exists(data_file):
    import urllib
    u = urllib.request.URLopener()
    print('Downloading Tádé')
    u.retrieve('http://people.mokk.bme.hu/~recski/verb_clusters/tade.tsv', 'tade.tsv')


Now that we have the file, we can read it into a `DataFrame` and start working on our experiments... The file is in the Latin-2 (ISO-8859-2) encoding, which is not the default in Python (nor in the modern world) -- that title belongs to utf-8. So in order to be able to properly load the file, we need to specify the encoding as well.

In [20]:
column_names = ['verb', 'frame', 'frame_freq', 'verb_freq', 'freq_ratio']
df = pd.read_table(data_file, encoding='latin2', sep='\t', names=column_names)
print('Loaded ' + data_file + '; read ' + str(len(df)) + ' lines. The first five lines are:')
df.head()

Loaded tade.tsv; read 1158484 lines. The first five lines are:


Unnamed: 0,verb,frame,frame_freq,verb_freq,freq_ratio
0,van,@,362298,908829,0.398643
1,van,NP<CAS<INE>>,71800,908829,0.079003
2,van,NP<CAS<DAT>>,56905,908829,0.062614
3,van,NP<CAS<SBL>>,35869,908829,0.039467
4,van,NP<CAS<SUE>>,29836,908829,0.032829


## A few examples

To give you an idea of Pandas in action, this section lists a few examples, in which we extract various statistics from the data. Let's dig in!

### How many different verbs are in the corpus?

In other words, the number of unique elements in column one (`verb`). There are several ways to find it out:

1. Have Pandas [`describe()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html) the column for you. What it shows depends on the type of the column; for strings, the description will have a `unique` field. What do you think the other fields mean?
2. Just call the [`unique()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) function on the column. This returns an array of all the unique elements; all you need is to take the length of the array.
3. Or just do (2) in a single step with [`nunique()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nunique.html).

In [39]:
print('describe()-based solution:\n')    # \n means new line (i.e. Enter, Return, etc.)
desc = df.verb.describe()
print(desc, '\n')
print('Unique only: ', desc['unique'], '\n\n')

print('unique()-based solution:', len(df.verb.unique()), '\n\n')
print('nunique()-based solution:', df.verb.nunique())

describe()-based solution:

count     1158484
unique     108045
top           van
freq        15337
Name: verb, dtype: object 

Unique only:  108045 


unique()-based solution: 108045 


nunique()-based solution: 108045


### In all, how many frames did we extract from our corpus?

Once again, there are different ways of doing this; for instance:

1. Sum the frame counts of all the different verb-frame pairs.
2. Notice that if we sum up the frame counts, we get the verb frequency. So we need only `sum()` the verb frequencies for each word. Now, this way is a bit longer:
  1. First, we do not need to work on the whole table, only the verbs and their counts; therefore, we extract these two columns to a new table.
  2. Then, we group this new table by the verbs; this gives us a [`Groupby`](http://pandas.pydata.org/pandas-docs/stable/api.html#groupby) object
  3. Since the verb frequencies are the same in all rows for a verb, we only need the `first()` row in each group
  4. Finally, we can `sum()` the filtered column...

In [41]:
print('Sum of frame counts: ', df.frame_freq.sum())

df_verb_and_freq = df[['verb', 'verb_freq']]
verb_and_freq_groups = df_verb_and_freq.groupby('verb')  # verb -> [verb_freq, verb_freq, verb_freq, ...]
verb_freqs = verb_and_freq_groups.first()                # verb -> verb_freq

print('Sum of verb counts: ', verb_freqs.verb_freq.sum())

# This works too: df[['verb', 'verb_freq']].groupby('verb').first().verb_freq.sum()

Sum of frame counts:  9966153
Sum of verb counts:  9966153
