
<p>Linguist 278: Programming for Linguists<br/>
Stanford Linguistics, Fall 2020<br/>
Christopher Potts</p>



<h1 id="Assignment-7:-Language-dataset-hackathon">Assignment 7: Language dataset hackathon<a class="anchor-link" href="#Assignment-7:-Language-dataset-hackathon">¶</a></h1>



<p>Distributed 2020-11-02<br/>
Due 2020-11-09 (but my intention is that you will be able to turn it in after class on Nov 4)</p>



<h2 id="Contents">Contents<a class="anchor-link" href="#Contents">¶</a></h2><ol>
<li><a href="#Overview">Overview</a><ol>
<li><a href="#Requirements">Requirements</a></li>
<li><a href="#Ideas">Ideas</a></li>
</ol>
</li>
<li><a href="#Set-up">Set-up</a></li>
<li><a href="#Age-of-acquisition-dataset">Age of acquisition dataset</a></li>
<li><a href="#Concreteness-dataset">Concreteness dataset</a></li>
<li><a href="#Sentiment-dataset">Sentiment dataset</a></li>
<li><a href="#Beautiful-words">Beautiful words</a></li>
<li><a href="#Novels-from-Project-Gutenberg">Novels from Project Gutenberg</a></li>
<li><a href="#Potentially-useful-code">Potentially useful code</a><ol>
<li><a href="#Project-Gutenberg-iterator">Project Gutenberg iterator</a></li>
<li><a href="#Sentence-tokenizing-using-NLTK">Sentence tokenizing using NLTK</a></li>
<li><a href="#Word-counts">Word counts</a></li>
<li><a href="#egrep">egrep</a></li>
</ol>
</li>
</ol>



<h2 id="Overview">Overview<a class="anchor-link" href="#Overview">¶</a></h2><ul>
<li><p>Our in-class hackathon on November 2 and 4 <em>is</em> Assignment 7. My hope is that you can complete the hackathon/homework by the end of class on November 7.</p>
</li>
<li><p>You are encouraged to work in groups (max size: 3 people).</p>
</li>
<li><p>The requirements are much more open-ended than for other assignments. Part of the task is to think up an original question using the materials in this notebook. I've given some general ideas below to get you started.</p>
</li>
<li><p>You should think about scoping the problem so that you can complete it in two class periods. This is something you should assess for yourselves. Of course, you <em>can</em> take until Nov 9 if you like, but I'd advise not doing that!</p>
</li>
</ul>
<h3 id="Requirements">Requirements<a class="anchor-link" href="#Requirements">¶</a></h3><ul>
<li><p>Designate one person to submit the notebook and make absolutely sure all group members names are given at the top of the noteboook.</p>
</li>
<li><p>Submit a modified version of this notebook, with your new code included, extraneous code removed, and prose added so that I can follow along. That is, try to make this notebook look like a polished piece of <a href="https://en.wikipedia.org/wiki/Literate_programming">literate programming</a>.</p>
</li>
<li><p>I am guessing most notebooks will have about the scope of three regular two-pointer assignment questions, with some prose explaining what they do and why. However, there is no strict requirement on code quantity or anything.</p>
</li>
<li><p>You needn't confine yourself to the data and other resources in this notebook. (There is a lot to work with here, though.)</p>
</li>
</ul>
<h3 id="Ideas">Ideas<a class="anchor-link" href="#Ideas">¶</a></h3><p>Examples of things you might do (not meant to be restrictive!):</p>
<ul>
<li><p>Write code that identifies interesting relationships in the concreteness, sentiment, and age of aquisition datasets. You can use Pandas to merge them on their <code>Index</code> values and then reason across them!</p>
</li>
<li><p>Write code that identifies differences between Project Gutenberg authors as revealed by the concreteness, sentiment, age of acquisition, and/or beautiful words datasets.</p>
</li>
<li><p>Write code to find the <em>most something</em> – the most sentiment-laden sentence or passage, the most challenging passage, the most abstract passage, etc. I've included code below for heuristic paragraph and sentence parsing.</p>
</li>
</ul>



<h2 id="Set-up">Set-up<a class="anchor-link" href="#Set-up">¶</a></h2>


In [1]:
%matplotlib inline
import glob
import os
import pandas as pd
import re
import string


<p>Download the hackathon data distribution:</p>
<p><a href="http://web.stanford.edu/class/linguist278/data/hackathon.zip">http://web.stanford.edu/class/linguist278/data/hackathon.zip</a></p>
<p>and unzip it in the same directory as this notebook. (If you want to put it somewhere else, just be sure to change <code>data_home</code> in the next cell.)</p>


In [8]:
data_home = os.path.join('data', 'hackathon')


<h2 id="Age-of-acquisition-dataset">Age of acquisition dataset<a class="anchor-link" href="#Age-of-acquisition-dataset">¶</a></h2><p>From <a href="https://www.humanities.mcmaster.ca/~vickup/Kuperman-BRM-2012.pdf">Age-of-acquisition ratings for 30 thousand English words</a> (Victor Kuperman, Hans Stadthagen-Gonzalez, and Marc Brysbaert, <em>Behavior Research Methods</em>, 2014):</p>
<ol>
<li><code>Word</code>: The word (str)</li>
<li><code>OccurTotal</code>: token count in their data</li>
<li><code>OccurNum</code>: Participants who gave an age-of-acquisition, rather than saying "Unknown"</li>
<li><code>Rating.Mean</code>: mean age of aquisition in years of age</li>
<li><code>Rating.SD</code>: standard deviation of the distribution of ages of acquisition</li>
</ol>


In [12]:

age_df = pd.read_csv(
    os.path.join(data_home, "Kuperman-BRM-data-2012.csv"),
    index_col='Word')

In [10]:

age_df.shape



(30121, 5)

In [11]:

age_df.head(2)



Unnamed: 0_level_0,OccurTotal,OccurNum,Rating.Mean,Rating.SD,Frequency
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
have,18,18,3.72,1.96,314232.0
do,20,20,3.6,1.6,312915.0



<h2 id="Concreteness-dataset">Concreteness dataset<a class="anchor-link" href="#Concreteness-dataset">¶</a></h2><p>We've worked with this dataset before. It's presented in <a href="http://crr.ugent.be/papers/Brysbaert_Warriner_Kuperman_BRM_Concreteness_ratings.pdf">Concreteness ratings for 40 thousand generally known English word lemmas</a> (Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman, <em>Behavior Research Methods</em>, 2014). Overview:</p>
<ol>
<li><code>Word</code>: The word (str)</li>
<li><code>Bigram</code>: Whether it is a single word or a two-word expression</li>
<li><code>Conc.M</code>: The mean concreteness rating</li>
<li><code>Conc.SD</code>: The standard deviation of the concreteness ratings (float)</li>
<li><code>Unknown</code>: The number of persons indicating they did not know the word</li>
<li><code>Total</code>: The total number of persons who rated the word</li>
<li><code>Percent_known</code>: Percentage of participants who knew the word</li>
<li><code>SUBTLEX</code>: The SUBTLEX-US frequency count</li>
<li><code>Dom_Pos</code>: The part-of-speech where known</li>
</ol>


In [13]:
concreteness_df = pd.read_csv(
    os.path.join(data_home, "Concreteness_ratings_Brysbaert_et_al_BRM.csv"),
    index_col='Word')

In [14]:
concreteness_df.shape

(39954, 8)

In [15]:
concreteness_df.head()

Unnamed: 0_level_0,Bigram,Conc.M,Conc.SD,Unknown,Total,Percent_known,SUBTLEX,Dom_Pos
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
roadsweeper,0,4.85,0.37,1,27,0.96,0,0
traindriver,0,4.54,0.71,3,29,0.9,0,0
tush,0,4.45,1.01,3,25,0.88,66,0
hairdress,0,3.93,1.28,0,29,1.0,1,0
pharmaceutics,0,3.77,1.41,4,26,0.85,0,0



<h2 id="Sentiment-dataset">Sentiment dataset<a class="anchor-link" href="#Sentiment-dataset">¶</a></h2>



<p>The dataset <a href="https://www.humanities.mcmaster.ca/~vickup/Warriner-etal-BRM-2013.pdf">Norms of valence, arousal, and dominance for 13,915 English lemmas</a> (Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert, <em>Behavior Research Methods</em>, 2013) contains a lot of sentiment information about +13K words. The following code reads in the full dataset and then restricts to just the mean ratings for the three core semantic dimensions:</p>
<ol>
<li><code>Word</code>: The word (str)</li>
<li><code>Valence</code> (positive/negative)</li>
<li><code>Arousal</code> (intensity)</li>
<li><code>Dominance</code></li>
</ol>


In [16]:
sentiment_df = pd.read_csv(
    os.path.join(data_home, "Warriner_et_al emot ratings.csv"),
    index_col='Word')

In [17]:
sentiment_df.shape

(13915, 64)

In [18]:
sentiment_df = sentiment_df[['V.Mean.Sum', 'A.Mean.Sum', 'D.Mean.Sum']]

sentiment_df = sentiment_df.rename(
    columns={'V.Mean.Sum': 'Valence',
             'A.Mean.Sum': 'Arousal',
             'D.Mean.Sum': 'Dominance'})


<h2 id="Beautiful-words">Beautiful words<a class="anchor-link" href="#Beautiful-words">¶</a></h2><p>I took the <a href="http://www.alphadictionary.com/articles/100_most_beautiful_words.html">100 Most Beautiful Words</a> (of which there are 107) and enriched them:</p>
<ol>
<li><code>Word</code>: The word (str).</li>
<li><code>Pronunciation</code>: CMU Pronouncing Dictionary representation.</li>
<li><code>Morphology</code>: Celex morphological representations.</li>
<li><code>Frequency</code>: frequency according to the Google N-gram Corpus. </li>
<li><code>Category</code>: 'most-beautiful' or 'regular'</li>
</ol>
<p>The 'regular' examples are 107 randomly selected non-proper-names.</p>
<p>Maybe there's something interesting here?</p>


In [19]:
beauty_df = pd.read_csv(
    os.path.join(data_home, "wordbeauty.csv"),
    index_col="Word")

In [20]:
beauty_df.shape

(214, 4)

In [21]:
beauty_df.head(2)

Unnamed: 0_level_0,Pronunciation,Morphology,Frequency,Category
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
lithe,L AY1 DH,(lithe)[A],136457,most-beautiful
vestige,V EH1 S T IH0 JH,(vestige)[N],135247,most-beautiful


In [22]:
beauty_df['Category'].value_counts()

most-beautiful    107
regular           107
Name: Category, dtype: int64

In [23]:
beauty_df.head()

Unnamed: 0_level_0,Pronunciation,Morphology,Frequency,Category
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
lithe,L AY1 DH,(lithe)[A],136457,most-beautiful
vestige,V EH1 S T IH0 JH,(vestige)[N],135247,most-beautiful
nemesis,N EH1 M AH0 S IH0 S,(nemesis)[N],1338430,most-beautiful
inure,IH0 N Y UH1 R,(inure)[V],123230,most-beautiful
imbue,IH0 M B Y UW1,(imbue)[V],105790,most-beautiful



<h2 id="Novels-from-Project-Gutenberg">Novels from Project Gutenberg<a class="anchor-link" href="#Novels-from-Project-Gutenberg">¶</a></h2><p>The Gutenberg metadata has been removed from these files, and the first line gives the title, author, and publication year in a systematic pattern.</p>


In [25]:
gutenberg_home = os.path.join(data_home, "gutenberg")

'data\\hackathon\\gutenberg'

In [26]:
gutenberg_filenames = glob.glob(os.path.join(gutenberg_home, "*.txt"))

In [29]:
len(gutenberg_filenames)

26

In [30]:
gutenberg_filenames[: 5]

['data\\hackathon\\gutenberg\\austen-emma.txt',
 'data\\hackathon\\gutenberg\\austen-persuasion.txt',
 'data\\hackathon\\gutenberg\\austen-sense.txt',
 'data\\hackathon\\gutenberg\\blake-poems.txt',
 'data\\hackathon\\gutenberg\\blake-songs.txt']


<h2 id="Potentially-useful-code">Potentially useful code<a class="anchor-link" href="#Potentially-useful-code">¶</a></h2>



<h3 id="Project-Gutenberg-iterator">Project Gutenberg iterator<a class="anchor-link" href="#Project-Gutenberg-iterator">¶</a></h3><p>You might want to modify this, depending on how you want to process these texts (by word? sentence? chapter?).</p>


In [35]:
def gutenberg_iterator(filename):
    """Yields paragraphs (as defined simply by multiple
    newlines in a row).

    Parameters
    ----------
    filename : str
        Full path to the file.

    Yields
    ------
    multiline str

    """
    with open(filename) as f:
        contents = f.read()
    
    for para in re.split(r"[\n\s*]{2,}", contents):
        yield para

In [36]:
emma_iterator = gutenberg_iterator(gutenberg_filenames[0])

for _ in range(5):
    print("="*50)
    print(next(emma_iterator))

[Emma by Jane Austen 1816]
VOLUME I
CHAPTER I
Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.
She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.



<h3 id="Sentence-tokenizing-using-NLTK">Sentence tokenizing using NLTK<a class="anchor-link" href="#Sentence-tokenizing-using-NLTK">¶</a></h3>


In [31]:
from nltk.tokenize import sent_tokenize

In [32]:
sent_tokenize("Hello? This is Dr. Potts! How are you?")

['Hello?', 'This is Dr. Potts!', 'How are you?']


<h3 id="Word-counts">Word counts<a class="anchor-link" href="#Word-counts">¶</a></h3><p>From assignment 2.</p>


In [None]:

def simple_tokenize(s):
    """Break str `s` into a list of str.

    1. `s` has all of its peripheral whitespace removed.
    2. `s` is downcased with `lower`.
    3. `s` is split on whitespace.
    4. For each token, any peripheral punctuation on it is stripped
       off. Punctuation is here defined by `string.punctuation`.

    Parameters
    ----------
    s : str
        The string to tokenize.

    Returns
    -------
    list of str
    """
    punct = string.punctuation
    final_toks = []
    toks = s.lower().strip().split()
    for w in toks:
        final_toks.append(w.strip(punct))
    return final_toks


def word_counts(s, tokenizing_func=simple_tokenize):
    """Count distribution for the words in `s` according to `tokenizer`.

    Parameters
    ----------
    s : str
        String to tokenize and get  word counts for.
    tokenizing_func : function
        Any function that can be called as `tokenizing_func(s)` where
        `s` is a string. The default is `simple_tokenize`.

    Returns
    -------
    dict mapping str to int
    """




<h3 id="egrep">egrep<a class="anchor-link" href="#egrep">¶</a></h3><p>From assignment 5.</p>


In [33]:
def egrep(regex, filename):
    """Python version of egrep. The function iterates through the
    user's file `filename`, line-by-line, stripping off the final
    newline character, and yielding only the lines that match the
    user's regular expression `regex`.

    Note: like basic egrep, a line that contains multiple matches for
    `regex` is yielded only once.

    Parameters
    ----------
    regex : Compiled regular expression
        The pattern to use for matching
    filename : str
        Full path to the file to open and iterate through

    Yields
    ------
    str
        Lines from the file, with newline characters removed
    """
    
    with open(filename) as f:
        content = f.read()
        content = content.strip()
        if regex.search(content):
            yield content