# CS4765/6765 Assignment 1: Tokenization and Counting Words (from Russian Trolls)

This (relatively short) assignment will give you experience working
with Python, corpora, regular expressions, tokenization, and some NLP terminology related to words.

## Data

**In this assignment you will be working with real tweets. Some
  of the tweets contain content that you might find offensive (e.g.,
  expletives, racist and homophobic remarks). Despite this offensive
  content, tweets can still be very valuable data, and building NLP
  systems that can operate over them is important. That is why we are
  working with this potentially-offensive data in this assignment.**

I've provided you with the following file for this assignment:

- `russian-troll-tweets-en.txt.gz` A collection of roughly
  $750k$ English tweets sent by Russian trolls, mostly from
  2015-2017. Each line of this file is a single tweet. It is UTF-8 encoded. (If
  you want to take a look at the contents of a file without
  uncompressing it, `zcat` is helpful. On OSX I use `gzcat`.) You can read more about the project that collected these tweets here:
https://fivethirtyeight.com/features/why-were-sharing-3-million-russian-troll-tweets/

## Implementation

You will implement tokenization and counting in this assignment. You must not use NLTK or any other NLP toolkits. You should not import any modules that this notebook does not already import for you. Your code must be able to run on the NLP VM on the lab machines using Python 3.9.

It is important that your code be reasonably efficient in this assignment. In particular, your code must not make multiple passes over the corpus, and must not read the entire corpus into memory at once. Corpora often contain billions of tokens. The corpus we are working with in this assignment is relatively small. (You will determine the size in tokens as part of the assignment). These efficiency issues become particularly important when working with larger corpora.



## Tokenization (3 marks)

You will first write a function, `tokenize`, to tokenize tweets. This function takes a line/tweet as input and returns a list where each element is a token from the line/tweet. To tokenize tweets, use regular expressions. The following describes the tokenization:

- Any sequence of alphanumeric characters, underscores, hyphens, or apostrophes, that optionally begins with # or @, is a token.
- Any sequence of other non-whitespace characters is a token;
- Any sequence of whitespace characters is a token boundary. (Whitespace does not appear as tokens in the output.)

**Hint:** I used the re.split function in my solution; you might find this helpful too.

The test cases provided below help to show the expected behaviour of the tokenizer.


In [4]:
import re

def tokenize(l):
    # A very simple whitespace-based tokenizer. You will need to
    # improve this function for your assignment. You will probably
    # need some regular expressions, so I've already imported the re
    # module for you :-)

    tokens = re.findall(r"[@#]?\w+(?:[-']\w+)*|[:;][()D]|[!?]+|[^\s\w]", l)
    return tokens

Test `tokenize` on three (carefully chosen and lightly-edited) tweets (from a different Twitter corpus).

In [6]:
import unittest

class TestTokenization(unittest.TestCase):
    def test_simple(self):
        self.assertEqual(tokenize('''I just ate a whole bag of chips, help!!!'''),
                        ['I',
                         'just', 
                         'ate', 
                         'a', 
                         'whole', 
                         'bag', 
                         'of', 
                         'chips', 
                         ',', 
                         'help', 
                         '!!!'])

    def test_clitic(self):
        self.assertEqual(tokenize('''Now I'm tired?'''),
                         ['Now', 
                          "I'm", 
                          'tired', 
                          '?'])

    def test_at_hashtag_smiley(self):
        self.assertEqual(tokenize('''@USER please bring back @USER, pleaseeee :( #AlwaysInOurHearts'''),
                         ['@USER',
                          'please',
                          'bring',
                          'back',
                          '@USER',
                          ',',
                          'pleaseeee',
                          ':(',
                          '#AlwaysInOurHearts'])
            
unittest.main(argv=[''], verbosity=2, exit=False)

test_at_hashtag_smiley (__main__.TestTokenization.test_at_hashtag_smiley) ... ok
test_clitic (__main__.TestTokenization.test_clitic) ... ok
test_simple (__main__.TestTokenization.test_simple) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.006s

OK


<unittest.main.TestProgram at 0x22535aa5f40>

Tokenize the corpus and write the output to file.

In [8]:
import gzip

def tokenize_file(in_fname, out_fname):
    # Apply tokenize to each line in in_fname and write output
    # one-token-per-line to out_fname. This one-token-per-line
    # format for a corpus is also referred to as "vertical" format.
    with gzip.open(in_fname, mode='rt', encoding='utf-8') as infile:
        with open(out_fname, mode='w', encoding='utf-8') as outfile:
            for line in infile:
                # Tokenize each sentence/line
                tokens = tokenize(line)
                # Write each token on a separate line, with a blank line between
                # sentences
                for t in tokens:
                    print(t, file=outfile)
                print(file=outfile)

In [9]:
tokenize_file('a1data/russian-troll-tweets-en.txt.gz', 'russian-troll-tweets-en.tokens')

## Counting (2 marks)

Now write a function, `count_tokens`, that counts how many times each type occurs in a
corpus. In doing this counting, apply case folding (i.e., convert
everything to lower case). Your function takes the filename for a file such as `russian-troll-tweets-en.tokens` as input (i.e., a file in one-token-per-line format) and returns dictionary in which the keys are types (ignoring case) and the value for each key is its count in the corpus.


In [11]:
def count_tokens(fname):
    counts = {}

    # Write your code here

    with open(fname, 'r', encoding='utf-8') as file:
        for line in file:
            token = line.strip().lower()
            
            if token in counts:
                counts[token] += 1
            else:
                counts[token] = 1

    return counts

Apply the counting to the corpus

In [13]:
russian_troll_counts = count_tokens('russian-troll-tweets-en.tokens')

Here are some test cases. These pass using my sample solution to the tokenizer and counting.

In [15]:
import unittest

class TestCounting(unittest.TestCase):
    def test_duck(self):
        self.assertEqual(russian_troll_counts['duck'], 294)
    def test_cat(self):
        self.assertEqual(russian_troll_counts['cat'], 932)
    def test_test(self):
        self.assertEqual(russian_troll_counts['test'], 1591)
    def test_smiley(self):
        self.assertEqual(russian_troll_counts[':)'], 779)
    def test_3exclamationmarks(self):
        self.assertEqual(russian_troll_counts['!!!'], 3213)
        
unittest.main(argv=[''], verbosity=2, exit=False)

test_3exclamationmarks (__main__.TestCounting.test_3exclamationmarks) ... FAIL
test_cat (__main__.TestCounting.test_cat) ... FAIL
test_duck (__main__.TestCounting.test_duck) ... FAIL
test_smiley (__main__.TestCounting.test_smiley) ... FAIL
test_test (__main__.TestCounting.test_test) ... FAIL
test_at_hashtag_smiley (__main__.TestTokenization.test_at_hashtag_smiley) ... ok
test_clitic (__main__.TestTokenization.test_clitic) ... ok
test_simple (__main__.TestTokenization.test_simple) ... ok

FAIL: test_3exclamationmarks (__main__.TestCounting.test_3exclamationmarks)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\Rakib\AppData\Local\Temp\ipykernel_604\516314098.py", line 13, in test_3exclamationmarks
    self.assertEqual(russian_troll_counts['!!!'], 3213)
AssertionError: 3412 != 3213

FAIL: test_cat (__main__.TestCounting.test_cat)
----------------------------------------------------------------------
Traceback (mos

<unittest.main.TestProgram at 0x22538107e30>

## Questions (5 marks)

Answer the following questions. For some questions, you will need to write some code to get the answer.


1. How many types are in the corpus?

In [18]:
# Write code to help answer this question here

corpus_type_count = len(russian_troll_counts)
print(f"{corpus_type_count} types")

216287 types


Write your answer here

2. How many tokens are in the corpus?

In [21]:
# Write code to help answer this question here

russian_troll_counts = count_tokens('russian-troll-tweets-en.tokens')
total_tokens = sum(russian_troll_counts.values())
print(f"{total_tokens} tokens")

11856883 tokens


Write your answer here



3. How many hapax legomena (types that occur once) are in the corpus?

In [24]:
# Write code to help answer this question here

hapax_legomena = [token for token, count in russian_troll_counts.items() if count == 1]
hapax_legomena_num = len(hapax_legomena)
print(f"{hapax_legomena_num} hapax legomena")

110067 hapax legomena


Write your answer here



4. What are the 10 most frequent types in the corpus?

In [27]:
# Write code to help answer the question here

most_frequent_types = sorted(russian_troll_counts.items(), key=lambda x: x[1], reverse=True)[:10]
for token, count in most_frequent_types:
    print(f"'{token}' Occurs {count} Times")

'' Occurs 759346 Times
'.' Occurs 477367 Times
',' Occurs 233837 Times
'to' Occurs 230468 Times
'the' Occurs 224984 Times
''' Occurs 186804 Times
'in' Occurs 159822 Times
'a' Occurs 137100 Times
'of' Occurs 131660 Times
':' Occurs 125107 Times


Write your answer here

5. I carried out some analysis on a corpus consisting of a random sample of ~500$k$ English tweets from a similar time period as the collection of Russian troll tweets that you have been working with in this assignment. This is a sample of all English tweets, i.e., it has not been carefully constructed to represent the language of Russian trolls, or any other group. I used the same approach to tokenization and counting as above. Here are the top-10 types for this corpus of English tweets:

    * .
    * i
    * the
    * you
    * to
    * ,
    * a
    * and
    * my
    * me
  
Compare the top-10 most frequent types for this corpus to the top-10 most frequent types for the corpus of Russian troll tweets from Question 4. Based on this analysis, how is the language of Russian trolls on Twitter different from that of more-general Twitter users?

Write your answer here

Russian language trolls differs from general Twitter users in many ways. For example,

Engagement: English tweets focus more on personal interaction, whereas Rusian troll tweets focus less on personal interaction.

Punctuation: For a more authoritative tone, troll tweets frequently use punctuation such as quotes, periods etc.

Function vs. Content Words: English tweets focus on function words, while troll tweets use more punctuation, showing a broadcast style.

Overall, Russian trolls appear to aim for attention and controversy, while general users engage more casually.

## What to submit

When you're done, submit a1.ipynb to the assignment 1 folder on D2L.

## A final note... 

In this assignment you’re working with real data. You might encounter problems or quirks with file formats, character encodings, etc. If you encounter such issues, please post about it on the bulletin board on D2L.

Have fun, and good luck!

