# Assignment 1

## Part A: Linguistic analysis using spaCy

### 1. Tokenization (1 point)
Process the dataset using the spaCy package and extract the following information:
**Number of tokens:
Number of types:  
Number of words:
Average number of words per sentence:
Average word length: 
Provide the definition that you used to determine words:**

In [1]:
%%bash
python -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [10]:
import spacy
import os
from collections import Counter
import numpy as np
import pandas as pd

In [3]:
# This loads a small English model trained on web data.
# For other models and languages check: https://spacy.io/models
nlp = spacy.load('en_core_web_sm')

In [4]:
cwd = os.getcwd()

with open(f"{cwd}/data/preprocessed/train/sentences.txt", "r") as file:
    train_dataset = file.read()
doc = nlp(train_dataset)

In [5]:
# Making our test doc for later testing
test_input = "I have an awesome cat. It's sitting on the mat that I bought yesterday."
doc_test = nlp(test_input)

In [6]:
# Counting words and frequencies
word_frequencies = Counter()
words_per_sencence = []
words_length = []
avg_num_words = []

for sentence in doc.sents:
    words = []
    lengths = []
    num_words = 0
    for token in sentence: 
        # Let's filter out punctuation
        if not token.is_punct:
            words.append(token.text)
            lengths.append(len(token.text))
            num_words += 1
    avg_num_words.append(num_words)
    words_length.append(np.mean(lengths))
    word_frequencies.update(words)
    
# print(word_frequencies)
num_tokens = len(doc)
num_words = sum(word_frequencies.values())
num_types = len(word_frequencies.keys())
avg_num_words_sentence = np.mean(avg_num_words)
avg_num_word_length = np.nanmean(words_length)


print(f"Num tokens: {num_tokens}\nNum words: {num_words}\nNum types: {num_types}\nWords per sentence: {np.around(avg_num_words_sentence, 3)}\nAverage length per word: {np.around(avg_num_word_length, 3)}")

Num tokens: 16130
Num words: 13895
Num types: 3722
Words per sentence: 19.352
Average length per word: 4.432


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


**Important: I saw that the words also contained "/n", so we might want to remove that one? In that case:**

In [7]:
# Counting words and frequencies
word_frequencies = Counter()
words_per_sencence = []
words_length = []
avg_num_words = []

for sentence in doc.sents:
    words = []
    lengths = []
    num_words = 0
    for token in sentence: 
        # Let's filter out punctuation
        if not (token.is_punct or token.text == "\n"):
            words.append(token.text)
            lengths.append(len(token.text))
            num_words += 1
    avg_num_words.append(num_words)
    words_length.append(np.mean(lengths))
    word_frequencies.update(words)

# print(word_frequencies)
num_tokens = len(doc)
num_words = sum(word_frequencies.values())
num_types = len(word_frequencies.keys())
avg_num_words_sentence = np.mean(avg_num_words)
avg_num_word_length = np.nanmean(words_length)


print(f"Num tokens: {num_tokens}\nNum words: {num_words}\nNum types: {num_types}\nWords per sentence: {np.around(avg_num_words_sentence, 3)}\nAverage length per word: {np.around(avg_num_word_length, 3)}")

Num tokens: 16130
Num words: 13242
Num types: 3721
Words per sentence: 18.443
Average length per word: 4.891


Our definition of a word: Any token not part of the punctuation set or the "\n" character

### 2. Word Classes (1.5 points)

Run the default part-of-speech tagger on the dataset and identify the ten most frequent 
POS tags. Complete the table below for these ten tags (the tagger in the model 
en_core_web_sm is trained on the PENN Treebank tagset). 

In [11]:
pos_list = []

NN_Noun = []
NNP_Propn = []
IN_Adp = []
DT_Det = []
JJ_Adj = []
NNS_Noun = []
COMMA_Punct = []
PERIOD_Punct = []
SP_Space = []
VBN_Verb = []

for token in doc:
    #print(token.pos_, token.tag_)
    pos_list.append('{}, {}'.format(token.pos_, token.tag_))
    if ('{}, {}'.format(token.pos_, token.tag_) == 'NOUN, NN'):
        NN_Noun.append(token.text)

    elif ('{}, {}'.format(token.pos_, token.tag_) == 'PROPN, NNP'):
        NNP_Propn.append(token.text)

    elif ('{}, {}'.format(token.pos_, token.tag_) == 'ADP, IN'):
        IN_Adp.append(token.text)

    elif ('{}, {}'.format(token.pos_, token.tag_) == 'DET, DT'):
        DT_Det.append(token.text)

    elif ('{}, {}'.format(token.pos_, token.tag_) == 'ADJ, JJ'):
        JJ_Adj.append(token.text)

    elif ('{}, {}'.format(token.pos_, token.tag_) == 'NOUN, NNS'):
        NNS_Noun.append(token.text)

    elif ('{}, {}'.format(token.pos_, token.tag_) == 'PUNCT, ,'):
        COMMA_Punct.append(token.text)

    elif ('{}, {}'.format(token.pos_, token.tag_) == 'PUNCT, .'):
        PERIOD_Punct.append(token.text)

    elif ('{}, {}'.format(token.pos_, token.tag_) == 'SPACE, _SP'):
        SP_Space.append(token.text)

    elif ('{}, {}'.format(token.pos_, token.tag_) == 'VERB, VBN'):
        VBN_Verb.append(token.text)


pos_frequencies = Counter(pos_list)
rtf = []
rtf.append(round(pos_frequencies['NOUN, NN']/sum(pos_frequencies.values()), 2))
rtf.append(round(pos_frequencies['PROPN, NNP']/sum(pos_frequencies.values()), 2))
rtf.append(round(pos_frequencies['ADP, IN']/sum(pos_frequencies.values()), 2))
rtf.append(round(pos_frequencies['DET, DT']/sum(pos_frequencies.values()), 2))
rtf.append(round(pos_frequencies['ADJ, JJ']/sum(pos_frequencies.values()), 2))
rtf.append(round(pos_frequencies['NOUN, NNS']/sum(pos_frequencies.values()), 2))
rtf.append(round(pos_frequencies['PUNCT, ,']/sum(pos_frequencies.values()), 2))
rtf.append(round(pos_frequencies['PUNCT, .']/sum(pos_frequencies.values()), 2))
rtf.append(round(pos_frequencies['SPACE, _SP']/sum(pos_frequencies.values()), 2))
rtf.append(round(pos_frequencies['VERB, VBN']/sum(pos_frequencies.values()), 2))

finegrained = ["NOUN", "PROPN", "ADP", "DET", "ADJ", "NOUN", "PUNCT", "PUNCT", "SPACE", "VERB"]
universal = ["NN", "NNP", "IN", "DT", "JJ", "NNS", ",", ".", "_SP", "VBN"]
occurrences = [len(NN_Noun), len(NNP_Propn), len(IN_Adp), len(DT_Det), len(JJ_Adj), len(NNS_Noun), len(COMMA_Punct), len(PERIOD_Punct), len(SP_Space), len(VBN_Verb)]

most_frequent = []
most_frequent.append(list(Counter(NN_Noun))[0:3])
most_frequent.append(list(Counter(NNP_Propn))[0:3])
most_frequent.append(list(Counter(IN_Adp))[0:3])
most_frequent.append(list(Counter(DT_Det))[0:3])
most_frequent.append(list(Counter(JJ_Adj))[0:3])
most_frequent.append(list(Counter(NNS_Noun))[0:3])
most_frequent.append(list(Counter(COMMA_Punct))[0:3])
most_frequent.append(list(Counter(PERIOD_Punct))[0:3])
most_frequent.append(list(Counter(SP_Space))[0:3])
most_frequent.append(list(Counter(VBN_Verb))[0:3])

least_frequent = []
least_frequent.append(list(Counter(NN_Noun))[-1])
least_frequent.append(list(Counter(NNP_Propn))[-1])
least_frequent.append(list(Counter(IN_Adp))[-1])
least_frequent.append(list(Counter(DT_Det))[-1])
least_frequent.append(list(Counter(JJ_Adj))[-1])
least_frequent.append(list(Counter(NNS_Noun))[-1])
least_frequent.append(list(Counter(COMMA_Punct))[-1])
least_frequent.append(list(Counter(PERIOD_Punct))[-1])
least_frequent.append(list(Counter(SP_Space))[-1])
least_frequent.append(list(Counter(VBN_Verb))[-1])

word_class_table = pd.DataFrame({"Finegrained POS-tag":finegrained, "Universal POS-tag":universal, "Occurrences": occurrences, "Relative Tag Frequency (%)" : rtf, "3 most frequent tokens" : most_frequent, "Example of infrequent token": least_frequent})
word_class_table.head(10)

Unnamed: 0,Finegrained POS-tag,Universal POS-tag,Occurrences,Relative Tag Frequency (%),3 most frequent tokens,Example of infrequent token
0,NOUN,NN,2066,0.13,"[month, baby, agar]",project
1,PROPN,NNP,2060,0.13,"[ROS, Police, Virginia]",Navy
2,ADP,IN,1600,0.1,"[alongside, of, with]",By
3,DET,DT,1313,0.08,"[an, the, a]",Each
4,ADJ,JJ,868,0.05,"[old, different, regional]",Sebastian
5,NOUN,NNS,774,0.05,"[children, years, concentrations]",areas
6,PUNCT,",",699,0.04,"[,, ;, …]",…
7,PUNCT,.,655,0.04,"[., ?, !]",!
8,SPACE,_SP,653,0.04,[\n],\n
9,VERB,VBN,454,0.03,"[thought, aged, represented]",acquitted


### 3 N-Grams (1.5 points)
Calculate the distribution of n-grams and provide the 3 most frequent
Token bigrams,
Token trigrams,
POS bigrams,
and POS trigrams