# Assignment 1

## Part A: Linguistic analysis using spaCy

### 1. Tokenization (1 point)
Process the dataset using the spaCy package and extract the following information:
**Number of tokens:
Number of types:  
Number of words:
Average number of words per sentence:
Average word length: 
Provide the definition that you used to determine words:**

In [None]:
%%bash
python -m spacy download en

In [51]:
import spacy
import os
from collections import Counter
import numpy as np

In [3]:
# This loads a small English model trained on web data.
# For other models and languages check: https://spacy.io/models
nlp = spacy.load('en_core_web_sm')

In [14]:
cwd = os.getcwd()

with open(f"{cwd}/data/preprocessed/train/sentences.txt", "r") as file:
    train_dataset = file.read()
doc = nlp(train_dataset)

In [73]:
# Making our test doc for later testing
test_input = "I have an awesome cat. It's sitting on the mat that I bought yesterday."
doc_test = nlp(test_input)

In [72]:
# Counting words and frequencies
word_frequencies = Counter()
words_per_sencence = []
words_length = []
avg_num_words = []

for sentence in doc.sents:
    words = []
    lengths = []
    num_words = 0
    for token in sentence: 
        # Let's filter out punctuation
        if not token.is_punct:
            words.append(token.text)
            lengths.append(len(token.text))
            num_words += 1
    avg_num_words.append(num_words)
    words_length.append(np.mean(lengths))
    word_frequencies.update(words)
    
# print(word_frequencies)
num_tokens = len(doc)
num_words = sum(word_frequencies.values())
num_types = len(word_frequencies.keys())
avg_num_words_sentence = np.mean(avg_num_words)
avg_num_word_length = np.nanmean(words_length)


print(f"Num tokens: {num_tokens}\nNum words: {num_words}\nNum types: {num_types}\nWords per sentence: {np.around(avg_num_words_sentence, 3)}\nAverage length per word: {np.around(avg_num_word_length, 3)}")

Num tokens: 16130
Num words: 13895
Num types: 3722
Words per sentence: 19.352
Average length per word: 4.432


**Important: I saw that the words also contained "/n", so we might want to remove that one? In that case:**

In [81]:
# Counting words and frequencies
word_frequencies = Counter()
words_per_sencence = []
words_length = []
avg_num_words = []

for sentence in doc.sents:
    words = []
    lengths = []
    num_words = 0
    for token in sentence: 
        # Let's filter out punctuation
        if not (token.is_punct or token.text == "\n"):
            words.append(token.text)
            lengths.append(len(token.text))
            num_words += 1
    avg_num_words.append(num_words)
    words_length.append(np.mean(lengths))
    word_frequencies.update(words)

# print(word_frequencies)
num_tokens = len(doc)
num_words = sum(word_frequencies.values())
num_types = len(word_frequencies.keys())
avg_num_words_sentence = np.mean(avg_num_words)
avg_num_word_length = np.nanmean(words_length)


print(f"Num tokens: {num_tokens}\nNum words: {num_words}\nNum types: {num_types}\nWords per sentence: {np.around(avg_num_words_sentence, 3)}\nAverage length per word: {np.around(avg_num_word_length, 3)}")

Num tokens: 16130
Num words: 13242
Num types: 3721
Words per sentence: 18.443
Average length per word: 4.891


Our definition of a word: Any token not part of the punctuation set or the "\n" character

### 2. Word Classes (1.5 points)

Run the default part-of-speech tagger on the dataset and identify the ten most frequent 
POS tags. Complete the table below for these ten tags (the tagger in the model 
en_core_web_sm is trained on the PENN Treebank tagset). 