# POS Chunking
**1. Create a chunker that detects noun-phrases (NPs) and lists the NPs in the text below.**

- Both [NLTK](https://www.nltk.org/book/ch07.html) and [spaCy](https://spacy.io/api/matcher) supports chunking
- Look up RegEx parsing for NLTK and the document object for spaCy.
- Make use of what you've learned about tokenization.

In [2]:
text = "The language model predicted the next word. It was a very nice word!"
# TODO: set up a pos tagger and a chunker.
# Output: a list of all tokens, grouped as noun-phrases where applicable

from nltk import pos_tag
from nltk.chunk.regexp import RegexpParser
from nltk.tokenize import RegexpTokenizer


def pos_tagger(text):
    tokenizer = RegexpTokenizer(r"\w+")
    t_text = tokenizer.tokenize(text)
    return pos_tag(t_text)

print("The senteces parsed with the standard pos tagger from nltk.")
print(pos_tagger(text))

patterns = """  
    NP: {<DT>?<RB>?<JJ>*<NN>+} 

"""
def regexp_chunker(patterns, text):
    PChunker = RegexpParser(patterns)
    parsed_text = PChunker.parse(pos_tagger(text))
    return parsed_text

print()
print("The text parsed with pos tagger and chunked with a spesific noun phrase pattern")
phrases = regexp_chunker(patterns, text)
for phrase in phrases:
    print(phrase)
# regexp_chunker(patterns, text).draw()

The senteces parsed with the standard pos tagger from nltk.
[('The', 'DT'), ('language', 'NN'), ('model', 'NN'), ('predicted', 'VBD'), ('the', 'DT'), ('next', 'JJ'), ('word', 'NN'), ('It', 'PRP'), ('was', 'VBD'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('word', 'NN')]

The text parsed with pos tagger and chunked with a spesific noun phrase pattern
(NP The/DT language/NN model/NN)
('predicted', 'VBD')
(NP the/DT next/JJ word/NN)
('It', 'PRP')
('was', 'VBD')
(NP a/DT very/RB nice/JJ word/NN)


**2. Modify the chunker to handle verb-phases (VPs) as well.**
- This can be done by using a RegEx parser in NLTK or using a spaCy Matcher.

In [3]:
# TODO: set up grammars to chunk VPs

grammar = """
    VP: {<VB*.><DT>?<RB>?<JJ>?<NN>}
    NP: {<DT>?<RB>?<JJ>*<NN>+}
   
"""

print("The text parsed with pos tagger and chunked with a spesific NP and VP pattern")
for phrase in regexp_chunker(grammar, text):
    print(phrase)

The text parsed with pos tagger and chunked with a spesific NP and VP pattern
(NP The/DT language/NN model/NN)
(VP predicted/VBD the/DT next/JJ word/NN)
('It', 'PRP')
(VP was/VBD a/DT very/RB nice/JJ word/NN)


**3. Verb-phrases (VPs) can be defined by many different grammatical rules. Give four examples.**
- Hint: Context-Free Grammars, chapter 8 in NLTK.

Verb Phrases usually consist of a verb combined with a different phrase type. Unlike noun phrases its unlikely to see several verbs in a row, but not impossible. From the NLTK documentation there are here four examples of verbs followed by a different phrase type. <br>
 VP: <br>
V Adj <br>
V NP <br>
V PP<br>
V NP PP<br>
In the text above i combined the verb phrase with the noun phrases that included prepositions as well as an adjective.

**4. After these applications, do you find chunking to be beneficial in the context of language modeling and next-word prediction? Why or why not?**

I would assume it to be quite benefical in processes like semantic analysis as the addition of adjectives would likely determine if there is an opinion within the text. Using POS and chuncking with some additional use of n-grams could likely provide the context needed to sort out what these adjectives are desribing and if they contain negations or any sort of hidden meaning, increasing the likelihood of concluding the correct semantic meaning.
Additionally the grouping of phrases and the language understanding it can provide would be useful in most language modelling, from classifing text to understanding queries.

___

# Dependency Parsing

**1. Use spaCy to inspect/visualise the dependency tree of the text provided below.**
- Optional addition: visualize the dependencies as a graph using `networkx`

In [13]:
text = "The language model predicted the next word"
# TODO: use spacy and displacy to visualize the dependency tree

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
document = nlp(text)
options = {"compact": True, "color": "indianred"}
displacy.render(document, style="dep", options=options)

**2. What is the root of the sentence? Attempt to spot it yourself, but the answer should be done by code**

In [7]:
# TODO: implement a function to find the root of the document
# Return both the word and its POS tag

for token in document:
    if token.head.text == token.text:
        print("The root word in this text is:", token.text)
        print("The POS tag of", token.text, "is:", token.tag_)

The root word in this text is: predicted
The POS tag of predicted is: VBD


**3. Find the subject and object of a sentence. Print the results for the sentence above.**

In [8]:
# TODO: implement a function to find the subjects + objects in the document

def get_subject_chunk(doc) -> str:
    for chunk in doc.noun_chunks:
        if ("subj" in chunk.root.dep_):
            return chunk
def get_subject(doc) -> str:
    for token in doc:
        if ("subj" in token.dep_):
            return token.text

def get_object_chunk(doc) -> str:
    for chunk in doc.noun_chunks:
        if ("dobj" in chunk.root.dep_):
            return chunk

def get_object(doc) -> str:
    for token in doc:
        if ("dobj" in token.dep_):
            return token.text

print(get_subject_chunk(document),"->", get_subject(document))
print(get_object_chunk(document),"->", get_object(document))

The language model -> model
the next word -> word


**4. How would you use the relationships extracted from dependency parsing in language modeling contexts?**

Similarly to chuncking, dependency provides an additional context and relative word understanding that should be helpful in most language modelling. Using the example of semantic analysis again, provided you know how the words are dependent on eachother you can understand better how the describing words impact the general message of the sentence or text. This would also be largly usefull in classification, deducing what the general subject of the text is.

___

# Wordnet

**1. Use Wordnet (from NLTK) and create a function to get all synonyms of a word of your choice. Try with "language"**

In [9]:
from nltk.corpus import wordnet as wn
# TODO: find synonyms

def get_synonyms(word: str) -> set:
    synonyms = set()
    [synonyms.update(syn_set.lemma_names()) for syn_set in wn.synsets(word)]
    return synonyms

get_synonyms("language")

{'language',
 'linguistic_communication',
 'linguistic_process',
 'lyric',
 'nomenclature',
 'oral_communication',
 'speech',
 'speech_communication',
 'spoken_communication',
 'spoken_language',
 'terminology',
 'voice_communication',
 'words'}

**2. From the same word you chose, extract an additional 4 or more features from wordnet (such as hyponyms). Describe each category briefly.**

In [10]:
# TODO: expand the function to find more features!

def get_antonyms(word: str) -> set:
    antonyms = set()
    for syn_set in wn.synsets(word):
        for lemma in syn_set.lemmas():
            for antonym in lemma.antonyms():
                antonyms.add(antonym.name())
    return antonyms

def get_hypernyms(word: str) -> set:
    hypernyms = set()
    for syn_set in wn.synsets(word): 
        [hypernyms.update(hyp_set.lemma_names()) for hyp_set in syn_set.hypernyms()]
    return hypernyms

def get_attributes(word: str) -> set:
    attributes = set()
    for syn_set in wn.synsets(word): 
        [attributes.update(att_set.lemma_names()) for att_set in syn_set.attributes()]
    return attributes

def get_word_features(word: str) -> dict:
    syn, ant, hyp, att = get_synonyms(word), get_antonyms(word), get_hypernyms(word), get_attributes(word)
    features = {"Synonyms": syn, "Antonyms": ant, "Hypernyms": hyp, "Attributes": att}
    return features

get_word_features("stupid")


{'Synonyms': {'dazed',
  'dolt',
  'dullard',
  'pillock',
  'poor_fish',
  'pudden-head',
  'pudding_head',
  'stunned',
  'stupe',
  'stupefied',
  'stupid',
  'stupid_person',
  'unintelligent'},
 'Antonyms': {'intelligent', 'smart'},
 'Hypernyms': {'simple', 'simpleton'},
 'Attributes': {'intelligence'}}

___

# Machine Learning Exercise - A sentiment classifier
- A rule-based approach with SentiWordNet + A machine learning classifier

**1. There are several steps required to build a classifier or any sort of machine learning application for textual data. For data including (INPUT_TEXT, LABEL), list the typical pipeline for classification.**

The typical machine learning pipline: <br>
select relevant data -> Preprocess -> Split to training and testing -> feature extraction -> model creation -> model training -> test and evaluate -> make changes if nessecary

**2. Before developing a classifier, having a baseline is very useful. Build a baseline model for sentiment classification using SentiWordNet.**
- How you decide to aggregate sentiment is up to you. Explain your approach.
- It should report the accuracy of the classifier.

In [11]:
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet as wn
import spacy

# TODO: implement a function to get the sentiment of a text
# Must use the sentiwordnet lexicon

# Evaluate it on the following sentences:
sents = [
    "I liked it! Did you?",
    "It's not bad but... Nevermind, it is.",
    "It's awful",
    "I don't care if you loved it - it was terrible!",
    "I don't care if you hated it, I think it was awesome"
]
# 0: negative, 1: positive
y_true = [1, 0, 0, 0, 1]

nlp = spacy.load("en_core_web_sm")

def word_sentiment_score(word: str) -> float:
    synsets = list(swn.senti_synsets(word))
    if synsets:
        sentiment = synsets[0]
        return sentiment.pos_score() - sentiment.neg_score()
    else:
        return 0.0  

def sent_sentiment(sentences: list[str]) -> list[int]:
    sentiment_labels = []
    for sent in sentences:
        sentence = nlp(sent)
        sentence_sentiment = sum(word_sentiment_score(token.text) for token in sentence)
        sentiment_labels.append(1 if sentence_sentiment >= 0 else 0)  
    return sentiment_labels

classifications = sent_sentiment(sents)
print(classifications)
print("Score:", sum(1 for x,y in zip(classifications, y_true) if x == y) / len(y_true))

[1, 0, 0, 0, 1]
Score: 1.0


So in this code i decided each word was represented with the difference between their positive and negative scores in the swn library. Then these differences were aggregated in the sum fuction totaling a score for the entire sentence. Wheather or not this sentence was classified as a having a positive or negative sentiment depended wheather their score was positive or negative. <br>
This method is quite simple and disregards all negations or other phrasings that provide negative words while eluding to something positive or vice versa. In this case it functions, but for more complex texts it would likely fail to classify the text properly.

## The SST-2 binary sentiment dataset

**3. Split the training set into a training and test set. Choose a split size, and justify your choice.**

In [25]:
from datasets import load_dataset
dataset = load_dataset("sst2")

train_df = dataset["train"].to_pandas().drop(columns=["idx"])
train_df = train_df.sample(10000)  # a tiny subset
print(train_df.label.value_counts())
train_df.head()

Found cached dataset parquet (C:/Users/marcu/.cache/huggingface/datasets/parquet/sst2-1151590ea3b3f98b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/3 [00:00<?, ?it/s]

label
1    5523
0    4477
Name: count, dtype: int64


Unnamed: 0,sentence,label
64594,there 'll be a power outage during your screen...,0
11796,the color sense of stuart little 2 is its most...,1
26050,"photographed with color and depth , and rather...",1
57107,too few laughs,0
28994,its limit to sustain a laugh,0


In [26]:
# TODO: split the data
from sklearn.model_selection import train_test_split as TTS
train_df, test_df = TTS(train_df, train_size=0.8, random_state=42)

# I choose an 80/20 split as this is a quite common divider in the ML world

**4. Evaluate your baseline model on the test set.**

- Additionally: compare it against a random baseline. That is, a random guess for each example

In [27]:
# TODO: evaluate on test set + random guess
# Report results in terms of accuracy
import numpy as np
predictions = sent_sentiment(test_df["sentence"])
print("Model score:", sum(1 for x,y in zip(predictions, test_df["label"]) if x == y) / len(test_df))

rand = np.random.choice([0, 1], size=(len(test_df),))
print("Random baseline:", sum(1 for x,y in zip(rand, test_df["label"]) if x == y)/ len(test_df))

Model score: 0.634
Random baseline: 0.496


**5. Did you beat random guess?**

If not, can you think of any reasons why?

My model did indeed outperform the random baseline. As the score of the baseline is close to 0.5 it could mean that there is about 50/50 of the two classes in the dataset, in this case the split is 55/45, meaning a random model will almost always get scores of around 50% accuracy. Despite my model not being great at 63.4% it still performed better likely due to the fact that it manages to read some words and get classify the postivie and negative words correctly. The error i assume is due to the lack of context for negation and other gramatical tricks that change the sentiment face value for a word.

## Classification with Naive Bayes and TF-IDF
This is the final task of the lab. You will use high-level libraries to implement a TF-IDF vectorizer and train your data using a Naive Bayes classifier

In [28]:
# TODO: use scikit-learn to...
# - normalize
# - vectorize/extract features
# - train a classifier
# - evaluate the classifier using `classification_report` and `accuracy`
# 
# expect an accuracy of > 0.8

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics import classification_report, accuracy_score 
from sklearn.naive_bayes import MultinomialNB 


def tf_idf_vectorize(train_data, test_data):
    vectorizer = TfidfVectorizer(encoding="utf-8", strip_accents="ascii", 
                                 lowercase=True, analyzer="word", stop_words="english")
    
    train_vector = vectorizer.fit_transform(train_data)
    test_vector = vectorizer.transform(test_data)
    return train_vector, test_vector

train_vector, test_vector = tf_idf_vectorize(train_df["sentence"], test_df["sentence"])

model = MultinomialNB()
model.fit(train_vector, train_df["label"])

print("The classification report:")
print(classification_report(predictions, test_df["label"]))
print("The Accuracy of the model is:", accuracy_score(model.predict(test_vector), test_df["label"])*100,"%")

The classification report:
              precision    recall  f1-score   support

           0       0.47      0.64      0.54       671
           1       0.78      0.63      0.70      1329

    accuracy                           0.63      2000
   macro avg       0.62      0.64      0.62      2000
weighted avg       0.67      0.63      0.64      2000

The Accuracy of the model is:  78.95 %


Using gridsearch on additional features i was able to improve the score to above 80%, although this was performed in a different notebook. The model also acchived that with its inital randomized train test split, prior to fixing the random_state to 42.

## Optional task: using a pre-trained transformer model
If you wish to push the accuracy as far as you can, take a look at BERT-based or other pre-trained language models. As a starting point, take a look at a model already fine-tuned on the SST-2 dataset: [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english)

**Advanced:**

Going beyond this, you could look into the addition of a *classification head* on top of the pooling layer of a BERT-based model. This is a common approach to fine-tuning these models on classification or regression problems.