# Lab 3
We'll use this lab as an experiment of using a single file where you fill in codeblocks where necessary. They will be available as .py and .ipynb. Using the latter, or Jupyter Notebook, is highly recommended, as it provides substantially better feedback.


Provide your outputs in a simple report, along with textual answers.


The idea behind this format is to clarify what sort of output is required, as all answers run on tests based in the `tests.py` file.

In [1]:
import sklearn
import nltk
import random
import pandas as pd
import re
# feel free to import from modules of sklearn and nltk later
# e.g., from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split

## Exercise 1 - Gender detection of names
In NLTK you'll find the corpus `corpus.names`. A set of 5000 male and 3000 female names.
1) Select a ratio of train/test data (based on experiences from previous labs perhaps?)
2) Build a feature extractor function
3) Build two classifiers:
    - Decision tree
    - Naïve bayes
    
Finally, write code to evaluate the classifiers. Explain your results, and what do you think would change if you altered your feature extractor?

In [3]:
from typing import List

class GenderDataset:
    def __init__(self):
        self.names = nltk.corpus.names
        self.data = None
        self.build()

    def make_labels(self, gender: str) -> List[str]:
        """
        this function is to help you get started
        based on the passed gender, as you can fetch from the file ids,
        we return a tuple of (name, gender) for each name
        
        use this in `build` below, or do your own thing completely :)
        """
        return [(n, gender) for n in self.names.words(gender + ".txt")]
    
    def build(self) -> None:
        """
        combine the data in "male" and "female" into one
        remember to randomize the order
        """
        data = self.make_labels("male")
        data.extend(self.make_labels("female"))
        random.shuffle(data)
        self.data = data
    
    def split(self, ratio):
        return train_test_split(self.data, test_size=ratio)

In [4]:
class Classifier:
    def __init__(self, classifier: nltk.ClassifierI):
        self.classifier = classifier
        self.model = None
    
    def train(self, data):
        self.model = self.classifier.train(data)
        
    def test(self, data):
        return nltk.classify.accuracy(self.model, data)
    
    def train_and_evaluate(self, train, test):
        self.train(train)
        return self.test(test)
        
    def show_features(self):
        # OPTIONAL
        pass

                                 
class FeatureExtractor:
    def __init__(self, data):
        self.data = data
        self.features = []  
        
        self.build()
                 
    @staticmethod
    def text_to_features(name):
        # TODO: create a dict of features from a name
        return {
            "last_letter": name[-1],
            "first_letter": name[1]
        }
    
    def build(self):
        for name, gender in self.data:
            self.features.append((self.text_to_features(name), gender))


Note: you should achieve an accuracy of well above 70%!

In [5]:
split_ratio = 0.1  # TODO: modify
train, test = GenderDataset().split(ratio=split_ratio)

classifiers = {
    "decision_tree": Classifier(nltk.DecisionTreeClassifier), # TODO
    "naive_bayes": Classifier(nltk.NaiveBayesClassifier), # TODO
}

train_set = FeatureExtractor(train).features
test_set = FeatureExtractor(test).features

for name, classifier in classifiers.items():
    acc = classifier.train_and_evaluate(train_set, test_set)
    print(f"Model: {name}\tAccuracy: {acc}")



Model: decision_tree	Accuracy: 0.7559748427672957
Model: naive_bayes	Accuracy: 0.7433962264150943


I get an accuracy of around 70-75% for both classifiers. The results vary for each time i run the code, because of the random shuffeling. That being said, the naive bayes classifier usally gives a better accuracy.

## Exercise 2 - Spam or ham
Spam or ham is referred to a mail being spam or regular ("ham"). Follow the instructions and implement the `TODOs`

In [6]:
spam = pd.read_csv(
    'spam.csv',
    usecols=["v1", "v2"],
    encoding="latin-1"
).rename(columns={"v1": "label", "v2": "text"})

print(spam.label.value_counts())
spam.head()

ham     4825
spam     747
Name: label, dtype: int64


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
""" TODO: transform label to numerical
Expected output:
0    4825
1     747
Name: label, dtype: int64

hint: you can use "apply" or "replace" for a column in pandas
"""
spam["label"].replace({"ham": 0, "spam": 1}, inplace=True) # your transformation goes here

spam.label.value_counts()

0    4825
1     747
Name: label, dtype: int64

In [8]:
class TextCleaner:
    def __init__(self, text):
        self.tokenize(text) # TODO: tokenize
        self.stemmer = nltk.stem.PorterStemmer() # TODO: incorporate a stemmer of your choice
        self.stopwords = nltk.corpus.stopwords.words("english")  # TODO: you've done this a few times
        self.lem = nltk.stem.WordNetLemmatizer()  # TODO: lemmatizer
    
    """
    Create small functions to replace your tokens (self.text)
    iteratively. Such as a lowercase function.
    """
    def tokenize(self, text):
        # self.text = [word for word in text.split()]
        self.text = nltk.tokenize.word_tokenize(text)

    def lowercase(self):
        self.text = [word.lower() for word in self.text]
    
    def lemmatize(self):
        self.text = [self.lem.lemmatize(word) for word in self.text]

    def stem(self):
        self.text = [self.stemmer.stem(word) for word in self.text]

    def remove_stopwords(self):
        self.text = [word for word in self.text if word not in self.stopwords]

    def clean(self, lem: bool = False, stem: bool = False):
        self.lowercase()
        self.remove_stopwords()
        """
        TODO: populate with your defined cleaning functions here
        perhaps you want some conditional values to
        control which functions to use?
        """
        if lem:
            self.lemmatize()
        if stem:
            self.stem()
        
        # finally, return it as a text 
        return " ".join(self.text)

In [9]:
old_spam = spam.copy()
clean = lambda text: TextCleaner(text).clean(stem=True)
spam.text = spam.text.apply(clean)

In [10]:
spam.head()

Unnamed: 0,label,text
0,0,"go jurong point , crazi .. avail bugi n great ..."
1,0,ok lar ... joke wif u oni ...
2,1,free entri 2 wkli comp win fa cup final tkt 21...
3,0,u dun say earli hor ... u c alreadi say ...
4,0,"nah n't think goe usf , live around though"


In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer

split_ratio = 0.1
X_train, X_test, y_train, y_test = train_test_split(
    spam.text, spam.label, test_size=split_ratio, random_state=4310)


# vectorize with sklearn
vectorizer = TfidfVectorizer()
# fit the vectorizer to your training data
X_train = vectorizer.fit_transform(X_train)



# TODO: set up a multinomial classifier
classifier = MultinomialNB()
if classifier:
    classifier.fit(X_train, y_train)
    
vectorized = None


In [12]:
def predict(model, vectorizer, data, all_predictions=False):
    data = vectorizer.transform(data) # TODO apply the transformation from the vectorizer to test data 
    if all_predictions:
        return model.predict_proba(data)
    else:
        return model.predict(data)

def print_examples(data, probs, label1, label2, n=10):
    percent = lambda x: "{}%".format(round(x*100, 1))

    for text, pred in list(zip(data, probs))[:n]:
        print(f"{text} \n{label1}: {percent(pred[0])} / {label2}: {percent(pred[1])}\n{'-' * 100}")

In [13]:
if classifier:
    y_probas = predict(classifier, vectorizer, X_test, all_predictions=True)
    print_examples(X_test, y_probas, "ham", "spam", n = 2)

    y_pred = predict(classifier, vectorizer, X_test)
    # TODO display a confusion matrix on the test set vs predictions
    confusion_mat = confusion_matrix(y_true = y_test, y_pred = y_pred)
    print(confusion_mat)

    # show precision and recall in a confusion matrix
    tn, fp, fn, tp = confusion_mat.ravel()
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)

    print(f"Recall={round(recall, 2)}\nPrecision={round(precision, 2)}")

world famamu .... 
ham: 95.6% / spam: 4.4%
----------------------------------------------------------------------------------------------------
\aww must nearli dead ! well jez iscom todo workand whilltak forev ! \ '' '' 
ham: 95.2% / spam: 4.8%
----------------------------------------------------------------------------------------------------
[[488   0]
 [ 18  52]]
Recall=0.74
Precision=1.0


## Exercise 3 - Word features
Word features can be very useful for performing document classification, since the words that appear in a document give a strong indication of what its semantic content is. However, many words occur very infrequently, and some of the most informative words in a document may never have occurred in our training data. One solution is to make use of a lexicon, which describes how different words relate to each other.

Your task:
- Use the WordNet lexicon and augment the movie review document classifier (See NLTK book, Ch. 6, section 1.3) to use features that generalize the words that appear in a document, making it more likely that they will match words found in the training data.

Download wordnet and import

In [14]:
from nltk.corpus import movie_reviews
from nltk.corpus import wordnet as wn
import random

In [16]:
from typing import List
import random

def word_to_syn(word) -> str:
    '''Returns a synonym for a word. If no synonym is found the function returns the word itself'''

    all_lemmas = []
    for syn in wn.synsets(word):
        all_lemmas.append(syn.lemma_names())

    if len(all_lemmas) == 0:
        return word

    first_synset = all_lemmas[0]

    if len(first_synset) == 0:
        return word

    if len(first_synset) > 1:
        if word in first_synset:
            first_synset.remove(word)
        return first_synset[0]

    return first_synset[0]

print(word_to_syn("frog"))

toad


In [102]:
"""
this is from Ch. 6, sec. 1.3, with slight modifications
note that word_to_syn(word) (from the above implementation)
is in the beginning of the following function
"""
documents = [([word_to_syn(word) for word in list(movie_reviews.words(fileid))], category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
n_most_freq = 2000
word_features = list(all_words)[:n_most_freq]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [103]:
featuresets = [(document_features(d), c) for (d, c) in documents]

split_ratio = 0.1
train_set, test_set = train_test_split(featuresets, test_size=split_ratio)


classifier = nltk.NaiveBayesClassifier
model = classifier.train(train_set)

In [166]:
from typing import List

def synset_expansion(words: List[str]) -> List:
    '''Returns a list of all lemmas given an input wordlist'''
    all_lemmas = []

    for word in words:
        for syn in wn.synsets(word):
            for lemma in syn.lemmas():
                all_lemmas.append(lemma.name())


    all_lemmas = [word.lower() for word in (set(all_lemmas))]

    return sorted(all_lemmas)

expanded_word_features = synset_expansion(word_features)
print(synset_expansion(["programming", "coder"]))

['coder', 'computer_programing', 'computer_programmer', 'computer_programming', 'program', 'programing', 'programme', 'programmer', 'programming', 'scheduling', 'software_engineer']


In [171]:
# some assertions to test your code :-)
assert sorted(synset_expansion(["pc"])) == ["microcomputer", "pc", "personal_computer"]
assert sorted(synset_expansion(["programming", "coder"])) == [
    'coder',
    'computer_programing',
    'computer_programmer',
    'computer_programming',
    'program',
    'programing',
    'programme',
    'programmer',
    'programming',
    'scheduling',
    'software_engineer'
]

In [172]:
doc_featuresets = [(document_features(d), c) for (d, c) in documents]
doc_train_set, doc_test_set = train_test_split(doc_featuresets, test_size=0.1)

doc_model = model.train(doc_train_set)
doc_model.show_most_informative_features(5)
print("Accuracy: ", nltk.classify.accuracy(doc_model, doc_test_set))

Most Informative Features
          contains(lame) = True              neg : pos    =     10.5 : 1.0
         contains(mulan) = True              pos : neg    =      8.2 : 1.0
        contains(seagal) = True              neg : pos    =      7.1 : 1.0
   contains(outstanding) = True              pos : neg    =      6.3 : 1.0
         contains(flynt) = True              pos : neg    =      5.6 : 1.0
Accuracy:  0.81


In [169]:
def lexicon_features(reviews):
    review_words = set(reviews)
    features = {}
    for word in expanded_word_features:
        if word not in word_features:
            features['synset({})'.format(word)] = (word in review_words)
        features['contains({})'.format(word)] = (word in review_words)

    return features

Question: do you see any issues with including the synsets? Experiment a bit with different words and verify your ideas.

Including this expanded synsets might lead to the tree becoming to big and including words that does not have the same meaning, or even worse, have the opposite meaning. One shold tread lightly when doing this and perhaps include a validation set to validate that this expansion makes sense.

In [62]:
# warning: this may take some time to run
lex_featuresets = [(lexicon_features(d), c) for (d, c) in documents]
lex_train_set, lex_test_set = train_test_split(lex_featuresets, test_size=0.1)
lex_model = model.train(lex_train_set)  # the same classifier as you defined above
lex_model.show_most_informative_features()
print("Accuracy: ", nltk.classify.accuracy(lex_model, lex_test_set))

Most Informative Features
        contains(feeble) = True              neg : pos    =     10.9 : 1.0
          synset(feeble) = True              neg : pos    =     10.9 : 1.0
     contains(interpret) = True              pos : neg    =      8.4 : 1.0
       synset(interpret) = True              pos : neg    =      8.4 : 1.0
     contains(illogical) = True              neg : pos    =      8.3 : 1.0
       contains(misfire) = True              neg : pos    =      8.3 : 1.0
       synset(illogical) = True              neg : pos    =      8.3 : 1.0
         synset(misfire) = True              neg : pos    =      8.3 : 1.0
           contains(bit) = True              pos : neg    =      7.7 : 1.0
         contains(chaff) = True              neg : pos    =      7.6 : 1.0
Accuracy:  0.73


```
Most Informative Features
        contains(feeble) = True              neg : pos    =     10.9 : 1.0
          synset(feeble) = True              neg : pos    =     10.9 : 1.0
     contains(interpret) = True              pos : neg    =      8.4 : 1.0
       synset(interpret) = True              pos : neg    =      8.4 : 1.0
     contains(illogical) = True              neg : pos    =      8.3 : 1.0
       contains(misfire) = True              neg : pos    =      8.3 : 1.0
       synset(illogical) = True              neg : pos    =      8.3 : 1.0
         synset(misfire) = True              neg : pos    =      8.3 : 1.0
           contains(bit) = True              pos : neg    =      7.7 : 1.0
         contains(chaff) = True              neg : pos    =      7.6 : 1.0
Accuracy:  0.73
```

## Exercise 4 -- Experimentation
This exercise is largely open to experiment with and testing your skills thus far!
Large websites are an ideal place to look for large corpora of natural language. In this exercise, you're free to implement what you've learned on real-world data, mined from youtube (see `youtube_data`). Reuse classes defined earlier on in the exercise if you want.

The only requirement here is to **use a classifier not previously used in the exercise**

In [3]:
import os
import sys

yt_data_path = os.path.join(os.getcwd(), "youtube_data")
df = pd.read_csv(os.path.join(os.path.join(yt_data_path, "videos.csv")))

I choose to clean this data by filtering on the columns that i want to keep, and drop all duplicate entries of the variable `video_id`. After looking in the dataset i noticed that almost all videos were representet multiple times, with a slightly different `trending_date` and different `views, likes, dislikes`. In this task i will try to predict a category based on the description.

In [4]:
clean_df = df[["video_id", "title", "category_id","description"]].drop_duplicates("video_id")

In [5]:
clean_df["category_id"] = clean_df["category_id"].astype("category")
clean_df["description"] = clean_df["description"].astype("str")

In [6]:
clean_df.category_id.value_counts()

24    1619
10     799
26     595
23     547
25     505
22     498
17     451
28     380
1      318
27     250
15     138
20     103
2       70
19      60
29      14
43       4
Name: category_id, dtype: int64

In [7]:
clean = lambda text: TextCleaner(str(text)).clean(stem=True)
clean_df["description"] = clean_df["description"].apply(clean)

In [8]:
clean_df["description"].describe()

count     6351
unique    6196
top        nan
freq       102
Name: description, dtype: object

In [92]:
drop_na_df = clean_df.dropna(axis=1)

I notice that the dataset contains some values that is `NaN`, but pandas `df.dropna()` function does not seem to work here.

In [14]:
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

split_ratio = 0.25
X_train, X_test, y_train, y_test = train_test_split(
    clean_df.description, clean_df.category_id, test_size=split_ratio, random_state=4310
)

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [15]:
SVC_classifier = SVC()
if SVC_classifier:
    SVC_classifier.fit(X_train, y_train)

In [16]:
y_pred = SVC_classifier.predict(vectorizer.transform(X_test))

In [17]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, zero_division = 0))

              precision    recall  f1-score   support

           1       0.93      0.34      0.50        80
           2       0.00      0.00      0.00        14
          10       0.93      0.83      0.88       199
          15       0.93      0.41      0.57        32
          17       0.97      0.76      0.85       113
          19       1.00      0.07      0.12        15
          20       1.00      0.12      0.22        24
          22       0.82      0.39      0.53       119
          23       0.98      0.70      0.82       142
          24       0.49      0.96      0.65       413
          25       0.93      0.62      0.74       125
          26       0.80      0.72      0.76       143
          27       1.00      0.65      0.79        68
          28       0.90      0.49      0.64        96
          29       0.00      0.00      0.00         3
          43       0.00      0.00      0.00         2

    accuracy                           0.70      1588
   macro avg       0.73   

From this i get the output:
```
              precision    recall  f1-score   support

           1       0.93      0.34      0.50        80
           2       0.00      0.00      0.00        14
          10       0.93      0.83      0.88       199
          15       0.93      0.41      0.57        32
          17       0.97      0.76      0.85       113
          19       1.00      0.07      0.12        15
          20       1.00      0.12      0.22        24
          22       0.82      0.39      0.53       119
          23       0.98      0.70      0.82       142
          24       0.49      0.96      0.65       413
          25       0.93      0.62      0.74       125
          26       0.80      0.72      0.76       143
          27       1.00      0.65      0.79        68
          28       0.90      0.49      0.64        96
          29       0.00      0.00      0.00         3
          43       0.00      0.00      0.00         2

    accuracy                           0.70      1588
   macro avg       0.73      0.44      0.50      1588
weighted avg       0.79      0.70      0.69      1588
```

This showes us that the precision is mostly good over all the categories. Category 2, 29 and 43 were unable to be calculated, and therefore is set to 0. This might be because of the random selection not selecting enough of theese categories to the test set.