# Supervised Learning
## Machine learning (ML)

The sentiment analysis program we wrote earlier (in Lab 2) adopts a non-machine learning algorithm. That is, it tries to define what words have good and bad sentiments and assumes all the necessary words of good and bad sentiments exist in the word_sentiment.csv file.


Machine Learning (ML) is a class of algorithms, which are data-driven, i.e. unlike "normal" algorithms, it is the data that "tells" what the "good answer" is. A machine learning algorithm would not have such coded definition of what a good and bad sentiment is, but would "learn-by-examples". That is, you will show several sentences which have been labeled as good sentiment and bad sentiment and a good ML algorithm will eventually learn and be able to predict whether or not an unseen sentence has a good or bad sentiment. This particular example of sentiment analysis is "supervised", which means that your example words must be labeled, or explicitly say which sentences are good and which are bad.

On the other hand, in the case of unsupervised learning, the sentence examples are not labeled. Of course, in such a case the algorithm itself cannot "invent" what a good sentiment is, but it can try to cluster the data into different groups, e.g. it can figure out that sentences that have certain  words are different from those hat have others (eg. it might cluster sentecences around words like mother,children etc. and find that cluster to be different from another group of sentences that contain words like politician).
There are "intermediate" forms of supervision, i.e. semi-supervised and active learning. Technically, these are supervised methods in which there is some "smart" way to avoid a large number of labeled examples. 

- In active learning, the algorithm itself decides which thing you should label (e.g. it can be pretty sure about a sentence that has the word fantastic, but it might ask you to confirm if the sentence may have a negative like “not”). 
- In semi-supervised learning, there are two different algorithms, which start with the labeled examples, and then "tell" each other the way they think about some large number of unlabeled data. From this "discussion" they learn.


#### Figure 1: Supervised learning approach


<center>
    <img src="ML_SL.png"  width="500" title="Supervised learning">
</center>



* Ref 1: https://www.youtube.com/watch?v=nKW8Ndu7Mjw&t=382s

* Ref 2: https://www.nltk.org/book/ch06.html

## Sentiment Excercise:

In this excercise we try to improve the *word_sentiment.csv* file. We build a classifier model to predict the sentiment of an unknown word, using the the dictionary (corpus) of sentiments available in the word_sentiment.csv file. 

- Objective: Predict sentiment of word
- Token : Characters of the word
- Training data: word_sentiment.csv 

### Training the ML algorithm

#### Module : NLTK (Natural Language Tool Kit)
NLTK module is built for working with language data.  NLTK supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities. We will use the NLTK module and employ the Naive Bayes method to classify words as being either positive or negative sentiment. You can also use other modules specifically meant for ML eg. sklearn module.

The nltk module may not be included in your laptop. In case it is not installed you need to :

*pip install --user nltk*

To check if it is installed just run 

*import nltk*

In [1]:
pip uninstall --nltk

Note: you may need to restart the kernel to use updated packages.



Usage:   
  D:\Downloads\Anaconda\python.exe -m pip uninstall [options] <package> ...
  D:\Downloads\Anaconda\python.exe -m pip uninstall [options] -r <requirements file> ...

no such option: --nltk


In [2]:
pip install --user nltk

Note: you may need to restart the kernel to use updated packages.


In [3]:
import csv
import nltk
import random

def feature_extractor(word):
    """ This is my feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    dict_feature = {"first letter" : first_l,"last letter" : last_l}
    return dict_feature

def gen_featureset():
    """ This function creates a feature set for the word_sentiment.csv file"""
    SENTIMENT_CSV = r"word_sentiment.csv"
    with open(SENTIMENT_CSV,'rt', encoding='utf-8') as csvobj:
        ws_data = csv.reader(csvobj)
        featureset = list()
        for row in ws_data:
            w_feature = list()
            feature = feature_extractor(row[0])
            w_feature.append(feature)
            if int(row[1]) > 0:
                w_feature.append('positive')
            else:
                w_feature.append('negative')
            featureset.append(w_feature)
        random.shuffle(featureset)
        return featureset
            

def ML_train():
    """ This will train the classifier using the word sentiment feature set"""
    featureset = gen_featureset()
    train_set = featureset[:2000]
    test_set = featureset[2000:]
    classifier = nltk.NaiveBayesClassifier.train(train_set) #Train the NaiveBayes model using the training data set
    acc = nltk.classify.accuracy(classifier,test_set) #Find accuracy of the model using the test set
    print('Accuracy of the model: ',acc)
    classifier.show_most_informative_features() #show the most informative features
    return classifier

nb_class = ML_train()

word = input('Enter a word: ').lower()
f_wrd = feature_extractor(word)
sentiment = nb_class.classify(f_wrd)
print("Sentiment of the word : ", sentiment)

Accuracy of the model:  0.6310272536687631
Most Informative Features
            first letter = 'y'            positi : negati =      7.9 : 1.0
             last letter = 'w'            positi : negati =      6.7 : 1.0
            first letter = 'd'            negati : positi =      5.2 : 1.0
            first letter = 'j'            positi : negati =      3.7 : 1.0
            first letter = 'g'            positi : negati =      3.5 : 1.0
             last letter = 'o'            positi : negati =      3.3 : 1.0
            first letter = 'u'            negati : positi =      2.9 : 1.0
            first letter = 'e'            positi : negati =      2.5 : 1.0
            first letter = 'n'            negati : positi =      2.3 : 1.0
            first letter = 'r'            positi : negati =      2.3 : 1.0
Enter a word: messi
Sentiment of the word :  negative


### Step 1: Feature extraction
Define what features of a word that you want to use in order to classify the data set. We will select two features the first and the last letter of the word.

**Tokenization:**  tokenization is the task of chopping text up into pieces (e.g. words or letter / character), called tokens
* For words, the characters in the word can act as tokens

**Feature generation:** selecting the right features and determining how to encode them

* Feature 1: First letter of the word
* Feature 2: Last letter of the word

#### Figure 2: Feature Extractor


<center>
    <img src="FE2.png"  width="500" title="Feature Extraction">
</center>


#### Part 1: Feature Extractor

Write a function to extract the first and last letters of a word. Create a dictionary of features (first and last letters) and return the dictionary.

In [4]:
def feature_extractor(word):
    """ This is the feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    dict_feature = {"first letter" : first_l,"last letter" : last_l}
    return dict_feature

input_word = input("Enter a word ").lower()
feature = feature_extractor(input_word)
print(feature)

Enter a word cristiano
{'first letter': 'c', 'last letter': 'o'}


#### Part 2: Create the feature set from the training data
We will use the corpus of sentiments from the word_sentiment.csv file to create a feature dataset which we will use to train and test the ML model. 

Write a function that takes all the words in the training data, extracts their features, and creates a list of features and their sentiment.

In [8]:
import csv

def feature_extractor(word):
    """ This is my feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    dict_feature = {"first letter" : first_l,"last letter" : last_l}
    return dict_feature

def gen_featureset():
    """ This function creates a feature set for all the words i nthe the word_sentiment.csv file"""
    SENTIMENT_CSV = r"word_sentiment.csv"
    with open(SENTIMENT_CSV,'rt', encoding='utf-8') as csvobj:
        sentidata = csv.reader(csvobj)
        featureset = list() # create an empty list with name featureset
        for row in sentidata:
            temp_l = list()
            f_word = feature_extractor(row[0])
            temp_l.append(f_word)
            temp_l.append(row[1])
            featureset.append(temp_l)
            print(temp_l)     
    return featureset
            
featureset = gen_featureset()
print(len(featureset))

[{'first letter': 'a', 'last letter': 'n'}, '-2']
[{'first letter': 'a', 'last letter': 'd'}, '-2']
[{'first letter': 'a', 'last letter': 's'}, '-2']
[{'first letter': 'a', 'last letter': 'd'}, '-2']
[{'first letter': 'a', 'last letter': 'n'}, '-2']
[{'first letter': 'a', 'last letter': 's'}, '-2']
[{'first letter': 'a', 'last letter': 'r'}, '-3']
[{'first letter': 'a', 'last letter': 'd'}, '-3']
[{'first letter': 'a', 'last letter': 't'}, '-3']
[{'first letter': 'a', 'last letter': 's'}, '-3']
[{'first letter': 'a', 'last letter': 's'}, '2']
[{'first letter': 'a', 'last letter': 'y'}, '2']
[{'first letter': 'a', 'last letter': 'd'}, '1']
[{'first letter': 'a', 'last letter': 'e'}, '-1']
[{'first letter': 'a', 'last letter': 's'}, '-1']
[{'first letter': 'a', 'last letter': 'e'}, '2']
[{'first letter': 'a', 'last letter': 'd'}, '2']
[{'first letter': 'a', 'last letter': 's'}, '2']
[{'first letter': 'a', 'last letter': 'g'}, '2']
[{'first letter': 'a', 'last letter': 'd'}, '1']
[{'first

### Step 2: Train the model on the training data set
##### Split the sample to training and testing data set

We will split the feature data set into training and test data sets. The training set is used to train our ML model and then the testing set can be used to check how good the model is. It is normal to use 20% of the data set for testing purposes. In our case we will retain 2000 words for training and the rest for testing.

##### Use ML method (Naive Bayes) to create the classifier model

The NLTK module gives us several ML methods to create a classifier model using our training set and based on our selected features. 

#### Figure 3: Training


<center>
    <img src="Tr2.png"  width="500" title="Training">
</center>

In [9]:
import csv
import nltk # this module has NaiveBayes classifier model

def feature_extractor(word):
    """ This is my feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    dict_feature = {"first letter" : first_l,"last letter" : last_l}
    return dict_feature

def gen_featureset():
    """ This function creates a feature set for the word_sentiment.csv file"""
    SENTIMENT_CSV = r"word_sentiment.csv"
    with open(SENTIMENT_CSV,'rt', encoding='utf-8') as csvobj:
        ws_data = csv.reader(csvobj)
        featureset = list()
        for row in ws_data:
            w_feature = list()
            feature = feature_extractor(row[0])
            w_feature.append(feature)
            w_feature.append(row[1])
            featureset.append(w_feature)
        return featureset
            

def ML_train():
    """ This will train the classifier using the word sentiment feature set"""
    featureset = gen_featureset()
    train_set = featureset[:2000]
    test_set = featureset[2000:]  
    classifier = nltk.NaiveBayesClassifier.train(train_set) #Train the NaiveBayes model using the training data set
    return classifier

nb_classifier = ML_train()
print(nb_classifier)

<nltk.classify.naivebayes.NaiveBayesClassifier object at 0x000002673C42B640>


### Step 3: Using the classifier object created predict the sentiment of a given word

Once the ML model is trained we can ask the user to input the word and predict the sentiment of the word.

#### Figure 4: Prediction


<center>
    <img src="Pr.png"  width="500" title="Prediction">
</center>

In [11]:
import csv
import nltk

def feature_extractor(word):
    """ This is my feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    dict_feature = {"first letter" : first_l,"last letter" : last_l}
    return dict_feature

def gen_featureset():
    """ This function creates a feature set for the word_sentiment.csv file"""
    SENTIMENT_CSV = r"word_sentiment.csv"
    with open(SENTIMENT_CSV,'rt', encoding='utf-8') as csvobj:
        ws_data = csv.reader(csvobj)
        featureset = list()
        for row in ws_data:
            w_feature = list()
            feature = feature_extractor(row[0])
            w_feature.append(feature)
            w_feature.append(row[1])
            featureset.append(w_feature)
        return featureset
            

def ML_train():
    """ This will train the classifier using the word sentiment feature set"""
    featureset = gen_featureset()
    train_set = featureset[:2000]
    test_set = featureset[2000:]
    classifier = nltk.NaiveBayesClassifier.train(train_set) #Train the NaiveBayes model using the training data set
    return classifier

nb_class = ML_train()

word = input('Enter a word: ').lower()
f_wrd = feature_extractor(word)
sentiment = nb_class.classify(f_wrd)
print("Sentiment of the word : ", sentiment)

Enter a word: messi
Sentiment of the word :  -2


### Step 4: Evaluating the model

Find how good the model is in identifying the labels. Ensure that the test set is distinct from the training corpus. If we simply re-used the training set as the test set, then a model that simply memorized its input, without learning how to generalize to new examples, would receive misleadingly high scores. The function nltk.classify.accuracy() will calculate the accuracy of a classifier model on a given test set.

#### Figure 5: Evaluation


<center>
    <img src="Ev.png"  width="500" title="Evaluation">
</center>

In [16]:
# These lines of code can be used o evaluate the model. 
# In the previously written code use these lines of code to calculate the accuracy

acc = nltk.classify.accuracy(classifier,test_set) #Find accuracy of the model using the test set

classifier.show_most_informative_features() #show the most informative features


Enter a word : poo
The accuracy of the model is  0.29350104821802936
Most Informative Features
             last letter = 'o'                 4 : 2      =     26.6 : 1.0
            first letter = 'e'                 3 : -3     =     16.4 : 1.0
             last letter = 'k'                -5 : -3     =     16.1 : 1.0
            first letter = 'r'                 4 : 3      =     13.0 : 1.0
             last letter = 'a'                 3 : -2     =     11.2 : 1.0
            first letter = 'j'                -4 : -1     =     10.3 : 1.0
             last letter = 'c'                 4 : -1     =      8.9 : 1.0
            first letter = 'f'                -4 : -1     =      8.9 : 1.0
            first letter = 'h'                 5 : 1      =      8.2 : 1.0
             last letter = 'h'                 5 : -1     =      8.0 : 1.0
Features extracted from the word :  {'first letter': 'p', 'last letter': 'o'}
Sentiment of the word :  -1


#### Note the low accuracy scores. Can you implement some improvements to increase the accuracy score ? 

In [None]:
import csv
import nltk
import random

def feature_extractor(word):
    """ This is my feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    dict_feature = {"first letter" : first_l,"last letter" : last_l}
    return dict_feature

def gen_featureset():
    """ This function creates a feature set for the word_sentiment.csv file"""
    SENTIMENT_CSV = r"word_sentiment.csv"
    with open(SENTIMENT_CSV,'rt', encoding='utf-8') as csvobj:
        ws_data = csv.reader(csvobj)
        featureset = list()
        for row in ws_data:
            w_feature = list()
            feature = feature_extractor(row[0])
            w_feature.append(feature)
            if int(row[1]) > 0:
                w_feature.append('positive')
            else:
                w_feature.append('negative')
            featureset.append(w_feature)
        random.shuffle(featureset)
        return featureset
            

def ML_train():
    """ This will train the classifier using the word sentiment feature set"""
    featureset = gen_featureset()
    train_set = featureset[:2000]
    test_set = featureset[2000:]
    classifier = nltk.NaiveBayesClassifier.train(train_set) #Train the NaiveBayes model using the training data set
    acc = nltk.classify.accuracy(classifier,test_set) #Find accuracy of the model using the test set
    print('Accuracy of the model: ',acc)
    classifier.show_most_informative_features() #show the most informative features
    return classifier

nb_class = ML_train()

word = input('Enter a word: ').lower()
f_wrd = feature_extractor(word)
sentiment = nb_class.classify(f_wrd)
print("Sentiment of the word : ", sentiment)





##### NOTE: Improvements made to increase accuracy
- Shuffle the corpus so that the issue of alphebetic ordering of the word is overcome 
- Reduce variance in outcome by clubbing it (i.e. change the range of sentiment from -5 to 5 .. to 'posiitve' and 'negative')

### Excercise 1

*Improve the feature extractor (by adding new features) so that the test accuracy goes up by a bit. Can you reach 70% accuracy?.*

In [4]:
import csv
import nltk
import random

def feature_extractor(word):
    """ This is my feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    dict_feature = {"first letter" : first_l,"last letter" : last_l}
    return dict_feature

def gen_featureset():
    """ This function creates a feature set for the word_sentiment.csv file"""
    SENTIMENT_CSV = r"word_sentiment.csv"
    with open(SENTIMENT_CSV,'rt', encoding='utf-8') as csvobj:
        ws_data = csv.reader(csvobj)
        featureset = list()
        for row in ws_data:
            w_feature = list()
            feature = feature_extractor(row[0])
            w_feature.append(feature)
            if int(row[1]) > 0:
                w_feature.append('positive')
            else:
                w_feature.append('negative')
            featureset.append(w_feature)
        random.shuffle(featureset)
        return featureset
            

def ML_train():
    """ This will train the classifier using the word sentiment feature set"""
    featureset = gen_featureset()
    train_set = featureset[:2000]
    test_set = featureset[2000:]
    classifier = nltk.NaiveBayesClassifier.train(train_set) #Train the NaiveBayes model using the training data set
    acc = nltk.classify.accuracy(classifier,test_set) #Find accuracy of the model using the test set
    print('Accuracy of the model: ',acc)
    classifier.show_most_informative_features() #show the most informative features
    return classifier

nb_class = ML_train()

word = input('Enter a word: ').lower()
f_wrd = feature_extractor(word)
sentiment = nb_class.classify(f_wrd)
print("Sentiment of the word : ", sentiment)


Accuracy of the model:  0.6331236897274634
Most Informative Features
            first letter = 'y'            positi : negati =      6.7 : 1.0
            first letter = 'd'            negati : positi =      5.8 : 1.0
             last letter = 'o'            positi : negati =      4.0 : 1.0
            first letter = 'j'            positi : negati =      3.8 : 1.0
            first letter = 'n'            negati : positi =      3.7 : 1.0
             last letter = 'w'            positi : negati =      3.4 : 1.0
             last letter = 'b'            negati : positi =      3.1 : 1.0
            first letter = 'u'            negati : positi =      2.7 : 1.0
            first letter = 'g'            positi : negati =      2.6 : 1.0
             last letter = 'k'            negati : positi =      2.5 : 1.0
Enter a word: exquisite
Sentiment of the word :  positive


### Excercise 2

Build a classifier model that can predict the sentiment of the sentence. Please use the TweetSentiment.csv file provided for training and testing. This file has a list of 5000 tweets ranked on the basis of their sentiment (positive and negative).

In [7]:
from nltk.data import LazyLoader
sent_tokenizer=LazyLoader("tokenizers/punkt/english.pickle")

import csv
import nltk
import random

def feature_extractor(sentence):
    l_word = sentence.split()
    first_w = second_w = last_w = ""  # Initialize variables with default values
    if len(l_word) >= 3:
        first_w = l_word[0]
        second_w = l_word[1]
        last_w = l_word[-1]
    if len(l_word) < 2:
        first_w = second_w = last_w = l_word[0]
    dict_feature = {"first word": first_w, "second word": second_w, "last word": last_w}
    return dict_feature

def gen_featureset():
    SENTIMENT_CSV = r"TweetSentiment.csv"
    with open(SENTIMENT_CSV, 'rt', encoding='utf-8') as csvobj:
        ws_data = csv.reader(csvobj)
        featureset = []
        for row in ws_data:
            if row and len(row) >= 2:
                w_feature = list()
                feature = feature_extractor(row[1])
                if feature:  # Check if the feature is not None
                    w_feature.append(feature)
                    w_feature.append(row[0])
                    featureset.append(w_feature)
        random.shuffle(featureset)
        return featureset

def ML_train():
    """ This will train the classifier using the word sentiment feature set"""
    featureset = gen_featureset()
    train_set = featureset[:4000]
    test_set = featureset[4000:]
    classifier = nltk.NaiveBayesClassifier.train(train_set) #Train the NaiveBayes model using the training data set
    acc = nltk.classify.accuracy(classifier,test_set) #Find accuracy of the model using the test set
    print('Accuracy of the model: ',acc)
    classifier.show_most_informative_features() #show the most informative features
    return classifier

nb_class = ML_train()

sentence = input('Enter a sentence: ').lower()
f_wrd = feature_extractor(sentence)
sentiment = nb_class.classify(f_wrd)
print("Sentiment of the sentence : ", sentiment)



Accuracy of the model:  0.604
Most Informative Features
             second word = 'think'        negati : positi =      9.0 : 1.0
             second word = 'hate'         negati : positi =      7.7 : 1.0
             second word = 'Thanks'       positi : negati =      7.0 : 1.0
              first word = 'Its'          negati : positi =      6.3 : 1.0
               last word = 'work'         negati : positi =      6.3 : 1.0
             second word = 'miss'         negati : positi =      6.3 : 1.0
             second word = 'want'         negati : positi =      6.2 : 1.0
               last word = 'sleep'        negati : positi =      5.8 : 1.0
             second word = 'good'         positi : negati =      5.7 : 1.0
             second word = 'do'           negati : positi =      5.7 : 1.0
Enter a sentence: I love this girl
Sentiment of the sentence :  positive


#### Note: Development testing and error analysis

Using a seperate dev-test set, we can generate a list of the errors that the classifier makes when predicting the sentiment. We can then examine individual error cases where the model predicted the wrong label, and try to determine what additional pieces of information would allow it to make the right decision (or which existing pieces of information are tricking it into making the wrong decision). The feature set can then be adjusted accordingly.