# Week 4: Naive Bayes Classification (Part 2)

### Overview
This notebook builds on the activities in the previous notebooks on sentiment analysis.  In this lab, we will be putting everything together.  You will be focussing on the movie_review corpus us with a view to investigating:

- Evaluation metrics for classifier performance
- Which is the best classifier from a set of possibilities on a given test set
- What is the impact of varying training data size? To what extent does increasing the quantity of training data improve classifier performance?



### Preliminaries 

In [None]:
import nltk
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import movie_reviews

>To access functionality defined in previous notebooks, copy the classes and functions defined in Week3Labs and Week4Labs into a `utils.py` file and then import it into the notebook.  There is a `utils.py` file included with these resources which you can update.  You should make sure that your classifier and ConfusionMatrix classes are defined in this file

In [None]:
#uncomment on google colab
from google.colab import drive
drive.mount('/content/drive')

#you don't need this on jupyter notebook unless you want to import from a directory which is not the current working directory
import sys
sys.path.append('/content/drive/My Drive/NLE Notebooks 2021/Week4LabsSolutions/')

In [None]:
#import code to setup training and testing data, wordlist classifiers and NB classifiers

from utils import *

### Evaluating a Naïve Bayes classifier on test data
We are now ready to run our Naïve Bayes classifier on a set of test data. When we do this we want to return the accuracy of the classifier on that data, where accuracy is calculated as follows:

$$\frac{\mbox{number of test documents that the classifier classifiers correctly}}
{\mbox{total number of test documents}}$$

In order to compute this accuracy score, we need to give the classifier **labelled** test data.
- This will be in the same format as the training data.

>In the cell below, we set up 5 test documents in the class `weather` and 5 documents in the class `football`.

>Run this cell.

In [None]:
weather_sents_train = [
    "today it is raining",
    "looking cloudy today",
    "it is nice weather",
]

football_sents_train = [
    "city looking good",
    "advantage united",
]

weather_data_train = [(FreqDist(sent.split()), "weather") for sent in weather_sents_train] 
football_data_train = [(FreqDist(sent.split()), "football") for sent in football_sents_train]
train_data = weather_data_train + football_data_train

weather_sents_test = [
    "the weather today is nice",
    "it is raining cats and dogs",
    "the weather here is wet",
    "it was hot today",
    "rain due tomorrow",
]

football_sents_test = [
    "what a great goal that was",
    "poor defending by the city center back",
    "wow he missed a sitter",
    "united are a shambles",
    "shots raining down on the keeper",
]

weather_data_test = [(FreqDist(sent.split()), "weather") for sent in weather_sents_test] 
football_data_test = [(FreqDist(sent.split()), "football") for sent in football_sents_test]
test_data = weather_data_test + football_data_test



In [None]:
train_data

In [None]:
test_data

### Exercise 1
Train the NB classifier that you developed earlier and then test it.
Compute accuracy, precision, recall and F1 score

### Exercise 2
Now, we want to run your NB classifier on a real problem - the classification of movie reviews as positive or negative.
* generate a training and test split of the data for movie_reviews (see Lab 3_1 / 3_2)
* train a nb_classifier on the training data
* test it on the test data

Compare the performance of your Naive Bayes classifier with the WordList classifiers that you developed last week.

### NLTK NB Classifier
Developing our own NB classifier is great for understanding how it works.  But, in practice, it is usually more convenient to use a standard one imported from a library.  NLTK provides a NB classifier (as do other libraries such as sklearn).  It can be imported and trained as follows.


In [None]:
from nltk.classify import NaiveBayesClassifier

#note that the NaiveBayesClassifier.train() method is a class method which returns the classifier object.
#this is different to ours and other classifiers which are first instantiated and then trained via an object method
nltk_nb=NaiveBayesClassifier.train(training)

This object also has a .classify_many() method:

In [None]:
nltk_nb.classify_many(docs)

### Exercise 3 

Compare the performance of the NLTK NB classifier with the one that you wrote yourself.

### Extension exercises
* Investigate the impact of the amount of training data on the Naive Bayes classifiers.
* Research what kind of event model is used by the NLTK NB classifier by default and whether / how this can be changed.  Does this impact the performance?
* Find out about a NB classifier provided by a different library e.g., sklearn.  Can you apply this to the movie_review data set?
* Find out about another machine learning method for classification (e.g., logistic regression).  Can you apply this to the movie_review data set?