# Homework 1: Preprocessing and Text Classification

Student Name: Hongyu Chen

Student ID: 1062897

# General Info

<b>Due date</b>: Sunday, 5 Apr 2020 5pm

<b>Submission method</b>: Canvas submission

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -20% per day (both week and weekend days counted)

<b>Marks</b>: 10% of mark for class (with 9% on correctness + 1% on quality and efficiency of your code)

<b>Materials</b>: See [Using Jupyter Notebook and Python page](https://canvas.lms.unimelb.edu.au/courses/17601/pages/using-jupyter-notebook-and-python?module_item_id=1678430) on Canvas (under Modules>Resources) for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. In particular, if you are not using a lab computer which already has it installed, we recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. You can also use any Python built-in packages, but do not use any other 3rd party packages (the packages listed above are all fine to use); if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  

To familiarize yourself with NLTK, here is a free online book:  Steven Bird, Ewan Klein, and Edward Loper (2009). <a href=http://nltk.org/book>Natural Language Processing with Python</a>. O'Reilly Media Inc. You may also consult the <a href=https://www.nltk.org/api/nltk.html>NLTK API</a>.

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. 

You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it.

<b>Updates</b>: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on the discussion board; we recommend you check it regularly.

<b>Academic misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.

# Overview

In this homework, you'll be working with a collection tweets. The task is to classify whether a tweet constitutes a rumour event. This homework involves writing code to preprocess data and perform text classification.

# 1. Preprocessing (5 marks)

**Instructions**: Run the code below to download the tweet corpus for the assignment. Note: the download may take some time. **No implementation is needed.**

In [1]:
import requests
import os
from pathlib import Path

fname = 'rumour-data.tgz'
data_dir = os.path.splitext(fname)[0] #'rumour-data'

my_file = Path(fname)
if not my_file.is_file():
    url = "https://github.com/jhlau/jhlau.github.io/blob/master/files/rumour-data.tgz?raw=true"
    r = requests.get(url)

    #Save to the current directory
    with open(fname, 'wb') as f:
        f.write(r.content)
        
print("Done. File downloaded:", my_file)


Done. File downloaded: rumour-data.tgz


**Instructions**: Run the code to extract the zip file. Note: the extraction may take a minute or two. **No implementation is needed.**

In [2]:
import tarfile

#decompress rumour-data.tgz
tar = tarfile.open(fname, "r:gz")
tar.extractall()
tar.close()

#remove superfluous files (e.g. .DS_store)
extra_files = []
for r, d, f in os.walk(data_dir):
    for file in f:
        if (file.startswith(".")):
            extra_files.append(os.path.join(r, file))
for f in extra_files:
    os.remove(f)

print("Extraction done.")

Extraction done.


### Question 1 (1.0 mark)

**Instructions**: The corpus data is in the *rumour-data* folder. It contains 2 sub-folders: *non-rumours* and *rumours*. As the names suggest, *rumours* contains all rumour-propagating tweets, while *non-rumours* has normal tweets. Within  *rumours* and *non-rumours*, you'll find some sub-folders, each named with an ID. Each of these IDs constitutes an 'event', where an event is defined as consisting a **source tweet** and its **reactions**.

An illustration of the folder structure is given below:

    rumour-data
        - rumours
            - 498254340310966273
                - reactions
                    - 498254340310966273.json
                    - 498260814487642112.json
                - source-tweet
                    - 498254340310966273.json
        - non-rumours

Now we need to gather the tweet messages for rumours and non-rumour events. As the individual tweets are stored in json format, we need to use a json parser to parse and collect the actual tweet message. The function `get_tweet_text_from_json(file_path)` is provided to do that.

**Task**: Complete the `get_events(event_dir)` function. The function should return **a list of events** for a particular class of tweets (e.g. rumours), and each event should contain the source tweet message and all reaction tweet messages.

**Check**: Use the assertion statements in *"For your testing"* below for the expected output.

In [14]:
import json

def get_tweet_text_from_json(file_path):
    with open(file_path) as json_file:
        data = json.load(json_file)
        return data["text"]
    
def get_events(event_dir):
    event_list = []
    for event in sorted(os.listdir(event_dir)):
        ###
        # Your answer BEGINS HERE
        ###
        subevent = []
        # first will be 'source-tweet', then 'reactions'
        current_path = os.path.join(event_dir, event)
        for sub in sorted(os.listdir(current_path))[::-1]:
            sub_current_path = os.path.join(current_path, sub)
            if sub == "source-tweet":
                for jsfile in os.listdir(sub_current_path):
                    subevent.append(get_tweet_text_from_json(os.path.join(sub_current_path, jsfile)))
            elif sub == "reactions":
                for jsfile in os.listdir(sub_current_path):
                    subevent.append(get_tweet_text_from_json(os.path.join(sub_current_path, jsfile)))
            else:
                print("No subfile named 'source-tweet' or 'reactions'")
                return
            
        event_list.append(subevent)             
               
        ###
        # Your answer ENDS HERE
        ###
        
    return event_list
    
#a list of events, and each event is a list of tweets (source tweet + reactions)    
rumour_events = get_events(os.path.join(data_dir, "rumours"))
nonrumour_events = get_events(os.path.join(data_dir, "non-rumours"))

print("Number of rumour events =", len(rumour_events))
print("Number of non-rumour events =", len(nonrumour_events))

Number of rumour events = 500
Number of non-rumour events = 1000


**For your testing:**

In [15]:
assert(len(rumour_events) == 500)
assert(len(nonrumour_events) == 1000)

### Question 2 (1.0 mark)

**Instructions**: Next we need to preprocess the collected tweets to create a bag-of-words representation. The preprocessing steps required here are: (1) tokenize each tweet into individual word tokens (using NLTK `TweetTokenizer`); and (2) remove stopwords (based on NLTK `stopwords`).

**Task**: Complete the `preprocess_events(event)` function. The function takes **a list of events** as input, and returns **a list of preprocessed events**. Each preprocessed event should have a dictionary of words and frequencies.

**Check**: Use the assertion statements in *"For your testing"* below for the expected output.

In [57]:
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from collections import defaultdict

nltk.download('stopwords')
tt = TweetTokenizer()
stopwords = set(stopwords.words('english'))


def preprocess_events(events):
    ###
    # Your answer BEGINS HERE
    ###
    preprocessed_event = []
    for source_reatcion in events:
        single_event = []
        ## tokenization
        for ele in source_reatcion:
            single_event.extend(tt.tokenize(ele))
        ## remove stopwords
        filter_single_event = []
        for i in single_event:
            if i not in stopwords:
                filter_single_event.append(i)
        ## get the dict
        dic = {}
        for i in filter_single_event:
            if dic.get(i, None) == None:
                dic[i] = 1
            else:
                dic[i] += 1
        
        preprocessed_event.append(dic)
    
    return preprocessed_event
            
    ###
    # Your answer ENDS HERE
    ###

preprocessed_rumour_events = preprocess_events(rumour_events)
preprocessed_nonrumour_events = preprocess_events(nonrumour_events)

print("Number of preprocessed rumour events =", len(preprocessed_rumour_events))
print("Number of preprocessed non-rumour events =", len(preprocessed_nonrumour_events))


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Number of preprocessed rumour events = 500
Number of preprocessed non-rumour events = 1000


**For your testing**:

In [26]:
assert(len(preprocessed_rumour_events) == 500)
assert(len(preprocessed_nonrumour_events) == 1000)

**Instructions**: Hashtags (i.e. topic tags which start with #) pose an interesting tokenisation problem because they often include multiple words written without spaces or capitalization. Run the code below to collect all unique hashtags in the preprocessed data. **No implementation is needed.**



In [27]:
def get_all_hashtags(events):
    hashtags = set([])
    for event in events:
        for word, frequency in event.items():
            if word.startswith("#"):
                hashtags.add(word)
    return hashtags

hashtags = get_all_hashtags(preprocessed_rumour_events + preprocessed_nonrumour_events)
print("Number of hashtags =", len(hashtags))

Number of hashtags = 1829


### Question 3 (2.0 mark)

**Instructions**: Our task here to tokenize the hashtags, by implementing a reversed version of the MaxMatch algorithm discussed in class, where matching begins at the end of the hashtag and progresses backwards. NLTK has a list of words that you can use for matching, see starter code below. Be careful about efficiency with respect to doing word lookups. One extra challenge you have to deal with is that the provided list of words includes only lemmas: your MaxMatch algorithm should match inflected forms by converting them into lemmas using the NLTK lemmatizer before matching. When lemmatising a word, you also need to provide the part-of-speech tag of the word. You should use `nltk.tag.pos_tag` for doing part-of-speech tagging.

Note that the list of words is incomplete, and, if you are unable to make any longer match, your code should default to matching a single letter. Create a new list of tokenized hashtags (this should be a list of lists of strings) and use slicing to print out the last 20 hashtags in the list.

For example, given "#speakup", the algorithm should produce: \["#", "speak", "up"\]. And note that you do not need to delete the hashtag symbol ("#") from the tokenised outputs.

**Task**: Complete the `tokenize_hashtags(hashtags)` function by implementing a reversed MaxMatch algorithm. The function takes as input **a set of hashtags**, and returns **a dictionary** where key="hashtag" and value="a list of word tokens".

**Check**: Use the assertion statements in *"For your testing"* below for the expected output.

In [58]:
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('words')
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
words = set(nltk.corpus.words.words()) #a list of words provided by NLTK


def get_wordnet_POS(word):
    ## get the first POS of the word
    ## e.g. nltk.pos_tag("feet") -> [('f', 'JJ'), ('e', 'NN'), ('e', 'NN'), ('t', 'NN')]
    ##      tag -> "J"
    tag = nltk.pos_tag(word)[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    
    return tag_dict.get(tag, wordnet.NOUN)


def lemmatize_word(word):
    return lemmatizer.lemmatize(word, get_wordnet_POS(word))


def maxmatch(word, wordlist):
    if len(word) == 0: return wordlist
    if len(word) == 1:
        return wordlist + [word]
    for i in range(len(word)-1,-1,-1):
        if word[:i+1].lower() in words:
            return maxmatch(word[i+1:],wordlist+[word[:i+1]])
            
    # when can not match anything, take the first char as a single char
    return maxmatch(word[1:], wordlist+[word[0]])
        
def tokenize_hashtags(hashtags):
    ###
    # Your answer BEGINS HERE
    ###
    ht_dic = {}
    for it in hashtags:
        # not startwith '#'
        if it[0] != '#': continue
        # just '#'
        if len(it) == 1: ht_dic[it] = ["#"]
        else:
            clear_word = lemmatize_word(it[1:])
            if clear_word not in ht_dic:
                ht_dic[it] = ["#"] + maxmatch(clear_word, [])
    return ht_dic
    ###
    # Your answer ENDS HERE
    ###


tokenized_hashtags = tokenize_hashtags(hashtags)

print(list(tokenized_hashtags.items())[:20])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


[('#mightaswellbeCubaorChina', ['#', 'might', 'aswell', 'be', 'Cub', 'a', 'orC', 'hin', 'a']), ('#JustinBieber', ['#', 'Just', 'in', 'B', 'ie', 'ber']), ('#overkill', ['#', 'over', 'kill']), ('#OpFerguson', ['#', 'O', 'p', 'Fe', 'r', 'g', 'us', 'on']), ('#opinions', ['#', 'opinion']), ('#justsaying', ['#', 'just', 'saying']), ('#awesome', ['#', 'awesome']), ('#Saudi', ['#', 'Sa', 'u', 'di']), ('#jealous', ['#', 'jealous']), ('#EveryDay', ['#', 'EveryDay']), ('#building7', ['#', 'building', '7']), ('#DONTSHOOT', ['#', 'DONT', 'SHOOT']), ('#Insecurityinfrance', ['#', 'Insecurity', 'infra', 'n', 'ce']), ('#power', ['#', 'power']), ('#dangerousblackkids', ['#', 'dangerous', 'black', 'k', 'id', 's']), ('#TeinemanSquare', ['#', 'Te', 'in', 'em', 'an', 'Square']), ('#gross', ['#', 'gross']), ('#Murder', ['#', 'Murder']), ("#4U9525's", ['#', '4', 'U', '9', '5', '2', '5', "'", 's']), ('#BokoHaramKilled2000People', ['#', 'Bo', 'ko', 'Ha', 'ram', 'Kill', 'e', 'd', '2', '0', '0', '0', 'People'])]


**For your testing:**

In [39]:
assert(len(tokenized_hashtags) == len(hashtags))

### Question 4 (1.0 mark)

**Instructions**: Now that we have the tokenized hashtags, we need to go back and update the bag-of-words representation for each event.

**Task**: Complete the ``update_event_bow(events)`` function. The function takes **a list of preprocessed events**, and for each event, it looks for every hashtag it has and updates the bag-of-words dictionary with the tokenized hashtag tokens. Note: you do not need to delete the counts of the original hashtags when updating the bag-of-words (e.g., if a document has "#speakup":2 in its bag-of-words representation, you do not need to delete this hashtag and its counts).

In [59]:
def update_event_bow(events):
    ###
    # Your answer BEGINS HERE
    ###
    for event in events:
        ht_word = []
        # find all "#..." in this event
        for key,val in event.items():
            if key.startswith('#'):
                ht_word.append(key)
        # for each hashtag word
        for w in ht_word:
            if w not in tokenized_hashtags:
                print(w + 'not in tokenized_hashtags')
            else:
                tokens = tokenized_hashtags[w]
                # update BOW
                for t in tokens:
                    event[t] = event.get(t,0) + event[w]
                
    ###
    # Your answer ENDS HERE
    ###
            
update_event_bow(preprocessed_rumour_events)
update_event_bow(preprocessed_nonrumour_events)

print("Number of preprocessed rumour events =", len(preprocessed_rumour_events))
print("Number of preprocessed non-rumour events =", len(preprocessed_nonrumour_events))

{'well': 1, 'They': 3, 'less': 1, 'recording': 1, 'LOT': 1, 'protect': 1, '#Ferguson': 2, '-': 1, 's': 1, 'came': 1, 'g': 2, 'Protecting': 1, 'installed': 3, 'pro-gunners': 1, 'still': 2, "that's": 1, 'take': 1, 'would': 1, 'police': 1, 'ppl': 1, 'protecting': 1, 'audio': 1, 'Mike': 3, 'public': 1, 'AR': 1, 'patrol': 1, 'fire': 1, 'hardship': 1, 'guns': 3, '15s': 1, 'program': 1, 'squad': 2, 'rifles': 1, 'bothered': 1, "they're": 1, "That's": 1, '@TheAnonMessage': 8, 'Like': 1, '@BallerinaX': 1, 'slaughter': 1, '#': 8, '@DaleGunn3': 1, 'box': 1, '&': 3, '|': 2, 'chief': 1, 'riots': 1, 'skills': 1, 'interesting': 2, 'financial': 1, 'right': 1, 'Brown': 3, 'So': 3, 'laziness': 1, 'need': 1, 'disorganized': 1, '@runsammrun': 2, 'Clearly': 1, 'cams': 1, 'ready': 1, 'gun': 1, 'open': 2, '@unpreachedtruth': 14, 'afford': 7, 'never': 2, '@rockermom53': 2, 'claim': 1, '?': 5, ',': 6, 'Fe': 2, 'machine': 3, 'r': 2, 'carelessness': 1, 'video': 2, 'focus': 1, 'matter': 1, 'find': 1, 'night': 1, '

# Text Classification (4 marks)

### Question 5 (1.0 mark)

**Instructions**: Here we are interested to do text classification, to predict, given a tweet and its reactions, whether it is a rumour or not. The task here is to create training, development and test partitions from the preprocessed events and convert the bag-of-words representation into feature vectors.

**Task**: Using scikit-learn, create training, development and test partitions with a 60%/20%/20% ratio. Remember to preserve the ratio of rumour/non-rumour events for all your partitions. Next, turn the bag-of-words dictionary of each event into a feature vector, using scikit-learn `DictVectorizer`.

In [111]:
from sklearn.feature_extraction import DictVectorizer

vectorizer = DictVectorizer()

###
# Your answer BEGINS HERE
###

## get train_set, dev_set, test_set with ratio
def get_tdt(raw_data,train=0.6,dev=0.2,test=0.2):
    tra_len = int(train*len(raw_data))
    dev_len = int(dev*len(raw_data))
    train_set = raw_data[:tra_len]
    dev_set = raw_data[tra_len:tra_len+dev_len]
    test_set = raw_data[tra_len+dev_len:]
    return train_set, dev_set, test_set

rumor_tra_set, rumor_dev_set, rumor_tst_set = get_tdt(preprocessed_rumour_events, 0.6, 0.2, 0.2)
nonrumor_tra_set, nonrumor_dev_set, nonrumor_tst_set = get_tdt(preprocessed_nonrumour_events, 0.6, 0.2, 0.2)


## mix rumor set and nonrumor set as final train_set, dev_set, test_set
def mix_rumor_nonrumor(rumor, nonrumor):
    mixed_set, mixed_label = [], []
    for i in rumor:
        mixed_set.append(i)
        mixed_label.append('rumor')
    for j in nonrumor:
        mixed_set.append(j)
        mixed_label.append('nonrumor')
    
    return mixed_set, mixed_label

tra_set, tra_label = mix_rumor_nonrumor(rumor_tra_set, nonrumor_tra_set)
dev_set, dev_label = mix_rumor_nonrumor(rumor_dev_set, nonrumor_dev_set)
tst_set, tst_label = mix_rumor_nonrumor(rumor_tst_set, nonrumor_tst_set)

# Vectorize
tra_data = vectorizer.fit_transform(tra_set)
dev_data = vectorizer.fit_transform(dev_set)
tst_data = vectorizer.transform(tst_set)
###
# Your answer ENDS HERE
###

print("Vocabulary size =", len(vectorizer.vocabulary_))

Vocabulary size = 14097


### Question 6 (2.0 mark)

**Instructions**: Now, let's build some classifiers. Here, we'll be comparing Naive Bayes and Logistic Regression. For each, you need to first find a good value for their main regularisation (hyper)parameters, which you should identify using the scikit-learn docs or other resources. Use the development set you created for this tuning process; do **not** use cross-validation in the training set, or involve the test set in any way. You don't need to show all your work, but you do need to print out the accuracy with enough different settings to strongly suggest you have found an optimal or near-optimal choice. We should not need to look at your code to interpret the output.

**Task**: Implement two text classifiers: Naive Bayes and Logistic Regression. Tune the hyper-parameters of these classifiers and print the task performance for different hyper-parameter settings.

In [115]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

###
# Your answer BEGINS HERE
###

from sklearn import model_selection
#from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

models = [MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
        MultinomialNB(alpha=1.5, class_prior=None, fit_prior=True),
        MultinomialNB(alpha=2.0, class_prior=None, fit_prior=True),
        MultinomialNB(alpha=2.3, class_prior=None, fit_prior=True),
        MultinomialNB(alpha=2.5, class_prior=None, fit_prior=True),
        MultinomialNB(alpha=2.7, class_prior=None, fit_prior=True),
        MultinomialNB(alpha=3.0, class_prior=None, fit_prior=True),
        LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                           intercept_scaling=1, l1_ratio=None, max_iter=100,
                           multi_class='warn', n_jobs=None, penalty='l2',
                           random_state=None, solver='warn', tol=0.0001, verbose=0,
                           warm_start=False),
        LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                           intercept_scaling=1, l1_ratio=None, max_iter=200,
                           multi_class='warn', n_jobs=None, penalty='l2',
                           random_state=None, solver='warn', tol=0.0001, verbose=0,
                           warm_start=False),
        LogisticRegression(C=3.0, class_weight=None, dual=False, fit_intercept=True,
                           intercept_scaling=1, l1_ratio=None, max_iter=300,
                           multi_class='warn', n_jobs=None, penalty='l2',
                           random_state=None, solver='warn', tol=0.0001, verbose=0,
                           warm_start=False),
        LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                           intercept_scaling=1, l1_ratio=None, max_iter=400,
                           multi_class='warn', n_jobs=None, penalty='l2',
                           random_state=None, solver='warn', tol=0.0001, verbose=0,
                           warm_start=False)]

def do_multiple_10foldcrossvalidation(models,data,classifications):
    for model in models:
        predictions = model_selection.cross_val_predict(model, data, classifications, cv=10)
        print(model)
        print("accuracy:", accuracy_score(classifications,predictions))
        print(classification_report(classifications,predictions))
        
print("==============>>>")
print("These tune processes based on Development-set...\n")
do_multiple_10foldcrossvalidation(models, dev_data, dev_label)
###
# Your answer ENDS HERE
###

These tune processes based on Development-set...

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
accuracy: 0.87
              precision    recall  f1-score   support

    nonrumor       0.94      0.85      0.90       200
       rumor       0.76      0.90      0.82       100

    accuracy                           0.87       300
   macro avg       0.85      0.88      0.86       300
weighted avg       0.88      0.87      0.87       300

MultinomialNB(alpha=1.5, class_prior=None, fit_prior=True)
accuracy: 0.88
              precision    recall  f1-score   support

    nonrumor       0.92      0.90      0.91       200
       rumor       0.81      0.84      0.82       100

    accuracy                           0.88       300
   macro avg       0.86      0.87      0.87       300
weighted avg       0.88      0.88      0.88       300

MultinomialNB(alpha=2.0, class_prior=None, fit_prior=True)
accuracy: 0.8866666666666667
              precision    recall  f1-score   support

    n



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
accuracy: 0.88
              precision    recall  f1-score   support

    nonrumor       0.87      0.97      0.92       200
       rumor       0.92      0.70      0.80       100

    accuracy                           0.88       300
   macro avg       0.89      0.83      0.86       300
weighted avg       0.88      0.88      0.88       300





LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
accuracy: 0.8733333333333333
              precision    recall  f1-score   support

    nonrumor       0.86      0.96      0.91       200
       rumor       0.91      0.69      0.78       100

    accuracy                           0.87       300
   macro avg       0.88      0.83      0.85       300
weighted avg       0.88      0.87      0.87       300





LogisticRegression(C=3.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=300,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
accuracy: 0.87
              precision    recall  f1-score   support

    nonrumor       0.86      0.96      0.91       200
       rumor       0.91      0.68      0.78       100

    accuracy                           0.87       300
   macro avg       0.88      0.82      0.84       300
weighted avg       0.87      0.87      0.86       300





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=400,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
accuracy: 0.88
              precision    recall  f1-score   support

    nonrumor       0.87      0.97      0.92       200
       rumor       0.92      0.70      0.80       100

    accuracy                           0.88       300
   macro avg       0.89      0.83      0.86       300
weighted avg       0.88      0.88      0.88       300



### Question 7 (1.0 mark)

**Instructions**: Using the best settings you have found, compare the two classifiers based on performance in the test set. Print out both accuracy and macro-averaged F-score for each classifier. Be sure to label your output.

**Task**: Compute test performance in terms of accuracy and macro-averaged F-score for both Naive Bayes and Logistic Regression, using optimal hyper-parameter settings.

In [116]:
###
# Your answer BEGINS HERE
###
best_NB = MultinomialNB(alpha=2.5, class_prior=None, fit_prior=True)
best_LR = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

def test_model(model, data, classifications):
    predictions = model_selection.cross_val_predict(model, data, classifications, cv=10)
    print(model)
    print("accuracy:", accuracy_score(classifications,predictions))
    print(classification_report(classifications,predictions))

    
print("==============>>>")
print("Below experiments based on Test-set...\n")
print("1> Test on MultinomialNB")
test_model(best_NB, tst_data, tst_label)
print("2> Test on LogisticRegression")
test_model(best_LR, tst_data, tst_label)
###
# Your answer ENDS HERE
###

Below experiments based on Test-set...

1> Test on MultinomialNB
MultinomialNB(alpha=2.5, class_prior=None, fit_prior=True)
accuracy: 0.7066666666666667
              precision    recall  f1-score   support

    nonrumor       0.71      0.96      0.81       200
       rumor       0.71      0.20      0.31       100

    accuracy                           0.71       300
   macro avg       0.71      0.58      0.56       300
weighted avg       0.71      0.71      0.65       300

2> Test on LogisticRegression




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
accuracy: 0.7366666666666667
              precision    recall  f1-score   support

    nonrumor       0.76      0.88      0.82       200
       rumor       0.65      0.45      0.53       100

    accuracy                           0.74       300
   macro avg       0.71      0.67      0.67       300
weighted avg       0.73      0.74      0.72       300

