## *Assignment 2: Naive Bayes Text classification*

Building a text classifier in python for movie review corpus in 2000 NLTK movie review corpus, you
should write a 600(+/- 50) words of short report to summarize:

1. Your proposed text classification method (you can use relevant NLTK text classification functions or other text classification functions)
2. Pre-processing methods such as removing stop words, punctuations.

3. Feature selection methods such as selection the 3000 most important words.

4. You should evaluate your model on 10 percent of movie review model, please report the precision, recall, accuracy, and F-score of each class.

My evaluation will based on the whether you can clearly demonstrate your method in text
classification, pre-processing, feature selection, and perform evaluation.
Please using the same dataset as we used in class 3: NLTK and sentiment analysis

In [20]:
import nltk
import random
import string
from nltk import re
import pandas as pd
import seaborn as sns
import en_core_web_sm
from IPython.display import display 
from collections import OrderedDict
from IPython.lib.pretty import pprint
from nltk.corpus import movie_reviews
from nltk.stem import WordNetLemmatizer
from spacy.lang.en.stop_words import STOP_WORDS

In [37]:
stemmer = WordNetLemmatizer() # initialisation of a lemmatiser which detaches lemmas from words
stop_words = STOP_WORDS #a list of stop-words frequently encountered in usual texts
digits = string.digits  #a list of digits from 0 to 9
global verbose #if the value is 1 - we output all information, if it is 0 - we output only particular information
verbose=1
pd.set_option('display.max_rows', 100)

In [22]:
def clean_word(raw_word):
    word=raw_word.lower()
    
    #Lemmatisation of words
    word=stemmer.lemmatize(word)
    
    #Elimination of stop-words or other words(punctuation, numbers, tiny words, detached letters) 
    #that make up a noise in input data
    if len(word)<=2 or any(map(str.isdigit, word)) or word in stop_words: 
        return None
    else:
        return word

In [23]:
def clean_review(review):
    for w in review[:]: #cleaning every word in the review
        word_res=clean_word(w) 
        if word_res!=None:
            review[review.index(w)]=word_res #returning back the cleaned word
        else: #removing noisy word from the review 
            review.remove(w) 
    return review

In [24]:
def create_documents():
    documents=[]
    #Collecting documents into one list
    for category in movie_reviews.categories():
        for fileid in movie_reviews.fileids(category):
            review=list(movie_reviews.words(fileid))
            documents.append((review, category))
    return documents

In [25]:
def get_most_common_words(word_features_number,all_words):
    if verbose==1:
        print("Selecting top-frequent words from all reviews in the training set...")
    all_words_len=len(all_words)
    all_words = nltk.FreqDist(all_words) #Creating a distribution of words' frequencies
    most_common_words=all_words.most_common(word_features_number) #Obtaining the most frequent words

    if verbose==1:
        print("Done.")
        print("Top-frequent words amount -",len(most_common_words))
        words_to_show=100
        
        #Building data table from obtained words
        cm = sns.light_palette("red", as_cmap=True)
        data=[[word,float(u'{:.4f}'.format(counts/all_words_len)),u'{} times'.format(counts)] 
        for word, counts in most_common_words[:words_to_show]]
    
        df = pd.DataFrame(data)
        df.index = range(1,words_to_show+1)
        df.columns = ["Word", "Frequency", "Counts"]
        s = df.style.background_gradient(cmap=cm)
        print("Showing first",words_to_show,"words...")
        display(s)
    return most_common_words

In [26]:
def feature_selector(most_common_words, word_types=None):
    nlp = en_core_web_sm.load() # loading language model package
    word_features=list()
    feature_types=list()
    if verbose==1:
        print("Selecting the most relevant word features using Part-of-Speach Tagging...")
    for word in most_common_words:
        docs = nlp(word[0])
#         if docs[0].pos_=="X":
#             print(docs[0].text,docs[0].pos_)

        #Skipping word features that do not correspond to required tags in the 'word_types' variable
        if word_types!= None and docs[0].pos_ not in word_types:
            continue
        else:
            word_features.append(docs[0].text) # We add only relevant features
            if verbose == 1:
                feature_types.append([docs[0].text,docs[0].pos_])
    if verbose == 1:
        print()
        print("Done")
        words_to_show=100
        print("Amount of selected by POS word features-",len(word_features))

        print("Showing first",words_to_show,"selected word features")
        df = pd.DataFrame(feature_types) #Building data table to represent selected by POS tagger word features 
        df.index = range(1,len(feature_types)+1)
        df.columns = ["Word", "Part-Of-Speech Tag"]
        display(df.head(words_to_show)) #display those features
    return word_features

In [27]:
def get_documents_feature_sets(documents, word_features):
    if verbose==1:
        print("Assembling word feature sets for each review")
    #Transforming reviews to feature sets using selected word features
    feature_sets = [(find_features(rev,word_features), category) for (rev, category) in documents] 
    if verbose==1:
        print()
        print("Done")
        print("Word feature sets amount -",len(feature_sets))        
        print("<<=========================Example of word feature set===========================>>")
        pprint(feature_sets[0], max_seq_length=50)
        print()
    return feature_sets

In [28]:
def find_features(document,word_features):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words) # We check if a certain feature exists in a current document
    return features

In [29]:
def calculate_metrics(cm):
    TP=cm._confusion[1][1] #Counting True Positive predictions
    FP=cm._confusion[1][0] #Counting False Positive predictions
    TN=cm._confusion[0][0] #Counting True Negative predictions
    FN=cm._confusion[0][1] #Counting False Positive predictions

    Recall=TP/(TP+FN) #Calculating Recall
    Precision=TP/(TP+FP) #Calculating Precision
    accuracy=(TP+TN)/(TP+TN+FP+FN) #Calculating accuracy
    F1score=2*(Recall * Precision) / (Recall + Precision) #Calculating F1 score
    if verbose==1:
        print("<<=======Accuracy Metrics========>>")
        print()
        print("Recall is - ",Recall)
        print("Precision is - ",Precision)
        print("accuracy is - ",accuracy)
        print("F1score is - ",F1score)
    return Recall,Precision,accuracy,F1score

In [30]:
def count_average_results(results):# A function which calculates average results among all launches of Naїve Bayes Classifier
    average_recall=0
    average_precision=0
    average_accuracy=0
    average_F1score=0
    for result in results:
        average_recall+=result[0]/len(results)
        average_precision+=result[1]/len(results)
        average_accuracy+=result[2]/len(results)
        average_F1score+=result[3]/len(results)
    return {"Average recall":"{:.3%}".format(average_recall), "Average Precision":"{:.3%}".format(average_precision), "Average accuracy":"{:.3%}".format(average_accuracy), "Average F1 score":"{:.3%}".format(average_F1score)}

In [31]:
def test_classification_system(tests,word_types=None, tests_number=5):
    print("Commencing of Naїve Bayes Classifier testing on ",tests_number,"tests")
    for t in tests: #Testing the classifier on different amounts of the most frequent words
        i=0
        av_results=[]
        while i<tests_number: #Launching the classifier multiple times to obtain an average accuracy for a certain test
            if verbose==1:
                print("Test #"+str(i+1))
            av_results.append(run_Naive_Bayes_classification_system(t,positive_reviews,negative_reviews,word_types))    
            i=i+1
        if t==tests[0]:
            print("<<=================Statistical data for accuracy of the classifier=================>>")
            print()
        print({str(t)+" top-frequent words":count_average_results(av_results), "Tests quantity":tests_number, "POS-tags":word_types})
           
        

In [32]:
def clean_review_set(review_set):
    cleaned_set=[]
    for (rev, category) in review_set:
        review=clean_review(rev)
        cleaned_set.append((review,category))
    return cleaned_set

In [33]:
def get_words_from_dataset(dataset):
    all_words=[]
    for review in dataset:
        all_words+=review[0]
    return all_words

In [34]:
def run_Naive_Bayes_classification_system(word_features_number,positive_reviews,negative_reviews,word_types=None):
    random.shuffle(positive_reviews)
    random.shuffle(negative_reviews)
    balanced_documents=[]

    #Making equal amounts of reviews in a dataset
    for i in range(0,int(len(positive_reviews))):
            balanced_documents.append(positive_reviews[i])
            balanced_documents.append(negative_reviews[i])

     # Separating training and testing sets from the whole dataset 
    training_set_separator=int(len(balanced_documents)*0.9)
    training_set = balanced_documents[:training_set_separator]
    testing_set = balanced_documents[training_set_separator:]
    
    #Extracting all words from the training corpus to prepare for feature selection 
    all_words=get_words_from_dataset(training_set)
    
    #Selecting the most frequent words from all words in the training dataset
    most_common_words=get_most_common_words(word_features_number,all_words)
    
    #Selecting the most relevant word features using Part-of-Speach Tagging
    word_features=feature_selector(most_common_words,word_types)

    #Building ready-to-work training and testing sets based on selected word features 
    training_feature_set=get_documents_feature_sets(training_set,word_features)
    testing_feature_set=get_documents_feature_sets(testing_set,word_features)
    
    #Feeding the classifier with the training data
    classifier = nltk.NaiveBayesClassifier.train(training_feature_set)
    
    #Preparing data for classification and further building of a confusion matrix
    testing_set_content=[i[0] for i in testing_feature_set]
    golden_label=[i[1] for i in testing_feature_set] #True classes of testing set
    
    #The classification itself
    tested_label=classifier.classify_many(testing_set_content)
    
    #Building of a confusion matrix
    cm = nltk.ConfusionMatrix(golden_label, tested_label)
    if verbose==1:
        print("<<=====================Classification results========================>>")
        print()
        print("Calculated Confusion Matrix:")
        print()
        print (cm)
        print("Classifier accuracy percent:", (nltk.classify.accuracy(classifier, testing_feature_set))*100)
        classifier.show_most_informative_features(50)
        print()
        
    #Calculating accuracy metrics from the given classification results
    recall,precision,accuracy,F1score=calculate_metrics(cm)
    return [recall,precision,accuracy,F1score]

In [35]:
print("Loading reviews...")
documents= create_documents()
print("Loaded.")
print ("Total amount of reviews -",len(documents))
if verbose==1:
    print("<<=========================Example of raw review===========================>>")
    pprint(documents[0], max_seq_length=50)
    print()
positive_reviews=documents[int(len(documents)/2):]
negative_reviews=documents[:int(len(documents)/2)]

print("Cleaning reviews by lemmatisation and elimination of stop-words")
positive_reviews=clean_review_set(positive_reviews)
negative_reviews=clean_review_set(negative_reviews)
print("Done.")
if verbose==1:
    print("<<=========================Example of cleaned review===========================>>")
    pprint(negative_reviews[0], max_seq_length=50)
    print()

Loading reviews...
Loaded.
Total amount of reviews - 2000
(['plot',
  ':',
  'two',
  'teen',
  'couples',
  'go',
  'to',
  'a',
  'church',
  'party',
  ',',
  'drink',
  'and',
  'then',
  'drive',
  '.',
  'they',
  'get',
  'into',
  'an',
  'accident',
  '.',
  'one',
  'of',
  'the',
  'guys',
  'dies',
  ',',
  'but',
  'his',
  'girlfriend',
  'continues',
  'to',
  'see',
  'him',
  'in',
  'her',
  'life',
  ',',
  'and',
  'has',
  'nightmares',
  '.',
  'what',
  "'",
  's',
  'the',
  'deal',
  '?',
  'watch',
  ...],
 'neg')

Cleaning reviews by lemmatisation and elimination of stop-words
Done.
(['plot',
  'teen',
  'couple',
  'church',
  'party',
  'drink',
  'drive',
  'accident',
  'guy',
  'girlfriend',
  'continues',
  'life',
  'nightmare',
  'deal',
  'watch',
  'movie',
  'sorta',
  'find',
  'critique',
  'mind',
  'fuck',
  'movie',
  'teen',
  'generation',
  'touch',
  'cool',
  'idea',
  'present',
  'bad',
  'package',
  'review',
  'harder',
  'write',
  

### Single launch of the Naїve Bayes Classifier with 3000 top-frequent words

In [38]:
test_classification_system([3000],["ADJ","VERB","ADV"],tests_number=1)

Commencing of Naїve Bayes Classifier testing on  1 tests
Test #1
Selecting top-frequent words from all reviews in the training set...
Done.
Top-frequent words amount - 3000
Showing first 100 words...


Unnamed: 0,Word,Frequency,Counts
1,film,0.018,9951 times
2,movie,0.0114,6291 times
3,character,0.0063,3490 times
4,like,0.0061,3389 times
5,time,0.0048,2662 times
6,scene,0.0044,2417 times
7,good,0.0039,2135 times
8,story,0.0038,2096 times
9,life,0.0031,1712 times
10,way,0.0031,1700 times


Selecting the most relevant word features using Part-of-Speach Tagging...

Done
Amount of selected by POS word features- 1049
Showing first 100 selected word features


Unnamed: 0,Word,Part-Of-Speech Tag
1,good,ADJ
2,know,VERB
3,little,ADJ
4,come,VERB
5,bad,ADJ
6,best,ADJ
7,look,VERB
8,play,VERB
9,great,ADJ
10,find,VERB


Assembling word feature sets for each review

Done
Word feature sets amount - 1800
({'good': True,
  'know': False,
  'little': False,
  'come': False,
  'bad': False,
  'best': False,
  'look': False,
  'play': False,
  'great': True,
  'find': False,
  'big': False,
  'want': False,
  'think': False,
  'better': False,
  'real': True,
  'seen': False,
  'going': True,
  'old': True,
  'long': False,
  'funny': False,
  'actually': False,
  'played': True,
  'turn': False,
  'original': False,
  'feel': False,
  'acting': True,
  'try': False,
  'away': True,
  'high': False,
  'watch': False,
  'start': True,
  'far': False,
  'making': False,
  'interesting': False,
  'hard': False,
  'begin': False,
  'tell': False,
  'special': False,
  'instead': False,
  'trying': False,
  'human': False,
  'black': False,
  'having': False,
  'run': False,
  'probably': False,
  'pretty': False,
  'given': True,
  'sure': False,
  'let': False,
  'looking': False,
  ...},
 'pos')

Assembling wo

### Experiments with different amount of top-frequent words for feature selection and training the Naїve Bayes Classifier

In [18]:
verbose=0 # Since we conduct here multiple tests we should output only average metrics results
test_classification_system([1000,2000,3000,4000,8000,10000,12000, 15000, 18000, 20000],["ADJ","VERB","ADV"],tests_number=3)

Commencing of Naїve Bayes Classifier testing on  3 tests

{'1000 top-frequent words': {'Average recall': '79.682%', 'Average Precision': '79.333%', 'Average accuracy': '79.500%', 'Average F1 score': '79.381%'}, 'Tests quantity': 3, 'POS-tags': ['ADJ', 'VERB', 'ADV']}
{'2000 top-frequent words': {'Average recall': '83.451%', 'Average Precision': '76.000%', 'Average accuracy': '80.333%', 'Average F1 score': '79.330%'}, 'Tests quantity': 3, 'POS-tags': ['ADJ', 'VERB', 'ADV']}
{'3000 top-frequent words': {'Average recall': '84.634%', 'Average Precision': '78.000%', 'Average accuracy': '81.833%', 'Average F1 score': '81.141%'}, 'Tests quantity': 3, 'POS-tags': ['ADJ', 'VERB', 'ADV']}
{'4000 top-frequent words': {'Average recall': '83.044%', 'Average Precision': '75.000%', 'Average accuracy': '79.833%', 'Average F1 score': '78.697%'}, 'Tests quantity': 3, 'POS-tags': ['ADJ', 'VERB', 'ADV']}
{'8000 top-frequent words': {'Average recall': '82.439%', 'Average Precision': '71.667%', 'Average acc

### Experiments with some combinations of Part-Of-Speach tags

In [42]:
#If None - we accept all tags
verbose=1
test_classification_system([8000],None,tests_number=1)

Commencing of Naїve Bayes Classifier testing on  1 tests
Test #1
Selecting top-frequent words from all reviews in the training set...
Done.
Top-frequent words amount - 8000
Showing first 100 words...


Unnamed: 0,Word,Frequency,Counts
1,film,0.0179,9949 times
2,movie,0.0113,6282 times
3,like,0.0062,3449 times
4,character,0.0062,3445 times
5,time,0.0048,2680 times
6,scene,0.0043,2407 times
7,good,0.004,2222 times
8,story,0.0038,2127 times
9,life,0.0031,1715 times
10,way,0.0031,1703 times


Selecting the most relevant word features using Part-of-Speach Tagging...

Done
Amount of selected by POS word features- 8000
Showing first 100 selected word features


Unnamed: 0,Word,Part-Of-Speech Tag
1,film,PROPN
2,movie,NOUN
3,like,SCONJ
4,character,NOUN
5,time,NOUN
6,scene,NOUN
7,good,ADJ
8,story,NOUN
9,life,NOUN
10,way,NOUN


Assembling word feature sets for each review

Done
Word feature sets amount - 1800
({'film': True,
  'movie': True,
  'like': True,
  'character': True,
  'time': True,
  'scene': True,
  'good': True,
  'story': True,
  'life': True,
  'way': True,
  'year': True,
  'thing': False,
  'doe': True,
  'plot': False,
  'come': True,
  'little': True,
  'know': True,
  'people': True,
  'bad': False,
  'man': False,
  'work': True,
  'director': True,
  'best': False,
  'performance': False,
  'don': True,
  'look': True,
  'new': True,
  'end': False,
  'doesn': True,
  'actor': True,
  'action': True,
  'love': False,
  'play': True,
  'great': True,
  'role': False,
  'star': False,
  'find': False,
  'audience': False,
  'big': True,
  'world': False,
  'day': True,
  'want': True,
  'think': True,
  'comedy': False,
  'guy': False,
  'seen': True,
  'real': True,
  'better': True,
  'going': True,
  'old': False,
  ...},
 'pos')

Assembling word feature sets for each review

Done
Word

In [43]:
verbose=0
test_classification_system([8000],None,tests_number=5)

Commencing of Naїve Bayes Classifier testing on  5 tests

{'8000 top-frequent words': {'Average recall': '82.281%', 'Average Precision': '69.600%', 'Average accuracy': '77.400%', 'Average F1 score': '75.266%'}, 'Tests quantity': 5, 'POS-tags': None}


In [46]:
verbose=1
test_classification_system([8000],["ADJ"],tests_number=1)

Commencing of Naїve Bayes Classifier testing on  1 tests
Test #1
Selecting top-frequent words from all reviews in the training set...
Done.
Top-frequent words amount - 8000
Showing first 100 words...


Unnamed: 0,Word,Frequency,Counts
1,film,0.0181,9965 times
2,movie,0.0114,6270 times
3,character,0.0063,3487 times
4,like,0.0061,3377 times
5,time,0.0049,2707 times
6,scene,0.0043,2397 times
7,good,0.004,2191 times
8,story,0.0038,2117 times
9,life,0.0031,1707 times
10,way,0.0031,1693 times


Selecting the most relevant word features using Part-of-Speach Tagging...

Done
Amount of selected by POS word features- 940
Showing first 100 selected word features


Unnamed: 0,Word,Part-Of-Speech Tag
1,good,ADJ
2,little,ADJ
3,bad,ADJ
4,best,ADJ
5,great,ADJ
6,big,ADJ
7,better,ADJ
8,old,ADJ
9,real,ADJ
10,funny,ADJ


Assembling word feature sets for each review

Done
Word feature sets amount - 1800
({'good': True,
  'little': True,
  'bad': False,
  'best': True,
  'great': False,
  'big': False,
  'better': False,
  'old': True,
  'real': False,
  'funny': True,
  'original': False,
  'interesting': False,
  'high': False,
  'human': False,
  'special': False,
  'sure': False,
  'black': False,
  'pretty': False,
  'second': False,
  'different': False,
  'dead': False,
  'true': False,
  'small': False,
  'entire': False,
  'main': False,
  'final': False,
  'wrong': False,
  'perfect': True,
  'open': False,
  'nice': False,
  'able': False,
  'worth': False,
  'entertaining': False,
  'short': False,
  'dark': False,
  'worst': False,
  'obvious': False,
  'beautiful': False,
  'fine': False,
  'simple': False,
  'present': False,
  'deep': False,
  'boring': False,
  'strong': False,
  'stupid': False,
  'possible': False,
  'romantic': False,
  'worse': False,
  'single': False,
  'emotional'

In [47]:
verbose=0
test_classification_system([8000],["ADJ"],tests_number=5)

Commencing of Naїve Bayes Classifier testing on  5 tests

{'8000 top-frequent words': {'Average recall': '80.111%', 'Average Precision': '73.600%', 'Average accuracy': '77.600%', 'Average F1 score': '76.544%'}, 'Tests quantity': 5, 'POS-tags': ['ADJ']}


In [48]:
verbose=1
test_classification_system([8000],["ADJ","NOUN",],tests_number=1)

Commencing of Naїve Bayes Classifier testing on  1 tests
Test #1
Selecting top-frequent words from all reviews in the training set...
Done.
Top-frequent words amount - 8000
Showing first 100 words...


Unnamed: 0,Word,Frequency,Counts
1,film,0.0181,10061 times
2,movie,0.0111,6179 times
3,character,0.0063,3494 times
4,like,0.0061,3395 times
5,time,0.0049,2695 times
6,scene,0.0044,2424 times
7,good,0.004,2200 times
8,story,0.0038,2112 times
9,life,0.0031,1726 times
10,way,0.0031,1703 times


Selecting the most relevant word features using Part-of-Speach Tagging...

Done
Amount of selected by POS word features- 2830
Showing first 100 selected word features


Unnamed: 0,Word,Part-Of-Speech Tag
1,movie,NOUN
2,character,NOUN
3,time,NOUN
4,scene,NOUN
5,good,ADJ
6,story,NOUN
7,life,NOUN
8,way,NOUN
9,year,NOUN
10,thing,NOUN


Assembling word feature sets for each review

Done
Word feature sets amount - 1800
({'movie': True,
  'character': True,
  'time': True,
  'scene': True,
  'good': True,
  'story': True,
  'life': True,
  'way': True,
  'year': False,
  'thing': True,
  'plot': True,
  'little': True,
  'people': False,
  'man': False,
  'work': True,
  'bad': False,
  'best': False,
  'director': True,
  'end': True,
  'performance': True,
  'action': True,
  'actor': True,
  'love': False,
  'great': True,
  'role': True,
  'world': False,
  'audience': False,
  'big': True,
  'day': False,
  'comedy': True,
  'real': False,
  'better': False,
  'old': False,
  'set': False,
  'funny': False,
  'fact': True,
  'point': True,
  'minute': False,
  'woman': False,
  'lot': False,
  'effect': True,
  'friend': True,
  'cast': True,
  'moment': True,
  'screen': True,
  'line': False,
  'original': True,
  'place': False,
  'family': False,
  'problem': False,
  ...},
 'pos')

Assembling word feature sets

In [49]:
verbose=0
test_classification_system([8000],["ADJ","NOUN",],tests_number=5)

Commencing of Naїve Bayes Classifier testing on  5 tests

{'8000 top-frequent words': {'Average recall': '82.666%', 'Average Precision': '69.400%', 'Average accuracy': '77.400%', 'Average F1 score': '75.368%'}, 'Tests quantity': 5, 'POS-tags': ['ADJ', 'NOUN']}


In [50]:
verbose=1
test_classification_system([8000],["ADJ","NOUN","VERB"],tests_number=1)

Commencing of Naїve Bayes Classifier testing on  1 tests
Test #1
Selecting top-frequent words from all reviews in the training set...
Done.
Top-frequent words amount - 8000
Showing first 100 words...


Unnamed: 0,Word,Frequency,Counts
1,film,0.018,10038 times
2,movie,0.0113,6269 times
3,character,0.0064,3548 times
4,like,0.0062,3429 times
5,time,0.0048,2696 times
6,scene,0.0044,2424 times
7,good,0.004,2206 times
8,story,0.0038,2117 times
9,way,0.0031,1722 times
10,life,0.0031,1714 times


Selecting the most relevant word features using Part-of-Speach Tagging...

Done
Amount of selected by POS word features- 4055
Showing first 100 selected word features


Unnamed: 0,Word,Part-Of-Speech Tag
1,movie,NOUN
2,character,NOUN
3,time,NOUN
4,scene,NOUN
5,good,ADJ
6,story,NOUN
7,way,NOUN
8,life,NOUN
9,year,NOUN
10,thing,NOUN


Assembling word feature sets for each review

Done
Word feature sets amount - 1800
({'movie': True,
  'character': True,
  'time': False,
  'scene': True,
  'good': False,
  'story': True,
  'way': True,
  'life': False,
  'year': True,
  'thing': False,
  'plot': False,
  'little': False,
  'come': False,
  'know': True,
  'people': False,
  'man': False,
  'bad': False,
  'work': True,
  'director': False,
  'performance': True,
  'end': False,
  'best': True,
  'look': False,
  'action': False,
  'actor': False,
  'play': False,
  'love': True,
  'great': False,
  'role': False,
  'find': False,
  'audience': False,
  'big': False,
  'world': False,
  'want': True,
  'day': False,
  'think': False,
  'comedy': False,
  'real': False,
  'better': False,
  'seen': False,
  'going': False,
  'old': False,
  'fact': False,
  'point': False,
  'set': True,
  'funny': False,
  'minute': False,
  'woman': False,
  'lot': True,
  'played': False,
  ...},
 'pos')

Assembling word feature set

In [51]:
verbose=0
test_classification_system([8000],["ADJ","NOUN","VERB"],tests_number=5)

Commencing of Naїve Bayes Classifier testing on  5 tests

{'8000 top-frequent words': {'Average recall': '82.189%', 'Average Precision': '73.400%', 'Average accuracy': '78.700%', 'Average F1 score': '77.518%'}, 'Tests quantity': 5, 'POS-tags': ['ADJ', 'NOUN', 'VERB']}


In [52]:
verbose=1
test_classification_system([8000],["ADJ","NOUN","VERB","ADV"],tests_number=1)

Commencing of Naїve Bayes Classifier testing on  1 tests
Test #1
Selecting top-frequent words from all reviews in the training set...
Done.
Top-frequent words amount - 8000
Showing first 100 words...


Unnamed: 0,Word,Frequency,Counts
1,film,0.0181,10044 times
2,movie,0.0114,6327 times
3,character,0.0063,3514 times
4,like,0.0062,3449 times
5,time,0.0048,2681 times
6,scene,0.0044,2428 times
7,good,0.004,2203 times
8,story,0.0038,2111 times
9,way,0.0031,1713 times
10,life,0.0031,1708 times


Selecting the most relevant word features using Part-of-Speach Tagging...

Done
Amount of selected by POS word features- 4427
Showing first 100 selected word features


Unnamed: 0,Word,Part-Of-Speech Tag
1,movie,NOUN
2,character,NOUN
3,time,NOUN
4,scene,NOUN
5,good,ADJ
6,story,NOUN
7,way,NOUN
8,life,NOUN
9,year,NOUN
10,thing,NOUN


Assembling word feature sets for each review

Done
Word feature sets amount - 1800
({'movie': True,
  'character': True,
  'time': True,
  'scene': True,
  'good': False,
  'story': True,
  'way': True,
  'life': True,
  'year': True,
  'thing': True,
  'plot': False,
  'come': True,
  'little': True,
  'people': True,
  'know': False,
  'bad': False,
  'man': True,
  'work': False,
  'director': True,
  'best': True,
  'performance': True,
  'end': True,
  'look': False,
  'action': True,
  'actor': False,
  'love': False,
  'play': False,
  'great': True,
  'role': True,
  'find': False,
  'audience': True,
  'big': False,
  'world': False,
  'want': True,
  'day': True,
  'think': True,
  'comedy': False,
  'real': True,
  'better': True,
  'seen': True,
  'going': True,
  'old': True,
  'fact': True,
  'funny': False,
  'set': False,
  'point': True,
  'actually': True,
  'long': True,
  'woman': True,
  'friend': False,
  ...},
 'pos')

Assembling word feature sets for each review

In [53]:
verbose=0
test_classification_system([8000],["ADJ","NOUN","VERB","ADV"],tests_number=5)

Commencing of Naїve Bayes Classifier testing on  5 tests

{'8000 top-frequent words': {'Average recall': '83.813%', 'Average Precision': '74.200%', 'Average accuracy': '79.900%', 'Average F1 score': '78.665%'}, 'Tests quantity': 5, 'POS-tags': ['ADJ', 'NOUN', 'VERB', 'ADV']}


In [54]:
#     tags=["ADJ","ADP","ADV","AUX","CONJ","CCONJ","DET","INTJ","NOUN","NUM","PART","PRON","PROPN","PUNCT","SCONJ","SYM","VERB","X","SPACE"]