# NLTK sentiment analysis

## Importing and looking through datasets

In [1]:
import mlflow
mlflow.autolog()

In [2]:
import nltk
nltk.download([
    "names",
    "stopwords",
    "state_union",
    "twitter_samples",
    "movie_reviews",
    "averaged_perceptron_tagger",
    "vader_lexicon",
    "punkt"
    ])

2023/08/17 22:29:20 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
[nltk_data] Downloading package names to /Users/Bartek/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Bartek/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     /Users/Bartek/nltk_data...
[nltk_data]   Package state_union is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/Bartek/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/Bartek/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/Bartek/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       

True

In [3]:
su_words = [w for w in nltk.corpus.state_union.words() if w.isalpha()] #only words made from letters

In [4]:
su_words[:10]

['PRESIDENT',
 'HARRY',
 'S',
 'TRUMAN',
 'S',
 'ADDRESS',
 'BEFORE',
 'A',
 'JOINT',
 'SESSION']

In [5]:
stopwords = nltk.corpus.stopwords.words("english") # to filter out words like "a" "of", "the"

In [6]:
su_words = [w for w in su_words if w.lower() not in stopwords]

In [7]:
su_words[:10]

['PRESIDENT',
 'HARRY',
 'TRUMAN',
 'ADDRESS',
 'JOINT',
 'SESSION',
 'CONGRESS',
 'April',
 'Mr',
 'Speaker']

In [8]:
from pprint import pprint

decl_of_indep_text = """
The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.
We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.--That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, --That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn, that mankind are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty, to throw off such Government, and to provide new Guards for their future security.--Such has been the patient sufferance of these Colonies; and such is now the necessity which constrains them to alter their former Systems of Government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world.
He has refused his Assent to Laws, the most wholesome and necessary for the public good.
He has forbidden his Governors to pass Laws of immediate and pressing importance, unless suspended in their operation till his Assent should be obtained; and when so suspended, he has utterly neglected to attend to them.
He has refused to pass other Laws for the accommodation of large districts of people, unless those people would relinquish the right of Representation in the Legislature, a right inestimable to them and formidable to tyrants only.
He has called together legislative bodies at places unusual, uncomfortable, and distant from the depository of their public Records, for the sole purpose of fatiguing them into compliance with his measures.
He has dissolved Representative Houses repeatedly, for opposing with manly firmness his invasions on the rights of the people.
He has refused for a long time, after such dissolutions, to cause others to be elected; whereby the Legislative powers, incapable of Annihilation, have returned to the People at large for their exercise; the State remaining in the mean time exposed to all the dangers of invasion from without, and convulsions within.
He has endeavoured to prevent the population of these States; for that purpose obstructing the Laws for Naturalization of Foreigners; refusing to pass others to encourage their migrations hither, and raising the conditions of new Appropriations of Lands.
He has obstructed the Administration of Justice, by refusing his Assent to Laws for establishing Judiciary powers.
He has made Judges dependent on his Will alone, for the tenure of their offices, and the amount and payment of their salaries.
He has erected a multitude of New Offices, and sent hither swarms of Officers to harrass our people, and eat out their substance.
He has kept among us, in times of peace, Standing Armies without the Consent of our legislatures.
He has affected to render the Military independent of and superior to the Civil power.
He has combined with others to subject us to a jurisdiction foreign to our constitution, and unacknowledged by our laws; giving his Assent to their Acts of pretended Legislation:
For Quartering large bodies of armed troops among us:
For protecting them, by a mock Trial, from punishment for any Murders which they should commit on the Inhabitants of these States:
For cutting off our Trade with all parts of the world:
For imposing Taxes on us without our Consent:
For depriving us in many cases, of the benefits of Trial by Jury:
For transporting us beyond Seas to be tried for pretended offences
For abolishing the free System of English Laws in a neighbouring Province, establishing therein an Arbitrary government, and enlarging its Boundaries so as to render it at once an example and fit instrument for introducing the same absolute rule into these Colonies:
For taking away our Charters, abolishing our most valuable Laws, and altering fundamentally the Forms of our Governments:
For suspending our own Legislatures, and declaring themselves invested with power to legislate for us in all cases whatsoever.
He has abdicated Government here, by declaring us out of his Protection and waging War against us.
He has plundered our seas, ravaged our Coasts, burnt our towns, and destroyed the lives of our people.
He is at this time transporting large Armies of foreign Mercenaries to compleat the works of death, desolation and tyranny, already begun with circumstances of Cruelty & perfidy scarcely paralleled in the most barbarous ages, and totally unworthy the Head of a civilized nation.
He has constrained our fellow Citizens taken Captive on the high Seas to bear Arms against their Country, to become the executioners of their friends and Brethren, or to fall themselves by their Hands.
He has excited domestic insurrections amongst us, and has endeavoured to bring on the inhabitants of our frontiers, the merciless Indian Savages, whose known rule of warfare, is an undistinguished destruction of all ages, sexes and conditions.
In every stage of these Oppressions We have Petitioned for Redress in the most humble terms: Our repeated Petitions have been answered only by repeated injury. A Prince whose character is thus marked by every act which may define a Tyrant, is unfit to be the ruler of a free people.
Nor have We been wanting in attentions to our Brittish brethren. We have warned them from time to time of attempts by their legislature to extend an unwarrantable jurisdiction over us. We have reminded them of the circumstances of our emigration and settlement here. We have appealed to their native justice and magnanimity, and we have conjured them by the ties of our common kindred to disavow these usurpations, which, would inevitably interrupt our connections and correspondence. They too have been deaf to the voice of justice and of consanguinity. We must, therefore, acquiesce in the necessity, which denounces our Separation, and hold them, as we hold the rest of mankind, Enemies in War, in Peace Friends.
We, therefore, the Representatives of the united States of America, in General Congress, Assembled, appealing to the Supreme Judge of the world for the rectitude of our intentions, do, in the Name, and by Authority of the good People of these Colonies, solemnly publish and declare, That these United Colonies are, and of Right ought to be Free and Independent States; that they are Absolved from all Allegiance to the British Crown, and that all political connection between them and the State of Great Britain, is and ought to be totally dissolved; and that as Free and Independent States, they have full Power to levy War, conclude Peace, contract Alliances, establish Commerce, and to do all other Acts and Things which Independent States may of right do. And for the support of this Declaration, with a firm reliance on the protection of divine Providence, we mutually pledge to each other our Lives, our Fortunes and our sacred Honor.
"""

## Tokenization and frequency distributions

In [9]:
pprint(nltk.word_tokenize(decl_of_indep_text), width=79, compact=True)

text_tokenized = nltk.word_tokenize(decl_of_indep_text)

['The', 'unanimous', 'Declaration', 'of', 'the', 'thirteen', 'united',
 'States', 'of', 'America', ',', 'When', 'in', 'the', 'Course', 'of', 'human',
 'events', ',', 'it', 'becomes', 'necessary', 'for', 'one', 'people', 'to',
 'dissolve', 'the', 'political', 'bands', 'which', 'have', 'connected', 'them',
 'with', 'another', ',', 'and', 'to', 'assume', 'among', 'the', 'powers', 'of',
 'the', 'earth', ',', 'the', 'separate', 'and', 'equal', 'station', 'to',
 'which', 'the', 'Laws', 'of', 'Nature', 'and', 'of', 'Nature', "'s", 'God',
 'entitle', 'them', ',', 'a', 'decent', 'respect', 'to', 'the', 'opinions',
 'of', 'mankind', 'requires', 'that', 'they', 'should', 'declare', 'the',
 'causes', 'which', 'impel', 'them', 'to', 'the', 'separation', '.', 'We',
 'hold', 'these', 'truths', 'to', 'be', 'self-evident', ',', 'that', 'all',
 'men', 'are', 'created', 'equal', ',', 'that', 'they', 'are', 'endowed', 'by',
 'their', 'Creator', 'with', 'certain', 'unalienable', 'Rights', ',', 'that',
 'am

In [10]:
text_tokenized[:10]

['The',
 'unanimous',
 'Declaration',
 'of',
 'the',
 'thirteen',
 'united',
 'States',
 'of',
 'America']

In [11]:
text = [w for w in text_tokenized if w.lower() not in stopwords]
text = [w for w in text if w.isalpha()]

In [12]:
text[:10]

['unanimous',
 'Declaration',
 'thirteen',
 'united',
 'States',
 'America',
 'Course',
 'human',
 'events',
 'becomes']

In [13]:
fd_decl_of_indep = nltk.FreqDist(text)
fd_su_words = nltk.FreqDist(su_words)

In [14]:
fd_decl_of_indep.most_common()[:10]

[('us', 11),
 ('States', 8),
 ('Laws', 8),
 ('people', 7),
 ('among', 5),
 ('powers', 5),
 ('Government', 5),
 ('right', 5),
 ('time', 5),
 ('Colonies', 4)]

In [15]:
fd_su_words.most_common()[:10]

[('must', 1568),
 ('people', 1291),
 ('world', 1128),
 ('year', 1097),
 ('America', 1076),
 ('us', 1049),
 ('new', 1049),
 ('Congress', 1014),
 ('years', 827),
 ('American', 784)]

In [16]:
fd_su_words["America"]

1076

In [17]:
fd_su_words["america"] # no KeyError raised here

0

In [18]:
fd_su_words["AMERICA"]

3

In [19]:
len([w.lower() for w in su_words])

180589

In [20]:
lower_fd_su_words = nltk.FreqDist([w.lower() for w in su_words])

In [21]:
lower_fd_su_words.most_common()[:10]

[('must', 1569),
 ('people', 1313),
 ('world', 1213),
 ('new', 1112),
 ('year', 1104),
 ('america', 1079),
 ('congress', 1078),
 ('us', 1049),
 ('government', 969),
 ('years', 829)]

In [22]:
lower_fd_su_words["america"] # as expected

1079

## Concordance and collocations analysis

In [23]:
su_text = nltk.Text(nltk.corpus.state_union.words())

In [24]:
su_text.concordance("states", lines=10)

Displaying 10 of 541 matches:
ues , in the Congress of the United States . Only yesterday , we laid to rest 
inate the world . While these great states have a special responsibility to en
on the obligations resting upon all states , large and small , not to use forc
 . The respon sibility of the great states is to serve and not to dominate the
 1946 To the Congress of the United States : A quarter century ago the Congres
g of the year 1946 finds the United States strong and deservedly confident . W
e citizenry . We have in the United States Government rich resources in inform
t that the nations come together as States in the Assembly and in the Security
s based on the policy of the United States that people be permitted to choose 
, the Soviet Union , and the United States conferred together in San Francisco


In [25]:
concordence_list = su_text.concordance_list("states", lines=10)
for line in concordence_list:
    print(line.line)

ues , in the Congress of the United States . Only yesterday , we laid to rest 
inate the world . While these great states have a special responsibility to en
on the obligations resting upon all states , large and small , not to use forc
 . The respon sibility of the great states is to serve and not to dominate the
 1946 To the Congress of the United States : A quarter century ago the Congres
g of the year 1946 finds the United States strong and deservedly confident . W
e citizenry . We have in the United States Government rich resources in inform
t that the nations come together as States in the Assembly and in the Security
s based on the policy of the United States that people be permitted to choose 
, the Soviet Union , and the United States conferred together in San Francisco


In [26]:
concordence_list[1].right_print

'have a special responsibility to en'

In [27]:
decl_nltk_words = nltk.word_tokenize(decl_of_indep_text)
decl_nltk_words = [w.lower() for w in decl_nltk_words if w.isalpha()]
decl_nltk_text = nltk.Text(decl_nltk_words)
fd_decl_nltk_text = decl_nltk_text.vocab()

In [28]:
fd_decl_nltk_text.tabulate(10)

  the    of    to   and   for   our their   has    in    he 
   78    78    65    56    29    26    20    20    19    19 


In [29]:
su_words = [w for w in nltk.corpus.state_union.words() if w.isalpha()] # redownloading state union words
su_words_lower = [w.lower() for w in su_words]

In [30]:
finder = nltk.collocations.TrigramCollocationFinder.from_words(su_words_lower)

In [31]:
finder.ngram_fd.most_common(10)

[(('the', 'united', 'states'), 327),
 (('the', 'american', 'people'), 208),
 (('the', 'state', 'of'), 171),
 (('to', 'the', 'congress'), 164),
 (('state', 'of', 'the'), 164),
 (('of', 'the', 'world'), 155),
 (('of', 'the', 'union'), 155),
 (('of', 'the', 'united'), 146),
 (('in', 'the', 'world'), 136),
 (('the', 'federal', 'government'), 132)]

## Using pretrained sentiment analyzer model

In [32]:
from nltk.sentiment import SentimentIntensityAnalyzer   # VADER model for sentiment analysis is better for short texts
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Shakespeare didn't know how to predict the outcome of his works")

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

In [33]:
tweets = [t.replace("://", "//") for t in nltk.corpus.twitter_samples.strings()]

In [34]:
from random import shuffle

def sia_tweet_is_positive(tweet: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    if sia.polarity_scores(tweet)["compound"] > 0:
        return("Positive")
    return("Negative")

shuffle(tweets)
for tweet in tweets[:10]:
    print(sia_tweet_is_positive(tweet), "\n>", tweet, "\n\n")

Negative 
> Question Time crowd emerge as stars on a night of vicious attacks on leaders: Cameron, Miliband and Clegg all ... http//t.co/okONAfO2b2 


Negative 
> @__sabaa I'll say rumble in da kumble BC that's all I know :((( 


Positive 
> RT @NicolaSturgeon: If Miliband is going to let Tories in rather than work with SNP, we will definitely need lots of SNP MPs to protect Sco… 


Positive 
> So you can keep me inside of pocket of ur wripped jeans :) 


Positive 
> @JabongIndia ready and eagerly waiting for the next question. bring it on :) #JabongatPumaUrbanStampede #JabongatPumaUrbanStampede 


Negative 
> @AlanRoden What it boils down to, is the simple fact that Labour refuse to budge. SNP will use this in a way that pins them against Scots. 


Positive 
> RT @KevinJPringle: Gordon Brown says the Tories are an "anti-Scottish" party - so why did Labour form an alliance with them in #indyref? ht… 


Positive 
> Farage looks like he is waiting for the Krypton Factor to start. 


Nega

In [35]:
positive_reviews_ids = nltk.corpus.movie_reviews.fileids(categories=["pos"])
negative_reviews_ids = nltk.corpus.movie_reviews.fileids(categories=["neg"])
all_review_ids = positive_reviews_ids + negative_reviews_ids

In [36]:
from statistics import mean

def is_positive(review_id: str) -> bool:
    """Returns true if the sentiment compound score for every sentence is positive."""
    reviews_text = nltk.corpus.movie_reviews.raw(review_id)
    sia_scores = [
        sia.polarity_scores(line)["compound"] for line in nltk.sent_tokenize(reviews_text)
    ]
    return mean(sia_scores) > 0

In [37]:
shuffle(all_review_ids)

pos_assignemnts_counter = 0
false_positive = 0
neg_assignemnts_counter = 0
false_negative = 0

for review_id in all_review_ids:
    if is_positive(review_id):
        if review_id in positive_reviews_ids:
            pos_assignemnts_counter += 1
        elif review_id in negative_reviews_ids:
            false_positive += 1
    else:
        if review_id in negative_reviews_ids:
            neg_assignemnts_counter += 1
        elif review_id in positive_reviews_ids:
            false_negative += 1

print(f"""
      Correct positive: {pos_assignemnts_counter}
      Wrong positive: {false_positive}
      Correct negative: {neg_assignemnts_counter}
      Wrong negative: {false_negative}
      """)
            
            


      Correct positive: 831
      Wrong positive: 551
      Correct negative: 449
      Wrong negative: 169
      


In [38]:
F"{(pos_assignemnts_counter + neg_assignemnts_counter)/ len(all_review_ids):.1%} accuracy"

'64.0% accuracy'

Selecting useful features

In [39]:
useless = nltk.corpus.stopwords.words("english")
useless.extend([w.lower() for w in nltk.corpus.names.words()])

def skip_unwanted_words(pos_tuple):
    word, tag = pos_tuple
    if not word.isalpha() or word in useless:
        return False
    if tag.startswith("NN"):
        return False
    return True

positive_words = [word for word, tag in filter(
    skip_unwanted_words,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["pos"]))
)]
negative_words = [word for word, tag in filter(
    skip_unwanted_words,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["neg"]))
)]

In [40]:
positive_fd = nltk.FreqDist(positive_words)
negative_fd = nltk.FreqDist(negative_words)

common_set = set(positive_fd).intersection(negative_fd)

for word in common_set:
    del positive_fd[word]
    del negative_fd[word]

top_100_positive = {word for word, count in positive_fd.most_common(100)}
top_100_negative = {word for word, count in negative_fd.most_common(100)}

In [41]:
positive_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["pos"])
    if w.isalpha() and w not in useless
])
negative_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["neg"])
    if w.isalpha() and w not in useless
])

In [42]:
positive_bigram_finder.ngram_fd.most_common(10)

[(('special', 'effects'), 179),
 (('new', 'york'), 131),
 (('even', 'though'), 120),
 (('one', 'best'), 117),
 (('year', 'old'), 106),
 (('science', 'fiction'), 96),
 (('high', 'school'), 89),
 (('pulp', 'fiction'), 78),
 (('sci', 'fi'), 78),
 (('real', 'life'), 73)]

In [43]:
positive_trigram_finder = nltk.collocations.TrigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["pos"])
    if w.isalpha() and w not in useless
])
negative_trigram_finder = nltk.collocations.TrigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["neg"])
    if w.isalpha() and w not in useless
])

In [44]:
negative_trigram_finder.ngram_fd.most_common(10)

[(('know', 'last', 'summer'), 39),
 (('new', 'york', 'city'), 24),
 (('saturday', 'night', 'live'), 17),
 (('still', 'know', 'last'), 14),
 (('first', 'half', 'hour'), 12),
 (('little', 'known', 'facts'), 12),
 (('film', 'takes', 'place'), 11),
 (('little', 'creaky', 'still'), 11),
 (('creaky', 'still', 'better'), 11),
 (('still', 'better', 'staying'), 11)]

Function to extract features from revies and count number of top 100 positive words:

In [45]:
def find_features(input_text):
    features = {}
    words_counter = 0
    compound_scores = []
    positive_scores = []
    
    for phrase in nltk.sent_tokenize(input_text):
        for word in nltk.word_tokenize(phrase):
            if word.lower() in top_100_positive:
                words_counter += 1
        compound_scores.append(sia.polarity_scores(phrase)["compound"])
        positive_scores.append(sia.polarity_scores(phrase)["pos"])
        
    features["mean_compound"] = mean(compound_scores) + 1 # adding 1 to make sure the result is positive
    features["mean_positive"] = mean(positive_scores) 
    features["words_counter"] = words_counter
    
    return features

In [46]:
features = [
    (find_features(nltk.corpus.movie_reviews.raw(review)), "pos")
    for review in nltk.corpus.movie_reviews.fileids(categories=["pos"])
]
features.extend([
    (find_features(nltk.corpus.movie_reviews.raw(review)), "neg")
    for review in nltk.corpus.movie_reviews.fileids(categories=["neg"])
])

Training Naive Bayes Classifier with reviews database

In [47]:
train_set = len(features) // 4
shuffle(features) 

nb_classifier = nltk.NaiveBayesClassifier.train(features[:train_set])
nb_classifier.show_most_informative_features(10)

Most Informative Features
           words_counter = 4                 pos : neg    =      4.7 : 1.0
           words_counter = 2                 pos : neg    =      3.3 : 1.0
           words_counter = 1                 pos : neg    =      1.7 : 1.0
           words_counter = 0                 neg : pos    =      1.6 : 1.0
           mean_positive = 0.10004761904761905    neg : pos    =      1.0 : 1.0
           mean_positive = 0.1095            neg : pos    =      1.0 : 1.0


testing out on some example reviews:

In [48]:
example_review = """You'll have to have your wits about you and your brain fully switched on watching Oppenheimer as it could easily get away from a nonattentive viewer. This is intelligent filmmaking which shows it's audience great respect. It fires dialogue packed with information at a relentless pace and jumps to very different times in Oppenheimer's life continuously through it's 3 hour runtime. There are visual clues to guide the viewer through these times but again you'll have to get to grips with these quite quickly. This relentlessness helps to express the urgency with which the US attacked it's chase for the atomic bomb before Germany could do the same. An absolute career best performance from (the consistenly brilliant) Cillian Murphy anchors the film. This is a nailed on Oscar performance. In fact the whole cast are fantastic (apart maybe for the sometimes overwrought Emily Blunt performance). RDJ is also particularly brilliant in a return to proper acting after his decade or so of calling it in. The screenplay is dense and layered (I'd say it was a thick as a Bible), cinematography is quite stark and spare for the most part but imbued with rich, lucious colour in moments (especially scenes with Florence Pugh), the score is beautiful at times but mostly anxious and oppressive, adding to the relentless pacing. The 3 hour runtime flies by. All in all I found it an intense, taxing but highly rewarding watch. This is film making at it finest. A really great watch."""

In [49]:
example_review_features = find_features(example_review)
nb_classifier.classify(example_review_features)

'pos'

In [50]:
find_features(example_review)

{'mean_compound': 1.3346,
 'mean_positive': 0.19242857142857142,
 'words_counter': 1}

In [51]:
another_review = """The film's universe and settings are fantastic. The casting is really good too, with Gosling excelling in the role of Ken.
Regrettably, however, the film completely misses the chance to appeal to people who grew up with Barbie. Instead, we're left with a film that almost divides and pits men against women, in an attempt to make it a feminist and political film.
Instead of seeing a light, fun film, I saw a rather heavy-handed movie whose storyline is centered around patriarchy.
It's a real shame, because the film's potential is pretty amazing. I came away disappointed, but not totally disgusted either."""

In [52]:
another_review_features = find_features(another_review)
nb_classifier.classify(another_review_features)

'neg'

In [53]:
find_features(another_review)

{'mean_compound': 1.348642857142857,
 'mean_positive': 0.2512857142857143,
 'words_counter': 0}

## Comparing classifiers

In [54]:
from sklearn.naive_bayes import BernoulliNB, ComplementNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

For the sake of this particular notebook detailed specification, of all these Classifiers above, is not required, so instead, default options will be used.

In [55]:
all_classifiers = {
    "BernoulliNB": BernoulliNB(),
    "ComplementNB": ComplementNB(),
    "MultinomialNB": MultinomialNB(),
    "LogisticRegression": LogisticRegression,
    "MLPClassifier": MLPClassifier(max_iter=2000),
    "KNeighborsClassifier": KNeighborsClassifier(),
    "DecisionTreeClassifier": DecisionTreeClassifier(),
    "RandomForestClassifier": RandomForestClassifier(),
    "AdaBoostClassifier": AdaBoostClassifier(),
    "QuadraticDiscriminantAnalysis": QuadraticDiscriminantAnalysis(),
}

In [61]:
def run_classifiers_training(sa_classifiers): 
    run_only_once = 1
    if run_only_once:
        for name, sa_classifier in sa_classifiers.items():
            sa_classifier = nltk.classify.SklearnClassifier(sa_classifier)
            sa_classifier.train(features[:train_set])
            accuracy = nltk.classify.accuracy(sa_classifier, features[train_set:])
            print(F"{accuracy:.2%} - {name}")
    run_only_once = run_only_once * 0

In [63]:
run_classifiers_training(all_classifiers)

2023/08/17 22:45:34 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'b44a7de2ac77491f99e068e749663ebf', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
