# Basics of NLP

## Natural Language Toolkit

**NLTK** - Natural Language Toolkit

- [NLTK](https://www.nltk.org/) -  This is aPython library for natural language processing.
- NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

## Preliminary Setup

In [1]:
import nltk

In [2]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] bcp47............... BCP-47 Language Tags
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)


    Downloading collection 'all'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package alpino to /root/nltk_data...
       |   Unzipping corpora/alpino.zip.
       | Downloading package averaged_perceptron_tagger to
       |     /root/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger.zip.
       | Downloading package averaged_perceptron_tagger_ru to
       |     /root/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger_ru.zip.
       | Downloading package basque_grammars to /root/nltk_data...
       |   Unzipping grammars/basque_grammars.zip.
       | Downloading package bcp47 to /root/nltk_data...
       | Downloading package biocreative_ppi to /root/nltk_data...
       |   Unzipping corpora/biocreative_ppi.zip.
       | Downloading package bllip_wsj_no_aux to /root/nltk_data...
       |   Unzipping models/bllip_wsj_no_aux.zip.
       | Downloading package book_grammars to


---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


True

## Keywords in NLTK


1. Tokenizing - Word Tokenizers ... Sentence Tokenizers
2. Lexicons and Corporas
3. Corpora - body of text. Eg: medical journals, presidents speech, Anything in English Language
4. Lexicons - Dictionary (words and their meanings)

**Lexicons vs. Corporas:**

* Bull means a kind of Animals in english language (Corpora)
* In Investors speech, Bull is someone who is positive about the market, i.e., Bull Vs. Bear - well known reference in Stock Market (Lexicons)

## Tokenize

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [4]:
example_text = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."

In [5]:
print(sent_tokenize(example_text))

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard."]


In [6]:
print(word_tokenize(example_text))

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard', '.']


In [7]:
for i in word_tokenize(example_text):
  print(i)

Hello
Mr.
Smith
,
how
are
you
doing
today
?
The
weather
is
great
,
and
Python
is
awesome
.
The
sky
is
pinkish-blue
.
You
should
n't
eat
cardboard
.


## Stop Words

Words that don't have any meaning in the context of the text. It just a filler of words. These can be ignored while doing any Natural Language tasks.

**Example:**
* a
* the
* an
* do
* and etc.,

In [8]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [9]:
example_sentence = "This is an example showing off stop word filtration."

In [10]:
stop_words = set(stopwords.words("english"))

In [11]:
print(stop_words)

{'theirs', 'more', 'ourselves', 'its', 'his', 'we', 'while', "mightn't", 'ain', 'than', 'where', 'yours', 'wasn', 'then', 'but', 'why', 'above', 'between', 'those', 'of', 'hadn', 'both', "didn't", "you've", 'during', 'no', 'by', 'him', 'not', 'doing', "you'd", 'below', 'can', 'this', 'at', 'having', 'on', "couldn't", 'are', 'll', "she's", 'were', 'if', 'as', 'doesn', 'through', 'and', "hadn't", "shan't", "that'll", 'that', 'd', "wasn't", 'such', "it's", 'out', 'yourselves', 'other', "doesn't", 'whom', 'm', 'after', 'shan', 'herself', 'to', 'when', 'there', "wouldn't", 'who', 'so', 've', 'isn', 'myself', 'wouldn', 'is', 'any', 'each', 'what', 'in', 'the', 'them', 'off', "weren't", 'your', 'for', 'which', "shouldn't", 'under', 'again', "aren't", "you'll", 'most', 'himself', 'own', 'they', 'couldn', 'does', 'their', 'o', 'you', "hasn't", 'down', 'didn', 'with', 'or', 'was', 'she', 'don', 'from', 'some', 'should', 'i', 'he', "mustn't", 'been', 'mustn', 'hasn', 're', 'needn', 'had', 'haven'

In [12]:
words = word_tokenize(example_sentence)

In [13]:
for i in words:
  if i in stop_words:
    print(i)

is
an
off


In [14]:
for i in words:
  if i not in stop_words:
    print(i)

This
example
showing
stop
word
filtration
.


In [15]:
filtered_sentence = [w for w in words if not w in stop_words]

In [16]:
print(filtered_sentence)

['This', 'example', 'showing', 'stop', 'word', 'filtration', '.']


## Stemming

- Stemming, in Natural Language Processing (NLP), refers to the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots.

- **Stemming** is the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots by maintaining their meaning.

- **Eg:**
          1. Writing, written, wrote (stem words) -> write (root word)
          2. Running, runs, ran **(stem words) -> run (root word)
          3. Reading, red, reads **(stem words) -> read (root word)

In [17]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [18]:
ps = PorterStemmer()

In [19]:
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

In [20]:
for w in example_words:
  print(ps.stem(w))

python
python
python
python
pythonli


In [21]:
new_text = "It is important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

In [22]:
words = word_tokenize(new_text)

In [23]:
for w in words:
  print(ps.stem(w))

it
is
import
to
be
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.


## Speech tagging

- Speech tagging is the process of labeling words in a text with their corresponding parts of speech (e.g., noun, verb, adjective). This helps algorithms understand the grammatical structure and meaning of a text and is an important step in natural language processing (NLP).

In [24]:
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer #PunktSentenceTokenizer is an unsupervised algorithm for tokenizing sentences

In [25]:
train_text = state_union.raw("2005-GWBush.txt")

In [26]:
sample_text = state_union.raw("2006-GWBush.txt")

In [27]:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [28]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [29]:
def process_content():
  try:
    for i in tokenized:
      words = nltk.word_tokenize(i)
      tagged = nltk.pos_tag(words)
      print(tagged)

  except Exception as e:
    print(str(e))

In [30]:
process_content()

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nat

## Chunking

- "**Chunking**" is the process of grouping different bits of information together into more manageable or meaningful chunks. Do that and you make information clearer and easier to remember for yourself and others.

- An example of chunking is grouping the everyday items someone needs to have in their pockets before leaving the house. This might include **house keys**, **car keys**, **cell phone**, and a **wallet** or **purse**.



In [31]:
train_text = state_union.raw("2005-GWBush.txt")

In [32]:
sample_text = state_union.raw("2006-GWBush.txt")

In [33]:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [34]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [35]:
def process_content():
  try:
    for i in tokenized:
      words = nltk.word_tokenize(i)
      tagged = nltk.pos_tag(words)

      ChunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>} """
      ChunkParser = nltk.RegexpParser(ChunkGram)
      chunked = ChunkParser.parse(tagged)
      print(chunked)

  except Exception as e:
    print(str(e))

In [36]:
process_content()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  I/PRP
  am/VBP
  confident/JJ
  in/IN
  our/PRP$
  plan/NN
  for/IN
  victory/NN
  ;/:
  I/PRP
  am/VBP
  confident/JJ
  in/IN
  the/DT
  will/MD
  of/IN
  the/DT
  Iraqi/NNP
  people/NNS
  ;/:
  I/PRP
  am/VBP
  confident/JJ
  in/IN
  the/DT
  skill/NN
  and/CC
  spirit/NN
  of/IN
  our/PRP$
  military/JJ
  ./.)
(S
  Fellow/NNP
  citizens/NNS
  ,/,
  we/PRP
  are/VBP
  in/IN
  this/DT
  fight/NN
  to/TO
  win/VB
  ,/,
  and/CC
  we/PRP
  are/VBP
  winning/VBG
  ./.)
(S (/( Applause/NNP ./. )/))
(S
  The/DT
  road/NN
  of/IN
  victory/NN
  is/VBZ
  the/DT
  road/NN
  that/WDT
  will/MD
  take/VB
  our/PRP$
  troops/NNS
  home/NN
  ./.)
(S
  As/IN
  we/PRP
  make/VBP
  progress/NN
  on/IN
  the/DT
  ground/NN
  ,/,
  and/CC
  Iraqi/NNP
  forces/NNS
  increasingly/RB
  take/VBP
  the/DT
  lead/NN
  ,/,
  we/PRP
  should/MD
  be/VB
  able/JJ
  to/TO
  further/JJ
  decrease/VB
  our/PRP$
  troop/NN
  levels/NNS
  --/:
  but

## Chinking

- Chinking is nothing but the process of removing the chunk from the chunk which is called as chink. These patterns are normal regular expression which are modifdied and designed to match POS(Part-of-Speech) tag designed to match the sequences of POS tags.


In [37]:
train_text = state_union.raw("2005-GWBush.txt")

In [38]:
sample_text = state_union.raw("2006-GWBush.txt")

In [39]:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [40]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [41]:
def process_content():
  try:
    for i in tokenized:
      words = nltk.word_tokenize(i)
      tagged = nltk.pos_tag(words)

      ChunkGram = r"""Chunk: {<.*>+}
                                   }<VB.? | IN | DT | TO>{"""
      ChunkParser = nltk.RegexpParser(ChunkGram)
      chunked = ChunkParser.parse(tagged)
      print(chunked)

  except Exception as e:
    print(str(e))

In [42]:
process_content()

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP 'S/POS ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk
    THE/NNP
    UNION/NNP
    January/NNP
    31/CD
    ,/,
    2006/CD
    THE/NNP
    PRESIDENT/NNP
    :/:
    Thank/NNP
    you/PRP)
  all/DT
  (Chunk ./.))
(S
  (Chunk
    Mr./NNP
    Speaker/NNP
    ,/,
    Vice/NNP
    President/NNP
    Cheney/NNP
    ,/,
    members/NNS)
  of/IN
  (Chunk Congress/NNP ,/, members/NNS)
  of/IN
  the/DT
  (Chunk
    Supreme/NNP
    Court/NNP
    and/CC
    diplomatic/JJ
    corps/NN
    ,/,
    distinguished/JJ
    guests/NNS
    ,/,
    and/CC
    fellow/JJ
    citizens/NNS
    :/:)
  Today/VB
  (Chunk our/PRP$ nation/NN)
  lost/VBD
  a/DT
  beloved/VBN
  (Chunk ,/, graceful/JJ ,/, courageous/JJ woman/NN who/WP)
  called/VBD
  (Chunk America/NNP)
  to/TO
  (Chunk its/PRP$ founding/NN ideals/NNS and/CC)
  carried/VBD
  on/IN
  a/DT
  (Chunk noble/

## Name Entity Recognition

- Name Entity Recognition (NER) is a natural language processing task that involves identifying and classifying named entities in a given text. The named entities can be of various types, such as persons, organizations, locations, dates, monetary values, and so on.

- Named entity recognition can identify and categorize key pieces of information in unstructured text. Once an NER learning model has been trained on textual data and entity types, it automatically analyzes new unstructured text, categorizing named entities and semantic meaning based on its training.

**Named Entity Recognition Type Examples:**

| ENTITY TYPE    |    | EXAMPLE                  |
|----------------|----|--------------------------|
| ORGANIZATION   |----| Tata Prvt. Ltd.          |
| PERSON         |----| Ratan Tata               |
| LOCATION       |----| Colaba, Mumbai, India    |
| DATE           |----| 11/11/2020 or June 2020  |
| MONEY          |----| 100000 INR, $1000        |
| PERCENT        |----| twenty pct, 19.45%       |
| FACILITY       |----| Taj Mahal, Agra          |
| GPE            |----| ECR, Chennai             |

In [43]:
train_text = state_union.raw("2006-GWBush.txt")
sample_text = state_union.raw("2005-GWBush.txt")

In [44]:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [45]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [46]:
def process_content():
  try:
    for i in tokenized[5:]:
      words = nltk.word_tokenize(i)
      tagged = nltk.pos_tag(words)

      namedEnt = nltk.ne_chunk(tagged, binary=True)
      print(namedEnt)

  except Exception as e:
    print(str(e))

In [47]:
process_content()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  ./.)
(S (/( (NE Applause/NNP) ./. )/))
(S
  My/PRP$
  Clear/JJ
  Skies/NNPS
  legislation/NN
  will/MD
  cut/VB
  power/NN
  plant/NN
  pollution/NN
  and/CC
  improve/VB
  the/DT
  health/NN
  of/IN
  our/PRP$
  citizens/NNS
  ./.)
(S (/( (NE Applause/NNP) ./. )/))
(S
  And/CC
  my/PRP$
  budget/NN
  provides/VBZ
  strong/JJ
  funding/NN
  for/IN
  leading-edge/JJ
  technology/NN
  --/:
  from/IN
  hydrogen-fueled/JJ
  cars/NNS
  ,/,
  to/TO
  clean/VB
  coal/NN
  ,/,
  to/TO
  renewable/VB
  sources/NNS
  such/JJ
  as/IN
  ethanol/NN
  ./.)
(S (/( (NE Applause/NNP) ./. )/))
(S
  Four/CD
  years/NNS
  of/IN
  debate/NN
  is/VBZ
  enough/JJ
  :/:
  I/PRP
  urge/VBP
  (NE Congress/NNP)
  to/TO
  pass/VB
  legislation/NN
  that/WDT
  makes/VBZ
  (NE America/NNP)
  more/JJR
  secure/NN
  and/CC
  less/RBR
  dependent/JJ
  on/IN
  foreign/JJ
  energy/NN
  ./.)
(S (/( (NE Applause/NNP) ./. )/))
(S
  All/PDT
  these/DT
  prop

## Lemmatizing

- Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word down to its root meaning to identify similarities.
- For example, a lemmatization algorithm would reduce the word better to its root word, or lemme, good.
- Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.

In [48]:
from nltk.stem import WordNetLemmatizer

In [49]:
lemmatizer = WordNetLemmatizer()

In [50]:
print(lemmatizer.lemmatize("cats"))

cat


In [51]:
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("once"))

cactus
goose
rock
python
once


In [52]:
print(lemmatizer.lemmatize("better", pos = "a"))

good


In [53]:
print(lemmatizer.lemmatize("best", pos = "a"))

best


In [54]:
print(lemmatizer.lemmatize("ran", 'v'))

run


In [55]:
print(lemmatizer.lemmatize("ate", 'v'))

eat


## Corpora

C:\Users\shibi\AppData\Roaming\nltk_data\corpora -> The Location of the Corpus in Python NLTK library in Shibi Laptop.
- Basically it is in a Roaming folder in the NLTK library.
- There are lot of Corpus dataset to train and test.

In [56]:
from nltk.corpus import gutenberg
from nltk.tokenize import sent_tokenize

In [57]:
sample = gutenberg.raw("bible-kjv.txt")

In [58]:
tok = sent_tokenize(sample)

In [59]:
print(tok[5:15])

['1:5 And God called the light Day, and the darkness he called Night.', 'And the evening and the morning were the first day.', '1:6 And God said, Let there be a firmament in the midst of the waters,\nand let it divide the waters from the waters.', '1:7 And God made the firmament, and divided the waters which were\nunder the firmament from the waters which were above the firmament:\nand it was so.', '1:8 And God called the firmament Heaven.', 'And the evening and the\nmorning were the second day.', '1:9 And God said, Let the waters under the heaven be gathered together\nunto one place, and let the dry land appear: and it was so.', '1:10 And God called the dry land Earth; and the gathering together of\nthe waters called he Seas: and God saw that it was good.', '1:11 And God said, Let the earth bring forth grass, the herb yielding\nseed, and the fruit tree yielding fruit after his kind, whose seed is\nin itself, upon the earth: and it was so.', '1:12 And the earth brought forth grass, and

## WordNet

- WordNet is useful for natural language processing tasks as it provides a structured lexical database, offering synonymy, semantic relations, and hierarchical organization, facilitating language understanding and analysis.

- The synonyms are grouped into synsets with short definitions and usage examples. It can thus be seen as a combination and extension of a dictionary and thesaurus.

In [60]:
from nltk.corpus import wordnet

In [61]:
syns = wordnet.synsets("program")

In [62]:
print(syns)

[Synset('plan.n.01'), Synset('program.n.02'), Synset('broadcast.n.02'), Synset('platform.n.02'), Synset('program.n.05'), Synset('course_of_study.n.01'), Synset('program.n.07'), Synset('program.n.08'), Synset('program.v.01'), Synset('program.v.02')]


In [63]:
print(syns[0].name())

plan.n.01


In [64]:
print(syns[0].lemmas()[0].name())

plan


In [65]:
print(syns[0].definition())

a series of steps to be carried out or goals to be accomplished


In [66]:
print(syns[0].examples())

['they drew up a six-step plan', 'they discussed plans for a new bond issue']


In [67]:
synonyms = []

antonyms = []

In [68]:
for syn in wordnet.synsets("good"):
  for l in syn.lemmas():
    synonyms.append(l.name())
    if l.antonyms():
      antonyms.append(l.antonyms()[0].name())

In [69]:
print("synonyms:", set(synonyms))
print("antonyms:", set(antonyms))

synonyms: {'skillful', 'dependable', 'salutary', 'dear', 'unspoiled', 'proficient', 'effective', 'skilful', 'trade_good', 'full', 'goodness', 'commodity', 'sound', 'undecomposed', 'good', 'estimable', 'ripe', 'unspoilt', 'beneficial', 'upright', 'honest', 'safe', 'soundly', 'right', 'respectable', 'honorable', 'in_effect', 'practiced', 'adept', 'in_force', 'serious', 'expert', 'well', 'thoroughly', 'secure', 'near', 'just'}
antonyms: {'evil', 'badness', 'ill', 'evilness', 'bad'}


In [70]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("boat.n.01")

In [71]:
print(w1.wup_similarity(w2))

0.9090909090909091


In [72]:
w3 = wordnet.synset("ship.n.01")
w4 = wordnet.synset("car.n.01")

print(w3.wup_similarity(w4))

0.6956521739130435


In [73]:
w5 = wordnet.synset("ship.n.01")
w6 = wordnet.synset("cat.n.01")

print(w5.wup_similarity(w6))

0.32


In [74]:
w7 = wordnet.synset("ship.n.01")
w8 = wordnet.synset("cactus.n.01")

print(w7.wup_similarity(w8))

0.38095238095238093


## Text Classification

- Text classification is one of the most common tasks in NLP. It is the process of assigning a label or category to a given piece of text. For example, we can classify emails as spam or not spam, tweets as positive or negative, and articles as relevant or not relevant to a given topic.

- Here Sentimental Analysis has been performed by using Text Classification.

In [75]:
import nltk
import random
from nltk.corpus import movie_reviews

In [76]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

The above code is also used as follows in a tradional way, but the above one (oneliners) is commonly used due to its simplicity.

```
documents = []

for category in movie_reviews.categories():
  for fileid in movie_reviews.fileid(category):
    documents.append(list(movie_reviews.words(fileid)), category)
```



In [77]:
random.shuffle(documents)

In [78]:
print(documents[1])

(['plot', ':', 'odin', 'is', 'a', 'great', 'high', 'school', 'basketball', 'player', '.', 'he', "'", 's', 'dating', 'a', 'hot', 'girl', 'and', 'the', 'coach', 'loves', 'his', 'ass', '.', 'in', 'fact', ',', 'the', 'coach', 'even', 'admits', 'to', 'having', 'fatherly', 'feelings', 'towards', 'him', '.', 'unfortunately', ',', 'the', 'coach', "'", 's', 'real', 'son', ',', 'hugo', ',', 'isn', "'", 't', 'too', 'pleased', 'to', 'hear', 'that', '.', 'in', 'fact', ',', 'he', 'doesn', "'", 't', 'like', 'hearing', 'about', 'any', 'of', 'odin', "'", 's', 'triumphs', ',', 'as', 'they', 'generally', 'supersede', 'his', 'own', '.', 'so', 'what', 'does', 'he', 'set', 'out', 'to', 'do', '?', 'well', ',', 'let', "'", 's', 'just', 'say', 'that', 'he', 'starts', 'to', 'mess', 'with', 'people', "'", 's', 'heads', 'and', 'one', 'thing', 'leads', 'to', 'another', 'thing', 'which', 'leads', 'to', '.', '.', '.', 'well', ',', 'you', "'", 'll', 'see', '.', 'critique', ':', 'a', 'very', 'powerful', ',', 'thorough

In [79]:
all_words = []

In [80]:
for w in movie_reviews.words():
  all_words.append(w.lower())

In [81]:
all_words = nltk.FreqDist(all_words)

In [82]:
print(all_words.most_common(15))
print(all_words["stupid"])

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]
253


## Words as Features

In [83]:
import nltk
import random
from nltk.corpus import movie_reviews

In [84]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

In [85]:
random.shuffle(documents)

In [86]:
all_words = []

In [87]:
for w in movie_reviews.words():
  all_words.append(w.lower())

In [88]:
all_words = nltk.FreqDist(all_words)

- FreqDist is a subclass of dict that contains a frequency distribution of the samples in a given sample.
- Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.
- FreqDist is used to find the most common words by counting word frequencies in the treebank corpus.

In [89]:
word_features = list(all_words.keys())[:3000]

In [90]:
def find_features(document):
  words = set(document)
  features = {}
  for w in word_features:
    features[w] = (w in words)
    ## print(w, ":", features)

  return features

In [91]:
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))



In [92]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

## Naive Bayes

The Naive Bayes algorithm is a popular machine learning algorithm used for text classification tasks in Natural Language Processing (NLP). It is a probabilistic classifier that makes predictions based on the assumption that the features (in this case, the words in a text) are independent of each other.

---
Naive Bayes classifier calculates the probability of an event in the following steps:
- **Step 1**: Calculate the prior probability for given class labels.
- **Step 2**: Find Likelihood probability with each attribute for each class.
- **Step 3**: Put these value in Bayes Formula and calculate posterior probability.

In Sentimental Analysis, Naive Bayes Algorithm used to determine whether the movie review or any other sentence that is fed to it is positive or negative.

P(A|B) = the probability of event A happening, given that event B has occurred. Note that “|” refers to “given.” P(A) = the probability of event A that occurred. P(B) = the probability of event B that occurred.
- The posterior probability P(y|X) can be calculated by first, creating a Frequency Table for each attribute against the target. Then, molding the frequency tables to Likelihood Tables and finally, use the Naïve Bayesian equation to calculate the posterior probability for each class.

```
posterior = prior occurences x likelihood / evidence

```

In [93]:
import nltk

In [94]:
print(len(featuresets))

2000


2000 data in featuresets will be splitted into 1900 for training and 100 for testing

In [95]:
training_set = featuresets[:1900]

testing_set = featuresets[1900:]

In [96]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [97]:
print(" Naive Bayes Classifier accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100)

 Naive Bayes Classifier accuracy percent: 87.0


In [98]:
classifier.show_most_informative_features(15)

Most Informative Features
                   sucks = True              neg : pos    =     10.4 : 1.0
                     ugh = True              neg : pos    =      9.8 : 1.0
                 idiotic = True              neg : pos    =      9.7 : 1.0
                 frances = True              pos : neg    =      8.8 : 1.0
           unimaginative = True              neg : pos    =      7.8 : 1.0
                  annual = True              pos : neg    =      7.5 : 1.0
              schumacher = True              neg : pos    =      7.1 : 1.0
                   kudos = True              pos : neg    =      6.5 : 1.0
                  regard = True              pos : neg    =      6.5 : 1.0
                  shoddy = True              neg : pos    =      6.4 : 1.0
             silverstone = True              neg : pos    =      6.4 : 1.0
               atrocious = True              neg : pos    =      6.3 : 1.0
                 cunning = True              pos : neg    =      6.2 : 1.0

## Save Classifier with Pickle

Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.

In [99]:
import pickle

In [100]:
save_classifier = open("naivebayes.picle", "wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()

**Load Pickle file**

---

In [101]:
classifier_f = open("naivebayes.picle", "rb")
classifier_saved = pickle.load(classifier_f)
classifier_f.close()

In [102]:
print(" Naive Bayes Classifier accuracy percent:",(nltk.classify.accuracy(classifier_saved, testing_set))*100)

 Naive Bayes Classifier accuracy percent: 87.0


---

- "wb" means write in binary in python file object
- "rb" means read in binary in python file object

---

## Scikit-Learn incorporation

- Scikit-Learn, also known as sklearn is a python library to implement machine learning models and statistical modelling. Through scikit-learn, we can implement various machine learning models for regression, classification, clustering, and statistical tools for analyzing these models.

In [103]:
# basically, it is wrapper to include scikitlearn algorithms in nltk
from nltk.classify.scikitlearn import SklearnClassifier

In [104]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

In [105]:
# importing some scikitlearn algorithms

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB

- **Multinomial Naive Bayes (MNB)**: Multinomial Naive Bayes is a very popular and efficient machine learning algorithm that is based on Bayes' theorem. It is commonly used for text classification tasks where we need to deal with discrete data like word counts in documents.

- **Gaussian Naive Bayes (GNB)**: Gaussian Naive Bayes is a machine learning classification technique based on a probablistic approach that assumes each class follows a normal distribution. It assumes each parameter has an independent capacity of predicting the output variable.

- **Bernoulli Naive Bayes (BNB)**: Bernoulli Naive Bayes is a part of the Naive Bayes family. It is based on the Bernoulli Distribution and accepts only binary values, i.e., 0 or 1. If the features of the dataset are binary, then we can assume that Bernoulli Naive Bayes is the algorithm to be used.

In [106]:
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)

## GNB_classifier = SklearnClassifier(GaussianNB())
## GNB_classifier.train(training_set)

BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)

<SklearnClassifier(BernoulliNB())>

For GNB_classifier, the code won't work that way, the following is the error message for the GNB_classifier:

```
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-133-967b6b7d4038> in <cell line: 2>()
      1 GNB_classifier = SklearnClassifier(GaussianNB())
----> 2 GNB_classifier.train(training_set)

6 frames
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse, estimator_name, input_name)
    520
    521     if accept_sparse is False:
--> 522         raise TypeError(
    523             "A sparse matrix was passed, but dense "
    524             "data is required. Use X.toarray() to "

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

```

In [107]:
print(" NLTK-NB_classifier accuracy percent:",(nltk.classify.accuracy(classifier_saved, testing_set))*100)

print(" MNB_classifier accuracy percent:",(nltk.classify.accuracy(MNB_classifier, testing_set))*100)

# print(" GNB_classifier accuracy percent:",(nltk.classify.accuracy(GNB_classifier, testing_set))*100)

print(" BNB_classifier accuracy percent:",(nltk.classify.accuracy(BNB_classifier, testing_set))*100)

 NLTK-NB_classifier accuracy percent: 87.0
 MNB_classifier accuracy percent: 87.0
 BNB_classifier accuracy percent: 87.0


In [108]:
LR_classifier = SklearnClassifier(LogisticRegression())
LR_classifier.train(training_set)

SGD_classifier = SklearnClassifier(SGDClassifier())
SGD_classifier.train(training_set)

SV_classifier = SklearnClassifier(SVC())
SV_classifier.train(training_set)

LSV_classifier = SklearnClassifier(LinearSVC())
LSV_classifier.train(training_set)

NSV_classifier = SklearnClassifier(NuSVC())
NSV_classifier.train(training_set)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<SklearnClassifier(NuSVC())>

In [109]:
print(" LR_classifier accuracy percent:",(nltk.classify.accuracy(LR_classifier, testing_set))*100)

print(" SGD_classifier accuracy percent:",(nltk.classify.accuracy(SGD_classifier, testing_set))*100)

print(" SV_classifier accuracy percent:",(nltk.classify.accuracy(SV_classifier, testing_set))*100)

print(" LSV_classifier accuracy percent:",(nltk.classify.accuracy(LSV_classifier, testing_set))*100)

print(" NSV_classifier accuracy percent:",(nltk.classify.accuracy(NSV_classifier, testing_set))*100)

 LR_classifier accuracy percent: 84.0
 SGD_classifier accuracy percent: 84.0
 SV_classifier accuracy percent: 90.0
 LSV_classifier accuracy percent: 81.0
 NSV_classifier accuracy percent: 89.0


---

The abbreviation for the above mentioned classifiers for the quick reference.

- NLTK-NB_classifier -> Naive Bayes Classifier in NLTK library
- MNB_classifier -> Multinomial Naive Bayes Classifier in Sklearn library
- GNB_classifier -> Gaussian Naive Bayes Classifier in Sklearn library
- BNB_classifier -> Bernoulli Naive Bayes Classifier in Sklearn library
- LR_classifier -> Logistic Regression Classifier in Sklearn library
- SGD_classifier -> Stochastic Gradient Descent Classifier in Sklearn library
- SV_classifier -> Support Vector Classifier in Sklearn library
- LSV_classifier -> Linear Support Vector Classifier in Sklearn library
- NSV_classifier -> Nu Support Vector Classifier in Sklearn library

---

## Combining Algorithms

The basic idea behind this approach is to leverage the strengths of different algorithms or models and combine their outputs to arrive at a more robust and reliable result. This is done by having multiple models or algorithms make predictions or decisions, and then using a voting mechanism to determine the final output.

- if the score is 100%, then it depicts that all the models agreeing on the same prediction
- if the score is 0%, then it depicts that all the models disagreeing on the same prediction
- if the score is between 0% and 100%, then it depicts that some models disagreeing on the same prediction (out of 8 models, 2 are agreeing and 6 are disagreeing, then the score is 25%)

We gonna use the following models:
1. NLTK-NB_classifier
2. MNB_classifier
3. BNB_classifier
4. LR_classifier
5. SGD_classifier
6. LSVC_classifier
7. NSV_classifier

In [110]:
from nltk.classify import ClassifierI
from statistics import mode

In [111]:
class VoteClassifier(ClassifierI):

  def __init__(self, *classifiers):
    self._classifiers = classifiers

  def classify(self, features):
    votes = []
    for classifier in self._classifiers:
      v = classifier.classify(features)
      votes.append(v)
    return mode(votes)

  def confidence(self, features):
    votes = []
    for classifier in self._classifiers:
      v = classifier.classify(features)
      votes.append(v)
    choice_votes = votes.count(mode(votes))
    conf = choice_votes / len(votes)
    return conf

In [112]:
voted_classifier = VoteClassifier(classifier_saved,
                                  MNB_classifier,
                                  BNB_classifier,
                                  LR_classifier,
                                  SGD_classifier,
                                  LSV_classifier,
                                  NSV_classifier)

In [113]:
print("voted_classifier accuracy percent:",(nltk.classify.accuracy(voted_classifier, testing_set))*100)

voted_classifier accuracy percent: 88.0


In [114]:
print("Classification: ", voted_classifier.classify(testing_set[0][0]), "Confidence %: ", voted_classifier.confidence(testing_set[0][0]) * 100)
print("Classification: ", voted_classifier.classify(testing_set[1][0]), "confidence %: ", voted_classifier.confidence(testing_set[1][0]) * 100)
print("Classification: ", voted_classifier.classify(testing_set[2][0]), "confidence %: ", voted_classifier.confidence(testing_set[2][0]) * 100)
print("Classification: ", voted_classifier.classify(testing_set[3][0]), "confidence %: ", voted_classifier.confidence(testing_set[3][0]) * 100)
print("Classification: ", voted_classifier.classify(testing_set[4][0]), "confidence %: ", voted_classifier.confidence(testing_set[4][0]) * 100)
print("Classification: ", voted_classifier.classify(testing_set[5][0]), "confidence %: ", voted_classifier.confidence(testing_set[5][0]) * 100)

Classification:  neg Confidence %:  100.0
Classification:  neg confidence %:  100.0
Classification:  neg confidence %:  100.0
Classification:  neg confidence %:  100.0
Classification:  pos confidence %:  100.0
Classification:  neg confidence %:  100.0


## Investigating Bias

Everything we have seen till now, is the thing that used to investigate the bias. Here, we gonna repeat the same process from the section - "**Words as features**" by just omitting the unimportant parts like random shuffle or printing confidence to investigate the bias. Sections before that are just the preliminaries.

In [115]:
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode

import pickle

In [116]:
class VoteClassifier(ClassifierI):

  def __init__(self, *classifiers):
    self._classifiers = classifiers

  def classify(self, features):
    votes = []
    for c in self._classifiers:
      v = c.classify(features)
      votes.append(v)
    return mode(votes)

  def confidence(self, features):
    votes = []
    for c in self._classifiers:
      v = c.classify(features)
      votes.append(v)
    choice_votes = votes.count(mode(votes))
    conf = choice_votes / len(votes)
    return conf

In [117]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

In [118]:
all_words = []

In [119]:
for w in movie_reviews.words():
  all_words.append(w.lower())

In [120]:
all_words = nltk.FreqDist(all_words)

In [121]:
word_features = list(all_words.keys())[:3000]

In [122]:
def find_features(document):
  words = set(document)
  features = {}
  for w in word_features:
    features[w] = (w in words)

  return features

In [123]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

In [124]:
#positive dataset
training_set = featuresets[:1500]
testing_set = featuresets[1500:]

#negative dataset
training_set = featuresets[500:]
testing_set = featuresets[:500]

- Same number of data in training and testing set in both positive and negative dataset, but in different order.
- But both datasets having same amount of data in training and testing set.



```
- training_set in positive dataset -> first 1500 data (from start to 1500)
- testing_set in positive dataset -> last 500 data (from 1500 to last)

- training_set in negative dataset -> last 1500 data (from 500 to last)
- testing_set in negative dataset -> first 500 data (from start to 500)

```


In [125]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [126]:
save_classifier = open("naivebayes.pickle", "wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()

In [127]:
classifier_f = open("naivebayes.pickle", "rb")
classifier_saved = pickle.load(classifier_f)
classifier_f.close()

In [128]:
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)

BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)

LR_classifier = SklearnClassifier(LogisticRegression())
LR_classifier.train(training_set)

SGD_classifier = SklearnClassifier(SGDClassifier())
SGD_classifier.train(training_set)

LSV_classifier = SklearnClassifier(LinearSVC())
LSV_classifier.train(training_set)

NSV_classifier = SklearnClassifier(NuSVC())
NSV_classifier.train(training_set)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<SklearnClassifier(NuSVC())>

In [129]:
print(" NLTK-NB_classifier accuracy percent:",(nltk.classify.accuracy(classifier_saved, testing_set))*100)

print(" MNB_classifier accuracy percent:",(nltk.classify.accuracy(MNB_classifier, testing_set))*100)

print(" BNB_classifier accuracy percent:",(nltk.classify.accuracy(BNB_classifier, testing_set))*100)

print(" LR_classifier accuracy percent:",(nltk.classify.accuracy(LR_classifier, testing_set))*100)

print(" SGD_classifier accuracy percent:",(nltk.classify.accuracy(SGD_classifier, testing_set))*100)

print(" LSV_classifier accuracy percent:",(nltk.classify.accuracy(LSV_classifier, testing_set))*100)

print(" NSV_classifier accuracy percent:",(nltk.classify.accuracy(NSV_classifier, testing_set))*100)

 NLTK-NB_classifier accuracy percent: 77.0
 MNB_classifier accuracy percent: 75.4
 BNB_classifier accuracy percent: 76.6
 LR_classifier accuracy percent: 69.19999999999999
 SGD_classifier accuracy percent: 70.6
 LSV_classifier accuracy percent: 69.19999999999999
 NSV_classifier accuracy percent: 60.6


In [130]:
voted_classifier = VoteClassifier(classifier_saved,
                                  MNB_classifier,
                                  BNB_classifier,
                                  LR_classifier,
                                  SGD_classifier,
                                  LSV_classifier,
                                  NSV_classifier)

In [131]:
print(" Voted_classifier accuracy percent:",(nltk.classify.accuracy(voted_classifier, testing_set))*100)

 Voted_classifier accuracy percent: 69.8


## Better Training data

In [132]:
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode

import pickle

from nltk.tokenize import word_tokenize

In [133]:
class VoteClassifier(ClassifierI):

  def __init__(self, *classifiers):
    self._classifiers = classifiers

  def classify(self, features):
    votes = []
    for c in self._classifiers:
      v = c.classify(features)
      votes.append(v)
    return mode(votes)

  def confidence(self, features):
    votes = []
    for c in self._classifiers:
      v = c.classify(features)
      votes.append(v)
    choice_votes = votes.count(mode(votes))
    conf = choice_votes / len(votes)
    return conf

The default given movie review dataset is smaller (not much of smaller but comparatively smaller), so we gonna use the one which is comparatively larger.
- The datasets used below are like 5000 reviews on both positive and negative sentiment.

In [135]:
try:
    short_pos = open("positive.txt", "r", encoding="utf-8").read()
except UnicodeDecodeError:
    short_pos = open("positive.txt", "r", encoding="iso-8859-1").read()

try:
    short_neg = open("negative.txt", "r", encoding="utf-8").read()
except UnicodeDecodeError:
    short_neg = open("negative.txt", "r", encoding="iso-8859-1").read()

To avoid encoding error while access the file, the try catch exception is used above.

In [136]:
documents = []

In [137]:
for r in short_pos.split('\n'):
  documents.append((r, "pos"))

for r in short_neg.split('\n'):
  documents.append((r, "neg"))

In [138]:
all_words = []

In [139]:
short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)

In [140]:
for w in short_pos_words:
  all_words.append(w.lower())

for w in short_neg_words:
  all_words.append(w.lower())

In [141]:
all_words = nltk.FreqDist(all_words)

In [142]:
word_features = list(all_words.keys())[:5000]

In [143]:
def find_features(document):
  words = word_tokenize(document)
  features = {}
  for w in word_features:
    features[w] = (w in words)

  return features

In [144]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

In [145]:
random.shuffle(featuresets)

In [146]:
training_set = featuresets[:10000]
testing_set = featuresets[10000:]

In [147]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [148]:
save_classifier = open("naivebayes.pickle", "wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()

In [149]:
classifier_f = open("naivebayes.pickle", "rb")
classifier_saved = pickle.load(classifier_f)
classifier_f.close()

In [150]:
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)

BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)

LR_classifier = SklearnClassifier(LogisticRegression())
LR_classifier.train(training_set)

SGD_classifier = SklearnClassifier(SGDClassifier())
SGD_classifier.train(training_set)

LSV_classifier = SklearnClassifier(LinearSVC())
LSV_classifier.train(training_set)

NSV_classifier = SklearnClassifier(NuSVC())
NSV_classifier.train(training_set)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<SklearnClassifier(NuSVC())>

In [151]:
print(" NLTK-NB_classifier accuracy percent:",(nltk.classify.accuracy(classifier_saved, testing_set))*100)

print(" MNB_classifier accuracy percent:",(nltk.classify.accuracy(MNB_classifier, testing_set))*100)

print(" BNB_classifier accuracy percent:",(nltk.classify.accuracy(BNB_classifier, testing_set))*100)

print(" LR_classifier accuracy percent:",(nltk.classify.accuracy(LR_classifier, testing_set))*100)

print(" SGD_classifier accuracy percent:",(nltk.classify.accuracy(SGD_classifier, testing_set))*100)

print(" LSV_classifier accuracy percent:",(nltk.classify.accuracy(LSV_classifier, testing_set))*100)

print(" NSV_classifier accuracy percent:",(nltk.classify.accuracy(NSV_classifier, testing_set))*100)

 NLTK-NB_classifier accuracy percent: 74.84939759036145
 MNB_classifier accuracy percent: 73.79518072289156
 BNB_classifier accuracy percent: 75.0
 LR_classifier accuracy percent: 72.59036144578313
 SGD_classifier accuracy percent: 71.3855421686747
 LSV_classifier accuracy percent: 72.28915662650603
 NSV_classifier accuracy percent: 73.94578313253012


In [152]:
voted_classifier = VoteClassifier(classifier_saved,
                                  MNB_classifier,
                                  BNB_classifier,
                                  LR_classifier,
                                  SGD_classifier,
                                  LSV_classifier,
                                  NSV_classifier)

In [153]:
print(" Voted_classifier accuracy percent:",(nltk.classify.accuracy(voted_classifier, testing_set))*100)

 Voted_classifier accuracy percent: 74.09638554216868
