# Natural Language Processing

## 1. [Tokenizing Words and Sentences with NLTK](https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15350826/)

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."

print(sent_tokenize(EXAMPLE_TEXT,"czech"))

['Hello Mr.', 'Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard."]


In [4]:
print(word_tokenize(EXAMPLE_TEXT))

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard', '.']


## 2. [Stop words with NLTK](https://pythonprogramming.net/stop-words-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15350868/)

In [6]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))
print "stop_words", stop_words

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

stop_words set([u'all', u'just', u'being', u'over', u'both', u'through', u'yourselves', u'its', u'before', u'o', u'hadn', u'herself', u'll', u'had', u'should', u'to', u'only', u'won', u'under', u'ours', u'has', u'do', u'them', u'his', u'very', u'they', u'not', u'during', u'now', u'him', u'nor', u'd', u'did', u'didn', u'this', u'she', u'each', u'further', u'where', u'few', u'because', u'doing', u'some', u'hasn', u'are', u'our', u'ourselves', u'out', u'what', u'for', u'while', u're', u'does', u'above', u'between', u'mustn', u't', u'be', u'we', u'who', u'were', u'here', u'shouldn', u'hers', u'by', u'on', u'about', u'couldn', u'of', u'against', u's', u'isn', u'or', u'own', u'into', u'yourself', u'down', u'mightn', u'wasn', u'your', u'from', u'her', u'their', u'aren', u'there', u'been', u'whom', u'too', u'wouldn', u'themselves', u'weren', u'was', u'until', u'more', u'himself', u'that', u'but', u'don', u'with', u'than', u'those', u'he', u'me', u'myself', u'ma', u'these', u'up', u'will', u'be

## 3. [Stemming words with NLTK](https://pythonprogramming.net/stemming-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15350897/)

In [8]:
from nltk.stem import PorterStemmer


ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli


In [10]:
from nltk.tokenize import word_tokenize

new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(new_text)

for w in words:
    print(ps.stem(w))

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.


## 4. [Part of Speech Tagging with NLTK](https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15350929/)

In [13]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[:5]:
            print i
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))


process_content()

0.0402097902098 0.0735785953177 0.0383693045564 5720 299 230 22
0.000174825174825 0.00334448160535 0.0 5720 299 1 1
0.00437062937063 0.00334448160535 0.00442722744881 5720 299 25 1
0.000699300699301 0.00334448160535 0.000553403431101 5720 299 4 1
0.0013986013986 0.00334448160535 0.00129127467257 5720 299 8 1
0.0171328671329 0.0401337792642 0.0158642316916 5720 299 98 12
0.00122377622378 0.0066889632107 0.000922339051835 5720 299 7 2
0.00384615384615 0.0066889632107 0.00368935620734 5720 299 22 2
0.00157342657343 0.00334448160535 0.00147574248294 5720 299 9 1
0.000874125874126 0.0066889632107 0.000553403431101 5720 299 5 2
0.00314685314685 0.0133779264214 0.00258254934514 5720 299 18 4
0.00244755244755 0.0066889632107 0.00221361372441 5720 299 14 2
0.00192307692308 0.00334448160535 0.00184467810367 5720 299 11 1
0.00157342657343 0.00334448160535 0.00147574248294 5720 299 9 1
0.0138111888112 0.0200668896321 0.0134661501568 5720 299 79 6
0.00034965034965 0.00334448160535 0.000184467810367

## 5. [Chunking with NLTK](https://pythonprogramming.net/chunking-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15353114/)

In [1]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[:3]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            print(chunked)
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

            chunked.draw()

    except Exception as e:
        print(str(e))

process_content()

0.0402097902098 0.0735785953177 0.0383693045564 5720 299 230 22
0.000174825174825 0.00334448160535 0.0 5720 299 1 1
0.00437062937063 0.00334448160535 0.00442722744881 5720 299 25 1
0.000699300699301 0.00334448160535 0.000553403431101 5720 299 4 1
0.0013986013986 0.00334448160535 0.00129127467257 5720 299 8 1
0.0171328671329 0.0401337792642 0.0158642316916 5720 299 98 12
0.00122377622378 0.0066889632107 0.000922339051835 5720 299 7 2
0.00384615384615 0.0066889632107 0.00368935620734 5720 299 22 2
0.00157342657343 0.00334448160535 0.00147574248294 5720 299 9 1
0.000874125874126 0.0066889632107 0.000553403431101 5720 299 5 2
0.00314685314685 0.0133779264214 0.00258254934514 5720 299 18 4
0.00244755244755 0.0066889632107 0.00221361372441 5720 299 14 2
0.00192307692308 0.00334448160535 0.00184467810367 5720 299 11 1
0.00157342657343 0.00334448160535 0.00147574248294 5720 299 9 1
0.0138111888112 0.0200668896321 0.0134661501568 5720 299 79 6
0.00034965034965 0.00334448160535 0.000184467810367

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
  'S/POS
  (Chunk ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP January/NNP)
  31/CD
  ,/,
  2006/CD
  (Chunk THE/NNP PRESIDENT/NNP)
  :/:
  (Chunk Thank/NNP)
  you/PRP
  all/DT
  ./.)
(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(Chunk ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
(Chunk THE/NNP UNION/NNP January/NNP)
(Chunk THE/NNP PRESIDENT/NNP)
(Chunk Thank/NNP)


(S
  (Chunk Mr./NNP Speaker/NNP)
  ,/,
  (Chunk Vice/NNP President/NNP Cheney/NNP)
  ,/,
  members/NNS
  of/IN
  (Chunk Congress/NNP)
  ,/,
  members/NNS
  of/IN
  the/DT
  (Chunk Supreme/NNP Court/NNP)
  and/CC
  diplomatic/JJ
  corps/NN
  ,/,
  distinguished/JJ
  guests/NNS
  ,/,
  and/CC
  fellow/JJ
  citizens/NNS
  :/:
  Today/VB
  our/PRP$
  nation/NN
  lost/VBD
  a/DT
  beloved/VBN
  ,/,
  graceful/JJ
  ,/,
  courageous/JJ
  woman/NN
  who/WP
  (Chunk called/VBD America/NNP)
  to/TO
  its/PRP$
  founding/NN
  ideals/NNS
  and/CC
  carried/VBD
  on/IN
  a/DT
  noble/JJ
  dream/NN
  ./.)
(Chunk Mr./NNP Speaker/NNP)
(Chunk Vice/NNP President/NNP Cheney/NNP)
(Chunk Congress/NNP)
(Chunk Supreme/NNP Court/NNP)
(Chunk called/VBD America/NNP)


(S
  Tonight/NN
  we/PRP
  are/VBP
  comforted/VBN
  by/IN
  the/DT
  hope/NN
  of/IN
  a/DT
  glad/JJ
  reunion/NN
  with/IN
  the/DT
  husband/NN
  who/WP
  was/VBD
  taken/VBN
  so/RB
  long/RB
  ago/RB
  ,/,
  and/CC
  we/PRP
  are/VBP
  grateful/JJ
  for/IN
  the/DT
  good/JJ
  life/NN
  of/IN
  (Chunk Coretta/NNP Scott/NNP King/NNP)
  ./.)
(Chunk Coretta/NNP Scott/NNP King/NNP)


## 6. [Chinking with NLTK](https://pythonprogramming.net/chinking-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15353145/)

In [1]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[:3]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)

            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)

            chunked.draw()

    except Exception as e:
        print(str(e))

process_content()

0.0402097902098 0.0735785953177 0.0383693045564 5720 299 230 22
0.000174825174825 0.00334448160535 0.0 5720 299 1 1
0.00437062937063 0.00334448160535 0.00442722744881 5720 299 25 1
0.000699300699301 0.00334448160535 0.000553403431101 5720 299 4 1
0.0013986013986 0.00334448160535 0.00129127467257 5720 299 8 1
0.0171328671329 0.0401337792642 0.0158642316916 5720 299 98 12
0.00122377622378 0.0066889632107 0.000922339051835 5720 299 7 2
0.00384615384615 0.0066889632107 0.00368935620734 5720 299 22 2
0.00157342657343 0.00334448160535 0.00147574248294 5720 299 9 1
0.000874125874126 0.0066889632107 0.000553403431101 5720 299 5 2
0.00314685314685 0.0133779264214 0.00258254934514 5720 299 18 4
0.00244755244755 0.0066889632107 0.00221361372441 5720 299 14 2
0.00192307692308 0.00334448160535 0.00184467810367 5720 299 11 1
0.00157342657343 0.00334448160535 0.00147574248294 5720 299 9 1
0.0138111888112 0.0200668896321 0.0134661501568 5720 299 79 6
0.00034965034965 0.00334448160535 0.000184467810367

## 7. [Named Entity Recognition with NLTK](https://pythonprogramming.net/named-entity-recognition-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15353198/)

In [1]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[:3]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            namedEnt.draw()
    except Exception as e:
        print(str(e))


process_content()

0.0402097902098 0.0735785953177 0.0383693045564 5720 299 230 22
0.000174825174825 0.00334448160535 0.0 5720 299 1 1
0.00437062937063 0.00334448160535 0.00442722744881 5720 299 25 1
0.000699300699301 0.00334448160535 0.000553403431101 5720 299 4 1
0.0013986013986 0.00334448160535 0.00129127467257 5720 299 8 1
0.0171328671329 0.0401337792642 0.0158642316916 5720 299 98 12
0.00122377622378 0.0066889632107 0.000922339051835 5720 299 7 2
0.00384615384615 0.0066889632107 0.00368935620734 5720 299 22 2
0.00157342657343 0.00334448160535 0.00147574248294 5720 299 9 1
0.000874125874126 0.0066889632107 0.000553403431101 5720 299 5 2
0.00314685314685 0.0133779264214 0.00258254934514 5720 299 18 4
0.00244755244755 0.0066889632107 0.00221361372441 5720 299 14 2
0.00192307692308 0.00334448160535 0.00184467810367 5720 299 11 1
0.00157342657343 0.00334448160535 0.00147574248294 5720 299 9 1
0.0138111888112 0.0200668896321 0.0134661501568 5720 299 79 6
0.00034965034965 0.00334448160535 0.000184467810367

## 8. [Lemmatizing with NLTK](https://pythonprogramming.net/lemmatizing-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15353273/)

In [4]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("dog"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

cat
cactus
goose
rock
python
good
best
dog
run
run


## 9. [The corpora with NLTK](https://pythonprogramming.net/nltk-corpus-corpora-tutorial/) | [video](https://www.bilibili.com/video/av15353335/)

In [6]:
import nltk
print(nltk.__file__)

from nltk.tokenize import sent_tokenize, PunktSentenceTokenizer
from nltk.corpus import gutenberg

# sample text
sample = gutenberg.raw("bible-kjv.txt")

tok = sent_tokenize(sample)

for x in range(5):
    print(tok[x])

D:\Anaconda2\lib\site-packages\nltk\__init__.pyc


[The King James Bible]

The Old Testament of the King James Bible

The First Book of Moses:  Called Genesis


1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon
the face of the deep.
And the Spirit of God moved upon the face of the
waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light
from the darkness.


## 10. [Wordnet with NLTK](https://pythonprogramming.net/wordnet-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15353391/)

In [12]:
from nltk.corpus import wordnet
syns = wordnet.synsets("program")
print(syns[0].name())
print(syns[0].lemmas()[0].name())
print(syns[0].definition())
print(syns[0].examples())

synonyms = []
antonyms = []

for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

plan.n.01
plan
a series of steps to be carried out or goals to be accomplished
[u'they drew up a six-step plan', u'they discussed plans for a new bond issue']
set([u'beneficial', u'right', u'secure', u'just', u'unspoilt', u'respectable', u'good', u'goodness', u'dear', u'salutary', u'ripe', u'expert', u'skillful', u'in_force', u'proficient', u'unspoiled', u'dependable', u'soundly', u'honorable', u'full', u'undecomposed', u'safe', u'adept', u'upright', u'trade_good', u'sound', u'in_effect', u'practiced', u'effective', u'commodity', u'estimable', u'well', u'honest', u'near', u'skilful', u'thoroughly', u'serious'])
set([u'bad', u'badness', u'ill', u'evil', u'evilness'])
0.909090909091


In [14]:
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('boat.n.01')
print(w1.wup_similarity(w2))
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('car.n.01')
print(w1.wup_similarity(w2))
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('cat.n.01')
print(w1.wup_similarity(w2))

0.909090909091
0.695652173913
0.32


## 11. [Text Classification with NLTK](https://pythonprogramming.net/text-classification-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15355245/)


In [15]:
import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

print(documents[1])

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))
print(all_words["stupid"])

([u'here', u'is', u'a', u'film', u'that', u'is', u'so', u'unexpected', u',', u'so', u'scary', u',', u'and', u'so', u'original', u'that', u'it', u'caught', u'me', u'off', u'guard', u'and', u'threw', u'me', u'for', u'a', u'loop', u'.', u'okay', u',', u'it', u'isn', u"'", u't', u'quite', u'original', u',', u'considering', u'it', u'is', u'a', u'sequel', u'to', u'the', u'box', u'office', u'hit', u'species', u',', u'but', u'it', u'certainly', u'is', u'smart', u'.', u'most', u'films', u'of', u'this', u'genre', u'are', u'reminiscent', u'of', u'those', u'cheesy', u'b', u'-', u'horror', u'films', u'from', u'the', u'50s', u'and', u'60s', u',', u'and', u'some', u'even', u'become', u'them', u'.', u'however', u',', u'as', u'we', u'learned', u'with', u'the', u'1995', u'small', u'-', u'budget', u'horror', u'/', u'sci', u'-', u'fi', u'film', u',', u'sometimes', u'expectations', u'can', u'be', u'shattered', u'.', u'a', u'lot', u'of', u'criticism', u'has', u'gone', u'against', u'this', u'film', u'(', u'f

[(u',', 77717), (u'the', 76529), (u'.', 65876), (u'a', 38106), (u'and', 35576), (u'of', 34123), (u'to', 31937), (u"'", 30585), (u'is', 25195), (u'in', 21822), (u's', 18513), (u'"', 17612), (u'it', 16107), (u'that', 15924), (u'-', 15595)]
253


## 12. [Converting words to Features with NLTK](https://pythonprogramming.net/words-as-features-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15355355/)

In [2]:
import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

word_features

[u'sonja',
 u'askew',
 u'woods',
 u'spiders',
 u'bazooms',
 u'hanging',
 u'francesca',
 u'comically',
 u'localized',
 u'disobeying',
 u'hennings',
 u'canet',
 u'scold',
 u'originality',
 u'caned',
 u'rickman',
 u'slothful',
 u'wracked',
 u'stipulate',
 u'capoeira',
 u'rawhide',
 u'taj',
 u'bringing',
 u'unsworth',
 u'liaisons',
 u'grueling',
 u'sommerset',
 u'wooden',
 u'wednesday',
 u'broiled',
 u'circuitry',
 u'crotch',
 u'elgar',
 u'stereotypical',
 u'shows',
 u'gavan',
 u'rebuilding',
 u'snuggles',
 u'francesco',
 u'feasibility',
 u'miniatures',
 u'gorman',
 u'woody',
 u'consenting',
 u'scraped',
 u'inanimate',
 u'errors',
 u'reopens',
 u'cooking',
 u'fonzie',
 u'opportunists',
 u'islamic',
 u'joely',
 u'designing',
 u'numeral',
 u'succumb',
 u'shocks',
 u'chins',
 u'crooned',
 u'jubilantly',
 u'rocque',
 u'ching',
 u'china',
 u'shandling',
 u'confronts',
 u'wiseguy',
 u'natured',
 u'existentialist',
 u'kids',
 u'uplifting',
 u'controversy',
 u'crowdpleasing',
 u'neurologist',
 u's

In [3]:
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))



In [4]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]
featuresets

## 13. [Naive Bayes Classifier with NLTK](https://pythonprogramming.net/naive-bayes-classifier-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15355406/)

In [5]:
# set that we'll train our classifier with
training_set = featuresets[:1900]

# set that we'll test against.
testing_set = featuresets[1900:]

classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Classifier accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100)

('Classifier accuracy percent:', 68.0)


In [6]:
classifier.show_most_informative_features(15)

Most Informative Features
               insulting = True              neg : pos    =     10.7 : 1.0
                  doubts = True              pos : neg    =      9.5 : 1.0
                    sans = True              neg : pos    =      8.4 : 1.0
              mediocrity = True              neg : pos    =      7.8 : 1.0
                 wasting = True              neg : pos    =      7.8 : 1.0
            refreshingly = True              pos : neg    =      7.6 : 1.0
               dismissed = True              pos : neg    =      6.9 : 1.0
             bruckheimer = True              neg : pos    =      6.4 : 1.0
                   wires = True              neg : pos    =      6.4 : 1.0
                  fabric = True              pos : neg    =      6.3 : 1.0
             overwhelmed = True              pos : neg    =      6.3 : 1.0
                     ugh = True              neg : pos    =      5.9 : 1.0
               uplifting = True              pos : neg    =      5.8 : 1.0

## 14. [Saving Classifiers with NLTK](https://pythonprogramming.net/pickle-classifier-save-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15355460/)

In [9]:
import pickle

save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()

classifier_f = open("naivebayes.pickle", "rb")
classifier = pickle.load(classifier_f)
classifier_f.close()

classifier.show_most_informative_features(15)

Most Informative Features
               insulting = True              neg : pos    =     10.7 : 1.0
                  doubts = True              pos : neg    =      9.5 : 1.0
                    sans = True              neg : pos    =      8.4 : 1.0
              mediocrity = True              neg : pos    =      7.8 : 1.0
                 wasting = True              neg : pos    =      7.8 : 1.0
            refreshingly = True              pos : neg    =      7.6 : 1.0
               dismissed = True              pos : neg    =      6.9 : 1.0
             bruckheimer = True              neg : pos    =      6.4 : 1.0
                   wires = True              neg : pos    =      6.4 : 1.0
                  fabric = True              pos : neg    =      6.3 : 1.0
             overwhelmed = True              pos : neg    =      6.3 : 1.0
                     ugh = True              neg : pos    =      5.9 : 1.0
               uplifting = True              pos : neg    =      5.8 : 1.0

## 15. [Scikit-Learn Sklearn with NLTK](https://pythonprogramming.net/sklearn-scikit-learn-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15355745/)

In [11]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)

SVC_classifier = SklearnClassifier(SVC())
SVC_classifier.train(training_set)
print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)

('Original Naive Bayes Algo accuracy percent:', 68.0)
Most Informative Features
               insulting = True              neg : pos    =     10.7 : 1.0
                  doubts = True              pos : neg    =      9.5 : 1.0
                    sans = True              neg : pos    =      8.4 : 1.0
              mediocrity = True              neg : pos    =      7.8 : 1.0
                 wasting = True              neg : pos    =      7.8 : 1.0
            refreshingly = True              pos : neg    =      7.6 : 1.0
               dismissed = True              pos : neg    =      6.9 : 1.0
             bruckheimer = True              neg : pos    =      6.4 : 1.0
                   wires = True              neg : pos    =      6.4 : 1.0
                  fabric = True              pos : neg    =      6.3 : 1.0
             overwhelmed = True              pos : neg    =      6.3 : 1.0
                     ugh = True              neg : pos    =      5.9 : 1.0
               uplif

('MNB_classifier accuracy percent:', 68.0)


('BernoulliNB_classifier accuracy percent:', 69.0)


('LogisticRegression_classifier accuracy percent:', 66.0)


('SGDClassifier_classifier accuracy percent:', 63.0)


('SVC_classifier accuracy percent:', 44.0)


('LinearSVC_classifier accuracy percent:', 55.00000000000001)


('NuSVC_classifier accuracy percent:', 61.0)


## 16. [Combining Algorithms with NLTK](https://pythonprogramming.net/combine-classifier-algorithms-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15355927/)

In [1]:
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle

from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode


class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

#print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

featuresets = [(find_features(rev), category) for (rev, category) in documents]
        
training_set = featuresets[:1900]
testing_set =  featuresets[1900:]

#classifier = nltk.NaiveBayesClassifier.train(training_set)

classifier_f = open("naivebayes.pickle","rb")
classifier = pickle.load(classifier_f)
classifier_f.close()




print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)

##SVC_classifier = SklearnClassifier(SVC())
##SVC_classifier.train(training_set)
##print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)


voted_classifier = VoteClassifier(classifier,
                                  NuSVC_classifier,
                                  LinearSVC_classifier,
                                  SGDClassifier_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)

print("voted_classifier accuracy percent:", (nltk.classify.accuracy(voted_classifier, testing_set))*100)

print("Classification:", voted_classifier.classify(testing_set[0][0]), "Confidence %:",voted_classifier.confidence(testing_set[0][0])*100)
print("Classification:", voted_classifier.classify(testing_set[1][0]), "Confidence %:",voted_classifier.confidence(testing_set[1][0])*100)
print("Classification:", voted_classifier.classify(testing_set[2][0]), "Confidence %:",voted_classifier.confidence(testing_set[2][0])*100)
print("Classification:", voted_classifier.classify(testing_set[3][0]), "Confidence %:",voted_classifier.confidence(testing_set[3][0])*100)
print("Classification:", voted_classifier.classify(testing_set[4][0]), "Confidence %:",voted_classifier.confidence(testing_set[4][0])*100)
print("Classification:", voted_classifier.classify(testing_set[5][0]), "Confidence %:",voted_classifier.confidence(testing_set[5][0])*100)

('Original Naive Bayes Algo accuracy percent:', 89.0)
Most Informative Features
               insulting = True              neg : pos    =     10.7 : 1.0
                  doubts = True              pos : neg    =      9.5 : 1.0
                    sans = True              neg : pos    =      8.4 : 1.0
              mediocrity = True              neg : pos    =      7.8 : 1.0
                 wasting = True              neg : pos    =      7.8 : 1.0
            refreshingly = True              pos : neg    =      7.6 : 1.0
               dismissed = True              pos : neg    =      6.9 : 1.0
             bruckheimer = True              neg : pos    =      6.4 : 1.0
                   wires = True              neg : pos    =      6.4 : 1.0
                  fabric = True              pos : neg    =      6.3 : 1.0
             overwhelmed = True              pos : neg    =      6.3 : 1.0
                     ugh = True              neg : pos    =      5.9 : 1.0
               uplif

('MNB_classifier accuracy percent:', 74.0)


('BernoulliNB_classifier accuracy percent:', 74.0)


('LogisticRegression_classifier accuracy percent:', 70.0)


('SGDClassifier_classifier accuracy percent:', 64.0)


('LinearSVC_classifier accuracy percent:', 70.0)


('NuSVC_classifier accuracy percent:', 71.0)


('voted_classifier accuracy percent:', 75.0)
('Classification:', u'pos', 'Confidence %:', 0)
('Classification:', u'pos', 'Confidence %:', 0)
('Classification:', u'neg', 'Confidence %:', 100)


('Classification:', u'pos', 'Confidence %:', 100)
('Classification:', u'neg', 'Confidence %:', 0)
('Classification:', u'neg', 'Confidence %:', 0)


## 17. [Investigating bias with NLTK](https://pythonprogramming.net/investigating-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15356098/)

## 18. [Improving Training Data for sentiment analysis with NLTK](https://pythonprogramming.net/new-data-set-training-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15356221/)

In [5]:
import codecs

import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle

from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode

from nltk.tokenize import word_tokenize


class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf
        
short_pos = open("nlp/hello_nltk/short_reviews/positive.txt","r").read()
short_neg = open("nlp/hello_nltk/short_reviews/negative.txt","r").read()

documents = []

for r in short_pos.split('\n'):
    documents.append( (r, "pos") )

for r in short_neg.split('\n'):
    documents.append( (r, "neg") )


all_words = []

short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w.lower())

for w in short_neg_words:
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:5000]

def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

#print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

featuresets = [(find_features(rev), category) for (rev, category) in documents]

random.shuffle(featuresets)

# positive data example:      
training_set = featuresets[:10000]
testing_set =  featuresets[10000:]

##
### negative data example:      
##training_set = featuresets[100:]
##testing_set =  featuresets[:100]


classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)

##SVC_classifier = SklearnClassifier(SVC())
##SVC_classifier.train(training_set)
##print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)


voted_classifier = VoteClassifier(
                                  NuSVC_classifier,
                                  LinearSVC_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)

print("voted_classifier accuracy percent:", (nltk.classify.accuracy(voted_classifier, testing_set))*100)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 4645: invalid continuation byte

## 19. [Creating a module for Sentiment Analysis with NLTK](https://pythonprogramming.net/sentiment-analysis-module-nltk-tutorial/) | [video](https://www.bilibili.com/video/av15356391/)

In [None]:
import nltk
import random
#from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk.tokenize import word_tokenize



class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf
    
short_pos = open("short_reviews/positive.txt","r").read()
short_neg = open("short_reviews/negative.txt","r").read()

# move this up here
all_words = []
documents = []


#  j is adject, r is adverb, and v is verb
#allowed_word_types = ["J","R","V"]
allowed_word_types = ["J"]

for p in short_pos.split('\n'):
    documents.append( (p, "pos") )
    words = word_tokenize(p)
    pos = nltk.pos_tag(words)
    for w in pos:
        if w[1][0] in allowed_word_types:
            all_words.append(w[0].lower())

    
for p in short_neg.split('\n'):
    documents.append( (p, "neg") )
    words = word_tokenize(p)
    pos = nltk.pos_tag(words)
    for w in pos:
        if w[1][0] in allowed_word_types:
            all_words.append(w[0].lower())



save_documents = open("pickled_algos/documents.pickle","wb")
pickle.dump(documents, save_documents)
save_documents.close()


all_words = nltk.FreqDist(all_words)


word_features = list(all_words.keys())[:5000]


save_word_features = open("pickled_algos/word_features5k.pickle","wb")
pickle.dump(word_features, save_word_features)
save_word_features.close()


def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

featuresets = [(find_features(rev), category) for (rev, category) in documents]

random.shuffle(featuresets)
print(len(featuresets))

testing_set = featuresets[10000:]
training_set = featuresets[:10000]


classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

###############
save_classifier = open("pickled_algos/originalnaivebayes5k.pickle","wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

save_classifier = open("pickled_algos/MNB_classifier5k.pickle","wb")
pickle.dump(MNB_classifier, save_classifier)
save_classifier.close()

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

save_classifier = open("pickled_algos/BernoulliNB_classifier5k.pickle","wb")
pickle.dump(BernoulliNB_classifier, save_classifier)
save_classifier.close()

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

save_classifier = open("pickled_algos/LogisticRegression_classifier5k.pickle","wb")
pickle.dump(LogisticRegression_classifier, save_classifier)
save_classifier.close()


LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

save_classifier = open("pickled_algos/LinearSVC_classifier5k.pickle","wb")
pickle.dump(LinearSVC_classifier, save_classifier)
save_classifier.close()


##NuSVC_classifier = SklearnClassifier(NuSVC())
##NuSVC_classifier.train(training_set)
##print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)


SGDC_classifier = SklearnClassifier(SGDClassifier())
SGDC_classifier.train(training_set)
print("SGDClassifier accuracy percent:",nltk.classify.accuracy(SGDC_classifier, testing_set)*100)

save_classifier = open("pickled_algos/SGDC_classifier5k.pickle","wb")
pickle.dump(SGDC_classifier, save_classifier)
save_classifier.close()

In [7]:
import nltk
import random
#from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk.tokenize import word_tokenize



class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf


documents_f = open("pickled_algos/documents.pickle", "rb")
documents = pickle.load(documents_f)
documents_f.close()




word_features5k_f = open("pickled_algos/word_features5k.pickle", "rb")
word_features = pickle.load(word_features5k_f)
word_features5k_f.close()


def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features



featuresets_f = open("pickled_algos/featuresets.pickle", "rb")
featuresets = pickle.load(featuresets_f)
featuresets_f.close()

random.shuffle(featuresets)
print(len(featuresets))

testing_set = featuresets[10000:]
training_set = featuresets[:10000]



open_file = open("pickled_algos/originalnaivebayes5k.pickle", "rb")
classifier = pickle.load(open_file)
open_file.close()


open_file = open("pickled_algos/MNB_classifier5k.pickle", "rb")
MNB_classifier = pickle.load(open_file)
open_file.close()



open_file = open("pickled_algos/BernoulliNB_classifier5k.pickle", "rb")
BernoulliNB_classifier = pickle.load(open_file)
open_file.close()


open_file = open("pickled_algos/LogisticRegression_classifier5k.pickle", "rb")
LogisticRegression_classifier = pickle.load(open_file)
open_file.close()


open_file = open("pickled_algos/LinearSVC_classifier5k.pickle", "rb")
LinearSVC_classifier = pickle.load(open_file)
open_file.close()


open_file = open("pickled_algos/SGDC_classifier5k.pickle", "rb")
SGDC_classifier = pickle.load(open_file)
open_file.close()




voted_classifier = VoteClassifier(
                                  classifier,
                                  LinearSVC_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)




def sentiment(text):
    feats = find_features(text)
    return voted_classifier.classify(feats),voted_classifier.confidence(feats)

IOError: [Errno 2] No such file or directory: 'pickled_algos/documents.pickle'

In [None]:
print(sentiment("This movie was awesome! The acting was great, plot was wonderful, and there were pythons...so yea!"))
print(sentiment("This movie was utter junk. There were absolutely 0 pythons. I don't see what the point was at all. Horrible movie, 0/10"))

In [6]:
## 20. []() | []()

In [None]:
## 21. []() | []()