# Regular Expressions and Naive Bayes Classification

### 1. Regular Expressions

- Regular expressions are useful for extracting information from text.
- Set of “rules” to identify or match a particular sequence of characters.
- Most text in utf-8 or utf-16: letters, digits, punctuation and symbols
- In Python, mainly through library `re`

In [2]:
# Set Directory
import os
# os.getcwd()
os.chdir('/Users/ea025/Desktop/Python Camp - Alexander/PythonCamp2024/Day06/Lecture')

In [3]:
# For regular expressions
import re 

* As a demonstration, we will work with Obama's 2008 concession speech from New Hampshire primary.
* Read in the sample text, and remember:
    * `readlines` makes a list of each line break in file

In [4]:
with open("obama-nh.txt", "r") as f:
  text = f.readlines()

* Let's take a look at how this file is structured 

In [5]:
print(text[0])
print(text[1])
# print(text[2])

# print(text[0:3])

I want to congratulate Senator Clinton on a hard-fought victory here in

New Hampshire.



* We can also join all lines into one string

In [6]:
alltext = ''.join(text) 

* What could we have done at the outset instead?

In [7]:
with open("obama-nh.txt", "r") as f:
  alltext = f.read()

#### 1.1 Useful functions from the `re` module:

- `re.findall`: Return all non-overlapping matches of pattern 
            in string, as a list of strings
- `re.split`: Split string by the occurrences of pattern.
- `re.match`: Search the beginning of the string for a
          regular expression and return the first occurrence.
          Returns a match object.
- `re.search`: Like re.match, but will check all lines of the input string.
- `re.compile`: Compile a regular expression pattern into a regular 
            expression object, which can be used for matching using
            match(), search() and other methods

Source: https://docs.python.org/3/library/re.html

Let's run some examples!

* Both lines find all instances of "Yes we can"

In [12]:
# re.findall(pattern = "Yes we can", string= alltext) 
re.findall("Yes we can", alltext) 

['Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can']

* Here we find all instances of "American"

In [13]:
re.findall("American", alltext)

['American', 'American', 'American', 'American']

* ...And of all breaklines

In [27]:
re.findall("\n", alltext)[0:4] #...but only print 4

['\n', '\n', '\n', '\n']

* Example of `re.split()`

In [31]:
re.split("and", alltext)[7:12]

[" in\nspirit - who have never before participated in politics - turn out in\nnumbers we've never seen because they know in their hearts that this\ntime must be different.\n\nThere is something happening when people vote not just for the party\nthey belong to but the hopes they hold in common - that whether we are\nrich or poor; black or white; Latino or Asian; whether we hail from Iowa\nor New Hampshire, Nevada or South Carolina, we are ready to take this\ncountry in a fundamentally new direction. That is what's happening in\nAmerica right now. Change is what's happening in America.\n\nYou can be the new majority who can lead this nation out of a long\npolitical darkness - Democrats, Independents ",
 ' Republicans who are\ntired of the division ',
 ' distraction that has clouded Washington; who\nknow that we can disagree without being disagreeable; who underst',
 '\nthat if we mobilize our voices to challenge the money ',
 " influence\nthat's stood in our way "]

#### 1.2 Backslash Characters

* Regular expressions use the backslash character `\` to indicte special forms, or to allow special characters to be used without invoking their special meaning. 
* This collides with Python's usage of the same character for the same purpose in string literals 

* How do we find the literal character `\` in our file?
* First 2 will give errors

In [8]:
# re.findall("\", alltext)
# re.findall("\\", alltext)
re.findall("\\\\", alltext)

['\\']

##### Instead of typing 4 backslashes every time we need to find one...

* Another way to address this is to use Python's *raw string notation* for regular expression patterns.
* This looks like this`r""`
* Backslashes are not handled in any special way in a string prefixed with `r`.  

##### So equivalently:

In [9]:
re.findall(r"\\", alltext)

['\\']

##### We can also see it in action here:

* Prints an actual linebreak

In [39]:
print("\n")





* Prints the character "\n"

In [38]:
print(r"\n")

\n


* Also prints the character "\n"

In [10]:
print("\\n")

\n


#### 1.3 Basic special characters

* `\d`: any decimal digit, equivalent to [0-9]

In [42]:
re.findall(r"\d", alltext) 

['9', '1', '1']

* `\D`: any character that is NOT a decimal digit, equivalent to ^[0-9]

In [45]:
# re.findall(r"\D", alltext) 

* `[]` can be used to indicate *a set* of characters
* Line below returns all instances of *each* of the characters in `[]`

In [54]:
re.findall("[ar]", alltext)[0:10] 

['a', 'r', 'a', 'a', 'a', 'r', 'a', 'a', 'r', 'r']

* All instances of the form "char1 to char2" in `[char1-char2]`

In [57]:
re.findall("[a-d]", alltext)[0:10] 

['a', 'c', 'a', 'a', 'a', 'a', 'a', 'd', 'c', 'a']

* Returns all characters, `^`: *except* for those of the form "char 1 to char 2" in [^char1-char2]

In [59]:
re.findall("[^a-z]", alltext)[0:20] 

['I',
 ' ',
 ' ',
 ' ',
 ' ',
 'S',
 ' ',
 'C',
 ' ',
 ' ',
 ' ',
 '-',
 ' ',
 ' ',
 ' ',
 '\n',
 'N',
 ' ',
 'H',
 '.']

* All characters and digits (alphanumeric)

In [63]:
re.findall("[a-zA-Z0-9]", alltext)[0:19]

['I',
 'w',
 'a',
 'n',
 't',
 't',
 'o',
 'c',
 'o',
 'n',
 'g',
 'r',
 'a',
 't',
 'u',
 'l',
 'a',
 't',
 'e']

* `\w`: Any alphanumeric, one word character

In [66]:
re.findall(r"\w", alltext)[0:19] # same as above

['I',
 'w',
 'a',
 'n',
 't',
 't',
 'o',
 'c',
 'o',
 'n',
 'g',
 'r',
 'a',
 't',
 'u',
 'l',
 'a',
 't',
 'e']

*  `\W`: non-alphanumeric, the inverse of `\w`

In [69]:
re.findall(r"\W", alltext)[0:15] # same as re.findall(r"[^a-zA-Z0-9]", alltext)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '-', ' ', ' ', ' ', '\n', ' ', '.']

* `\s`: whitespace

In [72]:
re.findall(r"\s", alltext)[0:5]

[' ', ' ', ' ', ' ', ' ']

* `\S`: *non*-whitespace characters

In [208]:
re.findall(r"\S", alltext)[0:7]

['I', 'w', 'a', 'n', 't', 't', 'o']

* `.` any character (include white spaces, except a newline)

In [79]:
re.findall(".", alltext)[0:10]

['I', ' ', 'w', 'a', 'n', 't', ' ', 't', 'o', ' ']

 * `\` is an escape character (`.` has a special use)

In [82]:
re.findall(r"\.", alltext)[0:10]

['.', '.', '.', '.', '.', '.', '.', '.', '.', '.']

* `?`: Makes the preceding expression optional; match 0 or 1 repetitions of the preceding expression

In [84]:
re.findall("Am?", alltext)[0:5] # This would match A or Am where m is optional

['A', 'A', 'Am', 'Am', 'A']

* `+`: match 1 or more repetitions of the preceding expression

In [88]:
re.findall(r"\d+", alltext)
# re.findall("am+", alltext)

['9', '11']

* `*`: match 0 or more repetitions of the preceding expression

In [90]:
re.findall("am*", alltext)[0:8] # match a, am, or a followed by any number of m's 

['a', 'a', 'a', 'a', 'a', 'a', 'am', 'a']

* Get any word that starts with America

In [91]:
re.findall(r"America[a-z]*", alltext) 

['America',
 'Americans',
 'America',
 'America',
 'American',
 'Americans',
 'America',
 'America',
 'Americans',
 'America',
 'America']

* `{m}` specifies exactly m copies of the previous expression should be matched

In [93]:
# {x} exactly x times (numbers with exact number of digits)
re.findall(r"\d{2}", alltext) 

['11']

* `{m,n}` matches from m to n repetitions of the preceding expression, while attempting to match as many repetitions as possible

In [99]:
re.findall("o{2,3}", alltext) 

['oo', 'oo', 'oo', 'oo', 'oo', 'oo', 'oo', 'oo']

- There are so many more special characters
- Regex can be super powerful and complicated 
- Use parenthese to group things together when using operators like `+`, `*`, `?`, `^`

##### Short Exercise: 
How would we grab 10/10 and 19/18 as they appear in the text using `re.findall()`? 

In [100]:
x = "Hi 10/10 hello 19/18 asdf 7/6 and 1/10 or 10/1 "

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

##### Answer

In [102]:
re.findall(r"\d{2}/\d{2}", x) 

['10/10', '19/18']

#### 1.4 `re.split()`

##### Split string by the occurrences of pattern. 

In [11]:
# splits at 'or', deletes 'or'
re.split("or", alltext)[0:4]

['I want to congratulate Senat',
 ' Clinton on a hard-fought vict',
 "y here in\nNew Hampshire.\n\nA few weeks ago, no one imagined that we'd have accomplished what we did\nhere tonight. F",
 ' most of this campaign, we were far behind, and we\nalways knew our climb would be steep. But in rec']

In [13]:
re.split("America*", alltext)[0:3]

["I want to congratulate Senator Clinton on a hard-fought victory here in\nNew Hampshire.\n\nA few weeks ago, no one imagined that we'd have accomplished what we did\nhere tonight. For most of this campaign, we were far behind, and we\nalways knew our climb would be steep. But in record numbers, you came\nout and spoke up for change. And with your voices and your votes, you\nmade it clear that at this moment - in this election - there is\nsomething happening in ",
 '.\n\nThere is something happening when men and women in Des Moines and\nDavenport; in Lebanon and Concord come out in the snows of January to\nwait in lines that stretch block after block because they believe in\nwhat this country can be.\n\nThere is something happening when ',
 "ns who are young in age and in\nspirit - who have never before participated in politics - turn out in\nnumbers we've never seen because they know in their hearts that this\ntime must be different.\n\nThere is something happening when people vote no

#### 1.5 `re.compile()`

##### Compile a regular expression pattern into a RE object, which can then be used for matching using the `match()` and `search()` methods. 

In [14]:
keyword = re.compile("America[a-z]*")

In [16]:
# search file for keyword in line by line version
for l in text: 
    if keyword.search(l): # reuse the RE here
        print(l)

something happening in America.

There is something happening when Americans who are young in age and in

America right now. Change is what's happening in America.

Our new American majority can end the outrage of unaffordable,

working Americans who deserve it.

is a challenge that should unite America and the world against the

But in the unlikely story that is America, there has never been anything

we can't, generations of Americans have responded with a simple creed

remember that there is something happening in America; that we are not

nation; and together, we will begin the next great chapter in America's



* Create a regex object

In [17]:
pattern = re.compile(r'\d+')

In [18]:
pattern.findall(alltext) # equivalent to the earlier but longer version using RE

['9', '11']

In [119]:
# pattern.split(alltext)

#### 1.6 `re.MULTILINE` or `re.M`



##### When specified, it helps to search across lines in a single string. 

In [120]:
mline = "python\nis\nfun"
print(mline)

python
is
fun


I want to search for "fun" in the third line, where it starts with an "f"

- We can use `^` to search the start of a string
- Be careful, `^` when used in `[]` means negating characters
- `$` can be used to match the end of a string

In [133]:
re.findall(r"^f\w*", mline)

[]

In [131]:
# re.findall("^f\w*", mline, re.M)
re.findall(r"^f\w*", mline, re.MULTILINE)

['fun']

#### Short Exercise: 

What does the following code search for? 

In [15]:
re.findall(r"^.*\.$", alltext, re.MULTILINE)[0:15]

['New Hampshire.',
 'something happening in America.',
 'what this country can be.',
 'time must be different.',
 "America right now. Change is what's happening in America.",
 'fulfill.',
 'working Americans who deserve it.',
 'can do this with our new majority.',
 'weapons; climate change and poverty; genocide and disease.',
 'ideas. And all are patriots who serve this country honorably.',
 'the people who love this country, can do to change it.',
 "That's why tonight belongs to you.",
 'believed in our improbable journey and rallied so many others to join.',
 'in the weeks to come.',
 'offering the people of this nation false hope.']

### 2. Naive Bayes Classification

* A conditional probability model that assigns probabilities for each class $k$ to an observation based on its $n$ features or $p(C_k|x_1, ...x_n)$.
* The central assumption: all $n$ features are independent of each other, conditional on the class/category $C_k$.
* Algorithm relies on Bayes' Theorem + a decision rule $$p(C_k|\mathbf{x}) = \dfrac{p(C_k)p(\mathbf{x}|C_k)}{p(\mathbf{x})}$$

Read more about it: https://en.wikipedia.org/wiki/Naive_Bayes_classifier

##### Why/When Naive Bayes

* Great first classifier to try, relatively fast, requires less data than other classifiers, and can be very accurate provided assumptions hold
* Popular in text classification problems

More resources: https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/

#### 2.1 Installation and Import Libraries

In [21]:
!pip3 install nltk
%pip install nltk

Collecting nltk
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting joblib (from nltk)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Using cached regex-2024.7.24-cp312-cp312-macosx_10_9_x86_64.whl.metadata (40 kB)
Collecting tqdm (from nltk)
  Using cached tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB)
Using cached regex-2024.7.24-cp312-cp312-macosx_10_9_x86_64.whl (282 kB)
Using cached click-8.1.7-py3-none-any.whl (97 kB)
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Using cached tqdm-4.66.5-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, joblib, click, nltk
Successfully installed click-8.1.7 joblib-1.4.2 nltk-3.9.1 regex-2024.7.24 tqdm-4.66.5

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.

In [91]:
import nltk
nltk.download('names')
from nltk.corpus import names
import random

[nltk_data] Downloading package names to /Users/ea025/nltk_data...
[nltk_data]   Package names is already up-to-date!


Docs for this library: https://www.nltk.org/api/nltk.classify.naivebayes.html

* Create a list of tuples with names

In [92]:
names = ([(name, 'male') for name in names.words('male.txt')] +
        [(name, 'female') for name in names.words('female.txt')])

In [93]:
names[0:20]

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male'),
 ('Abdullah', 'male'),
 ('Abe', 'male'),
 ('Abel', 'male'),
 ('Abelard', 'male'),
 ('Abner', 'male'),
 ('Abraham', 'male'),
 ('Abram', 'male'),
 ('Ace', 'male'),
 ('Adair', 'male'),
 ('Adam', 'male')]

##### Now, we shuffle

In [94]:
random.shuffle(names)
names[0:5]

[('Sutton', 'male'),
 ('Caryn', 'female'),
 ('Kai', 'female'),
 ('Elana', 'female'),
 ('Lindy', 'male')]

#### 2.2 Split Training and Test Sets

In [95]:
len(names) # N total observations

7944

* Define training and test set sizes

In [96]:
train_size = 5000

* Split train and test objects

In [97]:
train_names = names[:train_size]
test_names = names[train_size:]

#### 2.3 Define Features

* A simple feature: get the last letter of the name

In [98]:
def g_features1(name):
  return {'last_letter': name[-1]}

Tips: Python functions can return multiple values

In [99]:
# Quick break — some syntax:
def return_two():
  return 5, 10

# When a method returns two values, we can use this format: 
x, y = return_two()
x, y

(5, 10)

#### 2.4 Data Preparation

Loop over names, and return tuple of dictionary and label

In [100]:
train_set = [(g_features1(n), g) for (n, g) in train_names]
test_set = [(g_features1(n), g) for (n,g) in test_names]

In [212]:
# train_set[0]

#### 2.5 Train the Classifier

##### Run the naive Bayes classifier for the train set

In [101]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

#### 2.6 Test your Classifier

* Apply the classifier to some names

In [102]:
classifier.classify(g_features1('Alma'))

'female'

In [103]:
classifier.classify(g_features1('Masanori'))

'female'

In [125]:
classifier.classify(g_features1('Kat'))

'male'

* Get probabilities

In [105]:
classifier.prob_classify(g_features1('Alma')).prob("female")

0.9820962255991424

In [108]:
classifier.prob_classify(g_features1('Masanori')).prob("male")

0.13697792502794687

##### We can check the overall accuracy with our test set. 

More on accuracy, F1, precision, recall: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

In [109]:
print(nltk.classify.accuracy(classifier, test_set))

0.7605298913043478


#### 2.7 Feature Attribution

* Lets see what is driving this

In [113]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     32.7 : 1.0
             last_letter = 'k'              male : female =     26.4 : 1.0
             last_letter = 'v'              male : female =     11.7 : 1.0
             last_letter = 'f'              male : female =     11.0 : 1.0
             last_letter = 'p'              male : female =     10.6 : 1.0


Let's be smarter and add more features!

In [114]:
# What all are we including now?
def g_features2(name):
  features = {}
  features["firstletter"] = name[0].lower()
  features["lastletter"] = name[-1].lower()
  for letter in 'abcdefghijklmnopqrstuvwxyz':
      features["count(%s)" % letter] = name.lower().count(letter)
      features["has(%s)" % letter] = (letter in name.lower())
  return features

In [115]:
 g_features2('Alma')

{'firstletter': 'a',
 'lastletter': 'a',
 'count(a)': 2,
 'has(a)': True,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 0,
 'has(h)': False,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 0,
 'has(j)': False,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 1,
 'has(l)': True,
 'count(m)': 1,
 'has(m)': True,
 'count(n)': 0,
 'has(n)': False,
 'count(o)': 0,
 'has(o)': False,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

* Run for train set

In [116]:
train_set = [(g_features2(n), g) for (n,g) in train_names]

* Run for test set

In [117]:
test_set = [(g_features2(n), g) for (n,g) in test_names]

* Run new classifier

In [118]:
classifier_new = nltk.NaiveBayesClassifier.train(train_set)

* Check the overall accuracy with test set

In [119]:
print(nltk.classify.accuracy(classifier_new, test_set))

0.7764945652173914


* Lets see what is driving this

In [120]:
classifier_new.show_most_informative_features(20)

Most Informative Features
              lastletter = 'a'            female : male   =     32.7 : 1.0
              lastletter = 'k'              male : female =     26.4 : 1.0
              lastletter = 'v'              male : female =     11.7 : 1.0
              lastletter = 'f'              male : female =     11.0 : 1.0
              lastletter = 'p'              male : female =     10.6 : 1.0
              lastletter = 'd'              male : female =      9.0 : 1.0
              lastletter = 'r'              male : female =      8.4 : 1.0
              lastletter = 'o'              male : female =      8.0 : 1.0
              lastletter = 'm'              male : female =      7.6 : 1.0
                count(a) = 3              female : male   =      6.4 : 1.0
                count(l) = 3              female : male   =      5.8 : 1.0
              lastletter = 'b'              male : female =      5.0 : 1.0
              lastletter = 'g'              male : female =      4.8 : 1.0

* Worse? Better? How can we refine?
* Lets look at the errors from this model and see if we can do better

In [122]:
errors = []
for (name, label) in test_names:
  guess = classifier.classify(g_features2(name))
  if guess != label:
    prob = classifier.prob_classify(g_features2(name)).prob(guess)
    errors.append((label, guess, prob, name))

In [123]:
 for (label, guess, prob, name) in sorted(errors):
   print('correct={} guess={} prob={:.2f} name={}'.format(label, guess, prob, name))

correct=male guess=female prob=0.63 name=Abbot
correct=male guess=female prob=0.63 name=Abdel
correct=male guess=female prob=0.63 name=Abdul
correct=male guess=female prob=0.63 name=Abdulkarim
correct=male guess=female prob=0.63 name=Abe
correct=male guess=female prob=0.63 name=Abel
correct=male guess=female prob=0.63 name=Abner
correct=male guess=female prob=0.63 name=Abraham
correct=male guess=female prob=0.63 name=Adair
correct=male guess=female prob=0.63 name=Adger
correct=male guess=female prob=0.63 name=Adlai
correct=male guess=female prob=0.63 name=Adolf
correct=male guess=female prob=0.63 name=Adolfo
correct=male guess=female prob=0.63 name=Adolph
correct=male guess=female prob=0.63 name=Adrian
correct=male guess=female prob=0.63 name=Ahmed
correct=male guess=female prob=0.63 name=Alain
correct=male guess=female prob=0.63 name=Albert
correct=male guess=female prob=0.63 name=Alberto
correct=male guess=female prob=0.63 name=Albrecht
correct=male guess=female prob=0.63 name=Alden


What could we do to improve it? (Lab Assignment)

##### Now lets look at some bigger documents
* This may take a while to download.

In [126]:
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/ea025/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [127]:
# list of tuples
# ([words], label)
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]


In [128]:
# type(documents[0])
# type(documents)
documents[0][1] # only neg & pos

'neg'

In [129]:
random.shuffle(documents)

* Dictionary of words and number of instances

In [130]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
len(all_words)

39768

In [131]:
all_words

FreqDist({',': 77717, 'the': 76529, '.': 65876, 'a': 38106, 'and': 35576, 'of': 34123, 'to': 31937, "'": 30585, 'is': 25195, 'in': 21822, ...})

* Check the frequency of `,`

In [74]:
all_words[',']

77717

In [75]:
word_features = [k for k in all_words.keys() if all_words[k] > 5]

In [76]:
len(word_features)

13214

* Define function to get document features

In [132]:
def document_features(document):
  document_words = set(document)
  features = {}
  for word in word_features:
      features['contains(%s)' % word] = (word in document_words)
  return features

In [133]:
 # document_features(['This', 'is', 'a', 'horrible', 'movie'])

{'contains(plot)': False,
 'contains(:)': False,
 'contains(two)': False,
 'contains(teen)': False,
 'contains(couples)': False,
 'contains(go)': False,
 'contains(to)': False,
 'contains(a)': True,
 'contains(church)': False,
 'contains(party)': False,
 'contains(,)': False,
 'contains(drink)': False,
 'contains(and)': False,
 'contains(then)': False,
 'contains(drive)': False,
 'contains(.)': False,
 'contains(they)': False,
 'contains(get)': False,
 'contains(into)': False,
 'contains(an)': False,
 'contains(accident)': False,
 'contains(one)': False,
 'contains(of)': False,
 'contains(the)': False,
 'contains(guys)': False,
 'contains(dies)': False,
 'contains(but)': False,
 'contains(his)': False,
 'contains(girlfriend)': False,
 'contains(continues)': False,
 'contains(see)': False,
 'contains(him)': False,
 'contains(in)': False,
 'contains(her)': False,
 'contains(life)': False,
 'contains(has)': False,
 'contains(nightmares)': False,
 'contains(what)': False,
 "contains(')": F

In [134]:
 # document_features(movie_reviews.words('pos/cv957_8737.txt'))

{'contains(plot)': True,
 'contains(:)': True,
 'contains(two)': True,
 'contains(teen)': False,
 'contains(couples)': False,
 'contains(go)': False,
 'contains(to)': True,
 'contains(a)': True,
 'contains(church)': False,
 'contains(party)': False,
 'contains(,)': True,
 'contains(drink)': False,
 'contains(and)': True,
 'contains(then)': True,
 'contains(drive)': False,
 'contains(.)': True,
 'contains(they)': True,
 'contains(get)': True,
 'contains(into)': True,
 'contains(an)': True,
 'contains(accident)': False,
 'contains(one)': True,
 'contains(of)': True,
 'contains(the)': True,
 'contains(guys)': False,
 'contains(dies)': False,
 'contains(but)': True,
 'contains(his)': True,
 'contains(girlfriend)': True,
 'contains(continues)': False,
 'contains(see)': False,
 'contains(him)': True,
 'contains(in)': True,
 'contains(her)': False,
 'contains(life)': False,
 'contains(has)': True,
 'contains(nightmares)': False,
 'contains(what)': True,
 "contains(')": True,
 'contains(s)': T

* Now we have tuple of `({features}, label)`

In [135]:
train_docs = documents[:1000]
test_docs = documents[1000:1500]
train_set = [(document_features(d), c) for (d,c) in train_docs]
test_set = [(document_features(d), c) for (d,c) in test_docs]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [136]:
print(nltk.classify.accuracy(classifier, test_set))

0.788


In [137]:
classifier.show_most_informative_features(10)

Most Informative Features
   contains(wonderfully) = True              pos : neg    =     20.4 : 1.0
        contains(finest) = True              pos : neg    =     14.0 : 1.0
     contains(stupidity) = True              neg : pos    =     12.1 : 1.0
        contains(symbol) = True              pos : neg    =      9.7 : 1.0
        contains(turkey) = True              neg : pos    =      9.0 : 1.0
   contains(outstanding) = True              pos : neg    =      8.5 : 1.0
     contains(laughably) = True              neg : pos    =      8.4 : 1.0
     contains(addresses) = True              pos : neg    =      8.2 : 1.0
        contains(admits) = True              pos : neg    =      8.2 : 1.0
      contains(chilling) = True              pos : neg    =      8.2 : 1.0


In [None]:
# Copyright of the original version:

# Copyright (c) 2014 Matt Dickenson
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
