# Text Classification - Lab

## Introduction

In this lab, we'll use everything we've learned so far to build a model that can classify a text document as one of many possible classes!

## Objectives

You will be able to:

- Perform classification using a text dataset, using sensible preprocessing, tokenization, and feature engineering scheme 
- Use scikit-learn text vectorizers to fit and transform text data into a format to be used in a ML model 



# Getting Started

For this lab, we'll be working with the classic **_Newsgroups Dataset_**, which is available as a training data set in `sklearn.datasets`. This dataset contains many different articles that fall into 1 of 20 possible classes. Our goal will be to build a classifier that can accurately predict the class of an article based on the features we create from the article itself!

Let's get started. Run the cell below to import everything we'll need for this lab. 

In [1]:
import nltk
from nltk.corpus import stopwords
import string
from nltk import word_tokenize, FreqDist
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
import numpy as np
np.random.seed(0)

Now, we need to fetch our dataset. Run the cell below to download all the newsgroups articles and their corresponding labels. If this is the first time working with this dataset, scikit-learn will need to download all of the articles from an external repository -- the cell below may take a little while to run. 

The actual dataset is quite large. To save us from extremely long runtimes, we'll work with only a subset of the classes. Here is a list of all the possible classes:

<img src='classes.png'>

For this lab, we'll only work with the following five:

* `'alt.atheism'`
* `'comp.windows.x'`
* `'rec.sport.hockey'`
* `'sci.crypt'`
* `'talk.politics.guns'`

In the cell below:

* Create a list called `categories` that contains the five newsgroups classes listed above, as strings 
* Get the training set by calling `fetch_20newsgroups()` and passing in the following parameters:
    * `subset='train'`
    * `categories=categories`
    * `remove=('headers', 'footers', 'quotes')` -- this is so that the model can't overfit to metadata included in the articles that sometimes acts as a dead-giveaway as to what class the article belongs to  
* Get the testing set as well by passing in the same parameters, with the exception of `subset='test` 

In [2]:
categories = ['alt.atheism','comp.windows.x','rec.sport.hockey','sci.crypt','talk.politics.guns']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories,remove=('headers','footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(categories=categories,remove=('headers','footers', 'quotes'))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Great! Let's break apart the data and the labels, and then inspect the class names to see what the actual newsgroups are.

In the cell below:

* Grab the data from `newsgroups_train.data` and store it in the appropriate variable  
* Grab the labels from `newsgroups_train.target` and store it in the appropriate variable  
* Grab the label names from `newsgroups_train.target_names` and store it in the appropriate variable  
* Display the `label_names` so that we can see the different classes of articles that we're working with, and confirm that we grabbed the right ones  

In [3]:
data = newsgroups_train.data
target = newsgroups_test.data
label_names = newsgroups_train.target_names
label_names

['alt.atheism',
 'comp.windows.x',
 'rec.sport.hockey',
 'sci.crypt',
 'talk.politics.guns']

Finally, let's check the shape of `data` to see what our data looks like. We can do this by checking the `.shape` attribute of `newsgroups_train.filenames`.

Do this now in the cell below.

In [5]:
# Your code here
newsgroups_train.filenames.shape

(2814,)

Our dataset contains 2,814 different articles spread across the five classes we chose. 

### Cleaning and Preprocessing Our Data

Now that we have our data, the fun part begins. We'll need to begin by preprocessing and cleaning our text data. As you've seen throughout this section, preprocessing text data is a bit more challenging that working with more traditional data types because there's no clear-cut answer for exactly what sort of preprocessing and cleaning we need to do. Before we can begin cleaning and preprocessing our text data, we need to make some decisions about things such as:

* Do we remove stop words or not?
* Do we stem or lemmatize our text data, or leave the words as is?
* Is basic tokenization enough, or do we need to support special edge cases through the use of regex?
* Do we use the entire vocabulary, or just limit the model to a subset of the most frequently used words? If so, how many?
* Do we engineer other features, such as bigrams, or POS tags, or Mutual Information Scores?
* What sort of vectorization should we use in our model? Boolean Vectorization? Count Vectorization? TF-IDF? More advanced vectorization strategies such as Word2Vec?


These are all questions that we'll need to think about pretty much anytime we begin working with text data. 

Let's get right into it. We'll start by getting a list of all of the english stopwords, and concatenating them with a list of all the punctuation. 

In the cell below:

* Get all the english stopwords from `nltk` 
* Get all of the punctuation from `string.punctuation`, and convert it to a list 
* Add the two lists together. Name the result `stopwords_list` 
* Create another list containing various types of empty strings and ellipses, such as `["''", '""', '...', '``']`. Add this to our `stopwords_list`, so that we won't have tokens that are only empty quotes and such  

In [7]:
nltk.download('stopwords')
stopwords_list = stopwords.words('english') + list(string.punctuation)
stopwords_list += ["''", '""', '...', '``']

[nltk_data] Downloading package stopwords to /home/enthusiastic-
[nltk_data]     constructor-3429/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Great! We'll leave these alone for now, until we're ready to remove stop words after the tokenization step. 

Next, let's try tokenizing our dataset. In order to save ourselves some time, we'll write a function to clean our dataset, and then use Python's built-in `map()` function to clean every article in the dataset at the same time. 

In the cell below, complete the `process_article()` function. This function should:

* Take in one parameter, `article` 
* Tokenize the article using the appropriate function from `nltk` 
* Lowercase every token, remove any stopwords found in `stopwords_list` from the tokenized article, and return the results 

In [11]:
def process_article(article):
    token = nltk.word_tokenize(article)
    stop_words_removed = [i.lower() for i in token if i.lower() not in stopwords_list]
    return stop_words_removed

Now that we have this function, let's go ahead and preprocess our data, and then move into exploring our dataset. 

In the cell below:

* Use Python's `map()` function and pass in two parameters: the `process_article` function and the `data`. Make sure to wrap the whole map statement in a `list()`.

**_Note:_** Running this cell may take a minute or two!

In [12]:
nltk.download('punkt')
processed_data = list(map(process_article, data))

[nltk_data] Downloading package punkt to /home/enthusiastic-
[nltk_data]     constructor-3429/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Great. Now, let's inspect the first article in `processed_data` to see how it looks. 

Do this now in the cell below.

In [13]:
processed_data[0]

['note',
 'trial',
 'updates',
 'summarized',
 'reports',
 '_idaho',
 'statesman_',
 'local',
 'nbc',
 'affiliate',
 'television',
 'station',
 'ktvb',
 'channel',
 '7',
 'randy',
 'weaver/kevin',
 'harris',
 'trial',
 'update',
 'day',
 '4',
 'friday',
 'april',
 '16',
 '1993',
 'fourth',
 'day',
 'trial',
 'synopsis',
 'defense',
 'attorney',
 'gerry',
 'spence',
 'cross-examined',
 'agent',
 'cooper',
 'repeated',
 'objections',
 'prosecutor',
 'ronald',
 'howen',
 'spence',
 'moved',
 'mistrial',
 'denied',
 'day',
 'marked',
 'caustic',
 'cross-examination',
 'deputy',
 'marshal',
 'larry',
 'cooper',
 'defense',
 'attorney',
 'gerry',
 'spence',
 'although',
 'spence',
 'explicitly',
 'stated',
 'one',
 'angle',
 'stategy',
 'must',
 'involve',
 'destroying',
 'credibility',
 'agent',
 'cooper',
 'cooper',
 'government',
 "'s",
 'eyewitness',
 'death',
 'agent',
 'degan',
 'spence',
 'attacked',
 'cooper',
 "'s",
 'credibility',
 'pointing',
 'discrepancies',
 'cooper',
 "'s",
 '

Now, let's move onto exploring the dataset a bit more. Let's start by getting the total vocabulary size of the training dataset. We can do this by creating a `set` object and then using it's `.update()` method to iteratively add each article. Since it's a set, it will only contain unique words, with no duplicates. 

In the cell below:

* Create a `set()` object called `total_vocab` 
* Iterate through each tokenized article in `processed_data` and add it to the set using the set's `.update()` method 
* Once all articles have been added, get the total number of unique words in our training set by taking the length of the set 

In [None]:
processed_data[:5]

total_vocab = set()
for note in processed_data:
    for word in note:
        print(word)

note
trial
updates
summarized
reports
_idaho
statesman_
local
nbc
affiliate
television
station
ktvb
channel
7
randy
weaver/kevin
harris
trial
update
day
4
friday
april
16
1993
fourth
day
trial
synopsis
defense
attorney
gerry
spence
cross-examined
agent
cooper
repeated
objections
prosecutor
ronald
howen
spence
moved
mistrial
denied
day
marked
caustic
cross-examination
deputy
marshal
larry
cooper
defense
attorney
gerry
spence
although
spence
explicitly
stated
one
angle
stategy
must
involve
destroying
credibility
agent
cooper
cooper
government
's
eyewitness
death
agent
degan
spence
attacked
cooper
's
credibility
pointing
discrepancies
cooper
's
statements
last
september
made
court
cooper
conceded
things
compressed
seconds
's
difficult
remember
went
first
cooper
acknowledged
carried
9mm
colt
commando
submachine
gun
silenced
barrel
thought
colt
commando
revolver
cooper
continued
stating
federal
agents
specific
plans
use
weapon
started
kill
weaver
's
dog
spence
asked
seven
cartridges
could
f

--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
_/_/
_/
_/_/
_/_/_/_/
_/_/_/_/
_/_/
_/_/_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/_/_/
_/
_/_/_/_/
_/
_/_/_/
_/_/_/
_/_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_/
_____
_/_/_/
_/_/_/_/
_/
_/
_/_/_/_/
_/_/_/_/
_/
_/
_/_/_/
/____/
marc
foster
r.s.h
contact
oklahoma
city
blazers
1993
central
hockey
university
oklahoma
geography
department
league
adams
cup
internet
mfoster
geohub.gcn.uoknor.edu
champions
mfoster
alliant.backbone.uoknor.edu
placed
chl
mailing
list
send
email
either
address
export.lcs.mit.edu
/pub/sunkbd..930314.tar.z
/maf
march/april
version
x
journal
open
front
'll
working
programming
x-clients
summer
since
n't
much
experience
programming
x
thought
issue
might
helpful
section
debugging
section
40
common
errors
programming
x
end
errors
section
following
references
tutorials
x
programming
style
rosenthal
david
simple
x11
client
program
proceedings
winter
1988
usenix
conference
1988
lemke
d.

started
performing
better
offensively
converted
center
wing
although
lefty
sanderson
went
left
wing
righty
yake
went
right
side
biggest
disappointment
hands
john
cullen
cullen
disasterous
77
point
season
last
year
first
full
season
trade
cullen
started
season
summer
back
surgery
fell
flat
face
appropriate
since
spent
whaler
career
flat
ass
whining
cullen
scored
9
point
19
games
clubhouse
malcontent
commanded
powerplay
9
success
percentage
21
sanderson
sulked
way
town
worst
4
year
4m
contract
three
years
left
run
one
would
give
2nd
round
draft
pick
maple
leafs
offered
hartford
honorable
mention
steve
konroyd
also
subpar
signing
3
year
2.1m
contract
eric
weinrich
showed
flashes
competence
overall
played
poorly
jim
mckenzie
much
better
hockey
player
two
seasons
ago
frank
pietrangelo
seemed
play
well
sean
burke
extended
period
got
make
number
starts
row
according
osf/motif
style
guide
one
use
cursor
shapes
give
user
visual
clue
happening
expected
hourglass
cursor
shown
application
busy
cau

pts
pt
--
--
--
--
--
--
--
--
--
--
feb.
41
61
3-
8-
1
7
pts
year
246
268
march
02
la
6
vs
calgary
2
16,005
28-29-
7
63
pts
pt
04
la
8
vs
ottawa
6
16,005
29-29-
7
65
pts
pt
06
la
6
vs
edmonton
1
16,005
30-29-
7
67
pts
pt
09
la
3
ny
rangers
4
18,200
30-30-
7
67
pts
pt
11
la
3
pittsburgh
4
16,164
30-31-
7
67
pts
pt
13
philadelphia
postponed
1
period
1-1
due
weather
resch
4/1
14
buffalo
postponed
due
weather
rescheduled
3/15
15
la
4
buffalo
2
13,799
31-31-
7
69
pts
pt
16
la
8
vs
winnipeg
4
16,005
32-31-
7
71
pts
pt
18
la
7
vs
ny
islanders
4
16,005
33-31-
7
73
pts
pt
20
la
3
vs
st
louis
2
16,005
34-31-
7
75
pts
pt
24
la
2
vancouver
6
16,150
34-32-
7
75
pts
5
26
la
4
edmonton
1
17,503
35-32-
7
77
pts
5
28
la
3
winnipeg
3
15,566
35-32-
8
78
pts
pt
29
la
9
detroit
3
19,875
36-32-
8
80
pts
5
31
la
5
toronto
5
15,720
36-32-
9
81
pts
pt
--
--
--
--
--
--
--
--
--
--
mar
71
47
9-
3-
2
20
pts
year
317
315
april
01
la
3
philadelphia
1
17,380
37-32-
9
83
pts
pt
rescheduled
03
la
0
vs
minnesota
3
16

7/90
two
versions
xprompt
posted
comp.sources.x
latter
unauthorized
rewrite
r.
forsman
thoth
reef.cis.ufl.edu
1/91
version
xmenu
available
comp.sources.x
worked
likely
re-released
xp-1.1.tar.z
xpick-1.0.tar.z
xzap-1.0.tar.z
export
's
contrib/
tools
gerry.tomlinson
newcastle.ac.uk
act
x
versions
simple
display
choice-making
tools
k
p
4/92
xtpanel
lets
user
build
panel
containing
interactive
objects
buttons
sliders
text
fields
etc.
either
command
line
using
simple
scripting
language
available
anonymous
ftp
hanauma.stanford.edu
36.51.0.16
pub/x/xtpanel.tar.z
may
also
found
alt.sources
archives
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
subject
90
get
x-based
debugger
xdbx
x
interface
dbx
debugger
available
via
ftp
export
current
1/91
version
2.1
patchlevel
2
x
interface
gdb
called
xxgdb
like
xdbx
2.1.2
part
comp.sources.x
volume
11
2/91
xxgdb-1.06.tar.z
export
mxgdb
motif
interface
gdb
jim
tsillas
jtsillas
bubba.ma30.bull.com
v

tips
imake
sasun1.epfl.ch
pub/imakefile.1.z
1/91
12/91
5/92
8/92
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
subject
107
get
imake
versions
distributed
r4
r5
releases
earlier
version
distributed
x11r3
release
third-party
toolkits
redistribute
versions
imake
along
implementations
template
configuration
files
real
standards
configuration
files
although
*current*
contributed
software
expects
templates
distributed
x11r5
export
contains
r5
distribution
unpacked
pick
imake
without
picking
entire
distribution
stand-alone
version
imake
one
stemming
x11r5
ftp.germany.eu.net
pub/x11/misc/imake/imake-pure.tar.z
192.76.144.75
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
subject
108
program
imakefile
makefile
r4
r5
installed
system
run
xmkmf
script
runs
imake
correct
arguments
output
makefile
configured
system
based
imakefile
run
make
use
new
makefile
compile
program
--
--
--
--


name
x
build
shell
script
entry
's
title
possibly
followed
x
digit
followed
'.sh
x
x
needed
entry
's
remarks
indicate
entry
must
x
changed
order
deal
new
filenames
x
x
5
build
file
source
resulting
executable
x
treated
read-only
files
entry
needs
modify
files
x
make
modify
copy
appropriate
file
x
occurs
state
entry
's
remarks
x
x
6
entries
compiled
ansi
c
compiler
rejected
x
use
common
c
k
r
extensions
permitted
long
x
cause
compile
errors
ansi
c
compilers
x
x
7
program
must
original
work
programs
must
x
public
domain
copyrighted
programs
rejected
x
x
8
entries
must
received
prior
07-may-93
0:00
utc
utc
x
essentially
equivalent
greenwich
mean
time
email
entries
x
x
apple
pyramid
sun
uunet
hoptoad
obfuscate
x
obfuscate
toad.com
x
x
request
message
use
subject
'ioccc
entry
x
x
possible
request
hold
emailing
entries
x
1-mar-93
0:00
utc
early
entries
accepted
however
x
attempt
email
confirmation
first
author
x
entries
received
1-mar-93
0:00
utc
x
x
9
person
may
submit
8
entries
per
contest

files
x
please
say
--
-remark
--
section
try
avoid
x
touching
original
build
source
binary
files
x
arrange
make
copies
files
intend
modify
x
allow
people
re-generate
entry
scratch
x
x
remember
entry
may
built
without
build
file
x
typically
incorporate
build
lines
makefile
x
build
file
must
exist
say
--
-remark
--
section
x
x
entry
needs
special
info
files
uuencode
x
--
-info
--
sections
case
multiple
info
files
x
use
multiple
--
-info
--
sections
info
files
needed
x
skip
--
-info
--
section
x
x
info
files
intended
input
detailed
information
x
fit
well
--
-remark
--
section
example
x
entry
implements
compiler
might
want
provide
sample
x
programs
user
compile
entry
might
want
include
x
lengthy
design
document
might
appropriate
x
'hints
file
x
x
info
files
used
supplement
entry
x
example
info
files
may
provide
sample
input
detailed
x
information
entry
supplemental
x
entry
require
exist
x
x
cases
info
files
might
renamed
avoid
name
x
conflicts
info
files
renamed
reason
x
say
--
-remark
--


message
would
obtain
legal
authorization
normally
court
order
wiretap
first
place
would
present
documentation
authorization
two
entities
responsible
safeguarding
keys
obtain
keys
device
used
drug
smugglers
key
split
two
parts
stored
separately
order
ensure
security
key
escrow
system
q
run
key-escrow
data
banks
two
key-escrow
data
banks
run
two
independent
entities
point
department
justice
administration
yet
determine
agencies
oversee
key-escrow
data
banks
q
strong
security
device
sure
strong
security
system
secure
many
voice
encryption
systems
readily
available
today
algorithm
remain
classified
protect
security
key
escrow
system
willing
invite
independent
panel
cryptography
experts
evaluate
algorithm
assure
potential
users
unrecognized
vulnerabilities
q
whose
decision
propose
product
national
security
council
justice
department
commerce
department
key
agencies
involved
decision
approach
endorsed
president
vice
president
appropriate
cabinet
officials
q
consulted
congress
industry
on-goi

xservers
seen
demands
applications
exceed
size
colormap
solution
usually
1
avoid
colormap-greedy
apps
2
display
applications
color
icing
cake
monochrome
mode
rather
color
option
application
doug
shaker
voice
415/572-0200
fax
415/572-1300
email
dshaker
qualix.com
mail
qualix
group
1900
s.
norfolk
st.
224
san
mateo
ca
94403
think
many
reading
group
would
also
benefit
knowing
deviant
view
_as
've
articulated
above_
may
true
view
khomeini
basic
principles
islam
non-muslim
readers
group
see
far
simple
basics
islam
views
face
_not_
contradiction
basics
islam
subtle
issues
seems
sects
exist
islam
explicitly
proscribed
qur'an
opinion
considering
human
substance
metaphysical
fundamentally
different
human
_is_
heretical
notion
one
proscribed
islam
absolutely
would
interested
discussing
privately
interested
hearing
one
might
try
make
concept
error-free
sinless
human
beings
philosophically
consistent
teachings
qur'an
however
_prima
facie_
attemptsa
highly
susceptible
degenerating
monkery
explicitl

deleted
gxor
draw
one
stay
gxclear
gxxor
entire
pixmap
screen
note
pretty
effective
way
animation
ever
need
replace
gxclear
gxxor
--
--
--
--
--
joe
hildebrand
hildjj
fuentez.com
software
engineer
fuentez
systems
concepts
703
273-1447
pocklington
wanted
wake
powers
holding
political
office
northlands
business
community
oilers
current
lease
arrangement
state
yearly
basis
likely
operating
loss
based
normal
hockey
revenues
expenses
good
thing
better
complain
early
make
city
aware
potential
looming
crisis
begins
lose
millions
millions
dollars
would
truly
jeopardize
franchise
pocklington
's
first
option
sell
move
sell
minority
share
team
realize
appreciated
value
team
get
better
arena
deal
either
northlands
via
new
building
pocklington
probably
n't
going
get
exactly
wants
ultimately
probably
get
enough
sell
someone
probably
get
enough
lot
risks
moving
team
also
one
remember
peter
puck
's
principle
better
spend
people
's
money
one
's
possible
chips
released
phones
whatever
vulnerable
phsyica

think
would
also
hold
unbeaten
streak
regular
season
games
looking
postscript
tex
version
paper
called
public-key
cryptography
written
james
nechvatal
security
technology
group
national
computer
systems
laboratory
national
institute
standards
technology
gaithersburg
md
20899
december
1990
version
obtained
plain
text
symbolic
character
formatting
lost
last
year
us
suffered
almost
10,000
wrongful
accidental
deaths
handguns
alone
fbi
statistics
year
uk
suffered
35
deaths
scotland
yard
statistics
population
uk
1/5
us
10,000
35
5
weighted
population
us
57x
many
handgun-related
deaths
uk
brits
n't
make
murdering
57x
many
people
baseball
bats
snip
examine
figures
stabbing
favourite
closely
followed
striking
punching
kicking
many
people
burnt
death
britain
shot
death
take
look
'll
see
means
people
shot
death
great
britain
list
killings
name
religion
1
iran-iraq
war
1,000,000
2
civil
war
sudan
1,000,000
3
riots
india-pakistan
1947
1,000,000
4
massacares
bangladesh
1971
1,000,000
5
inquistions
a

--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
team
biggest
biggest
team
mvp
surprise
disappointment
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
boston
bruins
oates
d.sweeney
wesley
buffalo
sabres
lafontaine
mogilny
audette
jinx
calgary
flames
roberts
reichel
petit
chicago
blackhawks
roenick
ruuttu
goulet
detroit
red
wings
yzerman
chaisson
kozlov
edmonton
oilers
manson
buchberger
mellanby
hartford
whalers
sanderson
cassells
corriveau
los
angeles
kings
robitaille
donnelly
hrudey
minnesota
north
stars
modano
tinordi
expected
back
broten
montreal
canadiens
muller
lebeau
savard
new
jersey
devils
stevens
semak
maclean
new
york
islanders
turgeon
king
finally
marois
new
york
rangers
messier
kovalev
bourque
ottawa
senators
maciver
baker
jelinek
philadelphia
flyers
lindros/recchi
fedyk/galley
eklund
pittsburgh
penguins
lemieux
tocchet
even
jagr
quebec
nordiques
sakic/ricci
kovalenko
pearso

go
get
copy
army
's
1969
improvised
munitions
manual
see
easy
make
functional
firearm
paying
10
inconspicuous
parts
local
k-mart
nose
drew
n't
sign
blank
checks
doug
foxvog
says
weapons
mass
destruction
means
cbw
nukes
sarah
brady
says
weapons
mass
destruction
means
street
sweeper
shotguns
semi-automatic
sks
rifles
john
lawrence
rutledge
says
weapons
mass
destruction
immediately
follows
rutledge
mean
term
--
nanaimo
vancouver
island
furriners
got
abc
coverage
komo
probably
depends
cable
company
started
switching
cbc
abc
broadcasts
finally
settled
abc
ca
n't
stand
whitman
al
michaels
decent
job
followed
play
pretty
well
knew
players'
names
made
couple
rookie
mistakes
noticed
one
thing
surprised
never
attempted
explain
offside
rule
am-i-paranoid-or-is-this-really-happening
department
fights
game
couple
occasions
looked
like
fight
start
times
abc
cut
away
show
closeup
coach
mcnall
something
abc
decided
adopt
spicer
policy
pitt
vs.
penn
state
controversy
deleted
would
n't
sufficient
cause


200,706,000
2,394
11.93
1940
132,122,000
2,375
17.98
1969
202,677,000
2,309
11.39
1941
133,402,000
2,396
17.96
1970
204,879,000
2,406
11.74
1942
134,860,000
2,678
19.86
1971
207,661,000
2,360
11.36
1943
136,739,000
2,282
16.69
1972
209,896,000
2,442
11.63
1944
138,397,000
2,392
17.28
1973
211,909,000
2,618
12.35
1945
139,928,000
2,385
17.04
1974
213,854,000
2,613
12.22
1946
141,389,000
2,801
19.81
1975
215,854,000
2,380
11.03
1947
144,126,000
2,439
16.92
1976
218,035,000
2,059
9.44
1948
146,631,000
2,191
14.94
1977
220,239,000
1,982
9.00
1949
149,188,000
2,330
15.62
1978
222,585,000
1,806
8.11
1950
151,684,000
2,174
14.33
1979
225,055,000
2,004
8.90
1951
154,287,000
2,247
14.56
1980
227,757,000
1,955
8.58
1952
156,954,000
2,210
14.08
1981
230,138,000
1,871
8.13
1953
159,565,000
2,277
14.27
1982
232,520,000
1,756
7.55
1954
162,391,000
2,271
13.98
1983
234,799,000
1,695
7.22
1955
165,275,000
2,120
12.83
1984
237,001,000
1,668
7.04
1956
168,221,000
2,202
13.09
1985
239,279,000
1,649
6.89


hitting
takers
issue
problem
offence
taken
sad
lover
sport
team
metro
new
york
area
decade
still
exist
non-entity
're
getting
sabres-bruins
replacement
game
probably
devils-penguins
game
played
scny
islanders-caps
overflow
game
sca
scny
plus
sabres-bruins
ends
early
'll
go
devils-penguins
game
assuming
espn
follows
previous
patterns
got
last
minute
islanders-rangers
overtime
two
weeks
ago
espn
's
coverage
started
last
night
accident
one
writers
pointed
could
've
gone
wild
hog
wrestling
evening
instead
gld
posted
ago
without
code
excerpts
noone
able
help
problem
main_win.win
fine
create
detail_win.win
receive
's
initial
expose
events
main_win.win
receives
event
relevent
calls
main_win.win
xcreatesimplewindow
mydisplay
defaultrootwindow
mydisplay
myhint.x
myhint.y
myhint.width
myhint.height
main_win.line_thick
fg
bg
xsetstandardproperties
mydisplay
main_win.win
main_win.text
main_win.text
none
argv
argc
myhint
main_win.gc
xcreategc
mydisplay
main_win.win
0
0
xmapraised
mydisplay
detail_w

cryptanalysis
texts
show
break
's
possible
methods
hands
expert
's
ten
times
much
text
key
see
example
gaines
gai44
sinkov
sin66
send
encrypted
mail
unix
pgp
ripem
pem
's
one
popular
method
using
des
command
cat
file
compress
des
private_key
uuencode
mail
meanwhile
de
jure
internet
standard
works
called
pem
privacy
enhanced
mail
described
rfcs
1421
1424
join
pem
mailing
list
contact
pem-dev-request
tis.com
beta
version
pem
tested
time
writing
also
two
programs
available
public
domain
encrypting
mail
pgp
ripem
available
ftp
newsgroup
alt.security.pgp
alt.security.ripem
faq
well
pgp
commonly
used
outside
usa
since
uses
rsa
algorithm
without
license
rsa
's
patent
valid
least
primarily
usa
ripem
commonly
used
inside
usa
since
uses
rsaref
freely
available
within
usa
available
shipment
outside
usa
since
programs
use
secret
key
algorithm
encrypting
body
message
pgp
used
idea
ripem
uses
des
rsa
encrypting
message
key
able
interoperate
freely
although
repeated
calls
understand
's
formats
algori

put
words
mouth
let
suggest
settle
bother
following
postings
might
consider
developing
style
imitation
sincerest
form
flattery
quite
sure
flattery
intention
cordially
always
rm
request
would
like
see
charley
wingate
respond
charley
challenges
judging
e-mail
appear
quite
clear
mr.
wingate
intends
continue
post
tangential
unrelated
articles
ingoring
challenges
last
two
re-postings
challenges
noted
perhaps
dozen
posts
mr.
wingate
none
answered
single
challenge
seems
unmistakable
mr.
wingate
hopes
questions
go
away
level
best
change
subject
given
seems
rather
common
net.theist
tactic
would
like
suggest
impress
upon
desire
answers
following
manner
1
ignore
future
articles
mr.
wingate
address
challenges
answers
explictly
announces
refuses
--
--
2
must
respond
one
articles
include
within
something
similar
following
please
answer
questions
posed
charley
challenges
really
'm
looking
humiliate
anyone
want
honest
answers
would
n't
think
honesty
would
much
ask
devout
christian
would
nevermind
rhet

would
need
band
together
quickly
hence
small
fast
response
means
mechanized
infantry
finally
militia
fighting
equipment
needed
responses
federal
army
'm
convinced
minumum
infantry
relying
state
local
militias
functions
however
unless
missed
something
source
suggested
aside
hardware
rng
seems
available
unguessable
intruder
unix
fresh-booted
i/o
buffers
related
network
traffic
believe
solution
basically
uses
strategy
without
requiring
reach
kernel
sources
statistics
filesystems
easily
quickly
obtained
output
rusage
system
call
also
exec
finger
one
favorite
heavily-used
systems
though
take
several
seconds
cf
source
code
ripem
ripem.msu.edu
ah
core
question
let
suggest
scenario
grant
god
exists
uses
revelation
communicate
humans
said
revelation
taking
form
paraphrased
words
'this
infinitely
powerful
deity
grabs
poor
schmuck
makes
take
dictation
hides
away
hundred
years
exists
human
personally
experienced
revelation
person
observes
revelations
seem
contain
elements
contradict
rather
strongl

would
permit
police
state
function
required
suddenly
find
living
one
change
attitude
part
government
constitution
built
men
risk
lives
ensure
freedom
country
designed
system
make
difficult
tyranny
arise
instance
one
reasons
fourth
amendment
put
make
harder
government
try
make
smuggling
crime
think
jest
john
hancock
made
money
smuggling
rum
drug
think
government
everyones
keys
escrow
fbi
gets
pet
wiretap
without
leaving
office
scheme
coup
happens
every
day
around
world
within
hours
everyone
country
might
oppose
tyrants
monitored
closely
ever
possible
without
tools
place
tyranny
stand
tools
like
place
tyrannical
dictatorship
could
actually
successfully
imposed
give
government
tools
enslave
maybe
trust
bill
clinton
willing
tell
trust
every
government
ever
arise
u.s.
hereafter
willing
make
leap
faith
'm
political
dissident
acutely
aware
happens
political
dissidents
world
world
could
killed
beliefs
call
amnesty
international
time
find
happens
dissidents
world
seperates
u.s.
places
thin
piec

judge
value
basis
criticizing
values
ennumerated
bible
purposes
imputed
god
grounds
behavior
reliogious
condemned
seems
judging
values
motivate
others
action
standard
conduct
measured
nature
serve
purpose
law
nature
invoke
establish
values
c.s
lewis
tells
us
argument
main
reason
abandoned
atheism
became
christian
argument
severely
flawed
values
golden
rule
rational
basis
others
like
basic
idea
wanting
live
probably
roots
way
brains
wired
lewis
ignored
real
possiblity
natural
selection
could
also
favour
altruistic
behaviour
morality
well
indeed
humans
evolved
better
better
building
using
tools
also
became
better
killing
logical
necessity
evolution
could
favour
knew
use
tools
one's
people
bible
reveals
quite
nicely
morality
early
jews
beyond
simple
set
rules
hold
people
together
one
god
god
care
much
people
nations
time
nt
things
quite
different
jews
rule
_empire_
could
longer
simply
ignore
gentiles
new
situation
required
new
morality
along
new
religion
born
mutation
meme
pool
since
ever

prohibits
drinking
person
says
muslim
proceeds
drink
bottle
beer
mean
islam
teaches
people
drink
beer
course
idiot
would
think
see
point
judging
islam
capitalist
terms
capitalism
ideology
based
largely
assumption
people
want
maximise
wealth
--
assumption
opposition
islamic
teachings
say
islam
bad
capitalist
pretty
unthinking
--
islam
pretend
capitalist
try
capitalist
mean
islam
support
free-market
--
general
--
parts
capitalism
opposed
islam
understand
one
postulate
numerous
reasons
theory
islam
secularist
capitalist
etc
etc
selim
give
clear
historical
example
show
fallacy
views
think
obviously
islam
lack
education
power
large
part
history
islamic
world
powerful
significant
section
history
islamic
world
foremost
sciences
say
islam
example
anti-education
completely
absurd
try
blame
situation
islam
--
history
shows
conclusion
false
instead
must
reasons
situation
well
selim
viewpoint
women
islam
makes
question
extent
knowledge
islam
really
think
knowledgeable
enough
able
judge
whether
mus

greater
authentication
posted
information
hand
advances
ensuring
anonymity
remailers
forthcoming
see
privacy
_____
2.1
privacy
internet
generally
privacy
multiple
connotations
society
perhaps
even
internet
cyberspace
take
mean
exclusive
use
access
account
data
stored
directed
email
encounter
arbitrary
restrictions
searches
words
others
may
obtain
data
associated
account
without
permission
ideas
probably
fairly
limiting
liberal
scope
internet
users
consider
private
domains
users
n't
expect
want
privacy
expect
demand
_____
2.2
privacy
un
important
internet
somewhat
debatable
inflammatory
topic
arousing
passionate
opinions
internet
take
privacy
granted
rudely
surprised
find
tenuous
nonexistent
governments
rules
protect
privacy
illegal
search
seizure
clause
u.s.
constitution
adopted
others
many
antithetical
laws
prohibiting
secret
communications
allowing
wiretapping
rules
generally
carry
internet
specific
rules
governing
however
legal
repercussions
global
internet
still
largely
unknown
unt

safeguards
internet
growing
become
completely
global
international
superhighway
data
traffic
inevitably
entail
data
voice
messages
postal
mail
many
items
extremely
personal
nature
computer
items
many
people
consider
completely
private
local
hard
drives
literally
inches
global
network
connections
also
sensitive
industrial
business
information
exchanged
networks
currently
volume
may
conceivably
merge
internet
would
agree
basic
sensitive
uses
internet
significant
mechanisms
currently
place
ensure
much
privacy
new
standards
calling
uniform
introduction
privacy
enhanced
mail
pem
uses
encryption
technologies
ensure
privacy
privacy
protection
automatic
may
significantly
improve
safeguards
technology
extremely
destructive
privacy
surreptitious
surveilance
overwhelmingly
effective
protecting
e.g
encryption
government
agencies
opposed
unlimited
privacy
general
believe
lawfully
forfeited
cases
criminal
conduct
e.g
court-authorized
wiretapping
however
powerful
new
technologies
protect
privacy
comp

said
's
arrived
asked
whether
bobby
's
real
betcha
welcome
alt.atheism
rest
assured
gets
worse
pearls
wisdom
bobby
reproduce
anyone
keith
keeping
big
file
stuff
allah
's
infinite
wisdom
universe
created
nothing
saying
became
therefore
allah
exists
--
bobby
mozumder
proving
existence
allah
1
wait
said
humans
rarely
reasonable
n't
contradict
atheism
everything
explained
logic
reason
contradiction
atheism
proves
false
--
bobby
mozumder
proving
existence
allah
2
plus
believer
would
contradictory
quran
allah
exist
--
bobby
mozumder
proving
existence
allah
3
one
thing
relates
among
navy
men
get
tatoos
say
mom
love
mom
makes
virile
men
compare
homos
raised
study
get
point
--
bobby
mozumder
islamically
rigorous
alt.atheism
mmmmm
quality
*and*
quantity
new
voice
islam
pbuh
cheers
simon
know
least
one
person
list
says
first
heard
clipper
friday
morning
newspaper
another
already
fired
letter
protest
nist
point
suspect
list
interesting
various
reasons
represent
cabal
put
proposal
together
yes
othe

atheists
strong
atheists
atheists
believe
non-existence
gods
others
limit
atheism
specific
gods
christian
god
rather
making
flat-out
denials
n't
disbelieving
god
thing
believing
n't
exist
definitely
disbelief
proposition
means
one
believe
true
believing
something
true
equivalent
believing
false
one
may
simply
idea
whether
true
brings
us
agnosticism
agnosticism
term
'agnosticism
coined
professor
huxley
meeting
metaphysical
society
1876
defined
agnostic
someone
disclaimed
strong
atheism
believed
ultimate
origin
things
must
cause
unknown
unknowable
thus
agnostic
someone
believes
know
sure
whether
god
exists
words
slippery
things
language
inexact
beware
assuming
work
someone
's
philosophical
point
view
simply
fact
calls
atheist
agnostic
example
many
people
use
agnosticism
mean
weak
atheism
use
word
atheism
referring
strong
atheism
beware
also
word
atheist
many
shades
meaning
difficult
generalize
atheists
say
sure
atheists
n't
believe
god
example
certainly
case
atheists
believe
science
best

### Exploring Data With Frequency Distributions

Great -- our processed dataset contains 46,990 unique words! 

Next, let's create a frequency distribution to see which words are used the most! 

In order to do this, we'll need to concatenate every article into a single list, and then pass this list to `FreqDist()`. 

In the cell below:

* Create an empty list called `articles_concat` 
* Iterate through `processed_data` and add every article it contains to `articles_concat` 
* Pass `articles_concat` as input to `FreqDist()`  
* Display the top 200 most used words  

In [None]:
articles_concat = None

In [None]:
articles_freqdist = None


At first glance, none of these words seem very informative -- for most of the words represented here, it would be tough to guess if a given word is used equally among all five classes, or is disproportionately represented among a single class. This makes sense, because this frequency distribution represents all the classes combined. This tells us that these words are probably the least important, as they are most likely words that are used across multiple classes, thereby providing our model with little actual signal as to what class they belong to. This tells us that we probably want to focus on words that appear heavily in articles from a given class, but rarely appear in articles from other classes. You may recall from previous lessons that this is exactly where **_TF-IDF Vectorization_** really shines!

### Vectorizing with TF-IDF

Although NLTK does provide functionality for vectorizing text documents with TF-IDF, we'll make use of scikit-learn's TF-IDF vectorizer, because we already have experience with it, and because it's a bit easier to use, especially when the models we'll be feeding the vectorized data into are from scikit-learn, meaning that we don't have to worry about doing any extra processing to ensure they play nicely together. 

Recall that in order to use scikit-learn's `TfidfVectorizer()`, we need to pass in the data as raw text documents -- the `TfidfVectorizer()` handles the count vectorization process on it's own, and then fits and transforms the data into TF-IDF format. 

This means that we need to:

* Import `TfidfVectorizer` from `sklearn.feature_extraction.text` and instantiate `TfidfVectorizer()` 
* Call the vectorizer object's `.fit_transform()` method and pass in our `data` as input. Store the results in `tf_idf_data_train` 
* Also create a vectorized version of our testing data, which can be found in `newsgroups_test.data`. Store the results in `tf_idf_data_test`. 


**_NOTE:_** When transforming the test data, use the `.transform()` method, not the `.fit_transform()` method, as the vectorizer has already been fit to the training data. 

In [None]:
# Import TfidfVectorizer

In [None]:
vectorizer = None

In [None]:
tf_idf_data_train = None

In [None]:
tf_idf_data_test = None

### Modeling Our Data

Great! We've now preprocessed and explored our dataset, let's take a second to see what our data looks like in vectorized form. 

In the cell below, get the shape of `tf_idf_data`.

In [16]:
# Your code here

(2814, 36622)

Our vectorized data contains 2,814 articles, with 36,622 unique words in the vocabulary. However, the vast majority of these columns for any given article will be zero, since every article only contains a small subset of the total vocabulary. Recall that vectors mostly filled with zeros are referred to as **_Sparse Vectors_**. These are extremely common when working with text data. 

Let's check out the average number of non-zero columns in the vectors. Run the cell below to calculate this average. 

In [None]:
non_zero_cols = tf_idf_data_train.nnz / float(tf_idf_data_train.shape[0])
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_cols))

percent_sparse = 1 - (non_zero_cols / float(tf_idf_data_train.shape[1]))
print('Percentage of columns containing 0: {}'.format(percent_sparse))

As we can see from the output above, the average vectorized article contains 107 non-zero columns. This means that 99.7% of each vector is actually zeroes! This is one reason why it's best not to create your own vectorizers, and rely on professional packages such as scikit-learn and NLTK instead -- they contain many speed and memory optimizations specifically for dealing with sparse vectors. This way, we aren't wasting a giant chunk of memory on a vectorized dataset that only has valid information in 0.3% of it. 

Now that we've vectorized our dataset, let's create some models and fit them to our vectorized training data. 

In the cell below:

* Instantiate `MultinomialNB()` and `RandomForestClassifier()`. For random forest, set `n_estimators` to `100`. Don't worry about tweaking any of the other parameters  
* Fit each to our vectorized training data 
* Create predictions for our training and test sets
* Calculate the `accuracy_score()` for both the training and test sets (you'll find our training labels stored within the variable `target`, and the test labels stored within `newsgroups_test.target`) 

In [None]:
nb_classifier = None
rf_classifier = None

In [None]:

nb_train_preds = None
nb_test_preds = None

In [None]:

rf_train_preds = None
rf_test_preds = None

In [None]:
nb_train_score = None
nb_test_score = None
rf_train_score = None
rf_test_score = None

print("Multinomial Naive Bayes")
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(nb_train_score, nb_test_score))
print("")
print('-'*70)
print("")
print('Random Forest')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score, rf_test_score))

### Interpreting Results

**_Question:_** Interpret the results seen above. How well did the models do? How do they compare to random guessing? How would you describe the quality of the model fit?

Write your answer below:

In [None]:
# Your answer here

# Summary

In this lab, we used our NLP skills to clean, preprocess, explore, and fit models to text data for classification. This wasn't easy -- great job!!