<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px"\>
# NLP II: Tokenizing/Lemmatization, Sentiment Analysis and Word2Vec
---

#### Before we begin, try running this:

In [None]:
import nltk

In [2]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [3]:
lemmatizer.lemmatize("cats")

'cat'

If you ran into issues with the above:

1. Open a Jupyter notebook and run `import nltk`.
    - If this runs without issue, fantastic! Move to step 4.
    - If `import nltk` does not work, then move to step 2.
2. Run `nltk.download()`. A new screen will pop up outside your Jupyter notebook. (It may be hidden behind other windows.)
3. Once this box opens up, click `all`, then `download`. Once this is done, restart your Jupyter notebook and return to step 1.
4. Run:
```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("cats")```

    - If this returns `cat`, then fantastic! You’re done. 
    - If not, head to http://www.nltk.org/install.html and follow instructions for your computer, then go back to step 1.

In [5]:
#!pip install gensim

### Kick-Off

Generally when we get text data, strings aren't broken out into individual words or even sentences. We might have a full tweet, full chapter of a book, or full .pdf file all in one long string.

This afternoon, we're diving into the practical side of NLP - taking this data and breaking it out into words that we can then leverage into $n$-grams or $tfidf$.

A couple things to note before beginning:
1. NLP describes how we can get unstructured data into structured form. That does not mean these tools we used today work to the exclusion of other methods.
2. You can and should include other variables in your model!

#### Agenda
1. Pre-Processing
    - Break strings into words.
    - Combine words.
2. Sentiment Analysis
3. Word Vectors (if time)

In [6]:
spam = 'Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.'

In [7]:
spam

'Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.'

### Pre-Processing 

- Tokenizing
- Regular Expression
- Lemmatizing
- Stemming
- Additional Things (i.e. removing HTML)

#### Tokenizing

When we "tokenize" data, we take it and split it up into distinct chunks based on some pattern.

In [8]:
#Before we can lemmatize our spam string we need to tokenize it.

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') ## We'll talk about this in a moment.

In [9]:
spam_tokens = tokenizer.tokenize(spam.lower())

In [10]:
spam_tokens

['hello',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 'my',
 'name',
 'is',
 'mr',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'directors',
 'of',
 'pjsc',
 'lukoil',
 'i',
 'am',
 '86',
 'years',
 'old',
 'and',
 'i',
 'was',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'years',
 'ago',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 'i',
 'decided',
 'to',
 'will',
 'donate',
 'the',
 'sum',
 'of',
 '8',
 '750',
 '000',
 '00',
 'euros',
 'eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euros',
 'only',
 'etc',
 'etc']

#### Regular Expressions

Regular Expressions, or RegEx, are a very helpful way for us to detect patterns in text. 
- This is a tool you should be aware of, but you'll learn more about it on Friday!

In [11]:
import regex as re

In [12]:
for i in spam_tokens:
    print(re.findall('\d+', i), i)

[] hello
[] i
[] saw
[] your
[] contact
[] information
[] on
[] linkedin
[] i
[] have
[] carefully
[] read
[] through
[] your
[] profile
[] and
[] you
[] seem
[] to
[] have
[] an
[] outstanding
[] personality
[] this
[] is
[] one
[] major
[] reason
[] why
[] i
[] am
[] in
[] contact
[] with
[] you
[] my
[] name
[] is
[] mr
[] valery
[] grayfer
[] chairman
[] of
[] the
[] board
[] of
[] directors
[] of
[] pjsc
[] lukoil
[] i
[] am
['86'] 86
[] years
[] old
[] and
[] i
[] was
[] diagnosed
[] with
[] cancer
['2'] 2
[] years
[] ago
[] i
[] will
[] be
[] going
[] in
[] for
[] an
[] operation
[] later
[] this
[] week
[] i
[] decided
[] to
[] will
[] donate
[] the
[] sum
[] of
['8'] 8
['750'] 750
['000'] 000
['00'] 00
[] euros
[] eight
[] million
[] seven
[] hundred
[] and
[] fifty
[] thousand
[] euros
[] only
[] etc
[] etc


RegEx in Python 3 understands `\d+` to identify numeric digits. Therefore, the above code searched through `spam_tokens` to see if any numeric digits were in there. 

A `RegexpTokenizer` splits a string into substrings using a regular expression. (You could also say it uses RegEx to tokenize.)

The following example is pulled from [this site](http://www.nltk.org/_modules/nltk/tokenize/regexp.html).

In [13]:
s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."

In [14]:
print(s)

Good muffins cost $3.88
in New York.  Please buy me
two of them.

Thanks.


In [15]:
tokenizer_1 = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

In [16]:
tokenizer_1.tokenize(s)

['Good',
 'muffins',
 'cost',
 '$3.88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

In [17]:
tokenizer_2 = RegexpTokenizer('\s+', gaps=True)

In [18]:
tokenizer_2.tokenize(s)

['Good',
 'muffins',
 'cost',
 '$3.88',
 'in',
 'New',
 'York.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them.',
 'Thanks.']

In [19]:
capword_tokenizer = RegexpTokenizer('[A-Z]\w+')

In [20]:
capword_tokenizer.tokenize(s)

['Good', 'New', 'York', 'Please', 'Thanks']

#### Lemmatizing

When we "lemmatize" data, we take words and attempt to return their *lemma*, or the base/dictionary form of a word.

In [21]:
tokens_lem = [lemmatizer.lemmatize(i) for i in spam_tokens]

In [22]:
tokens_lem

['hello',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 'my',
 'name',
 'is',
 'mr',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'director',
 'of',
 'pjsc',
 'lukoil',
 'i',
 'am',
 '86',
 'year',
 'old',
 'and',
 'i',
 'wa',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'year',
 'ago',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 'i',
 'decided',
 'to',
 'will',
 'donate',
 'the',
 'sum',
 'of',
 '8',
 '750',
 '000',
 '00',
 'euro',
 'eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euro',
 'only',
 'etc',
 'etc']

In [23]:
paired = list(zip(spam_tokens, tokens_lem))

In [24]:
paired

[('hello', 'hello'),
 ('i', 'i'),
 ('saw', 'saw'),
 ('your', 'your'),
 ('contact', 'contact'),
 ('information', 'information'),
 ('on', 'on'),
 ('linkedin', 'linkedin'),
 ('i', 'i'),
 ('have', 'have'),
 ('carefully', 'carefully'),
 ('read', 'read'),
 ('through', 'through'),
 ('your', 'your'),
 ('profile', 'profile'),
 ('and', 'and'),
 ('you', 'you'),
 ('seem', 'seem'),
 ('to', 'to'),
 ('have', 'have'),
 ('an', 'an'),
 ('outstanding', 'outstanding'),
 ('personality', 'personality'),
 ('this', 'this'),
 ('is', 'is'),
 ('one', 'one'),
 ('major', 'major'),
 ('reason', 'reason'),
 ('why', 'why'),
 ('i', 'i'),
 ('am', 'am'),
 ('in', 'in'),
 ('contact', 'contact'),
 ('with', 'with'),
 ('you', 'you'),
 ('my', 'my'),
 ('name', 'name'),
 ('is', 'is'),
 ('mr', 'mr'),
 ('valery', 'valery'),
 ('grayfer', 'grayfer'),
 ('chairman', 'chairman'),
 ('of', 'of'),
 ('the', 'the'),
 ('board', 'board'),
 ('of', 'of'),
 ('directors', 'director'),
 ('of', 'of'),
 ('pjsc', 'pjsc'),
 ('lukoil', 'lukoil'),
 ('i'

In [25]:
for i in paired:
    if i[0] != i[1]:
        print(i)

('directors', 'director')
('years', 'year')
('was', 'wa')
('years', 'year')
('euros', 'euro')
('euros', 'euro')


We can also do this on individual words.

In [26]:
lemmatizer.lemmatize('computation')

'computation'

#### Stemming

When we "stem" data, we take words and attempt to return a base form of the word. It tends to be cruder than using lemmatization. There's a [method developed by Porter in 1980](https://www.cs.toronto.edu/~frank/csc2501/Readings/R2_Porter/Porter-1980.pdf) that explains the algorithm used below.

In [27]:
from nltk.stem.porter import PorterStemmer

In [28]:
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

In [29]:
stem_spam = [p_stemmer.stem(i) for i in spam_tokens]

In [30]:
stem_spam

['hello',
 'i',
 'saw',
 'your',
 'contact',
 'inform',
 'on',
 'linkedin',
 'i',
 'have',
 'care',
 'read',
 'through',
 'your',
 'profil',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstand',
 'person',
 'thi',
 'is',
 'one',
 'major',
 'reason',
 'whi',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 'my',
 'name',
 'is',
 'mr',
 'valeri',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'director',
 'of',
 'pjsc',
 'lukoil',
 'i',
 'am',
 '86',
 'year',
 'old',
 'and',
 'i',
 'wa',
 'diagnos',
 'with',
 'cancer',
 '2',
 'year',
 'ago',
 'i',
 'will',
 'be',
 'go',
 'in',
 'for',
 'an',
 'oper',
 'later',
 'thi',
 'week',
 'i',
 'decid',
 'to',
 'will',
 'donat',
 'the',
 'sum',
 'of',
 '8',
 '750',
 '000',
 '00',
 'euro',
 'eight',
 'million',
 'seven',
 'hundr',
 'and',
 'fifti',
 'thousand',
 'euro',
 'onli',
 'etc',
 'etc']

In [31]:
paired_stem = list(zip(spam_tokens, stem_spam))

In [32]:
paired_stem

[('hello', 'hello'),
 ('i', 'i'),
 ('saw', 'saw'),
 ('your', 'your'),
 ('contact', 'contact'),
 ('information', 'inform'),
 ('on', 'on'),
 ('linkedin', 'linkedin'),
 ('i', 'i'),
 ('have', 'have'),
 ('carefully', 'care'),
 ('read', 'read'),
 ('through', 'through'),
 ('your', 'your'),
 ('profile', 'profil'),
 ('and', 'and'),
 ('you', 'you'),
 ('seem', 'seem'),
 ('to', 'to'),
 ('have', 'have'),
 ('an', 'an'),
 ('outstanding', 'outstand'),
 ('personality', 'person'),
 ('this', 'thi'),
 ('is', 'is'),
 ('one', 'one'),
 ('major', 'major'),
 ('reason', 'reason'),
 ('why', 'whi'),
 ('i', 'i'),
 ('am', 'am'),
 ('in', 'in'),
 ('contact', 'contact'),
 ('with', 'with'),
 ('you', 'you'),
 ('my', 'my'),
 ('name', 'name'),
 ('is', 'is'),
 ('mr', 'mr'),
 ('valery', 'valeri'),
 ('grayfer', 'grayfer'),
 ('chairman', 'chairman'),
 ('of', 'of'),
 ('the', 'the'),
 ('board', 'board'),
 ('of', 'of'),
 ('directors', 'director'),
 ('of', 'of'),
 ('pjsc', 'pjsc'),
 ('lukoil', 'lukoil'),
 ('i', 'i'),
 ('am', 'am'

In [33]:
for i in paired_stem:
    if i[0] != i[1]:
        print(i)

('information', 'inform')
('carefully', 'care')
('profile', 'profil')
('outstanding', 'outstand')
('personality', 'person')
('this', 'thi')
('why', 'whi')
('valery', 'valeri')
('directors', 'director')
('years', 'year')
('was', 'wa')
('diagnosed', 'diagnos')
('years', 'year')
('going', 'go')
('operation', 'oper')
('this', 'thi')
('decided', 'decid')
('donate', 'donat')
('euros', 'euro')
('hundred', 'hundr')
('fifty', 'fifti')
('euros', 'euro')
('only', 'onli')


We can also do this on individual words as well.

In [34]:
p_stemmer.stem('computer')

'comput'

In [35]:
p_stemmer.stem('computation')

'comput'

# Let's start with a very simple example

Let's build a function that can classify a small amount of text, such as a tweet, into positive and negative.

What words tell us whether certain text is positive?

In [36]:
theTweet = "We have some delightful new food in the cafeteria. Awesome!!!"

In [37]:
# Let's come up with a list of positive and negative words we might run into in one tweet
positive_words = ['awesome', 'lit', 'sweet', 'delightful', 'dank']
negative_words = ['poop', 'awful', 'bad', 'whack', 'grody']

In [38]:
#Tokenize

import re
theTokens = re.findall(r'\b\w[\w-]*\b', theTweet.lower())

In [39]:
theTokens

['we',
 'have',
 'some',
 'delightful',
 'new',
 'food',
 'in',
 'the',
 'cafeteria',
 'awesome']

In [40]:
numPosWords = 0
for i in theTokens:
    if i in positive_words:
        numPosWords += 1

In [41]:
numPosWords

2

In [42]:
numNegWords = 0
for i in theTokens:
    if i in negative_words:
        numNegWords += 1

In [43]:
numNegWords

0

In [44]:
# return a percentage
numWords = len(theTokens)
percent_positive = numPosWords / numWords
percent_negative = numNegWords / numWords

In [45]:
print('Positive: ' + '{:.0%}'.format(percent_positive) +
     ' Negative: ' + '{:.0%}'.format(percent_negative))

Positive: 20% Negative: 0%


**Check:** What are some shortcomings of this method?

# Sorting Positive from Negative Reviews

The easiest way to do sentiment classification of analysis is by training a model on data we've already labeled. 

Today we will begin by reviewing the basic NLP techniques we learned yesterday to create a sentiment analyzer from Rotten Tomatoes Movie reivew.  This code-along is adapted from Kaggle's tutorial, available [here](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words).


## Step One: Import The Data

In [46]:
import pandas as pd       
train = pd.read_csv("labeledTrainData.tsv", header=0, 
                    delimiter="\t", quoting=3)

In [47]:
#What are we looking at? Someone describe the columns
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


There are a few steps we'll take to clean up the text data before it's ready for processing

- Remove the HTML code artifacts from the text
- Remove punctuation
- Remove stopwords (what are these?)


## Step One: Remove HTML code artifacts

Fortunately, we can use beautiful soup to remove the HTML artificats from our corpus

In [48]:
from bs4 import BeautifulSoup             

# Initialize the BeautifulSoup object on a single movie review     
example1 = BeautifulSoup(train['review'][0], 'lxml')
# Print the raw review and then the output of get_text(), for 
# comparison
print(train['review'][0])
print('\n')
print(example1.get_text())

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

## Step Two: Remove Punctuation

Punctuation can be removed using regular expressions

In [49]:
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text() )  # The text to search
print(letters_only)

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    mi

In [50]:
letters_only[0:50]

' With all this stuff going down at the moment with'

In [51]:
# Let's also take this time to convert everything to lowercase 
# and split into individual words.
lower_case = letters_only.lower()
words = lower_case.split()

In [52]:
words[:10]

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with']

## Step Three: Remove Stop Words

If you didn't complete the NLTK download you may run into some issues here.

In [53]:
import nltk
#nltk.download()  # Download text data sets, including stop words. Uncomment this if you did not download

In [54]:
from nltk.corpus import stopwords

In [55]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [57]:
# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words('english')]

## Step Four: Combine our cleaning into one function

**Check**: Why should we do everything with one function?

In [58]:
def review_to_words(raw_review):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review, 'lxml').get_text()
    #
    # 2. Remove non-letters        
    letters_only = re.sub('[^a-zA-Z]',' ', review_text)
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words('english'))
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return(' '.join(meaningful_words))


## Step Five (Finally!) Applying our Function

In [59]:
# Get the number of reviews based on the dataframe column size
num_reviews = train['review'].size

# Initialize an empty list to hold the clean reviews
clean_train_reviews = []

print(num_reviews)

25000


In [60]:
print("Cleaning and parsing the training set movie reviews...\n")
for i in range(0, num_reviews):
    # If the index is evenly divisible by 1000, print a message
    if((i+1) % 1000 == 0):
        print("Review %d of %d\n" % ( i+1, num_reviews ))                                                                    
    clean_train_reviews.append( review_to_words( train["review"][i] ))

Cleaning and parsing the training set movie reviews...





 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000



## Our data is finally ready.....

In [61]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()


In [63]:
print(train_data_features.shape)

(25000, 5000)


In [64]:
vocab = vectorizer.get_feature_names()
print(vocab)



### Now we have an array that we can use for classification

In [65]:
from sklearn.neighbors import KNeighborsClassifier

In [66]:
clf = KNeighborsClassifier(n_neighbors=5)

In [67]:
clf.fit(train_data_features, train['sentiment'])
#this will take a while....

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [None]:
## In general, we DO NOT want to fit and score on the same data!
#clf.score(train_data_features, train['sentiment'])

# Word Vectors

Earlier, you learned fairly simple methods of transforming text into numerical formats (count vectorizing, hash vectorizing, tf-idf transforming). Today we're going to get a high-level introduction to one of the coolest techniques in Natural Language Processing: Word Vectors

For this example we will be working with a larger set of the data we used above.

In [None]:
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3, encoding='utf-8')
test = pd.read_csv("testData.tsv", header=0, delimiter="\t", quoting=3, encoding='utf-8')
unlabeled_train = pd.read_csv("unlabeledTrainData.tsv", encoding='utf-8', header=0, delimiter="\t", quoting=3)

# When dealing with NLP you will often run into encoding errors.  
# The best way to address them is to pass an encoding parameter to pandas when you read in the data

Now we'll clean and prepare our data similar to how we did it before

In [None]:
def review_to_wordlist( review, remove_stopwords=False ):
    # Function to convert a document to a sequence of words,
    # optionally removing stop words.  Returns a list of words.
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(review).get_text()
    #  
    # 2. Remove non-letters
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    #
    # 3. Convert words to lower case and split them
    words = review_text.lower().split()
    #
    # 4. Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    #
    # 5. Return a list of words
    return(words)


The word-to-vec function that we'll use today takes sentences as a list of strings, so we will use a **tokenizer** to generate that.

In [None]:
# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Define a function to split a review into parsed sentences
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
    # Function to split a review into parsed sentences. Returns a 
    # list of sentences, where each sentence is a list of words
    #
    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    #
    # 2. Loop over each sentence
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence, \
              remove_stopwords ))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences


Now we apply both our functions to prepare the data

In [None]:
%%time 
# This took me about 15 minutes

sentences = []  # Initialize an empty list of sentences

print("Parsing sentences from training set")
for review in train["review"]:
    sentences += review_to_sentences(review, tokenizer)

print("Parsing sentences from unlabeled set")
for review in unlabeled_train["review"]:
    sentences += review_to_sentences(review, tokenizer)

In [None]:
#Now how much data do we have?

print(len(sentences))

In [None]:
sentences[0]

## Vectorizing our Words

This will take a very long time. Now is a good time to talk about what the result will be. Start running the cell first. Then we'll talk

In [None]:
%%time 
#This took me about 5 minutes

# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model (this will take some time)
from gensim.models import word2vec
print("Training model...")
model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "300features_40minwords_10context"
model.save(model_name)

Suppose I said "man plus royalty." What does that look like?

"King minus man."

"King minus man plus woman."

![example](https://adriancolyer.files.wordpress.com/2016/04/word2vec-distributed-representation.png?w=1132)

We can go far beyond this.

What about "dollar minus United States plus Europe?"

What about "Russia minus Moscow plus China?"

What is word2vec?

> Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.


Put more plainly - 

It's a way of 'abstracting' the meaning of words into numbers by distributing its meaning as a series or weights across elements. 

There's a very thorough discussion here: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

With enough training observations we can use these vectors to interpret speech patterns and define words.

## So what cool tricks can we do with word vectors?

In [None]:
# Which word doesn't match the others?
model.doesnt_match("man woman child kitchen".split())

In [None]:
model.doesnt_match("france england germany berlin".split())

In [None]:
model.doesnt_match("paris berlin london austria".split())
# We are limited by the size of our training set

In [None]:
model.most_similar("awful")

In [None]:
model.most_similar("brisket")

# Extras

* [NLTK Sentiment Demo](http://text-processing.com/demo/sentiment/)
* [NLTK Sentiment Example Code](http://www.nltk.org/howto/sentiment.html)
* [FiveThirtyEight Article](https://fivethirtyeight.com/features/dissecting-trumps-most-rabid-online-following/) using a similar technique to Word2Vec to analyze reddit users in the `r/thedonald` subreddit 