In [2]:
import spacy
import nltk

## Normalizing Text

The best data is data that's consistent - textual data usually isn't. But we can make it that way by normalizing it. To do this, we can do a number of things. 

In [3]:
#lowercase text
raw = "OMG, Natural Language Processing is SO cool and I'm really enjoying this workshop!"
tokens = nltk.word_tokenize(raw)
tokens = [i.lower() for i in tokens]
print(tokens)

['omg', ',', 'natural', 'language', 'processing', 'is', 'so', 'cool', 'and', 'i', "'m", 'really', 'enjoying', 'this', 'workshop', '!']


### Stemming

But we can do more! 

#### What is Stemming?

Stemming is the process of converting the words of a sentence to its non-changing portions. In the example of amusing, amusement, and amused above, the stem would be amus.

#### Types of Stemmers

You're probably wondering how do I convert a series of words to its stems. Luckily, NLTK has a few built-in and established stemmers available for you to use! They work slightly differently since they follow different rules - which you use depends on whatever you happen to be working on. 

First, let's try the Lancaster Stemmer: 

In [5]:
lancaster = nltk.LancasterStemmer()
stem_lanc = [lancaster.stem(i) for i in tokens]
stem_lanc

['omg',
 ',',
 'nat',
 'langu',
 'process',
 'is',
 'so',
 'cool',
 'and',
 'i',
 "'m",
 'real',
 'enjoy',
 'thi',
 'workshop',
 '!']

Another options is the Lancaster Stemmer:

In [6]:
porter = nltk.PorterStemmer()
stem_porter = [porter.stem(i) for i in tokens]
stem_porter

['omg',
 ',',
 'natur',
 'languag',
 'process',
 'is',
 'so',
 'cool',
 'and',
 'i',
 "'m",
 'realli',
 'enjoy',
 'thi',
 'workshop',
 '!']

## Difference b/w porter and lancaster

**Porter**: Most commonly used stemmer without a doubt, also one of the most gentle stemmers.  the most computationally intensive of the algorithms. It is also the oldest stemming algorithm by a large margin.

**Lancaster**: Very aggressive stemming algorithm, sometimes to a fault. With porter a, the stemmed representations are usually fairly intuitive to a reader, not so with Lancaster, as many shorter words will become totally obfuscated. The fastest algorithm here, and will reduce your working set of words hugely, but if you want more distinction, not the tool you would want.

Notice how "natural" maps to "natur" instead of "nat" and "really" maps to "realli" instead of "real" in the last stemmer. 

### Lemmatization

#### What is Lemmatization?

The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary. Affixes are an addition to the base form or stem of a word in order to modify its meaning or create a new word. This additional checking process makes the lemmatizer slower than the above stemmers. Notice that it doesn't handle lying, but it converts women to woman.



#### WordNetLemmatizer

Once again, NLTK is awesome and has a built in lemmatizer for us to use: 

In [8]:
from nltk import WordNetLemmatizer

lemma = nltk.WordNetLemmatizer()
text = "Women in technology are amazing at coding"
ex = [i.lower() for i in text.split()]

lemmas = [lemma.lemmatize(i) for i in ex]
lemmas

['woman', 'in', 'technology', 'are', 'amazing', 'at', 'coding']

**Note**:

Another normalization task involves identifying non-standard words including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary. For example, every decimal number could be mapped to a single token 0.0, and every acronym could be mapped to AAA. This keeps the vocabulary small and improves the accuracy of many language modeling tasks.

## Sentiment Analyses

Sentiment analysis involves building a system to collect and determine the emotional tone behind words. This is important because it allows you to gain an understanding of the attitudes, opinions and emotions of the people in your data.

### Preparing the Data 

To accomplish sentiment analysis computationally, we have to use techniques that will allow us to learn from data that's already been labeled. 

So what's the first step? Formatting the data so that we can actually apply NLP techniques. 

In [9]:
import nltk

def format_sentence(sent):
    return({word: True for word in nltk.word_tokenize(sent)})

Here, `format_sentence` changes a piece of text, in this case a tweet, into a dictionary of words mapped to True booleans. Though not obvious from this function alone, this will eventually allow us to train  our prediction model by splitting the text into its tokens, i.e. <i>tokenizing</i> the text.

{'!': True, 'animals': True, 'are': True, 'the': True, 'ever': True, 'Dogs': True, 'best': True}

In [16]:
pos = []
with open("./pos_tweets.txt") as f:
    for i in f: 
        pos.append([format_sentence(i), 'pos'])

In [17]:
pos

[[{"''": True,
   "'m": True,
   ',': True,
   '.': True,
   ':': True,
   'Ballads': True,
   'Cellos': True,
   'Genius': True,
   'I': True,
   '``': True,
   'and': True,
   'by': True,
   'called': True,
   'cheer': True,
   'down': True,
   'iPod': True,
   'listening': True,
   'love': True,
   'music': True,
   'my': True,
   'myself': True,
   'of': True,
   'playlist': True,
   'taste': True,
   'to': True,
   'up': True,
   'when': True},
  'pos'],
 [{"''": True,
   '.': True,
   '...': True,
   'Wanted': True,
   '``': True,
   'darn': True,
   'good': True,
   'it': True,
   'just': True,
   'movie': True,
   'pretty': True,
   'the': True,
   'was': True,
   'watched': True},
  'pos'],
 [{"'m": True, 'I': True, '``': True, 'happy': True, 'now': True}, 'pos'],
 [{'!': True,
   "''": True,
   "'ll": True,
   "'m": True,
   '--': True,
   ':': True,
   'ALL': True,
   'AT': True,
   'DONT': True,
   "FOUL'..certainly": True,
   'TIMES': True,
   '``': True,
   'and': True,
 

In [11]:
neg = []
with open("./neg_tweets.txt") as f:
    for i in f: 
        neg.append([format_sentence(i), 'neg'])

In [18]:
neg

[[{"''": True,
   '.': True,
   '?': True,
   '@': True,
   'London': True,
   'What': True,
   '``': True,
   'a': True,
   'boy': True,
   'busy': True,
   'do': True,
   'evening': True,
   'iggigg': True,
   'in': True,
   'is': True,
   'me': True,
   'see': True,
   'this': True,
   'to': True,
   'too': True},
  'neg'],
 [{'!': True,
   "''": True,
   ',': True,
   '...': True,
   '2010': True,
   '?': True,
   'Ah': True,
   'BROWNS': True,
   'GO': True,
   'I': True,
   'LETS': True,
   'Lebron': True,
   'SUCK': True,
   '``': True,
   'also': True,
   'and': True,
   'are': True,
   'cavs': True,
   'city': True,
   'feeling': True,
   'going': True,
   'got': True,
   'home': True,
   'in': True,
   'lose': True,
   'lost': True,
   'must': True,
   'my': True,
   'sinking': True,
   'this': True,
   'to': True,
   'we': True,
   'well': True,
   'why': True},
  'neg'],
 [{"''": True,
   '?': True,
   'BGT': True,
   'Brothers': True,
   'Cardiff': True,
   'Chuckle': True

#### Training and Test

Next, we'll split the labeled data we have into two pieces, one that can "train" data and the other to give us insight on how well our model is performing. The training data will inform our model on which features are most important.

In [14]:
training = pos[:int((.89)*len(pos))] + neg[:int((.89)*len(neg))]

In [15]:
test = pos[int((.1)*len(pos)):] + neg[int((.1)*len(neg)):]

## Building a Classifier
All NLTK classifiers work with feature structures, which can be simple dictionaries mapping a feature name to a feature value. In this example, we’ve used a simple bag of words model where every word is a feature name with a value of True.

In [21]:
from nltk.classify import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(training)

In [22]:
classifier.show_most_informative_features()

Most Informative Features
                      no = True              neg : pos    =     20.3 : 1.0
                 awesome = True              pos : neg    =     18.7 : 1.0
                headache = True              neg : pos    =     18.0 : 1.0
                    love = True              pos : neg    =     14.2 : 1.0
               beautiful = True              pos : neg    =     12.7 : 1.0
                      Hi = True              pos : neg    =     12.7 : 1.0
                    glad = True              pos : neg    =      9.7 : 1.0
                     fan = True              pos : neg    =      9.7 : 1.0
                   Thank = True              pos : neg    =      9.7 : 1.0
                    lost = True              neg : pos    =      9.4 : 1.0


The above reads like this: For every time the word "no" appeared in a positive tweet, it appeared 20 times in a negative tweet.

Let's see how our model is doing at this point. 

In [100]:
example1 = "Today is Monday."

print(classifier.classify(format_sentence(example1)))

neg


In [101]:
example2 = "this workshop is awful."

print(classifier.classify(format_sentence(example2)))

neg


In [102]:
from nltk.classify.util import accuracy
print(accuracy(classifier, test))

0.9518005540166204


In [29]:
from nltk.metrics.scores import precision, recall

In [103]:
import collections
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
 
for i, (feats, label) in enumerate(test):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)
 

In [37]:
print ('pos precision:',precision(refsets['pos'], testsets['pos']))
print ('pos recall:', recall(refsets['pos'], testsets['pos']))

print ('neg precision:', precision(refsets['neg'], testsets['neg']))
print ('neg recall:', recall(refsets['neg'], testsets['neg']))



pos precision: 0.8901830282861897
pos recall: 0.9622302158273381
neg precision: 0.9825581395348837
neg recall: 0.9471577261809447


## Final Words 

Going back to our original sentiment analysis, we could have improved our model in a lot of ways by applying some of techniques we just went through. The twitter data is seemingly messy and inconsistent, so if we really wanted to get a highly accurate model, we could have done some preprocessing on the tweets to clean it up.

Secondly, the way in which we built our classifier could have been improved. Our feature extraction was relatively simple and could have been improved by using a bigram model rather than the bag of words model. We could have also fixed our Bayes Classifier so that it only took the most frequent words into considerations. 

In [38]:

today:Supervised learning
- entity extraction etc.
- Logistic, svm

tomorrow: 
- topic modelling 



SyntaxError: invalid syntax (<ipython-input-38-996d479a0e30>, line 2)

# Bag of Words Model 

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.

**Bags of words:**
The most intuitive way to do so is the bags of words representation:

Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).

For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary
The bags of words representation implies that n_features is the number of distinct words in the corpus: this number is typically larger than 100,000.

## Countvectorizer 

Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a dictionary of features and transform documents to feature vectors.

We are going to use the popular 20 newsgroups data: 

**Description:**
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In [58]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(twenty_train.data)

In [59]:
X_train_counts.shape

(2257, 35788)

In [60]:
X_train_counts

<2257x35788 sparse matrix of type '<class 'numpy.int64'>'
	with 365886 stored elements in Compressed Sparse Row format>

In [61]:
X_train_counts.shape

(2257, 35788)

## From occurrences to frequencies
Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”. Here we are modelling with the assumption that words that appear less frequently are more important. 

In [104]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [105]:
X_train_tf.shape

(2257, 35788)

## Revisiting Sentiment Analyses

In [106]:
import pandas as pd
train_data = pd.read_csv("intermediate_train_data.tsv",sep='\t', header=None)
test_data = pd.read_csv("intermediate_test_data.csv",sep='delimiter', header=None)

  This is separate from the ipykernel package so we can avoid doing imports until


In [107]:
test_data.columns = ["Text"]
train_data.columns = ["Sentiment","Text"]

In [90]:
train_data.tail()

Unnamed: 0,Sentiment,Text
30454,0,Brokeback Mountain was boring.
30455,0,So Brokeback Mountain was really depressing.
30456,0,"As I sit here, watching the MTV Movie Awards, ..."
30457,0,Ok brokeback mountain is such a horrible movie.
30458,0,"Oh, and Brokeback Mountain was a terrible movie."


In [108]:
test_data.head()

Unnamed: 0,Text
0,Then we had stupid trivia about San Francisco ...
1,"This means we beat out schools like MIT, which..."
2,i'm off to harvard square bitches..
3,"I'm a big fan of Lakers, so I kind of have all..."
4,seattle sucks!!!...


### Preparing the Data

To implement our bag-of-words linear classifier, we need our data in a format that allows us to feed it in to the classifer. Using CountVectorizer i, we can convert the text documents to a matrix of token counts.

We need to remove punctuations, lowercase, remove stop words, and stem words. All these steps can be directly performed by CountVectorizer if we pass the right parameter values. We can do this as follows. 

We first create a stemmer, using the Porter Stemmer implementation.

In [74]:
from sklearn.feature_extraction.text import CountVectorizer        
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = [stemmer.stem(item) for item in tokens]
    return(stemmed)

Here, we have our tokenizer, which removes non-letters and stems:

In [79]:
import re
def tokenize(text):
    text = re.sub("[^a-zA-Z]", " ", text)
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return(stems)

Here we initialize the vectoriser with the CountVectorizer class, making sure to pass our tokenizer and stemmers as parameters, remove stop words, and lowercase all characters.

In [80]:
vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = 'english',
    max_features = 85
)

Next, we use the `fit_transform()` method to transform our corpus data into feature vectors. Since the input needed is a list of strings, we concatenate all of our training and test data. 

In [81]:
features = vectorizer.fit_transform(
    train_data.Text.tolist() + test_data.Text.tolist())

In [82]:
#creating an array for easier use
features_nd = features.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### Linear Classifier

Finally, we begin building our classifier. Earlier we learned what a bag-of-words model. Here, we'll be using a similar model, but with some modifications. To refresh your mind, this kind of model simplifies text to a multi-set of terms frequencies. 

So first we'll split our training data to get an evaluation set. As we mentioned before, we'll use validation to split the data. sklearn has a built-in method that will do this for us. All we need to do is provide the data and assign a training percentage (in this case, 75%).

In [85]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test  = train_test_split(
        features_nd[0:len(train_data)], 
        train_data.Sentiment,
        train_size=0.75, 
        random_state=1234)

In [92]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()
log_model = log_model.fit(X=X_train, y=y_train)
y_pred = log_model.predict(X_test)

In [93]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.98      0.98      0.98      3114
          1       0.99      0.99      0.99      4501

avg / total       0.99      0.99      0.99      7615



In [96]:
log_model = LogisticRegression()
log_model = log_model.fit(X=features_nd[0:len(train_data)], y=train_data.Sentiment)
test_pred = log_model.predict(features_nd[len(train_data):])

In [98]:
import random
spl = random.sample(range(len(test_pred)), 10)
for text, sentiment in zip(test_data.Text[spl], test_pred[spl]):
    print (sentiment, text)

0 TAKE THAT STUPID UCLA!!!!!!!..
0 wow i hate boston after being there on the island.
1 ps i LOVE toyota and yeh you need a HIS AND HERS....
0 DON MCGILL TOYOTA IS HORRIBLE.
0 Stupid Seattle!!
1 Set in London's delightful canal district Little Venice, The Colonnade Hotel is.....
1 I LOVE San Francisco, it is one of my favorite cities.
1 I love UCLA but miss everyone from back home.
0 I miss London...
1 I love UCLA but miss everyone from back home.
