# Natural Language Processing (NLP) Part 2

## Time to pick up where we left off

**Goals:**

- Finish text classification lesson by using stemming and lemmatization in our vectorizers
- Build a simple text summarizer
- How to find similar documents with cosine similarity and clustering

In [1]:
#Imports
from time import time
import pandas as pd
pd.set_option("max.colwidth", 500)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA, TruncatedSVD, NMF
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize
from nltk.tokenize import TreebankWordTokenizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.util import ngrams
from textblob import TextBlob



## Text Classification continued

To wrap our text classification section, we're going to learn how to incorporate stemming and lemmatization in our vectorizers. 

In [2]:
#Load in yelp review data

path = "../data/NLP_data/yelp.csv"

yelp = pd.read_csv(path, encoding='unicode-escape')

yelp.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\r\n\r\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ing...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,"I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.\r\n\r\nIn any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we'll be seated when the girl comes back from seating someone else. We we...",review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I also dig their candy selection :),review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!! It's very convenient and surrounded by a lot of paths, a desert xeriscape, baseball fields, ballparks, and a lake with ducks.\r\n\r\nThe Scottsdale Park and Rec Dept. does a wonderful job of keeping the park clean and shaded. You can find trash cans and poopy-pick up mitts located all over the park and paths.\r\n\r\nThe fenced in area is huge to let the dogs run, play, and sniff!",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,"General Manager Scott Petello is a good egg!!! Not to go into detail, but let me assure you if you have any issues (albeit rare) speak with Scott and treat the guy with some respect as you state your case and I'd be surprised if you don't walk out totally satisfied as I just did. Like I always say..... ""Mistakes are inevitable, it's how we recover from them that is important""!!!\r\n\r\nThanks to Scott and his awesome staff. You've got a customer for life!! .......... :^)",review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [3]:
# Create a new DataFrame called yelp_best_worst that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

In [4]:
# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars

#Null accuracy
print y.value_counts(normalize=True)

# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

5    0.816691
1    0.183309
Name: stars, dtype: float64


In [5]:
#Look at the analyzer section of the CountVectorizer doc strings
CountVectorizer()

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [None]:
# analyzer = "word"... let's focus on this. Let's set it to a function instead!

The analyzer argument allows us to upload our function to transform/tokenize the words in our corpura

In [16]:
# define a function that accepts text and returns a list of stems
def word_tokenize_stem(text):
    words = TextBlob(text).words
    stemmer = SnowballStemmer("english")
    
    return [stemmer.stem(word) for word in words]

# define a function that accepts text and returns a list of lemons (noun version)

def word_tokenize_lemma(text):
    words = TextBlob(text).words
    return [word.lemmatize() for word in words]

# define a function that accepts text and returns a list of lemons (verb version)
def word_tokenize_lemma_verb(text):
    words = TextBlob(text).words
    return [word.lemmatize(pos = "v") for word in words]

Let's try our three new functions with both count and tfidf vectorizers. 
<br>
- First let's create a function that takes in an initialized but unfit vectorizer as an argument.
- Fit and transforms training data using the vectorizer
- Transforms the testing data
- Fits naive bayes model on training data.
- Evaluate it on the training and testing data.
- Prints the number of features and scores

In [8]:
def text_model_evaluator(vect):
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    print "Features: ", X_train_dtm.shape[1]
    print "Training Score: ", nb.score(X_train_dtm, y_train)
    print "Testing Score: ", nb.score(X_test_dtm, y_test)

In [9]:
#Intialize Count Vectorizer with stop_words set to english and analyzer to word_tokenize_stem

vect = CountVectorizer(stop_words="english", analyzer=word_tokenize_stem)

#Pass vectorizer into function
text_model_evaluator(vect)

Features:  13273
Training Score:  0.970626631854
Testing Score:  0.924657534247


In [17]:
#Intialize Count Vectorizer with stop_words set to english and analyzer to word_tokenize_lemma

vect = CountVectorizer(stop_words="english", analyzer=word_tokenize_lemma)

#Pass vectorizer into function
text_model_evaluator(vect)

Features:  20599
Training Score:  0.974216710183
Testing Score:  0.904109589041


In [14]:
#Intialize Count Vectorizer with stop_words set to english and analyzer to word_tokenize_lemma_verb

vect = CountVectorizer(stop_words="english", analyzer=word_tokenize_lemma_verb)

#Pass vectorizer into function
text_model_evaluator(vect)

Features:  19431
Training Score:  0.974216710183
Testing Score:  0.906066536204


How do you interpret these results? Let's try it again with tfidf

In [None]:
#stem is more accurate here, and more thorough. But there's still overfitting..

In [18]:
#Intialize Tfidf Vectorizer with stop_words set to english and analyzer to word_tokenize_stem

vect = TfidfVectorizer(stop_words="english", analyzer=word_tokenize_stem)

#Pass vectorizer into function
text_model_evaluator(vect)

Features:  13273
Training Score:  0.816906005222
Testing Score:  0.819960861057


In [19]:
#Intialize Tfidf Vectorizer with stop_words set to english and analyzer to word_tokenize_lemma

vect = TfidfVectorizer(stop_words= "english", analyzer=word_tokenize_lemma)

#Pass vectorizer into function
text_model_evaluator(vect)

Features:  20599
Training Score:  0.817232375979
Testing Score:  0.819960861057


In [20]:
#Intialize Tfidf Vectorizer with stop_words set to english and analyzer to word_tokenize_lemma

vect = TfidfVectorizer(stop_words= "english", analyzer=word_tokenize_lemma_verb)

#Pass vectorizer into function
text_model_evaluator(vect)

Features:  19431
Training Score:  0.817232375979
Testing Score:  0.819960861057


How do the tfidf vectorizers compare to counts?

In [None]:
# The results are much worse, but it's not overfit : p 

Grid search time. Let's grid search objects that incorporate all of the analyzer functions for count and tfidf vectorizers. In addition we'll do the same for randomized search.

Countvectorizer gridsearch

In [21]:
#Make pipeline for countvectorizer and naive bayes model
pipe_cv = make_pipeline(CountVectorizer(), MultinomialNB())

#Intialize parameters for count vectorizer
param_grid_cv = {}
param_grid_cv["countvectorizer__max_features"] = [1000, 2500 ,5000, 7500,10000]
param_grid_cv["countvectorizer__ngram_range"] = [(1,1), (1,2), (2,2)]
param_grid_cv["countvectorizer__lowercase"] = [True, False]
param_grid_cv["countvectorizer__binary"] = [True, False]
param_grid_cv["countvectorizer__analyzer"] = ["word", word_tokenize_stem,
                                              word_tokenize_lemma, word_tokenize_lemma_verb]

In [None]:
#Grid search object, this will run over 1000 models! 

grid_cv = GridSearchCV(pipe_cv, param_grid_cv, cv = 5, scoring = "accuracy")

#intialize time stamp
t = time()
#fit grid search object
grid_cv.fit(X, y)
#Print time elapsed
print time() - t

In [None]:
#Best parameters
print grid_cv.best_params_
#Best score
print grid_cv.best_score_

Tfidfvectorizer gridsearch

In [22]:
#Make pipeline for tfidfvectorizer and naive bayes model
pipe_tf = make_pipeline(TfidfVectorizer(), MultinomialNB())


#Intialize parameters for tfidf vectorizer
param_grid_tf = {}
param_grid_tf["tfidfvectorizer__max_features"] = [1000, 2500 ,5000, 7500,10000]
param_grid_tf["tfidfvectorizer__ngram_range"] = [(1,1), (1,2), (2,2)]
param_grid_tf["tfidfvectorizer__lowercase"] = [True, False]
param_grid_tf["tfidfvectorizer__binary"] = [True, False]
param_grid_tf["tfidfvectorizer__analyzer"] = ["word", word_tokenize_stem,
                                              word_tokenize_lemma, word_tokenize_lemma_verb]

In [None]:
#Grid search object

grid_tf = GridSearchCV(pipe_tf, param_grid_tf, cv = 5, scoring = "accuracy")

#intialize time stamp
t = time()
#fit grid search object
grid_tf.fit(X, y)
#Print time elapsed
print time() - t

Countvectorizer randomized search

In [None]:
#Randomized grid search with n_iter = 5
randsearch_cv = RandomizedSearchCV(pipe_cv, n_iter = 5,
                        param_distributions = param_grid_cv, cv = 5, scoring = "accuracy")

#Time the code 

t = time()

#Fit grid on data
randsearch_cv.fit(X, y)

#Print time difference

print time() - t

In [None]:
#Best params
print randsearch_cv.best_params_
#Best score
print randsearch_cv.best_score_

Tfidfvectorizer randomized search

In [None]:
#Randomized grid search with n_iter = 10
randsearch_tf = RandomizedSearchCV(pipe_tf, n_iter = 10,
                        param_distributions = param_grid_tf, cv = 5, scoring = "accuracy")

#Time the code 

t = time()

#Fit grid on data
randsearch_tf.fit(X, y)

#Print time difference

print time() - t

In [None]:
#Best params
print randsearch_tf.best_params_
#Best score
print randsearch_tf.best_score_

This wraps up text classification. Now onto the rest of the lesson.

## Summarizing text

We're going to build a very simple summarizer that uses tfidf scores on a corpura of data science and artificial intelligence articles

In [23]:
#Load in data

path = "../data/NLP_data/ds_articles.csv"

#We're only be using the text and title columns
articles = pd.read_csv(path, usecols=["text", "title"], encoding="utf-8")

#Drop nulls
articles.dropna(inplace=True)

#Reset index
articles.reset_index(inplace=True, drop=True)

articles.head()

Unnamed: 0,text,title
0,One of the greatest difficulties that companies wishing to become more analytical have encountered over the last several years is finding good analysts and data scientists. A considerable amount of printer’s ink has been spilled into articles over this issue. Many of them mention consultants’ or analyst firms’ projections about how many quantitative analysts or data scientists will be needed in our society and conclude that it will be incredibly difficult to find them.\n\nI always thought th...,What Data Scientist Shortage? Get Serious and Get Talent
1,"Within soccer’s nascent analytics movement, one metric dominates most discussions. It’s called Expected Goals or xG. Models for calculating xG differ, but the underlying concept is the same. In a nutshell, xG takes a shot’s characteristics – distance from goal, angle from goal, root cause, etc. – and assigns a probability that said shot will result in a goal. Accounting for these probabilities reveals which team creates better scoring opportunities. Given a season of data, xG analysis is a p...","xG, Soccer Analytics of Bundesliga in R"
2,"The company’s adjacent market opportunities are growing at a CAGR of 18% for the next five years.\n\nQualcomm (NASDAQ: QCOM) announced a few days ago that its subsidiary Qualcomm Technologies will offer OEMs its first machine learning SDK for running their own neural network models on devices powered by Snapdragon 820 SoCs. The devices include smartphones, cars and drones among many others. Gary Brotman, director of product management, Qualcomm Technologies, said:\n\nWith the introduction of...",Qualcomm: Taking Artificial Intelligence To A New Level
3,"How Web, Tech Companies Use GPUs to Put Deep Learning at Your Fingertips\n\nGPUs have helped researchers spark a deep-learning revolution that’s given computers super-human capabilities.\n\nThey’ve already enabled breakthrough results on the industry-standard ImageNet benchmark. They’re powering Facebook’s “Big Sur” deep learning computing platform. They’re also accelerating major advances in deep learning across a broad range of fields.\n\nGPUs have become the go-to technology for training ...",How Companies Use GPUs to Put Deep Learning at Your Fingertips
4,"White House technology policy adviser Kristen Honey urged government and industry IT leaders to support the open data movement and showcase their work at two upcoming data innovation events.\n\nSpeaking Wednesday to a standing-room-only audience at the annual Data Innovation Summit in Washington, Honey highlighted a number of the administration’s open data initiatives, dating back to 2009, that are leading to innovative advances in medicine, agriculture, energy, transportation and education....",White House official urges IT leaders to join open data efforts


In [24]:
articles.shape

(1418, 2)

In [25]:
#Info
articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1418 entries, 0 to 1417
Data columns (total 2 columns):
text     1418 non-null object
title    1418 non-null object
dtypes: object(2)
memory usage: 22.2+ KB


In [None]:
# tfidf is a score of words based on how many times it showed up & ranks them w/ a higher value. 
# lower score are stock or useless words.

In [29]:
#Intialize tfidf with stop_words = english, max_features = 1000, and stem analzyer 

tfidf = TfidfVectorizer(stop_words="english", max_features=1000,
                        analyzer=word_tokenize_stem)

#Fit and transform the text using the tfidf vectorizer
text = articles.text
dtm = tfidf.fit_transform(text)

#Assign tokens to features
features = tfidf.get_feature_names()

In [30]:
#Create a dataframe of features and their idf scores
idfscores = pd.DataFrame()
idfscores["tokens"] = features
idfscores["scores"] = tfidf.idf_

In [31]:
#Top ten most imporant words
idfscores.sort_values(by = "scores", ascending = False).head(10)

Unnamed: 0,tokens,scores
999,⭐️,7.56456
937,var,6.311798
2,0.0,5.772801
4,1.0,4.961871
133,blockchain,4.594146
639,pdf,4.365887
625,p,4.345685
822,split,4.345685
422,https,4.197265
824,sql,4.146834


In [32]:
#Top ten least imporant words
idfscores.sort_values(by = "scores", ascending = True).head(10)

Unnamed: 0,tokens,scores
896,to,1.012766
875,the,1.015625
599,of,1.020649
20,a,1.024975
72,and,1.025697
438,in,1.033683
469,is,1.049848
355,for,1.049848
874,that,1.067786
977,with,1.070051


Let's our summarizer function that will randomly select an article to summarize. By summarize, I mean show the top five words with the highest tfidf values

In [37]:
def summarize():
    
    index = np.random.choice(articles.index, 1)[0]
    article = text.iloc[index]
    
    word_scores = {}
    for word in TextBlob(article).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[index, features.index(word)]
    print "TOP SCORING WORD: "
    top_scores = sorted(word_scores.items(), key=lambda x:x[1], reverse = True)[:5]
    
    for word, score in top_scores:
        print word
        
    print "\n", articles.title[index]
    
    print "\n\n\n", article 
    
    

In [41]:
#Give it a go
summarize()

TOP SCORING WORD: 
microsoft
the
that
and
app

Silicon Valley’s Artificial Intelligence Marathon Is On



Photo

Artificial intelligence. Chatbots. Messaging. Sound familiar?

These were some of the themes that Google brought up at its annual developer conference on Wednesday. At the event, the Silicon Valley company introduced an Internet-connected speaker called Google Home that is powered by A.I. and a new messaging app called Allo, among other things.

These are also some of the very same topics that have come up at developer conferences held by Microsoft and Facebook this year. In March, Microsoft spent time talking about A.I. and bots, which are the pieces of software that can be used to produce new methods of interaction with computers, like chat interfaces. A month later, Facebook said it was opening up its Messenger messaging app so developers could create chatbots for the service.

If there’s a certain sameness to it all, it illustrates how tech behemoths are all moving into 

## Text Similarity with Cosine Similarity and Clustering

### Cosine Similarity

![ew](https://i2.wp.com/dataaspirant.com/wp-content/uploads/2015/04/cosine.png?w=697)
<br><br>
" Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we would effectively try to find the cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle.

It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in (0,1). One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors."
<br>
Source: [Dataaspirant](http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/)

In [42]:
#Diy cosine similarity function

def square_rooted(x):

    return round(np.sqrt(sum([a*a for a in x])),3)
 
def cosine_similarity_function(x,y):

    numerator = sum(a*b for a,b in zip(x,y))
    denominator = square_rooted(x)*square_rooted(y)
    return round(numerator/float(denominator),3)
 
vec1 = [3, 45, 7, 2]
vec2 = [2, 54, 13, 15]
print cosine_similarity_function(vec1, vec2)

0.972


Derive matrix of similarities between all the data science articles documents.

In [44]:
#Calculate cosine distance for each pair of documents
dist = cosine_similarity(dtm.toarray())

In [45]:
#make it a dataframe
dist_df = pd.DataFrame(dist)

#Shape
dist_df.shape

(1418, 1418)

Let's compare some articles!

In [46]:
#Index position of article
index = 239

In [52]:
#Assign titles column to titles variable

titles = articles.title

#Print title
print titles[index]

#print article
print "\n ************************************************ \n" , text[index]


10 Popular TV Shows on Data Science and Artificial Intelligence

 ************************************************ 
Introduction

The development of full artificial intelligence could spell the end of human race. – Stephen Hawking

The world is now rapidly moving towards achieving this finest technology breakthrough ever. It is expected that AI would enrich humans with more power and opportunities. Another group of people (including Stephen Hawking and Elon Musk) believe that this might lead to human destruction (if not handled carefully).

I think, it’s too early for us to envisage such uncertain future. Good news is, companies like Google, Microsoft, Baidu have already started creating products based on AI. It won’t be long enough to experience the influence of AI in our daily lives.

Accidentally, my exploration of AI started with movie ‘Her’. The influence was so powerful that I ended up creating an infographic on 10 Movies on Data Science and Machine Learning. May be a ~ 2 hours m

We need to take the index value and use it grab the column of the scores between every article and the one at index 935

In [48]:
#Pass index value into dataframe
dist_column = dist_df[index]

In [49]:
#Get the index values of the 5 

closest_index = dist_column.nlargest(6).index[1:].tolist()

In [50]:
#Pass index values into titles and print them

for i in titles.iloc[closest_index].tolist():
    print i

Why algorithms will be at the core of our AI-powered future, and why you should care
Why We Need More Women Taking Part In The AI Revolution
How AI Is Already Changing Business
The Non-Technical Guide to Machine Learning & Artificial Intelligence
Will a machine replace me?


In [51]:
#Pass index values into titles and but don't print
text.iloc[closest_index]

1130    For the second year in a row, We Are Social had the privilege of presenting at Vivid Sydney this year. Already one of the world’s leading festivals, Vivid is a bit like SXSW: an amalgam of inspiring people, creative work, and fresh ideas across a variety of themes and topics.\n\nFor our 2017 keynote, I joined forces with We Are Social’s Sydney MD, Suzie Shaw, to explore the impact that algorithms and machine learning are having on every aspect of our lives, and what that means for marketing....
1000    Lolita Taub\n\nBy Samantha Walravens & Heather Cabot\n\nIn 2011, entrepreneur and investor Marc Andreessen wrote his famous ,"Why Software Is Eating the World" in the Wall Street Journal. Today, that story would more likely read, "Why Artificial Intelligence Is Eating the World." The market for artificial intelligence (AI) technologies-- from voice and image recognition to chat bots to self-driving cars-- is hot. A Narrative Science survey found last year that 38% of enterprises 

### Clustering

It is standard practice to cluster with tfidf data instead of the count vectorized data

In [53]:
#Intialize clustering algorithm with 4 clusters and fit it on dtm
km = KMeans(n_clusters= 4)

#Fit algorithm
km.fit(dtm)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [54]:
#Check out silhouette score
silhouette_score(dtm, km.labels_)

0.011012143631564825

In [55]:
#Assign labels to articles dataframe 

articles["cluster"] = km.labels_

Print 5 randomly selected headlines from each cluster

In [63]:
#Cluster 0
for i in articles[articles.cluster == 0].sample(n = 10).title.tolist():
    print i

[Cheat Sheet] Python Basics For Data Science
Phase Change Co-Processors Extend Memory Technologies To Computing
Airbnb open sources data-science-sharing platform
If you want to learn Data Science, take a few of these statistics classes
Facebook’s Artificial Intelligence Research lab releases open source fastText on GitHub
7 Visualizations You Should Learn in R
HBase Key Design with OpenTSDB #WhiteboardWalkthrough
How Data Scientist Skills and Qualifications Differ From Those of BI Analysts and Statisticians
DeepMind just published a mind blowing paper: PathNet.
Resources to Start Learning R Language – Beginner @ Data Science – Medium


In [62]:
#Cluster 1
for i in articles[articles.cluster == 1].sample(n = 10).title.tolist():
    print i

What’s next for blockchain and cryptocurrency
Machine Learning Is Redefining The Enterprise In 2016
OracleVoice: Machine Learning Stands To Transform The Way We Communicate
Heart Failure Identification Algorithms Developed Using EHR Data
Samsung Will Invest $1.2 Billion Into US For 'Internet Of Things'
3 Industries That Will Be Transformed By AI, Machine Learning And Big Data In The Next Decade
Machine learning methods (infographic)
Seattle is paying the most for its engineers
Artificial Intelligence in the Next-Gen Automobile
Artificial Intelligence Is Not [Only] All About Robots


In [61]:
#Cluster 2
for i in articles[articles.cluster == 2].sample(n = 10).title.tolist():
    print i

3 Steps To Jumpstart A Machine Learning Strategy
The age of analytics: Competing in a data-driven world
Statistics and Machine Learning
Cognitive Analytics Answers the Question: What's Interesting in Your Data?
Internet Of Things (IoT): 5 Essential Ways Every Company Should Use It
Are Your Predictive Models like Broken Clocks?
The Skills You Need to Become a Data Scientist
Hold Your Machine Learning and AI Models Accountable
White House: Want data science with impact? Spend ‘a ridiculous amount of time’ with people
5 Amazing Things Big Data Helps Us To Predict Now -- Plus What's On The Horizon


In [60]:
#Cluster 3
for i in articles[articles.cluster == 3].sample(n = 10).title.tolist():
    print i

Training a deep learning model to steer a car in 99 lines of code
This Super Bowl Experiment Proves Machine Learning Still Needs a Helping Hand From Humanity
IBM Advances Neuromorphic Computing for Deep Learning
Ten Myths About Machine Learning – Pedro Domingos – Medium
How Artificial Intelligence Will Change Everything
Artificial intelligence is about the people, not the machines
Machine learning and AI: Can it be the butler that can change user experience?
FLYR: Data Science Shaking Up Travel Industry
Deep Learning Isn't a Dangerous Magic Genie. It's Just Math
Answers to dozens of data science job interview questions


What do you think the clusters are? Is it easy decipher? Ignore the silhouette score, does it pass the eye test?

In [None]:
# it's hard to interpret. Doesn't really pass the eye test.. 

Let's try this exercise again but this time we'll cluster the cosine distances.

In [65]:
#Intialize clustering algorithm with 4 clusters
km2 = KMeans(n_clusters= 4)


#fit it on dist array
km2.fit(dist)


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [66]:
#Check out silhouette score
silhouette_score(dist, km2.labels_)

0.28199394634494024

Print 5 randomly selected headlines from each cluster

In [67]:
#Assign new labels to data frame

articles["cluster_dist"] = km2.labels_

In [68]:
articles.cluster.value_counts()

3    506
1    419
0    255
2    238
Name: cluster, dtype: int64

In [69]:
#Cluster 0
for i in articles[articles.cluster_dist == 0].sample(n = 7).title.tolist():
    print i

IBM partnership puts Watson in your ear to help you at work
Spark-based machine learning for capturing word meanings
Five New Machine Learning Tools To Make Your Software Intelligent
AI, IoT biggest disruptors in 2017
IBM Invests in R Programming Language for Data Science; Joins R Consortium
7 Steps to Mastering SQL for Data Science
Thinking About Cloud Data Analytics? Consider This


In [70]:
#Cluster 1, Investment/Business related? #FORBESmagazine??
for i in articles[articles.cluster_dist == 1].sample(n = 7).title.tolist():
    print i

The Future of Machine Learning in Finance
Artificial intelligence, machine learning, deep learning and more
7 More Steps to Mastering Machine Learning With Python
A Technical Primer On Causality – adam kelleher – Medium
Samsung Will Invest $1.2 Billion Into US For 'Internet Of Things'
Predictive Analytics And Machine Learning AI In The Retail Supply Chain
How Machine Learning, Big Data And AI Are Changing Healthcare Forever


In [71]:
#Cluster 2, 
for i in articles[articles.cluster_dist == 2].sample(n = 7).title.tolist():
    print i

Machine learning and big data know it wasn’t you who just swiped your credit card
Salaries by Roles in Data Science and Business Intelligence
Deep Learning - A Non-Technical Introduction
A Complete Tutorial to learn Data Science in R from Scratch
How to define two-dimensional array in python
Deep Learning - A Non-Technical Introduction
Machine learning in our daily lives


In [72]:
#Cluster 3, maybe more educational?
for i in articles[articles.cluster_dist == 3].sample(n = 7).title.tolist():
    print i

The speech age
Learning from Imbalanced Classes
10 tips for getting started with machine learning
Top 10 Big Data Challenges – A Serious Look at 10 Big Data V’s
Three Things Your Organization Must Consider To Prepare For Artificial Intelligence
From Big Data to Artificial Intelligence: The Next Digital Disruption
Behind the Dream of Data Work as it Could Be


Are the results better?

# Resources


My fake news classifer article: https://opendatascience.com/blog/how-to-build-a-fake-news-classification-model/
<br>
My data science topic modeling article: https://opendatascience.com/blog/how-to-analyze-articles-about-data-science-using-data-science/
<br><br>
**Regular Expressions**
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- https://www.oreilly.com/ideas/an-introduction-to-regular-expressions


**NLP Tutorials**

- https://github.com/bonzanini/nlp-tutorial
- https://github.com/totalgood/pycon-2016-nlp-tutorial

**Text similarity:**
- https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/
- http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
- http://billchambers.me/tutorials/2014/12/22/cosine-similarity-explained-in-python.html
- Explains why text similarity uses cosine similarity -> https://www.quora.com/What-are-the-mechanics-of-cosine-similarity-in-natural-language-processing

**Text classification:**
- Another fake news tutorial - > https://www.datacamp.com/community/tutorials/scikit-learn-fake-news
- http://nlpforhackers.io/text-classification/
- http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html
- https://github.com/javedsha/text-classification
- https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a
- https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html


**Text clustering:**

- Great tutorial -> http://brandonrose.org/clustering
- http://nlpforhackers.io/recipe-text-clustering/
- https://pythonprogramminglanguage.com/kmeans-text-clustering/
- http://mccormickml.com/2015/08/05/document-clustering-example-in-scikit-learn/


**Word Embeddings/Word2Vec**

- https://chatbotsmagazine.com/introduction-to-word-embeddings-55734fd7068a
- https://www.springboard.com/blog/introduction-word-embeddings/
- http://ruder.io/word-embeddings-1/
- https://www.slideshare.net/BhaskarMitra3/a-simple-introduction-to-word-embeddings
- https://github.com/fastai/word-embeddings-workshop


**Topic Modeling**

- http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
- https://blog.bigml.com/2016/11/16/introduction-to-topic-models/
- http://nbviewer.jupyter.org/github/ogrisel/notebooks/blob/master/nmf_topics.ipynb?create=1
- https://www.youtube.com/watch?v=ZgyA1Q2ywbM
- https://www.youtube.com/watch?v=SjRss8Uk6mQ
- https://github.com/derekgreene/topic-model-tutorial

# Lab time

Pick a text dataset to spend the rest of class working. There are three other datasets in the NLP_data that you can work with: pitchfork album reviews, fake/real news, deadspin, and political lean. Make sure to unzip political lean or fake news. You can also continue to work with the datasets we've already used (data science, yelp, spam.)

<br>

For the rest of class apply supervised or unsupervised learning techniques to the dataset of your choice. 

- Build a model that can differentiate between good/bad review, real/fake news, or liberal/conservative leaning or a model that 

- Predict how many page views a deadspin can get based on its headlines and tags.

- Ignore the labels and attempt cluster the articles.

- Have fun with the summarizer!!

<br>

Be prepared to share your results at the end of class.


In [74]:
#Load in yelp review data

path = "../data/NLP_data/yelp.csv"

yelp = pd.read_csv(path, encoding = 'unicode-escape')

yelp.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\r\n\r\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ing...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,"I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.\r\n\r\nIn any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we'll be seated when the girl comes back from seating someone else. We we...",review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I also dig their candy selection :),review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!! It's very convenient and surrounded by a lot of paths, a desert xeriscape, baseball fields, ballparks, and a lake with ducks.\r\n\r\nThe Scottsdale Park and Rec Dept. does a wonderful job of keeping the park clean and shaded. You can find trash cans and poopy-pick up mitts located all over the park and paths.\r\n\r\nThe fenced in area is huge to let the dogs run, play, and sniff!",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,"General Manager Scott Petello is a good egg!!! Not to go into detail, but let me assure you if you have any issues (albeit rare) speak with Scott and treat the guy with some respect as you state your case and I'd be surprised if you don't walk out totally satisfied as I just did. Like I always say..... ""Mistakes are inevitable, it's how we recover from them that is important""!!!\r\n\r\nThanks to Scott and his awesome staff. You've got a customer for life!! .......... :^)",review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [None]:
# Goal: Cluster yelp text to see if we can find which star rating they belong to..

In [77]:
X = yelp.text
y = yelp.stars

In [None]:
#see RandomSearchCV for tfidf parameters below. This is the best combination!

{'tfidfvectorizer__analyzer': <function __main__.word_tokenize_lemma>,
 'tfidfvectorizer__binary': True,
 'tfidfvectorizer__lowercase': True,
 'tfidfvectorizer__max_features': 1000,
 'tfidfvectorizer__ngram_range': (2, 2)}

In [216]:
#Intialize tfidf with the above parameters

tfidf = TfidfVectorizer(stop_words="english", max_features=1000, ngram_range = (2, 2),
                        analyzer=word_tokenize_lemma, binary = True, lowercase = True)

#Fit and transform the text using the tfidf vectorizer
text = articles.text
dtm = tfidf.fit_transform(text)

#Assign tokens to features
features = tfidf.get_feature_names()

In [217]:
#Create a dataframe of features and their idf scores
idfscores = pd.DataFrame()
idfscores["tokens"] = features
idfscores["scores"] = tfidf.idf_

In [220]:
idfscores.sort_values(by = "scores", ascending = False).head(20)
#the words changed here. Previously, it was "bagel", "donuts", "nail" etc. More nouncs, now they're more like verbs.

Unnamed: 0,tokens,scores
864,talking,3.478584
307,described,3.478584
776,scenario,3.478584
571,low,3.478584
463,healthcare,3.478584
140,allowing,3.478584
902,told,3.478584
156,announced,3.478584
930,unit,3.478584
387,extract,3.478584


In [219]:
idfscores.sort_values(by="scores", ascending=True).head(10)

Unnamed: 0,tokens,scores
899,to,1.01348
879,the,1.017775
632,of,1.020649
105,a,1.024975
155,and,1.025697
493,in,1.037334
521,is,1.050589
414,for,1.060272
878,that,1.070051
977,with,1.074596


In [142]:
def summarize():
    #Randomly choose index value
    index = np.random.choice(articles.index, 1)[0]
    article = text.iloc[index]
    # create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(article).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[index, features.index(word)]
            
   # print words with the top 5 TF-IDF scores
    print 'TOP 10 SCORING WORDS BY REVIEW:'
    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:10]
    for word, score in top_scores:
        print word   
        
    #Print title of article
    print "\nSTAR RATING: ", yelp.stars[index]
    
    #Print the text of article
#     print article

In [121]:
summarize()

TOP 10 SCORING WORDS BY REVIEW:
pie
neighborhood
finish
the
pork
spot
both
perfect
meal
make

STAR RATING:  5


In [122]:
summarize()

TOP 10 SCORING WORDS BY REVIEW:
mom
beef
without
and
was
we
were
over
i
for

STAR RATING:  4


In [123]:
summarize()

TOP 10 SCORING WORDS BY REVIEW:
shrimp
the
fine
fish
i
they
and
review
lot
after

STAR RATING:  3


In [144]:
summarize()

TOP 10 SCORING WORDS BY REVIEW:
care
she
how
i
young
they
drop
would
about
15

STAR RATING:  2


In [158]:
summarize()

TOP 10 SCORING WORDS BY REVIEW:
the
waiter
we
dessert
our
explain
unless
meal
to
a

STAR RATING:  1


In [159]:
dist = cosine_similarity(dtm.toarray())

In [160]:
#make it a dataframe
dist_df = pd.DataFrame(dist)

#Shape
dist_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9990,9991,9992,9993,9994,9995,9996,9997,9998,9999
0,1.0,0.265631,0.14016,0.150543,0.157744,0.31024,0.366842,0.186357,0.209724,0.133153,...,0.185904,0.206033,0.299977,0.263117,0.233961,0.223305,0.273634,0.252683,0.170309,0.165041
1,0.265631,1.0,0.095235,0.178524,0.248129,0.394907,0.329775,0.193584,0.193605,0.138281,...,0.180837,0.253761,0.329982,0.246323,0.348538,0.417698,0.29604,0.274741,0.25062,0.228771
2,0.14016,0.095235,1.0,0.102721,0.067641,0.134734,0.162686,0.053588,0.026819,0.024263,...,0.089237,0.099667,0.124686,0.135001,0.090076,0.098566,0.123984,0.131283,0.061732,0.079356
3,0.150543,0.178524,0.102721,1.0,0.124986,0.185893,0.21469,0.113088,0.084104,0.108312,...,0.163275,0.230002,0.253871,0.128041,0.249466,0.115347,0.177156,0.221494,0.160121,0.108506
4,0.157744,0.248129,0.067641,0.124986,1.0,0.190418,0.205416,0.17519,0.141164,0.13247,...,0.118275,0.154964,0.237319,0.119964,0.255874,0.131317,0.1844,0.205218,0.175987,0.096345


In [None]:
#Intialize clustering algorithm with 4 clusters
km_yelp = KMeans(n_clusters=5)

#fit it on dist array

km_yelp.fit(dist)

In [None]:
#Intialize range of cluster values from 2 to 16
cluster_range = range(2, 7)

#Intialize list to store inertia scores

i_scores = []

#Iterate over cluster range, fit models and add score to s_scores

for cluster in cluster_range:
    model = KMeans(n_clusters=cluster)
    model.fit(dist)
    
    score = model.inertia_
    i_scores.append(score)
    
#Plot clusters versus scores

plt.figure(figsize=(10, 7))
plt.plot(cluster_range, i_scores, linewidth = 6, alpha = .8, c = "g")
plt.xlabel("Cluster Values")
plt.ylabel("Inertia Scores");

In [163]:
#Check out silhouette score
silhouette_score(dist, km_yelp.labels_)

0.14864134505266238

In [164]:
yelp["cluster_dist"] = km_yelp.labels_

In [173]:
#Cluster 0
for i in yelp[yelp.cluster_dist == 0].sample(n=5).stars.tolist():
    print (i)

5
4
5
4
3


In [167]:
#Cluster 1
for i in yelp[yelp.cluster_dist == 1].sample(n=5).stars.tolist():
    print (i)

4
5
1
4
4


In [168]:
#Cluster 2
for i in yelp[yelp.cluster_dist == 2].sample(n=5).stars.tolist():
    print (i)

2
2
5
4
4


In [172]:
#Cluster 3
for i in yelp[yelp.cluster_dist == 3].sample(n=5).stars.tolist():
    print (i)

4
5
5
4
2


In [170]:
#Cluster 4
for i in yelp[yelp.cluster_dist == 4].sample(n=5).stars.tolist():
    print (i)

4
4
1
4
5


In [180]:
# Create a new DataFrame called yelp_best_worst that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

In [181]:
# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars

#Null accuracy
print y.value_counts(normalize=True)

# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

5    0.816691
1    0.183309
Name: stars, dtype: float64


In [182]:
#Make pipeline for countvectorizer and naive bayes model
pipe_cv = make_pipeline(CountVectorizer(), MultinomialNB())

#Intialize parameters for count vectorizer
param_grid_cv = {}
param_grid_cv["countvectorizer__max_features"] = [1000, 2500 ,5000, 7500,10000]
param_grid_cv["countvectorizer__ngram_range"] = [(1,1), (1,2), (2,2)]
param_grid_cv["countvectorizer__lowercase"] = [True, False]
param_grid_cv["countvectorizer__binary"] = [True, False]
param_grid_cv["countvectorizer__analyzer"] = ["word", word_tokenize_stem,
                                              word_tokenize_lemma, word_tokenize_lemma_verb]

In [183]:
#Randomized grid search with n_iter = 5
randsearch_cv = RandomizedSearchCV(pipe_cv, n_iter = 5,
                        param_distributions = param_grid_cv, cv = 5, scoring = "accuracy")

#Time the code 

t = time()

#Fit grid on data
randsearch_cv.fit(X, y)

#Print time difference

print time() - t

407.359891891


In [185]:
randsearch_cv.best_params_

{'countvectorizer__analyzer': <function __main__.word_tokenize_lemma>,
 'countvectorizer__binary': True,
 'countvectorizer__lowercase': False,
 'countvectorizer__max_features': 5000,
 'countvectorizer__ngram_range': (2, 2)}

In [186]:
randsearch_cv.best_score_

0.9287812041116006

In [178]:
#Make pipeline for tfidfvectorizer and naive bayes model
pipe_tf = make_pipeline(TfidfVectorizer(), MultinomialNB())


#Intialize parameters for tfidf vectorizer
param_grid_tf = {}
param_grid_tf["tfidfvectorizer__max_features"] = [1000, 2500 ,5000, 7500,10000]
param_grid_tf["tfidfvectorizer__ngram_range"] = [(1,1), (1,2), (2,2)]
param_grid_tf["tfidfvectorizer__lowercase"] = [True, False]
param_grid_tf["tfidfvectorizer__binary"] = [True, False]
param_grid_tf["tfidfvectorizer__analyzer"] = ["word", word_tokenize_stem,
                                              word_tokenize_lemma, word_tokenize_lemma_verb]

In [195]:
#Randomized grid search with n_iter = 10
randsearch_tf = RandomizedSearchCV(pipe_tf, n_iter = 10,
                        param_distributions = param_grid_tf, cv = 5, scoring = "accuracy")

#Time the code 

t = time()

#Fit grid on data
randsearch_tf.fit(X, y)

#Print time difference

print time() - t

3835.05065179


In [196]:
randsearch_tf.best_params_

{'tfidfvectorizer__analyzer': <function __main__.word_tokenize_lemma>,
 'tfidfvectorizer__binary': True,
 'tfidfvectorizer__lowercase': True,
 'tfidfvectorizer__max_features': 1000,
 'tfidfvectorizer__ngram_range': (2, 2)}

In [198]:
randsearch_tf.best_score_

0.8607440039158101