![Image of Venturenix](http://www.venturenix.com//assets/images/venture-nix-logo.png)

# Lecture10: Natural Language Processing using NLTK
<font color=grey> by Anthony@Venturenix </font>

## Learning Objectives:
- Understand what is NLP and its applications
- Use NLTK to do a simple sentiment analysis for Yelp Review data

## 1. Examples using NLP

- **Text Classification and Categorization**: News summary
- **Speech recognition and generation**: Speech-to-text, Text-toSpeech
- **Q&A**: Chatbot
- **Machine translation**: Google Translate
- **Information extraction and retrieval**: Search Engine
- **Assistive technologies**: Auto-complete
- **Sentiment analysis**: Chatbot

## 2. Install NLTK (https://www.nltk.org/)
- Tutorial: https://media.readthedocs.org/pdf/nltk/latest/nltk.pdf
-  sudo pip install -U nltk

## 3. Exercise: Sentiment Analysis
dataset: "RecSys2013: Yelp Business Rating Prediction" (https://www.kaggle.com/c/yelp-recsys-2013/data)

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords 
import string
import sklearn

#matplotlib inline

[nltk_data] Downloading package punkt to C:\Users\Ben
[nltk_data]     Wong\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to C:\Users\Ben
[nltk_data]     Wong\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [46]:
! pip install xgboost

Collecting xgboost
  Downloading xgboost-1.4.2-py3-none-win_amd64.whl (97.8 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.4.2


In [47]:
import xgboost as xgb

In [16]:
df_reviews = pd.read_csv(r'C:\Users\Ben Wong\Desktop\Venturenix Lab\Data Collection\Yelp\Yelp.csv')

In [18]:
df_reviews.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


### EDA

In [24]:
stop_words_list = stopwords.words('english')
punctuation = list(string.punctuation)

def token_filter(text):

    tokens = nltk.word_tokenize(text)
    filter_list = stop_words_list + punctuation
    tokens = [i.lower() for i in tokens]
    list_of_text = [i for i in tokens if i not in filter_list]
    
    return list_of_text
    



## NLP Flow

- Text Preprocessing
    - tokenization 
    - Noise removal
    - standardize words, e.g. stemming
- text feature engineering
    - entity parsing, object? topic modeling? name entity? N-gram, pharse detection 
    - staistical features
        - tf-idf
        - frequency counting
    - word embedding
- modelling

### Text Processing

#### First taste of Sentiment Analysis

In [4]:
import seaborn as sns

In [12]:
# First taste of Sentiment Analysis
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/Work/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


#### Back to tokenization

### Vectorize the reviews
After converting the reviews into lists of token (lemmas), we now need to transform each message into a vector so that we can use these as the input for our machine learning models.

We will be using the types of vectorizer from sklearn
- CountVectorizer: use word count to classify messages
- TfidfVectorizer: use TF-IDF result as for classification

### CountVectorizer
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer



In [27]:
vectorizer = CountVectorizer(analyzer = token_filter, max_features=100)
X_fit = vectorizer.fit_transform(X_text)
x_df = pd.DataFrame(X_fit.toarray(), columns = vectorizer.get_feature_names())

In [28]:
# Class work: Remove punctuation and Stop words
X_text = df_reviews['text']
y = df_reviews['stars']

In [30]:
scaler = StandardScaler()
scaler.fit(x_df)
X_norm = scaler.transform(x_df)

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.33, random_state=42)

In [32]:
lr = LogisticRegression()

In [33]:
lr.fit(X_train,y_train)

LogisticRegression()

In [34]:
df_result = pd.DataFrame(y_train)
df_result['predicted'] = lr.predict(X_train)

In [35]:
df_result

Unnamed: 0,stars,predicted
8371,5,5
5027,3,1
9234,4,2
3944,5,5
6862,4,5
...,...,...
5734,4,4
5191,3,3
5390,4,5
860,5,5


In [42]:
from sklearn.metrics import accuracy_score

In [43]:
y_actual = df_result['stars']
y_predicted = df_result['predicted']

In [44]:
accuracy_score(y_actual,y_predicted)

0.4892537313432836

### TF-IDF

TF-IDF stands for *term frequency-inverse document frequency*

**TF: Term Frequency**

*TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).*

**IDF: Inverse Document Frequency**

*IDF(t) = log_e(Total number of documents / Number of documents with term t in it).*



#### TF-IDF

In [41]:
#counts_tf = pd.DataFrame(tfidf_fit.toarray(),
#                      columns=tfidf.get_feature_names())

# Show us the top 10 most common words
#counts_tf.T.sum(axis = 1).reset_index().sort_values(by=0, ascending=False).head(20)

### WORD2VEC

In [43]:
from gensim.models import Word2Vec
import gensim.downloader



In [44]:
glove_vectors = gensim.downloader.load('glove-twitter-50')

In [45]:
glove_vectors.most_similar('twitter')

[('facebook', 0.9354465007781982),
 ('fb', 0.9216365814208984),
 ('tweet', 0.8943408727645874),
 ('instagram', 0.8829589486122131),
 ('chat', 0.8607304096221924),
 ('tumblr', 0.8572155833244324),
 ('tweets', 0.8523542881011963),
 ('tl', 0.8520059585571289),
 ('internet', 0.8418295979499817),
 ('timeline', 0.8414894342422485)]

In [46]:
def text_process_w2v(review, vector_lens):
    word_tokens = nltk.word_tokenize(review)
    filter_tokens = list(string.punctuation) + nltk.corpus.stopwords.words("english")
    word_tokens = [i.lower() for i in word_tokens if i not in filter_tokens]
    word_vector = np.zeros(vector_lens)
    for i in word_tokens:
        if i in glove_vectors:
            word_vector += glove_vectors[i]
    return word_vector/len(word_tokens)

In [47]:
X_df = []
for i in X_text:
    X_df.append(text_process_w2v(i, 50))
X_df = np.array(X_df)

In [48]:
scaler = StandardScaler()
X = scaler.fit_transform(X_df)

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = LogisticRegression(random_state=0).fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [50]:
confusion_matrix(y_test, y_pred)

array([[ 402,  393],
       [ 165, 1540]])

In [51]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.71      0.51      0.59       795
           1       0.80      0.90      0.85      1705

    accuracy                           0.78      2500
   macro avg       0.75      0.70      0.72      2500
weighted avg       0.77      0.78      0.77      2500



In [52]:
xgb = XGBClassifier().fit(X_train, y_train)
y_pred = xgb.predict(X_test)





In [53]:
confusion_matrix(y_test, y_pred)

array([[ 394,  401],
       [ 191, 1514]])

In [54]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.67      0.50      0.57       795
           1       0.79      0.89      0.84      1705

    accuracy                           0.76      2500
   macro avg       0.73      0.69      0.70      2500
weighted avg       0.75      0.76      0.75      2500

