# Data description & Problem statement: 
I will use the Yelp Review Data Set from Kaggle. Each observation in this dataset is a review of a particular business by a particular user. The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review. The "cool" column is the number of "cool" votes this review received from other Yelp users. The "useful" and "funny" columns are similar to the "cool" column. Here, the goal is to model/clusterize the topics of Yelp reviews. 

# Workflow:
- Load the dataset
- Data cleaning (e.g. remove formats and punctuations)
- Basic data exploration
- Text vectorization, using "Bag of Words" technique
- Use "Latent Dirichlet Allocation" for document clustering (i.e. topic modeling)
- Determine, sort and print most important words/features for each topic

In [1]:
import sklearn
import numpy as np
import scipy as sc
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
from sklearn import preprocessing
%matplotlib inline

from scipy import stats

import warnings
warnings.filterwarnings("ignore")

# we insatll and import spacy package for some advanced tokenizaion techniques:
import spacy

# we also install and import mglearn package (using !pip install mglearn) for some interesting visualization of results:
import mglearn

In [2]:
ls

 Volume in drive C is OS
 Volume Serial Number is 3EA9-93A4

 Directory of C:\Users\rhash\Documents\Datasets\NLP projects (sklearn & Spark)

09/11/2018  02:13 PM    <DIR>          .
09/11/2018  02:13 PM    <DIR>          ..
09/11/2018  02:11 PM    <DIR>          .ipynb_checkpoints
09/10/2018  11:29 AM    <DIR>          aclImdb
09/10/2018  11:57 AM    <DIR>          cache
08/02/2018  06:00 PM           100,912 Dataset_Challenge_Dataset_Agreement.pdf
09/11/2018  12:54 AM           149,226 IMDb review (positive vs negative reviews, in sklearn).ipynb
09/11/2018  10:02 AM             8,797 IMDb review (topic modeling, sklearn).ipynb
04/18/2011  02:53 PM             5,868 readme
09/11/2018  10:53 AM           198,102 sms filteration (ham vs spam, sklearn).ipynb
09/11/2018  10:54 AM            27,708 sms filteration (topic modeling, sklearn)-Copy1.ipynb
03/15/2011  10:36 PM           477,907 SMSSpamCollection
09/11/2018  02:13 PM           222,900 Yelp review (1 vs 5 star, sklearn).ipynb
09/1

# load and prepare the text data: 

In [3]:
reviews = pd.read_csv('yelp_review.csv')

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=500, 
                       stop_words="english",
                       ngram_range=(1, 1),
                       max_df=0.3)

X = vect.fit_transform(reviews['text'][0:100000])

# document clustering with Latent Dirichlet Allocation:  LDA

In [10]:
from sklearn.decomposition import LatentDirichletAllocation 
lda = LatentDirichletAllocation(n_topics=5, 
                                learning_method="batch",                                
                                max_iter=24, 
                                random_state=42)

# We build the model and transform the data in one step  
document_topics = lda.fit_transform(X)

In [11]:
# For each topic (a row in the components_), sort the features (ascending) 
sorting = np.argsort(lda.components_, axis=1)[:, ::-1] 

# Get the feature names from the vectorizer 
feature_names = np.array(vect.get_feature_names())

In [13]:
# Print out the 5 topics: 
mglearn.tools.print_topics(topics=range(5), feature_names=feature_names,   
                           sorting=sorting, topics_per_chunk=5, n_words=10)

topic 0       topic 1       topic 2       topic 3       topic 4       
--------      --------      --------      --------      --------      
time          ordered       service       like          like          
just          restaurant    staff         chicken       vegas         
said          service       friendly      just          room          
service       pizza         recommend     ve            nice          
did           came          best          really        bar           
got           menu          love          burger        just          
told          table         amazing       fries         really        
didn          server        time          try           night         
like          delicious     ve            delicious     pretty        
don           dinner        definitely    don           people        


