![title](img/reaction.jpg)

In [1]:
import numpy as np # for arithmetic/ logical operations
import matplotlib.pyplot as plt # for plotting graphs 
import pandas as pd # for loading and manipulating the data

#    Loading the csv file

In [2]:
dataset = pd.read_csv('datasets/App_Reviews_50.csv')
dataset.head()

Unnamed: 0,id,percentage,review
0,1,9,I am disappointed with this app I thought that...
1,2,7,This game is only trying to make money So stup...
2,3,91,Really fun app I love it its very fun you shou...
3,4,99,User friendly. Good mix of games play daily. ...
4,5,50,Soopa fast Great app but it allways has my tea...


In [3]:
# initializing the sentiment column by default value
dataset["Sentiment"] = "default value"
dataset.head()

Unnamed: 0,id,percentage,review,Sentiment
0,1,9,I am disappointed with this app I thought that...,default value
1,2,7,This game is only trying to make money So stup...,default value
2,3,91,Really fun app I love it its very fun you shou...,default value
3,4,99,User friendly. Good mix of games play daily. ...,default value
4,5,50,Soopa fast Great app but it allways has my tea...,default value


In [4]:
for i in range(0, len(dataset)):
    if(dataset['percentage'][i] < 50):
        dataset['Sentiment'][i] = 0
    else:
        dataset['Sentiment'][i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [5]:
# gives the top 5 rows of the dataset
dataset.head()

Unnamed: 0,id,percentage,review,Sentiment
0,1,9,I am disappointed with this app I thought that...,0
1,2,7,This game is only trying to make money So stup...,0
2,3,91,Really fun app I love it its very fun you shou...,1
3,4,99,User friendly. Good mix of games play daily. ...,1
4,5,50,Soopa fast Great app but it allways has my tea...,1


In [6]:
# select all the rows and review column
dataset.iloc[:,2:3].head()

Unnamed: 0,review
0,I am disappointed with this app I thought that...
1,This game is only trying to make money So stup...
2,Really fun app I love it its very fun you shou...
3,User friendly. Good mix of games play daily. ...
4,Soopa fast Great app but it allways has my tea...


# Text Preprocessing Part
        1. Cleaning the text from all unwanted characters
        2. Lower casing the text
        3. Splitting the text
        4. Stemming

![title](img/dirty_data.jpg)

In [7]:
import re
import nltk
# downloading the stop words
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to /home/mukesh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
corpus = [] # collection of all cleansed texts

for i in range(0, 50):
    print("Before: ")
    print(dataset['review'][i])
    review = re.sub('[^a-zA-Z]', ' ', dataset['review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    print("")
    print("After: ")
    print(review)
    corpus.append(review)

Before: 
I am disappointed with this app I thought that the app would show you how to do it but it is just YouTube videos that crash my phone. I am going to have to get ride of it.

After: 
disappoint app thought app would show youtub video crash phone go get ride
Before: 
This game is only trying to make money So stupid you have to pay real money for new hero's learn how to make apps rookies

After: 
game tri make money stupid pay real money new hero learn make app rooki
Before: 
Really fun app I love it its very fun you should buy it but there is some glitches but i rate 5 over all.

After: 
realli fun app love fun buy glitch rate
Before: 
User friendly. Good mix of games play daily.  like fireworks when win a game.  Thanks.

After: 
user friendli good mix game play daili like firework win game thank
Before: 
Soopa fast Great app but it allways has my team getting beat Sort that out  there'll be another 50,000 downloading  LIVE SCORES,, 

After: 
soopa fast great app allway team get 


After: 
use good game lag fastest wifi use decent game would said star month ago barli play lag
Before: 
If I could rate no stars I would. No instructions

After: 
could rate star would instruct
Before: 
Login fail This app keeps telling me my login credentials are wrong, which it is not, removing from my fone as of now.

After: 
login fail app keep tell login credenti wrong remov fone
Before: 
Crashed right in the middle of the 1st game playin and all was good, then it just crashed, VERY FRUSTRATING

After: 
crash right middl st game playin good crash frustrat
Before: 
Need help I emailed the support already, and waiting for response. This app closed itself after some black screen, really disappointed.

After: 
need help email support alreadi wait respons app close black screen realli disappoint
Before: 
Attention please When i tried to open it, it takes too long to load eventho my Internet showed excellent .. gtMalaysia n using Samsung S5 ... please fix

After: 
attent pleas tri ope

# Why we are doing text Processing?

- We have to extract features from the text.
- the no of features should not be huge
- if no of features extracted are huge, more words – more columns – more sparsity – more complexity in model – more computation time – slow responses. 
- if no of features extracted are less, less words – less columns – less sparsity – less complexity in model – less computation time – fast responses.

![title](img/Picture2.png)

![title](img/classifier_creation.png)

# 1. Creating Bag of Words Model
# Feature Extraction
    - Sparse matrix
        matrix having lot of zeros
    In this section,
        - Rows – reviews of the apps
        - Columns – meaningful and influential words extracted from the reviews
    For ideal ML model,
        - sparsity should be reduced as much as possible
![title](img/Picture4.png)


In [9]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 100) # creating an object for the class CountVectorizer
print(cv)
print(type(cv))

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=100, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
<class 'sklearn.feature_extraction.text.CountVectorizer'>


In [10]:
#fitting and then transforming corpus into X(independent variable)
X = cv.fit_transform(corpus).toarray()
X_features = cv.get_feature_names()

print("The first review is: ", corpus[0])
print("Extracted features from X (independent set): ")
print(X_features)
print("No of features extracted")
print(len(X_features))
len(X)

The first review is:  disappoint app thought app would show youtub video crash phone go get ride
Extracted features from X (independent set): 
['add', 'amaz', 'angri', 'anoth', 'app', 'attent', 'awesom', 'bad', 'best', 'better', 'brain', 'close', 'coin', 'could', 'craft', 'crash', 'disappoint', 'download', 'enough', 'error', 'even', 'ever', 'excel', 'fast', 'fieldrunn', 'first', 'fix', 'forum', 'free', 'freez', 'friendli', 'frontier', 'fun', 'galaxi', 'game', 'get', 'go', 'good', 'got', 'great', 'hard', 'hate', 'help', 'instal', 'issu', 'keep', 'know', 'lag', 'learn', 'least', 'level', 'like', 'load', 'lot', 'love', 'make', 'mayb', 'min', 'mine', 'money', 'month', 'much', 'need', 'new', 'old', 'one', 'open', 'pay', 'phone', 'play', 'pleas', 'problem', 'program', 'rate', 'real', 'realli', 'respons', 'run', 'show', 'someth', 'spend', 'star', 'start', 'still', 'support', 'time', 'tri', 'uninstal', 'updat', 'use', 'user', 'visual', 'wast', 'way', 'win', 'within', 'work', 'would', 'wrong', 

50

In [11]:
y = dataset.iloc[0:50, 3].values
print(y)
y = y.astype('int')

[0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]


# 2. Splitting dataset into training dataset and test dataset

![title](img/dataset_splitting.png)

In [12]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)

In [13]:
#training set - seeing what are taken for training set
# features that are selected from the training data
hX_train = cv.inverse_transform(X_train)
# features that are selected from the testing data
hX_test = cv.inverse_transform(X_test)

In [14]:
print("Training set - Reviews and their corresponding like or dislike")
print(hX_train[0:15])
print(y_train[0:15])

Training set - Reviews and their corresponding like or dislike
[array(['best', 'ever', 'game', 'good', 'know', 'love', 'still', 'wrong'],
      dtype='<U10'), array(['app', 'best', 'even', 'fast', 'forum', 'friendli', 'great',
       'user'], dtype='<U10'), array(['app', 'keep', 'wrong'], dtype='<U10'), array(['attent', 'excel', 'fix', 'load', 'open', 'pleas', 'show', 'tri',
       'use'], dtype='<U10'), array(['game', 'mine'], dtype='<U10'), array(['angri', 'best', 'fieldrunn', 'first', 'game', 'mayb', 'need',
       'time'], dtype='<U10'), array(['app', 'good', 'least', 'month', 'one', 'pleas', 'realli', 'updat',
       'visual', 'year'], dtype='<U10'), array(['craft', 'game', 'get', 'hate', 'mine', 'rate', 'work'],
      dtype='<U10'), array(['anoth', 'awesom', 'ever', 'game', 'keep', 'know', 'love'],
      dtype='<U10'), array(['angri', 'bad', 'load', 'old', 'open', 'year'], dtype='<U10'), array(['game', 'get', 'great', 'hard', 'lot', 'money', 'work'],
      dtype='<U10'), array(['

# 3. Import and fit classifier to training set

![title](img/fornlp.png)

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import tree

In [16]:
# Fitting Decision Tree to the Training set
classifier = tree.DecisionTreeClassifier()
classifier.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

# 4. Make predictions on test set

In [17]:
y_pred = classifier.predict(X_test)
print(hX_test)
print(y_test)
print(y_pred)

[array(['app', 'better', 'download', 'free', 'frontier', 'updat', 'work'],
      dtype='<U10'), array(['app', 'friendli', 'love', 'use', 'user', 'wast'], dtype='<U10'), array(['amaz', 'app', 'fun', 'much'], dtype='<U10'), array(['close', 'coin', 'enough', 'game', 'get', 'got', 'level', 'need',
       'play'], dtype='<U10'), array(['app', 'fun', 'love', 'rate', 'realli'], dtype='<U10'), array(['game', 'know', 'make', 'time', 'wast'], dtype='<U10'), array(['disappoint', 'fix', 'game', 'level', 'new', 'start', 'way'],
      dtype='<U10'), array(['app', 'could', 'user'], dtype='<U10'), array(['add', 'game', 'great', 'love', 'pay'], dtype='<U10'), array(['anoth', 'app', 'download', 'fast', 'get', 'great'], dtype='<U10'), array(['anoth', 'crash', 'freez', 'galaxi', 'instal', 'least', 'show',
       'star', 'time', 'tri'], dtype='<U10'), array(['game', 'keep', 'need', 'play', 'uninstal'], dtype='<U10'), array(['game', 'like', 'love', 'play', 'way'], dtype='<U10'), array(['freez', 'keep'], dty

# 5. Evaluating the performance by different metrics 

![title](img/confu.png)

In [18]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
cm

array([[8, 1],
       [1, 5]])

In [19]:
# seeing test reviews their actual and predictions
print("the test set are")
print(hX_test[0:15])

the test set are
[array(['app', 'better', 'download', 'free', 'frontier', 'updat', 'work'],
      dtype='<U10'), array(['app', 'friendli', 'love', 'use', 'user', 'wast'], dtype='<U10'), array(['amaz', 'app', 'fun', 'much'], dtype='<U10'), array(['close', 'coin', 'enough', 'game', 'get', 'got', 'level', 'need',
       'play'], dtype='<U10'), array(['app', 'fun', 'love', 'rate', 'realli'], dtype='<U10'), array(['game', 'know', 'make', 'time', 'wast'], dtype='<U10'), array(['disappoint', 'fix', 'game', 'level', 'new', 'start', 'way'],
      dtype='<U10'), array(['app', 'could', 'user'], dtype='<U10'), array(['add', 'game', 'great', 'love', 'pay'], dtype='<U10'), array(['anoth', 'app', 'download', 'fast', 'get', 'great'], dtype='<U10'), array(['anoth', 'crash', 'freez', 'galaxi', 'instal', 'least', 'show',
       'star', 'time', 'tri'], dtype='<U10'), array(['game', 'keep', 'need', 'play', 'uninstal'], dtype='<U10'), array(['game', 'like', 'love', 'play', 'way'], dtype='<U10'), array(['fre

In [20]:
print("y_test")
print(y_test)

print("y_pred")
print(y_pred)

y_test
[0 1 1 0 1 0 0 0 1 1 0 0 0 0 1]
y_pred
[0 1 0 0 1 0 0 0 1 1 0 0 1 0 1]


In [21]:
print("Accuracy score:")
print(classifier.score(X_test, y_test))

Accuracy score:
0.8666666666666667
