# Shopping Apps, Rating for Google Play Store and Apple AppStore Users

## Introduction

Users download apps for various purposes. Given that there is a rise in the usage of online shopping due to the Covid-19 pandemic, improvement of shopping experience has become more important then before. With that in mind, what are the important features we have to look out for to improve a shopping app?

More specifically, the questions to be answered:

- How do the app ratings differ across different shopping apps?
- Is there any specific group of users we can look out for to improve the app?
- Are there any specific improvement we can work on to further improve user experience of the app?

To explore and answer the above questions, we will scrap reviews from Google Play Store and Apple AppStore and conduct analysis and modelling.

In [58]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import matthews_corrcoef, confusion_matrix, roc_auc_score, plot_roc_curve, plot_confusion_matrix
from sklearn.metrics import classification_report, roc_curve, cohen_kappa_score

from sklearn.model_selection import train_test_split
from tensorflow.keras import Model
from tensorflow.keras.preprocessing.text import Tokenizer                    
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input, Embedding, Conv1D, GlobalMaxPooling1D, Flatten
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow import concat
from tensorflow.keras.metrics import Accuracy, CategoricalAccuracy, CosineSimilarity
from tensorflow.keras.regularizers import l1_l2, l2, l1
from tensorflow.math import confusion_matrix
from tensorflow_addons.metrics import CohenKappa, MatthewsCorrelationCoefficient, F1Score, MultiLabelConfusionMatrix
from tensorflow.keras.utils import to_categorical, plot_model
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import class_weight

In [3]:
df = pd.read_csv('../data/reviews_Model.csv')

In [4]:
df.head()

Unnamed: 0,rating,date,app,store,review,clean_content,adj,noun,verb,emoji,...,pos_score,compound_score,language,month,dayofweek,hour,minute,text_len,word_count,category
0,5,2020-09-16 20:26:28,shoppee,google,Orders mostly came early and products are good.,order come early product good,good,order product,come,,...,0.293,0.4404,en,9,3,20,26,47,8,Delivery
1,4,2020-09-16 20:13:46,shoppee,google,Good and convenient,good convenient,good convenient,,,,...,0.592,0.4404,en,9,3,20,13,19,3,Convenient App
2,4,2020-09-16 20:11:18,shoppee,google,My first purchase experience...Happy with purc...,purchase experience happy purchase thks,first happy,purchase experience purchase,,,...,0.286,0.34,en,9,3,20,11,57,7,Consumer Satisfaction
3,5,2020-09-16 20:08:54,shoppee,google,A lot of items at a very good deal.,lot item good deal,good,lot item deal,,,...,0.285,0.4927,en,9,3,20,8,35,9,Variety & Price
4,5,2020-09-16 19:37:21,shoppee,google,Delivery is fast,delivery fast,fast,delivery,,,...,0.0,0.0,en,9,3,19,37,16,3,Delivery


In [5]:
#list comprehension for target variable
df = df[df['rating'] <= 3]

In [6]:
df['category'].value_counts(normalize = True)

Account Issues            0.198107
App Issues                0.189493
Poor In-app Events        0.133468
Poor Customer Service     0.095600
Payment Issue             0.092419
Delivery Issues           0.083029
Refund                    0.080624
Poor Seller Feedback      0.057345
Product Issue             0.049119
Product Listing Issues    0.020796
Name: category, dtype: float64

In [7]:
#Checking null values
df.isna().sum()[df.isna().sum() > 0]

adj       2872
noun       640
verb      1159
emoji    12267
dtype: int64

In [8]:
## Removing null values
df= df[df['clean_content'].notna()]
df.reset_index(drop = True, inplace = True)
print(f'Null values left in df: {df.clean_content.isna().sum()}')
print(f'Number of rows left: {df.shape[0]}')

Null values left in df: 0
Number of rows left: 12887


## Train Test Split Data

In [43]:
X = df['clean_content']
y = df['category']

lb = LabelEncoder()

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2, 
                                                    random_state = 42,
                                                    stratify = y)

In [45]:
print(f'X_train rows: {X_train.shape[0]}, X_test rows: {X_test.shape[0]}')
print(f'y_train rows: {y_train.shape[0]}, y_test rows: {y_test.shape[0]}')

X_train rows: 10309, X_test rows: 2578
y_train rows: 10309, y_test rows: 2578


## Convolutional Neural Network

### Tokenize Features

In [46]:
tokenizer = Tokenizer(num_words=9000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
# Adding 1 because of  reserved 0 index
vocab_size = len(tokenizer.word_index) + 1
train_word_index = tokenizer.word_index

maxlen = 9000

y_train_label = lb.fit_transform(y_train)
y_test_label = lb.transform(y_test)

y_train_dummy = to_categorical(y_train_label)
y_test_dummy = to_categorical(y_test_label)

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

embedding_dim = 100

train_embedding_weights = np.zeros((len(train_word_index)+1, embedding_dim))

In [47]:
weights = class_weight.compute_class_weight('balanced',
                                            np.unique(y_train),
                                            y_train)

 'Poor Customer Service' 'Poor In-app Events' 'Poor Seller Feedback'
 'Product Issue' 'Product Listing Issues' 'Refund'], y=11256     Poor Customer Service
173              Account Issues
5831             Account Issues
4767                 App Issues
11135     Poor Customer Service
                  ...          
4064       Poor Seller Feedback
7563     Product Listing Issues
6937         Poor In-app Events
4355                     Refund
10104             Payment Issue
Name: category, Length: 10309, dtype: object as keyword args. From version 0.25 passing these as positional arguments will result in an error


In [48]:
weights = dict(enumerate(weights))
weights

{0: 0.5048481880509305,
 1: 0.5275844421699079,
 2: 1.204322429906542,
 3: 1.0817418677859392,
 4: 1.0455375253549695,
 5: 0.7492005813953488,
 6: 1.744331641285956,
 7: 2.0373517786561264,
 8: 4.817289719626168,
 9: 1.2405535499398315}

### Adding Sequence for Network

In [59]:
def ConvNet(embeddings, max_sequence_length, num_words, embedding_dim, labels_index):
 
    embedding_layer = Embedding(num_words,
                            embedding_dim,
                            weights=[embeddings],
                            input_length=max_sequence_length,
                            trainable=False)
    
    sequence_input = Input(shape=(max_sequence_length,), dtype='int32')
    embedded_sequences = embedding_layer(sequence_input)
    convs = []
    filter_sizes = [2,3,4,5,6]
    for filter_size in filter_sizes:
        l_conv = Conv1D(filters=200, 
                        kernel_size=filter_size, 
                        activation='relu')(embedded_sequences)
        l_pool = GlobalMaxPooling1D()(l_conv)
        convs.append(l_pool)
    l_merge = concat(convs, axis=1)
    x = Dropout(0.1)(l_merge)  
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.2)(x)
    preds = Dense(labels_index, activation='sigmoid')(x)
    model = Model(sequence_input, preds)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['acc'])
    model.summary()
    return model

In [60]:
model = ConvNet(train_embedding_weights, 
                maxlen, 
                len(train_word_index)+1, 
                embedding_dim, 
                10)

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_6 (InputLayer)            [(None, 9000)]       0                                            
__________________________________________________________________________________________________
embedding_5 (Embedding)         (None, 9000, 100)    947100      input_6[0][0]                    
__________________________________________________________________________________________________
conv1d_20 (Conv1D)              (None, 8999, 200)    40200       embedding_5[0][0]                
__________________________________________________________________________________________________
conv1d_21 (Conv1D)              (None, 8998, 200)    60200       embedding_5[0][0]                
_______________________________________________________________________________________

In [63]:
plot_model(model, "my_first_model.png", show_shapes = True)

('Failed to import pydot. You must `pip install pydot` and install graphviz (https://graphviz.gitlab.io/download/), ', 'for `pydotprint` to work.')


### Training Neural Network

In [64]:
num_epochs = 10
batch_size = 32
hist = model.fit(X_train, 
                 y_train_dummy, 
                 epochs=num_epochs, 
                 validation_data = (X_test,y_test_dummy) , 
                 shuffle=True, 
                 batch_size=batch_size)

Epoch 1/10
 16/323 [>.............................] - ETA: 19:56 - loss: 0.6679 - acc: 0.1973

KeyboardInterrupt: 

### Accuracy of Network

In [None]:
predictions = np.argmax(hist.predict(X_test), axis=-1)
predictions

In [None]:
# evaluate
loss, acc = hist.evaluate(X_test, y_test_dummy, verbose=0)
print('Test Accuracy: %f' % (acc*100))
print(f'MCC Score: {matthews_corrcoef(y_test_label, predictions)}')
print(f'Kappa Score: {cohen_kappa_score(y_test_label, predictions)}')

### Heatmap of Neural Network

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
plt.title('Convolutional Neural Network', fontdict = {'fontsize': 15})
ax = sns.heatmap(confusion_matrix(y_test_label, predictions), annot=True, fmt="d", cmap='Oranges')
ax.set_ylabel('Actual Value')
ax.set_xlabel('Predicted Value');

### Plotting ROC AUC Curve

### Check Misclassified Posts

In [None]:
lb.inverse_transform(predictions)

In [None]:
# Create DataFrame with column for predicted values.
results = pd.DataFrame(lb.inverse_transform(predictions), columns=['predicted'], index = y_test.index)

# Create column for observed values.
results['actual'] = y_test
results['review'] = df['review']
results['clean_content'] = df['clean_content']

# Find all indices where predicted and true results 
# aren't the same, then save in an array.
ms_class = results[results['predicted']!= results['actual']]
ms_class.head(10)

In [None]:
# save model and architecture to single file
model.save('/content/drive/My Drive/Colab Notebooks/2.model.h5')
print('Saved model to disk')

In [None]:
from keras.models import load_model
 
# load model
model = load_model('/content/drive/My Drive/Colab Notebooks/2.model.h5')
# summarize model.
model.summary()

In [None]:
plot_model(model, "my_first_model.png", show_shapes = True)