![wordcloud](wordcloud.png)

As a Data Scientist working for a mobile app company, you usually find yourself applying product analytics to better understand user behavior, uncover patterns, and reveal insights to identify the great and not-so-great features. Recently, the number of negative reviews has increased on Google Play, and as a consequence, the app's rating has been decreasing. The team has requested you to analyze the situation and make sense of the negative reviews.

It's up to you to apply K-means clustering from scikit-learn and NLP techniques through NLTK to sort text data from negative reviews in the Google Play Store into categories!

## The Data

A dataset has been shared with a sample of reviews and their respective scores (from 1 to 5) in the Google Play Store. A summary and preview are provided below.

# reviews.csv

| Column     | Description              |
|------------|--------------------------|
| `'content'` | Content (text) of each review. |
| `'score'` | Score assigned to the review by the user as an integer (from 1 to 5). |

In [1]:
# Import necessary libraries
import re
import statistics
from typing import Counter
from collections import Counter
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [2]:
# Download necessary files from NLTK:
# punkt -> Tokenization
# stopwords -> Stop words removal
# nltk.download("punkt")
# nltk.download("stopwords")

In [3]:
# Load the reviews dataset and preview it
reviews = pd.read_csv("reviews.csv")
reviews

Unnamed: 0,content,score
0,I cannot open the app anymore,1
1,I have been begging for a refund from this app...,1
2,Very costly for the premium version (approx In...,1
3,"Used to keep me organized, but all the 2020 UP...",1
4,Dan Birthday Oct 28,1
...,...,...
12490,"I really like the planner, it helps me achieve...",5
12491,😁****😁,5
12492,Very useful apps. You must try it,5
12493,Would pay for this if there were even more add...,5


In [4]:
# Subset for only 1 and 2 score
reviews = reviews[reviews['score'].isin([1, 2])]
reviews

Unnamed: 0,content,score
0,I cannot open the app anymore,1
1,I have been begging for a refund from this app...,1
2,Very costly for the premium version (approx In...,1
3,"Used to keep me organized, but all the 2020 UP...",1
4,Dan Birthday Oct 28,1
...,...,...
11940,I loved it until I realized that the very feat...,2
11941,Gave it a test run and tried out the notificat...,2
11942,"Looks great but since installing, my device on...",2
11943,This app looked good until I had to purchase i...,2


To reveal the main topics from app reviews, you'll perform these tasks:

- Preprocess the negative reviews (reviews with a score of 1 or 2) by tokenizing the text, removing stop words and non-alpha characters. Save the results in a pandas DataFrame called preprocessed_reviews.

- Vectorize the cleaned negative reviews using TF-IDF and store the matrix in a variable called tfidf_matrix.

- Apply K-means clustering to tfidf_matrix to group the reviews into five categories. Store the predicted labels in a list called categories.

- For each unique cluster label, find the most frequent term. Store the results in a pandas DataFrame called topic_terms with at least three columns to store the label assigned from K-means, the identified term, and its frequency.

In [5]:
reviews.describe()

Unnamed: 0,score
count,4850.0
mean,1.483299
std,0.499773
min,1.0
25%,1.0
50%,1.0
75%,2.0
max,2.0


In [6]:
## FUNCTION DEFINITIONS

from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

def preprocess_text(text: str) -> str:
    """
    Cleans, removes stopwords from text
    :param text: 
    :return: cleaned text
    """
    text = nltk.word_tokenize(text)
    # remove stopwords
    text = ' '.join(word for word in text if word.isalpha() and word.lower() not in stopwords.words('english'))
    return text
def lemmatize_stem(text: str, stem: bool) -> str:
    """
    :param text:
    :param stem: 
    :return: stemmed text
    """
    text = str.split(text)
    if stem:
        text = " ".join(PorterStemmer().stem(word) for word in text)
    else:
        text = " ".join(WordNetLemmatizer().lemmatize(word, pos='v') for word in text)
    return text

def cluster_reviews(matrix):
    km = KMeans(n_clusters=5, random_state=42)
    labels = km.fit_predict(matrix)
    return labels

def five_common_words(text: str) -> tuple:
    """
    Gets 5 most common words
    :param text: 
    :return: 5 most common woeds
    """
    text = Counter(text)
    text = text.most_common(1)
    return text

def get_most_frequent(review, groupby_column):
    """
    Gets most frequent words into a dataframe
    :param review: 
    :param groupby_column: 
    :return: 
    """
    review = review.groupby(groupby_column)['cleaned_tokenize_reviews'].apply(lambda x: str.split(' '.join(x))).reset_index()
    word_freq = review['cleaned_tokenize_reviews'].apply(five_common_words)
    word_freq = [(word, figure) for sublist in word_freq for (word, figure) in sublist]
    review['most frequent word'] = [word for (word, freq) in word_freq]
    review['Frequency of word'] = [freq for (word, freq) in word_freq]
    review.drop(columns=['cleaned_tokenize_reviews'], inplace=True)
    return review

def top_terms(matrix, labels, vectorizer, num_cluster):
    # Store the predicted labels in a list variable called categories
    categories = labels

    # Get the feature names (terms) from the vectorizer
    terms = vectorizer.get_feature_names_out()

    # List to save the top term for each cluster
    topic_terms_list = []

    for cluster in range(num_cluster):
        # Get indices of reviews in the current cluster
        cluster_indices = [i for i, label in enumerate(categories) if label == cluster]

        # Sum the tf-idf scores for each term in the cluster
        cluster_tfidf_sum = matrix[cluster_indices].sum(axis=0)
        cluster_term_freq = np.asarray(cluster_tfidf_sum).ravel()

        # Get the top term and its frequencies
        top_term_index = cluster_term_freq.argsort()[::-1][0]

        # Append rows to the topic_terms DataFrame with three fields:
        # - category: label / cluster assigned from K-means
        # - term: the identified top term
        # - frequency: term's weight for the category
        topic_terms_list.append(
            {
                "category": cluster,
                "term": terms[top_term_index],
                "frequency": cluster_term_freq[top_term_index],
            }
        )

    # Pandas DataFrame to store results from this step
    topic_terms = pd.DataFrame(topic_terms_list)

    # Output the final result
    return  topic_terms

In [7]:
# apply to reviews
reviews['cleaned_tokenize_reviews'] = reviews['content'].apply(preprocess_text)
reviews

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviews['cleaned_tokenize_reviews'] = reviews['content'].apply(preprocess_text)


Unnamed: 0,content,score,cleaned_tokenize_reviews
0,I cannot open the app anymore,1,open app anymore
1,I have been begging for a refund from this app...,1,begging refund app month nobody replying
2,Very costly for the premium version (approx In...,1,costly premium version approx Indian Rupees pe...
3,"Used to keep me organized, but all the 2020 UP...",1,Used keep organized UPDATES made mess things c...
4,Dan Birthday Oct 28,1,Dan Birthday Oct
...,...,...,...
11940,I loved it until I realized that the very feat...,2,loved realized feature got download first plac...
11941,Gave it a test run and tried out the notificat...,2,Gave test run tried notifications hear thing A...
11942,"Looks great but since installing, my device on...",2,Looks great since installing device lasts half...
11943,This app looked good until I had to purchase i...,2,app looked good purchase get week view everyti...


In [8]:
# Copy to new dataframe
preprocessed_reviews = reviews[['cleaned_tokenize_reviews']].copy()

In [9]:
preprocessed_reviews['cleaned_tokenize_reviews_stem'] = preprocessed_reviews['cleaned_tokenize_reviews'].apply(lambda x: lemmatize_stem(x, True))
preprocessed_reviews['cleaned_tokenize_reviews_lem'] = preprocessed_reviews['cleaned_tokenize_reviews'].apply(lambda x: lemmatize_stem(x, False))

In [10]:
preprocessed_reviews

Unnamed: 0,cleaned_tokenize_reviews,cleaned_tokenize_reviews_stem,cleaned_tokenize_reviews_lem
0,open app anymore,open app anymor,open app anymore
1,begging refund app month nobody replying,beg refund app month nobodi repli,beg refund app month nobody reply
2,costly premium version approx Indian Rupees pe...,costli premium version approx indian rupe per ...,costly premium version approx Indian Rupees pe...
3,Used keep organized UPDATES made mess things c...,use keep organ updat made mess thing cud u lea...,Used keep organize UPDATES make mess things cu...
4,Dan Birthday Oct,dan birthday oct,Dan Birthday Oct
...,...,...,...
11940,loved realized feature got download first plac...,love realiz featur got download first place av...,love realize feature get download first place ...
11941,Gave test run tried notifications hear thing A...,gave test run tri notif hear thing also save n...,Gave test run try notifications hear thing Als...
11942,Looks great since installing device lasts half...,look great sinc instal devic last half long ne...,Looks great since instal device last half long...
11943,app looked good purchase get week view everyti...,app look good purchas get week view everytim c...,app look good purchase get week view everytime...


In [11]:
# Vectorize non lemmatized/stemmed reviews
tfid_na = TfidfVectorizer()
tfid_mat = tfid_na.fit_transform(preprocessed_reviews['cleaned_tokenize_reviews'])
tfid_mat

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 68707 stored elements and shape (4850, 7054)>

In [12]:
# Vectorize lemmatized/stemmed reviews
tfid_stm = TfidfVectorizer()
tfid_stm_mat = tfid_stm.fit_transform(preprocessed_reviews['cleaned_tokenize_reviews_stem'])
tfid_stm_mat

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 66877 stored elements and shape (4850, 4986)>

In [13]:
# Vectorize lemmatized/stemmed reviews
tfid_lem = TfidfVectorizer()
tfid_lem_mat = tfid_lem.fit_transform(preprocessed_reviews['cleaned_tokenize_reviews_lem'])
tfid_lem_mat

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 67430 stored elements and shape (4850, 6013)>

In [14]:
# in
non_cluster = cluster_reviews(tfid_mat)
stem_cluster = cluster_reviews(tfid_stm_mat)
lem_cluster = cluster_reviews(tfid_lem_mat)

In [15]:
reviews['non_cluster'] = non_cluster
reviews['stem_cluster'] = stem_cluster
reviews['lem_cluster'] = lem_cluster

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviews['non_cluster'] = non_cluster
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviews['stem_cluster'] = stem_cluster
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviews['lem_cluster'] = lem_cluster


In [16]:
top_terms(tfid_mat, non_cluster, tfid_na, 5)

Unnamed: 0,category,term,frequency
0,0,work,72.540212
1,1,use,64.014781
2,2,good,55.18223
3,3,ads,49.210517
4,4,app,189.542772


In [17]:
top_terms(tfid_lem_mat, lem_cluster, tfid_lem, 5)

Unnamed: 0,category,term,frequency
0,0,work,103.467471
1,1,app,178.556566
2,2,version,61.960493
3,3,good,38.519915
4,4,task,97.293399


In [18]:
# Non stemmed clusters
non_reviews = get_most_frequent(reviews, 'non_cluster')
non_reviews

Unnamed: 0,non_cluster,most frequent word,Frequency of word
0,0,work,300
1,1,use,333
2,2,version,319
3,3,ads,221
4,4,app,2239


In [19]:
stem_cluster = get_most_frequent(reviews, 'stem_cluster')
stem_cluster

Unnamed: 0,stem_cluster,most frequent word,Frequency of word
0,0,app,324
1,1,app,1698
2,2,tasks,441
3,3,ads,217
4,4,app,446


In [20]:
# Non stemmed clusters
lem_reviews = get_most_frequent(reviews, 'lem_cluster')
lem_reviews

Unnamed: 0,lem_cluster,most frequent word,Frequency of word
0,0,work,250
1,1,app,2022
2,2,app,528
3,3,good,42
4,4,tasks,424
