![wordcloud](wordcloud.png)

As a Data Scientist working for a mobile app company, you usually find yourself applying product analytics to better understand user behavior, uncover patterns, and reveal insights to identify the great and not-so-great features. Recently, the number of negative reviews has increased on Google Play, and as a consequence, the app's rating has been decreasing. The team has requested you to analyze the situation and make sense of the negative reviews.

It's up to you to apply K-means clustering from scikit-learn and NLP techniques through NLTK to sort text data from negative reviews in the Google Play Store into categories!

## The Data

A dataset has been shared with a sample of reviews and their respective scores (from 1 to 5) in the Google Play Store. A summary and preview are provided below.

# reviews.csv

| Column     | Description              |
|------------|--------------------------|
| `'content'` | Content (text) of each review. |
| `'score'` | Score assigned to the review by the user as an integer (from 1 to 5). |

In [76]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [77]:
# Download necessary files from NLTK:
# punkt -> Tokenization
# stopwords -> Stop words removal
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [78]:
# Load the reviews dataset and preview it
reviews = pd.read_csv("reviews.csv")
reviews.head()

Unnamed: 0,content,score
0,I cannot open the app anymore,1
1,I have been begging for a refund from this app...,1
2,Very costly for the premium version (approx In...,1
3,"Used to keep me organized, but all the 2020 UP...",1
4,Dan Birthday Oct 28,1


# negative reviews

In [79]:
negative_reviews_tmp = reviews.content[reviews["score"] <= 2]

# clean the negative reviews

In [80]:
import regex as re
stop_words = set(stopwords.words('english'))
# Function to preprocess text
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation and special characters
    tokens = word_tokenize(text)
    # Remove stop words and non-alphabetic tokens using isalpha()
    tokens_clean = [word for word in tokens if word not in stop_words and word.isalpha()]
    return " ".join(tokens_clean)

# Apply preprocessing to the negative reviews
clean_tokens = negative_reviews_tmp.apply(preprocess_text)

# tokenising

In [81]:
preprocessed_reviews = pd.DataFrame({"review": clean_tokens})


# tfidf

In [82]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()
tfidf_matrix = tf.fit_transform(preprocessed_reviews["review"])

# kmeans

In [83]:
from sklearn.cluster import KMeans

In [84]:
clust_kmeans  = KMeans(n_clusters=5)
lab = clust_kmeans.fit_predict(tfidf_matrix)
categories = lab.tolist()

In [85]:
preprocessed_reviews["cat"] = categories

# plotting

In [86]:
terms = tf.get_feature_names_out()

In [87]:
top_terms = []

In [88]:
clust_kmeans.n_clusters

5

In [89]:
for cl in range(5) :
    cl_indices = [i for i, label in enumerate(categories) if label == cl]
    cluster_tfidf_sum = tfidf_matrix[cl_indices].sum(axis=0)
    cluster_term_freq = np.asarray(cluster_tfidf_sum).ravel()
    top_term_index = cluster_term_freq.argsort()[::-1][0]
    top_terms.append(
        {
            "category": cl,
            "term": terms[top_term_index],
            "frequency": cluster_term_freq[top_term_index],
        })
    

In [90]:
# Pandas DataFrame to store results from this step
topic_terms = pd.DataFrame(top_terms)

# Output the final result
print(topic_terms)

   category     term   frequency
0         0      app  190.646949
1         1  version   69.373145
2         2     work   58.631976
3         3     good   36.057851
4         4     cant   43.360541
