![wordcloud](wordcloud.png)

As a Data Scientist working for a mobile app company, you usually find yourself applying product analytics to better understand user behavior, uncover patterns, and reveal insights to identify the great and not-so-great features. Recently, the number of negative reviews has increased on Google Play, and as a consequence, the app's rating has been decreasing. The team has requested you to analyze the situation and make sense of the negative reviews.

It's up to you to apply K-means clustering from scikit-learn and NLP techniques through NLTK to sort text data from negative reviews in the Google Play Store into categories!

## The Data

A dataset has been shared with a sample of reviews and their respective scores (from 1 to 5) in the Google Play Store. A summary and preview are provided below.

# reviews.csv

| Column     | Description              |
|------------|--------------------------|
| `'content'` | Content (text) of each review. |
| `'score'` | Score assigned to the review by the user as an integer (from 1 to 5). |

In [24]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [25]:
# Download necessary files from NLTK:
# punkt -> Tokenization
# stopwords -> Stop words removal
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [26]:
# Load the reviews dataset and preview it
reviews = pd.read_csv("reviews.csv")
reviews.head()

Unnamed: 0,content,score
0,I cannot open the app anymore,1
1,I have been begging for a refund from this app...,1
2,Very costly for the premium version (approx In...,1
3,"Used to keep me organized, but all the 2020 UP...",1
4,Dan Birthday Oct 28,1


## Preprocess the negative reviews

In [27]:
# Create helper functions
def preprocess_reviews(reviews_df):
    """
    Preprocess negative reviews (score <= 2) by tokenizing the text, removing stop words and non-alpha characters.
    
    Parameters:
    reviews_df (pandas.DataFrame): DataFrame containing the 'content' and 'score' columns.
    
    Returns:
    pandas.DataFrame: Preprocessed negative reviews.
    """
    # Filter for negative reviews
    negative_reviews = reviews_df[reviews_df['score'] <= 2]
    
    # Tokenize the text
    negative_reviews['tokens'] = negative_reviews['content'].apply(word_tokenize)
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    negative_reviews['tokens'] = negative_reviews['tokens'].apply(lambda x: [word for word in x if word.lower() not in stop_words])
    
    # Remove non-alpha characters
    negative_reviews['tokens'] = negative_reviews['tokens'].apply(lambda x: [word for word in x if word.isalpha()])
    
    # Convert tokens back to text
    negative_reviews['preprocessed_content'] = negative_reviews['tokens'].apply(lambda x: ' '.join(x))
    
    return negative_reviews[['preprocessed_content', 'score']]


def get_tfidf_matrix(preprocessed_reviews_df):
    """
    Vectorize the preprocessed negative reviews using TF-IDF.
    
    Parameters:
    preprocessed_reviews_df (pandas.DataFrame): DataFrame containing the 'preprocessed_content' column.
    
    Returns:
    scipy.sparse.csr.csr_matrix: TF-IDF matrix.
    """
    # Create a TF-IDF vectorizer
    tfidf_vectorizer = TfidfVectorizer()
    
    # Fit and transform the preprocessed content
    tfidf_matrix = tfidf_vectorizer.fit_transform(preprocessed_reviews_df['preprocessed_content'])
    
    return tfidf_matrix

from collections import Counter

def cluster_and_get_topics(tfidf_matrix, num_clusters=5):
    """
    Apply K-Means clustering to the TF-IDF matrix and find the most frequent terms in each cluster.
    
    Parameters:
    tfidf_matrix (scipy.sparse.csr.csr_matrix): TF-IDF matrix.
    num_clusters (int): Number of clusters to create.
    
    Returns:
    list: Predicted cluster labels for each review.
    pandas.DataFrame: DataFrame containing the cluster label, most frequent term, and its frequency.
    """
    # Apply K-Means clustering
    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    labels = kmeans.fit_predict(tfidf_matrix)
    
    # Convert the predicted labels to a list
    categories = labels.tolist()
    
    # Find the most frequent terms in each cluster
    topic_terms = []
    for cluster_id in range(num_clusters):
        cluster_reviews = preprocessed_reviews[labels == cluster_id]['preprocessed_content']
        all_terms = ' '.join(cluster_reviews).split()
        most_common = Counter(all_terms).most_common(1)[0]
        topic_terms.append((cluster_id, most_common[0], most_common[1]))
    
    # Create a DataFrame with the results
    topic_terms_df = pd.DataFrame(topic_terms, columns=['cluster', 'term', 'frequency'])
    
    return categories, topic_terms_df

In [28]:
# apply function on dataframe
preprocessed_reviews = preprocess_reviews(reviews)
display(preprocessed_reviews.head())

Unnamed: 0,preprocessed_content,score
0,open app anymore,1
1,begging refund app month nobody replying,1
2,costly premium version approx Indian Rupees pe...,1
3,Used keep organized UPDATES made mess things c...,1
4,Dan Birthday Oct,1


## Vectorize the cleaned negative reviews

In [29]:
# Get the TF-IDF matrix
tfidf_matrix = get_tfidf_matrix(preprocessed_reviews)

## K-Means Clustering and Topic Terms

In [30]:
categories, topic_terms = cluster_and_get_topics(tfidf_matrix, num_clusters=5)

In [31]:
display(categories[:5], topic_terms)

[3, 3, 1, 3, 3]

Unnamed: 0,cluster,term,frequency
0,0,account,221
1,1,app,401
2,2,calendar,435
3,3,app,2173
4,4,good,36
