<a href="https://colab.research.google.com/github/TrishAcharya/Projects/blob/main/Play_store_app_review_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -   Play store app review analysis



##### **Project Type**    - Exploratory Data Analysis
##### **Contribution**    - Individual

# **Project Summary -**

Play store app review analysis

The goal of this Exploratory Data Analysis (EDA) project is to uncover key insights from the Play Store app data and customer reviews that can drive success in app development. By understanding factors such as app ratings, size, category, pricing, and user sentiment, we aim to provide actionable insights for app developers to improve user engagement, boost ratings, and optimize their app offerings for the Android market.


# **GitHub Link -**

https://github.com/TrishAcharya/Projects/blob/main/Play_store_app_review_analysis.ipynb

# **Problem Statement**


To Explore and analyse the data to discover key factors responsible for app engagement and success.


#### **Define Your Business Objective?**

Business Objective is to gain actionable insights that can drive key decisions in areas like product development, marketing, customer satisfaction, and overall business strategy.
1. Understand Market Trends and Demand
2. Improve App Features and Functionality
3. Enhance User Engagement and Retention
4. Improve Customer Support and Service
5. Predict Future Success or Failure


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
## Dataset link
## https://drive.google.com/file/d/1xdS1IfPeP8p-3aeGQ-W7hYN9yq8K7Bcd/view?usp=sharing
## https://drive.google.com/file/d/1AZbZtxH6KXrhLTj9OagXKUoWdPzjzgnp/view?usp=sharing

### Dataset Loading

In [None]:
from google.colab import files
uploaded = files.upload()


In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
playstore_data= pd.read_csv('/content/Play Store Data.csv')
playstore_data

In [None]:
reviews_data= pd.read_csv('/content/User Reviews.csv')
reviews_data

### Dataset First View

In [None]:
# Dataset First Look
playstore_data.head()          ## gives first five 5 by default.

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
playstore_data.shape

### Dataset Information

In [None]:
# Dataset Info
playstore_data.info()      # info gives the data type, non-null, count

In [None]:
reviews_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(playstore_data[playstore_data.duplicated()])

In [None]:
len(reviews_data[reviews_data.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(playstore_data.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(playstore_data.isnull(), cbar=False)

### What did you know about your dataset?

There are two datasets provided to draw the actionable insights.

Playstore app data which contains data related to Category,Rating,Reviews, Size,Installs,Type,Price,Content,Rating,Genres,Last Updated,Current version and Android Version.It has 10841 rows and 13 columns. There are 483 duplicated value counts.

Reviews data which contains data related to App,Translated_Review,Sentiment, Sentiment_Polarity,Sentiment_Subjectivity.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
playstore_data.columns

In [None]:
reviews_data.columns

In [None]:
# Dataset Describe
playstore_data.describe(include='all')

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for x in playstore_data.columns.tolist():
  print("No. of unique values in ",x,"is",playstore_data[x].nunique(),".")

## 3. ***Data Wrangling***

Data Wrangling is the process of Transforming,cleaning the messy data

### Data Wrangling Code

In [None]:
# code to make dataset analysis ready.
# Merge the Play Store data with user reviews based on the 'App' column
merged_df = pd.merge(playstore_data, reviews_data, on='App', how='left')

# Display the merged dataset
print(merged_df.head())       ## .head() gives by default top 5

In [None]:
## Data Cleaning
## Now that the datasets are merged, clean the data by handling missing values, duplicates, etc.
# Check the missing values in the merged dataset
print(merged_df.isnull().sum())

# Drop rows with missing 'Translated_Review' or 'Sentiment_Polarity' (or any other critical columns)
merged_df.dropna(subset=['Translated_Review', 'Sentiment_Polarity'], inplace=True)

# Remove duplicate reviews for the same app (if needed)
merged_df.drop_duplicates(subset=['App', 'Translated_Review'], inplace=True)

# Check again for missing values after cleaning
print(merged_df.isnull().sum())

In [None]:
## Sentiment Aggregation
## To get a high-level sentiment overview for each app--  aggregate the sentiment data
# by calculating the average sentiment polarity or subjectivity for each app.
# Group by 'App' and calculate average sentiment polarity and subjectivity
sentiment_summary = merged_df.groupby('App').agg({
    'Sentiment_Polarity': 'mean',  # Average sentiment polarity
    'Sentiment_Subjectivity': 'mean'  # Average sentiment subjectivity
}).reset_index()

# Display the sentiment summary
print(sentiment_summary.head())


In [None]:
# Sentiment vs. Rating: Compare the sentiment of reviews (e.g., sentiment polarity) with the app rating to see if there's any correlation.
# Correlation between sentiment polarity and rating
sentiment_vs_rating = merged_df[['Rating', 'Sentiment_Polarity']].corr()
print(sentiment_vs_rating)

In [None]:
# Sentiment Distribution by App Category: You can analyze how sentiment varies across different categories of apps.
# Sentiment distribution by category
sentiment_by_category = merged_df.groupby('Category').agg({
    'Sentiment_Polarity': 'mean'
}).sort_values('Sentiment_Polarity', ascending=False)

print(sentiment_by_category.head())

In [None]:
# Top Apps Based on Average Sentiment: Identify the top-rated apps based on sentiment or average review sentiment polarity.
# Top 5 apps with the highest average sentiment polarity
top_positive_apps = sentiment_summary.sort_values('Sentiment_Polarity', ascending=False).head(5)
print(top_positive_apps)

# Top 5 apps with the lowest average sentiment polarity
top_negative_apps = sentiment_summary.sort_values('Sentiment_Polarity').head(5)
print(top_negative_apps)

### What all manipulations have you done and insights you found?

To make the datasets analysis ready first merged the two datasets over the common column 'app'. Then performed Data Cleaning- as the datasets are merged, clean the data by handling missing values by checking for missing values in the merged dataset then drop rows with missing 'translated review or sentiment polarity and then remove duplicate reviews for the same app.

Then performed Sentiment Aggregation- To get a high-level sentiment overview for each app, aggregated the sentiment data by calculating the average sentiment polarity or subjectivity for each app.
And Group by 'App' and calculate average sentiment polarity and subjectivity.

Then Compared the sentiment of reviews  with the app rating to see if there's any correlation.(Sentiment vs. Rating).

Analyzed how sentiment varies across different categories of apps.
(Sentiment Distribution by App Category).

Then fetched the top Apps Based on Average Sentiment: Identify the top-rated apps based on sentiment or average review sentiment polarity.
The Top 5 apps with the highest average sentiment polarity &
The Top 5 apps with the lowest average sentiment polarity.

INSIGHTS--
After cleaning, all columns show zero missing values. This means your dataset is now complete with respect to critical columns..With no missing data, the analysis is more reliable, and you can move forward with generating accurate insights..
The removal of duplicate reviews ensures that no repeated or redundant entries will distort the analysis ensureing that sentiment analysis, app ratings, and other metrics are based on unique user feedback, making conclusions more representative of the broader user base.


Sentiment Polarity: This represents the overall emotional tone of the reviews for each app, where:
A positive value (close to 1) indicates positive sentiment.
A negative value (close to -1) indicates negative sentiment.
A value close to 0 indicates neutral sentiment.
Sentiment Subjectivity: This measures the degree to which the sentiment is subjective or opinion-based, with:
A value close to 0 indicating more objective or factual feedback.
A value close to 1 indicating more subjective or opinionated feedback.

10 Best Foods for You" has a positive sentiment polarity of 0.465906, indicating that users generally feel positively about the app. It also has a moderate sentiment subjectivity of 0.493254, meaning users' feedback is a balance of opinion-based and factual content.

11st" has a positive sentiment polarity of 0.185943, which is on the lower end of positive sentiment, indicating somewhat neutral-to-positive feedback.

Most of the apps in the dataset seem to have a positive overall sentiment (average polarity above 0). This is a good sign, indicating that users are generally happy with the apps they are reviewing.
The range of Sentiment_Polarity values indicates that while most apps have positive sentiment, some apps have lower or more neutral sentiment, suggesting room for improvement in user experience or functionality.

The Sentiment_Subjectivity values are all relatively high, above 0.4 for most apps, indicating that user feedback tends to be more opinion-based rather than objective.

The correlation coefficient between Rating and Sentiment_Polarity is 0.111607, which is a very weak positive correlation. This indicates that there is no strong relationship between the rating given to an app and the sentiment polarity of the reviews.This means a higher rating does not necessarily mean more positive sentiment in the reviews, and lower ratings do not automatically correlate with negative sentiment.





## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
sns.barplot(x='Category', y='Rating', data=merged_df, ci=None)

# Rotate the x-axis labels for better readability
plt.xticks(rotation=90)
plt.title("Average Rating by App Category")
plt.show()

##### 1. Why did you pick the specific chart?

X-axis: Categories (e.g., Games, Social, Productivity).
Y-axis: Average rating of apps in each category.
The plot helps compare ratings across categories and shows which categories tend to have better user ratings.


##### 2. What is/are the insight(s) found from the chart?

A bar plot helps us understand Auto and vehicles app categories have the highest average ratings.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.histplot(merged_df['Rating'], bins=10, kde=True)

plt.title("Distribution of Ratings for All Apps")
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.show()

##### 1. Why did you pick the specific chart?

A histogram shows the distribution of ratings across all apps. This is useful for identifying whether ratings are skewed and understanding the overall spread of ratings.It also helps to visualize if ratings are clustered around a certain value.

##### 2. What is/are the insight(s) found from the chart?

This shows how ratings are distributed.
Ratings are ledt skewed.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.scatterplot(x='Rating', y='Sentiment_Polarity', data=merged_df)

plt.title("Rating vs Sentiment Polarity")
plt.xlabel("Rating")
plt.ylabel("Sentiment Polarity")
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot helps to visualize the relationship between Rating and Sentiment Polarity. Does a more positive sentiment correlate with a higher rating?

##### 2. What is/are the insight(s) found from the chart?

The rating and sentiment polarity are scattered across the graph.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
app_type_counts = merged_df['Type'].value_counts()

# Create a pie chart
app_type_counts.plot.pie(autopct='%1.1f%%', startangle=90, colors=['skyblue', 'red'])

plt.title("Proportion of Free vs Paid Apps")
plt.ylabel('')  # Hides the y-axis label
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart shows the proportion of Free vs. Paid apps. This is a quick way to see how many apps are free versus paid in the Play Store.

##### 2. What is/are the insight(s) found from the chart?

Pie chart to display the proportion of apps that are Free vs. Paid.
This gives an immediate view of the distribution of app types in the dataset.
Most of the apps are free.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
sns.boxplot(x='Category', y='Rating', data=merged_df)

plt.title("Rating Distribution by App Category")
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

A box plot will show the distribution of ratings within each app category, including median, interquartile range (IQR), and potential outliers. This helps to understand the spread of ratings within different categories.

##### 2. What is/are the insight(s) found from the chart?

This graph demonstrate the outliers and the median for all the categories.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
## Visualise the insights of top 5 performing and bottom 5 apps
import matplotlib.pyplot as plt
import seaborn as sns

# Top Positive Apps
top_positive_apps = sentiment_summary.sort_values('Sentiment_Polarity', ascending=False).head(5)

# Top Negative Apps
top_negative_apps = sentiment_summary.sort_values('Sentiment_Polarity').head(5)

# Combine the data for plotting
top_apps = pd.concat([top_positive_apps[['App', 'Sentiment_Polarity']],
                      top_negative_apps[['App', 'Sentiment_Polarity']]])

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Sentiment_Polarity', y='App', data=top_apps, palette='magma')
plt.title('Top Positive and Negative Apps Based on Sentiment Polarity')
plt.xlabel('Sentiment Polarity')
plt.ylabel('App')
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are well-suited for displaying categorical data because they allow you to clearly see differences.

##### 2. What is/are the insight(s) found from the chart?

The top 5 performing apps and the bottom 5 apps are found from this easily.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights help to decide which all apps need more innovation,creativity and focus to appeal more to the people and the apps which are top performing can lead which all factors make them most used and what more can be done.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
average_sentiment_by_category = merged_df.groupby('Category')['Sentiment_Polarity'].mean().sort_values(ascending=False)
sns.barplot(x=average_sentiment_by_category.index, y=average_sentiment_by_category.values)
plt.xticks(rotation=90)
plt.title('Average Sentiment Polarity by Category')
plt.xlabel('Category')
plt.ylabel('Average Sentiment Polarity')
plt.show()

##### 1. Why did you pick the specific chart?

Bar plot helps in comparing beteen the categorical data.

Y-axis has Average Sentiment Polarity

X-axis has Category related data

##### 2. What is/are the insight(s) found from the chart?

Maximum average sentiment Polarity is in the Comics category,followed by Events and auto & vehicles.
Minimum average polarity is in Games and Social category apps.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Sentiment distribution of reviews for the apps
sns.countplot(data=merged_df, x='Sentiment')
plt.title('Sentiment Distribution of Reviews')
plt.xlabel('Sentiment')
plt.ylabel('Number of Reviews')
plt.show()

##### 1. Why did you pick the specific chart?

Count Plot demonstates the count of the distribution of reviews as per the sentiment.

##### 2. What is/are the insight(s) found from the chart?

Majority of the reviews have positive sentiments followed by negative but quite less than the positive ones. Few users also have neutral sentiments regarding the apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It motivates the apps with positiv reviews that they are on track and give scope for improvement to the apps with neutral ones and need to focus on the negative reviews and change accordingly.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Bar plot for Count of Apps by Category
plt.figure(figsize=(10, 6))
sns.countplot(data=merged_df, x='Category', palette='viridis')
plt.title('Number of Apps by Category')
plt.xlabel('Category')
plt.ylabel('Count of Apps')
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

Bar plot to count the numbers of apps for all categories.

X-axis has Category of the apps

Y-axis has Count of apps.

##### 2. What is/are the insight(s) found from the chart?

Maximum apps are of the Games category followed by Family category
and the least count of apps are of the Comics category.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
## Genre-wise Average Rating
plt.figure(figsize=(10, 5))
sns.barplot(data=merged_df, x='Genres', y='Rating', estimator='mean')
plt.title('Average Rating by Genre')
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

To identify the specific comparison between different genre on the basis of average rating

#### Chart - 11

In [None]:
# Chart - 11 visualization code
## App Type Distribution (Free vs Paid)
plt.figure(figsize=(10, 6))
sns.countplot(data=merged_df, x='Type')
plt.title('App Type Distribution (Free vs Paid)')
plt.show()

##### 1. Why did you pick the specific chart?

This count chart helps to count the number of the apps that are free and paid.

##### 2. What is/are the insight(s) found from the chart?

Most of the apps are available free of cost and that enhances its chance of more installation and usage whereas paid apps are very less in count.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

# Ensure 'Installs' is cleaned to remove non-numeric characters
merged_df['Installs'] = merged_df['Installs'].str.replace(',', '').str.replace('+', '').astype(int)

# Filter the data for Free and Paid apps
install_by_type = merged_df.groupby('Type')['Installs'].mean().reset_index()

# Plotting the bar chart for Free vs Paid Apps and their average installs
plt.figure(figsize=(10, 6))
sns.barplot(data=install_by_type, x='Type', y='Installs', palette='viridis')
plt.title('Average Installs for Free vs Paid Apps')
plt.xlabel('App Type')
plt.ylabel('Average Installs')
plt.show()

##### 1. Why did you pick the specific chart?

This chart is to further strengthen the last chart insights regarding the installation of free and paid apps.

##### 2. What is/are the insight(s) found from the chart?






The most installed apps are those which are free whereas paid apps are installed by specific people who might have customized the app.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chances of installation and usage are more when the app is free. So to give the taste of the app to the users free app are better.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Rating Distribution by Content Rating
plt.figure(figsize=(8, 4))
sns.boxplot(data=merged_df, x='Content Rating', y='Rating', palette='Set3')
plt.title('Rating Distribution by Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Rating')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

This Box plot is to identify the outliers in the content rating an drating of the apps.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Step 1: Convert 'Reviews' to string first, then remove non-numeric characters like commas, plus signs
merged_df['Reviews'] = merged_df['Reviews'].astype(str).str.replace(r'[^\d]', '', regex=True)

# Step 2: Convert 'Reviews' column to float, handling any conversion errors (in case of invalid strings)
merged_df['Reviews'] = pd.to_numeric(merged_df['Reviews'], errors='coerce')

# Step 3: Convert 'Size' to numeric (handling M for megabytes and k for kilobytes)
def convert_size(size):
    if isinstance(size, str):
        if 'M' in size:
            return float(size.replace('M', ''))  # Convert to megabytes
        elif 'k' in size:
            return float(size.replace('k', '')) / 1024  # Convert to megabytes
    return None  # In case of invalid size

merged_df['Size'] = merged_df['Size'].apply(lambda x: convert_size(str(x)) if isinstance(x, str) else None)

# Step 4: Check for NaN values in relevant columns and drop rows with NaN values
merged_df = merged_df.dropna(subset=['Reviews', 'Rating', 'Sentiment_Polarity', 'Size'])

# Step 5: Calculate the correlation matrix
correlation_data = merged_df[['Reviews', 'Rating', 'Sentiment_Polarity', 'Size']]
correlation_matrix = correlation_data.corr()

# Step 6: Plot the correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()


# Why did you pick the specific chart

The Heatmap gives the correlation regarding different numerical features of the dataset.

##### 2. What is/are the insight(s) found from the chart?

The correlation of the heatmap --
corr is 1 : strong relationship
corr is 0 : No relationship
corr is -1 : Negative relationship

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Step 1: Plot a pairplot to visualize relationships between numerical features
sns.pairplot(merged_df[['Reviews', 'Rating', 'Sentiment_Polarity', 'Size']])
plt.suptitle('Pairplot of Numerical Features', y=1.02)
plt.show()

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To help the client achieve their business objectives based on the insights from the Play Store dataset, the following suggestions can be made-
1. Identify and Focus on High-Rating Categories-
To Improve the quality and reputation of apps.
Based on the analysis of categories with the highest average ratings, the client should consider investing in or developing apps within high-performing categories.
2. Targeting Free vs. Paid Apps
To Maximize revenue by determining the best app type for each segment.
The free apps are performing well in terms of installs, ratings, and user engagement. Consider offering in-app purchases or ads to monetize these apps.
And develop paid apps if its ensured that the app offers substantial value that justifies the price.
3. Improve Sentiment and User Feedback
To Enhance user satisfaction and improve app ratings.
Analyze sentiment analysis and feedback from reviews to understand user pain points and positive feedback. Pay attention to apps with high subjectivity in their reviews.
4. Focus on Apps with High Engagement (Reviews vs. Rating)
Increase user engagement and interaction.
Focus on apps that have high engagement (i.e., a large number of reviews) but lower ratings. These apps can be improved by addressing user complaints and fixing issues. By improving their ratings, these apps can attract more installs.
5.Track and Adapt to Changing User Preferences
Stay relevant by adapting to market trends.
The client should regularly track "Last Updated" data to ensure apps are regularly maintained and updated based on the latest user feedback, trends, and technological advancements.

# **Conclusion**

The Play Store data with user review data, helped to gain a deeper understanding of how apps perform and what users think about them. The combined insights help identify trends like which categories of apps are the most popular, what app features users like or dislike, and how ratings are impacted by user feedback. This data allows the client to make informed decisions about improving app quality, targeting the right audience, and optimizing performance. It helped to draw clearer picture of app success factors, leading to better user experiences and business growth.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***