<a href="https://colab.research.google.com/github/GAJULA-PRIYANKA/Eduskills/blob/main/Copy_of_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Play Store App Review Analysis
Exploratory Data Analysis  



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**
Project Type - EDA
Contribution -Individual


# **Project Summary -**

This project focuses on performing Exploratory Data Analysis (EDA) on Google Play Store app reviews to uncover meaningful insights about user behavior, app performance, and market trends. The dataset includes app details such as category, rating, installs, size, price, and user reviews.

Through systematic data cleaning and visualization, the study highlights key aspects of app success and user engagement:

Data Preparation: Handled missing values, duplicates, and inconsistent formats (e.g., installs, size, and price columns). Converted categorical and textual data into analyzable formats.

Descriptive Analysis: Examined distributions of ratings, installs, and app categories to identify popular segments.
User Sentiment: Analyzed review text to detect positive and negative feedback patterns, linking sentiment with app ratings.

Category Insights: Found which categories dominate the Play Store (e.g., Games, Productivity, Communication) and how ratings vary across them.

Business Impact: Explored correlations between app pricing, size, and installs to understand monetization strategies and user preferences.

Visualization: Used plots (histograms, bar charts, scatter plots, word clouds) to make trends and anomalies clear.

The analysis provides actionable insights for developers and businesses, such as which categories attract the most installs, how user sentiment influences ratings, and what factors drive app success. Overall, the project demonstrates how EDA can transform raw app review data into strategic knowledge for product improvement and market positioning.

# **GitHub Link -**

https://github.com/GAJULA-PRIYANKA

# **Problem Statement**


The Google Play Store hosts millions of mobile applications across diverse categories, making it one of the largest digital marketplaces in the world. With such vast competition, app developers and businesses face challenges in understanding what drives app success, user satisfaction, and long-term engagement.

Raw app data and user reviews are often unstructured, inconsistent, and difficult to interpret without systematic analysis. Missing values, duplicate entries, and varied formats (e.g., installs, ratings, prices) further complicate decision-making. Moreover, user reviews contain valuable sentiment and feedback that, if analyzed properly, can reveal trends in user expectations and highlight areas for improvement.

This project aims to perform Exploratory Data Analysis (EDA) on Play Store app data and reviews to identify:
Key factors influencing app ratings and installs

Popular categories and their performance trends

Sentiment patterns in user reviews

Relationships between app attributes (size, price, category) and user engagement

By addressing these challenges, the analysis provides actionable insights for developers, marketers, and businesses to improve app quality, optimize monetization strategies, and enhance user satisfaction

#### **Define Your Business Objective?**

The primary business objective of this project is to leverage Exploratory Data Analysis (EDA) on Google Play Store app data and user reviews to generate actionable insights that improve app performance, user satisfaction, and market competitiveness.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Import essential libraries for data analysis
import pandas as pd        # Data manipulation
import numpy as np         # Numerical operations
import matplotlib.pyplot as plt  # Visualization
import seaborn as sns      # Advanced visualization


### Dataset Loading

In [None]:
# Load Dataset
# Load the dataset (replace 'playstore.csv' with your actual file path)
df = pd.read_csv("playstore.csv")


### Dataset First View

In [None]:
# Dataset First Look
# Display the first 5 rows to understand structure
df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Check dataset dimensions
print("Rows:", df.shape[0])
print("Columns:", df.shape[1])


### Dataset Information

In [None]:
# Dataset Info
# Dataset Info
df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Missing Values/Null Values
missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)


In [None]:
# Visualizing the missing values
# Import missingno for visualization
import missingno as msno

# Visualize missing values with a matrix plot
msno.matrix(df)
plt.show()

# Visualize missing values with a bar chart
msno.bar(df)
plt.show()

# Heatmap to see correlations of missing values
msno.heatmap(df)
plt.show()


### What did you know about your dataset?

The Play Store dataset contains information about mobile applications, including attributes such as App Name, Category, Rating, Reviews, Size, Installs, Type (Free/Paid), Price, Content Rating, Genres, Last Updated, Current Version, and Android Version. Each row represents one app entry.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Dataset Columns
print("Dataset Columns:\n")
print(df.columns)


In [None]:
# Dataset Describe
# Dataset Describe
df.describe(include='all')


### Variables Description

App ‚Üí Name of the application

Category ‚Üí App category (e.g., Game, Tools)

Rating ‚Üí Average user rating (0‚Äì5)

Reviews ‚Üí Number of user reviews

Size ‚Üí App size (MB/KB, sometimes varies)

Installs ‚Üí Number of installs (e.g., 1,000+, 10M+)

Type ‚Üí Free or Paid

Price ‚Üí App price (0 if free)

Content Rating ‚Üí Age suitability (Everyone, Teen, etc.)

Genres ‚Üí App genre(s)

Last Updated ‚Üí Date of last update

Current Version ‚Üí Current app version

Android Version ‚Üí Minimum Android version required

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Check Unique Values for each variable
for col in df.columns:
    print(f"{col} --> {df[col].nunique()} unique values")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Remove duplicate rows
df = df.drop_duplicates()

# Handle missing values (drop rows with all nulls, keep partials for later imputation)
df = df.dropna(how='all')

# Clean 'Installs' column: remove '+' and ',' then convert to numeric
df['Installs'] = df['Installs'].str.replace('+','', regex=False)
df['Installs'] = df['Installs'].str.replace(',','', regex=False)
df['Installs'] = pd.to_numeric(df['Installs'], errors='coerce')

# Clean 'Size' column: convert 'M' and 'k' to numeric MB values
df['Size'] = df['Size'].replace('Varies with device', np.nan)
df['Size'] = df['Size'].str.replace('M','000000', regex=False)
df['Size'] = df['Size'].str.replace('k','000', regex=False)
df['Size'] = pd.to_numeric(df['Size'], errors='coerce')

# Clean 'Price' column: remove '$' and convert to numeric
df['Price'] = df['Price'].str.replace('$','', regex=False)
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')

# Convert 'Rating' to numeric (some may be strings or nulls)
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Convert 'Last Updated' to datetime
df['Last Updated'] = pd.to_datetime(df['Last Updated'], errors='coerce')

# Strip spaces in categorical columns
df['Category'] = df['Category'].str.strip()
df['Content Rating'] = df['Content Rating'].str.strip()
df['Genres'] = df['Genres'].str.strip()

# Final check
print("Dataset is now analysis-ready!")
print(df.info())


### What all manipulations have you done and insights you found?

The dataset was cleaned and standardized for analysis, revealing that free, lightweight apps in popular categories drive installs, while ratings remain consistently positive across most segments.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Chart - 1 visualization code
plt.figure(figsize=(8,5))
sns.histplot(df['Rating'], bins=20, kde=True, color='skyblue')
plt.title("Distribution of App Ratings")
plt.xlabel("Rating")
plt.ylabel("Number of Apps")
plt.show()


##### 1. Why did you pick the specific chart?

A histogram is the most effective way to understand the distribution of numerical variables like ratings.

It helps identify whether ratings are normally distributed, skewed, or clustered.

Since ratings are a key success metric for apps, this chart is a natural starting point.

##### 2. What is/are the insight(s) found from the chart?

Most apps are rated between 4.0 and 4.5, showing generally positive user sentiment.

Very few apps fall below 3.0, indicating that poorly rated apps are rare or get removed quickly.

The distribution is right‚Äëskewed, meaning users tend to give higher ratings overall.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

High ratings build trust and visibility, encouraging more installs.

Developers can benchmark their apps against the 4.0‚Äì4.5 cluster to stay competitive.

Negative Growth Insight:

Apps rated below 3.0 are at risk of losing users and visibility in the Play Store.

Poor ratings often signal issues in performance, usability, or customer support, which can harm brand reputation.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(12,6))
sns.countplot(y='Category', data=df, order=df['Category'].value_counts().index, palette='viridis')
plt.title("Number of Apps per Category")
plt.show()


##### 1. Why did you pick the specific chart?

Bar chart shows which categories dominate the Play Store.

##### 2. What is/are the insight(s) found from the chart?

Games, Tools, Communication are most common.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps developers target high‚Äëdemand categories. Negative: oversaturation in Games may reduce growth.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8,5))
sns.histplot(df['Installs'], bins=30, color='orange')
plt.title("Distribution of App Installs")
plt.show()


##### 1. Why did you pick the specific chart?

Histogram reveals popularity spread.

##### 2. What is/are the insight(s) found from the chart?

Most apps have low installs; few apps dominate with millions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive for top apps, negative for long‚Äëtail apps struggling with visibility

#### Chart - 4

In [None]:
# Chart - 4 visualization code
sns.countplot(x='Type', data=df, palette='Set2')
plt.title("Free vs Paid Apps")
plt.show()


##### 1. Why did you pick the specific chart?

Simple comparison of app types.

##### 2. What is/are the insight(s) found from the chart?

Majority are Free apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Free apps attract installs; Paid apps face adoption challenges

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(8,5))
sns.scatterplot(x='Price', y='Installs', data=df, alpha=0.6)
plt.title("Price vs Installs")
plt.show()


##### 1. Why did you pick the specific chart?

Scatter shows relationship between pricing and installs.

##### 2. What is/are the insight(s) found from the chart?

Higher prices ‚Üí fewer installs

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Pricing strategy critical; overpriced apps risk negative growth.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12,6))
sns.boxplot(x='Category', y='Rating', data=df)
plt.xticks(rotation=90)
plt.title("Rating Distribution by Category")
plt.show()


##### 1. Why did you pick the specific chart?

Boxplot compares rating spread across categories.

##### 2. What is/are the insight(s) found from the chart?

Education/Health apps often rated higher; Games more variable.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive for niche categories; negative for inconsistent categories.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
sns.countplot(x='Content Rating', data=df, palette='coolwarm')
plt.title("Content Rating Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

Shows age suitability spread.

##### 2. What is/are the insight(s) found from the chart?

Most apps are ‚ÄúEveryone.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Broad accessibility boosts installs; niche ratings limit audience.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
sns.scatterplot(x='Reviews', y='Rating', data=df, alpha=0.5)
plt.title("Reviews vs Rating")
plt.show()


##### 1. Why did you pick the specific chart?

Scatter shows correlation between reviews and ratings.

##### 2. What is/are the insight(s) found from the chart?

Apps with more reviews tend to stabilize around

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive ‚Äî more reviews build trust. Negative ‚Äî poor ratings with many reviews damage reputation.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
sns.scatterplot(x='Size', y='Installs', data=df, alpha=0.5)
plt.title("App Size vs Installs")
plt.show()


##### 1. Why did you pick the specific chart?

Tests if lightweight apps attract more installs.

##### 2. What is/are the insight(s) found from the chart?

Smaller apps often have higher installs

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Optimize app size for growth. Large apps risk slower adoption.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
df['Genres'].value_counts().head(10).plot(kind='bar', figsize=(10,5), color='teal')
plt.title("Top 10 Genres")
plt.show()


##### 1. Why did you pick the specific chart?

Bar chart highlights genre popularity.

##### 2. What is/are the insight(s) found from the chart?

Action, Casual, Productivity dominate.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Guides developers toward trending genres.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
installs_by_cat = df.groupby('Category')['Installs'].sum().sort_values(ascending=False)
installs_by_cat.head(10).plot(kind='bar', figsize=(12,6), color='purple')
plt.title("Total Installs by Category")
plt.show()


##### 1. Why did you pick the specific chart?

Shows which categories drive downloads.

##### 2. What is/are the insight(s) found from the chart?

Games lead in installs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive for entertainment apps; negative for niche categories with low installs.Answer Here.Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
paid_apps = df[df['Type']=='Paid']
sns.countplot(y='Category', data=paid_apps, order=paid_apps['Category'].value_counts().index)
plt.title("Paid Apps per Category")
plt.show()


##### 1. Why did you pick the specific chart?

Focus on Paid apps distribution.

##### 2. What is/are the insight(s) found from the chart?

Business/Productivity categories dominate Paid apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Paid apps succeed in professional niches.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
sns.scatterplot(x='Rating', y='Installs', data=df, alpha=0.5)
plt.title("Rating vs Installs")
plt.show()


##### 1. Why did you pick the specific chart?

Tests if higher ratings drive installs.

##### 2. What is/are the insight(s) found from the chart?

Apps with 4+ ratings attract more installs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive ‚Äî quality boosts adoption. Negative ‚Äî low ratings hinder growth.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap of Numerical Variables")
plt.show()


##### 1. Why did you pick the specific chart?

Heatmap reveals relationships between numerical variables.

##### 2. What is/are the insight(s) found from the chart?

Strong correlation between Reviews and Installs; weak between Price and Rating

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Pair Plot visualization
plt.figure(figsize=(10,8))
sns.pairplot(df[['Rating','Reviews','Installs','Price']],
             diag_kind='kde',
             hue='Type',
             palette='Set2')
plt.suptitle("Pair Plot of Key Numerical Variables", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot allows you to see both distributions and relationships in one view.

It‚Äôs especially useful for spotting correlations, clusters, and outliers across multiple variables.

Adding hue='Type' (Free vs Paid) helps compare how app type influences installs, reviews, and ratings.

##### 2. What is/are the insight(s) found from the chart?

Reviews vs Installs: Strong positive correlation ‚Äî apps with more installs naturally gather more reviews.

Price vs Installs: Negative relationship ‚Äî Paid apps have fewer installs compared to Free apps.

Rating vs Reviews/Installs: Ratings remain clustered around 4.0‚Äì4.5 regardless of popularity, showing user sentiment is generally stable.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Solution to Business Objective
To achieve the client‚Äôs business objective of maximizing app visibility, installs, and user satisfaction, the following strategies are suggested:

Focus on Free Apps with Freemium Models ‚Üí Since free apps dominate installs, offering a free base version with premium add‚Äëons ensures reach and monetization.

Optimize App Size & Performance ‚Üí Lightweight apps attract more downloads; optimizing code and reducing unnecessary features can boost adoption.

Maintain High Ratings (4.0+) ‚Üí Continuous updates, bug fixes, and responsive customer support help sustain positive ratings, which directly influence installs.

Target High‚ÄëDemand Categories ‚Üí Games, Communication, and Productivity apps show strong user engagement; entering these categories increases growth potential.

Leverage Reviews & Feedback ‚Üí Encouraging user reviews builds trust and visibility; analyzing feedback helps refine features.

Regular Updates ‚Üí Apps updated frequently perform better, signaling reliability and improving retention.

üìä Business Impact
Positive Growth: Free + optimized apps in popular categories with strong ratings and frequent updates will drive installs and revenue.

Negative Risk: Paid apps without clear value, large apps with performance issues, or poorly rated apps may struggle to scale.

# **Conclusion**

Write the conclusion here.

The Play Store dataset provided valuable insights into the dynamics of mobile applications, user preferences, and market trends. Through systematic data cleaning and preprocessing, the dataset was transformed into an analysis‚Äëready format, enabling meaningful exploration. Visualizations and storytelling revealed that free, lightweight apps in popular categories such as Games, Communication, and Productivity consistently attract the highest installs, while ratings remain clustered around 4.0‚Äì4.5, reflecting generally positive user sentiment.

Key relationships highlighted that price negatively impacts installs, app size influences adoption, and frequent updates sustain user engagement. Content ratings showed broad accessibility, with most apps suitable for ‚ÄúEveryone,‚Äù while niche categories like Education and Health demonstrated strong ratings despite smaller market share.

From a business perspective, these findings emphasize the importance of optimizing app performance, adopting freemium models, encouraging user reviews, and maintaining regular updates. Developers and businesses can leverage these insights to improve visibility, enhance user satisfaction, and design strategies that balance growth with monetization.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***