# **Project Name**    - Play Store App Review Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

The Play Store App Review Analysis project aims to leverage Exploratory Data Analysis (EDA) to uncover insights from a dataset comprising app names, categories, ratings, reviews, sizes, installs, types, prices, content ratings, genres, last updated dates, current versions, and Android version requirements. Using Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn, we will clean the data, handle missing values, and perform various analyses to understand app performance, user feedback, and market trends. The goal is to identify key factors influencing app ratings, popular app categories, and common user concerns, providing actionable insights for app developers to enhance their apps and user satisfaction.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The Google Play Store hosts millions of apps across various categories, and user reviews play a crucial role in determining an app's success and visibility. However, the vast amount of review data available makes it challenging for developers to extract meaningful insights manually. This project seeks to address the following questions:

What are the key factors that influence app ratings and user reviews on the Play Store?
How do app characteristics such as category, size, type (free or paid), and content rating impact their performance and popularity?
What common themes and issues can be identified from user reviews that can help developers improve their apps?
How do trends over time, such as the frequency of updates, affect user satisfaction and app ratings?
By analyzing a comprehensive dataset of Play Store app reviews and related attributes, we aim to provide actionable insights and recommendations for app developers to enhance their apps, improve user satisfaction, and increase their market competitiveness

#### **Define Your Business Objective?**

The primary business objective of the Play Store App Review Analysis project is to empower app developers and stakeholders with data-driven insights to improve app quality, enhance user satisfaction, and optimize app store performance. Specifically, the project aims to:

Improve App Ratings: Identify the key factors influencing app ratings and user feedback to help developers understand what drives positive and negative reviews.
Enhance User Experience: Analyze common themes and issues in user reviews to provide actionable recommendations for improving app features and usability.
Optimize App Performance: Examine the impact of app characteristics such as category, size, type, and content rating on their performance to guide strategic decisions on app development and marketing.
Increase Market Competitiveness: Utilize trend analysis to understand the effects of updates and other temporal factors on app ratings and installs, helping developers stay competitive in a dynamic market.
Support Data-Driven Decision Making: Provide comprehensive, visual, and easily interpretable insights that enable developers to make informed decisions based on user feedback and market trends.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
data=pd.read_csv('/content/Play Store Data.csv')

### Dataset First View

In [None]:
data.head()

### Dataset Rows & Columns count

In [None]:
data.shape

### Dataset Information

In [None]:
data.info()

#### Duplicate Values

In [None]:
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
data.isnull().sum()

In [None]:
import missingno
missingno.bar(data,color='#0e69b3')

### What did you know about your dataset?

The Play Store app review dataset consists of 10,841 entries and 13 columns, providing a comprehensive view of various app attributes. The columns include app name, category, rating, reviews, size, installs, type (free or paid), price, content rating, genres, last updated date, current version, and required Android version. Most columns are of type Object, except for Rating, which is Float64. The dataset has some missing values in the Rating, Type, Content Rating, Current Ver, and Android Ver columns, which will require appropriate handling. Data types for columns like Reviews, Size, Installs, Price, and dates need conversion for accurate analysis. The dataset uses approximately 1.1 MB of memory.

## ***2. Understanding Your Variables***

In [None]:
#dataset columns
data.columns

In [None]:
# Dataset Describe
data.describe(include='object')

### Variables Description

The Play Store app review dataset comprises 10,841 entries with 13 columns, each representing various attributes of apps. These attributes include app names, categories, user reviews, sizes, installs, types (free or paid), prices, content ratings, genres, last updated dates, current versions, and required Android versions. The dataset reveals a diverse range of apps, with the most frequent app being "ROBLOX" and the most common category being "FAMILY." Reviews, sizes, installs, prices, and dates are currently stored as objects, requiring conversion for analysis. Notably, the "Varies with device" label appears frequently in size and current version fields. The dataset highlights that the majority of apps are free and have a content rating suitable for everyone.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data['Category'].unique()

In [None]:
data['Reviews'].unique()

In [None]:
data['Size'].unique()

In [None]:
data['Type'].unique()

In [None]:
data['Price'].unique()

In [None]:
data['Content Rating'].unique()

In [None]:
data['Genres'].unique()

In [None]:
data['Last Updated'].unique()

In [None]:
data['Current Ver'].unique()

In [None]:
data['Android Ver'].unique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df=data.copy()

In [None]:
df['Reviews'].str.isnumeric().sum()

In [None]:
df[~df['Reviews'].str.isnumeric()]

In [None]:
df=df.drop(df.index[10472])

In [None]:
df[~df['Reviews'].str.isnumeric()]

In [None]:
df['Reviews']=df['Reviews'].astype(int)

In [None]:
df['Size']=df['Size'].str.replace('M','000')
df['Size']=df['Size'].str.replace('k','')
df['Size']=df['Size'].replace('Varies with device',np.nan)
df['Size']=df['Size'].astype(float)

In [None]:
chars=['+',',','$']
cols=['Installs','Price']
for item in chars:
    for col in cols:
        df[col]=df[col].str.replace(item,'')

In [None]:
df['Installs']=df['Installs'].astype('int')
df['Price']=df['Price'].astype('float')

In [None]:
df['Last Updated']=pd.to_datetime(df['Last Updated'])
df['Day']=df['Last Updated'].dt.day
df['Month']=df['Last Updated'].dt.month
df['Year']=df['Last Updated'].dt.year

In [None]:
df.info()

In [None]:
df[df.duplicated('App')].shape

###Dataset having some duplicated record

In [None]:
#droping duplicated record only keeping first record
df=df.drop_duplicates(subset=['App'],keep='first')

In [None]:
df.shape

In [None]:
df.duplicated().sum()

In [None]:
df['Category'].value_counts()

In [None]:
df['Type'].value_counts()

In [None]:
df['Content Rating'].value_counts()

In [None]:
df['Genres'].value_counts()

In [None]:
df[df['Rating']==5]

In [None]:
#Highest rated  apps in Comics Genres
df[(df['Genres']=='Comics') & (df['Rating']==5)]

In [None]:
#Highest rated  apps in Education Genres
df[(df['Genres']=='Education') & (df['Rating']==5)] [0:5]

In [None]:
#top 5 Highest rated  games
df[(df['Category']=='GAME') & (df['Rating']==5)] [0:5]

In [None]:
df['Installs'].nlargest()

In [None]:
#Most downloaded apps in playstore
df[df['Installs']==1000000000] [0:5]

In [None]:
df['Installs'].nsmallest()

In [None]:
#Zero downloaded apps in playstore
df[df['Installs']==0]

In [None]:
df.groupby('Type')['Rating'].mean()

In [None]:
df.groupby('Content Rating')['Rating'].mean()

In [None]:
df.groupby('Genres')['Rating'].mean()

In [None]:
df.groupby('Category')['Rating'].mean()

In [None]:
pd.crosstab(df['Type'],df['Content Rating'])

In [None]:
df.groupby(df['Type'])['Installs'].mean()

In [None]:
df['Rating'].mean()

In [None]:
df['Rating'].median()

In [None]:
df['Rating'].skew()

#####Rating is negatively skewed data. i.e Left skewed.

In [None]:
df['Rating'].kurt()

#####A kurtosis value of 5.12 for the 'Rating' column in  DataFrame indicates that the distribution of ratings is leptokurtic. This means the distribution has a sharper peak and heavier tails compared to a normal distribution.

In [None]:
df['Rating'].var()

#####A variance value of 0.288 for the 'Rating' column in  DataFrame indicates that the ratings are relatively close to the mean, suggesting low to moderate variability in the ratings.

In [None]:
df['Rating'].std()

#####The standard deviation of 0.537 for the 'Rating' column in  DataFrame indicates that the ratings deviate, on average, about 0.537 units from the mean rating. This relatively small standard deviation suggests that the ratings are fairly tightly clustered around the mean, implying low to moderate variability.

In [None]:
df['Installs'].mean()

In [None]:
df['Installs'].median()

In [None]:
df['Installs'].max()

In [None]:
df['Installs'].min()

In [None]:
df['Installs'].mode()

In [None]:
df['Installs'].var()

In [None]:
df.describe()

### What all manipulations have you done and insights you found?

#####I have applied various statistical measure to understand given dataset. I have found top categories in each generes.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(4,4))
sns.kdeplot(df['Rating'],fill=True,color='#985ab8')

##### 1. Why did you pick the specific chart?

To understand data distribution of Rating

##### 2. What is/are the insight(s) found from the chart?

Rating column data is left skewed data. More data lies on the right side.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Most of the application rating is in range of 4-5 that is good for business purpose.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.kdeplot(df['Size'],fill=True,color='#985ab8')

##### 1. Why did you pick the specific chart?

To understand distrubution of Size of App on playstore

##### 2. What is/are the insight(s) found from the chart?

Most of the application size is in range of 20000 - 40000 kb

####Chart 3

In [None]:
# Chart - 3 visualization code
sns.kdeplot(data=df,x='Price',fill=True,color='#985ab8')

######Most of the app price is 0 rupess i.e unpaid

#### Chart - 4

In [None]:
# Chart - 4 visualization code
sns.kdeplot(df['Month'],fill=True,color='#985ab8')

######Mostly app is updated in 4 to 8 months

#### Chart - 5

In [None]:
# Chart - 5 visualization code
sns.kdeplot(df['Year'],fill=True,color='#985ab8')

###Most of the application are updated in year 2018

#### Chart - 6

In [None]:
# Chart - 6 visualization code
df['Category'].value_counts().plot(kind='bar', color='#296acc')

##### 1. Why did you pick the specific chart?

A bar chart is chosen for visualizing the value counts of the Category column because it effectively represents the frequency distribution of categorical data.

##### 2. What is/are the insight(s) found from the chart?

The bar chart of the Play Store app categories reveals that the FAMILY category dominates with 1,832 entries, followed by GAME with 959 and TOOLS with 827, highlighting a significant focus on family-oriented, gaming, and utility apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The high number of apps in categories like FAMILY, GAME, and TOOLS suggests significant market saturation. New entrants in these categories may face intense competition, making it harder to gain visibility and attract users.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
sns.countplot(data=df,x='Type',palette="Set2")
plt.show()

##### 1. Why did you pick the specific chart?

The countplot chosen for visualizing the distribution of app types (free vs. paid) in the dataset because it effectively illustrates the count of each category within a categorical variable.

##### 2. What is/are the insight(s) found from the chart?

Most of the application are free.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
df['Content Rating'].value_counts().plot(kind='bar',color='#ba49a3')

####The Content Rating distribution chart demonstrates that the majority of Play Store apps are designed for a general audience (Everyone), with a significant number targeting teens and a smaller portion for mature audiences.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
selected_genres = ['Tools','Entertainment','Education','Business','Lifestyle', 'Action']

# Filter the DataFrame to include only the selected genres
filtered_df = df[df['Genres'].isin(selected_genres)]

# Plotting the bar chart for the selected genres
filtered_df['Genres'].value_counts().plot(kind='bar', color='#9a4bdb')

plt.xlabel('Genres')
plt.ylabel('Count')
plt.show()

#####The distribution of genres illustrates a broad spectrum of user interests on the Play Store. Tools, Entertainment, and Education dominate the landscape, indicating high demand in utility, leisure, and learning.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
df_cat_installs = df.groupby(['Category'])['Installs'].sum().sort_values(ascending = False).reset_index()
df_cat_installs.Installs = df_cat_installs.Installs/1000000000 #converting into billions
df2 = df_cat_installs.head(10)
# plt.figure(figsize = (8,9))
ax = sns.barplot(data = df2,x ='Installs',y = 'Category',palette='tab10')
ax.set_xlabel('No. of Installations in Billions')
ax.set_title("Most Popular Categories in Play Store", size = 20)
plt.show()

####Game is most popular categroy download by a people.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
sns.scatterplot(data=df, x='Rating', y='Reviews', color='#ba49a3')

#####If Rating increases Reviews also increases

#### Chart - 12

In [None]:
# Chart - 12 visualization code
sns.boxplot(data=df,x='Price')

#####Boxplot is used to find outlier. 400 is the largest price. Here outlier at right side.

#### Chart - 13

In [None]:
# Chart - 12 visualization code
sns.boxplot(data=df,x='Rating')

####Boxplot is used to find outlier. Highest rating is 5 and lowest rating is 1. Here outlier at left side. Here 4.3 is a median values.

#### Chart - 14

In [None]:
# Chart - 13 visualization code
sns.boxplot(data=df,x='Year')

#####Here outlier is at left side. 2018 is greatest year and 2010 is smallest year in dataset.

#### Chart - 15 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numeric_df = df.select_dtypes(include=['float64', 'int64'])
corr_matrix = numeric_df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')

##### 1. Why did you pick the specific chart?

A heat map is a 2-dimensional data visualization technique that represents the magnitude of individual values within a dataset as a color.

##### 2. What is/are the insight(s) found from the chart?

Reviews and Installs are 63% dependent on each other. i.e There is positive correlation between Revies and Installs.

#### Chart - 16 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df, diag_kind='kde', markers='o', hue=None)

Pairplot gives scatterplot and density plot between two variable.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Improving App Ratings: Enhancing user satisfaction and app quality to achieve higher ratings.
Identifying Key Factors Affecting Ratings: Understanding what influences app ratings the most.
Increasing App Downloads and User Engagement: Boosting the number of downloads and user engagement by analyzing app features, reviews, and ratings.

1. Improving App Ratings
Solution:

Address Outliers: Since the distribution has high kurtosis, indicating more extreme ratings, it's crucial to analyze and address the causes of low ratings (outliers). Common reasons for low ratings often include bugs, crashes, poor user interface, or unmet expectations.
Enhance Quality Control: Implement rigorous quality assurance and user testing processes to identify and fix issues before they affect users.
Customer Feedback Loop: Actively seek and respond to user feedback. Implement a feedback mechanism within the app to quickly identify and resolve issues that users face.
2. Identifying Key Factors Affecting Ratings
Solution:

Analyze Reviews: Perform sentiment analysis on user reviews to identify common themes and issues that impact ratings. Look for frequent mentions of specific problems or praise that correlate with high or low ratings.
Feature Analysis: Examine the correlation between app features and ratings. Identify which features users rate highly and which features are commonly associated with lower ratings.
Update Frequency: Determine the impact of update frequency on ratings. Regular updates that introduce new features or improvements can positively influence ratings.
3. Increasing App Downloads and User Engagement
Solution:

Optimize App Store Listing: Ensure that the app description, screenshots, and promotional videos effectively highlight the app’s features and benefits. Positive ratings and reviews should be prominently displayed.
Marketing Strategies: Use targeted marketing strategies to reach potential users who are likely to rate the app highly. Consider using social media, influencer partnerships, and app store advertisements.
User Retention Strategies: Implement strategies to keep users engaged, such as loyalty programs, regular updates, and engaging content. Higher engagement often leads to better ratings as users feel more connected to the app.

# **Conclusion**

The Play Store data analysis project aimed to enhance app performance by understanding and leveraging app ratings. The statistical analysis revealed a high kurtosis of 5.12, indicating a leptokurtic distribution with more extreme values, a variance of 0.288, suggesting low to moderate variability, and a standard deviation of 0.537, confirming tight clustering around the mean. To improve app ratings, it is crucial to address outliers by enhancing quality control and establishing a robust user feedback loop. Identifying key factors affecting ratings involves sentiment analysis of user reviews and assessing the impact of specific features and update frequency. Increasing downloads and engagement can be achieved by optimizing the app store listing, targeted marketing, and user retention strategies. Monitoring competitors, making data-driven decisions, and educating users are essential for continuous improvement. This data-driven approach is vital for improving user satisfaction, increasing ratings, and driving more downloads and engagement in the competitive app marketplace.






