# **Project Name**    - Play Store App Review Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

The Play Store apps dataset offers significant potential for driving success in the app development industry. By analyzing this data, developers can derive actionable insights to optimize their strategies and effectively capture the Android market.

Each app in the dataset includes various attributes, such as category, rating, size, and more. Additionally, a separate dataset contains customer reviews of these Android apps. Through comprehensive exploration and analysis, this study aims to identify the key factors contributing to app engagement and success.


In this project, I employed an Exploratory Data Analysis (EDA) approach to analyze the data, utilizing visual techniques to uncover trends and patterns. This process involved examining statistical summaries and graphical representations to validate assumptions and gain deeper insights into the data.

# **GitHub Link -**

https://github.com/AshwiniSuryakar09/Play-Store-App-Review-Analysis-

# **Problem Statement**




The objective of this analysis is to understand the key factors influencing app success on the Google Play Store. This includes identifying the most popular app categories, determining whether the majority of apps are free or paid, assessing the importance of app ratings, and exploring how these ratings are affected by whether an app is paid or free. Additionally, the analysis aims to uncover which app categories have the highest number of installations, how app counts vary across different genres, and the correlation between user reviews and app ratings. By gaining insights into these areas, developers and marketers can make data-driven decisions to optimize app development and marketing strategies, thereby enhancing user engagement and maximizing revenue potential on the platform.

#### **Define Your Business Objective?**

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Dataset Loading

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
data = pd.read_csv('Play Store Data.csv')
user = pd.read_csv('User Reviews.csv')

### Dataset First View

In [None]:
# Dataset First Look for playstore
data.head()

In [None]:
data.tail()

In [None]:
# Dataset First Look for user
user.head()

In [None]:
user.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count for playstore
data.shape

In [None]:
# dataset rows and column for user
user.shape

### Dataset Information

In [None]:
# Dataset Info for palystore
data.info()

In [None]:
# Dataset Info for user
user.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count for playstore
data.duplicated().sum()

In [None]:
# Dataset Duplicate Value Count For User
user.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count for playstore data
data.isnull().sum()

In [None]:
# Visualizing the missing values for playstore data
sns.heatmap(data.isnull(),cbar= False)

In [None]:
# Missing values /Null Values Count for User reviews
user.isnull().sum()

In [None]:
# Visualizing the missing values for user
sns.heatmap(user.isnull(),cbar = False)

### What did you know about your dataset?

Their are two datasets :

**1.data dataset** :-
   * Total of 13 feature columns and observations(rows) are 10841
     present in dataset
   * 483 duplicate rows are there.
   * Also there are missing values or Null values in the dataset.

**2.user dataset**:-

   * There are a total of 5 feature columns and The total number of
      observations(rows) are 64295.
   * 33616 duplicate rows
   * Also there are missing values or Null values in the dataset.
   

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns for playstore
data.columns

In [None]:
# dataset columns for user
user.columns

In [None]:
# Dataset Describe for playstore data
data.describe()

In [None]:
# Datset describe for user
user.describe()

### Variables Description

**A. Variable Description for playstore dataset** :

1.App- The name of the application

2.Category - The primary category to which the app belongs

3.Rating - The overall rating of the app

4.Reviews - The total number of reviews for application

5.Size - The Size of the application

6.Installs - The total number of times app installed

7.Type - The cost of app if it is paid or The app is free

8.Price - The cost of app if it is paid or The app is free

9.Content Rating - The age group app is targeted

10.Genres - Additional categories or genres the app belongs to apart
from main category

11.Last Updated

12.Current Version

13.Android Version

**B. Variable Description for playstore dataset :**

1.App -The name of mobile Application being reviewed

2.Translated_Review - The user review of the app,which has been preprocessed

3.Sentiment - The sentiment of user revieew categorized as positive ,
Negative or Neutral

4.Sentiment_Polarity - A numeric score representing the sentiment polarity of the review,ranging from -1 to +1

5.Sentiment_Subjectivity - A numeric score indicating the subjectivity of the review ranging from 0 (objective) to 1(subjective)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable for play store
for i in data.columns.tolist():
  print("No. of Unique values in",i,"is",data[i].nunique())

In [None]:
# Check Unique values for each variable for user reviews
for i in user.columns.tolist():
  print("No.of unique values in",i,"is",user[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
## convert reviews type as integer type
# Remove non-numeric characters and convert to float first
data['Reviews'] = data['Reviews'].astype(str).str.replace('M', '',regex=True) # Use str accessor for string replacement
data['Reviews'] = data['Reviews'].astype(float)
data['Reviews'] = data['Reviews'].apply(lambda x: x * 1000 if x < 10 else x )
data['Reviews'] = data['Reviews'].astype(int)

In [None]:
#drop $ sign from price column
data['Price'] = data['Price'].astype(str).str.replace('$','')
data['Price'] = data['Price'].replace(r' ^\s*$',np.nan, regex=True)
data['Price'] = pd.to_numeric(data['Price'],errors ='coerce')

In [None]:
#in installs column we replace some sign and convert it into integer data type
data['Installs'] = data['Installs'].str.replace(',','')
data['Installs'] = data['Installs'].str.replace('+','')
data['Installs'] = data['Installs'].str.replace('Free', '0') # Replace "Free" with '0'
data = data.astype({"Installs": int})

In [None]:
#In size column apps size in mb and kb
data["Size"].unique()

In [None]:
#convert all apps size in mb data and drop some sign for coverting it into float data type
data['Size'] = data['Size'].apply(lambda x: str(x).replace('Varies with device', '20') if 'Varies with device' in str(x) else x)
data['Size'] = data['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x)
data['Size'] = data['Size'].apply(lambda x: str(x).replace(',', '') if ',' in str(x) else x) # Replace ',' with ''
data['Size'] = data['Size'].apply(lambda x: float(str(x).replace('k', '')) / 1000 if 'k' in str(x) else x)
data['Size'] = data['Size'].apply(lambda x: float(str(x).replace('+', '')) if '+' in str(x) else x) # Remove '+'
data['Size'] = data['Size'].apply(lambda x: float(x))

Since, size of the applications present in the datset are in MB and KB. Therefore, for ease in data processing, entire size column is converted to MB.

In [None]:
#After converted
data["Size"].unique()

In [None]:
#data info after cleaning
data.info()

In [None]:
data.tail()

In [None]:
#unique apps
df=data["App"].nunique()
df

In [None]:
#drop the duplicate apps
right_df=data.drop_duplicates(subset='App')
right_df.head()

In [None]:
#null values in rating-1463
right_df.isnull().sum()

Here, we realized that there are 1463 rows having null values under column 'Rating'. Hence, we decided to replace the null values with mean of overall 'Rating' values.

In [None]:
#replace the null value
right_df.Rating.fillna(4.19, inplace=True)

In [None]:
#After cleaning
right_df.isnull().sum()

**Afer the cleaning process now we have a right data set.**

In [None]:
#some values count
right_df.describe()

In [None]:
#list of category
number_of_category=right_df["Category"].unique()
print(number_of_category)

In [None]:
right_df.head()

In [None]:
right_df.tail()

**Check Unique Values for each variable**:-

In [None]:
right_df["App"].nunique()

In [None]:
right_df["Category"].nunique()

In [None]:
right_df["Type"].nunique()

In [None]:
right_df["Price"].nunique()

In [None]:
right_df["Price"].max()

In [None]:
right_df["Genres"].nunique()

In [None]:
#data based on rating vs price
price_df=right_df[["App","Price","Type","Rating","Size"]]
price_df.head()

In [None]:
#apps content rating
grouped = right_df[['App','Content Rating']]
grouped

In [None]:
#apps age group
age_grouped= grouped.rename(columns={'Content Rating': 'age_group'})
age_grouped

In [None]:
#android version of apps
versions=right_df["Android Ver"].value_counts().reset_index()
versions = versions.rename(columns={'index':'Android Ver','Android Ver': 'count'})
versions

In [None]:
#total apps based on category
df=right_df.groupby('Category')['App'].nunique().reset_index(name="Total Apps")
df=df.sort_values(by=['Total Apps'],ascending=False)
df.head()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#pie chart on price of apps
plt.figure(figsize=(8,6))
plt.title("Apps on price",fontsize = 14)
# Calculate the number of unique values in 'Type' column
num_slices = len(price_df.Type.value_counts())
# Create an explode tuple with the same length as the number of slices
explode = (0.1,) + (0,) * (num_slices - 1)
plt.pie(price_df.Type.value_counts(), labels=price_df.Type.value_counts().index,autopct='%1.1f%%',startangle=90,explode=explode)
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts show the parts-to-whole relationship

Pie charts are often used in business. Examples include showing percentages of types of customers, percentage of revenue from different products, and profits from different countries.

##### 2. What is/are the insight(s) found from the chart?


From the plot we can imply that majority of the apps in the Play Store are Free apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the above pie chart, we can draw insight that most of the apps are free (approx 92%) and paid apps (approx 7%) Listing. It may lead to fewer installations of paid apps.

#### Chart - 2

In [None]:
#age group on apps
plt.figure(figsize=(10,6))
plt.title("Apps per age group",fontsize = 16)
plt.pie(age_grouped.age_group.value_counts(), labels=age_grouped.age_group.value_counts().index,autopct='%1.1f%%',startangle=180,explode=(0.1, 0,0,0,1,2.5))
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts show the parts-to-whole relationship

So here we display the percentage of Apps with different age group restrictions

##### 2. What is/are the insight(s) found from the chart?


From the above plot, we can see that the Everyone category has the highest number of apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the above piechart we can show that most of the apps are no age restriction means they will gain more users, restricted apps are fewer users base

#### Chart - 3 : Bar PLot

In [None]:
#bar graph on apps category
plt.figure(figsize=(20,6))
plt.title("Total Apps on category",fontsize=16)
sns.barplot(data=df, x="Category", y="Total Apps", hue="Category", palette="rainbow", legend=False)
plt.xticks(rotation= 90)
plt.xlabel('Categorys',fontsize=16)
plt.ylabel('Total Apps',fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts should be used when we are showing segments of information.

So, here I want to show the segment of apps category that's why I chose a bar graph.

##### 2. What is/are the insight(s) found from the chart?

From this plotting we know that most of the apps in the play store are from

the categories of 'Family', 'Game' and also 'Tools.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From this graph we can say that most of the apps are from the category family, game, tools, and beauty, comics category apps are fewer on the play store So it's better to list an app in the category like Beuty, comics, etc.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
#top apps by category
category= df.sort_values(by='Total Apps',ascending = False)
top_five_category= category.head()
least_five_category =category.tail()

In [None]:
#top category
top_five_category.reset_index()

In [None]:
least_five_category = least_five_category.drop(least_five_category[least_five_category['Category'] == '1.9'].index)

In [None]:
least_five_category['Category'] = least_five_category['Category'].replace('1.9', 'Unknown')

In [None]:
least_five_category.reset_index()

In [None]:
#bar graph of top five category
fig = plt.figure(figsize=(7,5))
sns.barplot(data=top_five_category,x= "Category" ,y= "Total Apps",hue="Category", palette="rainbow", legend=False) # Changed dta to data
plt.title("Top five categories of listed apps",fontsize= 16)
plt.xlabel("Category",fontsize=16)
plt.ylabel("Total Apps",fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

Top Five categories on play store where most of the apps are listed.

1-Family

2-Game

3-Tools

4-Business

5-Medical

#### Chart - 5 : Bar PLot

In [None]:
# Chart - 5 visualization code
least_five_category.reset_index()

In [None]:
# Bar PLot
figure= plt.figure(figsize=(7,5))
sns.barplot(data=least_five_category,x="Category",y="Total Apps",hue ="Category",palette="rainbow",legend= False)
plt.title("Least Five categories of apps",fontsize = 16)
plt.xlabel("Category",fontsize = 16)
plt.ylabel("Total Apps",fontsize= 16)
plt.show()

##### 1. Why did you pick the specific chart?

Least five categories were apps are least listed.

1-Beauty

2-Comics

3-Parenting

4-Events

#### Chart - 6 : BOX PLOT

In [None]:
# Chart - 6 visualization code
#number of installs based on category
categories = right_df.groupby('Category')["Installs"].sum().reset_index()
category_installs_sum_df = categories.sort_values(by='Installs',ascending= False)
category_installs_sum_df

In [None]:
# Bar plot
figure = plt.figure(figsize=(10,6))
sns.barplot(data=category_installs_sum_df,x ="Category",y="Installs",hue="Installs",palette="rainbow",legend= False)
plt.xticks(rotation=90)
plt.title("Total Installs based on category",fontsize=16)
plt.xlabel("Category",fontsize=16)
plt.ylabel("Installs",fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts used when we are showing segments of information.

##### 2. What is/are the insight(s) found from the chart?

From this distribution plotting of number of installs for each category, we can see that most of the apps being downloaded and installed are from the categories of 'Game' and 'Communication'.

From the above two plots we can conclude that, maximum number of apps present in google play store comes under Family, Games and Tools Category but as per the installations and requirements in the market place, this is not the case. Maximum installed apps comes under Games, Communication and Tools.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the above visualization, we can say that it is profitable to list a game on the play store but in the category like event and beauty it may lead a negative growth of business and apps.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#data frame df1
df1 =right_df[["App","Category","Rating","Reviews","Installs","Price","Type","Genres","Content Rating","Size"]]
df1

In [None]:
# Box plot on average rating
plt.figure(figsize=(18,15))
plt.title("Average rating on Category",fontsize=20)
sns.boxplot(y="Category",x ='Rating',data =df1,hue="Category",palette="rainbow",width=0.5)
plt.xticks(np.arange(0,5.5, 0.5), rotation=45, fontsize=16)
plt.xlim(0.5, 5)
plt.xlabel("Rating",fontsize=20)
plt.ylabel("Category",fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

A Box Plot is also known as Whisker plot is created to display the summary of the set of data values having properties like minimum, first quartile, median, third quartile and maximum.

##### 2. What is/are the insight(s) found from the chart?

From this distribution plotting, it implies that most of the apps in the Play Store are having rating higher than 4 or in the range of 4 to 4.5.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the box plot, we insight that most of the apps have ratings between 4 - 4.5. The maximum and minimum rating categories are parenting and video players respectively.

#### Chart - 8

In [None]:
# Average app ratings
data['Rating'].value_counts().plot.bar(figsize=(20,8),color='m')
plt.xlabel('Average rating',fontsize = 15 )
plt.ylabel('Number of apps', fontsize = 15)
plt.title('Average rating of apps in Playstore', fontsize = 20)
plt.legend()

##### 1. Why did you pick the specific chart?

The bar chart is chosen to display the distribution of app ratings across different rating values. It is effective for visualizing how many apps fall into each rating category, making it easy to identify trends and distributions in app ratings.

##### 2. What is/are the insight(s) found from the chart?

The chart shows the distribution of app ratings, indicating how many apps fall into each rating category. It reveals the overall quality of apps on the Play Store, with more apps in higher rating ranges suggesting better consumer satisfaction and quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, insights from the chart can guide strategies for improving app quality and competitive positioning, but there are risks associated with high competition and quality gaps.

#### Chart - 9 : Paid Apps

In [None]:
# Creating a df containing only paid apps
paid_df=data[data['Type']=='Paid']
# Number of apps that can be installed at a particular price
paid_df.groupby('Price')['App'].count().sort_values(ascending= False).plot.bar(figsize = (20,6), color = 'crimson')


##### 1. Why did you pick the specific chart?

The bar chart is chosen to visualize the distribution of the number of apps available at different price points. A bar chart is effective here because it clearly shows the frequency of apps across various price ranges, allowing you to easily compare how many apps are available at each price level.

##### 2. What is/are the insight(s) found from the chart?

Overall, the chart helps understand the distribution of app prices and can guide decisions related to pricing and market positioning.

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Yes,Insights help set competitive prices, target market segments, and align with consumer preferences.

#### Chart - 10 : Size Of Applications

In [None]:
# Function to group the apps based on its size in MB
def size_apps(var):
  try:
    if var < 1:
      return 'Below 1'
    elif var >= 1 and var <10:
      return '1-10'
    elif var >= 10 and var <20:
      return '10-20'
    elif var >= 20 and var <30:
      return '20-30'
    elif var >= 30 and var <40:
      return '30-40'
    elif var >= 40 and var <50:
      return '40-50'
    elif var >= 50 and var <60:
      return '50-60'
    elif var >= 60 and var <70:
      return '60-70'
    elif var >= 70 and var <80:
      return '70-80'
    elif var >= 80 and var <90:
      return '80-90'
    else:
      return '90 and above'
  except:
    return var

In [None]:
data['size_group']=data['Size'].apply(lambda x : size_apps(x))
data.head()

In [None]:
# no of apps belonging to each size group
data['size_group'].value_counts().plot.barh(figsize=(20,8),color='r').invert_yaxis()
plt.title("Number of apps in different size groups", size=20)
plt.ylabel('App size in MB', size=15)
plt.xlabel('No of apps', size=15)
plt.legend()

In [None]:
# average number of app installs in each category
data.groupby('size_group')['Installs'].mean().sort_values(ascending= False).plot.barh(figsize=(20,8),color='sandybrown').invert_yaxis()
plt.title("Average number of app installs (In 10 millions)", fontsize=20)
plt.ylabel('App Size In MB', fontsize=15)
plt.xlabel('Average No Of App Installs', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

 I used a bar plot to analyze the distribution of application sizes across different categories. This visualization allowed me to clearly compare the sizes of applications within each category, providing insights into the relative sizes and helping to identify trends or patterns in the data.
 And also gives the details about average number of apps installed in each category

##### 2. What is/are the insight(s) found from the chart?

I identified which categories have the largest and smallest average application sizes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes,It can help target marketing efforts, guide product development, and inform investment in popular categories.

#### Chart - 11

In [None]:
#scaterplot on apps rating based on price
plt.figure(figsize=(12, 6))
sns.scatterplot(data=price_df,x=price_df['Size'],y=price_df['Rating'],hue=price_df['Type'], s=115, alpha=0.7)
plt.xlabel("Size(MB)",fontsize=16)
plt.xlim(0,110)
plt.ylabel("Rating(out of 5)",fontsize=16)
plt.ylim(0,6)
plt.title("Apps rating ,size and type",fontsize=16)
plt.legend(title='App Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plots are used to plot data points on horizontal and vertical axis in the attempt to show how much one variable is affected by another.

##### 2. What is/are the insight(s) found from the chart?

From this scatter plot, we can imply that majority of the free apps are small in size and having high rating. While for paid apps, we have quite equal distribution in term on size and rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From this scatter plot, we can imply that the majority of the apps are free and small in size with a high rating. While for paid apps, we have quite equal distribution in terms of size and rating it mins if the apps are small in size it may lead to high installations and reviews.

#### Chart - 12

In [None]:
#dataset on rating, reviews,installs,size and price
df2=df1[["Rating","Reviews","Installs","Size","Price"]]
df2

In [None]:
#df2 information
df2.info()

In [None]:
#apps based on rating,reviews,installs,size and price
rating_df = df2.groupby('Rating').sum().reset_index()
rating_df.head()

In [None]:
# Chart - 12 visualization code
#graph on size,rating,review etc
# Create a figure with 4 subplots
fig, axes = plt.subplots(1, 4, figsize=(15, 4))

axes[0].plot(rating_df['Rating'], rating_df['Reviews'], 'r')
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Reviews')
axes[0].set_title('Reviews Per Rating',fontsize=16)

axes[1].plot(rating_df['Rating'], rating_df['Size'], 'g')
axes[1].set_xlabel('Rating')
axes[1].set_ylabel('Size')
axes[1].set_title('Size Per Rating',fontsize=16)

axes[2].plot(rating_df['Rating'], rating_df['Installs'], 'y')
axes[2].set_xlabel('Rating')
axes[2].set_ylabel('Installs (e+10)')
axes[2].set_title('Installs Per Rating',fontsize=16)

axes[3].plot(rating_df['Rating'], rating_df['Price'], 'm')
axes[3].set_xlabel('Rating')
axes[3].set_ylabel('Price')
axes[3].set_title('Price Per Rating',fontsize=16)

plt.tight_layout(pad=2)
plt.show()

##### 1. Why did you pick the specific chart?

A line chart displays the evolution of one or several numeric variables

##### 2. What is/are the insight(s) found from the chart?

From the above plottings, we can imply that most of the apps with higher rating range of 4.0 - 4.7 are having high amount of reviews, size, and installs. In terms of price, it doesn't reflect a direct relationship with rating, as we could see a fluctuation in term of pricing even at the range of high rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the above plottings, we can imply that most of the apps with a higher rating range of 4.0 - 4.7 are having a high amount of reviews, sizes, and installs. In terms of price, it doesn't reflect a direct relationship with rating, as we could see a fluctuation in terms of pricing even at the range of high rating. From these conclusions, we draw an insight that the popularity of apps or ratings of apps is strongly correlated with the number of reviews, the size of the app, and the number of installs.

Chart - 13 : Android version based on each category

In [None]:
data['Android Ver'].replace(to_replace=['4.4W and up','Varies with device'], value=['4.4','1.0'],inplace=True)
data['Android Ver'].replace({k: '1.0' for k in ['1.0','1.0 and up','1.5 and up','1.6 and up']},inplace=True)
data['Android Ver'].replace({k: '2.0' for k in ['2.0 and up','2.0.1 and up','2.1 and up','2.2 and up','2.2 - 7.1.1','2.3 and up','2.3.3 and up']},inplace=True)
data['Android Ver'].replace({k: '3.0' for k in ['3.0 and up','3.1 and up','3.2 and up']},inplace=True)
data['Android Ver'].replace({k: '4.0' for k in ['4.0 and up','4.0.3 and up','4.0.3 - 7.1.1','4.1 and up','4.1 - 7.1.1','4.2 and up','4.3 and up','4.4','4.4 and up']},inplace=True)
data['Android Ver'].replace({k: '5.0' for k in ['5.0 - 6.0','5.0 - 7.1.1','5.0 - 8.0','5.0 and up','5.1 and up']},inplace=True)
data['Android Ver'].replace({k: '6.0' for k in ['6.0 and up']},inplace=True)
data['Android Ver'].replace({k: '7.0' for k in ['7.0 - 7.1.1','7.0 and up','7.1 and up']},inplace=True)
data['Android Ver'].replace({k: '8.0' for k in ['8.0 and up']},inplace=True)
data['Android Ver'].fillna('1.0', inplace=True)

In [None]:
print(data.groupby('Category')['Android Ver'].value_counts())
Type_cat = data.groupby('Category')['Android Ver'].value_counts().unstack().plot.bar(figsize=(25,8), width=2)
plt.xticks()
plt.show()

### 1. Why did you pick the specific chart?

 I used a bar plot to analyze the distribution of android versions across applications

### 2. What is/are the insight(s) found from the chart?
It is clearly evident from the above plot that majority of the apps are working on Android_Ver 4.0 and up.



#### Chart - 14 : Correlation Heatmap

In [None]:
#apps based on rating,reviews,installs,size and price
rating_df = df2.groupby('Rating').sum().reset_index()
rating_df

In [None]:
# Chart - 14 visualization code
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(figsize=(8, 8))
plt.title("co-relations of variables using a heatmap", fontsize=16)
sns.heatmap(rating_df.corr(), ax=axes, annot=True, linewidths=0.1, fmt='.2f', square=True)
plt.show()

##### 1. Why did you pick the specific chart?

Because it makes patterns easily readable and highlights the differences and variations in the same data.

##### 2. What is/are the insight(s) found from the chart?

From our heatmap, we can visualize the correlation of the variables. some visualizations from the map are -

1-The correlation between reviews and price is very low

2-The correlation between reviews and installation is very high

3- The moderate correlation variables are price vs rating, rating vs installation

4-The very low correlation between price, reviews, and installs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the above heatmap, we can say that the correlation between Installation, size, and reviews of the apps gain a positive impact on business, and the relation between price, Installation, and Reviews has a negative or less positive impact on business.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
g=sns.pairplot(rating_df)
g.fig.suptitle("Distribution and relation of data ", y=1.00,fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

Because Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

##### 2. What is/are the insight(s) found from the chart?

1.Relation between reviews and price is very low

2.Relation between reviews and installation is very high

3.Moderate relation variables are price vs rating, rating vs installation

4.very low relation between price, reviews, and installs.

### Chart - 16: Scatter plot


In [None]:
plt.figure(figsize=(14,7))
sns.scatterplot(data=user,x='Sentiment_Subjectivity', y='Sentiment_Polarity',hue="Sentiment")
plt.title("Does sentiment_subjectivity proportional to sentiment_polarity",fontsize=16)
plt.show()

### 1. Why did you pick the specific chart?

Because Scatter plots are used to plot data points on a horizontal and a vertical axis in an attempt to show how much one variable is affected by another.

### 2. What is/are the insight(s) found from the chart?
From the above scatter plot it can be concluded that sentiment subjectivity is not always proportional to sentiment polarity but in maximum number of case, shows a proportional behavior, when variance is too high or low

### 3. Will the gained insights help creating a positive business impact?
### Are there any insights that lead to negative growth? Justify with specific reason.
sentiment subjectivity is not always proportional to sentiment polarity but in maximum number of case, shows a proportional behavior, when variance is too high or low

Chart - 18 Pie Chart


In [None]:
#sentiment percentage
plt.figure(figsize=(8,6))
plt.title("sentiment percentege",fontsize = 16)
plt.pie(user.Sentiment.value_counts(), labels=user.Sentiment.value_counts().index,autopct='%1.1f%%',startangle=90,explode=(0,0.1,0))
plt.show()

### 1. Why did you pick the specific chart?
A pie chart helps organize and show data as a percentage of a whole.

### 2. What is/are the insight(s) found from the chart?
It can be seen from the above plot that the number of positive reviews are way higher than negative and neutral ones.

### 3.Will the gained insights help creating a positive business impact?
### Are there any insights that lead to negative growth? Justify with specific reason
From the above plot that the number of positive reviews(64%) are way higher than negative(22%) and neutral ones.

## **5. Solution to Business Objective**

Here’s a more concise version:

**Free Apps**: Free apps generally have higher installations and ratings in the Android market, suggesting greater user engagement.

**Popular Categories**: Categories like GAME, SOCIAL, COMMUNICATION, and TOOL have the most installs, ratings, and reviews, reflecting current Android user trends.

**App Size**: The median app size on the Play Store is 12 MB, indicating a benchmark for app optimization.

# **Conclusion**

The Google Play Store Apps report provides valuable insights into current app trends within the Play Store. Based on the visualizations, it is evident that the most popular apps, in terms of user installations, belong to categories such as GAME, COMMUNICATION, and TOOL, despite there being fewer apps available in these categories compared to FAMILY. The popularity of these apps is likely due to their ability to entertain or assist users effectively. Additionally, this trend suggests that developers in these categories prioritize the quality of their apps over quantity.

Moreover, the charts indicate that apps with high user ratings, typically above 4.0, often have a substantial number of reviews and installations. Although there are occasional spikes in app size and price, these outliers do not suggest that highly rated apps are generally large or expensive. Instead, these spikes are likely attributable to a few exceptional cases. Furthermore, the apps with the most reviews predominantly come from categories such as SOCIAL, COMMUNICATION, and GAME, including popular apps like Facebook, WhatsApp Messenger, Instagram, Messenger – Text and Video Chat for Free, and Clash of Clans.

While categories such as GAME, SOCIAL, COMMUNICATION, and TOOL currently dominate in terms of installs, ratings, and reviews, they do not appear among the top five most expensive app categories in the store, which are primarily FINANCE and LIFESTYLE. In conclusion, the current trend in the Android market is largely driven by apps that provide assistance, enable communication, or offer entertainment.

-Percentage of free apps = ~92%

-Percentage of apps with no age restrictions = ~82%

-Most competitive category: Family

-Category with the highest number of installs: Game

-Category with the highest average app installs: Communication

-Percentage of apps that are top rated = ~80%

-The majority of the free apps are small in size and having high rating.

-While for paid apps, we have quite equal distribution in term on size and rating.

-The number of positive reviews(64%) are way higher than negative(22%) and neutral ones in User reviews dataset.

-The apps are working on Android_Ver 4.0 and up.







### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***