# **Project Name**    - **Play Store App Review Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name              - Ashish Anil Sarang**


# **Project Summary -**

Play Store, also branded as the Google Play Store and formerly Android Market, is a digital distribution service operated and developed by Google. Applications are available through Play Store either free of charge or at a cost. They can be downloaded directly on an Android device through the proprietary Play Store mobile app.

We are provided with the two datasets one containing the information about apps and the other consist of the user reviews and their sentiments about the apps. Our goal is to analyze the dataset and visualize the trends and relations between app features. There are many questions an app developer could come across while developing an app and our study will help in answering those questions. Our analysis is divided into three phases; understanding data, data preparation and data visualization.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Given the vast repository of data available from the Play Store, the challenge is to formulate a comprehensive analysis that identifies key factors influencing app engagement and success. Specifically, this entails exploring app attributes such as category, rating, and size, as well as analyzing user reviews to gain insights into consumer preferences and sentiments. The objective is to uncover actionable insights that app developers can leverage to optimize their strategies, enhance app performance, and effectively capture the Android market








#### **Define Your Business Objective?**

The wealth of data available from the Play Store presents a promising opportunity for app developers to propel their businesses towards success. By mining this data, developers can extract actionable insights crucial for capturing the Android market. Each app entry is rich with information such as category, rating, and size, providing a comprehensive view of the app landscape. Complementing this dataset is another containing customer reviews, offering invaluable feedback on app performance. Through thorough exploration and analysis of these datasets, key factors driving app engagement and success can be uncovered, empowering developers to make informed decisions and optimize their strategies for maximum impact in the competitive app market.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Loading Dataset and storing them as a Pandas dataframe
apps_df = pd.read_csv("/content/drive/MyDrive/Alma Better EDA files/Play Store Data.csv")
reviews_df = pd.read_csv("/content/drive/MyDrive/Alma Better EDA files/User Reviews.csv")

In [None]:
#copying the dataframe
data_apps = apps_df.copy()

### Dataset First View

In [None]:
#Let us look at first 20 rows
data_apps.head(n = 20)


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data_apps.shape

### Dataset Information

In [None]:
# Dataset Info
data_apps.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(data_apps[data_apps['App'].duplicated()])

In [None]:
# droping duplicates
data_apps.drop_duplicates("App", inplace=True)

#### Missing Values/Null Values

In [None]:
# checking the columns which have missing values
columns_with_missing_values = data_apps.columns[data_apps.isnull().any()]
data_apps[columns_with_missing_values].isnull().sum()

In [None]:
# Visualizing the missing values
columns = ['Rating', 'Type', 'Content Rating', 'Current Ver', 'Android Ver']
missing_values_count = [1463, 1, 1, 8, 3]

# Creating a bar plot with Seaborn
plt.figure(figsize=(10, 6))  # Adjust the figure size if needed
sns.barplot(x=columns, y=missing_values_count)

# Adding labels and title
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.title('Count of Missing Values in Columns')

# Rotating x-axis labels for better readability
plt.xticks(rotation=45)

# Showing the plot
plt.show()





Dataset contains many Null or missing values.

*   [ 'Rating' ] --> contains 1474 missing values.
*   [ 'Type' ] --> contains 1 missing values.
*   [ 'Content Rating' ] --> contains 1 missing values.
*   [ 'Current Ver' ] --> contains 8 missing values. *[ 'AndAndroid Ver' ] --> contains 3 missing values.



Creating a function to gather more information about these attributes of dataset

In [None]:
def printinfo():
    df = pd.DataFrame(index=data_apps.columns)
    df['data_type'] = data_apps.dtypes
    df['null_count'] = data_apps.isnull().sum()
    df['unique_count'] = data_apps.nunique()
    return df

In [None]:
printinfo()

Insights about data :

1.   We have missing value counts of all the attributes
2.   We have unique value counts of all the attributes
3.   We have datatypes of all the attributes






### What did you know about your dataset?

Understanding the dataset is crucial for conducting effective analysis. It involves familiarizing oneself with the structure, contents, and characteristics of the data to make informed decisions and perform meaningful operations.

# **Data Cleaning**

**Missing Value**

In [None]:
# looking at missing values in column Content Rating
data_apps[data_apps['Content Rating'].isna()]


We clearly can see that row 10472 has a missing value under column Content Rating . Hence we drop this entire row for eliminating missing value.

In [None]:
#dropping the row containing missing value
data_apps.dropna(subset = ['Content Rating'], inplace=True)

**Missing Values**

In [None]:
#checking rows containing missing values under ['Type']
data_apps[data_apps['Type'].isnull()]

In [None]:

#lets check unique value counts
data_apps['Type'].value_counts()

So ['Type'] column has 2 unique values :

*   Free
*   Paid



In [None]:
#lets find most frequent value/ mode
data_apps['Type'].mode()
print(f"The mode/ most frequent value is '{data_apps['Type'].mode()[0]}'")

The mode/ most frequent value is 'Free'

So we have only one row which consist of a missing value under column 'Type'. After checking the price of that App , I got to know that its price is '0' and also the most of apps are free , that means we can fill the missing value with 'Free'.

In [None]:
# filling missing data in city column
data_apps['Type'].fillna('Free', inplace= True)


**Missing Values**

In [None]:
#checking rows with missing values
data_apps[data_apps['Rating'].isnull()]

There are 1463 values under 'Rating' which are missing. We cannot fill these values manually cause its very unpredictable.

So lets see the distribution of this column:

In [None]:
# plot distribution of rating
sns.distplot(data_apps['Rating'])

Rating Column Shows negetively skewed-distribution as most of the values are concentrated atowards right side of the plot.In skewed distributions, the median is the best measure because it is unaffected by extreme outliers or non-symmetric distributions of scores.

We can replace the missing values with the median value of the column

In [None]:
# calculating median value
median_value = data_apps['Rating'].median()

In [None]:
# replacing the missing values with median value
data_apps['Rating'].fillna(median_value, inplace= True)

We are having some of the unnecessary columns which will be of not much use in the analysis process.

Dropping them is the only solution.

In [None]:
#dropping unnecessary columns
data_apps.drop(['Current Ver' , 'Android Ver'], axis= 1, inplace= True)

So After eliminating all the missing and duplicate values , dropping all the the unecessary columns , our dataframe looks like:

In [None]:
printinfo()

All the columns have *'null_count' '0'. * So now dataset doesn't contain any missing value and is cleaned.

# **Transforming Data**

As you can see from the above dataframe, Columns like Reviews, Size, Installs, & Price should have an int or float datatype,So let’s convert them to their respective correct type.

Let's start with changing Reviews column from object to integer.

In [None]:
# changing data type of column Reviews
data_apps['Reviews'] = data_apps.Reviews.astype(int)

To convert this column from object to integer type

*   First of all, we will need to remove the + symbol from these values
*   then need to remove ',' symbol from the numbers.



In [None]:
#removing special characters
data_apps['Installs'] = data_apps['Installs'].apply(lambda x: x.strip('+'))

In [None]:
#removing special characters
data_apps['Installs'] = data_apps['Installs'].apply(lambda x: x.replace(',', ''))

In [None]:
#Now we can convert [' Installs' ] column from string type to integer type
data_apps['Installs'] = data_apps['Installs'].astype(int)

Need to remove '$' symbol from['Price'] and convert column from Object to Float

In [None]:
# removing $ and changing the type of Price column
data_apps['Price'] = data_apps['Price'].apply(lambda x : x.strip('$'))
data_apps['Price'] = data_apps['Price'].astype(float)

Let us take ['Size'] column. let us look at the Unique Values of 'Size' column

In [None]:
#checking unique values
data_apps['Size'].unique()

Size column contain characters M and K which denotes MB and kB .

*   Dropping the M symbol by replacing with the value '000':
*   So that, all size values become the kilobyte type.


In [None]:
#Dropping the M symbol by replacing with the value '000':
data_apps['Size'] = data_apps['Size'].str.replace("M","000")

Replacing the k with "":

In [None]:
#Replacing the k with "":
data_apps['Size'] = data_apps['Size'].str.replace("k","")

Some apps' sizes vary with device and we cannot actually predict their exact value , so its better to drop those rows having 'Varies with device' under 'Size' column

In [None]:
data_apps = data_apps[data_apps['Size'] != 'Varies with device']

In [None]:
# changing the type of Size column
data_apps['Size'] = data_apps['Size'].astype(float)

Now looking at the last updated column it contains the date on which the app is updated/launched last time. It is of object type so we have to convert date in the date-time format.

In [None]:
# converting 'date' dtype from object to datetime
def date_conversion(date_string):
  date_obj = datetime.strptime(date_string, '%B %d, %Y').date()
  date_obj = pd.to_datetime(date_obj, format="%Y-%m-%d")
  return date_obj

data_apps['Last Updated'] = data_apps['Last Updated'].apply(date_conversion)

# Final Look at the dataset

In [None]:
# reseting the index
data_apps.reset_index(drop= True, inplace= True)

In [None]:
#new shape
data_apps.shape

In [None]:
#basic info
data_apps.info()

In [None]:
#glimse of dataset's first 5 rows
data_apps.head()

In [None]:
#glimse of dataset's last 5 rows
data_apps.tail()

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data_apps.columns

In [None]:
# Dataset Describe
data_apps.describe()

### Variables Description

Answer Here

*   After cleaning the dataset we have 8432 apps
*   Shape of the dataset is (8432,11) , that means 8432 rows and 11 columns
*   There are no missing values in this dataset.
*   Most of the apps are Free off cost , very few are paid apps , and Maximum price one has to pay for an app is 400($
*   Average rating for an app is 4.18 , Maximum rating for an app is 5.00 and the minimum rating is 1.00
*   Average size of the app is 18372KB , minimum size recorded for an app is 1KB , while maximum size recorded is 100000KB i.e 100 MB

*   Some apps are installed more than 1 Billion times while some apps are not installed even single time.
*   Average reviews on any app are 1.206553e+05 times , while some apps are also there who are not getting reviewed by users





## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

# Distribution of Paid and Free Apps

In [None]:
# Calculate the distribution of app types (Free vs. Paid)
data = data_apps['Type'].value_counts()

# Labels for the pie chart
labels = ['Free', 'Paid']

# Create a pie chart
plt.figure(figsize=(8,8))
colors = ["#00EE76","#7B8895"]
explode=(0.01,0.1)
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%', explode=explode, textprops={'fontsize': 15})
plt.title('Distribution of Paid and Free Apps', size=15, loc='center')
plt.legend()

# Show the pie chart
plt.show()


##### 1. Why did you pick the specific chart?

The pie chart is effective for displaying the distribution of categorical data, such as the proportion of Free and Paid apps, in a visually appealing manner.

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates that Free apps make up a significant majority (92.2%) of the dataset, while Paid apps constitute a smaller portion (7.8%).

##### 3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason

The insight that most apps are free can help businesses understand the prevailing market trend towards free apps. For developers planning monetization strategies, this insight suggests that offering free apps with in-app purchases or ads might align better with user expectations.


*   Negative Trend - However, if a business model heavily relies on paid apps, this insight could indicate a challenging market landscape in terms of competition and user preferences for free offerings.




#### Chart - 2

# Number of Apps per Category

This chart shows the number of installations for each app category, helping to identify which categories are most popular among users.

In [None]:
# get the number of apps for each category

# Calculate the number of apps for each category
app_count_per_category = data_apps['Category'].value_counts().sort_values(ascending=False)

# Set the style for the plot
sns.set_style('darkgrid')

# Create a figure and set its size
plt.figure(figsize=(10, 5))

# Use a count plot to visualize the number of apps per category
sns.countplot(x='Category', data=data_apps, order=app_count_per_category.index, palette="tab10")

# Set the title and labels for the plot
plt.title('Number of Apps Per Category')
plt.xticks(rotation=90)
plt.ylabel('Number of Apps')

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

I chose a count plot because it effectively shows the frequency of apps within each category, providing a clear overview of app distribution across different categories.








##### 2. What is/are the insight(s) found from the chart?




The chart reveals that out of 33 categories 'Family', 'Game', and 'Tools' have the highest number of apps, indicating that these categories are popular among app developers and users.

Least number of apps are from EVENTS & BEAUTY Category.




##### 3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason

Are there any insights that lead to negative growth? Justify with specific reason.

These insights can be valuable for business planning and strategy. Knowing that 'Family', 'Game', and 'Tools' are popular categories, businesses can focus their development efforts on these categories to target a larger user base. However, there are no insights in this chart that directly suggest negative growth.

#### Chart - 3

# Distribution of apps over ratings

In [None]:
#kernel Distribution Estimation Plot
plt.figure(figsize=(12,6))
sns.set_style('whitegrid')
plt.xlabel("Rating")
plt.ylabel("Frequency")
graph = sns.kdeplot(data_apps.Rating, color="Blue", shade = True)
plt.title('Distribution of Rating',size = 20)

##### 1. Why did you pick the specific chart?

KDE plots are a versatile and effective choice for reviewing rating and frequency data because they offer a smooth representation, allow for the visualization of two variables simultaneously, estimate probability density, offer flexibility in visualization, and provide clear presentation of insights.

##### 2. What is/are the insight(s) found from the chart?

*   Distribution of ratings is negetively skewed.
*   We see that majority of apps lie between 4.0 and 4.7 rating.
*   Most of the ratings are above 3.5 which means most of the apps on Play Store are being liked by the users.





##### 3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason

Yes, the insight that many apps have high ratings suggests that users are satisfied with a substantial portion of the apps available. This positive sentiment can contribute to a positive business impact by encouraging more downloads, better user reviews, and potentially increased revenue from app purchases or advertisements. However, it's essential to further analyze other factors like user reviews, app functionality, and competition to make informed business decisions.

#### Chart - 4

# Different Distributions in User Review Data





In [None]:
# Plotting the three charts
plt.figure(figsize=(15, 5))

# Distribution of Sentiment in User Reviews
plt.subplot(1, 3, 1)
sns.countplot(x='Sentiment', data=reviews_df)
plt.title('Distribution of Sentiment in User Reviews')
plt.xlabel('Sentiment')
plt.ylabel('Count')

# Distribution of Sentiment Polarity in User Reviews
plt.subplot(1, 3, 2)
sns.histplot(reviews_df['Sentiment_Polarity'], bins=30, kde=True)
plt.title('Distribution of Sentiment Polarity in User Reviews')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Frequency')

# Distribution of Sentiment Subjectivity in User Reviews
plt.subplot(1, 3, 3)
sns.histplot(reviews_df['Sentiment_Subjectivity'], bins=30, kde=True)
plt.title('Distribution of Sentiment Subjectivity in User Reviews')
plt.xlabel('Sentiment Subjectivity')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose to combine the countplot for sentiment distribution, and histograms for sentiment polarity and subjectivity to provide a comprehensive view of user sentiments in the reviews data. This combination allows us to understand both the distribution of sentiment categories and the detailed distribution of sentiment scores.








##### 2. What is/are the insight(s) found from the chart?


*   The countplot shows that the majority of user reviews have a positive sentiment, followed by negative and neutral sentiments. This indicates that overall, users are more inclined to express positive opinions about the apps.
*   The histogram of sentiment polarity reveals that the sentiment scores are concentrated around the range of -0.2 to 0.5, suggesting a mix of slightly negative to positive sentiments in the reviews.

*   The histogram of sentiment subjectivity indicates that most reviews exhibit a moderate level of subjectivity, falling within the range of 0.4 to 0.6.





##### 3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason

These insights can help businesses understand the overall sentiment distribution in user reviews, identify trends in sentiment polarity and subjectivity, and make informed decisions regarding product improvements or marketing strategies based on customer feedback.

Analyzing sentiment polarity helps identify areas for improvement or strengths, while negative sentiments highlight potential issues needing attention.




#### Chart - 5

In [None]:
# counting values
x1 = data_apps['Content Rating'].value_counts().index
y1 = data_apps['Content Rating'].value_counts()

#creating empty lists
x1_axis = []
y1_axis = []
for i in range(len(x1)):
    x1_axis.append(x1[i])
    y1_axis.append(y1[i])

In [None]:
# Calculate the distribution of app content ratings
data = data_apps['Content Rating'].value_counts()
labels = ['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+', 'Adults only 18+', 'Unrated']

# Create a pie chart for content ratings with adjusted settings
plt.figure(figsize=(6, 6))
explode = (0, 0.1, 0.1, 0.1, 0.0, 1.3)
colors = ['C4', 'r', 'c', 'g', 'm', 'k']
plt.pie(data, labels=labels, colors=colors, autopct='%.2f%%', explode=explode, textprops={'fontsize': 10})
plt.title('Content Rating Distribution', size=20, loc='center')
plt.legend()

# Show the pie chart
plt.show()

##### 1. Why did you pick the specific chart?

The pie chart effectively represents the distribution of categorical data, such as different content ratings, making it suitable for showcasing the proportions of each content rating category.

##### 2. What is/are the insight(s) found from the chart?

A majority of the apps (82%) in the play store are can be used by everyone. The remaining apps have various age restrictions to use it. Teen Category has second highest percentage (10.7%)


3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

These insights can inform decision-making related to content creation, marketing strategies, and audience targeting based on content rating preferences, potentially leading to a positive impact on user engagement and app success.

#### Chart - 6

# What are the Top 10 installed apps in any category?

In [None]:

#function
def Top10_inst_app_in_any_cat(str):
    str = str.upper()
    top10 = data_apps[data_apps['Category'] == str]
    top10apps = top10.sort_values(by='Installs', ascending=False).head(10)
    plt.figure(figsize=(15,7))
    plt.title('Top 10 Installed Apps',size = 20);
    graph = sns.barplot(x = top10apps['App'], y = top10apps['Installs'])
    graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right');

In [None]:
Top10_inst_app_in_any_cat('Sports')

##### 1. Why did you pick the specific chart?

bar plot to review the sport category offers a straightforward and intuitive way to compare the number of apps associated with different sports, making it a suitable choice for this analysis

##### 2. What is/are the insight(s) found from the chart?


*   From the above graph, we can see that in the Sports category 8 Ball Pool, 3D Bowling, Dream League Soccer 2018 & FIFA Soccer has the highest installs.
*   In the same way by passing different category names to the function, we can get the top 10 installed app



#### Chart - 7

# Does rating change with increasing price?

In [None]:
# setting plot size and background
plt.figure(figsize=(10,6))
sns.set_style("ticks")

# scatterplot Price v/s Rating
sns.regplot(x="Price", y="Rating", data=data_apps,
            line_kws={"color":"r", "alpha":0.7, "lw":5})

# set title
plt.title("Influence of Price on Rating");


##### 1. Why did you pick the specific chart?

Regplot for reviewing price and rating allows for a comprehensive examination of the relationship between these two variables, providing insights into pricing strategies, user perceptions, and market dynamics.








##### 2. What is/are the insight(s) found from the chart?


*   There is anegetive relation between price and ratings .
*   Looking at a plot , as Price factor increases , ratings decreases.



#### Chart - 8

# Does the size of an app influence the number of downloads?

In [None]:
#
fig, ax = plt.subplots(figsize=(10,6))

sns.scatterplot(x="Size", y="Installs", data=data_apps)
plt.title("Size v/s Installs")

In [None]:
# distrbution of installs
sns.distplot(data_apps['Installs'])

We have columns like Installs which range from 0 to 1,000,000,000, and even more; while Size column which ranges from 0 to 100(at the most). Thus, number of Installs are more times larger than Size.

We can observe that Installs are highly skewed and its difficult from above to find any correlation of Size and Installs. So, to give importance to both Size, and Installs, we need feature transformation.

The Log Transform is one of the most popular Transformation techniques out there. It is primarily used to convert a skewed distribution to a normal distribution/less-skewed distribution. In this transform, we take the log of the values in a column and use these values as the column instead.

This transformation reduces the impact of both too-low as well as too-high values.

In [None]:
# Log Trasformation of column Installs
data_apps["log_installs"] = np.log(data_apps["Installs"])
data_apps.head(2)

In [None]:
# scatter plot Size v/s log_installs
plt.figure(figsize=(10,6))
sns.scatterplot(x="Size", y="log_installs", data=data_apps)

# set title
plt.title("Size v/s log_installs");


##### 1. Why did you pick the specific chart?

Using a scatter plot for reviewing size versus installs offers a comprehensive and intuitive way to analyze the relationship between these two variables, providing valuable insights into user behavior and app performance.

##### 2. What is/are the insight(s) found from the chart?


*   Apps having Size between 0 to 20 are installed more as observed from above.
*   Looking at the plot the apps with large size are installed less as compared to small sized apps.
*   Users prefer apps that require less space and load faster. We can conclude that app size may influence the number of installations of the app.




#### Chart - 9

# Are app updates important?

In [None]:
# getting year of update
data_apps['Update_Year'] = pd.DatetimeIndex(data_apps['Last Updated']).year
data_apps.head(2)

In [None]:

# stripplot Update_year v/s log_installs
plt.figure(figsize=(12,8))

sns.stripplot(data =data_apps, x="Update_Year", y="log_installs",
              jitter=0.3, size=5, linewidth=.2)

plt.title("Update_year v/s log_installs");

##### 1. Why did you pick the specific chart?

Strip plot for reviewing update year versus log installs offers a comprehensive and intuitive way to analyze the relationship between these two variables, providing valuable insights into user engagement and app performance over time.

##### 2. What is/are the insight(s) found from the chart?


*   We can see from above plot that most of the apps get frequent updates and they are also installed more.
*   There are very less apps which got updates in 2010, 2011 and 2012 year.
*   We can say that those developers who make their app better over period of time have great chance of success.






#### Chart - 10

# Is sentiment_subjectivity proportional to sentiment_polarity?

In [None]:
# Create a scatter plot for sentiment analysis
plt.figure(figsize=(15, 10))
sns.scatterplot(x = reviews_df['Sentiment_Subjectivity'],y =  reviews_df['Sentiment_Polarity'],
                hue=reviews_df['Sentiment'], edgecolor='white', palette='inferno')
plt.title("Google Play Store Reviews Sentiment Analysis", fontsize=20)
plt.xlabel("Sentiment Subjectivity")
plt.ylabel("Sentiment Polarity")
plt.legend(title="Sentiment")
plt.show()



##### 1. Why did you pick the specific chart?

A scatter plot is suitable for showing the relationship between two numerical variables, such as sentiment subjectivity and sentiment polarity, and allows us to observe patterns or correlations.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows that while sentiment subjectivity and sentiment polarity generally have a proportional relationship in many cases, there are instances where this relationship is not strictly proportional, indicating varied sentiments within reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

It provides a visual representation of how sentiment subjectivity and polarity are distributed across reviews, helping to identify trends or anomalies in sentiment analysis and understand the overall sentiment patterns in the data.

#### Chart - 11

We can represent the ratings in a better way if we group the ratings between certain intervals. Here, we can group the rating as follows:

*   4-5: Top rated
*   3-4: Above average
*   2-3: Average
*   1-2: Below average

Lets create a new column 'Rating group' in the main dataframe and apply these filters.

In [None]:
# Defining a function grouped_rating to group the ratings as mentioned above
def Rating_app(val):
  ''''
  This function help to categories the rating from 1 to 5
  as Top_rated,Above_average,Average & below Average
  '''
  if val>=4:
    return 'Top rated'
  elif val>3 and val<4:
    return 'Above Average'
  elif val>2 and val<3:
    return 'Average'
  else:
    return 'Below Average'

Lets apply the grouped_rating function on the Rating column and save the output in new column named as Rating group in the main df.

In [None]:
# Applying grouped_rating function
data_apps['Rating_group']=data_apps['Rating'].apply(lambda x: Rating_app(x))

In [None]:
# Average app ratings
data_apps['Rating_group'].value_counts().plot.bar(figsize=(15,5), color = 'royalblue')
plt.xlabel('Rating Group')
plt.ylabel('Number of apps')
plt.title('Average app ratings')
plt.xticks(rotation=0)
plt.legend()


##### 1. Why did you pick the specific chart?

he specific chart, a bar plot, is chosen to visualize the distribution of app ratings across different rating groups, providing a clear comparison of the number of apps falling into each rating category.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that a significant number(7000+) of apps fall into the 'Top rated' category, indicating a considerable proportion of highly-rated apps in the dataset. The distribution across other categories ('Above Average'(2000), 'Average', 'Below Average') provides insights into the overall quality and diversity of app ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this chart can help businesses understand the distribution of app ratings and identify areas of strength (Top rated apps) as well as areas for potential improvement (Average and Below Average apps).

#### Chart - 12

# Exploring App Ratings by Size and Category

In [None]:

# Set the seaborn style to whitegrid and remove the grid from axes
sns.set_style("whitegrid", {'axes.grid' : False})

# Create an lmplot to visualize the relationship between Rating and Size across different categories
sns.lmplot(y='Rating', x='Size', data=data_apps, col="Category", hue="Category", col_wrap=4,
           line_kws={'color': 'red'}, height=4, aspect=1.2)
plt.suptitle('Exploring App Ratings by Size and Category', y=1.02)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

This lmplot visualizes how the Rating of apps varies with their Size, segmented by different app categories.

The red regression line shows the overall trend, while each subplot represents a specific app category, allowing for category-specific insights.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals varying trends in the relationship between app size and ratings across different app categories. For instance, in some categories, larger apps may have higher ratings, while in others, smaller apps might be favored by users. This insight can guide decisions on app development and marketing strategies tailored to each category's user preferences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the insights derived from this chart can have a positive business impact by informing decisions related to app design, functionality, and marketing. By understanding how app size influences user ratings within specific categories, businesses can optimize their apps to better meet user expectations, leading to improved user satisfaction and potentially higher app ratings, which in turn can drive downloads, user retention, and overall business success.

#### Chart - 13

# Percentage of Review Sentiments

In [None]:
# Get the count of each sentiment category
counts = list(reviews_df['Sentiment'].value_counts())

# Define labels for the pie chart
labels = 'Positive Reviews', 'Negative Reviews', 'Neutral Reviews'

# Create the pie chart with exploded segments, shadow, and percentage labels
plt.figure(figsize=(8, 8))
plt.pie(counts, labels=labels, explode=[0.01, 0.05, 0.05], shadow=True, autopct="%.2f%%")

# Set the title and turn off axis labels
plt.title('Percentage of Review Sentiments')
plt.axis('off')

# Add a legend outside the plot area to the right
plt.legend(bbox_to_anchor=(1, 0.8), loc='center left')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is suitable for displaying the distribution of categories (positive, negative, neutral) in a visually intuitive way, making it easy to grasp the proportions of each sentiment category.

##### 2. What is/are the insight(s) found from the chart?

From this visualization, it's evident that a significant majority of reviews are positive, indicating a generally favorable sentiment towards the apps.

Theres tie between negative and neutral reviews which are 21% and 14.6% respectively

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

This visualization helps in understanding the overall sentiment polarity of the reviews. The higher percentage of positive reviews suggests that users are generally satisfied with the apps, while the presence of negative and neutral reviews indicates areas where improvements or attention might be needed

**Apps with the highest number of positive reviews**

In [None]:

# positive reviews
positive_rev_df=reviews_df[reviews_df['Sentiment']=='Positive']
positive_rev_df.groupby('App')['Sentiment'].value_counts().nlargest(10).plot.barh(figsize=(6,4),color='seagreen').invert_yaxis()
plt.title("Top 10 positive review apps")
plt.xlabel('Total number of positive reviews')
plt.legend()

**Apps with the highest number of negative reviews**

In [None]:
# negative reviews
negative_rev_df=reviews_df[reviews_df['Sentiment']=='Negative']
negative_rev_df.groupby('App')['Sentiment'].value_counts().nlargest(10).plot.barh(figsize=(6,4),color='crimson').invert_yaxis()
plt.title("Top 10 positive review apps")
plt.xlabel('Total number of positive reviews')
plt.legend()


#### Chart - 14 - Correlation Heatmap

In [None]:
# Exclude non-numeric columns from correlation calculation
numeric_data_apps = data_apps.select_dtypes(include=['number'])

# Calculate the correlation matrix for numeric columns
print(numeric_data_apps.corr())


In [None]:
fig, axes = plt.subplots(figsize=(8, 8))
sns.heatmap(numeric_data_apps.corr(), ax=axes, annot=True, linewidths=0.1, fmt='.2f', square=True)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a heatmap of the correlation matrix because it visually represents the correlation between numerical variables in a clear and concise manner. The color gradients help in quickly identifying the strength and direction of the relationships between variables.

##### 2. What is/are the insight(s) found from the chart?


*   There is a strong positive correlation between the Reviews and Installs column. This is pretty much obvious. Higher the number of installs, higher is the user base, and higher are the total number of reviews dropped by the users.

*   The Price is slightly negatively correlated with the Rating, Reviews, and Installs. This means that as the prices of the app increases, the average rating, total number of reviews and Installs fall slightly.

*   The Rating is slightly positively correlated with the Installs and Reviews column. This indicates that as the the average user rating increases, the app installs and number of reviews also increase.






#### Chart - 15 - Pair Plot

In [None]:
# Extract relevant columns from data_apps
Rating = data_apps['Rating']
Size = data_apps['Size']
Installs = data_apps['Installs']
Reviews = data_apps['Reviews']
Type = data_apps['Type']
Price = data_apps['Price']

# Create a DataFrame with selected columns and perform logarithmic transformation on Installs and Reviews
pairwise_df = pd.DataFrame(list(zip(Rating, Size, np.log(Installs), np.log10(Reviews), Price, Type)),
                           columns=['Rating','Size', 'Installs', 'Reviews', 'Price','Type'])

# Create a pairplot with hue based on 'Type' column
p = sns.pairplot(pairwise_df, hue='Type')

# Set title for the pairplot
p.fig.suptitle("Pairwise Plot - Rating, Size, Installs, Reviews, Price", x=0.5, y=1.0, fontsize=12)

# Show the pairplot
plt.show()


##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters








##### 2. What is/are the insight(s) found from the chart?


*   It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.
*   Plot a pairwise plot between all the quantitative variables to look for any evident patterns or relationships between the features



## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

# Business Solution for App Performance and User Satisfaction


1.   **Category Optimization for Maximum Impact**
*   Focus resources and marketing efforts on top-performing categories like Communication and Social, which exhibit high user engagement, positive sentiment, and significant installs.
*   Tailor app features, updates, and promotional campaigns to resonate with user preferences within these dominant categories, ensuring maximum impact and user satisfaction.



2.   **Review and Ratings Enhancement**
*   Optimize pricing strategies based on the slight negative correlation (-0.09) between app prices and ratings/reviews.
*   Implement in-app prompts and incentives for satisfied users to leave reviews, while promptly addressing negative feedback to demonstrate responsiveness and commitment to user satisfaction.



3.   **Pricing Strategy Refinement**
*   Align marketing messaging with sentiment trends to reinforce positive perceptions and attract new users.
*   Conduct A/B testing and introduce flexible pricing tiers or promotional offers to attract price-sensitive users without compromising perceived value, thereby enhancing user acquisition and retention.



4.   **Sentiment-Driven Feature Enhancements**
*  Prioritize feature enhancements based on sentiment analysis insights to address user satisfaction and pain points effectively.
*   Capitalize on positive sentiment themes to strengthen app features that resonate with users, while addressing negative sentiments to improve overall user experience and mitigate churn.


5.   **Marketing Messaging Alignment**
*   Align marketing messaging with sentiment trends to reinforce positive perceptions and attract new users.
*   Ensure consistency between marketing campaigns and user sentiments, maintaining authenticity and trust to drive user engagement and loyalty.



6.   **Continuous Monitoring and Agile Iterations**


*   Implement a robust feedback loop and agile development approach to continuously monitor user feedback, sentiment trends, and app performance metrics.
*   Iterate and adapt app features, marketing strategies, and pricing models based on real-time insights, ensuring ongoing improvement, relevance, and competitiveness in the Google Play Store ecosystem.



# **Conclusion**

In conclusion, the analysis underscores the importance of understanding user preferences, focusing on app quality, and aligning with market trends to succeed in the competitive landscape of the Google Play Store. By leveraging these insights, developers and businesses can make informed decisions to optimize app performance, enhance user satisfaction, and drive sustainable growth in the Android app market.



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***