<a href="https://colab.research.google.com/github/FG3242/Capstone-Project--1-Play-Store-App-Review-Analysis/blob/main/Team_Project_Play_Store_App_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - 



##### **Project Type**    - EDA
##### **Contribution**    - Team
##### **Team Member 1 - Ankur Kumar** 
##### **Team Member 2 - Pal Bharti**
##### **Team Member 3 - Amanjeet Kumar Singh**


# **Project Summary -**

The Play Store App EDA (Exploratory Data Analysis) project is an in-depth analysis of the Google Play Store app dataset. The primary objective of this project is to gain insights into the various aspects of apps available on the Play Store and understand the factors that contribute to their popularity and success.

The project begins with data collection, where a comprehensive dataset containing information about different apps, such as their categories, ratings, reviews, sizes, and download counts, is gathered. This dataset serves as the foundation for the subsequent analysis.

During the exploratory data analysis phase, various statistical and visual techniques are employed to understand the distribution, trends, and patterns within the dataset. Key metrics such as the most popular app categories, average ratings, and the relationship between app size and download counts are examined. Moreover, the project aims to uncover any outliers, missing values, or data quality issues that may impact the analysis.

Overall, the Play Store App EDA project aims to provide valuable insights into the Play Store app ecosystem. The findings can be utilized by app developers, marketers, and stakeholders to make informed decisions regarding app development, marketing strategies, and monetization.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The Play Store App EDA (Exploratory Data Analysis) project addresses the need for comprehensive insights into the Google Play Store app ecosystem. The availability of millions of apps on the Play Store presents app developers, marketers, and stakeholders with numerous challenges when it comes to understanding user preferences, identifying successful app features, and making data-driven decisions.

The problem this project aims to solve is the lack of a comprehensive analysis of the Play Store app dataset. While there is abundant data available, there is a need to extract meaningful insights and patterns from it. The project focuses on addressing the following key questions:

1. App Categorization:
2. App Ratings and Reviews
3. App Size and Download Counts
4. App Features and Popularity


By addressing these questions through exploratory data analysis, this project aims to provide actionable insights that can guide app developers, marketers, and stakeholders in making informed decisions. The analysis will enable them to understand user preferences, identify successful app attributes, and develop effective strategies for app development, marketing, and monetization.


#### **Define Your Business Objective?**

The business objective of the Play Store App EDA (Exploratory Data Analysis) project is to leverage the insights gained from the analysis to drive informed decision-making and strategic planning within the app development and marketing ecosystem. The primary goals of this project are as follows:

* App Development Strategy
* Marketing and User Acquisition:
* Monetization Strategies
* Competitive Analysis


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns  
import plotly.express as px

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

# Reading the dataset
play_store_dataset = pd.read_csv("/content/drive/MyDrive/Play Store App Review Analysis project/Play Store Data.csv")
user_review_dataset = pd.read_csv("/content/drive/MyDrive/Play Store App Review Analysis project/User Reviews.csv")



### Dataset First View

In [None]:
# Dataset First Look
play_store_dataset.head()

In [None]:
# Dataset First Look
user_review_dataset.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
play_store_dataset.shape

In [None]:
# Dataset Rows & Columns count
user_review_dataset.shape

### Dataset Information

In [None]:
# Dataset Info
play_store_dataset.info()

In [None]:
user_review_dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
play_store_dataset.duplicated().value_counts()

In [None]:
# Dataset Duplicate Value Count
user_review_dataset.duplicated().value_counts()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
play_store_dataset.isnull().sum()

In [None]:
# Missing Values/Null Values Count
user_review_dataset.isnull().sum()

In [None]:
# Visualizing the missing values

# Checking Null Value of play store by plotting Heatmap
sns.heatmap(play_store_dataset.isnull(), yticklabels=False ,cbar=False, cmap='viridis')

In [None]:
# Visualizing the missing values

# Checking Null Value of play store by plotting Heatmap
sns.heatmap(play_store_dataset.isnull(),yticklabels=False ,cbar=False, cmap='viridis')

### What did you know about your dataset?

Play Store app review analysis EDA project, the dataset typically consists of information related to user reviews of various apps available on the Play Store. The dataset may include the following key features:

* App Information
* User Reviews:
* App Version
* User Information:
* App Metadata

These are general features that can be found in a Play Store app review dataset, but the specific attributes and structure may vary depending on the source and scope of the dataset

When analyzing this dataset, we can explore various aspects such as the distribution of ratings, sentiment analysis of user reviews, trends in review sentiment over time, most commonly mentioned keywords or topics in reviews, and any correlations between app features (such as app size, category) and user reviews.

 * Additionaly , Play Store dataset has 10841 rows and 13 columns and  User Reviews dataset has 64295 rows and 5 columns.

* In play store dataset total 483 duplicate values and in user review dataset total 33616 values.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
play_store_dataset.columns

In [None]:
# Dataset Columns
user_review_dataset.columns

In [None]:
# Dataset Describe
play_store_dataset.describe(include='all')

In [None]:
# Dataset Describe
play_store_dataset.describe(include='all')

### Variables Description 

**Variables Description of Play Store csv**

* App name : The name or title of the app.
* Category : The category or genre of the app 
* Rating : The numerical rating given by the user (usually on a scale of 1 to 5 stars).

* Reviews : The text content of the user's review.
* Size : The size of the app in terms of storage space.
* Installs : The approximate number of times the app has been installed
* Type : Types of app like ( Free or Paid )
* Price : Price of application.
* Content Rating :  The price or pricing model of the app 

* Genres : Play Store apps belong, providing classification based on functionality or content.

* Last Updated : Update info when we updated application.
* Current Version : Current version of the app available on the Play Store.
* Android Version : Android operating system version required to run the app.


**Variables Description of User Review csv**

* App: The name or title of the app for which the review was provided.

* Sentiment: The sentiment label associated with the review

* Translated_Review: The translated version of the user's review text.

* Sentiment_Polarity: The polarity or sentiment score of the review, indicating the sentiment as positive, negative, or neutral.

* Sentiment_Subjectivity: The subjectivity score of the review, representing the extent to which the review is subjective or objective

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in play_store_dataset.columns.tolist():
  print(i,"is",play_store_dataset[i].nunique())

In [None]:
# Check Unique Values for each variable.
for i in user_review_dataset.columns.tolist():
  print(i,"is",user_review_dataset[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
play_store = play_store_dataset.copy()
user_review = user_review_dataset.copy()


In [None]:
# Fill missing values (Play store dataset)

play_store['Rating'].fillna(play_store['Rating'].mean(),inplace = True)
play_store['Rating'].isnull().sum()

In [None]:
# Fill missing values (User Revview dataset)

user_review = user_review[~user_review['Sentiment'].isnull()]
user_review.dropna(inplace = True)
user_review.isnull().sum()

In [None]:
# Drop Null values (Play store dataset)

play_store.drop_duplicates(inplace = True)
play_store.drop_duplicates(subset='App',inplace = True)
play_store.duplicated().value_counts()

In [None]:
# Drop Null values (User Revview dataset)


user_review.drop_duplicates(inplace = True)
user_review.drop_duplicates(subset='App',inplace = True)
user_review.duplicated().value_counts()

In [None]:
# Drop all null values 

play_store.dropna(inplace = True)

In [None]:
# Drop all null values 
user_review.dropna(inplace = True)


**1. Identify the top categories with the highest number of apps**

In [None]:
# 1. Identify the top categories with the highest number of apps
top_categories = play_store['Category'].value_counts().head(5)
print("Top Categories:")
print(top_categories)

**2. Calculate the sentiment distribution in user reviews**

In [None]:
# 2. Calculate the sentiment distribution in user reviews
sentiment_distribution = user_review['Sentiment'].value_counts()
print("Sentiment Distribution:")
print(sentiment_distribution)

**3. Calculate the average sentiment polarity for each app**

In [None]:
# 3. Calculate the average sentiment polarity for each app
average_polarity = user_review_dataset.groupby('App')['Sentiment_Polarity'].mean()
print("Average Sentiment Polarity by App:")
print(average_polarity)


**4. Identify apps with the most positive sentiment polarity**

In [None]:
# 4. Identify apps with the most positive sentiment polarity
top_positive_apps = average_polarity.nlargest(5)
print("Top Apps with Positive Sentiment Polarity:")
print(top_positive_apps)

**5. Analyze the relationship between app ratings and sentiment subjectivity**

In [None]:
# 5. Analyze the relationship between app ratings and sentiment subjectivity
ratings_subjectivity = pd.merge(play_store[['App', 'Rating']], user_review[['App', 'Sentiment_Subjectivity']], on='App')
correlation = ratings_subjectivity['Rating'].corr(ratings_subjectivity['Sentiment_Subjectivity'])
print("Correlation between App Ratings and Sentiment Subjectivity:", correlation)

**6. Find the % distribution of app types ( Free or Paid)**

In [None]:
# 6. Find the % distribution of app types ( Free or Paid)

categorydf = play_store.groupby(['Type'])['Type'].count()

total_apps = categorydf.sum()
free_apps = categorydf['Free']
paid_apps = categorydf['Paid']

per_free = (free_apps / total_apps) * 100
per_paid = (paid_apps / total_apps) * 100

print( per_free ,"% Free apps")
print( per_paid, "% Paid apps")

**7. The top 10 applications based on ratings**

In [None]:
# The top 10 applications based on ratings

best_reviewed_apps = play_store.sort_values(by='Rating', ascending=False)

print("Top 10 highest rated applications are:")
for app_name in best_reviewed_apps['App'].head(10):
    print(app_name)

**8. top 5 most installed Free Applications**

In [None]:
# 8. top 5 most installed Free Applications

installed_free_app = play_store[play_store['Type'] == 'Free']

print("Top 5 most installed free applications are:")
for app_name in installed_free_app.sort_values(by='Installs', ascending=False).head(5)['App']:
    print(app_name)


**9. the top 5 most downloaded applications that are paid**

In [None]:
# 9. the top 5 most downloaded applications that are paid

most_downloaded_app = play_store[play_store['Type'] == 'Paid']

print("Top 5 most installed paid applications are:")
for app_name in most_downloaded_app.sort_values(by='Installs', ascending=False).head(5)['App']:
    print(app_name)

**10. What is the total number of installs for each category in the Google Play Store**

In [None]:
# 10. What is the total number of installs for each category in the Google Play Store
category_installs = play_store.groupby('Category')['Installs'].sum().reset_index()
category_installs

**11. Distribution of App Updates Over the Years**

In [None]:
# 11. Distribution of App Updates Over the Years
play_store['Last Updated'] = pd.to_datetime(play_store['Last Updated'])
play_store['Update Year'] = play_store['Last Updated'].dt.year
update_counts = play_store['Update Year'].value_counts().sort_index()
update_counts

### What all manipulations have you done and insights you found?

**The main data manipulations performed and the insights found:**

* Missing Value Handling: Filled missing values in the 'Rating' column of the Play store dataset and removed rows with missing values in the 'Sentiment' column of the User Review dataset.

* Data Cleaning: Dropped duplicate rows and removed remaining rows with null values in both datasets.

* Top Categories: Identified the top 5 categories with the highest number of apps in the Play store dataset.

* Sentiment Distribution: Calculated the distribution of sentiments (positive, negative, neutral) in the User Review dataset.

* Average Sentiment Polarity: Calculated the average sentiment polarity for each app in the User Review dataset.

* Top Apps with Positive Sentiment: Identified the top 5 apps with the most positive sentiment polarity.

* Correlation: Analyzed the relationship between app ratings and sentiment subjectivity, finding the correlation between the two.

* App Type Distribution: Determined the percentage distribution of free and paid apps in the Play store dataset.

* Top Rated Apps: Identified the top 10 highest-rated applications based on ratings.

* Most Installed Free Applications: Listed the top 5 most installed free applications.

* Most Installed Paid Applications: Listed the top 5 most installed paid applications.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  The Bar Chart 

In [None]:
# Chart - 1 visualization code
data = {
    'Category': top_categories.index,
    'Number of Apps': top_categories.values,
    'Color': ['#FFA07A', '#FFC0CB', '#7B68EE', '#00CED1', '#FFD700']
}

fig = px.bar(data, x='Category', y='Number of Apps', color='Color')

fig.update_layout(title='Top Categories', xaxis=dict(title='Category'), yaxis=dict(title='Number of Apps'))
fig.show(renderer="colab")

##### 1. Why did you pick the specific chart?

**The bar chart** allows us to compare the number of apps in different categories easily. The length of each bar represents the number of apps in a specific category, making it intuitive to identify the top categories with the highest number of apps.

##### 2. What is/are the insight(s) found from the chart?

1. The "Family" category has the highest number of apps, indicating that it is the most prevalent category in the dataset.
2. The "Game" category is the second highest, followed by "Tools," "Medical," and "Business."
3. These insights provide an understanding of the distribution of app categories and can inform decision-making related to app development, marketing, and investment in specific categories.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**The gained insights** can help create a positive business impact. By understanding the top categories with the highest number of apps, businesses can make informed decisions related to app development, marketing strategies, and investment. They can focus their efforts on popular categories like "Family" and "Games" to target a larger user base and potentially increase their app downloads and revenue.

**Regarding negative growth,** the insights themselves do not directly indicate any negative impact. However, if the analysis reveals a high concentration of apps in specific categories, it may lead to increased competition and saturation within those categories. This intense competition can make it challenging for new or existing apps to stand out and gain significant market share, potentially resulting in negative growth for individual apps within those crowded categories.

#### Chart - 2 Donut Chart

In [None]:
# Chart - 2 visualization code
sentiment_distribution = user_review['Sentiment'].value_counts()
explode = (0.03)
fig = px.pie(sentiment_distribution, values=sentiment_distribution.values, names=sentiment_distribution.index,hole=0.4)
fig.update_layout(title='Sentiment Distribution in User Reviews')
fig.update_layout(width=800, height=600)
fig.update_traces(rotation=90,pull = explode)
fig.show(renderer="colab")

##### 1. Why did you pick the specific chart?

I picked the **Donut Chart** because it is a good way to visualize the distribution of data into categories. The pie chart shows the relative size of each category, and it is easy to see which category is the largest. The pie chart is also a good choice for this data because there are only a few categories. If there were more categories, it might be better to use a different type of chart, such as a bar chart or a line chart.

##### 2. What is/are the insight(s) found from the chart?

* The majority of user reviews are positive.
* There is a small number of user reviews that are negative.
* There are an even smaller number of user reviews that are neutral.

This information can be used to understand how users feel about a product or service. For example, if a company is trying to improve its customer satisfaction, it can focus on the areas where users are giving negative feedback.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**The gained insights** can help create a positive business impact. By understanding how users feel about a product or service, a company can identify areas where it can improve. For example, if a company is getting a lot of negative feedback about its customer service, it can focus on improving its customer service policies and procedures. This can lead to increased customer satisfaction, which can lead to increased sales and revenue.

**There are also some insights that could lead to negative growth**. For example, if a company is getting a lot of negative feedback about its product quality, it could lead to decreased sales and revenue. Additionally, if a company is getting a lot of negative feedback about its pricing, it could lead to decreased customer satisfaction.

#### Chart - 3 Donut Chart

In [None]:
# Chart - 3 visualization code
Type_counts = play_store['Type'].value_counts()
explode = (0.03)
fig = px.pie(Type_counts, values=Type_counts.values, names=Type_counts.index,hole=0.4)
fig.update_layout(title='App Categories Distribution')
fig.update_layout(width=500, height=500)
fig.update_traces(rotation=90,pull = explode)

fig.show(renderer="colab")

##### 1. Why did you pick the specific chart?


I picked the **Donut Chart** because it is a good way to show the distribution of categorical data. The size of each slice represents the percentage of the data in that category. The labels for each slice are the names of the categories. The pie chart is a simple and easy-to-understand 

##### 2. What is/are the insight(s) found from the chart?

The insight found from the donut chart is the distribution of app categories in the dataset. The chart shows the proportion of different types of apps, such as free and paid, in the overall app categories. This provides an understanding of the composition of app types and can help identify the prevalence of certain types of apps in the market.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


The gained insights from the app category distribution chart can potentially help create a positive business impact. By understanding the distribution of app categories and identifying the popular ones, businesses can make informed decisions about developing and promoting apps in those categories. They can target the high-demand categories to maximize their chances of success and profitability.

#### Chart - 4 Horizontal Bar Chart

In [None]:
# Chart - 4 visualization code
highest_rating_app = play_store.groupby('App')[["Installs","Rating"]].max().sort_values(["Installs","Rating"],ascending=False).head(10)

colors = px.colors.qualitative.Pastel
fig = px.bar(highest_rating_app[::-1], y=highest_rating_app.index[::-1], x='Rating',
             orientation='h',
             labels={'y': 'App', 'Rating': 'Rating'},
             title='Top 10 Highest Rated Apps',
             color='Rating',  
)      
fig.update_layout(height=600, width=1200) 
fig.update_traces(opacity=0.9)

fig.show(renderer="colab")

##### 1. Why did you pick the specific chart?

The specific chart chosen for visualizing the top 10 highest rated apps is **a horizontal bar chart**. This chart type is suitable for comparing the ratings of different apps and easily identifying the highest rated ones. The horizontal orientation allows for displaying the app names on the y-axis, making it easier to read and comprehend the labels. The length of the bars corresponds to the ratings, allowing for a clear visual representation of the ratings hierarchy.

##### 2. What is/are the insight(s) found from the chart?

* The top 10 highest rated apps have been identified based on their ratings and number of installs.
* The ratings of these top apps range from approximately 4.5 to 5.0, indicating that they have received very positive feedback from users.
* These apps have achieved a significant number of installs, indicating their popularity among users.
* The chart provides a visual comparison of the ratings of these top apps, allowing us to identify the apps with the highest ratings.
* The chart helps to highlight the apps that have consistently received high ratings and have been successful in terms of user satisfaction.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

* The identification of the top 10 highest rated apps allows the business to understand which apps are resonating well with users. This knowledge can be leveraged to drive marketing strategies and allocate resources towards the development and promotion of similar high-rated apps. It can lead to increased user satisfaction, positive word-of-mouth, and potentially higher app downloads and revenue.

**Negative Growth:**
* Based on the insights from the given chart, there are no specific insights that indicate negative growth. The chart focuses on highlighting the top-rated apps and does not provide information on apps with lower ratings or negative trends. Therefore, the insights gained from this particular chart do not suggest any negative impact on business growth.

#### Chart - 5 Histogram chart

In [None]:
# Chart - 5 visualization code
fig = px.histogram(play_store, x='Category', title='Number of Apps Per Category',
                   color='Category')

fig.update_layout(
    xaxis={'categoryorder': 'total descending'},
    xaxis_title='Category',
    yaxis_title='Number of Apps'
)

fig.show(renderer="colab")

##### 1. Why did you pick the specific chart?

I picked **The histogram chart** because it effectively visualizes the distribution of the number of apps across different categories. The histogram allows us to see the frequency or count of apps in each category, giving us an overview of the app distribution pattern.

##### 2. What is/are the insight(s) found from the chart?

* The category with the highest number of apps: We can identify the category that has the highest bar on the histogram, indicating the category with the most number of apps.
* Comparisons between categories: We can compare the heights of the bars for different categories to understand the relative number of apps in each category.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* Identifying popular categories: By analyzing the number of apps in each category, businesses can identify the categories that have a higher demand and popularity among users. This information can guide businesses in developing and promoting apps in those categories, potentially leading to increased user engagement and revenue.

* Market analysis: The distribution of apps across different categories provides insights into the competitive landscape. Businesses can assess the level of saturation in certain categories and make informed decisions about entering less crowded categories or finding unique selling propositions within popular categories.

* There may not be specific insights from the chart that directly indicate negative growth. Negative growth can occur in various scenarios such as poor app performance, low user ratings, or lack of user engagement. These factors are not solely determined by the number of apps in a category but require deeper analysis of user feedback, app quality, marketing strategies, and other factors that impact app success.

#### Chart - 6 Histogram Chart

In [None]:
# Chart - 6 visualization code
fig = px.histogram(play_store, x='Content Rating', title='Number of Apps Per Content Rating', color_discrete_sequence=px.colors.qualitative.Pastel)

colors = ['#FFA07A', '#7B68EE', '#00CED1', '#FFD700', '#FFC0CB', '#98FB98', '#FF6347']
fig.update_traces(marker_color=colors)

fig.update_layout(
    xaxis={'categoryorder': 'total descending'},
    xaxis_title='Content Rating',
    yaxis_title='Number of Apps'
)

fig.show(renderer="colab")

##### 1. Why did you pick the specific chart?

**A Histogram Chart** to represent the number of apps per content rating because it effectively shows the distribution of apps across different content ratings. A histogram is suitable for visualizing the frequency or count of data points in different categories, making it easy to compare the number of apps in each content rating category.

##### 2. What is/are the insight(s) found from the chart?

* The majority of apps have a content rating of "Everyone" or "Teen".
* The number of apps gradually decreases as we move towards higher content ratings such as "Mature 17+" and "Adults only 18+".
* There is a relatively small number of apps with content ratings of "Unrated" or "Everyone 10+".

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* By knowing that the majority of apps have content ratings of "Everyone" or "Teen," businesses can focus their efforts on developing and marketing apps that cater to these target audiences. This alignment with the dominant content ratings can increase the likelihood of attracting a larger user base and generating positive business outcomes.

**There isn't a specific insight that directly indicates negative growth. However, it's important to consider the potential impact of certain content ratings on the target audience and market reach**

* Limited audience reach for higher content ratings: The chart shows that content ratings such as "Mature 17+" and "Adults only 18+" have a smaller number of apps compared to ratings like "Everyone" or "Teen." While targeting a more mature or adult audience can be a valid business strategy for certain app categories, it's essential to recognize that the potential user base may be narrower compared to apps with lower content ratings. This limited audience reach could affect overall growth opportunities.

#### Chart - 7  Scatter Plot

In [None]:
# Chart - 7 visualization code
fig = px.scatter(user_review, x='Sentiment_Subjectivity', y='Sentiment_Polarity', color='Sentiment',
                 title='Google Play Store Reviews Sentiment Analysis',
                 labels={'Sentiment_Subjectivity': 'Subjectivity', 'Sentiment_Polarity': 'Polarity'},
                 )

fig.show(renderer="colab")



##### 1. Why did you pick the specific chart?

 The **scatter plot** allows us to observe any patterns, trends, or clusters in the data and identify how sentiment varies across different values of subjectivity and polarity. Additionally, the use of colors to represent different sentiment categories adds an extra dimension to the visualization, making it easier to interpret the sentiment analysis results.

##### 2. What is/are the insight(s) found from the chart?

* Distribution: The data points are spread across a range of subjectivity and polarity values, indicating a varied sentiment in the reviews.
* Sentiment Categories: The data points are color-coded based on sentiment categories (e.g., positive, neutral, negative), allowing for a quick understanding of the sentiment distribution.
* Clusters: There appear to be clusters of data points with similar subjectivity and polarity values, suggesting groups of reviews with similar sentiments.
* Sentiment Patterns: The scatter plot shows how the sentiment (positive, neutral, negative) is distributed across different levels of subjectivity and polarity. This provides insights into the overall sentiment patterns in the reviews dataset.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* Identify Positive Sentiments: By analyzing the distribution of positive sentiments, businesses can identify which aspects of their app or service are well-received by users. This insight can be used to further enhance and promote those positive aspects, leading to improved customer satisfaction and positive word-of-mouth.

* Address Negative Sentiments: Negative sentiments highlighted in the analysis can provide businesses with valuable feedback on areas that need improvement. By addressing and resolving the issues raised in negative reviews, businesses can enhance their product or service, leading to higher customer satisfaction and better user experiences.

* Understand User Preferences: Analyzing sentiment patterns across different subjectivity and polarity levels can provide insights into user preferences and expectations. This understanding can guide businesses in tailoring their offerings to better align with user preferences, resulting in increased customer engagement and loyalty.

It is essential to note that negative sentiments or areas for improvement identified in the analysis may indicate potential challenges or issues that need to be addressed. Ignoring or neglecting these negative insights can lead to negative growth or a decline in customer satisfaction and business performance. It is crucial for businesses to take prompt action to address and resolve any concerns raised by users to maintain a positive business impact.

#### Chart - 8 Combines a Box plot and a Kernel density plot.

In [None]:
# Chart - 8 visualization


fig = px.violin(play_store, y="Rating", box=True, points="all", title="Distribution of App Ratings", 
                violinmode='overlay', color_discrete_sequence=px.colors.qualitative.Pastel)

fig.update_traces(meanline_visible=True, line_color='darkblue')
fig.update_layout(yaxis_title="Ratings")

fig.show(renderer="colab")

##### 1. Why did you pick the specific chart?

The chosen chart, the violin plot, **Combines a Box plot and a Kernel density plot.** It is an effective choice for visualizing the distribution of app ratings because it provides information about the central tendency (box plot) and the overall distribution shape (kernel density plot) simultaneously. The addition of the box plot inside the violin plot allows for easy interpretation of the quartiles, median, and any outliers in the rating distribution.

##### 2. What is/are the insight(s) found from the chart?

* The majority of app ratings are concentrated around the range of 3.5 to 4.5, as indicated by the wider section of the violin plot in that region.
* The median rating, represented by the white dot within the violin plot, is around 4.3. This suggests that the average app rating tends to be relatively high.
* The presence of thinner sections and the presence of outliers towards lower ratings indicate that there are some apps with lower ratings, indicating potential areas for improvement.
* The distribution of app ratings appears to be slightly positively skewed, as the right side of the violin plot (higher ratings) extends further than the left side (lower ratings).

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* Identify areas for improvement: By analyzing the distribution of app ratings, businesses can identify any negative outliers or areas where the ratings are lower. These insights can guide businesses to focus on improving the features, functionality, user experience, or addressing any issues that may have contributed to lower ratings. This proactive approach can lead to an enhanced user experience, increased customer satisfaction, and potentially positive business impact.

* Competitive analysis: Comparing the distribution of app ratings with competitors can provide insights into the market landscape and help businesses gauge their performance relative to their competitors. If competitors have higher average ratings or a more positively skewed distribution, it may indicate that there is room for improvement to stay competitive in the market and attract more users.

**If the distribution of app ratings is heavily skewed towards lower ratings or has a significant number of negative outliers, it indicates a higher dissatisfaction level among users. This can potentially lead to negative growth for the business.**

#### Chart - 9

In [None]:
# Chart - 9 visualization code
ratings_subjectivity = pd.merge(play_store[['App', 'Rating']], user_review[['App', 'Sentiment_Subjectivity']], on='App')

fig = px.scatter(ratings_subjectivity, x='Rating', y='Sentiment_Subjectivity', title='App Ratings vs. Sentiment Subjectivity')

fig.update_layout(
    xaxis_title='App Rating',
    yaxis_title='Sentiment Subjectivity'
)

fig.show(renderer="colab")


##### 1. Why did you pick the specific chart?

**The scatter plot** was chosen to visualize the relationship between app ratings and sentiment subjectivity because it allows us to examine the individual data points and observe any patterns or trends. Scatter plots are effective for displaying the correlation between two continuous variables and can help identify any potential relationships or outliers in the data.

##### 2. What is/are the insight(s) found from the chart?

* There is a slight positive correlation between app ratings and sentiment subjectivity. As the app rating increases, there tends to be a slightly higher sentiment subjectivity.
* Most of the app ratings are concentrated in the range of 3 to 5, indicating that the majority of apps have relatively positive sentiment subjectivity.
* There are a few outliers with low app ratings and high sentiment subjectivity, suggesting that some apps may have received negative sentiment despite having a lower rating.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* Feedback and Improvement: Analyzing the sentiment subjectivity associated with different app ratings can help businesses identify areas for improvement. By considering the sentiment expressed in user reviews, businesses can gain insights into specific aspects of their apps that are positively or negatively impacting user experiences. This information can guide them in making necessary improvements to enhance user satisfaction.

* User Engagement and Retention: Positive sentiment subjectivity associated with higher app ratings indicates a generally favorable user experience. This can contribute to user engagement and satisfaction, leading to higher retention rates. By understanding the correlation between app ratings and sentiment subjectivity, businesses can focus on maintaining high-quality features and functionalities that align with positive user sentiment.

It is important to note that the insights from the chart do not directly indicate negative growth. The slight positive correlation observed between app ratings and sentiment subjectivity suggests that higher ratings tend to be associated with higher sentiment subjectivity. However, this does not imply that negative sentiment subjectivity will lead to negative growth. Negative sentiment can still provide valuable feedback for businesses to address areas of improvement and enhance the overall user experience. It is crucial for businesses to monitor and address negative sentiment effectively to mitigate any potential negative impacts.

#### Chart - 10  The Boxplot Chart

In [None]:

fig = px.box(user_review, y='Sentiment_Subjectivity', title='Box Plot - Sentiment Subjectivity',
             color_discrete_sequence=['#330C73'])
fig.update_layout(height=400, width=600)
fig.show(renderer="colab")

##### 1. Why did you pick the specific chart?

**The boxplot chart** was chosen to visualize the distribution of the "Sentiment_Subjectivity" variable in the user reviews dataset. The boxplot provides a clear representation of the median, quartiles, and outliers, allowing for easy comparison of the data's spread and identifying any potential patterns or anomalies in the sentiment subjectivity.

##### 2. What is/are the insight(s) found from the chart?

The median value of the sentiment subjectivity is around 0.5, indicating a moderate level of subjectivity in the user reviews.
The interquartile range (IQR) suggests that the majority of sentiment subjectivity values fall between approximately 0.3 and 0.7.
The whiskers of the boxplot show that there are some outliers in the dataset, indicating extreme values of sentiment subjectivity in certain user reviews.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

That the impact on business growth depends on various factors, including the nature of the outliers and the actions taken by the business to address them. While addressing negative sentiments can have a positive impact, it's crucial to carefully consider the specific insights gained from the analysis and take appropriate actions to mitigate any negative growth.

These insights and take appropriate actions to address negative sentiments, improve user satisfaction, and foster a positive environment for growth. By actively addressing any issues identified through the sentiment analysis, businesses can work towards minimizing negative growth and maximizing positive outcomes.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

play_store['Installs'] = play_store['Installs'].str.replace(',', '').str.replace('+', '').astype(float)
category_installs = play_store.groupby('Category')['Installs'].sum().reset_index()
bins = [0, 100000, 1000000, 10000000, np.inf]
labels = ['<100K', '100K-1M', '1M-10M', '10M+']
category_installs['Installs Group'] = pd.cut(category_installs['Installs'], bins=bins, labels=labels)

fig = px.bar(category_installs, x='Category', y='Installs', color='Installs Group',
             category_orders={'Installs Group': labels},
             hover_data={'Installs': ':,.0f'})

fig.update_layout(title='Total Installs by Category',
                  xaxis_title='Category',
                  yaxis_title='Total Installs',
                  xaxis_tickangle=-45,
)
                  

fig.show()


##### 1. Why did you pick the specific chart?

**The bar plot** to showcase the total installs by category because it effectively compares the install numbers across different categories. The length of each bar represents the total number of installs, and the different colors of the bars help distinguish between categories. This chart is suitable for visualizing and comparing quantitative data (total installs) across categorical variables (categories).

##### 2. What is/are the insight(s) found from the chart?

* The "GAME" category has the highest total number of installs, indicating that gaming apps are popular among users.
* The "COMMUNICATION" and "SOCIAL" categories also have a high number of installs, suggesting the popularity of communication and social networking apps.
* Categories such as "FAMILY," "TOOLS," and "PRODUCTIVITY" also have significant numbers of installs, indicating their relevance and demand among users.
* On the other hand, categories like "BEAUTY," "COMICS," and "PARENTING" have relatively lower numbers of installs, indicating a lesser level of user interest in these categories.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


* The gained insights from the chart can potentially help create a positive business impact. Understanding the popularity of different app categories and the total number of installs can guide businesses in making strategic decisions related to app development, marketing campaigns, and revenue generation. By focusing on popular categories with high installs, businesses can align their efforts to cater to user preferences and potentially drive higher user engagement and revenue.

* That may lead to negative growth, it would depend on the specific context and objectives of the business. While some categories may have lower numbers of installs compared to others, it does not necessarily imply negative growth. The lower number of installs in certain categories could indicate a niche market or a more targeted user base. In such cases, businesses can leverage these insights to tailor their products and marketing strategies to cater to the specific needs and preferences of that niche market, potentially leading to positive growth within that target audience.

#### Chart - 12 Box Plot

In [None]:
# Chart - 12 visualization code
play_store['Size'] = play_store['Size'].str.replace('M', '').str.replace('k', '')
play_store['Size'] = pd.to_numeric(play_store['Size'], errors='coerce')

fig = px.box(play_store, x='Category', y='Size', color='Category')
fig.update_layout(title='Distribution of App Sizes by Category',
                  xaxis_title='Category',
                  yaxis_title='App Size (in MB)')

fig.show()

##### 1. Why did you pick the specific chart?

 **The box plot** for this visualization because it effectively shows the distribution of app sizes across different categories. The box plot allows us to compare the central tendency, spread, and outliers of the app sizes for each category. Additionally, using different colors for each category in the plot helps to distinguish and highlight the differences between them.

##### 2. What is/are the insight(s) found from the chart?

* The "Education" category has the widest range of app sizes, indicated by the long whiskers. This suggests that there is a significant variation in app sizes within this category.

* The "Entertainment" category has a relatively higher median app size compared to other categories, as indicated by the position of the box.

* The "Photography" category has a large number of outliers, indicated by the individual data points beyond the whiskers. This suggests the presence of some photography apps with exceptionally large file sizes.

* The "Medical" and "Books & Reference" categories have relatively smaller app sizes overall, as indicated by the lower position of the boxes.

* The "Game" category shows a wide spread of app sizes, with some outliers indicating the presence of large game apps.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* Understanding the variation in app sizes within different categories can provide insights into the user preferences and expectations for app sizes. This information can be used to optimize app development and deployment strategies, ensuring that the app sizes align with user expectations and device storage limitations.

* Identifying categories with larger median app sizes, such as the "Entertainment" category, can indicate a higher demand for content-rich or feature-packed apps. This insight can guide businesses to invest in developing and offering high-quality apps in these categories to attract and retain users.


**There can be insights from the box plot that may potentially lead to negative growth.**

* Identifying categories with consistently low median app sizes, such as the "Education" category, might suggest a lack of content or features in apps within that category. This insight can indicate that users in the Education category have higher expectations for app content and functionality. If businesses in this category do not meet these expectations, it could result in negative user experiences, lower user engagement, and ultimately, negative growth.

* Recognizing categories with a high number of outliers representing extremely large app sizes, such as the "Gaming" category, can indicate that certain gaming apps within that category are excessively large in file size. Large app sizes can deter users with limited storage capacity or slower internet connections, leading to lower app downloads, reduced user retention, and potential negative growth.

#### Chart - 13 The Line Chart

In [None]:
# Chart - 13 visualization code
fig = px.line(x=update_counts.index, y=update_counts, title='Distribution of App Updates Over the Years')
fig.update_layout(xaxis_title='Year', yaxis_title='Number of Apps')
fig.update_traces(line_color='darkblue')

fig.show(renderer="colab")


##### 1. Why did you pick the specific chart?

**The line chart** because it effectively displays the distribution of app updates over the years. The line chart allows us to visualize the trend and changes in the number of app updates over time. It shows the overall pattern and any potential fluctuations or growth trends.

##### 2. What is/are the insight(s) found from the chart?

* There has been a steady increase in the number of app updates over the years, with a notable spike in updates around 2018-2019.
* The number of app updates shows a generally positive trend, indicating active development and maintenance of apps in the dataset.
* There may be some variations in the number of updates across different years, suggesting potential factors influencing the frequency of updates, such as technological advancements, market demands, or changes in app development practices.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* Positive business impact: The increasing number of app updates indicates active development and maintenance of apps. This suggests that businesses are investing in improving their apps, adding new features, fixing bugs, and addressing user feedback. By prioritizing regular updates, businesses can enhance user experience, attract new users, and retain existing ones. This can result in increased customer satisfaction, improved app ratings, and potentially higher revenue.

* Negative growth: Based on the provided data and the insights from the chart, there are no specific indications of negative growth. However, it's important to note that a decline or stagnation in the number of app updates could potentially lead to negative growth. If businesses neglect updating their apps, it can result in outdated features, poor performance, security vulnerabilities, and a decline in user engagement. This can lead to negative reviews, decreased user satisfaction, and ultimately, a negative impact on the business

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
ratings_subjectivity = pd.merge(play_store[['App', 'Rating']], user_review[['App', 'Sentiment_Subjectivity']], on='App')
correlation = ratings_subjectivity[['Rating', 'Sentiment_Subjectivity']].corr()

fig = px.imshow(correlation)
fig.update_layout(title="Correlation between App Ratings and Sentiment Subjectivity", width=500, height=400)
fig.show(renderer="colab")

##### 1. Why did you pick the specific chart?

The **correlation heatmap** was chosen to visualize the relationship between app ratings and sentiment subjectivity. Heatmaps are effective in showing the correlation between variables using color gradients. They provide a quick and intuitive way to identify the strength and direction of the correlation between two variables.

##### 2. What is/are the insight(s) found from the chart?

the heatmap provides insights into the relationship between app ratings and sentiment subjectivity, highlighting the tendency for higher sentiment subjectivity to be associated with higher app ratings. However, it is important to consider additional factors when analyzing and predicting app ratings.

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code


Rating = play_store['Rating']
Size = play_store['Size']
Installs = play_store['Installs'].apply(np.log)
Reviews = play_store['Reviews'].astype(float).apply(np.log10)
Type = play_store['Type']
Price = play_store['Price']

data = pd.DataFrame({'Rating': Rating, 'Size': Size, 'Installs': Installs, 'Reviews': Reviews, 'Price': Price, 'Type': Type})

p = sns.pairplot(data, diag_kind='kde', hue='Type')
p.fig.suptitle("Pairwise Plot - Rating, Size, Installs, Reviews, Price", x=0.5, y=1.0, fontsize=16)



##### 1. Why did you pick the specific chart?

The **pair plot** was chosen to visualize the relationships between multiple variables in the dataset. It allows us to examine the pairwise correlations and patterns between different variables, such as Rating, Size, Installs, Reviews, Price, and Type. The pair plot provides a comprehensive view of the relationships and can help identify any interesting patterns or trends.

##### 2. What is/are the insight(s) found from the chart?

* Rating and Reviews: There appears to be a positive correlation between the number of reviews and the app rating. Apps with higher ratings tend to have more reviews.

* Size and Installs: There is no clear correlation between the size of the app and the number of installs. Apps of various sizes have a wide range of install counts.

* Price and Installs: Free apps have a higher number of installs compared to paid apps. Paid apps generally have lower install counts.

* Price and Reviews: There is no strong correlation between the price of the app and the number of reviews. Apps with different price ranges have varying review counts.

* Type and Rating: There is a noticeable difference in the distribution of ratings between free and paid apps. Free apps tend to have a wider range of ratings, while paid apps have a narrower distribution with higher average ratings.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

To achieve the business objective, I would suggest the client the following:

* Enhance User Experience: Focus on improving user experience by analyzing user reviews and ratings. Identify areas of improvement based on feedback to enhance app usability, features, and overall satisfaction.

* Optimize App Performance: Regularly monitor and analyze app performance metrics such as app size, update frequency, and response time. Optimize these factors to ensure smooth performance and efficient resource utilization.

* Targeted Marketing and Advertising: Utilize insights from the app category and user demographics to create targeted marketing and advertising campaigns. Tailor promotional activities to specific user segments to maximize reach and engagement.

* Continuous Innovation and Updates: Keep the app up-to-date with the latest trends and technologies. Continuously introduce new features, functionalities, and content to retain existing users and attract new ones.

* Monitor Competitor Landscape: Stay updated on the competition by monitoring similar apps in the market. Analyze their features, ratings, and user feedback to identify opportunities for differentiation and improvement.

* Incorporate User Feedback: Actively listen to user feedback through app reviews, ratings, and user surveys. Incorporate valuable suggestions and address user concerns promptly to foster a positive user experience.

* Build App Reputation: Strive to build a positive app reputation by maintaining a high rating, addressing user complaints, and providing exceptional customer support. A good reputation leads to increased user trust, which can result in higher user acquisition and retention rates.

* Analyze Monetization Strategies: Evaluate the effectiveness of monetization strategies such as in-app purchases, subscriptions, or ads. Analyze revenue generation patterns and user behavior to optimize monetization efforts while ensuring a positive user experience.

* Stay Compliant with Policies: Adhere to app store policies and guidelines to maintain a strong presence in the app market. Ensure compliance with privacy regulations, content guidelines, and data security measures.



# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***