# **Project Name**    -  **Exploratory Data Analysis on Google Play Store Dataset for App Development Insights**



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Problem Statement**


#### **BUSINESS PROBLEM OVERVIEW**

The Google Play Store, formerly known as Android Market, serves as a significant digital distribution platform for various content types including apps, books, music, movies, and more. As the Android ecosystem continues to expand rapidly, performing data analysis over the data becomes crucial for extracting valuable insights beneficial for app development businesses. 

This project focuses on conducting exploratory data analysis (EDA) on Google Play Store dataset, detailing app attributes such as category, rating, size, etc.of Android apps. 

The objective is to derive actionable insights that can guide developers towards creating successful apps and capturing the Android market effectively.

#### **Define Your Business Objective?** 

This project aims to explore and analyze the Google Play Store dataset to uncover the essential factors influencing app engagement and success.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


5. You have to create at least 20 logical & meaningful charts having important insights.

[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]







# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Dataset Loading

In [None]:
# Load Dataset
dataset=pd.read_csv("Play Store Data.csv")

### Dataset First View

In [None]:
# Dataset First view
dataset.head()

### Rows & Columns count

In [None]:
# Dataset Rows & Columns 
dataset.shape

In [None]:
# Get the datatype of each columns
dataset.dtypes

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

In [None]:
# Statistical information of dataset
dataset.describe()

In [None]:
dataset.describe(include="all")

## ***2. Data Cleaning*** 
    1.Missing value
    2.Outliers
    3.Checking Datatypes
    4.Duplicates

#### Handling Missing Values/Null Values

In [None]:
# Count of Missing Values/Null Values
# If null values in other formats like (?,_, blank space etc), in such cases it requires to replace it into "nan",because python only consider "nan" values as null values.
# Here it needs to run : dataset.replace("?",np.nan,inplace=True)
#In this dataset its not required as null values already in correct format, so directly counted the null values.
print(dataset.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(dataset.isna())
plt.title("Heatmap for detecting Null Values")
plt.show()


In [None]:
# Drop null values where null values are less
dataset.dropna(subset=["Type","Content Rating","Current Ver","Android Ver"],axis=0,inplace=True)

In [None]:
print(dataset.isnull().sum())

In [None]:
# Handling null values where null values are larger
dataset["Rating"]

##### Checking Outliers

In [None]:
# In this dataset it already continuous type, so directly can perform boxplot else it needs to do typecasting like data["Rating"]=data["Rating"].astype(float) to get outliers 

sns.boxplot(data=dataset,x="Rating")

In [None]:
# Replace nan values with median of the data, as data in "Rating" is continuous with outliers
# If data in the column is continuous and outliers present: replace with median
# If data is continuous and no outliers present: replace with mean
# If data is categorical : replace with mode

median=dataset["Rating"].median()
dataset["Rating"].fillna(value=median,inplace=True)

In [None]:
# Cleaned dataset
print(dataset.isnull().sum())

In [None]:
# Visualize the cleaned dataset 
sns.heatmap(dataset.isna())
plt.title("Heatmap after handling Null Values")
plt.show()

#### Checking Datatypes

In [None]:
# Checking Datatypes
dataset.dtypes

##### 1) Change the type of "Reviews" to int type

In [None]:
dataset["Reviews"]=dataset["Reviews"].astype(int)

##### 2) Clean the 'Size' data and change the type 'object' to 'float'

In [None]:
# Show column "Size"
dataset["Size"]

In [None]:
# Found value with '1,000+' in one of record, remove it from data_frame as uncertain whether it is 'M' or 'k'

index = dataset[dataset['Size'] == '1,000+'].index
dataset.drop(axis=0, inplace=True, index=index)
sizes = [i for i in dataset['Size']]
def clean_sizes(sizes_list):
    """
    As sizes are represented in 'M' and 'k', we remove 'M'
    and convert 'k'/kilobytes into megabytes
    """
    cleaned_data = []
    for size in sizes_list:
        if 'M' in size:
            size = size.replace('M', '')
            size = float(size)
        elif 'k' in size:
            size = size.replace('k', '')
            size = float(size)
            size = size/1024  # 1 megabyte = 1024 kilobytes
        # representing 'Varies with device' with value 0
        elif 'Varies with device' in size:
            size = float(0)
        cleaned_data.append(size)
    return cleaned_data
dataset['Size'] = clean_sizes(sizes)

#Typecasting
dataset['Size'] = dataset['Size'].astype(float)

In [None]:
# Check datatype
dataset.dtypes

##### 3) Clean the 'Installs' data and change the type 'object' to 'float'

In [None]:
# Show column "Installs"
dataset["Installs"]

In [None]:
# Replace "," and "+" to ""

installs = [i for i in dataset['Installs']]
def clean_installs(installs_list):
    cleaned_data = []
    for install in installs_list:
        if ',' in install:
            install = install.replace(',', '')
        if '+' in install:
            install = install.replace('+', '')
        install = int(install)
        cleaned_data.append(install)
    return cleaned_data
        
dataset['Installs'] = clean_installs(installs)

# Typecasting
dataset['Installs'] =dataset['Installs'].astype(float)

In [None]:
# Check datatype 
dataset.dtypes

##### 4) Clean the 'Price' data and change the type 'object' to 'float'

In [None]:
# Show column "Price"
dataset["Price"]

In [None]:
# Remove "$" to ""
prices = [i for i in dataset['Price']]

def clean_prices(prices_list):
    cleaned_data = []
    for price in prices_list:
        if '$' in price:
            price = price.replace('$', '')
        cleaned_data.append(price)
    return cleaned_data
dataset['Price'] = clean_prices(prices)

# Typecasting
dataset['Price'] = dataset['Price'].astype(float)

In [None]:
# Show datatype
dataset.dtypes

In [None]:
# look at the random 10 records in the apps dataframe to verify the cleaned columns
dataset.sample(10)

#### Handle Duplicate

In [None]:
# Duplicate Value Count
len(dataset[dataset.duplicated()]) # or dataset.duplicated().sum() 

In [None]:
# Remove duplicates 
dataset.drop_duplicates(inplace=True) 

In [None]:
'''# Save the cleaned dataset 
import os

# File path
file_path = r"G:\python csv\EDA module 2\cleaned_play_store_dataset.csv"

# Check if the file already exists
if not os.path.exists(file_path):
    # Create directories if they don't exist
    os.makedirs(os.path.dirname(file_path), exist_ok=True)

# Save the DataFrame to a CSV file
dataset.to_csv(file_path, header=True)
'''

### What did you know about your dataset?

This project undergoes comprehensive exploratory analysis on Google Play Store dataset sourced from the Google Play Store.

The primary objective is to rectify data inconsistencies, conduct comprehensive exploratory analysis, and extract actionable insights aimed at empowering app developers to refine their products and optimize engagement within the Android market.

In the above part, data preprocessing is carried out and gathered information about the dataset, which contains details about various apps available on the Google Play Store. This dataset likely includes information such as the category of the app, its rating, size, etc. The dataset dimensions are specified as 10841 rows and 13 columns.

Conducted statistical calculations on all columns to gain more insights into the data, which helps in understanding the distribution and characteristics of the data. The statistical analysis identified instances of duplicates, null values, and outliers within the dataset.

Then data cleaning is performed to address these issues. Data cleaning involves techniques such as removing duplicate entries, imputing or removing null values, and handling outliers appropriately.

By completing these steps, the dataset is prepared for further analysis.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe
dataset.describe(include='all')

### Variables Description 

* App : Name of the Android application

* Category:  Category or genre to which the app belongs (e.g., Games, Education, Productivity)

* Rating: Average user rating of the app (on a scale of 1 to 5)

* Reviews: Total number of user reviews/ratings received by the app

* Size:  Size of the app in terms of storage space required for installation, usually measured in megabytes (MB) or gigabytes (GB)

* Installs: Total number of installations or downloads of the app from the Google Play Store

* Type: Indicates whether the app is free or needs payment (e.g., Free, Paid)

* Price: Price of the app if it is not free, specified in the local currency

* Content Rating: The content rating assigned to the app based on its suitability for different age groups (e.g., Everyone, Teen, Mature)

* Genres: Additional genres or subcategories associated with the app

* Last Updated: Date when the app was last updated on the Google Play Store

* Current Ver: Current version number of the app

* Android Ver: The minimum required Android version to run the app

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in dataset.columns.tolist():
  print("No. of unique values in ",i,"is",dataset[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to df
df=dataset.copy()

In [None]:
# Show columns
df.columns

1) ##### Count positive and negative ratings 

In [None]:
positive_ratings = [4, 5]  # Positive ratings are 4 or 5
negative_ratings = [1, 2]  # Negative ratings are 1 or 2

# Count positive and negative ratings
positive_count = df['Rating'].isin(positive_ratings).sum()
negative_count = df['Rating'].isin(negative_ratings).sum()

# Calculate ratio
ratio_positive_to_negative = positive_count / negative_count

print("Ratio of positive to negative rating:", ratio_positive_to_negative)

2) ##### Calculate Mean Ratings by App and classify apps into positive rating and negative rating

In [None]:
# Calculate mean rating by app
mean_ratings_by_app = df.groupby('App')['Rating'].mean().sort_values(ascending=False).reset_index()
print(mean_ratings_by_app)

In [None]:
# Apps having positive rating
positive_ratings_4_to_5 = mean_ratings_by_app[(mean_ratings_by_app['Rating'] >= 4) & (mean_ratings_by_app['Rating'] <= 5)]
print(positive_ratings_4_to_5)
count_positive_ratings_4_to_5 = positive_ratings_4_to_5.shape[0]

# Display the count
print("Count of positive ratings between 4 and 5:", count_positive_ratings_4_to_5)

In [None]:
# Apps having negative rating
negative_ratings_1_to_2 = mean_ratings_by_app[(mean_ratings_by_app['Rating'] >= 1) & (mean_ratings_by_app['Rating'] <= 2)]
print(negative_ratings_1_to_2)
count_negative_ratings_1_to_2 = negative_ratings_1_to_2.shape[0]

# Display the count
print("Count of negative ratings between 1 and 2:", count_negative_ratings_1_to_2)

3) ##### Calculate the age of the App

In [None]:
# # Convert 'last_updated' column to datetime
df['Last Updated'] = pd.to_datetime(df['Last Updated'])
# Calculate the age of the app
current_date = pd.to_datetime('now')  # Get the current date
df['App_age'] = (current_date - df['Last Updated']).dt.days
print(df.columns)

4) ##### Calculate Mean Ratings on Category

In [None]:
mean_ratings_by_category = df.groupby('Category')['Rating'].mean().sort_values(ascending=False)
print(mean_ratings_by_category.head(10))

5) ##### Count Number of Apps by Category and identify the category with the highest count of Apps with its review

In [None]:
# Count no. of Apps by category
App_count_by_category = df.groupby('Category').size().sort_values(ascending=False)
print(App_count_by_category)

In [None]:
# Find the category with the highest count of Apps
category_with_highest_count = App_count_by_category.idxmax()
print("Category with the highest count of apps:", category_with_highest_count)
print("Maximum count:", App_count_by_category.max())
# Get the mean review score for the category with the highest count
mean_review_for_highest_count = df[df['Category'] == category_with_highest_count]['Reviews'].mean()
print("Mean review score for the category:", mean_review_for_highest_count)

6) ##### Correlate top 10 categories, App Counts, Reviews, and Ratings

In [None]:

top_10_categories = App_count_by_category.nlargest(10)

# Print the top 10 categories with their counts
# print("Top 10 categories with the highest counts:", top_10_categories)

# Create a dictionary to store the data
data = {
    'Category': [],
    'Count': [],
    'Mean Rating': [],
    'Mean Review': []
}

# Iterate over the top 10 categories
for category in top_10_categories.index:
    # Filter the DataFrame to include only rows corresponding to the current category
    category_df = df[df['Category'] == category]
    
    # Calculate the mean rating and review for the current category
    mean_rating = category_df['Rating'].mean()
    mean_review = category_df['Reviews'].mean()
    
    # Print category, count, mean rating, and mean review
    #print(f"Category: {category}")
    #print(f"Count: {top_10_categories[category]}")
    #print(f"Mean Rating: {mean_rating:.2f}")
    #print(f"Mean Review: {mean_review:.2f}\n")
    
    # Or result can be print as a dataframe
    # Append the data to the dictionary
    data['Category'].append(category)
    data['Count'].append(top_10_categories[category])
    data['Mean Rating'].append(mean_rating)
    data['Mean Review'].append(mean_review)

# Create a DataFrame from the dictionary
result_df = pd.DataFrame(data)

# Print the DataFrame
print(result_df)

7) ##### Find largest App in Each Category:

In [None]:
max_size_app_by_category = df.groupby(['Category','App'])['Size'].max().sort_values(ascending=False)
print(max_size_app_by_category)
print()
print("Top 10 largest app in each category: ",max_size_app_by_category.head(10))

8) ##### Calculate Total Number of Installs by App and Content Rating

In [None]:
total_installs_by_app_rating = df.groupby(['App','Content Rating'])['Installs'].sum()
#Sort the result in descending order
sorted_installs_by_app_rating = total_installs_by_app_rating.sort_values(ascending=False)

# Print the total installs by content rating in descending order
#print("Total installs by content rating (descending order):")
#print(sorted_installs_by_app_rating)

# Reset the index to convert the multi-index Series into a DataFrame
total_installs_by_app_rating_df = sorted_installs_by_app_rating.reset_index()

# Print the DataFrame
print("Total installs by content rating and app:")
print(total_installs_by_app_rating_df.head(10))

9) ##### Calculate the average rating for free and paid apps.

In [None]:
# Filter the dataset to include only free or paid apps
free_apps = df[df['Type'] == 'Free']
paid_apps = df[df['Type'] == 'Paid']
# Calculate the average ratings for free and paid apps
avg_rating_free_apps = free_apps['Rating'].mean()
avg_rating_paid_apps = paid_apps['Rating'].mean()
print("Average Rating for Free Apps:")
print(avg_rating_free_apps)
print("Average Rating for Paid Apps:")
print( avg_rating_paid_apps)

In [None]:
# Top 10 Free apps with rating and review
# Sort the free apps by rating (descending) and then by review (descending)
free_apps_sorted = free_apps.sort_values(by=['Rating', 'Reviews'], ascending=False)

# Select the top 10 free apps with their rating and review 
top_10_free_apps = free_apps_sorted.head(10)[['App', 'Rating', 'Reviews']]

# Display the top 10 free apps with their rating and review count
print("Top 10 Free Apps with Rating and Review Count:")
print(top_10_free_apps)

In [None]:
# Top 10 Paid apps with rating and review
# Sort the paid apps by rating (descending) and then by review (descending)
paid_apps_sorted = paid_apps.sort_values(by=['Rating', 'Reviews'], ascending=False)

# Select the top 10 paid apps with their rating and review 
top_10_paid_apps = paid_apps_sorted.head(10)[['App', 'Rating', 'Reviews']]

# Display the top 10 paid apps with their rating and review count
print("Top 10 Paid Apps with Rating and Review Count:")
print(top_10_paid_apps)

In [None]:
# Number of free and paid apps in each category
top_10_free_apps_by_category = free_apps.groupby('Category').size().sort_values(ascending=False).head(10)
top_10_paid_apps_by_category = paid_apps.groupby('Category').size().sort_values(ascending=False).head(10)
print("Number of free apps by category ")
print(top_10_free_apps_by_category)
print()
print("Number of paid apps by category ")
print(top_10_paid_apps_by_category)

10) ##### Group the content rating based on category 

In [None]:
grouped_data = df.groupby(['Category', 'Content Rating']).size()
print(grouped_data)

11) ##### Corelation among Paid apps,Category and Price

In [None]:
# Filter the dataset to include only paid apps
paid_apps = df[df['Price'] != 0]

# Group the paid apps by category
#paid_apps_by_category = paid_apps.groupby(["App",'Category']).size().sort_values(ascending=False)

# Select the 'Price','Category' and 'App' columns
category_price_app = paid_apps[['Price','Category','App','Rating']]

# Sort the DataFrame by the 'Price' column in ascending order
sorted_category_price_app = category_price_app.sort_values(by='Price', ascending=False).head(10)

# Print the result
print(" Top 10 Apps, Categories, Ratings Based on Price:")
print(sorted_category_price_app)

12) ##### Paid Apps with Highest Number of Reviews

In [None]:
# Filter the dataset to include only paid apps
paid_apps = df[df['Price'] != 0]

# Group the data by app and calculate the total number of reviews for each app
total_reviews_paid_apps = paid_apps.groupby('App')['Reviews'].sum()

# Sort the result in descending order
sorted_reviews_paid_apps = total_reviews_paid_apps.sort_values(ascending=False)

# Print the result
print("Paid Apps with Highest Number of Reviews:")
print(sorted_reviews_paid_apps.head(10))


### What all manipulations have you done and insights you found?

By processing data related to apps on the Google Play Store, collected some valuable insights into user preferences, market trends, and potential areas for app development, shown as follows:

1.	The ratio of positive to negative ratings being 28.892857142857142 indicates the proportion of positive ratings compared to negative ratings. Specifically, this ratio suggests that for every negative rating in the dataset, there are approximately 28.89 positive ratings. This high ratio could imply that the majority of ratings in the dataset are positive, showing that users generally have a favourable opinion of the apps or content in the dataset

2. By comparing apps and ratings, revealed insights about which apps are perceived positively or negatively by users, it indicates the relationship between app performance (in terms of ratings) and user perception. Here it shown the count of positive ratings between 4 and 5 is 7739 and count of negative ratings between 1 and 2 is 66. There is a large number of users who are satisfied with the apps. This could indicate that the apps in the dataset generally meet or exceed user expectations, providing quality features, functionality, and user experience.

3.	I created a new column named as ‘App_age’ by calculating the age of the app from its last update until now, which shows how actively the app is maintained its level of support and engagement, its competitiveness in the market, and its overall health and potential for future success. This information can guide decisions about whether to invest in or continue using the app. 

4.	By calculating mean rating by categories, it shown a scenario between categories with different ratings. Categories with higher mean ratings may align more closely with user preferences and needs. This can help to understand what types of apps users are more satisfied with, which can be focused areas for developers to create successful apps. Categories with lower mean ratings may present opportunities for improvement and innovation. The top 10 categories with the highest ratings on this dataset fall between approximately 4.2 and 4.4 on average where, events, education, art_and_design, books_and_reference, personalization, parenting having ~4.4 rating and beauty, game, health_and_fitness, social showing ~4.2.

5.	By counting the number of apps by category and identifying the category with the highest count, provides information of the competitive landscape of the app market and potential areas for new development and investment. Based on our dataset, indicated that the category with the highest count of apps on the Play Store is ‘family’ with a maximum count of 1,939 apps. Additionally, the mean review score for this category is approximately 204,627. The "family" category is the most saturated category on the Play Store, suggesting a high level of competition within this category. The mean review score for the category is high (204,627), which suggests that, on average, apps in the "family" category receive a significant amount of user reviews. The high count of apps in the "family" category and the significant mean review score suggest that there is strong demand for family-oriented apps on the Play Store. This demand may present opportunities for developers to innovate and create new, high-quality family apps to meet user needs. 

6.	Identified top 10 categories by correlating app counts with average reviews and ratings, indicates the most successful categories and design app content that aligns with market trends and user preferences. "FAMILY" and "GAME" categories have the highest number of apps, as well as favourable mean ratings and reviews.

7.	Extracted the largest apps in each category may indicate apps that are resource-intensive, requiring more storage and potentially more processing power. Listed the top 10 largest apps in each category. The data includes various categories such as game, finance, family, health and fitness, sports, and lifestyle. The apps in each category seem to be ranked based on the app's size, with each app listed having a size of 100 M. The list includes several game apps, such as "Stickman Legends: Shadow Wars", "Car Crash III Beam DH Real Damage Simulator 2018", "Hungry Shark Evolution", "Mini Golf King - Multiplayer Game", and "Miami crime simulator". By knowing their features, user experience and overall design, it can be revealed their success and popularity behind the apps. This information can guide developers in creating new games that appeal to similar audiences.

8.	By calculating the total number of installs by app and content rating, it provided a scenario about user preferences, which helps the developers to adjust their strategy accordingly for app development and marketing purpose.

9.	Here calculated the average rating for free and paid apps. The average rating for free apps is 4.19, slightly lower than for paid apps i.e. 4.26. while there is a small difference in average ratings between free and paid apps, the ratings for both types of apps suggest general satisfaction among users. Developers can use this information to tailor their app strategies based on their target audience and the type of app they are offering. Identified the top 10 free apps with a perfect rating of 5.0 and a varying number of reviews, indicating extremely high user satisfaction. Apps with higher review counts and perfect ratings, such as "Ríos de Fe" and "FD Calculator (EMI, SIP, RD & Loan Eligibility)," may be considered particularly reliable and consistent in terms of quality. Apps with perfect ratings but a low review count may need further analysis to verify their authenticity and consistency. The apps listed are from a variety of types, including religious apps, calculators, photography, technology news, interview questions, and more, suggests that high-quality free apps can be successful across a range of categories. By analysing top 10 paid apps on reviews and rating, shown all of the apps have perfect ratings of 5.0, but review counts for these apps are generally low, ranging from 4 to 13 reviews. Although perfect ratings are rare and indicative of quality, low review counts mean these ratings may not be as representative of broader user sentiment. The list of apps includes a variety of types, including watch faces, music-related apps, communication tools, and icon packs.

10.	By grouping the content rating (e.g., Everyone, Teen, Mature 17+) based on category, indicates the type of content typically found in each app category. It Helps to understand which categories and content ratings to target based on their audience's age and content preferences.

11.	Correlation between paid apps, category, and price, helps to understand how pricing strategies vary across different app categories and guide strategic decisions in app development and marketing. Top 10 apps are sorted based on price, and it appears to include apps from various categories, including lifestyle, family, and finance. The apps in the list are sorted by their prices in descending order and have high prices, often around 399.99 USD or more. Lifestyle apps seem to be the most prevalent in the list, suggesting a focus on luxury or exclusive experiences in these apps. The ratings vary across the apps, ranging from 3.5 to 4.4, indicates that high price does not necessarily correlate with higher ratings. Many apps have names related to wealth, such as "I'm Rich," "I Am Rich Pro," or similar phrases, suggest that the apps may be marketed towards users interested in luxury or exclusivity.

12.	Again figured out the top 10 paid apps with the highest number of reviews. Most apps in the list are games, such as "Minecraft," "Hitman Sniper," "Grand Theft Auto: San Andreas," and others. Even all the apps listed have a significant number of reviews, ranging from 97,890 to 4,751,900. A high number of reviews suggests these apps have a large user base and are popular choices among paid apps.

These are common manipulations and insights have been done with this Play Store dataset. 

 

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

##### Chart-1-Histogram of App ratings (Univariate)

In [None]:
# Create a figure
plt.figure(figsize=(8, 6))
# Plot the data using Seaborn's histplot function
sns.histplot(df['Rating'], bins=10, kde=True)
# Add title and labels
plt.title('Distribution of App Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?
Histograms are excellent for visualizing the distribution of continuous variables, such as app ratings.

##### 2. What is/are the insight(s) found from the chart?
It shows the distribution of app ratings for this dataset, and with the insight that the maximum number of ratings falls within the range of 4 to 4.7

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth ? Justify with specific reason.  

High app ratings (e.g., ratings in the range of 4.0-5.0) suggest overall user satisfaction with the apps, helps developers to understand which app features and functionalities resonate well with users, allowing them to focus on similar characteristics in future app development.

##### Chart-2-Count plot on Count of Apps per Category  (Univariate)

In [None]:
# Create a figure
plt.figure(figsize=(12, 8))
# Plot the data using Seaborn's countplot function
sns.countplot(y='Category', data=df, order=df['Category'].value_counts().index, palette='viridis')  # Customize color palette
# Add title and labels
plt.title('Count of Apps per Category')
plt.xlabel('Count')
plt.ylabel('Category')
# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?
Count plot has choosen due to its ability to clearly convey categorical data, allow for easy comparisons, and identify trends and areas of opportunity.

##### 2. What is/are the insight(s) found from the chart?
It provides a clear picture of the distribution of apps across different categories in our dataset, indicates that the "Family" category has a high number of apps and it suggests there is a strong demand for family-oriented apps.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Focusing on family-oriented apps can be a profitable strategy due to the high demand in the category,but it's important for developers and businesses to balance their app portfolio across different categories. 

##### Chart-3-Density plot of app size (Univariate)

In [None]:
# Create a figure
plt.figure(figsize=(8, 6))
# Plot the data using Seaborn's kdeplot function 
sns.kdeplot(df['Size'].dropna(), color='blue')  # Customize color
# Add title and labels
plt.title('Density Plot of App Size')
plt.xlabel('App Size (MB)')
plt.ylabel('Density')
# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?
Density plots provide a smooth, continuous representation of the distribution of a continuous variable, such as app size.The shape of the density plot can reveal skewness in the data, indicating whether app sizes are generally larger or smaller.

##### 2. What is/are the insight(s) found from the chart?
The peak of the density plot indicates the most common app sizes in the dataset.This suggests the average size range that developers typically target when creating apps.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Developers can aim to keep their app sizes within the most common size range to match user expectations and device capabilities.Apps that are significantly larger than the common size range may lead to negative user experiences, such as long download times or high storage usage.But overly focusing on minimizing app size could lead to neglecting important app features or content that users value.


#### Chart-4-Bar plot on Category Vs. Installs (Bivariate)

In [None]:
# Group data by category and calculate sum and mean of installs
category_group = df.groupby('Category')['Installs'].agg(['sum', 'mean'])

# Create a figure
plt.figure(figsize=(10, 6))

# Sort data based on total installs (sum)
sorted_data = category_group['sum'].sort_values()

# Plot the data using Seaborn's barplot function with a color palette
sns.barplot(x=sorted_data, y=sorted_data.index, palette="viridis")

# Add title and labels
plt.title('Total Installs per Category')
plt.xlabel('Total Installs')
plt.ylabel('Category')

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?
Bar chart is chosen for its clear visual representation of differences across categories. Each bar corresponds to a category and its height represents the total installs, making it straightforward for the viewer to identify which categories have higher or lower download counts.

##### 2. What is/are the insight(s) found from the chart?
The chart reveales the app categories such as Games, Communication or Social have the highest total downloads. These are the categories that are most popular among users. On the opposite end, the chart can identify which app categories have the lowest total downloads. These categories may be niche or less in demand.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

When some categories have the highest total downloads can guide businesses toward focusing on these high-potential areas. 
But over-focusing on high-performing categories could lead to neglecting niche markets, which might result in missed opportunities for differentiation and innovation. 

##### Chart-5-Violin plot on Content Rating Vs. Rating (Bivariate)

In [None]:

# Create a figure
plt.figure(figsize=(10, 6))

# Create a violin plot using Seaborn
sns.violinplot(x='Content Rating', y='Rating', data=df, palette='Set2')

# Add title and labels
plt.title('Violin Plot of App Ratings by Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Rating')

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?
Violin plot is chosed because it combines the best features of a box plot and a density plot. It displays the full distribution of app ratings within each content rating category, including the range, median, quartiles, and density.

##### 2. What is/are the insight(s) found from the chart?

The shape of the violin provides information about the density of app ratings in each content rating category. Areas where the plot is wider indicate higher density (i.e., max content ratings fall within that range of ratings).Since most content rating categories have ratings concentrated in the 4-5 range, it suggests that the trend of high app ratings is consistent across different types of content ratings.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

If certain content rating categories consistently receive higher app ratings, this insight can guide app developers to focus on creating apps that align with these content ratings. This can increase the chances of developing popular apps that receive high ratings and positive user feedback.

When certain content rating categories consistently receive lower app ratings, businesses may neglect these areas. However, this could be a missed opportunity to innovate and improve the quality of apps in those categories, which could ultimately lead to growth.

##### Chart-6-Pie chart on Distribution of Content Ratings (Univariate)

In [None]:
# Group data by 'Content Rating' and count the number of apps in each category
content_rating_distribution =df['Content Rating'].value_counts()
# Create a pie chart
plt.figure(figsize=(8, 8))
plt.pie(
    content_rating_distribution,
    labels=content_rating_distribution.index,
    autopct=lambda p: f'{int(p * sum(content_rating_distribution) / 100)} ({p:.1f}%)',
    startangle=140
)

# Set the title of the pie chart
plt.title('Distribution of Content Ratings')

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart is chosen as a visualization to represent the distribution of content ratings across different apps because of its strengths in displaying proportions and comparisons among different apps.

##### 2. What is/are the insight(s) found from the chart?

The vast majority of apps i.e. 8371 (80.9%) are rated for "Everyone," indicating that most apps in the Google Play Store are designed to be accessible to all age groups and suitable for a general audience.
Where other age groups:
"Teen" accounts for 1146 (11.1%) of apps.
"Mature 17+" accounts for 447 (4.3%) of apps.
"Everyone 10+" accounts for 375 (3.6%) of apps.
"Adult 18+" accounts for an insignificant portion of apps i.e. 3 (0%).

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The dominance of "Everyone" rated apps (80.9%) indicates that most apps are accessible to a broad audience, which can lead to higher potential downloads and user engagement.With 11% of apps rated for "Teen," there is a moderate focus on the teenage market.The very low proportion of apps rated for "Mature 17+" (4.3%) and "Adult 18+" (0%) may limit opportunities to cater to older audiences, potentially missing out on specific market segments.The concentration of apps rated for "Everyone" could lead to a lack of diverse app experiences for other age groups.

##### Chart-7-Bar plot on Count of Apps Vs. Android Version (Bivariate)

In [None]:
# Group data by 'Android Ver' and calculate the count of apps for each version
android_version_distribution =df['Android Ver'].value_counts().reset_index()
android_version_distribution.columns = ['Android Version', 'Count']

# Create a bar chart
plt.figure(figsize=(12, 8))
sns.barplot(
    data=android_version_distribution,
    x='Android Version',
    y='Count',
    palette='coolwarm'
)

# Set plot title and labels
plt.title('Count of Apps for Each Required Android Version')
plt.xlabel('Android Version')
plt.ylabel('Count of Apps')

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart? 

Bar plots are excellent for comparing quantities across different categories or groups. In this case, the required Android versions serve as groups, and the bar plot clearly shows how many apps are available for each version.

##### 2. What is/are the insight(s) found from the chart?

Based on the bar plot visualization the count of apps for each required Android version, with more apps falling within version 4.1 and up, indicates that these versions are widely supported and targeted by app developers.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Focusing on newer Android versions allows developers to utilize the latest features and APIs, potentially leading to more modern and feature-rich apps that can enhance user experience and satisfaction. Users on older versions of Android who cannot access newer apps may feel left out or frustrated, which could harm the reputation of apps.

##### Chart-8-Lineplot on Average App Rating Vs. Android Version (Bivariate)

In [None]:
# Create the bar plot
plt.figure(figsize=(12, 6))
sns.lineplot(
    data=df,
    x='Android Ver',
    y='Rating',
    palette='coolwarm'
)

# Set the plot title and labels
plt.title('Average App Rating by Android Version')
plt.xlabel('Android Version')
plt.ylabel('Average Rating')

# Rotate x-axis labels for better readability (optional)
plt.xticks(rotation=90)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

A line plot was chosen to visualize how average app ratings vary across different Android versions in a clear view. A line plot is well-suited for visualizing trends over a sequence, such as the progression of Android versions. 

##### 2. What is/are the insight(s) found from the chart?

The average app rating falls within the range of 4 to 4.5 across every Android version, this indicates a relatively high and consistent level of satisfaction among users regardless of the Android version. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The uniformity in ratings across Android versions suggests a stable market, which can provide developers with the confidence to invest in apps knowing that their user base will be satisfied across different versions of Android. Although average ratings are within a high range, there may still be areas for improvement in specific apps or categories. Developers should analyze individual app performance and user feedback for specific areas that need attention.

##### Chart-9-Scatter Plot on App Size vs. Rating with Number of Installs (Multivariate)

In [None]:
# Create a scatter plot with bubble size representing the number of installs
scatter = sns.scatterplot(
    data=df,
    x='Size',
    y='Rating',
    hue='Category',
    size='Installs',
    sizes=(10, 300),
    alpha=1,
    palette='viridis'
)

# Set plot title and labels
plt.title('Scatter Plot: App Size vs. Rating with Number of Installs')
plt.xlabel('App Size')
plt.ylabel('App Rating')

# Position the legend above the plot
scatter.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=3, frameon=True)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot with a third variable (bubble plot) is chosen for its ability to effectively visualize complex relationships between multiple variables like: app size, rating, and the number of installs in a single plot in the Google Play Store dataset. 

##### 2. What is/are the insight(s) found from the chart?

The plot shows that apps with ratings in the range of 3.5 to 4.7 tend to have a high number of installs, suggests that users may prefer apps within this rating range, leading to higher popularity and more downloads.There is a trend of decreasing number of installs as app size increases, particularly within the rating range of 3.5 to 4.7. This may imply that users prefer apps with smaller sizes, possibly because smaller apps are less demanding on device resources (e.g., storage, memory) and can be downloaded and installed more quickly.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

By focusing on apps within the optimal rating range and smaller size, marketing campaigns can emphasize these attributes to appeal to user preferences. This can increase app visibility and boost downloads. While keeping app size small is generally preferred, over-reducing app size could lead to the removal of essential features or compromises in app quality.This may harm user experience and ultimately lead to lower app ratings and installs.

##### Chart-10-Correlation Heatmap on Rating, Reviews, Size, Installs and Price (Multivariate)

In [None]:
# Select the continuous columns 
continuous_columns = ['Rating','Reviews','Size','Installs','Price']

# Calculate the correlation matrix for the selected columns
correlation_matrix = df[continuous_columns].corr()

# Create a figure
plt.figure(figsize=(10, 8))

# Create a heatmap using Seaborn
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')

# Add title
plt.title('Correlation Heatmap')

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?
Heatmap provides a visual representation of the correlations between continuous variables in the dataset. It uses a color scale to indicate the strength and direction (positive or negative) of the relationship between pairs of variables.

##### 2. What is/are the insight(s) found from the chart?
It shows the correlation coefficients between pairs of columns like: "Rating","Reviews","Size","Installs",and "Price". Positive correlations are indicated by warm colors (e.g., red), while negative correlations are indicated by cool colors (e.g., blue).It reveals a strong postitive correlation between 'Installs' and 'Reviews', suggests that apps with higher installs tend to have higher reviews. There is no any correlationship between "Installs" and "Size" means these are independent to each other, where "Price" is negatively correlated with "Rating","Reviews","Size" and "Installs".

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

A heatmap can reveal emerging trends with in the dataset. A strong correlation between certain variables over time, might suggest an area of growth or opportunity that can capitalize on.


##### Chart-11-Sub plots for Reviews, Size, Installs and Price Vs. Rating

In [None]:
rating_df = df.drop(columns=['Last Updated']).groupby('Rating').sum().reset_index()

In [None]:
print(df.dtypes)

In [None]:
# plot the graphs of reviews, size, installs and price per rating

''' rating_df = df.groupby('Rating').sum().reset_index()---> code works during individual debugging but encounters an error when running 
 all cells, it suggests that there might be a dependency issue in the data between individual execution and running all cells at once.'''

'''To mitigate the TypeError: datetime64 objects not supporting sum operations, I've incorporated a step in the code to exclude the 
datetime column from the the groupby operation '''

rating_df = df.drop(columns=['Last Updated']).groupby('Rating').sum().reset_index()
# Create the subplots
fig, axes = plt.subplots(1, 4, figsize=(14, 4))

axes[0].plot(rating_df['Rating'], rating_df['Reviews'], 'r')
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Reviews')
axes[0].set_title('Reviews Per Rating')

axes[1].plot(rating_df['Rating'], rating_df['Size'], 'g')
axes[1].set_xlabel('Rating')
axes[1].set_ylabel('Size')
axes[1].set_title('Size Per Rating')

axes[2].plot(rating_df['Rating'], rating_df['Installs'], 'g')
axes[2].set_xlabel('Rating')
axes[2].set_ylabel('Installs (e+10)')
axes[2].set_title('Installs Per Rating')

axes[3].plot(rating_df['Rating'], rating_df['Price'], 'k')
axes[3].set_xlabel('Rating')
axes[3].set_ylabel('Price')
axes[3].set_title('Price Per Rating')

plt.tight_layout(pad=2)
plt.show()

##### 1. Why did you pick the specific chart?
Subplots able to visualize multiple relationships within a single figure, allowing for easier comparison and analysis. 

##### 2. What is/are the insight(s) found from the chart?
It reveals the relationships between app ratings and other factors (reviews, size, installs, and price) based on the line plots in the 1x4 subplot.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from these subcharts can drive positive business impact through targeted improvements and informed decision-making. However, not acting on negative trends or ignoring user feedback can lead to negative growth. It is crucial to use these insights to guide strategic decisions, continuous improvement, and adaptation to evolving market demands.

#### Chart-12-Count plot-App distribution (Bivariate)

In [None]:

# Set the figure size
plt.figure(figsize=(10, 5))
sns.countplot(df['Type'], palette='Set2')

# Set the plot title
plt.title('Type Distribution')

# Set the y-axis label
plt.ylabel('Number of Apps')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?
A count plot is chosen for visualizing the distribution of different types of applications, provides a straightforward and clear representation of the frequency of each category in a given column (in this case, the 'Type' column). This type of plot is excellent for showing how many instances of each app type (e.g., 'Free' or 'Paid') exist in the dataset.

##### 2. What is/are the insight(s) found from the chart?
It shows that 'Free' apps are more prevalent than 'Paid' apps in this dataset.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The prevalence of 'Free' apps suggests a general trend in the app market towards offering apps at no cost to the user. This may indicate that users are more inclined to download and use free apps compared to paid ones. However, it is essential to strike a balance between offering free apps and establishing sustainable revenue streams to ensure long-term growth and success.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.


**To achieve successful app development**

*	Focus on Positive User Experience
*	Target High-Demand Categories
*	Cater to Specific Age Groups
*	Monitor Free and Paid App Ratings
*	Leverage Popular and Reliable Apps
*	Optimize Pricing Strategies
*	Capitalize on Market Trends
*	Customize Content Ratings and Categories


# **Conclusion**

Through exploratory data analysis of Google Play Store datasets, valuable insights are obtained to guide app development businesses towards creating successful apps and capturing the Android market effectively. By understanding user preferences, app attributes, and customer sentiments, developers can make informed decisions to optimize app engagement and drive success in the competitive app market.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***