<a href="https://colab.research.google.com/github/Shubh18699/My-repository/blob/main/Play_Store_Data_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** Shubham Pathak

# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

https://github.com/Shubh18699

# **Problem Statement**


The aim of this EDA project is to gain insights into the Google Play Store ecosystem and user sentiments by analyzing two datasets: one containing information about apps available on the Google Play Store, and another containing user reviews for these apps. The project seeks to answer key questions such as:

**App Performance Analysis**: What are the characteristics of top-performing apps on the Google Play Store in terms of ratings, number of installs, and user reviews?

**Category Trends**: Which app categories are most popular among users? Are there any specific categories that tend to receive higher ratings or more positive user sentiments?

**Impact of App Attributes**: How do factors such as app size, price, content rating, and last update date correlate with user ratings and sentiments?

**User Sentiment Analysis**: What are the overall sentiments expressed by users in their reviews of Google Play Store apps? Are there any patterns or trends in sentiment polarity and subjectivity across different app categories or versions?

**App Improvement Opportunities**: Based on user reviews and sentiments, what are the common strengths and weaknesses of apps on the Google Play Store? What actionable insights can app developers derive to enhance user satisfaction and engagement?

#### **Define Your Business Objective?**

By conducting comprehensive exploratory data analysis on these datasets, I aim to provide valuable insights for app developers, marketers, and stakeholders in the mobile app industry to make informed decisions and optimize their strategies for app development, marketing, and user engagement.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np
import pandas as pd

### Dataset Loading

In [None]:
# Load Dataset

play_store_df = pd.read_csv("/content/Play Store Data.csv")

In [None]:
user_reviews_df = pd.read_csv("/content/User Reviews.csv")

### Dataset First View

In [None]:
# Dataset First Look

play_store_df.head(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# computing number of rows
rows = len(play_store_df.axes[0])

# computing number of columns
cols = len(play_store_df.axes[1])

print("Number of Rows: ", rows)
print("Number of Columns: ", cols)

### Dataset Information

In [None]:
# Dataset Info

play_store_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_apps_count = play_store_df.duplicated().sum()

# Check for duplicates in the user reviews dataset
duplicate_reviews_count = user_reviews_df.duplicated().sum()

print("Duplicate count in Play Store Dataset:", duplicate_apps_count)
print("Duplicate count in User Reviews Dataset:", duplicate_reviews_count)


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

# Count missing/null values in the Google Play Store dataset
play_store_missing_values = play_store_df.isnull().sum()

# Count missing/null values in the user reviews dataset
user_reviews_missing_values = user_reviews_df.isnull().sum()

# Print the missing/null value counts for each dataset
print("Missing/Null Value Counts in Play Store Dataset:")
print(play_store_missing_values)
print("\nMissing/Null Value Counts in User Reviews Dataset:")
print(user_reviews_missing_values)

In [None]:
# Visualizing the missing values

import missingno as msno
import matplotlib.pyplot as plt

# Plot the proportion of missing values in each column of the Google Play Store dataset
plt.figure(figsize=(8, 3))
plt.title("Proportion of Missing Values in Google Play Store Dataset")
play_store_missing_values.plot(kind='bar')
plt.ylabel("Number of Missing Values")
plt.xlabel("Columns")
plt.xticks(rotation=45)
plt.show()

# Plot the proportion of missing values in each column of the User Reviews dataset
plt.figure(figsize=(8, 3))
plt.title("Proportion of Missing Values in User Reviews Dataset")
user_reviews_missing_values.plot(kind='bar')
plt.ylabel("Number of Missing Values")
plt.xlabel("Columns")
plt.xticks(rotation=45)
plt.show()

**Inference:** Here I've used bar plots to show the proportion of missing values in each column. These visualizations will help me identify columns with a high proportion of missing values, allowing me to make informed decisions about how to handle them in my data preprocessing steps.

### What did you know about your dataset?

This dataset usually contains information about mobile applications available on the Google Play Store. It include attributes such as app name, category, rating, number of installs, price, content rating, size, last updated date, and more. Such datasets are often used for analyzing trends in app categories, popularity, user ratings, and other aspects of the app ecosystem.







## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

play_store_columns = play_store_df.columns.tolist()
print("Columns present in the Play STore dataset: ",play_store_columns)

In [None]:
# Dataset Describe

play_store_df.describe()

**Inference:** This step is providing descriptive statistics for numeric columns present in Play Store Dataset.

### Variables Description

**App:** The name of the mobile application.

**Category:** The category to which the app belongs (e.g., Games, Education, Tools).

**Rating:** The overall user rating of the app.

**Reviews:** The number of user reviews for the app.

**Size:** The size of the app.

**Installs:** The number of times the app has been installed.

**Type:** Whether the app is free or paid.

**Price:** The price of the app, if it is not free.

**Content Rating:** The age group for which the app is intended.

**Genres:** Additional categorization of the app beyond the main category.

**Last Updated:** The date when the app was last updated.

**Current Ver:** The current version of the app.

**Android Ver:** The minimum required Android version to run the app.Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for column in play_store_df.columns:
    unique_values = play_store_df[column].nunique()
    print(f"Unique values in column '{column}':", unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# 1. Handling Missing Values
play_store_df.dropna(inplace=True)  # Dropping rows with missing values
user_reviews_df.fillna("", inplace=True)  # Filling missing values in user reviews with empty string

# 2. Data Type Conversion (if needed)
play_store_df['Last Updated'] = pd.to_datetime(play_store_df['Last Updated'])

# 3. Merging Datasets (if applicable)
merged_df = pd.merge(play_store_df, user_reviews_df, on='App', how = 'inner')

# 4. Feature Engineering (if needed)
merged_df['Year_Last_Updated'] = merged_df['Last Updated'].dt.year

merged_df.head()

### What all manipulations have you done and insights you found?

**Handling Missing Values:** In the Play Store dataset, I've dropped rows with missing values using the dropna() method to ensure data integrity.
In the user reviews dataset, I've filled missing values in the user reviews column with an empty string to maintain the structure of the dataset.

**Data Type Conversion:** Converted the "Last Updated" column in the Play Store dataset to datetime format using pd.to_datetime() for easier manipulation and analysis.

**Merging Datasets:** Merged the Play Store dataset and user reviews dataset based on the common column "App" using the merge() function to combine information from both datasets into a single DataFrame. This allows for deeper analysis by correlating app attributes with user sentiments.

**Feature Engineering:** Created a new feature "Year_Last_Updated" by extracting the year from the "Last Updated" column. This enables analysis of app update trends over time.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

category_counts = play_store_df['Category'].value_counts()
plt.figure(figsize = (10, 6))
category_counts.plot(kind = 'bar', color = '#FF5733')
plt.title('App Category Distribution on Google Play Store')
plt.xlabel('App Category')
plt.ylabel('Number of Apps')
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I picked a bar chart because it effectively visualizes the distribution of categorical data (app categories) by displaying the counts of apps in each category.

##### 2. What is/are the insight(s) found from the chart?

The insight from the chart is that the distribution of apps across different categories on the Google Play Store, showing which categories have a higher or lower number of apps available.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of apps across categories can inform strategic decisions for targeting specific app categories with higher market demand, potentially leading to increased user engagement and revenue generation.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(8, 6))
plt.hist(play_store_df['Rating'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of App Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram to visualize the distribution of app ratings because it provides a clear representation of the frequency distribution of ratings across different intervals, allowing for easy identification of the most common rating ranges.

##### 2. What is/are the insight(s) found from the chart?

The insight from the distribution of app ratings chart could be: "The majority of apps in the dataset have ratings clustered around a certain range, indicating a general satisfaction level among users, with a few outliers at the extremes."






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, gaining insights from app ratings distribution can inform app developers about areas for improvement, potentially leading to better user satisfaction and increased downloads, ultimately resulting in a positive business impact.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

app_installs = play_store_df.groupby('Category')['Installs'].sum().sort_values(ascending = False)
plt.figure(figsize = (10,6))
app_installs.plot(kind = 'bar', color = '#cc0066')
plt.title('App Category vs Installs')
plt.xlabel('App Category')
plt.ylabel('Installs')
plt.xticks(rotation = 90)
plt.show()

##### 1. Why did you pick the specific chart?


I chose a grouped bar plot because it effectively compares the number of installs across different app categories, allowing for easy visual comparison.

##### 2. What is/are the insight(s) found from the chart?

The insight from the chart is that certain app categories have significantly higher numbers of installs compared to others, indicating their popularity among users on the Google Play Store.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help prioritize app categories with the highest number of installs, potentially leading to strategic decisions that maximize business growth and revenue.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***