<a href="https://colab.research.google.com/github/ShouryaKumar1996/EDA_Project/blob/main/EDA_Submission_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - EDA on Playstore app data




##### **Project Type**    - EDA on Playstore Data **set**
##### **Contribution**    - Individual
##### **Prepared By**    - Shourya KUmar




# **Project Summary -**

In this project, the aim is to conduct an Exploratory Data Analysis (EDA) on a dataset comprising information from the Google Play Store. The dataset in question is divided into two main parts: "Play Store Data.csv" and "User Reviews.csv". The "Play Store Data.csv" file includes various details about apps available in the Google Play Store, such as app names, categories, ratings, reviews, size, installs, type (free or paid), price, content rating, genres, last updated, current version, and Android version. Meanwhile, the "User Reviews.csv" file contains user-generated reviews for these apps, providing insights into user sentiment and feedback.

This project's core objective is to extract valuable insights and trends from the dataset, which could be beneficial for developers, marketers, and analysts to understand the current app market landscape on the Google Play Store. It involves cleaning the data by handling missing values, converting non-numeric values to a suitable numeric format for analysis, and deriving additional metrics that could offer deeper insights into app performance and user preferences.We will answer some business questions in this project that can help us get some important insight from data.


# **GitHub Link -**

Provide your GitHub Link here.

Q1. How are apps distributed across different categories in the Google Play Store?

A bar chart could be used to visualize the number of apps per category.

Q2. What is the relationship between app ratings and the number of reviews?

A scatter plot could help identify if apps with higher ratings generally receive more reviews.

Q3. How do app ratings vary across different content ratings (e.g., Everyone, Teen, Mature)?

A box plot could be utilized to compare the distribution of ratings across different content ratings.

Q4. What are the top 10 apps by number of installs in each category?

A horizontal bar chart could showcase the most popular apps in each category based on installs.

Q5. How does the size of an app affect its rating?

A scatter plot or hexbin plot can show the correlation between app size and its rating.

Q6. Is there a trend in the amount of free versus paid apps across different categories?

A stacked bar chart could illustrate the proportion of free vs. paid apps in each category.

Q7. What is the distribution of app prices in the paid apps category?

A histogram or KDE (Kernel Density Estimate) plot can provide insights into how app prices are distributed.

Q8. How frequently are apps updated, and does update frequency correlate with higher ratings?

A scatter plot comparing the last update date with ratings, possibly aggregating by year or month to identify trends.

Q9. What are the common words used in positive and negative reviews in the User Reviews dataset?

Word clouds could be used to visually represent the most common words in positive vs. negative reviews.

Q10. How does the sentiment polarity of reviews vary across top categories?

A box plot comparing sentiment polarity for different app categories could reveal which categories have more positive or negative reviews.

Q11. What is the relationship between the number of installs and the app's rating?

A scatter plot with a trend line could help visualize if more installs correlate with higher ratings.

Q12. How are apps rated across different genres within the same category?

A grouped bar chart could compare average ratings of apps within different genres of a single category.

Q13. What percentage of apps support different versions of Android, and how does this relate to app ratings?

A pie chart for Android version support and a scatter plot for rating by Android version requirement might uncover trends.

Q14. Are there any categories where paid apps significantly outperform free apps in terms of ratings?

A violin plot or grouped bar chart comparing ratings of free vs. paid apps across categories could provide insights.

Q15. How does the app's rating distribution differ between apps with in-app purchases and those without?

A histogram or box plot comparing the rating distributions for apps with and without in-app purchases could reveal differences in user satisfaction.

# STEPS:


Steps involved: Content Table:

Importing libraries

Loading the dataset

Data Cleaning

Observations

Exploratory Data Analysis

Plotting

Conclusion

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
playstore_data = pd.read_csv('/content/drive/MyDrive/EDA_shourya/Play Store Data.csv')
user_review_data = pd.read_csv('/content/drive/MyDrive/EDA_shourya/User Reviews.csv')

### Dataset First View

In [None]:
# Dataset First Look
playstore_data.head(5)

In [None]:
playstore_data.tail(5)

In [None]:
playstore_data.info()

In [None]:
playstore_data.describe()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
playstore_data.shape,user_review_data.shape

### Dataset Information

In [None]:
# Dataset Info
user_review_data.head()

In [None]:
user_review_data.tail()

In [None]:
user_review_data.info()

In [None]:
user_review_data.describe()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_counts_ps= playstore_data.duplicated(keep=False).sum()
duplicate_counts_ur = user_review_data.duplicated(keep=False).sum()


print("No. of duplicates in Playstore data" ,duplicate_counts_ps)
print("No. of duplicates in " ,duplicate_counts_ur)




In [None]:

# Data preparation
duplicate_status = ['Duplicates', 'Unique']
counts = [playstore_data.duplicated(keep=False).sum(), (~playstore_data.duplicated(keep=False)).sum()]

# Plotting
plt.figure(figsize=(8, 6))
plt.bar(duplicate_status, counts, color=['red', 'green'])
plt.title('Duplicate vs Unique Rows (Playstore data)')
plt.ylabel('Count')
plt.show()


In [None]:

# Data preparation
duplicate_status = ['Duplicates', 'Unique']
counts = [user_review_data.duplicated(keep=False).sum(), (~user_review_data.duplicated(keep=False)).sum()]

# Plotting
plt.figure(figsize=(8, 6))
plt.bar(duplicate_status, counts, color=['red', 'green'])
plt.title('Duplicate vs Unique Rows (User review data)')
plt.ylabel('Count')
plt.show()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_play=playstore_data.isna().sum()
null_play


In [None]:
null_user=user_review_data.isna().sum()
null_user

In [None]:
# Visualizing the missing values
import matplotlib.pyplot as plt

# Plotting
plt.figure(figsize=(10, 8))
null_play.plot(kind='bar', color='coral')
plt.title('Number of Null Values per Column')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.show()


In [None]:
# Visualizing the missing values
import matplotlib.pyplot as plt

# Plotting
plt.figure(figsize=(10, 8))
null_user.plot(kind='bar', color='coral')
plt.title('Number of Null Values per Column')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.show()


In [None]:
Rating_Mode = playstore_data['Rating'].mode().values[0]
Type_Mode = playstore_data['Type'].mode().values[0]
Content_Rating_Mode = playstore_data['Content Rating'].mode().values[0]
Android_Ver_Mode = playstore_data['Android Ver'].mode().values[0]

playstore_data[playstore_data['Rating'].isna()] = Rating_Mode
playstore_data[playstore_data['Type'].isna()] = Type_Mode
playstore_data[playstore_data['Content Rating'].isna()] = Content_Rating_Mode
playstore_data[playstore_data['Android Ver'].isna()] = Android_Ver_Mode

playstore_data['Current Ver'].fillna('Not Available',inplace=True)

In [None]:
playstore_data.isna().sum()

In [None]:
Data_to_drop = playstore_data[playstore_data['App'] == playstore_data['Rating']]
Data_to_drop

In [None]:
playstore_data.drop(Data_to_drop.index[:],axis=0,inplace=True)

In [None]:
user_review_data.dropna(subset=['Translated_Review','Sentiment','Sentiment_Polarity','Sentiment_Subjectivity'],inplace=True)
user_review_data.info()

In [None]:
playstore_data.info()



  1. Replaces any occurrences of '4.1 and up' and 'Everyone' in the Reviews column with 0. These replacements indicate that the dataset might have some data entry errors or placeholders that need to be handled. Typically, 'Reviews' should be numeric, and these values do not fit the expected format.

  2. Similar to the Reviews column, this line handles erroneous or placeholder values ('4.1 and up' and 'Everyone') in the Rating column by replacing them with the mean rating of the dataset.
  Converts the Rating column to a float data type to ensure ratings are in decimal form.

  3. Cleaning the 'Installs' Column.

  4. Cleans the Price (in $) column by replacing '4.1 and up' and 'Everyone' with 0, indicating that apps with these placeholders are considered free.

  5. Extracting and Converting the 'Last Update Year'

In [None]:

  playstore_data.rename({'Price':'Price (in $)'},inplace=True,axis=1)

  playstore_data['Reviews']=playstore_data['Reviews'].replace('4.1 and up',0).replace('Everyone',0).astype(int)

  playstore_data['Rating'] = playstore_data['Rating'].replace('4.1 and up',playstore_data['Rating'].mean()).replace('Everyone',playstore_data['Rating'].mean()).astype(float)

  playstore_data['Installs']=playstore_data['Installs'].apply(lambda x : int(x[0:(len(x)-1)].replace(',','')))

  playstore_data['Price (in $)'] = playstore_data['Price (in $)'].replace('4.1 and up',0).replace('Everyone',0).apply(lambda x: float(x.replace('$','')))

  playstore_data['Last Update Year'] = pd.DatetimeIndex(pd.to_datetime(playstore_data['Last Updated'].apply(lambda x : x[-4:]),format='%Y')).year



In [None]:
playstore_data.info()

In [None]:
user_review_data.info()

In [None]:
playstore_data.head(30)

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
playstore_data.columns

In [None]:
user_review_data.columns

In [None]:
# Dataset Describe
playstore_data.describe()

In [None]:
user_review_data.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in playstore_data.columns:
    print(f"Unique values in '{column}':", playstore_data[column].nunique())


In [None]:
for column in user_review_data.columns:
    print(f"Unique values in '{column}':", user_review_data[column].nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

playstore_data.to_csv('/content/drive/MyDrive/EDA_shourya/Play Store Data Final.csv', index=False)


In [None]:
user_review_data.to_csv('/content/drive/MyDrive/EDA_shourya/user review data final.csv', index=False)

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Q1: Distribution of apps across different categories
plt.figure(figsize=(12, 8))
category_count = playstore_data['Category'].value_counts()
sns.barplot(x=category_count, y=category_count.index, palette='viridis')
plt.title('Distribution of Apps Across Categories in Google Play Store')
plt.xlabel('Number of Apps')
plt.ylabel('Category')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:

# Q2: Relationship between app ratings and number of reviews
plt.figure(figsize=(10, 6))
sns.scatterplot(data=playstore_data, x='Rating', y='Reviews', alpha=0.5)
plt.title('Relationship Between App Ratings and Number of Reviews')
plt.xlabel('Rating')
plt.ylabel('Number of Reviews')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:

# Q3: App ratings vary across different content ratings
plt.figure(figsize=(10, 6))
sns.boxplot(data=playstore_data, x='Content Rating', y='Rating', palette='coolwarm')
plt.title('App Ratings Across Different Content Ratings')
plt.xlabel('Content Rating')
plt.ylabel('Rating')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Due to space, this code shows the approach rather than being fully executable.
# Select a few categories to showcase.
selected_categories = playstore_data['Category'].value_counts().index[:3]

for category in selected_categories:
    top_apps = playstore_data[playstore_data['Category'] == category].nlargest(10, 'Installs')
    plt.figure(figsize=(10, 6))
    sns.barplot(data=top_apps, y='App', x='Installs', palette='autumn')
    plt.title(f'Top 10 Apps by Installs in {category}')
    plt.xlabel('Number of Installs')
    plt.ylabel('App')
    plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Q5: App Size vs. Rating
plt.figure(figsize=(10, 6))
# Assuming 'Size' is in a format that's already converted to numeric values (MB)
sns.scatterplot(data=playstore_data, x='Size', y='Rating', alpha=0.7)
plt.title('Correlation Between App Size and Rating')
plt.xlabel('App Size (MB)')
plt.ylabel('Rating')
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

# Q6: Free vs. Paid Apps in Categories
plt.figure(figsize=(12, 8))
# Create a new column to easily filter free and paid apps
playstore_data['Type'] = playstore_data['Price (in $)'].apply(lambda x: 'Free' if x == 0 else 'Paid')
category_type_counts = playstore_data.groupby(['Category', 'Type']).size().unstack().fillna(0)
category_type_counts.plot(kind='bar', stacked=True, figsize=(14, 8), color=['skyblue', 'orange'])
plt.title('Number of Free vs. Paid Apps Across Categories')
plt.xlabel('Category')
plt.ylabel('Number of Apps')
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Q7: Distribution of App Prices
plt.figure(figsize=(10, 6))
# Filter out free apps for the price distribution of paid apps
paid_apps = playstore_data[playstore_data['Type'] == 'Paid']
sns.histplot(paid_apps['Price (in $)'], kde=True, bins=30, color='purple')
plt.title('Distribution of App Prices in Paid Apps Category')
plt.xlabel('Price (in $)')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Q8: Update Frequency and Ratings
plt.figure(figsize=(10, 6))
# Assuming 'Last Update Year' is a column with the year of the last update
sns.scatterplot(data=playstore_data, x='Last Update Year', y='Rating', alpha=0.5)
plt.title('App Update Frequency vs. Rating')
plt.xlabel('Last Update Year')
plt.ylabel('Rating')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
from wordcloud import WordCloud

# Assuming 'Sentiment' is a column in 'user_review_data' that marks reviews as Positive, Negative, or Neutral
# And 'Translated_Review' is the column with the review text

# Filter positive and negative reviews
positive_reviews = ' '.join(user_review_data[user_review_data['Sentiment'] == 'Positive']['Translated_Review'].fillna(''))
negative_reviews = ' '.join(user_review_data[user_review_data['Sentiment'] == 'Negative']['Translated_Review'].fillna(''))

# Generate word clouds
plt.figure(figsize=(20, 10))
plt.subplot(1, 2, 1)
pos_wordcloud = WordCloud(width=800, height=400, background_color ='white').generate(positive_reviews)
plt.imshow(pos_wordcloud, interpolation='bilinear')
plt.title('Common Words in Positive Reviews')
plt.axis('off')

plt.subplot(1, 2, 2)
neg_wordcloud = WordCloud(width=800, height=400, background_color ='white').generate(negative_reviews)
plt.imshow(neg_wordcloud, interpolation='bilinear')
plt.title('Common Words in Negative Reviews')
plt.axis('off')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
playstore_df = playstore_data[['App','Category']]
user_review_df = user_review_data[['App','Sentiment','Sentiment_Polarity']]
new = pd.merge(playstore_df, user_review_df, how='right',left_on=['App'],right_on='App')
new

In [None]:
# Chart - 10 visualization code
# Q10: Sentiment Polarity Across Categories
plt.figure(figsize=(14, 8))
top_categories = new['Category'].value_counts().nlargest(5).index
filtered_reviews = new[new['Category'].isin(top_categories)]
sns.boxplot(data=filtered_reviews, x='Category', y='Sentiment_Polarity', palette='Set2')
plt.title('Sentiment Polarity of Reviews Across Top Categories')
plt.xlabel('Category')
plt.ylabel('Sentiment Polarity')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Q11: Number of Installs vs. Rating
plt.figure(figsize=(10, 6))
# Assuming 'Installs' have been cleaned to numeric values
playstore_data['Log Installs'] = np.log1p(playstore_data['Installs'])  # Log transform to handle skewness
sns.scatterplot(data=playstore_data, x='Log Installs', y='Rating', alpha=0.5)
plt.title('Relationship Between Number of Installs and Rating')
plt.xlabel('Log of Number of Installs')
plt.ylabel('Rating')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Due to complexity, this is a simplified version focusing on a single category for illustration.
plt.figure(figsize=(14, 8))
sample_category = 'GAME'
sample_genre_ratings = playstore_data[playstore_data['Category'] == sample_category]
sns.barplot(data=sample_genre_ratings, x='Genres', y='Rating', ci=None, palette='deep')
plt.title(f'Average Ratings Across Different Genres within {sample_category}')
plt.xlabel('Genres')
plt.xticks(rotation=90)
plt.ylabel('Average Rating')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# This requires preprocessing to extract and categorize Android versions
# Example code provided without execution due to data preprocessing requirements

# Assuming 'Android Ver' has been cleaned and categorized into broad categories
android_version_counts = playstore_data['Android Ver'].value_counts(normalize=True) * 100
plt.figure(figsize=(10, 6))
android_version_counts.plot(kind='pie', autopct='%1.1f%%')
plt.title('Percentage of Apps by Android Version Support')
plt.ylabel('')  # Hide the y-label as it's not needed for pie charts
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Q14: Paid vs. Free Apps Ratings
plt.figure(figsize=(14, 8))
sns.boxplot(data=playstore_data, x='Category', y='Rating', hue='Type', palette='coolwarm')
plt.title('Ratings of Free vs. Paid Apps Across Categories')
plt.xlabel('Category')
plt.ylabel('Rating')
plt.xticks(rotation=90)
plt.legend(title='Type')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:

# Summarizing total installs by content rating
installs_by_content_rating = playstore_data.groupby('Content Rating')['Installs'].sum().reset_index()

# Sorting the data to ensure the chart is ordered by the number of installs
installs_by_content_rating = installs_by_content_rating.sort_values('Installs', ascending=False)


In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(data=installs_by_content_rating, x='Content Rating', y='Installs', palette='coolwarm')
plt.title('Total Installs by Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Total Number of Installs')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***