<a href="https://colab.research.google.com/github/Sumit4085/Data-Science/blob/main/Google_Play_Store_EDA_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual



# **Project Summary -**

The Google Play Store is a massive digital marketplace where users can download a wide variety of applications for their smartphones and tablets. With millions of apps available, understanding the trends and preferences of both users and developers is crucial for optimizing the app ecosystem. Exploratory Data Analysis (EDA) offers a powerful approach to uncover insights and patterns within the vast dataset of the Google Play Store.

The Google Play Store EDA project aims to explore and analyze the extensive dataset of app listings available on the platform. By conducting thorough analysis, we seek to address several key questions and challenges:

1. App Categories and Popularity:
   - What are the most popular app categories on the Google Play Store?
   - How does the popularity of different categories vary?
   - specific categories that dominate the market?

2. App Ratings and Reviews:
   - How do user ratings affect the popularity and success of an app?
   - Are there correlations between the number of reviews and app ratings?
   

3. Pricing Strategies and Revenue:
   - How do pricing strategies influence app downloads and revenue generation?
   - Are paid apps more successful than free apps, or vice versa?
   - What are the most common price ranges for apps across different categories?

4. User Demographics and Preferences:
   - Who are the primary users of the Google Play Store?
   - What are the demographics of users who download specific types of apps?
   - Are there regional or cultural differences in app preferences?

5. App Size and Performance:
   - How does app size impact user downloads and retention?
   - Are users more likely to download smaller apps over larger ones?
   - Is there a correlation between app size and user ratings/performance?

6. Developer Trends and Behavior:
   - What are the characteristics of successful app developers on the Google Play Store?
   - How do factors such as app updates, developer reputation, and engagement influence app success?
   - Are there patterns in the behavior of developers regarding app pricing, updates, and user feedback?

Approach:
To address these questions and challenges, we will perform exploratory data analysis on a comprehensive dataset extracted from the Google Play Store. This analysis will involve data cleaning, visualization, and statistical techniques to uncover meaningful insights. We will utilize Python programming language along with libraries such as Pandas, Matplotlib, and Seaborn for data manipulation and visualization. Additionally, we will employ descriptive and inferential statistical methods to derive conclusions and make recommendations based on the findings.

Expected Outcome:
Through this EDA project, we aim to provide valuable insights into the dynamics of the Google Play Store ecosystem. By understanding app trends, user preferences, and developer behavior, stakeholders such as app developers, marketers, and platform administrators can make informed decisions to optimize their strategies and enhance the overall user experience on the Google Play Store.

# **GitHub Link -**

Provide your GitHub Link here.https://github.com/Sumit4085/Data-Science-Projects.git

# **Problem Statement**


**Write Problem Statement Here.**

Problem Statement:
Analyze the Google Play Store dataset to gain insights into app trends, user preferences, and factors influencing app ratings and downloads. The goal is to uncover patterns that can guide app developers and marketers in making informed decisions about app development, pricing strategies, and marketing campaigns. By conducting exploratory data analysis (EDA), we aim to identify correlations between various app attributes such as category, size, price, and rating, and understand their impact on user engagement and satisfaction. Additionally, we seek to explore geographical trends in app usage and popularity across different regions. Ultimately, the insights derived from this analysis will enable stakeholders to optimize their app offerings and maximize their success on the Google Play Store platform.

#### **Define Your Business Objective?**

Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing all mandatory Libraries
import pandas as pd # Numerical operations on arrays
import numpy as np #  Data manipulation and analysis
import matplotlib.pyplot as plt #  Data visualization
%matplotlib inline
import seaborn as sns # Statistical data visualization



### Dataset Loading

In [None]:
# importing drive from google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# uploading Dataset (play store)
df = pd.read_csv('/content/drive/MyDrive/Play Store Data.csv')

In [None]:
# uploading dataset (play store user review)
df1 = pd.read_csv('/content/drive/MyDrive/User Reviews.csv')

### Dataset First View

In [None]:
# overview of top 5 data of df
df.head()

In [None]:
# overview of top 5 data of df1
df1.head()

### Dataset Rows & Columns count

In [None]:
# getting Rows & Columns count of dataset dataset df
df.shape

In [None]:
# getting Rows & Columns count of dataset df1
df1.shape

### Dataset Information

In [None]:
# summary of data structure in detail(total rows nd columns, null ,dtypes) of dataset df
df.info()

In [None]:
# summary of data structure in detail(total rows nd columns, null ,dtypes) of dataset df1
df1.info()

#### Duplicate Values

In [None]:
# finding duplicates in df
df.duplicated().sum()

In [None]:
# finding duplicates in df1
df1.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# finding null values in df
df.isnull().sum()

In [None]:
# finding null values in df1
df1.isnull().sum()

In [None]:
# Visualizing the missing values in df
# Checking Null Value by plotting Heatmap
sns.heatmap(df.isnull(), cbar=False )
plt.title('Missing Values')
plt.xlabel('Categories')
plt.ylabel('No. of Rows')
plt.show()

In [None]:
# Visualizing the missing values in df1
# Checking Null Value by plotting Heatmap
sns.heatmap(df1.isnull(), cbar=False )
plt.title('Missing Values')
plt.xlabel('Categories')
plt.ylabel('No. of Rows')
plt.show()

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset df Columns
df.columns

In [None]:
# Dataset df1 Columns
df1.columns

In [None]:
# Dataset df Describe
df.describe(include = 'all')

In [None]:
# Dataset df1 Describe
df1.describe(include = 'all')

### Variables Description

Here we have descripton for all column of both dataset(df and df1)

- **App**: Name of the application.
- **Category**: Category or genre of the application.
- **Rating**: Average user rating of the application.
- **Reviews**: Number of user reviews for the application.
- **Size**: Size of the application in terms of storage space.
- **Installs**: Number of times the application has been installed.
- **Type**: Type of the application (e.g., Free or Paid).
- **Price**: Price of the application (if it's a paid app).
- **Content Rating**: Content rating of the application based on audience suitability.
- **Genres**: Sub-genres or additional categorizations of the application.
- **Last Updated**: Date when the application was last updated.
- **Current Ver**: Current version of the application.
- **Android Ver**: Minimum Android version required to run the application.re
Certainly, here are the variable descriptions for each column:

- **Translated_Review**: Reviews of the application translated into a specific language.
- **Sentiment**: Overall sentiment of the translated review (e.g., Positive, Negative, Neutral).
- **Sentiment_Polarity**: Numeric value indicating the sentiment polarity (positive, negative, or neutral) of the translated review.
- **Sentiment_Subjectivity**: Numeric value indicating the subjectivity (opinionated vs. factual) of the translated review.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable for dataset df
df.nunique()

In [None]:
# Check Unique Values for each variable for dataset df1
df1.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

DATA CLEANING AND TRANSFORMATION.
--> As We Have Already Removed Duplicates and Null Values so now we're performing below stated steps in order to ensure cleaning of data and after that we will create some new columns i.e calculated fields and will remove those columns which have no use.

WILL CHECK BELOW THINGS TO EVERY COLUMN AND WILL PERFORM MANDATORY ACTIONS.

--> FOR CLEANING
* CHECKING DATATYPE
* CHECKING REPEATED VALUES
* CHECKING ABNORMALITIES

--> DATA TRANSFORMATION
* COLUMN DELETING
* COLUMN ADDING(CALCULATED FIELD)

FIRST WE WILL CLEAN AND MODEL DATASET df, THEN WILL CLEAN AND MODEL DATASET df1

CLEANING AND TRANSFORMING df

In [None]:
# finding null values
df.isnull().sum()

insights for null values for df

* as we have very small null values in couple of columns so will remove those rows and in review column we have big amount of null values and this is numerical column so will fill this column with mean value.

In [None]:
# dropping all null values which are small in numbers
df = df.dropna(axis=0, subset ={'Content Rating','Type','Android Ver','Current Ver'})

In [None]:
# finding duplicates in df
df.duplicated().sum()

insights for duplicate values in df

* as we have observed that 'App' category contains most duplicates so will analyse it with same column and will remove after analysing

In [None]:
# observing for 'App' column
df['App'].unique()

In [None]:
# finding duplicates in 'App' column as 'App' column needs to contains only unique values
df.duplicated(subset = 'App').sum()

# insights of column 'Category'

* abnormalities: none
* repeatation in data : yes so will remove duplicates
* dtype : will remain same

In [None]:
# dropping duplicates from app column
df = df.drop_duplicates(subset = 'App')

In [None]:
# observing column Category
df.Category.unique()

# insights of column 'Category'

* abnormalities: to replace '_AND_'  with ' & '
* repeatation in data : none
* dtype : will remain same

In [None]:
# replacing '_AND_' with ' & ' in Category column
df['Category'] = df['Category'].str.replace('_AND_', ' & ')

In [None]:
# observing size column
df.Size.value_counts()

# insights of column 'Size'

* abnormalities: to replace 'M' to '', 'Varies with device' to 'nan'and 'k' to '' and will divide it with 1000 to get all values in MB(MEGABYTE)
* repeatation in data : none
* dtype : will change in float

In [None]:
# replacing 'M' with ''
df['Size'] = df['Size'].apply(lambda x: x.replace('M','') if 'M' in x else x)

In [None]:
# replacing 'k' with '' and changing to float and then dividing by 1000 to convert data in MB
df['Size'] = df['Size'].apply(lambda x: float(x.replace('k','')) / 1000 if 'k' in x else x)

In [None]:
# now changing 'Size' column dtype into string to replace 'Varies with device' into 'nan' as we can't iterate numbers.
df['Size'] = df['Size'].astype('str')

In [None]:
# now replacing 'Varies with device' with 'nan'
df['Size'] = df['Size'].apply(lambda x: x.replace('Varies with device','nan') if 'Varies with device' in x else x)

In [None]:
# again chganging 'Size' column dtype into float
df['Size'] = df['Size'].astype('float64')

In [None]:
# observing column Rating
df.Rating.unique()

# insights of Rating column

* abnormalities: found nan value, now finding mean and will fill all nan.
* repeatation in data : none
* dtype : remain same

In [None]:
# finding mean values to replace all null values in Rating column
df['Rating'].mean()

In [None]:
# filling all nan values with mean values
df.fillna({'Rating': 4.2},inplace = True)

In [None]:
# observing column Reviews
df.Reviews.unique()

# insights of Reviews column

* abnormalities: none
* repeatation in data : none
* dtype : changing dtype into int

In [None]:
# changed datatype
df['Reviews'] = df['Reviews'].astype('int64')

In [None]:
# observing Installs column
df.Installs.unique()

# insights of Installs column

* abnormalities: need to remove '+', and ',' to change dtype
* repeatation in data : none
* dtype : changing dtype into int

In [None]:
# replacing '+' with ''
df['Installs'] = df['Installs'].str.replace('+','')

In [None]:
# replacing ',' with ''
df['Installs'] = df['Installs'].str.replace(',','')

In [None]:
# changing dtype into int
df['Installs'] = df['Installs'].astype('int64')

In [None]:
# observing Type Category
df.Type.unique()

# insights of Type column

* abnormalities: none
* repeatation in data : none
* dtype : remain same

In [None]:
# observing Price Category
df.Price.unique()

# insights of Price column

* abnormalities: to remove '$' in order to change dtype
* repeatation in data : none
* dtype : to change in float

In [None]:
# replacing $ with ''
df['Price'] = df['Price'].str.replace('$','')

In [None]:
# changing dtype into float
df['Price'] = df['Price'].astype('float64')

In [None]:
# observing Content Rating
df['Content Rating'].unique()

# insights of Content Rating column

* abnormalities: none
* repeatation in data : none
* dtype : remain same

In [None]:
# observing 'Genres'
df.Genres.unique()

# insights of 'Genres' column

* as we already have 'Category' column and genre column is mixed of category and sub category in which sub category having mostly 'none' values also having very less count so we can delete this column this action will have no impact on data set


In [None]:
# removing 'Genres' column
df = df.drop( columns= 'Genres')

In [None]:
# observing Last Updated column
df['Last Updated'].unique()

# insights of Last Updated column

* abnormalities: none
* repeatation in data : none
* dtype : to change in datetime

In [None]:
# changing dtype in datetime
df['Last Updated'] = pd.to_datetime(df['Last Updated'])

In [None]:
# observing Content Rating
df['Current Ver'].value_counts()

# insights of Current Ver column

* abnormalities: none
* repeatation in data : none
* dtype : remain same

In [None]:
# observing Android Ver
df['Android Ver'].unique()

# insights of Android Ver column

* abnormalities: to remove 'W' from '4.4W', and replacing 'upper range values' to 'and up'
* repeatation in data : none
* dtype : remain same

In [None]:
# removing 'W' with ''
df['Android Ver'] = df['Android Ver'].str.replace('W', '')

In [None]:
# replacing '- 7.1.1' to 'and up'
df['Android Ver'] = df['Android Ver'].str.replace('- 7.1.1', 'and up')

In [None]:
# replacing '- 8.0'' to 'and up'
df['Android Ver'] = df['Android Ver'].str.replace('- 8.0', 'and up')

In [None]:
# replacing '- 6.0' to 'and up'
df['Android Ver'] = df['Android Ver'].str.replace('- 6.0', 'and up')

In [None]:
# ensuring all changes takes place
df.info()

In [None]:
# having a sneak peak to ensure changes
df.head()

CLEANING AND TRANSFORMING DATASET df1

In [None]:
# finding duplicates
df1.duplicated().sum()

insights for duplicates in df1

* having duplicate 33616 rows so will remove all

In [None]:
# dropping duplicates
df1 = df1.drop_duplicates()

In [None]:
# finding null values
df1.isnull().sum()

insights for null values in df1

* as we have only 5 columns in which 4 have null values so will remove all corresponding rows to get clean data

In [None]:
# dropping all null values
df1 = df1.dropna(axis = 0)

In [None]:
# observing App column of df1
df1['App'].unique()

# insights of App column

* abnormalities: none
* repeatation in data : none
* dtype : remain same

In [None]:
# observing Translated_Review column of df1
df1['Translated_Review'].unique()

# insights of Translated_Review column

* abnormalities: none
* repeatation in data : none
* dtype : remain same

In [None]:
# observing Sentiment column of df1
df1['Sentiment'].unique()

In [None]:
# checking amount of null values
df1['Sentiment'].value_counts()

# insights of Sentiment column

* abnormalities: have null values and small in amount so will delete all null values
* repeatation in data : none
* dtype : remain same

In [None]:
# observing Sentiment column of df1
df1['Sentiment_Polarity'].unique()

# insights of Sentiment_Polarity column

* abnormalities: none
* repeatation in data : none
* dtype : remain same

In [None]:
# observing Sentiment_Subjectivity column of df1
df1['Sentiment_Subjectivity'].unique()

# insights of Sentiment_Subjectivity column

* abnormalities: none
* repeatation in data : none
* dtype : remain same

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Category wise apps

In [None]:
# to find no. of apps in each category
dc = df.groupby(['Category']).count()['App'].sort_values(ascending = False)
sns.barplot(data =  dc)
plt.xticks(rotation=90)
plt.title('category wise no. of apps')
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - top 10 categories review wise

In [None]:
# Chart - 2 reviews of top 10 categories
sns.barplot(x = 'Category', y = 'Reviews', data = df)
df.sort_values(by = 'Category' , ascending = True)
plt.xticks(rotation=90)
plt.title('top 10 category wise apps')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3 Category wise total app installs

In [None]:
# Chart - 3 top 5 category wise total app installs
df5 = df.groupby(['Category']).sum()['Reviews'].value_counts()
df5.plot(kind='pie', autopct="%1.2f%%",startangle=90)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***