# **EDA Report: Google Play Store Apps Dataset**
- ## **Author:** Asjad Ali
- ### **Email:** aliasjid009@gmail.com
- ### **Date:** 13/08/2023

> In this Exploratory Data Analysis (EDA) report, we will examine and summarize the main characteristics of the Google Play Store Apps dataset. The dataset contains details about various applications available on the Play Store and is sourced from Kaggle.

## **Dataset Overview**

- **Dataset Name:** Google PlayStore Apps
- **Dataset Size:** 210MB
- **Number of Apps:** 10,0000+
- **Data Collection Date** June 2021
- **Data Collection Method** Python script(Scrapy)

## **Objective**

> The main objective of this project is to gain insights into customer demands and provide valuable information to developers, helping them popularize their products on the Google Play Store.

## **Data Analysis**

1. **Data Understanding:**

   > We will start by exploring the structure and contents of the dataset.
   > We will examine the variables, their types, and the overall data distribution.

1. **Data Quality Check:**

   > We will identify and handle any missing values, outliers, or inconsistencies in the data.
   > We will assess the quality and reliability of the dataset.

1. **Exploring Patterns:**

   > We will analyze the data to uncover patterns, trends, and correlations among variables.
   > We will generate visualizations and summary statistics to identify interesting insights.

1. **Variable Relationships:**

   > We will investigate the relationships between variables.
   > We will measure the strength and direction of correlations and assess the impact of one variable on another.

1. **Feature Selection:**

   > Based on our analysis, we will determine which features are most informative and relevant for predicting app popularity.
   > We will perform feature selection or dimensionality reduction techniques.

1. **Outlier Detection:**

   > We will identify any outliers or anomalies in the dataset.
   > We will examine extreme or unexpected observations that may require further investigation.

1. **Data Visualization:**

   > We will create visual representations such as plots, charts, or graphs to communicate our findings effectively.

## **Conclusion**

> Through this EDA report, we aim to gain insights, discover patterns, and uncover relationships within the Google Play Store Apps dataset. The analysis will provide valuable information to developers, enabling them to understand customer demands and popularize their applications on the Play Store.

For detailed access to the dataset, please follow this [link 🔗](https://www.kaggle.com/datasets/gauthamp10/google-playstore-apps).

Note: The analysis and findings presented in this report are based on the available dataset and the EDA techniques applied.

## **Import the libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action = 'ignore', category = FutureWarning)

## **Data Preprocessing**
- Load the csv file with pandas
- Creating Dataframe of the csv file and understanding the data present in the dataset
- Dealing with the missing/null values

In [None]:
df = pd.read_csv("../Datasets/googleplaystore.csv")

- ### **Data Composition**

**View the first 5 rows of the data**

In [None]:
df.head()

- From here we have come to know that with which type of data we are going to deal with.

**Let's see the columns in the dataset**

In [None]:
df.columns

| Column Name         | Description                                                                   |
|---------------------|-------------------------------------------------------------------------------|
| **App Name**        | The name or title of the application available on the Google Play Store.    |
| **App Id**          | The unique identifier assigned to each application.                          |
| **Category**        | The category or genre to which the application belongs.                      |
| **Rating**          | The average user rating or feedback score received by the application.       |
| **Rating Count**    | The total number of user ratings received by the application.                |
| **Installs**        | The estimated number of times the application has been installed.           |
| **Minimum Installs**| The minimum number of installations required for the application to be listed on the Play Store. |
| **Maximum Installs**| The maximum number of installations recorded for the application.           |
| **Free**            | Indicates whether the application is available for free or if it has a price. |
| **Price**           | The price of the application, if it is not available for free.               |
| **Currency**        | The currency in which the price is listed.                                   |
| **Size**            | The size of the application in terms of storage space.                       |
| **Minimum Android** | The minimum Android version required to run the application.                 |
| **Developer Id**    | The unique identifier assigned to the application developer.                |
| **Developer Website** | The website associated with the application developer.                   |
| **Developer Email** | The email address of the application developer.                             |
| **Released**        | The date when the application was initially released.                        |
| **Last Updated**    | The date when the application was last updated.                              |
| **Content Rating**  | The age-based rating or content suitability of the application.             |
| **Privacy Policy**  | The link to the privacy policy associated with the application.             |
| **Ad Supported**    | Indicates whether the application contains advertisements.                  |
| **In App Purchases** | Indicates whether the application offers in-app purchases.                  |
| **Editors Choice**  | Indicates whether the application has been selected as an editor's choice on the Play Store. |
| **Scraped Time**    | The date and time when the data was scraped or collected from the Play Store. |


**Important things to know**

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

**Shape or Dimensions of the data**

In [None]:
df.shape

- It shows that there are 2312944 rows and 24 columns in the dataset, which means that Google playstore has more than 2.3 million applications.

**Let's get some more information about the data**

In [None]:
df.info()

- It shows the number of enteries in the data and data types of the columns. Like:
  - 4 bool type columns
  - 4 float type columns
  - 1 integer type column
  - 15 object type columns -> we will see its further details onwards
- Data set is using 361.8+ MB memory of the system

## **Descriptive Statistics**

In [None]:
df.describe()

- Describe function shows the summary of the data and summary is only of numeric variables. It shows, there are only these 5 columns ***Rating, Rating Count, Minumum Installs, Maximum Installs, Price*** in the whole data that are numeric
- Count of ***Rating*** (2.290016e+06) and ***Rating Count*** (2.290016e+06) is less than other columns, which shows that these 2 column contain missing values
- Maximum number of ***Minimum installs*** recorded for any application is 10 billion
- Maximum number of ***Maximum installs*** recorded for any application is 12+ billion
- Maximum ***Rating*** for any application is 5
- Maximum ***Rating Count*** for any application is 138.5576 million
- Maximum ***Price*** for any application is 400. But at this moment, we can say, which currecy it is.

**To see the entire column, we can use pandas' set_option() function**

In [None]:
pd.set_option('display.max_columns', None)

## **Missing Values**

In [None]:
missing_values = df.isnull().sum().sort_values(ascending=False)
print(missing_values)

- It means that ***Developer Websites*** contain highest number of missing values that is 760835
- It also shows that most of the App Developers do not provide its ***Website Address***, ***Privacy Policy*** of the app that is very important and also ***Released date*** of the application.
- Missing values of the ***Minimum Android*** shows that, they don't tell which version of android is suitable for application.
- Missing vakues of ***Size*** shows that, they don,t provide details about application size.

### **Visualizing Missing/Null Values**

In [None]:
import matplotlib
matplotlib.rcParams['figure.figsize'] = (20,10)
sns.heatmap(df.isnull(), xticklabels=False, cbar=False, cmap='viridis')
plt.title('Number of Missing Values')

- It is easy to visualize missing values. The yellow lines are indicating the missing vakues in each column.

**Percentage of missing values**

In [None]:
missing_percentage = df.isnull().sum().sort_values(ascending=False)/len(df)*100
print(missing_percentage)

- Here, we can see the percentage of missing data in the columns.

**Visualizing Percenage of Missing Values**

In [None]:
missing_percentage = missing_percentage[missing_percentage != 0]
import matplotlib
matplotlib.rcParams['figure.figsize'] = (20, 8)
sns.barplot(x=missing_percentage, y=missing_percentage.index)
plt.xticks(rotation=90)
plt.title('Percentage of Missing Values')

- Here, it is more easy to visualize the percentage of missing values in the data. 
- The column haveing highest percentage of missing values are:
    - ***Developer Website***
    - ***Privay Policy***
- If we want, we can drop these columns from the dataset or we can drop rows that contain missing values because we cannot impute missing values in these columns because they are very specific/unique to applications
- We can impute or drop small null values for the columns like,
  - Size
  - Installs
  - Currency
  - Minimum Installs
  - App Name
  - Developer Id
  - Developer Email
- We can impute null values for the following columns, becuase they are important features:
  - Released
  - Rating
  - Minimum Android
  - Rating Count
- Imputation of missing values depends upon the end purpose of data and we normally oerform imputation when we have small amount of data

**Droppping Missing Values from the Data**

In [None]:
df.dropna(subset=['Size', 'Installs', 'Currency', 'Developer Id', 'Developer Email', 'App Name', 'Minimum Installs'], inplace=True)

- Here, we have dropped rows that contain missing values from the dataset to clean our data.

In [None]:
df.isnull().sum().sort_values(ascending=False)

- Here, we can see that missing values are removed from most of the columns and only those columns are left behind where we want to impute missing values.

## **Data Cleaning**

**Let's check the duplicated values in the `App Name` column and we see duplicated data in columns that are unique**

In [None]:
df['App Name'].duplicated().any()

- Here, `True` shows that there are duplicated values in ***App Name*** column means that there are more than one Apps with same name.

**Now, let's see which rows are duplicated**

In [None]:
df['App Name'].value_counts()

- Here we can see that there are many duplicated rows of Apps in this column.

**Before removing the duplicated values, let's check are they actually duplicated or not**

In [None]:
df[df['App Name'] == 'Tic Tac Toe']

- Here we can see that they are not actually duplicated because name is same but other features like App Id, Category, Installs, Rating etc are different, that shows they are not actually duplicated.

**On the base of App Id we can check if there are duplicated rows in data or not**

In [None]:
df['App Id'].value_counts()

- From the above output, the value counts for each ***App Id*** is 1. So, we have concluded that there are Apps with same name but they are different based on App Id's. So, no duplicated rows in the data.

### **Explore Different Variables**

1. Install

In [None]:
df['Installs'].unique()

- It shows the number of installations of different appps.
- Highest number of installations recorded for any app is 1 billion+
- Lowest number of installations are 0+
- Here `+` means that insallation of any app may be in process during scraping so it will not be counted.

**Convert `Installs` from `object` to `int` datatype**
> As we discussed before, object dtypes will be dealed later on. So here we are dealing with it and also with commmas(,) and plus(+)

In [None]:
df['Installs'] = df['Installs'].str.split('+').str[0]
df['Installs'].replace(',', '', regex=True, inplace=True)
df['Installs'] = df['Installs'].astype(np.int64)

In [None]:
df['Installs'].unique()

- As we can see from the output, comma(,) and plus(+) have removed from the install values and we have converted its type from object to int.

2. Currency

In [None]:
df['Currency'].unique()

- Here, we can see from the output that which currencies are acceptable in Google playstore.
- List of currencies acceptable in Google playstore:
  - **USD:** United States Dollar
  - **XXX:** This is often used as a placeholder or code for transactions involving no specific currency.
  - **CAD:** Canadian Dollar
  - **EUR:** Euro (used by many countries in the European Union)
  - **INR:** Indian Rupee
  - **VND:** Vietnamese Dong
  - **GBP:** British Pound Sterling
  - **BRL:** Brazilian Real
  - **KRW:** South Korean Won
  - **TRY:** Turkish Lira
  - **RUB:** Russian Ruble
  - **SGD:** Singapore Dollar
  - **AUD:** Australian Dollar
  - **PKR:** Pakistani Rupee
  - **ZAR:** South African Rand

3. Size

In [None]:
df['Size'].unique()

- From the above output, we can see that the size of the App can be in GB, MB and KB.

**Let's convert App size in MB**

In [None]:
df['Size'] = df['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x)

- Here, we have firstly remove `M` from ***Size*** column' each value. For example, if it was firstly 10M, now its only 10.

In [None]:
df['Size'] = df['Size'].apply(lambda x: float(str(x).replace('k', ''))/1024 if 'k' in str(x) else x)

- Here we have mismatched value with the data. We have got 1,018 which is causing error. So we have to remove this value from data or convert this comma(,) into dot(.) comsidering it an incorrect value in the **Size** column

In [None]:
df['Size'] = df['Size'].apply(lambda x: str(x).replace(',', '.') if ',' in str(x) else x)

- After this, we have to run the above cell again. We are basically converting *kbs* into *MB*

**Again convert `Size` to float**

- Here we are again encountering a problem that ***App Size*** varies with device. So, you may drop them or replace it with 0. Here, I am assuming it as 0.

In [None]:
df['Size'] = df['Size'].apply(lambda x:str(x).replace('Varies with device', '0') if 'Varies with device' in str(x) else x)

- Here we have replace `Varies with device` columns with `0`

**Now let's convert `Size` to float**

In [None]:
# df['Size'] = df['Size'].apply(lambda x:float(x))

- Here we are again encountering an issue that there are values in GBs like 1.5G

**Convert GBs to MBs**

In [None]:
df['Size'] = df['Size'].apply(lambda x: float(str(x).replace('G', ''))*1024 if 'G' in str(x) else x)

- Now we have converted the data in GBs to MBs.

**Now convert `Size` to float**

In [None]:
df['Size'] = df['Size'].apply(lambda x:float(x))

In [None]:
df.dtypes['Size']

In [None]:
print(max(df['Size']))

- Conlusion from ***Size*** column
  - We convert GBs, MBs and KBs to MBs to get a clear overview.
  - The largest size of any app in MBs is 1536, other than that applications that Varies with device.

4. Minumum Android

In [None]:
df['Minimum Android'].unique()

- It shows different ***Minimum Android*** version required to run an playstore application. As the minimum version let's say is 4.1 then there is no purpose to write `and up` with, that's why we can remove it.

**Removing ` and up` from Minimum Andriod Values**

In [None]:
df['Minimum Android'] = df['Minimum Android'].str.replace(' and up', '')

In [None]:
df['Minimum Android'].unique()

- Also there is `W` written with some values. It means, Android version that is specifically designed for wearable devices. For example, if there is value `4.4W` is a wearable API level that was released prior to Android 5.0 (Lollipop) as an update of the API to include the Android Wear APIs. This version of Android is exclusive to smartwatches and other wearable devices. It is important to note that "4.4W" is not the same as the regular Android 4.4 version, which is designed for smartphones and tablets.
- It means there are different values in it, not only versions but also wearables

5. Content Rating

In [None]:
df['Content Rating'].unique()

- It shows the that for which age group, specific apps are designed according to its content. 
- Here:
  - Everyone -> Content is generally suitable for all ages. May contain minimal cartoon, fantasy or mild violence and/or infrequent use of mild language.
  - Teen -> Content is generally suitable for ages 13 and up. May contain violence, suggestive themes, crude humor, minimal blood, simulated gambling, and/or infrequent use of strong language.
  - Mature 17+ -> Content is generally suitable for ages 17 and up. May contain intense violence, blood and gore, sexual content, and/or strong language.
  - Everyone 10+ -> Content is generally suitable for ages 10 and up. May contain more cartoon, fantasy or mild violence, mild language and/or minimal suggestive themes.
  - Adults only 18+ -> Content is suitable only for adults. May include graphic depictions of sex and/or violence.
  - Unrated -> Indicates possible exposure to unfiltered/uncensored user-generated content, including user-to-user communications and media sharing via online platforms.

6. Released

In [None]:
df['Released'].unique()

- From here, we can see that how long ago an app was released and in which year most number of apps were released.

**Imputing null values in `Released`**

In [None]:
# imoute null values here and check the year with most number of apps released

7. Last Updated

In [None]:
df['Last Updated']

- It shows the last time an app was updated. Here we can calculate when a specific app was released and afterwards it was updated or not.

8. Privacy Policy

In [None]:
df['Privacy Policy']

- From here we can read the privacy policy of applications and it is very important to give because it tells us which permissions an application access while installing in our device.

9. Scraped Time

In [None]:
df['Scraped Time']

- From here we can see when the data for a particular app was scraped and we can also calculate how much time it take and also when most data was scraped and it also tells us that we do not have any further updates available if made after the data scraped.

10. Free

In [None]:
df['Free']

- This column shows which apps are free and which apps are paid with True and False respectively. We can convert it to paid and free from True and False for easy understanding.

**Paid and Free Apps**

In [None]:
df['Type'] = np.where(df['Free']==True, 'Free', 'Paid')
df.drop(['Free'], axis=1, inplace=True)

In [None]:
df['Type']

In [None]:
num_free_apps = len(df[df['Type'] == 'Free'])
print(num_free_apps)

- From here we can easily see the Free/Paid Apps rather than True/False.
- Number of free apps on Google playstore greater than number of paid apps that is 2267616
- We can also check whether Free apps have more installations or Paid apps.

**Dealing with Content Rating**

In [None]:
df['Content Rating'].value_counts()

- From here we can see that the apps that the most number of apps on Google playstore are for eveyone and least number of apps are for Adults only 18+
- From here, We can calculate which type of apps have most number of installs
- We can also use `ANOVA` to see the difference between apps type on playstore but we mostly appy it when there is very little difference between values and we do it according to the requirements of our company or stakeholder.
- Here we can make main categories to represent data for better understanding, like
  - Adults only 18+ -> Adults
  - Everyone 10+ -> Teen
  - Unrated -> Everyone
  - Mature 17+ -> Adults

In [None]:
df['Content Rating'] = df['Content Rating'].replace('Unrated', 'Everyone')
df['Content Rating'] = df['Content Rating'].replace('Adults only 18+', 'Adults')
df['Content Rating'] = df['Content Rating'].replace('Mature 17+', 'Adults')
df['Content Rating'] = df['Content Rating'].replace('Everyone 10+', 'Teen')

In [None]:
df['Content Rating'].unique()

11. Rating

In [None]:
df['Rating'].unique()

- It shows different ratings to different apps given by users.
- Maximum rating is 5
- Here we can also calculate the number of apps with maximum rating that is 5.

12. Rating Count

In [None]:
df['Rating Count']

In [None]:
max_rating_count = max(df['Rating Count'])
max_rating_count

In [None]:
max_rating_rows = df[df['Rating Count'] == max_rating_count]

# Extract App Name and App Id from the filtered rows
app_name = max_rating_rows['App Name'].iloc[0]
app_id = max_rating_rows['App Id'].iloc[0]
print(app_name)
print(app_id)

- It represents the number of people who give rating to an app
- It shows the maximum number of rating count to an app are 138557570.0
- It shows that the maximum Rating count App is `Whatsapp` with App Id `com.whatsapp`
- We can also divide it into different categories for better understanding

**Rating Count Categories**

In [None]:
df['Rating Type'] = 'NoRatingProvided'
df.loc[(df['Rating Count']>0)&(df['Rating Count']<=10000.0), 'Rating Type'] = 'Less than 10k'
df.loc[(df['Rating Count']>10000.0)&(df['Rating Count']<=500000.0), 'Rating Type'] = 'Between 10k and 500k'
df.loc[(df['Rating Count']>500000.0)&(df['Rating Count']<=138557570.0), 'Rating Type'] = 'More than 500k'
df['Rating Type'].value_counts()

- Here we are again tide up the data and we have converted ***Rating Count*** into different categories and sorted it according to which apps counts with Rating Count:
    - Less than 10k -> 1192855
    - NoRatingProvided -> 1082645
    - Between 10k and 500k -> 35779
    - More than 500k -> 1665

## **Important Questions related Data**

### **1. What are the top 10 categories of Apps on Google Playstore?**

In [None]:
df['Category'].unique()

In [None]:
top_category = df.Category.value_counts().reset_index().rename(columns={'Category':'Category', 'index':'Category'})
top_category

In [None]:
top_category_installs = pd.merge(top_category, Category_installs, on='Category')
top_10_category_installs = top_category_installs.head(10).sort_values(by=['Installs'], ascending=False)
plt.figure(figsize=(16, 8))
plt.xticks(rotation=60)
plt.title('Top 10 Categories of Apps')
sns.barplot(x='Category', y='count', data=top_10_category_installs)

plt.show()

- From the above dataframe, you can see that the top 10 categories are:
  1. Education -> 241090
  2. Music & Audio -> 154906
  3. Tools -> 143988
  4. Business -> 143771
  5. Entertainment -> 138276
  6. Lifestyle -> 118331
  7. Books & Reference -> 116728
  8. Personalization -> 89210
  9. Health & Fitness -> 83510
  10. Productivity -> 79698
- It also tells us that there are less number of `tools` apps available on Google playstore and this category has the largest number of innstallations. So there is opportunity for businesses to invest in this category.

### **2. Which are the categories that are getting installed the most in top 10 categories?**

In [None]:
Category_installs = df.groupby(['Category'])[['Installs']].sum()
print(Category_installs)

In [None]:
top_category_installs = pd.merge(top_category, Category_installs, on='Category')
top_category_installs.head()

- Top Category installs:
    1. Tools -> 71440271217
    2. Entertainment -> 17108396833
    3. Music & Audio -> 14239401798
    4. Education -> 5983815847
    5. Business -> 5236661902

#### **Top 10 most Installed Categories**

In [None]:
top_10_category_installs = top_category_installs.head(10).sort_values(by=['Installs'], ascending=False)
import matplotlib
matplotlib.rcParams['figure.figsize'] = (20,8)
plt.title("Top 10 most Installed Categories")
sns.barplot(x=top_10_category_installs.Category, y=top_10_category_installs.Installs)

According to our analysis, these are the top 10 most installed categories:
1. Tools
2. Productivity
3. Entertainment
4. Music & Audio
5. Personalization
6. Lifestyle
7. Education
8. Business
9. Books & Reference
10. Health & Fitness

In [None]:
plt.figure(figsize=(8,6))
data = df.groupby('Category')['Maximum Installs'].max().sort_values(ascending=True)
data = data.head(10)
labels = data.keys()
plt.pie(data, labels=labels, autopct='%1.1f%%')
plt.title('Top 10 Categories with Maximum Installs')
plt.show()

- From here we can see the maximum installed categories of applications

### **3. Which is the highest rated category?**

In [None]:
# Filter the dataframe
filtered_df = df[df['Rating'] == 5.0]

# Group the resulting dataframe by the Category column and calculate the mean rating for each category
grouped_df = filtered_df.groupby('Category')['Rating'].mean().reset_index()

# Sort the resulting dataframe by the mean rating in descending order
sorted_df = grouped_df.sort_values('Rating', ascending=False)

# Select the first row of the resulting dataframe, which will be the highest rated category
highest_rated_category = sorted_df.iloc[0]['Category']

# Print the result
print("The highest rated category is:", highest_rated_category)

- It shows the highest rated category that has a rating of 5.0 is ***Action***

In [None]:
plt.figure(figsize=(14,7))
plt.title("HIghest Rated Category")
sns.barplot(x='Category', y='Rating', data=df)
plt.xticks(rotation=90)
plt.show()

- It gives you more clear picture about the Rating of different categories.

### **4. Which Category has the highest Paid and Free apps?**

In [None]:
# Filter the dataframe to include only the rows where the Type column is "Free" or "Paid"
filtered_df = df[df['Type'].isin(['Free', 'Paid'])]

# Group the resulting dataframe by the Category column and count the number of rows for each category where the Type column is "Free" or "Paid"
grouped_df = filtered_df.groupby(['Category', 'Type']).size().reset_index(name='Count')

# Pivot the resulting dataframe to have the categories as rows and the types as columns
pivoted_df = grouped_df.pivot(index='Category', columns='Type', values='Count').reset_index()

# Sort the resulting dataframe by the count of free apps in descending order
sorted_free_df = pivoted_df.sort_values('Free', ascending=False)

# Sort the resulting dataframe by the count of paid apps in descending order
sorted_paid_df = pivoted_df.sort_values('Paid', ascending=False)

# Select the first row of the resulting dataframe for the category with the most free apps
most_free_category = sorted_free_df.iloc[0]['Category']

# Select the first row of the resulting dataframe for the category with the highest paid apps
highest_paid_category = sorted_paid_df.iloc[0]['Category']

# Print the results
print("The category with the most free apps is:", most_free_category)
print("The category with the highest paid apps is:", highest_paid_category)

- It shows that ***Education*** is the category with the most free and paid apps.

In [None]:
# Filter the dataframe to include only the rows where the Type column is "Free" or "Paid"
filtered_df = df[df['Type'].isin(['Free', 'Paid'])]

# Group the resulting dataframe by the Category column and count the number of rows for each category where the Type column is "Free" or "Paid"
grouped_df = filtered_df.groupby(['Category', 'Type']).size().reset_index(name='Count')

# Pivot the resulting dataframe to have the categories as rows and the types as columns
pivoted_df = grouped_df.pivot(index='Category', columns='Type', values='Count').reset_index()

# Sort the resulting dataframe by the count of free apps in descending order
sorted_free_df = pivoted_df.sort_values('Free', ascending=False)

# Sort the resulting dataframe by the count of paid apps in descending order
sorted_paid_df = pivoted_df.sort_values('Paid', ascending=False)

# Print the resulting dataframes
print("Category-wise count of free apps:")
print(sorted_free_df)

print("\nCategory-wise count of paid apps:")
print(sorted_paid_df)

- From here we can see the exact count of paid and free apps in each category 

In [None]:
# Create a cross-tabulation of the "Category" and "Type" columns
ct = pd.crosstab(df['Category'], df['Type'])

# Stack the resulting dataframe
stacked_df = ct.stack().reset_index()

# title of the plot
plt.title('Free vs Paid Apps in All Categories')

plt.xticks(rotation=90)
# Create a stacked bar plot of the resulting dataframe with Seaborn
sns.set_style("whitegrid")
sns.barplot(x=stacked_df['Category'], y=stacked_df[0], hue=stacked_df['Type'], palette="rocket")

# Add labels and title
plt.xlabel('Category')
plt.ylabel('Count')
plt.title('Free vs Paid Apps in All Categories')

# Show the plot
plt.show()

### **5. How does the size of the application impacts the installation?**

In [None]:
plt.figure(figsize=(18,9))
plt.xticks(rotation=60, fontsize=9)
plt.title("Impact of Application size on Installations")
sns.scatterplot(x='Size', y='Installs', hue='Type', data=df)

### **6. What is the impact of Content Rating on Maximum Installations?**

In [None]:
plt.figure(figsize=(12,6))
sns.scatterplot(data=df, x='Maximum Installs', y='Rating Count', hue='Content Rating')
plt.title('Content Rating and Maximum Installations')

### **7. How many apps are available in each category?**

In [None]:
plt.figure(figsize=(16, 12))
plt.xticks(rotation=90)
plt.title('Number of Apps in each category')
sns.barplot(y='Category', x='count', data=top_category_installs)

plt.show()

- From here we can easily visualize the number of apps in each category.
- Top 5 categories with most number of apps
  1. Education
  2. Music & Audio
  3. Tools
  4. Business
  5. Entertainment

### **8. How many apps have a rating above a certain threshold (e.g., 4.0)?**

In [None]:
# Filter the dataframe
filtered_df = df[df['Rating'] > 4.0]

# Count the number of rows
total_apps_above_4_rating = len(filtered_df)

# Print the result
print("Total number of apps with a rating above 4.0:", total_apps_above_4_rating)

- It shows that there are 750285 Apps with rating above 4.0

In [None]:
plt.figure(figsize=(12,6))
sns.kdeplot(df.Rating, color='Blue', shade=True)
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.title('Distribution of Rating')

- It shows that most apps have zero rating, means that most of the time app have not been rated and mostly rating is between 4 to 5. We can visualize it more clearly with the help of histplot.

In [None]:
plt.figure(figsize=(12,6))
sns.histplot(df.Rating, kde=True ,bins=5)
plt.title('Distribution of Rating')

- Here we can see more clearly that people mostly don't give ratings but if they do it is most of the time between 4 and 5

### **9. What are the top 5 Free Apps based on highest ratings and installs?**

In [None]:
free_apps = df[(df.Type=='Free')&(df.Installs>=5000000)]
free_apps = free_apps.groupby('App Name')['Rating'].max().sort_values(ascending=False)
free_apps.head(5)

In [None]:
# category_type_installs = df.groupby(['Category', 'Type'])[['Installs']].sum().reset_index()
# category_type_installs['log_Installs'] = np.log10(category_type_installs['Installs'])

In [None]:
plt.figure(figsize=(18,7))
plt.title("Top 5 Free Rated Apps")
sns.lineplot(x=free_apps.values, y=free_apps.index, color = 'Red')

### **10. What are the top 5 Paid Apps based on highest ratings and installs?**

In [None]:
paid_apps = df[(df.Type=='Paid')&(df.Installs>=5000000)]
paid_apps = paid_apps.groupby('App Name')['Rating'].max().sort_values(ascending=False)
paid_apps.head(5)

In [None]:
plt.figure(figsize=(18,7))
plt.title("Top 5 Paid Rated Apps")
sns.lineplot(x=paid_apps.values, y=paid_apps.index, color = 'Blue')

In [None]:
df.head()

### **Heat map to see the correlation between different features**

In order to see the correlation, we have to drop categoricla columns from the Dataframe

In [None]:
df.drop(columns=['Currency', 'Developer Id', 'Developer Email', 'Last Updated', 'Scraped Time', 'Category', 'App Id', 'Content Rating', 'Rating Type', 'App Name', 'Minimum Android', 'Released', 'Privacy Policy', 'Developer Website', 'Type'], inplace=True)

In [None]:
df.dropna(subset=['Rating', 'Rating Count'], inplace=True)

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(20,10))
plt.title("Heatmap")
sns.heatmap(df.corr(), cbar=True, yticklabels=True, annot=True, cmap='viridis')
plt.show()

- It gives you the complete overview of all the important features' correlation
  - There is slightly postive correlation between ***Installs*** and ***Rating Count***, means that if Rating Count increases Install will also increase.
  - There is negative correlation between between ***Price*** and ***Installs***, means that if price increae installs will decrease.
  - There is negative correlation between ***Size*** and ***Installs***, means that if size increase installs will decrease.
  - Factors like ***Ad Support*** and ***In App Purchases*** are correlated to ***Rating***, means that if app provides customer support and subscription plans then we can engage more customers
  - ***Editors Choice*** is also correlated to ***Rating Count***

## **Conclusion:**
- Most people do not give rating but the peole who do, tend to give 4+ rating the most.
- Most of the installations are are done by the teen and the most of them are Video players and Editors.
- Size of the application varies the installations
- People have mostly installed the free apps and the availability of free apps is also high
- In App purchases are correlated to Rating count means that if apps will have subscription plans it will help to engage customers.
- Most apps available on Google playstore are of education category but most number of installations are of tools category. So there is opportunity for businesses to invest in this category. 