<a href="https://www.kaggle.com/code/gaurobsaha/google-app?scriptVersionId=218830398" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Google Playstore Data 
### Complete EDA

## About Dataset

>'Description'

This dataset was taken from Kaggle. Get the dataset: [Link](https://www.kaggle.com/datasets/lava18/google-play-store-apps)

>`Context`

While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

>`Content`

Each app (row) has values for catergory, rating, size, and more.

>`Acknowledgements`

This information is scraped from the Google Play Store. This app information would not be available without it.

>`Inspiration`

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

 ### 1. Importing Libraries

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline 
#this is for jupyter notebook to show the plot in the notebook itself instead of opening a new window for the plot


### 2. Data loading and Exploration and cleaning

In [25]:
data=pd.read_csv('/kaggle/input/google-play-store-apps/googleplaystore.csv') # Read the dataset
pd.set_option('display.max_columns',None) # this is to display all the columns in the dataframe
pd.set_option('display.max_rows',None) # this is to display all the rows in the dataframe
warnings.filterwarnings('ignore') # Hide all warnings
data.head() # show the first 5 rows

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [26]:
data.shape

(10841, 13)

In [27]:
print(f"The number of rows in the dataset is {data.shape[0]}, and the number of columns is {data.shape[1]}." )

The number of rows in the dataset is 10841, and the number of columns is 13.


In [28]:
#Not enough, let's have a look on the columns and their data types using detailed info function
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [29]:
# name of the columns
data.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

# Observations
---
1. There are 10841 rows and 13 columns in the dataset.

2. The columns are of different data types.

3. The columns in the datasets are:

   >'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',   'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',   'Android Ver'

4. There are some missing values in the dataset which we will read in details and deal later on in the notebook.

5. There are some columns which are of object data type but they should be of numeric data type, we will convert them later on in the notebook.

   >'Size', 'Installs', 'Price'

In [30]:
# summary statistics of the numerical columns
data.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


# Observations:
---
- We have only 2 columns as numeric data type, rest all are object data type (according to python), but we can see that `Size`, `Installs`, `Price` are also numeric, we must convert them to numeric data type in data wrangling process.


---
- Let's clean the Size column first

In [31]:
# check unique values in the Size column
data["Size"].unique()

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
       '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
       '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
       '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
       '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
       '4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
       '23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
       '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
       '5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
     

In [32]:
# checking for unique values and their counts in the Size column
data["Size"].value_counts()

Size
Varies with device    1695
11M                    198
12M                    196
14M                    194
13M                    191
15M                    184
17M                    160
19M                    154
26M                    149
16M                    149
25M                    143
20M                    139
21M                    138
10M                    136
24M                    136
18M                    133
23M                    117
22M                    114
29M                    103
27M                     97
28M                     95
30M                     84
33M                     79
3.3M                    77
37M                     76
35M                     72
31M                     70
2.9M                    69
2.5M                    68
2.3M                    68
2.8M                    65
3.4M                    65
32M                     63
34M                     63
3.7M                    63
40M                     62
3.9M                   

### Observations with the "Size" column
1. "Varies with device
2. "K": Some values are Kilobytes
3. "M": Some values are Megabytes

In [33]:
## checking for null values in the "Size" column
data["Size"].isnull().sum()

0

### We want to confirm, if there is only 3 types of value in the "Size" column

In [34]:
# find the values in size column which has 'M' in it
data["Size"].loc[data["Size"].str.contains("M")].value_counts().sum()

8829

In [None]:
# find the values in size column which has 'k' in it
data["Size"].loc[data["Size"].str.contains("k")].value_counts().sum()

In [36]:
# find the values in size column which has 'Varies with device' in it
data["Size"].loc[data["Size"].str.contains("Varies with device")].value_counts().sum()

1695

In [37]:
# Total Values in Size column
len(data)

10841

In [38]:
## Total Values in Size column
8830+316+1695

10841

# Observation
- We have 8830 values in M units
- We have 316 values in k units
- We have 1695 value in Varies with device

> Let's convert the M and K units into bytes and then remove the M and K from the values and convert them into numeric data type and replace 'Varies with device' with nan

In [39]:
# check the head of the size column before the conversion
data["Size"].head()

0     19M
1     14M
2    8.7M
3     25M
4    2.8M
Name: Size, dtype: object

In [40]:
# convert the size column to numeric by multiplying the values with 1024 if it has 'k' in it and 1024*1024 if it has 'M' in it
# this function will convert the size column to numeric
def convert_size(size):
    if isinstance(size, str):
        if 'k' in size:
            return float(size.replace('k', '')) * 1024
        elif 'M' in size:
            return float(size.replace('M', '')) * 1024 * 1024
        elif 'Varies with device' in size:
            return np.nan
    return size


In [41]:
# Call the above function
data['Size'] = data['Size'].apply(convert_size)

In [42]:
## check the head of the size column after the conversion
data["Size"]

0         19922944.0
1         14680064.0
2          9122611.2
3         26214400.0
4          2936012.8
5          5872025.6
6         19922944.0
7         30408704.0
8         34603008.0
9          3250585.6
10        29360128.0
11        12582912.0
12        20971520.0
13        22020096.0
14        38797312.0
15         2831155.2
16         5767168.0
17        17825792.0
18        40894464.0
19        32505856.0
20        14680064.0
21        12582912.0
22         4404019.2
23         7340032.0
24        24117248.0
25         6291456.0
26        26214400.0
27         6396313.6
28         4823449.6
29         4404019.2
30         9646899.2
31         5452595.2
32        11534336.0
33        11534336.0
34         4404019.2
35         9646899.2
36        25165824.0
37               NaN
38        11534336.0
39         9856614.4
40        15728640.0
41        10485760.0
42               NaN
43         1258291.2
44        12582912.0
45        25165824.0
46        27262976.0
47         83

In [43]:
# changing the "Size" column name
data.rename(columns={"Size": "Size_in_bytes"}, inplace=True)


In [45]:
# creating a new column "Size_in_Mb" and change the dataset
data["Size_in_Mb"]=data["Size_in_bytes"]/(1024*1024)

TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [None]:
data.head()

# Obervations with "Installs" column
- The only problem I see here is the + sign in the values, let's remove them and convert the column into numeric data type.
- The total values in the Installs column are 10841 and there are no null values in the column.

- However, one value 0 has no plus sign

- Let's remove the plus sign + and , from the values and convert them into numeric data type


In [None]:
#checking for null values
data["Installs"].isnull().sum()

In [None]:
# checking for unique values
data["Installs"].unique

In [None]:
# checking for unique values
data["Installs"].nunique()

In [None]:
# checking for unique values and their counts
data["Installs"].value_counts()

In [None]:
# datatype of installs column before conversion
data["Installs"].dtype

In [None]:
# removing '+' and ',' from the column
data["Installs"] = data["Installs"].str.replace('+', '')
data["Installs"] = data["Installs"].str.replace(',', '')
data["Installs"]

In [None]:
# converting the column to int
data["Installs"] = data["Installs"].astype(int)
data["Installs"].value_counts()

In [None]:
# this will show the data type of the column
data['Installs'].dtype 


> We can generate a new columns based on the installation values, which will be helpful in our analysis.


In [None]:
# data before adding new column
data.head()

In [None]:
data['Installs'].max() # this will show the value counts of the column


In [None]:
# making a new column called 'Installs_category' which will have the category of the installs
bins = [-1, 0, 10, 1000, 10000, 100000, 1000000, 10000000, 10000000000]
labels=['No_Install', 'Very low', 'Low', 'Moderate', 'More than moderate', 'High', 'Very High', 'Top Notch']
data['Installs_category'] = pd.cut(data['Installs'], bins=bins, labels=labels)

In [None]:
# check the value counts of the new column
data['Installs_category'].value_counts()


In [None]:
## Data after adding the new column
data.head()

# Observation for Price column

In [None]:
# checking for missing values
data["Price"].isnull().sum()

In [None]:
# checking the value counts of the "Price" column
data["Price"].value_counts()

In [None]:
# check the unique values in the 'Price' column
data['Price'].unique()

> We need to confirm if the values in the Price column are only with $ sign or not.


In [None]:
# count the values having $ in the 'Price' column
data['Price'].loc[data['Price'].str.contains('\$')].value_counts().sum()

In [None]:
# This code counts the number of values in the 'Price' column which contains 0 but does not contain $ sign
data['Price'].loc[(data['Price'].str.contains('0')) & (~data['Price'].str.contains('\$'))].value_counts().sum()

- Now we can confirm that the only currency used is $ in the Price column or 0 value, as 800+10041=10841 Total values
- The only problem is $ sign let's remove it and convert the column into numeric data type.


> Removing '$' sign

In [None]:
# removing '$' from the column
data["Price"] = data["Price"].str.replace('$', '')
data["Price"].value_counts()

In [None]:
# converting "Price" column to float
data["Price"] = data["Price"].astype(float)
data["Price"]

In [None]:
# this will show the data type of the column
data['Price'].dtype 


In [None]:
data.describe()

In [None]:
# Printing min,max and mean of the Price column using f string
print(f"Minimum price is {data['Price'].min()}")
print(f"Maximum price is {data['Price'].max()}")
print(f"Mean price is {data['Price'].mean():.2f}")


# Missing Values

In [None]:
## Missing values per column
data.isnull().sum()

In [None]:
# missing value percentage
round(data.isnull().sum()/data.shape[0]*100,2)

In [None]:
# total number of missing values
data.isnull().sum().sum()

In [None]:
# plot missing values
sns.heatmap(data.isnull())
plt.show()


In [None]:
## Bar plot for missing values
# make figure size
plt.figure(figsize=(16, 6))

# plot the null values by their percentage in each column
missing_percentage = data.isnull().sum() / len(data) * 100
missing_percentage.plot(kind='bar')

# add the labels
plt.xlabel('Columns')
plt.ylabel('Percentage')
plt.title('Percentage of Missing Values in each Column')


In [None]:
## Bar plot for missing values which are less than 1% of the total missing values
# make figure size
plt.figure(figsize=(16, 6))
missing_percentage[missing_percentage<1].plot(kind='bar')
# add the labels
plt.xlabel('Columns')
plt.ylabel('Percentage')
plt.title('Percentage of Missing Values in each Column without the rating column')

In [None]:
data.isnull().sum().sort_values(ascending=False) # this will show the number of null values in each column in descending order


In [None]:
## this will show the percentage of null values in each column
(data.isnull().sum() / len(data) * 100).sort_values(ascending=False) 


# 2.3. Dealing with the missing values
---
We can not impute the Rating column as is is directly linked with the installation column. To test this Hypothesis we need to plot the Rating column with the Installs and Size columns and statistically test it using pearson correlation test.

In [None]:
data.describe()

In [None]:
# Make a correlation matrix of numeric columns
plt.figure(figsize=(16, 10)) # make figure size  
numeric_cols = ['Rating', 'Reviews', 'Size_in_bytes', 'Installs', 'Price'] # make a list of numeric columns
sns.heatmap(data[numeric_cols].corr(), annot=True) # plot the correlation matrix

In [None]:
# we can also calculate the correlation matrix using pandas
data[numeric_cols].corr() # this will show the correlation matrix

In [None]:
# we can calculate the pearson correlation coefficient using scipy as well as follows

# this is to install scipy if you have not done it before
# pip install scipy 
from scipy import stats

# remove rows containing NaN or infinite values (Important to calculate Pearson's R)
data_clean = data.dropna()

# calculate Pearson's R between Rating and Installs
pearson_r, _ = stats.pearsonr(data_clean['Reviews'], data_clean['Installs'])
print(f"Pearson's R between Reviews and Installs: {pearson_r:.4f}")

In [None]:
## null values for all columns
data.isnull().sum()

> Before going ahead, let's remove the rows with missing values in the Current Ver, Android Ver, Category, Type and Genres columns, as they are very less in number and will not affect our analysis.

In [None]:
# # length before removing null values
print(f"Length of the dataframe before removing null values: {len(data)}")

In [None]:
# remove the rows having null values in the 'Current Ver', 'Android Ver', 'Category', 'Type' and 'Genres' column
data.dropna(subset=['Current Ver', 'Android Ver', 'Category', 'Type', 'Genres'], inplace=True)

In [None]:
# length after removing null values
print(f"Length of the dataframe after removing null values: {len(data)}")

We have removed 12 rows having null values in the `Current Ver`, `Android Ver`, `Category`, `Type` and `Genres` columns.


In [None]:
# let's check the null values again
data.isnull().sum().sort_values(ascending=False)

In [None]:
data.columns

In [None]:
# use groupby function to find the trend of Rating in each Installs_category
data.groupby('Installs_category')['Rating'].describe()

In [None]:
data.groupby('Installs_category')['Rating'].count()

In [None]:
data[["Rating","Installs_category"]].head()

In [None]:
data.groupby('Installs_category')['Rating'].mean()

In [None]:
# in which Install_category the Rating has NaN values
data['Installs_category'].loc[data['Rating'].isnull()].value_counts()

In [None]:
# plot the boxplot of Rating in each Installs_category
plt.figure(figsize=(16, 6)) # make figure size
sns.boxplot(x='Installs_category', y='Rating', hue='Installs_category', data=data) # plot the boxplot
# add the text of number of null values in each category
plt.text(0, 3.5, 'Null values: 14')
plt.text(1, 3.5, 'Null values: 874')
plt.text(2, 3.5, 'Null values: 86')
plt.text(3, 3.5, 'Null values: 31')
plt.text(4, 3.5, 'Null values: 3')
plt.text(5, 3.5, 'Null values: 0')
plt.text(6, 3.5, 'Null values: 0')
plt.text(7, 3.5, 'Null values: 0')

In [None]:
# before imputing the number of missing values in the 'Rating' column
data["Rating"].isnull().sum()

In [None]:
# Mean values for each 'Installs_category'
mean_values = {
    'No_Install': 0.0,
    'Very low': 4.637037,
    'Low': 4.170970,
    'Moderate': 4.035417,
    'More than moderate': 4.093255,
    'High': 4.207525,
    'Very High': 4.287076,
    'Top Notch': 4.374396
}

In [None]:
# Function to fill missing 'Rating' based on 'Installs_category'
def fill_rating(row):
    if pd.isnull(row['Rating']):
        category = row['Installs_category']
        return mean_values.get(category, row['Rating'])
    return row['Rating']

In [None]:
# Apply the function to fill missing ratings
data['Rating'] = data.apply(fill_rating, axis=1)

In [None]:
## before imputing the number of missing values in the 'Rating' column
data["Rating"].isnull().sum()

- > There are no Null values in Reviews


In [None]:
# let's plot the same plots for Reviews column as well
plt.figure(figsize=(16, 6)) # make figure size
sns.boxplot(x='Installs_category', y= 'Reviews', data=data) # plot the boxplot

- The data looks really imbalance, let's normalize the data using log transformation


In [None]:
# let's plot the same plots for Reviews column as well
#Without log transformation, the large range of the data would make it hard to visualize lower categories like "No Install," "Very Low," etc., 
# as their values are close to zero compared to the large numbers in the "Top Notch" category. 
# A log scale makes it easier to compare data across categories that have both small and large values.
plt.figure(figsize=(16, 6)) # make figure size
sns.boxplot(x='Installs_category', y= np.log10(data['Reviews']), data=data) # plot the boxplot

- We also draw the scatter plot of the Rating and Review columns with the Installs column


In [None]:
# Draw a scatter plot between Rating, Reviews and Installs
plt.figure(figsize=(16, 6)) # make figure size
sns.scatterplot(x='Rating', y='Reviews', hue='Installs_category', data=data) # plot the scatter plot

- It doesn't show any trend, because, you should know that Rating is a categorical variable (Ordinal) and Reviews is a continuous variable, therefore, we can not plot them together.
- Let's try with Reviews and Installs

In [None]:
# plot reviews and installs in a scatter plot
plt.figure(figsize=(16, 6)) # make figure size
sns.scatterplot(x='Reviews', y='Installs', data=data) # plot the scatter plot

- We did not see any trend and the issue is we need to normalize the data before plotting it, let's try with log transformation.


In [None]:
# plot reviews and installs in a scatter plot
plt.figure(figsize=(16, 6)) # make figure size
sns.scatterplot(x=np.log10(data['Reviews']), y=np.log10(data['Installs']), data=data) # plot the scatter plot

- Now we see a slight trend but still the issue is installs were given in a factorial manner, as 10+, 20+, 1000+ etc, and these are not continuous number but Discreet one, therefore, we can only see a slight trends here. Let's plot a line plot to see the trend.

In [None]:
# plot reviews and installs in a scatter plot with trend line
plt.figure(figsize=(16, 6)) # make figure size
sns.lmplot(x='Reviews', y='Installs', data=data) # plot the scatter plot with trend line

- Here, we can see a nice trend, which shows that number of Reviews increases with the number of Installs, which is quite obvious.


# Observation
-We can see that most of the null values from Rating column are no - Moderate Installation apps, which make sense that if the app has less installations, it has less Rating and review.

> But wait, we have to check for the duplicates as well, as they can affect our analysis

- # Duplicates
- Removing duplicates is one of the most important part of the data wrangling process, we must remove the duplicates in order to get the correct insights from the data.

- If you do not remove duplicates from a dataset, it can lead to incorrect insights and analysis.

- Duplicates can skew statistical measures such as mean, median, and standard deviation, and can also lead to over-representation of certain data points.

- It is important to remove duplicates to ensure the accuracy and reliability of your data analysis.

In [None]:
# find duplicate if any
data.duplicated().sum()

- This shows us total duplicates, but we can also check based on the app name, as we know that every app has a unique name.

In [None]:
# finding how many duplicate values are there in the "App" column
data["App"].duplicated().sum()

- There are 1181 duplicate app names

- let's check for number of duplicates in each column using a for loop and printing the output





In [None]:
# let's check for number of duplicates
for col_name in data.columns:
    print(f"Number of duplicates in {col_name} column are: {data[col_name].duplicated().sum()}")

- This means that the only better way to find duplicates is to check for whole data



In [None]:
# number of the duplicate values in the dataset
print(f"Number of duplicates in dataset are: {data.duplicated().sum()}")

In [None]:
# printing the duplicate values
#data[data['App'].duplicated(keep=False)].sort_values(by='App')

#data[data.duplicated()]

1. data['App'].duplicated(keep=False):
This part checks for duplicate values in the 'App' column.
The parameter keep=False ensures that all occurrences of a duplicate value are marked as True (not just the second or later occurrence). By default, only subsequent duplicates are marked, but keep=False flags all duplicates, including the first occurrence.
So, it creates a Boolean Series where True marks all rows with duplicate values in the 'App' column, and False marks the unique values.

2. data[data['App'].duplicated(keep=False)]:
This filters the data DataFrame, returning only the rows where the 'App' column contains duplicate values. The result is a new DataFrame with all rows that have duplicates in the 'App' column.
3. .sort_values(by='App'):
This part sorts the filtered DataFrame by the values in the 'App' column. The goal is to group together rows that have the same duplicate value, making it easier to spot the duplicates.

In [None]:
# remove the duplicates
data.drop_duplicates(inplace=True)

In [None]:
# print the number of rows and columns after removing duplicates
print(f"Number of rows after removing duplicates: {data.shape[0]}")

- Now we have removed 483 duplicates from the dataset. and have 10346 rows left.


# Insights from data

Q1. Which category has the highest number of apps?

In [None]:
# which category has highest number of apps
data['Category'].value_counts().head(15) # this will show the top 10 categories with highest number of apps

Q2. Which category has the highest number of installs?


In [None]:
## category with highest number of Installs
data.groupby('Category')['Installs'].sum().sort_values(ascending=False).head(15)

Q3. Which category has the highest number of reviews?


In [None]:
## Category with highest number of Reviews
data.groupby("Category")["Reviews"].sum().sort_values(ascending=False).head(15)

Q4. Which category has the highest rating?


In [None]:
## Category with highest average Rating
data.groupby("Category")["Rating"].mean().sort_values(ascending=False).head(30)

In [None]:
# plot the rating distribution
plt.figure(figsize=(16, 6)) # make figure size
sns.kdeplot(data['Rating'], color="blue", shade=True) # plot the distribution plot

Q4. Number of different Ratings for each category.

In [None]:
data.groupby("Category")["Rating"].value_counts()

Q5. What is the number of paid apps and unpaid apps?

In [None]:
data["Type"].value_counts()

In [None]:
# plot number of installs for free vs paid apps make a bar plot
plt.figure(figsize=(16, 6)) # make figure size
sns.barplot(x='Type', y='Installs', data=data) # plot the bar plot

Q6. How many Free and Paid softwares got installed?

In [None]:
data.groupby("Type")["Installs"].sum()

In [None]:
# show scatter plot as well where x-axis is Installs and y-axis is Price and hue is Type
plt.figure(figsize=(16, 6)) # make figure size
sns.scatterplot(x='Installs', y='Price', hue='Type', data=data) # plot the scatter plot

In [None]:
# plot reviews and installs in a scatter plot with trend line
plt.figure(figsize=(16, 6)) # make figure size
sns.lmplot(x='Installs', y='Price',hue='Type', data=data) # plot the scatter plot with trend line

Q7. What are mean of the sizes of the softwares based on Installs_category?

In [None]:
data.groupby("Installs_category")["Size_in_Mb"].mean()

In [None]:
# Check if there is any impact of size on installs
# make a bar plot of Size_in_Mb vs Installs_category
plt.figure(figsize=(16, 6)) # make figure size
sns.barplot(x='Installs_category', y='Size_in_Mb', data=data) # plot the bar plot

Q8. Which content rating is most popular in installs


In [None]:
data['Content Rating'].value_counts() # this will show the value counts of each content rating

In [None]:
# plot the bar plot of Content Rating vs Installs
plt.figure(figsize=(16, 6)) # make figure size
sns.barplot(x='Content Rating', y='Installs', data=data) # plot the bar plot

Q9.How many apps are there in Everyone content rating?


In [None]:
data['Category'].loc[data['Content Rating'] == 'Everyone'].value_counts()

Q10. How many apps got 5 rating?

In [None]:
len(data[data["Rating"]==5.0])


Q11. How many paid apps got 5 rating?

In [None]:
len(data[(data["Rating"] == 5.0) & (data["Type"] == "Paid")])


Q12. Give the top rated 10 apps in Paid category

In [None]:
data[data['Type'] == 'Paid'].sort_values(by='Rating', ascending=False).head(10)

In [None]:
# plot top 5 rated paid apps
plt.figure(figsize=(16, 6)) # make figure size
sns.barplot(x='App', y='Rating', data=data[data['Type'] == 'Paid'].sort_values(by='Rating', ascending=False).head(5)) # plot the bar plot

Q12. How many Free apps got 5 rating?

In [None]:
len(data[(data["Rating"] == 5.0) & (data["Type"] == "Free")])


In [None]:
# plot top 5 rated paid apps
plt.figure(figsize=(26, 6)) # make figure size
sns.barplot(x='App', y='Rating', data=data[data['Type'] == 'Free'].sort_values(by='Rating', ascending=False).head(5)) # plot the bar plot

Q13. Give the top rated 10 apps in Free category

In [None]:
data[data['Type'] == 'Free'].sort_values(by='Rating', ascending=False).head(5)


Q14. What are the top 5 FREE apps with highest number of reviews?



In [None]:
data[data['Type'] == 'Free'].sort_values(by='Reviews', ascending=False).head(5)


In [None]:
# Plot top 10 Paid apps with highest number of reviews
plt.figure(figsize=(16, 6)) # make figure size
sns.barplot(x='App', y='Reviews', data=data[data['Type'] == 'Free'].sort_values(by='Reviews', ascending=False).head(10)) # plot the bar plot

Q15. What are the top 5 Paid apps with highest number of reviews?


In [None]:
data[data['Type'] == 'Paid'].sort_values(by='Reviews', ascending=False).head(5)


In [None]:
# Plot top 5 Paid apps with highest number of reviews
plt.figure(figsize=(16, 6)) # make figure size
sns.barplot(x='App', y='Reviews', data=data[data['Type'] == 'Paid'].sort_values(by='Reviews', ascending=False).head(5)) # plot the bar plot

Q16. How does the distribution of app ratings vary across different categories?

In [None]:
# Filter out rows with missing ratings or categories (if any)
data_cleaned = data.dropna(subset=['Rating', 'Category'])

# Set the size of the plot for better readability
plt.figure(figsize=(12, 8))

# Create a box plot to show the distribution of ratings by category
sns.boxplot(x='Category', y='Rating', data=data_cleaned)
plt.xticks(rotation=90)  # Rotate the category labels for readability
plt.title('Distribution of App Ratings Across Different Categories')
plt.xlabel('App Category')
plt.ylabel('Rating')

Q17. Which categories have the highest average app ratings?

In [None]:
data.groupby("Category")["Rating"].mean().sort_values(ascending=False)

Q18. What is the relationship between the number of reviews and app ratings?

In [None]:
# Calculate the correlation coefficient between 'Reviews' and 'Rating'
correlation = data['Reviews'].corr(data['Rating'])
print(f"Correlation coefficient between number of reviews and app ratings: {correlation:.2f}")

- We can see, there is no direct correlation.

In [None]:
# Create a scatter plot with a trend line (regression line) using seaborn
plt.figure(figsize=(10, 6))

# Use seaborn's regplot to create the plot with a trend line
sns.regplot(x='Reviews', y='Rating', data=data, scatter_kws={'alpha':0.3}, line_kws={'color':'red'}, logx=True)

plt.title('Relationship Between Number of Reviews and App Ratings with Trend Line')
plt.xlabel('Number of Reviews (Log Scale)')
plt.ylabel('App Rating')

plt.tight_layout()
plt.show()

In [None]:
sns.lmplot(x='Reviews', y='Rating', data=data_cleaned, col='Category', height=4, aspect=0.7)
