# Google Play Store EDA
> dataset is downloaded from [https://www.kaggle.com/datasets/lava18/google-play-store-apps?resource=download](https://www.kaggle.com/datasets/lava18/google-play-store-apps?resource=download)

In [27]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np


In [28]:
# reading the dataset
df = pd.read_csv('../Datasets/googleplaystore.csv')

In [82]:
# max column & rows printing
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",None)

## Meaning of each column name
1. App: Name of app
2. Category: category of app e.g: Game, Music,etc
3. Rating: Rating of app 0-5 good or bad
4. Reviews: Number of reviews of app
5. Size: Size of app in MB
6. Installs: Number of users or downloaders or  installers of an app
7. Type: Type of app e.g: Paid, Free, Trial, etc
8. Price: Price of app in USD, rupees etc
9. Content Rating: Content rating of app or type of users e.g: Everyone, Teen, Mature, etc
10. Genres: Genres of app e.g: Action, Adventure, Puzzle, etc
11. Last Updated: Last updated date of app
12. Current Ver: Current version of app
13. Android Ver: Minimum android version required to install app


In [29]:
# import ydata_profiling as yp
# report  = yp.ProfileReport(df)
# report.to_file(output_file='../Outputs/googleplaystore_report.html')

In [70]:
df.sample(2)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
6197,ScorePal,TOOLS,4.5,1439,15M,"10,000+",Free,0,Everyone,Tools,"July 9, 2018",2.10.7,4.3 and up
8976,SportLook,PHOTOGRAPHY,2.5,1283,8.2M,"100,000+",Free,0,Everyone,Photography,"June 21, 2018",1.2.7,4.1 and up


## Find numeric variables columns and convert them to numeric
3. Size
4. Installs
5. Price

In [43]:
df["Size"].value_counts()



Size
Varies with device    1695
11M                    198
12M                    196
14M                    194
13M                    191
                      ... 
253k                     1
992k                     1
658k                     1
73k                      1
246k                     1
Name: count, Length: 461, dtype: int64

In [32]:
df2 = df.copy()

In [33]:
df2["Size"].replace("Varies with device", np.nan, inplace=True)# replacing "Varies with device" with "NaN"



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2["Size"].replace("Varies with device", np.nan, inplace=True)# replacing "Varies with device" with "NaN"


In [97]:
# finding percentage of missing values in each column
df2.isnull().sum() / len(df2) * 100

App               0.000000
Category          0.000000
Rating            0.000000
Reviews           0.000000
Size              0.000000
Installs          0.000000
Type              0.000000
Price             0.000000
Content Rating    0.000000
Genres            0.000000
Last Updated      0.000000
Current Ver       0.073794
Android Ver       0.018448
dtype: float64

In [62]:
# converting kb to mb in Size column and removing "M" from the values

df2["Size"]=df2["Size"].str.replace("k","000")
df2["Size"]=df2["Size"].str.replace("M","")
df2["Size"] = df2["Size"].astype(float)

In [63]:
# Replacing NaN values from Size column with the mean of the column
df2["Size"].fillna(df2["Size"].mean(),inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2["Size"].fillna(df2["Size"].mean(),inplace=True)


In [66]:
# Replacing NaN values from Rating column with the mean of the column
df2["Rating"].fillna(df2["Rating"].mean(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2["Rating"].fillna(df2["Rating"].mean(),inplace=True)


In [100]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          10841 non-null  float64
 3   Reviews         10841 non-null  int64  
 4   Size            10841 non-null  float64
 5   Installs        10841 non-null  int64  
 6   Type            10841 non-null  object 
 7   Price           10841 non-null  float64
 8   Content Rating  10841 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10841 non-null  object 
 12  Android Ver     10841 non-null  object 
dtypes: float64(3), int64(2), object(8)
memory usage: 1.1+ MB


In [69]:
# Replacing +, from Installs column
df2["Installs"]=df2["Installs"].str.replace("+","")
df2["Installs"]=df2["Installs"].str.replace(",","")

In [73]:
df2["Installs"] = df2["Installs"].astype(int)

In [75]:
# Replacing $ sign from Price column
df2["Price"]=df2["Price"].str.replace("$","")

In [76]:
df2["Price"] = df2["Price"].astype(float)

In [91]:
# Completing category column
df2["Category"].fillna("Smart_Home_Device",inplace=True)

In [99]:
# Completing genere column
df2["Genres"].fillna("Digital Photo Frame",inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2["Genres"].fillna("Digital Photo Frame",inplace=True)


In [92]:
df2["Category"].loc[10470:10480]

10470        COMMUNICATION
10471      PERSONALIZATION
10472    Smart Home Device
10473                TOOLS
10474        COMMUNICATION
10475                TOOLS
10476            LIFESTYLE
10477      PERSONALIZATION
10478               FAMILY
10479        COMMUNICATION
10480               FAMILY
Name: Category, dtype: object

In [96]:
# Replacing NaN values from type column with the mode of the column
df2["Type"] = df2["Type"].fillna(df2["Type"].mode()[0])

In [98]:
# Replacing NaN values from type column with the mode of the column
df2["Current Ver"] = df2["Current Ver"].fillna(df2["Current Ver"].mode()[0])
df2["Android Ver"] = df2["Android Ver"].fillna(df2["Android Ver"].mode()[0])


In [101]:
df2.to_excel("../Datasets/norm_googleplaystore.xlsx")