## GOOGLE PLAY STORE ANALYSIS PROJECT

Understand the trend of applications available on the google play store with a focus on promoting advertisements on particular applications which are trending in the market and can lead to maximum profit.
Analyze detailed information on apps in the Google Play Store to discover insights on app features and the current state of the Android app market.

### LIBRARIES 

In [1]:
# Libraries for reading and manipulating data
import numpy as np
import pandas as pd

# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

#display the graphs
%matplotlib inline

#display values to 2 decimal places
pd.set_option("display.float_format", lambda x: "%.2f" % x) 

### READING THE DATASET

In [2]:
#read the csv file into a dataframe
data = pd.read_csv("google_play_store.csv")

#create a copy of the data
df = data.copy()

#display first five rows of the dataset
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [3]:
#total number of rows and columns of the dataset
df.shape

(10841, 13)

### DESCRIPTIVE SUMMARY OF DATASET:
- There are a total of **10,841** rows and **13** columns in the dataset.
- The names of the apps are specified under the **App** column.
- The apps are grouped into different categories and displayed under the **Category** column.
- The **Ratings** and **Reviews** columns display the performance of each app as stated by the users.
- The size of each app is displayed in kilobytes in the **Size** column and the number of downloads for each app specified in the **Installs** column.
- The apps have been grouped into either Paid or Free under the **Type** column and their corresponding prices shown in the **Price** column.
- The **Content Rating** column describes the age group of the users for which the app has been developed.
- **Genres** provides information on the different groupings of the apps. However, this information is similar to the one provided in the Category column.
- **Last Updated** column shows the dates for the most recent updates installed by the creators.
- **Current Version** column shows the recent versions of the apps and the **Android Version** column shows the version of the Android OS the app is compatible with.

### DATASET INFORMATION


In [4]:
#displays the datatypes of the columns in the dataset
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


- There are **seven(7)** object datatype columns and **five(5)** numeric datatype columns.
- The dataset occupies aproximately **1.1 MB** of memory.
- We can infer that some rows have missing data since some columns have count less than the total number of rows.This will be treated in the data cleaning process.
- Some columns have numeric values but are represented as object datatype. They will be converted to more appropriate datatypes for analysis.
- Columns that will not be used for this analysis will be dropped.

#### DROP REDUNDANT COLUMNS

In [5]:
#list of columns to drop from dataset
col_to_drop = ['Type','Genres','Last Updated','Current Ver','Android Ver']

#drop listed columns
df = df.drop(columns = col_to_drop)

#verify results of the drop
df.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Price',
       'Content Rating'],
      dtype='object')

- Columns dropped as seen

#### COLUMN CONVERSION

In [6]:
df[df['Content Rating'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Price,Content Rating
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,Everyone,


In [7]:
#columns to convert to integer datatype
col_to_check = ['Reviews']

#checking for non-numeric rows
non_numeric_values = {}
for row in col_to_check:
        non_numeric_values[row] = df.loc[pd.to_numeric(df[row], errors='coerce').isnull()]
        
#display result in tabular format
non_numeric_df = pd.concat(non_numeric_values, ignore_index = True)
non_numeric_df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Price,Content Rating
0,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,Everyone,


In [8]:
#drop disruptive row
df = df.drop(df[df['Reviews'] == '3.0M'].index)

#convert Reviews column to integer datatype
df['Reviews'] = df['Reviews'].astype(float)

In [9]:
#Converting string values in Size column and converting column to float type
df['Size'] = df['Size'].str.replace('+','', regex=False) #treat + character as string NOT regex pattern
df['Size'] = df['Size'].str.replace('M','').str.replace('Varies with device','0').str.replace('k','').astype(float)

In [10]:
#Converting string values in Installs column and converting column to float type
df['Installs'] = df['Installs'].str.replace(',', '', regex=False).str.replace('+', '', regex=False).astype(float)

In [11]:
#Converting string values in Price column and converting column to float type
df['Price'] = df['Price'].str.replace('$', '', regex=False).astype(float)

In [12]:
#verify the conversion of the column datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  float64
 4   Size            10840 non-null  float64
 5   Installs        10840 non-null  float64
 6   Price           10840 non-null  float64
 7   Content Rating  10840 non-null  object 
dtypes: float64(5), object(3)
memory usage: 762.2+ KB


- Column conversions and row dropping successful