# Exploring a Dataframe on Google Play Store Applications

## Load Libraries and Data

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Import data
df = pd.read_csv("googleplaystore.csv")

## Data Preprocessing and Cleaning

In [3]:
#Seeing what the data looks like with .info()
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB
None


In [4]:
# Dropping NAs
df1 = df.dropna()
print(df1.info())

<class 'pandas.core.frame.DataFrame'>
Index: 9360 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             9360 non-null   object 
 1   Category        9360 non-null   object 
 2   Rating          9360 non-null   float64
 3   Reviews         9360 non-null   object 
 4   Size            9360 non-null   object 
 5   Installs        9360 non-null   object 
 6   Type            9360 non-null   object 
 7   Price           9360 non-null   object 
 8   Content Rating  9360 non-null   object 
 9   Genres          9360 non-null   object 
 10  Last Updated    9360 non-null   object 
 11  Current Ver     9360 non-null   object 
 12  Android Ver     9360 non-null   object 
dtypes: float64(1), object(12)
memory usage: 1023.8+ KB
None


In [5]:
df1.drop(['Current Ver', 'Android Ver', 'Size', 'Genres', 'Last Updated'], axis=1, inplace=True)
print(df1.info())

<class 'pandas.core.frame.DataFrame'>
Index: 9360 entries, 0 to 10840
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             9360 non-null   object 
 1   Category        9360 non-null   object 
 2   Rating          9360 non-null   float64
 3   Reviews         9360 non-null   object 
 4   Installs        9360 non-null   object 
 5   Type            9360 non-null   object 
 6   Price           9360 non-null   object 
 7   Content Rating  9360 non-null   object 
dtypes: float64(1), object(7)
memory usage: 658.1+ KB
None


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1.drop(['Current Ver', 'Android Ver', 'Size', 'Genres', 'Last Updated'], axis=1, inplace=True)


In [6]:
# Correcting data types
df1.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Installs           object
Type               object
Price              object
Content Rating     object
dtype: object

In [7]:
# Converting 'Installs' & Price to string before removing symbols
df1['Installs'] = df1['Installs'].astype(str)
df1['Price'] = df1['Price'].astype(str)
df1['Reviews'] = pd.to_numeric(df1['Reviews'], errors='coerce')
df1['Type'] = (df1['Type'].str.lower() == 'free').astype(int)
df1.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Installs'] = df1['Installs'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Price'] = df1['Price'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Reviews'] = pd.to_numeric(df1['Reviews'], errors='coerce')
A value is trying to be set on a copy of a slice from a

App                object
Category           object
Rating            float64
Reviews             int64
Installs           object
Type                int64
Price              object
Content Rating     object
dtype: object

In [8]:
df1['Installs'] = df1['Installs'].str.replace('+', '', regex=False)
df1['Price'] = df1['Price'].str.replace('$', '', regex=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Installs'] = df1['Installs'].str.replace('+', '', regex=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Price'] = df1['Price'].str.replace('$', '', regex=False)


In [9]:
# Now we can finally return the Installs and Price columns to numeric type
df1['Installs'] = pd.to_numeric(df1['Installs'], errors='coerce')
df1['Price'] = pd.to_numeric(df1['Price'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Installs'] = pd.to_numeric(df1['Installs'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Price'] = pd.to_numeric(df1['Price'], errors='coerce')


In [10]:
df1.head()

Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Price,Content Rating
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,,1,0.0,Everyone
1,Coloring book moana,ART_AND_DESIGN,3.9,967,,1,0.0,Everyone
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,,1,0.0,Everyone
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,,1,0.0,Teen
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,,1,0.0,Everyone


## Exploratory Data Analysis

### Statistics and Summary

Using df.agg & df['column_name'].value_counts() to gain a preliminary understanding

### Visualisation (Univariate Analysis)

### Visualisation (Bivariate Analysis)

### Correlations