### In this Jupyter Notebook we will explore the various relationships in Google Playstore dataset
### using numpy, pandas, seaborn and matlibplot.

In [20]:
# import relavent packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Load in the csv file with pd.read_csv()

In [21]:
google = pd.read_csv("googleplaystore.csv")
google_user = pd.read_csv("googleplaystore_user_reviews.csv")


Inspect the data with pd.head()

In [22]:
google.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [23]:
google_user.tail()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
64290,Houzz Interior Design Ideas,,,,
64291,Houzz Interior Design Ideas,,,,
64292,Houzz Interior Design Ideas,,,,
64293,Houzz Interior Design Ideas,,,,
64294,Houzz Interior Design Ideas,,,,


Use pd.info() to inspect data types as non-null counts 

In [24]:
google.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [25]:
google_user.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     64295 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37432 non-null  object 
 3   Sentiment_Polarity      37432 non-null  float64
 4   Sentiment_Subjectivity  37432 non-null  float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB


Checking the missing data which can help to make a decision on how to deal with them.

In [58]:
google.isnull().any()

App               False
Category          False
Rating             True
Reviews           False
Size              False
Installs          False
Type               True
Price             False
Content Rating     True
Genres            False
Last Updated      False
Current Ver        True
Android Ver        True
dtype: bool

Select 20 random rows with missing data to inspect

In [63]:
google[google.isnull().any(axis=1)].sample(20)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10769,FQ Magazine,LIFESTYLE,,1,12M,100+,Free,0,Everyone,Lifestyle,"December 12, 2016",1.0,4.1 and up
4139,F Length Sim (no Ads),PHOTOGRAPHY,,0,1.7M,10+,Paid,$2.00,Everyone,Photography,"April 29, 2018",1.0,4.0.3 and up
9329,EG Groups,EVENTS,,0,1.1M,10+,Free,0,Everyone,Events,"October 15, 2017",1.0.0,4.0 and up
4564,Tutorials for R Programming Offline,FAMILY,,2,4.4M,500+,Free,0,Everyone,Education,"December 4, 2017",1.0,4.0 and up
9924,Konferencija.eu,BUSINESS,,3,1.6M,10+,Free,0,Everyone,Business,"March 21, 2017",1.0.2,4.0 and up
9235,ec.tv,LIFESTYLE,,0,8.6M,50+,Free,0,Everyone,Lifestyle,"November 11, 2016",1.0,4.1 and up
10361,FG Wallet,FINANCE,,9,12M,"1,000+",Free,0,Everyone,Finance,"June 6, 2018",2.1,6.0 and up
9557,Explora con el Chavo,FAMILY,,0,Varies with device,"100,000+",Free,0,Everyone,Educational,"June 26, 2018",Varies with device,Varies with device
5162,Hilltop AH,MEDICAL,,0,29M,100+,Free,0,Everyone,Medical,"January 10, 2018",300000.0.96,4.0.3 and up
7167,CD - Teach me ABC English L1,FAMILY,,2,63M,500+,Free,0,Everyone,Education,"June 18, 2017",1.0.0,4.0 and up


In [64]:
google2 = google.dropna(how='all').dropna(how='all', axis=1)

In [66]:
google2.loc[:, google2.isnull().any()].sample(20)

Unnamed: 0,Rating,Type,Content Rating,Current Ver,Android Ver
2534,3.6,Free,Everyone,0.0.1,4.0 and up
493,4.4,Free,Mature 17+,1.639,4.0 and up
8175,4.6,Free,Everyone,2.1.69,4.0.3 and up
7991,2.2,Free,Everyone,1.0,3.2 and up
9953,3.8,Free,Everyone,4.0.4,4.1 and up
5438,4.2,Free,Everyone,1.7.7,2.1 and up
4444,3.8,Free,Everyone,8.3.2,4.0.3 and up
8138,4.5,Free,Everyone,6.2.0,4.1 and up
3286,4.5,Free,Everyone,Varies with device,Varies with device
4998,4.0,Free,Everyone,Varies with device,Varies with device


Upon inspection columns contain null values are: "Current Ver", "Android Ver", "Content Rating", "Type" and "Rating"
Out of above columns "Current Ver", "Android Ver", "Content Rating" and "Type" have few missing value therefore we can safely drop
the corresponding row without impacting the end result. 
However vast majority of the missing data are in the "Rating" column so by simply dropping the missing row will affect the end result 
not to mention the potential relationship between "Rating" and other variables. 