# Google Play Store Analysis
Notes:
- The dataset used was compiled around 2017/2018
- The dataset does not represent every single app in the Google Play Store during this period
- The dataset was served up based on the geographical location (USA) and user behaviour of the user who scraped it
- Each apps datasize has been converted to a float in MB, if a value for this field was missing then it has been replaced by the average size for that category
- 'Installs' does not represent an exact number of user installs, e.g. if an app has 123,456 installs, Google simply reports an order of magnitude such as 100,000+ (The + has been removed)

## Exploratory questions:
- How competitive different app categories (e.g., Games, Lifestyle, Weather) are there?
- Which app category offers compelling opportunities based on its popularity?
- How many downloads would you give up by making your app paid vs. free?
- How much can you reasonably charge for a paid app?
- Which paid apps have had the highest revenue
- How many paid apps will recoup their development costs based on their sales revenue

## Data exploration and cleaning

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

In [143]:
data = pd.read_csv('apps.csv', thousands=',')

In [145]:
data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social,"July 28, 2017",4.1 and up
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education,"April 15, 2016",3.0 and up
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization,"July 11, 2018",4.2 and up
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business,"August 6, 2018",4.1 and up


In [147]:
data.tail()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10836,Subway Surfers,GAME,4.5,27723193,76.0,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up
10837,Subway Surfers,GAME,4.5,27724094,76.0,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up
10838,Subway Surfers,GAME,4.5,27725352,76.0,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up
10839,Subway Surfers,GAME,4.5,27725352,76.0,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up
10840,Subway Surfers,GAME,4.5,27711703,76.0,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up


In [149]:
data.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size_MBs          float64
Installs            int64
Type               object
Price              object
Content_Rating     object
Genres             object
Last_Updated       object
Android_Ver        object
dtype: object

In [151]:
data.sample()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
1538,DT Baby Cam,TOOLS,1.7,10,23.0,500,Free,0,Everyone,Tools,"December 5, 2017",2.2 and up


In [153]:
data.shape

(10841, 12)

In [155]:
data.isnull().values.any()

True

In [157]:
data.isna().value_counts()

App    Category  Rating  Reviews  Size_MBs  Installs  Type   Price  Content_Rating  Genres  Last_Updated  Android_Ver
False  False     False   False    False     False     False  False  False           False   False         False          9365
                 True    False    False     False     False  False  False           False   False         False          1473
                 False   False    False     False     False  False  False           False   False         True              2
                 True    False    False     False     True   False  False           False   False         False             1
Name: count, dtype: int64

### Remove rows containg missing data:

In [160]:
data.dropna(inplace=True)

In [162]:
data.shape

(9365, 12)

In [164]:
data.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')

In [166]:
data.drop(labels=['Android_Ver','Last_Updated'], axis="columns", inplace=True)

In [168]:
data.sample()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
1423,HON. B.J. ACS COLLEGE ALE,FAMILY,5.0,3,1.8,100,Free,0,Mature 17+,Education


### Fix data types:

In [188]:
data.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size_MBs          float64
Installs            int64
Type               object
Price              object
Content_Rating     object
Genres             object
dtype: object

In [190]:
data.Price

21           0
28       $1.49
47       $0.99
82           0
99           0
         ...  
10836        0
10837        0
10838        0
10839        0
10840        0
Name: Price, Length: 9365, dtype: object

In [222]:
data.Price = data.Price.replace({'\\$': ''}, regex=True)

In [224]:
data.Price

21          0
28       1.49
47       0.99
82          0
99          0
         ... 
10836       0
10837       0
10838       0
10839       0
10840       0
Name: Price, Length: 9365, dtype: object

In [226]:
data.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size_MBs          float64
Installs            int64
Type               object
Price              object
Content_Rating     object
Genres             object
dtype: object

In [230]:
data.Price = data.Price.astype('float64')

In [232]:
data.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size_MBs          float64
Installs            int64
Type               object
Price             float64
Content_Rating     object
Genres             object
dtype: object

### Check for and remove duplicates:

In [244]:
duplicated = data[data.duplicated()]

In [252]:
duplicated.sort_values('App').head(10)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
6371,10 Best Foods for You,HEALTH_AND_FITNESS,4.0,2490,3.8,500000,Free,0.0,Everyone 10+,Health & Fitness
7159,1800 Contacts - Lens Store,MEDICAL,4.7,23160,26.0,1000000,Free,0.0,Everyone,Medical
2604,2017 EMRA Antibiotic Guide,MEDICAL,4.4,12,3.8,1000,Paid,16.99,Everyone,Medical
5140,21-Day Meditation Experience,HEALTH_AND_FITNESS,4.4,11506,15.0,100000,Free,0.0,Everyone,Health & Fitness
946,420 BZ Budeze Delivery,MEDICAL,5.0,2,11.0,100,Free,0.0,Mature 17+,Medical
9760,8fit Workouts & Meal Planner,HEALTH_AND_FITNESS,4.6,115721,67.0,10000000,Free,0.0,Everyone,Health & Fitness
2482,A Manual of Acupuncture,MEDICAL,3.5,214,68.0,1000,Paid,33.99,Everyone,Medical
6439,A&E - Watch Full Episodes of TV Shows,ENTERTAINMENT,4.0,29706,19.0,1000000,Free,0.0,Teen,Entertainment
6443,A&E - Watch Full Episodes of TV Shows,ENTERTAINMENT,4.0,29706,19.0,1000000,Free,0.0,Teen,Entertainment
3465,AAFP,MEDICAL,3.8,63,24.0,10000,Free,0.0,Everyone,Medical


In [254]:
duplicated.sort_values('App').tail(10)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
6837,join.me - Simple Meetings,BUSINESS,4.0,6989,7.2,1000000,Free,0.0,Everyone,Business
4761,"muzmatch: Muslim & Arab Singles, Marriage & Da...",DATING,3.8,5164,25.0,100000,Free,0.0,Mature 17+,Dating
7783,mySugr: the blood sugar tracker made just for you,MEDICAL,4.6,21189,36.0,1000000,Free,0.0,Everyone,Medical
9783,osmino Wi-Fi: free WiFi,TOOLS,4.2,134203,4.1,10000000,Free,0.0,Everyone,Tools
10400,slither.io,GAME,4.4,5235294,36.0,100000000,Free,0.0,Everyone,Action
9635,"theScore: Live Sports Scores, News, Stats & Vi...",SPORTS,4.4,133833,34.0,10000000,Free,0.0,Everyone 10+,Sports
9634,"theScore: Live Sports Scores, News, Stats & Vi...",SPORTS,4.4,133833,34.0,10000000,Free,0.0,Everyone 10+,Sports
9632,"theScore: Live Sports Scores, News, Stats & Vi...",SPORTS,4.4,133825,34.0,10000000,Free,0.0,Everyone 10+,Sports
10156,trivago: Hotels & Travel,TRAVEL_AND_LOCAL,4.2,219848,12.0,50000000,Free,0.0,Everyone,Travel & Local
10159,trivago: Hotels & Travel,TRAVEL_AND_LOCAL,4.2,219848,12.0,50000000,Free,0.0,Everyone,Travel & Local


In [256]:
data_cleaned = data.drop_duplicates(subset=['App','Type','Price'])

In [258]:
data_cleaned.shape

(8197, 10)