## Google Play Store Apps Dataset Cleaning and Feature Engineering

Web scraped data of 10k Play Store apps for analysing the Android market. Source: https://www.kaggle.com/lava18/google-play-store-apps

Fields in the data:
* App: Application name
* Category: Category to which the app belongs
* Rating: Overall user rating of the app
* Reviews: Number of user reviews for the app
* Size: Size of the app
* Installs: Number of user downloads/installs for the app
* Type: Paid or Free
* Price: Price of the app
* Content Rating: Age group the app is targeted at - Children / Mature 21+ / Adult
* Genres: An app can belong to multiple genres (apart from its main category). For example, a musical family game will belong to Music, Game, Family genres.
* Last Updated: Date when the app was last updated on Play Store
* Current Ver: Current version of the app available on Play Store
* Android Ver: Minimum required Android version

In [1]:
import pandas as pd

#playstore_data = pd.read_csv('/home/whuiyuan3/Data_Science_Projects/Google_PlayStore_Apps/googleplaystore.csv')
playstore_data = pd.read_csv('./googleplaystore.csv', parse_dates=['Last Updated']).drop_duplicates()
playstore_data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [2]:
playstore_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10358 entries, 0 to 10840
Data columns (total 13 columns):
App               10358 non-null object
Category          10358 non-null object
Rating            8893 non-null float64
Reviews           10358 non-null object
Size              10358 non-null object
Installs          10358 non-null object
Type              10357 non-null object
Price             10358 non-null object
Content Rating    10357 non-null object
Genres            10358 non-null object
Last Updated      10358 non-null object
Current Ver       10350 non-null object
Android Ver       10355 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [3]:
playstore_data.nunique()

App               9660
Category            34
Rating              40
Reviews           6002
Size               462
Installs            22
Type                 3
Price               93
Content Rating       6
Genres             120
Last Updated      1378
Current Ver       2832
Android Ver         33
dtype: int64

In [4]:
playstore_data.isna().sum()

App                  0
Category             0
Rating            1465
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

In [5]:
playstore_data.Size.unique()

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
       '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
       '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
       '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M', '2.2M',
       '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M', '7.1M',
       '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M', '4.9M',
       '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M',
       '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M', '23k', '6.5M',
       '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M', '8.3M', '4.3M',
       '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M',
       '79k', '8.4M', '118k', '44M', '695k', '1.6M', '6.2M

In [6]:
playstore_data.Size.value_counts()

Varies with device    1526
11M                    188
13M                    186
12M                    186
14M                    182
15M                    174
17M                    155
26M                    145
16M                    143
19M                    135
10M                    133
25M                    131
20M                    131
21M                    130
24M                    129
18M                    124
23M                    109
22M                    108
29M                     95
27M                     94
28M                     92
30M                     84
33M                     78
3.3M                    76
37M                     72
31M                     69
35M                     68
2.5M                    68
2.3M                    68
2.9M                    67
                      ... 
608k                     1
191k                     1
20k                      1
785k                     1
892k                     1
511k                     1
6

In [7]:
# delete rows with wrong data 
playstore_data = playstore_data[playstore_data.Size != '1,000+']
# convert kB to MB 
playstore_data['Size'] = playstore_data['Size'].apply(lambda x: str(x).replace('k','e-3').strip('M'))
playstore_data['Size(M)'] = pd.to_numeric(playstore_data['Size'], errors = 'coerce' ) #coerce substitute string "Varies with device" to NaN

In [8]:
playstore_data['Installs'].unique()

array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
       '10+', '1+', '5+', '0+', '0'], dtype=object)

In [9]:
playstore_data['Installs'] = playstore_data['Installs'].apply(lambda x: x.strip('+').replace(',',''))
playstore_data.Installs = pd.to_numeric(playstore_data.Installs)
playstore_data.Reviews = pd.to_numeric(playstore_data.Reviews)

In [10]:
playstore_data.Type.unique()

array(['Free', 'Paid', nan], dtype=object)

In [11]:
playstore_data.isna().sum()

App                  0
Category             0
Rating            1465
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       0
Genres               0
Last Updated         0
Current Ver          8
Android Ver          2
Size(M)           1526
dtype: int64

In [12]:
playstore_data['Price'].unique()

array(['0', '$4.99', '$3.99', '$6.99', '$1.49', '$2.99', '$7.99', '$5.99',
       '$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49',
       '$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99', '$1.00',
       '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99', '$15.99',
       '$33.99', '$74.99', '$39.99', '$3.95', '$4.49', '$1.70', '$8.99',
       '$2.00', '$3.88', '$25.99', '$399.99', '$17.99', '$400.00', '$3.02',
       '$1.76', '$4.84', '$4.77', '$1.61', '$2.50', '$1.59', '$6.49',
       '$1.29', '$5.00', '$13.99', '$299.99', '$379.99', '$37.99',
       '$18.99', '$389.99', '$19.90', '$8.49', '$1.75', '$14.00', '$4.85',
       '$46.99', '$109.99', '$154.99', '$3.08', '$2.59', '$4.80', '$1.96',
       '$19.40', '$3.90', '$4.59', '$15.46', '$3.04', '$4.29', '$2.60',
       '$3.28', '$4.60', '$28.99', '$2.95', '$2.90', '$1.97', '$200.00',
       '$89.99', '$2.56', '$30.99', '$3.61', '$394.99', '$1.26', '$1.20',
       '$1.04'], dtype=object)

In [13]:
playstore_data.Price = playstore_data.Price.apply(lambda x: x.strip('$'))
playstore_data['Price($)'] = pd.to_numeric(playstore_data.Price)
playstore_data = playstore_data.drop(columns = ['Size','Price'])

In [14]:
playstore_data[playstore_data['Current Ver'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Size(M),Price($)
15,Learn To Draw Kawaii Characters,ART_AND_DESIGN,3.2,55,5000,Free,Everyone,Art & Design,"June 6, 2018",,4.2 and up,2.7,0.0
1553,Market Update Helper,LIBRARIES_AND_DEMO,4.1,20145,1000000,Free,Everyone,Libraries & Demo,"February 12, 2013",,1.5 and up,0.011,0.0
6322,Virtual DJ Sound Mixer,TOOLS,4.2,4010,500000,Free,Everyone,Tools,"May 10, 2017",,4.0 and up,8.7,0.0
6803,BT Master,FAMILY,,0,100,Free,Everyone,Education,"November 6, 2016",,1.6 and up,0.222,0.0
7333,Dots puzzle,FAMILY,4.0,179,50000,Paid,Everyone,Puzzle,"April 18, 2018",,4.0 and up,14.0,0.99
7407,Calculate My IQ,FAMILY,,44,10000,Free,Everyone,Entertainment,"April 3, 2017",,2.3 and up,7.2,0.0
7730,UFO-CQ,TOOLS,,1,10,Paid,Everyone,Tools,"July 4, 2016",,2.0 and up,0.237,0.99
10342,La Fe de Jesus,BOOKS_AND_REFERENCE,,8,1000,Free,Everyone,Books & Reference,"January 31, 2017",,3.0 and up,0.658,0.0


In [15]:
playstore_data['Current Ver'].unique()

array(['1.0.0', '2.0.0', '1.2.4', ..., '1.0.612928', '0.3.4', '2.0.148.0'], dtype=object)

In [16]:
playstore_data['Android Ver'].unique()

array(['4.0.3 and up', '4.2 and up', '4.4 and up', '2.3 and up',
       '3.0 and up', '4.1 and up', '4.0 and up', '2.3.3 and up',
       'Varies with device', '2.2 and up', '5.0 and up', '6.0 and up',
       '1.6 and up', '1.5 and up', '2.1 and up', '7.0 and up',
       '5.1 and up', '4.3 and up', '4.0.3 - 7.1.1', '2.0 and up',
       '3.2 and up', '4.4W and up', '7.1 and up', '7.0 - 7.1.1',
       '8.0 and up', '5.0 - 8.0', '3.1 and up', '2.0.1 and up',
       '4.1 - 7.1.1', nan, '5.0 - 6.0', '1.0 and up', '2.2 - 7.1.1',
       '5.0 - 7.1.1'], dtype=object)

In [17]:
playstore_data['Android Ver'].value_counts()

4.1 and up            2379
4.0.3 and up          1451
4.0 and up            1337
Varies with device    1221
4.4 and up             894
2.3 and up             643
5.0 and up             546
4.2 and up             387
2.3.3 and up           279
2.2 and up             239
3.0 and up             237
4.3 and up             235
2.1 and up             133
1.6 and up             116
6.0 and up              58
7.0 and up              42
3.2 and up              36
2.0 and up              32
5.1 and up              22
1.5 and up              20
4.4W and up             11
3.1 and up              10
2.0.1 and up             7
8.0 and up               6
7.1 and up               3
5.0 - 8.0                2
1.0 and up               2
4.0.3 - 7.1.1            2
2.2 - 7.1.1              1
5.0 - 6.0                1
7.0 - 7.1.1              1
5.0 - 7.1.1              1
4.1 - 7.1.1              1
Name: Android Ver, dtype: int64

In [18]:
playstore_data['Android Ver req'] = playstore_data['Android Ver'].apply(lambda x: str(x).split('and')[0].split('-')[0].replace('W',''))
playstore_data['Android Ver req'].value_counts()

4.1                   2380
4.0.3                 1453
4.0                   1337
Varies with device    1221
4.4                    905
2.3                    643
5.0                    550
4.2                    387
2.3.3                  279
2.2                    240
3.0                    237
4.3                    235
2.1                    133
1.6                    116
6.0                     58
7.0                     43
3.2                     36
2.0                     32
5.1                     22
1.5                     20
3.1                     10
2.0.1                    7
8.0                      6
7.1                      3
1.0                      2
nan                      2
Name: Android Ver req, dtype: int64

In [19]:
playstore_data.describe()

Unnamed: 0,Rating,Reviews,Installs,Size(M),Price($)
count,8892.0,10357.0,10357.0,8831.0,10357.0
mean,4.187877,405904.6,14157760.0,21.287788,1.0308
std,0.522377,2696778.0,80239550.0,22.540247,16.278625
min,1.0,0.0,0.0,0.0085,0.0
25%,4.0,32.0,1000.0,4.7,0.0
50%,4.3,1680.0,100000.0,13.0,0.0
75%,4.5,46416.0,1000000.0,29.0,0.0
max,5.0,78158310.0,1000000000.0,100.0,400.0


In [20]:
playstore_data.App.value_counts()

ROBLOX                                                9
8 Ball Pool                                           7
Bubble Shooter                                        6
Helix Jump                                            6
Zombie Catchers                                       6
Granny                                                5
Subway Surfers                                        5
Angry Birds Classic                                   5
Duolingo: Learn Languages Free                        5
Zombie Tsunami                                        5
slither.io                                            5
Candy Crush Saga                                      5
Temple Run 2                                          5
Farm Heroes Saga                                      5
Bowmasters                                            5
Sniper 3D Gun Shooter: Free Shooting Games - FPS      4
Google Photos                                         4
Word Search                                     

In [21]:
playstore_data.groupby('App').filter(lambda x: len(x)>1)

Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Size(M),Price($),Android Ver req
1,Coloring book moana,ART_AND_DESIGN,3.9,967,500000,Free,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,14.000,0.0,4.0.3
23,Mcqueen Coloring pages,ART_AND_DESIGN,,61,100000,Free,Everyone,Art & Design;Action & Adventure,"March 7, 2018",1.0.0,4.1 and up,7.000,0.0,4.1
36,UNICORN - Color By Number & Pixel Art Coloring,ART_AND_DESIGN,4.7,8145,500000,Free,Everyone,Art & Design;Creativity,"August 2, 2018",1.0.9,4.4 and up,24.000,0.0,4.4
42,Textgram - write on photos,ART_AND_DESIGN,4.4,295221,10000000,Free,Everyone,Art & Design,"July 30, 2018",Varies with device,Varies with device,,0.0,Varies with device
139,Wattpad 📖 Free Books,BOOKS_AND_REFERENCE,4.6,2914724,100000000,Free,Teen,Books & Reference,"August 1, 2018",Varies with device,Varies with device,,0.0,Varies with device
143,Amazon Kindle,BOOKS_AND_REFERENCE,4.2,814080,100000000,Free,Teen,Books & Reference,"July 27, 2018",Varies with device,Varies with device,,0.0,Varies with device
145,Dictionary - Merriam-Webster,BOOKS_AND_REFERENCE,4.5,454060,10000000,Free,Everyone,Books & Reference,"May 18, 2018",Varies with device,Varies with device,,0.0,Varies with device
146,NOOK: Read eBooks & Magazines,BOOKS_AND_REFERENCE,4.5,155446,10000000,Free,Teen,Books & Reference,"April 25, 2018",Varies with device,Varies with device,,0.0,Varies with device
155,Oxford Dictionary of English : Free,BOOKS_AND_REFERENCE,4.1,363934,10000000,Free,Everyone,Books & Reference,"July 11, 2018",9.1.363,4.1 and up,7.100,0.0,4.1
157,Spanish English Translator,BOOKS_AND_REFERENCE,4.2,87873,10000000,Free,Teen,Books & Reference,"May 28, 2018",Varies with device,Varies with device,,0.0,Varies with device


In [22]:
# delete redundant rows - all other columns the same, but number of Reviews vary --> keep the max

m_idx = playstore_data.groupby('App')['Reviews'].idxmax()
playstore_data= playstore_data.loc[m_idx].reset_index(drop=True)
playstore_data.head()
#potential issue, one game may be under different category

Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Size(M),Price($),Android Ver req
0,"""i DT"" Fútbol. Todos Somos Técnicos.",SPORTS,,27,500,Free,Everyone,Sports,"October 7, 2017",0.22,4.1 and up,3.6,0.0,4.1
1,+Download 4 Instagram Twitter,SOCIAL,4.5,40467,1000000,Free,Everyone,Social,"August 2, 2018",5.03,4.1 and up,22.0,0.0,4.1
2,- Free Comics - Comic Apps,COMICS,3.5,115,10000,Free,Mature 17+,Comics,"July 13, 2018",5.0.12,5.0 and up,9.1,0.0,5.0
3,.R,TOOLS,4.5,259,10000,Free,Everyone,Tools,"September 16, 2014",1.1.06,1.5 and up,0.203,0.0,1.5
4,/u/app,COMMUNICATION,4.7,573,10000,Free,Mature 17+,Communication,"July 3, 2018",4.2.4,4.1 and up,53.0,0.0,4.1


In [23]:
playstore_data['Revenue_est($)'] = playstore_data['Price($)']*playstore_data['Installs']

In [24]:
playstore_data.to_csv('./gpsa.csv')