# Title 2
## Subtitle

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
apple_store = pd.read_csv("AppleStore.csv")
google_store = pd.read_csv("googleplaystore.csv")

# Markdown to describe the data

In [3]:
print("**Apple Store Dataset Columns:", '\n')
for c in apple_store.columns:
    print(c)

**Apple Store Dataset Columns: 

id
track_name
size_bytes
currency
price
rating_count_tot
rating_count_ver
user_rating
user_rating_ver
ver
cont_rating
prime_genre
sup_devices.num
ipadSc_urls.num
lang.num
vpp_lic


In [4]:
print("**Google Play Store Dataset Columns:", '\n')
for c in google_store.columns:
    print(c)

**Google Play Store Dataset Columns: 

App
Category
Rating
Reviews
Size
Installs
Type
Price
Content Rating
Genres
Last Updated
Current Ver
Android Ver


# CLEANING PROCESS STAGE 1

The analysis is going to focus on free apps, and not all columns are needed for the final analysis. So based on the column analysis these are the key <span style="color:green">**properties to keep**</span> in order to have a consistent merged file:
- App Name/IDs
- App Size (bytes)
- Installs
- Count of total Ratings
- Ratings (total)
- Category/Genre/Prime Genre
- Price

The following related properties will be <span style="color:red">**dropped**</span>:
- **Related to Versions**
- **Related to Devices/Softwares**

The analysis final outcome is related to categories and different ways of couting users, and not really related to technical information about the apps itself. For this reason, all these information will be dropped.

The <span style="color:blue">**cleaning process**</span> is going to follow the order:
- Renaming columns
- Dropping/Reordering columns
- Implementing proper data-types to the columns
- Inspecting NaN values and any other wrong inputs
- Dropping paid apps
- Dealing with duplicates

_All done separatedly for Apple and then Google's datasets._

## APPLE

In [5]:
apple_crop = apple_store.copy()[["id", "track_name", "prime_genre", "size_bytes", "price", "rating_count_tot", "user_rating"]]
apple_crop.columns = ["id", "app_name", "genre", "size_bytes", "price", "rating_count", "rating"]
apple_crop.head()

Unnamed: 0,id,app_name,genre,size_bytes,price,rating_count,rating
0,284882215,Facebook,Social Networking,389879808,0.0,2974676,3.5
1,389801252,Instagram,Photo & Video,113954816,0.0,2161558,4.5
2,529479190,Clash of Clans,Games,116476928,0.0,2130805,4.5
3,420009108,Temple Run,Games,65921024,0.0,1724546,4.5
4,284035177,Pandora - Music & Radio,Music,130242560,0.0,1126879,4.0


In [6]:
apple_crop.shape

(7197, 7)

In [7]:
for c in apple_crop.columns:
    print(c, ':', '\t', apple_crop[c].dtype)

id : 	 int64
app_name : 	 object
genre : 	 object
size_bytes : 	 int64
price : 	 float64
rating_count : 	 int64
rating : 	 float64


In [8]:
for c in apple_crop.columns:
    print(c, '\n', apple_crop[c].isna().value_counts(), '\n')

id 
 False    7197
Name: id, dtype: int64 

app_name 
 False    7197
Name: app_name, dtype: int64 

genre 
 False    7197
Name: genre, dtype: int64 

size_bytes 
 False    7197
Name: size_bytes, dtype: int64 

price 
 False    7197
Name: price, dtype: int64 

rating_count 
 False    7197
Name: rating_count, dtype: int64 

rating 
 False    7197
Name: rating, dtype: int64 



In [9]:
apple_free = apple_crop.copy()[apple_crop['price'] == 0]
print(apple_free.shape)
apple_free.head()

(4056, 7)


Unnamed: 0,id,app_name,genre,size_bytes,price,rating_count,rating
0,284882215,Facebook,Social Networking,389879808,0.0,2974676,3.5
1,389801252,Instagram,Photo & Video,113954816,0.0,2161558,4.5
2,529479190,Clash of Clans,Games,116476928,0.0,2130805,4.5
3,420009108,Temple Run,Games,65921024,0.0,1724546,4.5
4,284035177,Pandora - Music & Radio,Music,130242560,0.0,1126879,4.0


In [10]:
apple_free.duplicated("app_name", keep=False).value_counts()

False    4052
True        4
dtype: int64

In [11]:
apple_free[apple_free.duplicated("app_name", keep=False)]

Unnamed: 0,id,app_name,genre,size_bytes,price,rating_count,rating
2948,1173990889,Mannequin Challenge,Games,109705216,0.0,668,3.0
4442,952877179,VR Roller Coaster,Games,169523200,0.0,107,3.5
4463,1178454060,Mannequin Challenge,Games,59572224,0.0,105,4.0
4831,1089824278,VR Roller Coaster,Games,240964608,0.0,67,3.5


In [12]:
apple_free.sort_values("rating_count", ascending=False, inplace=True)
apple_unique = apple_free.drop_duplicates(subset="app_name", keep="first", inplace=False, ignore_index=True)
apple_unique.duplicated("app_name", keep=False).value_counts()

False    4054
dtype: int64

In [13]:
apple_unique[apple_unique['app_name'] == "VR Roller Coaster"]

Unnamed: 0,id,app_name,genre,size_bytes,price,rating_count,rating
2625,952877179,VR Roller Coaster,Games,169523200,0.0,107,3.5


In [14]:
apple_final = apple_unique.copy()[["app_name", "genre", "size_bytes", "rating_count", "rating"]]
apple_final.head()

Unnamed: 0,app_name,genre,size_bytes,rating_count,rating
0,Facebook,Social Networking,389879808,2974676,3.5
1,Instagram,Photo & Video,113954816,2161558,4.5
2,Clash of Clans,Games,116476928,2130805,4.5
3,Temple Run,Games,65921024,1724546,4.5
4,Pandora - Music & Radio,Music,130242560,1126879,4.0


# GOOGLE

In [15]:
google_crop = google_store.copy()[["App", "Category", "Genres", "Size", "Price", "Installs", "Reviews", "Rating"]]
google_crop.columns = ["app_name", "category", "genre", "size_bytes", "price", "installs", "rating_count", "rating"]
google_crop.head()

Unnamed: 0,app_name,category,genre,size_bytes,price,installs,rating_count,rating
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,Art & Design,19M,0,"10,000+",159,4.1
1,Coloring book moana,ART_AND_DESIGN,Art & Design;Pretend Play,14M,0,"500,000+",967,3.9
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,Art & Design,8.7M,0,"5,000,000+",87510,4.7
3,Sketch - Draw & Paint,ART_AND_DESIGN,Art & Design,25M,0,"50,000,000+",215644,4.5
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,Art & Design;Creativity,2.8M,0,"100,000+",967,4.3


In [16]:
for c in google_crop.columns:
    print(c, ':', '\t', google_crop[c].dtype)

app_name : 	 object
category : 	 object
genre : 	 object
size_bytes : 	 object
price : 	 object
installs : 	 object
rating_count : 	 object
rating : 	 float64


In [17]:
try:
    google_crop['price'] = google_crop['price'].str.replace('$', '', regex=False).astype(float)
except:
    print("There's some price value which is not a price")

There's some price value which is not a price


In [18]:
google_store[google_store['Price'] == "Everyone"]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


In [19]:
google_crop.loc[~google_crop['price'].str.startswith("$"), 'price'].value_counts()

0           10040
Everyone        1
Name: price, dtype: int64

Decision to drop (low rating count)

In [20]:
google_crop.drop(10472, axis=0, inplace=True)
try:
    google_crop['price'] = google_crop['price'].str.replace('$', '', regex=False).astype(float)
except:
    print("There's some price value which is not a price")

In [21]:
for c in google_crop.columns:
    print(c, ':', '\t', google_crop[c].dtype)

app_name : 	 object
category : 	 object
genre : 	 object
size_bytes : 	 object
price : 	 float64
installs : 	 object
rating_count : 	 object
rating : 	 float64


In [22]:
try:
    google_crop['rating_count'] = google_crop['rating_count'].astype(int)
except:
    print("There's some value which is not rating_count")

In [23]:
google_crop.loc[~google_crop['size_bytes'].str.endswith("M"), 'size_bytes'].value_counts()

Varies with device    1695
79k                      3
375k                     3
201k                     3
118k                     3
                      ... 
371k                     1
329k                     1
953k                     1
913k                     1
203k                     1
Name: size_bytes, Length: 279, dtype: int64

In [24]:
# def convert_round_sizes(siz):
#     if '.' not in siz:
#         return siz.replace('M', '000000').replace('k', '000')
#     elif '.0' in siz:
#         return siz.replace('.0M', '000000').replace('.0k', '000')
#     else:
#         return siz

In [25]:
# google_crop['size_bytes'] = google_crop['size_bytes'].apply(convert_round_sizes)
# google_crop.loc[google_crop['size_bytes'].str.contains('.', regex=False), 'size_bytes'].value_counts()

In [26]:
# google_crop['size_bytes'] = google_crop['size_bytes'].replace('Varies with device', np.nan).str.replace(r'\.(\d)', '{0}', regex=True)
# print(google_crop['size_bytes'].head(20))
# print(google_store['Size'].head(20))

In [27]:
# def convert_broken_sizes(siz):
#     if '.' in siz:
#         pattern = re.compile(r'\.(\d)')
#         decimal = re.search(pattern, siz).group(1)
#         converted = siz.replace(pattern, decimal)
#         converted = converted.replace("M", '00000').replace('k', '00')
#         return converted
#     else:
#         return siz

In [28]:
# def convert_broken_sizes(siz):
#     if '.' in siz:
#         converted = siz.replace('.', '').replace("M", '00000').replace('k', '00')
#         return converted
#     else:
#         return siz

In [29]:
# test = '031.34'
# pat = re.compile(r'\.(\d)')
# print(re.search(pat, test).group(1))

In [30]:
# google_crop['size_bytes'] = google_crop['size_bytes'].apply(convert_broken_sizes)
# google_crop['size_bytes'].head(20)

In [31]:
def convert_sizes(siz):
    # Firstly if there's no '.' just substitute M and k for it's respective bytes amount.
    if '.' not in siz:
        return siz.replace('M', '000000').replace('k', '000')
    # Secondly if the number is round but in a float format, we do the same as above.
    elif '.0' in siz:
        return siz.replace('.0M', '000000').replace('.0k', '000')
    # Thirdly if there is a decimal place different than 0, we keep it and add remaining bytes.
    elif '.' in siz:
        return siz.replace('.', '').replace("M", '00000').replace('k', '00')
    # Lastly if there's still some condition not covered, return with no changes.
    else:
        return siz

In [32]:
google_crop['size_bytes'] = google_crop['size_bytes'].apply(convert_sizes)
print(google_crop['size_bytes'].head(20))
print('\n')
print(google_store['Size'].head(20))

0     19000000
1     14000000
2      8700000
3     25000000
4      2800000
5      5600000
6     19000000
7     29000000
8     33000000
9      3100000
10    28000000
11    12000000
12    20000000
13    21000000
14    37000000
15     2700000
16     5500000
17    17000000
18    39000000
19    31000000
Name: size_bytes, dtype: object


0      19M
1      14M
2     8.7M
3      25M
4     2.8M
5     5.6M
6      19M
7      29M
8      33M
9     3.1M
10     28M
11     12M
12     20M
13     21M
14     37M
15    2.7M
16    5.5M
17     17M
18     39M
19     31M
Name: Size, dtype: object


In [33]:
google_crop['size_bytes'] = google_crop['size_bytes'].str.replace('Varies with device', '0').astype(int)
print(google_crop['size_bytes'].describe())
print('\n')
print(google_crop['size_bytes'].value_counts(bins=10))

count    1.084000e+04
mean     1.815209e+07
std      2.217061e+07
min      0.000000e+00
25%      2.600000e+06
50%      9.200000e+06
75%      2.600000e+07
max      1.000000e+08
Name: size_bytes, dtype: float64


(-100000.001, 10000000.0]    5755
(10000000.0, 20000000.0]     1698
(20000000.0, 30000000.0]     1176
(30000000.0, 40000000.0]      656
(40000000.0, 50000000.0]      481
(50000000.0, 60000000.0]      338
(60000000.0, 70000000.0]      240
(90000000.0, 100000000.0]     205
(70000000.0, 80000000.0]      168
(80000000.0, 90000000.0]      123
Name: size_bytes, dtype: int64
