<h2>Import dependencies</h2>

In [474]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)

<h2>Import data</h2>

We continue working on the cleaned version of the data.

In [475]:
apps_df = pd.read_csv("./data/googleplaystore.csv")
apps_df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,7-Jan-18,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,1-Aug-18,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,8-Jun-18,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,4.4 and up


<h2>Explore Data Types</h2>

In [476]:
print("Shape of data (rows,columns): ",apps_df.shape)
print(apps_df.dtypes)

Shape of data (rows,columns):  (10841, 13)
App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object


All of the columns are objects except for the rating column which is float. We will try to change this by changing some of the columns types to numeric ones, while others to strings.

<h3>Reviews</h3>

We check first if all of the values are actually numeric.

In [477]:
apps_df.Reviews.str.isnumeric().sum()

10840

There seems to be one value out of the 10841 values that is non-numeric.

In [478]:
apps_df[~apps_df.Reviews.str.isnumeric()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,11-Feb-18,1.0.19,4.0 and up,


This app has a rating of 19.0, which doesn't make any sense. Also, its size, price and category attributes have weird values, therefore we should simply remove this entire row .

In [479]:
apps_df = apps_df[apps_df.Reviews.str.isnumeric()]

Finally, we convert the Reviews column's type to numeric

In [480]:
apps_df.Reviews=pd.to_numeric(apps_df.Reviews)
print(apps_df.dtypes)

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object


<h3>Size</h3>

The size attribute should also be numeric. 

In [481]:
apps_df.Size.value_counts()

Varies with device    1695
11M                    198
12M                    196
14M                    194
13M                    191
                      ... 
157k                     1
485k                     1
232k                     1
696k                     1
924k                     1
Name: Size, Length: 461, dtype: int64

It seems that most of the values have either the suffix M or the suffix k. We could replace them by 10^6 and 10^3 respectively to make them numeric

In [482]:
apps_df.Size=apps_df.Size.str.replace('k','e+3')
apps_df.Size=apps_df.Size.str.replace('M','e+6')
apps_df.Size.head()

0     19e+6
1     14e+6
2    8.7e+6
3     25e+6
4    2.8e+6
Name: Size, dtype: object

We now make sure that all of the values are numeric before converting the entire column

In [483]:
def is_numeric(v):
    try:
        float(v)
        return True
    except ValueError:
        return False
    
temp=apps_df.Size.apply(lambda x: is_numeric(x))
apps_df.Size[~temp].value_counts()

Varies with device    1695
Name: Size, dtype: int64

There seems to be many rows that have size attributes with the values "Varies with device". We will simply change all of them to NaN.

In [484]:
apps_df.Size=apps_df.Size.replace('Varies with device',np.nan)

Finally, we convert the column's type to numeric

In [485]:
apps_df.Size=pd.to_numeric(apps_df.Size)
print(apps_df.dtypes)

App                object
Category           object
Rating            float64
Reviews             int64
Size              float64
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object


<h3>Installs</h3>

We first check the unique values of this column.

In [486]:
apps_df.Installs.value_counts()

1,000,000+        1579
10,000,000+       1252
100,000+          1169
10,000+           1054
1,000+             907
5,000,000+         752
100+               719
500,000+           539
50,000+            479
5,000+             477
100,000,000+       409
10+                386
500+               330
50,000,000+        289
50+                205
5+                  82
500,000,000+        72
1+                  67
1,000,000,000+      58
0+                  14
0                    1
Name: Installs, dtype: int64

All of the values are either pure numbers or numbers prefixed with the sign '+'. We can convert the latter by simply removing the '+' sign.

In [487]:
apps_df.Installs=apps_df.Installs.apply(lambda x: x.strip('+'))
apps_df.Installs=apps_df.Installs.apply(lambda x: x.replace(',',''))
apps_df.Installs.value_counts()

1000000       1579
10000000      1252
100000        1169
10000         1054
1000           907
5000000        752
100            719
500000         539
50000          479
5000           477
100000000      409
10             386
500            330
50000000       289
50             205
5               82
500000000       72
1               67
1000000000      58
0               15
Name: Installs, dtype: int64

Now that everything seems to be in order, we can safetly convert this column's type to numeric.

In [488]:
apps_df.Installs=pd.to_numeric(apps_df.Installs)
print(apps_df.dtypes)

App                object
Category           object
Rating            float64
Reviews             int64
Size              float64
Installs            int64
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object


<h3>Price</h3>

We first get a feeling of the format of this column

In [489]:
apps_df.Price.value_counts()

0           10040
$0.99         148
$2.99         129
$1.99          73
$4.99          72
            ...  
$4.29           1
$379.99         1
$1.76           1
$3.02           1
$389.99         1
Name: Price, Length: 92, dtype: int64

All of the values are either 0 or are numbers prefixed with the dollar sign. We will begin by removing the dollar signs.

In [490]:
apps_df.Price=apps_df.Price.apply(lambda x: x.strip('$'))
apps_df.Price.value_counts()

0          10040
0.99         148
2.99         129
1.99          73
4.99          72
           ...  
200.00         1
25.99          1
2.95           1
3.90           1
4.29           1
Name: Price, Length: 92, dtype: int64

Finally, we make sure that all of the values are numeric then convert the column's dtype

In [491]:
temp=apps_df.Price.apply(lambda x: is_numeric(x))
apps_df.Price[~temp].value_counts()

Series([], Name: Price, dtype: int64)

In [492]:
apps_df.Price=pd.to_numeric(apps_df.Price)
print(apps_df.dtypes)

App                object
Category           object
Rating            float64
Reviews             int64
Size              float64
Installs            int64
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object


<h2>Finish</h2>

Finally, we save the dataframe into a new csv file

In [507]:
apps_df.to_csv("./data/googleplaystore_clean.csv",index=False)