# Team 2 - Google Play Store

![](https://www.brandnol.com/wp-content/uploads/2019/04/Google-Play-Store-Search.jpg)

_For more information about the dataset, read [here](https://www.kaggle.com/lava18/google-play-store-apps)._

## Your tasks
- Name your team!
- Read the source and do some quick research to understand more about the dataset and its topic
- Clean the data
- Perform Exploratory Data Analysis on the dataset
- Analyze the data more deeply and extract insights
- Visualize your analysis on Google Data Studio
- Present your works in front of the class and guests next Monday

## Submission Guide
- Create a Github repository for your project
- Upload the dataset (.csv file) and the Jupyter Notebook to your Github repository. In the Jupyter Notebook, **include the link to your Google Data Studio report**.
- Submit your works through this [Google Form](https://forms.gle/oxtXpGfS8JapVj3V8).

## Tips for Data Cleaning, Manipulation & Visualization
- Here are some of our tips for Data Cleaning, Manipulation & Visualization. [Click here](https://hackmd.io/cBNV7E6TT2WMliQC-GTw1A)

_____________________________

## Some Hints for This Dataset:
- There are lots of null values. How should we handle them?
- Column `Installs` and `Size` have some strange values. Can you identify them?
- Values in `Size` column are currently in different format: `M`, `k`. And how about the value `Varies with device`?
- `Price` column is not in the right data type
- And more...


In [1]:
# Start your codes here!
%matplotlib inline
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

## RESEARCH 

In [34]:
df = pd.read_csv('google-play-store.csv')

In [35]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [36]:
df.shape

(10841, 13)

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


### NULL

In [38]:
print("Du lieu co NULL?")
df.isnull().values.any()

Du lieu co NULL?


True

In [39]:
print('So Null values trong cac cot: ')
for col in df.columns:
    print(col, ': ', df[col].isnull().sum())

So Null values trong cac cot: 
App :  0
Category :  0
Rating :  1474
Reviews :  0
Size :  0
Installs :  0
Type :  1
Price :  0
Content Rating :  1
Genres :  0
Last Updated :  0
Current Ver :  8
Android Ver :  3


In [40]:
print('Hien thi cac App chua du lieu NULL: ')
df[df.isnull().any(axis=1)]
# df[df['Android Ver'].isnull()]

Hien thi cac App chua du lieu NULL: 


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
15,Learn To Draw Kawaii Characters,ART_AND_DESIGN,3.2,55,2.7M,"5,000+",Free,0,Everyone,Art & Design,"June 6, 2018",,4.2 and up
23,Mcqueen Coloring pages,ART_AND_DESIGN,,61,7.0M,"100,000+",Free,0,Everyone,Art & Design;Action & Adventure,"March 7, 2018",1.0.0,4.1 and up
113,Wrinkles and rejuvenation,BEAUTY,,182,5.7M,"100,000+",Free,0,Everyone 10+,Beauty,"September 20, 2017",8.0,3.0 and up
123,Manicure - nail design,BEAUTY,,119,3.7M,"50,000+",Free,0,Everyone,Beauty,"July 23, 2018",1.3,4.1 and up
126,Skin Care and Natural Beauty,BEAUTY,,654,7.4M,"100,000+",Free,0,Teen,Beauty,"July 17, 2018",1.15,4.1 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10824,Cardio-FR,MEDICAL,,67,82M,"10,000+",Free,0,Everyone,Medical,"July 31, 2018",2.2.2,4.4 and up
10825,Naruto & Boruto FR,SOCIAL,,7,7.7M,100+,Free,0,Teen,Social,"February 2, 2018",1.0,4.0 and up
10831,payermonstationnement.fr,MAPS_AND_NAVIGATION,,38,9.8M,"5,000+",Free,0,Everyone,Maps & Navigation,"June 13, 2018",2.0.148.0,4.0 and up
10835,FR Forms,BUSINESS,,0,9.6M,10+,Free,0,Everyone,Business,"September 29, 2016",1.1.5,4.0 and up


### Duplication

In [41]:
print('Shape df: ', df.shape)
print('kiem tra trung lap App')
# df.count(0)['App']
df['App'].nunique()

Shape df:  (10841, 13)
kiem tra trung lap App


9660

#### Category

In [64]:
print('NULL value', df['Category'].isnull().values.any())

df['Category'].value_counts(sort= True, dropna= False)

NULL value False


FAMILY                 1972
GAME                   1144
TOOLS                   843
MEDICAL                 463
BUSINESS                460
PRODUCTIVITY            424
PERSONALIZATION         392
COMMUNICATION           387
SPORTS                  384
LIFESTYLE               382
FINANCE                 366
HEALTH_AND_FITNESS      341
PHOTOGRAPHY             335
SOCIAL                  295
NEWS_AND_MAGAZINES      283
SHOPPING                260
TRAVEL_AND_LOCAL        258
DATING                  234
BOOKS_AND_REFERENCE     231
VIDEO_PLAYERS           175
EDUCATION               156
ENTERTAINMENT           149
MAPS_AND_NAVIGATION     137
FOOD_AND_DRINK          127
HOUSE_AND_HOME           88
LIBRARIES_AND_DEMO       85
AUTO_AND_VEHICLES        85
WEATHER                  82
ART_AND_DESIGN           65
EVENTS                   64
PARENTING                60
COMICS                   60
BEAUTY                   53
Name: Category, dtype: int64

#### Rating

In [43]:
print('NULL value', df['Rating'].isnull().values.any())
df.loc[(df['Rating']>5.0) | (df['Rating']<0)]

NULL value True


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


#### Reviews

In [44]:
df.loc[~df['Reviews'].str.contains(r'\d+')]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver


#### Size

In [45]:
# print(df.loc[~df['Size'].str.contains(r'\d+\.*\d+(M|k|m|K)')])
print(df['Size'].loc[~df['Size'].str.contains(r'\d+\.*\d+(M|k|m|K)')].value_counts())
df.loc[df['Size'] == '1,000+']

Varies with device    1695
1,000+                   1
Name: Size, dtype: int64


  return func(self, *args, **kwargs)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


#### Installs

In [46]:
print('NULL value', df['Installs'].isnull().values.any())
print(df.loc[df['Installs'].str.contains(r'^\d{1,3}(,\d{3})*$')])
df.loc[~df['Installs'].str.contains(r'^\d{1,3}(,\d{3})*\+*')]

NULL value False
                            App Category  Rating Reviews                Size  \
9148  Command & Conquer: Rivals   FAMILY     NaN       0  Varies with device   

     Installs Type Price Content Rating    Genres   Last Updated  \
9148        0  NaN     0   Everyone 10+  Strategy  June 28, 2018   

             Current Ver         Android Ver  
9148  Varies with device  Varies with device  


  return func(self, *args, **kwargs)
  return func(self, *args, **kwargs)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


#### Type

In [65]:
print('NULL value', df['Type'].isnull().values.any())

print(df['Type'].value_counts(dropna= False))
df.loc[df['Type'] == '0']

NULL value True
Free    10039
Paid      800
NaN         1
Name: Type, dtype: int64


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver


#### Price

In [32]:
print('NULL value', df['Price'].isnull().values.any())

df.loc[~df['Price'].str.contains(r'(\${0,1}\d+)')]

# df.loc[df['Price'].str.contains(r'^\d')]

NULL value False


  return func(self, *args, **kwargs)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Category_int,Type_int,Content_Rating_int,Genres_int


#### Content Rating

In [66]:
print('NULL value', df['Content Rating'].isnull().values.any())

print(df['Content Rating'].value_counts(dropna= False))
print(df.loc[df['Content Rating'].isnull()])
df.loc[df['Content Rating'] == 'Unrated']

NULL value False
Everyone           8714
Teen               1208
Mature 17+          499
Everyone 10+        414
Adults only 18+       3
Unrated               2
Name: Content Rating, dtype: int64
Empty DataFrame
Columns: [App, Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Genres, Last Updated, Current Ver, Android Ver]
Index: []


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
7312,Best CG Photography,FAMILY,,1,2.5M,500+,Free,0,Unrated,Entertainment,"June 24, 2015",5.2,3.0 and up
8266,DC Universe Online Map,TOOLS,4.1,1186,6.4M,"50,000+",Free,0,Unrated,Tools,"February 27, 2012",1.3,2.3.3 and up


#### Genres

In [67]:
print('NULL value', df['Genres'].isnull().values.any())

print(df['Genres'].value_counts(dropna= False))
df.loc[df['Genres'].str.contains(r'.*\d+.*')]

NULL value False
Tools                       842
Entertainment               623
Education                   549
Medical                     463
Business                    460
                           ... 
Comics;Creativity             1
Lifestyle;Pretend Play        1
Card;Brain Games              1
Entertainment;Education       1
Communication;Creativity      1
Name: Genres, Length: 119, dtype: int64


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver


#### Last Updated

In [51]:
print('NULL value', df['Last Updated'].isnull().values.any())

# df.loc[~(df['Last Updated'].str.contains(r'(^\d{1,2}-\w{3}-d{2})') | df['Last Updated'].str.contains(r'(^\w+)') )]
df.loc[~df['Last Updated'].str.contains(r'(^\w+\s\d{1,2},\s\d{4})')]
# February 11, 2018

NULL value False


  return func(self, *args, **kwargs)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


#### Current Ver

In [52]:
print('NULL value', df['Current Ver'].isnull().values.any())
# cur = df.loc[~df['Current Ver'].isnull()]
df['Current Ver'].loc[~df['Current Ver'].str.contains(r'\d+(\.\d+)*', na = False)].value_counts()

NULL value True


  return func(self, *args, **kwargs)


Varies with device            1459
Final                            2
BlueOrange                       1
Initial                          1
Natalia Studio Development       1
HTTPs                            1
closed                           1
Copyright                        1
Gratis                           1
Human Dx                         1
DH-Security Camera               1
Public.Heal                      1
MONEY                            1
App copyright                    1
newversion                       1
opciÃ³n de cerrar                1
KM                               1
Name: Current Ver, dtype: int64

#### Android Ver

In [53]:
print('NULL value', df['Android Ver'].isnull().values.any())
df['Android Ver'].loc[~df['Android Ver'].str.contains(r'\d+(\.\d+)*( and up)', na = False)].value_counts()

NULL value True


Varies with device    1362
4.4W and up             12
4.0.3 - 7.1.1            2
5.0 - 8.0                2
4.1 - 7.1.1              1
5.0 - 6.0                1
7.0 - 7.1.1              1
2.2 - 7.1.1              1
5.0 - 7.1.1              1
Name: Android Ver, dtype: int64

## CLEAN DATA

In [2]:
df = pd.read_csv('google-play-store.csv')
print(df.iloc[10472])

# Loai bo hang co nhieu gia tri loi

df = df.drop(10472, axis = 0)
df.reset_index(drop = True)

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              1.9
Rating                                                 19
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    0
Price                                            Everyone
Content Rating                                        NaN
Genres                                  February 11, 2018
Last Updated                                       1.0.19
Current Ver                                    4.0 and up
Android Ver                                           NaN
Name: 10472, dtype: object


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10835,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10836,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10837,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10838,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [3]:
print('Shape: ', df.shape)
for col in df.columns:
    print(col, ': ', df[col].isnull().sum())
    

Shape:  (10840, 13)
App :  0
Category :  0
Rating :  1474
Reviews :  0
Size :  0
Installs :  0
Type :  1
Price :  0
Content Rating :  0
Genres :  0
Last Updated :  0
Current Ver :  8
Android Ver :  2


In [4]:
df.index[df['Type'].isnull()].tolist()

[9148]

#### NULL Type

In [5]:
print(df['Type'].value_counts(dropna= False))
df.loc[df['Type'].isnull()]

Free    10039
Paid      800
NaN         1
Name: Type, dtype: int64


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
9148,Command & Conquer: Rivals,FAMILY,,0,Varies with device,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device,Varies with device


In [6]:
def checkTypePrice():
    errorL = []
    for index, row in df.iterrows():
        if row['Type']== "Free":
            if row['Price'] !='0':
                print('Error Free')
                errorL.append(index)
        elif row['Type'] == "Paid":
            if row['Price'] =='0':
                print('Error Paid')
                errorL.append(index)
        else:
            print('NaN', row['Price'])
            errorL.append(index)
    return errorL

errorL = checkTypePrice()
print(errorL)

NaN 0
[9148]


In [7]:
df['Type'].loc[df['Type'].isnull()] = 'Free'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [8]:
df['Type'].value_counts(dropna= False)

Free    10040
Paid      800
Name: Type, dtype: int64

In [9]:
print('Shape: ', df.shape)
for col in df.columns:
    print(col, ': ', df[col].isnull().sum())

Shape:  (10840, 13)
App :  0
Category :  0
Rating :  1474
Reviews :  0
Size :  0
Installs :  0
Type :  0
Price :  0
Content Rating :  0
Genres :  0
Last Updated :  0
Current Ver :  8
Android Ver :  2


#### NULL Current, Android

In [10]:
df = df[(~df['Current Ver'].isnull()) &  (~df['Android Ver'].isnull())]
df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [11]:
print('Shape: ', df.shape)
for col in df.columns:
    print(col, ': ', df[col].isnull().sum())

Shape:  (10830, 13)
App :  0
Category :  0
Rating :  1470
Reviews :  0
Size :  0
Installs :  0
Type :  0
Price :  0
Content Rating :  0
Genres :  0
Last Updated :  0
Current Ver :  0
Android Ver :  0


In [12]:
# Categories
# print(df['Category'].value_counts())
cateVal = df['Category'].unique()
cateDict = dict()
for i in range(0, len(cateVal)):
    cateDict[cateVal[i]] = i
df['Category_int'] = df['Category'].map(cateDict).astype(int)

# Type
# print(df['Type'].value_counts())
TypeVal = df['Type'].unique()
TypeDict = dict()
for i in range(0, len(TypeVal)):
    TypeDict[TypeVal[i]] = i
df['Type_int'] = df['Type'].map(TypeDict).astype(int)

# Content Rating
# print(df['Content Rating'].value_counts())
# df.loc[df['Content Rating'].isnull()]
Content_RatingVal = df['Content Rating'].unique()
# print(Content_RatingVal)
Content_RatingDict = dict()
for i in range(0, len(Content_RatingVal)):
    Content_RatingDict[Content_RatingVal[i]] = i
df['Content_Rating_int'] = df['Content Rating'].map(Content_RatingDict).astype(int)


# Genres
# print(df['Genres'].value_counts())
GenresVal = df['Genres'].unique()
GenresDict = dict()
for i in range(0, len(GenresVal)):
    GenresDict[GenresVal[i]] = i
df['Genres_int'] = df['Genres'].map(GenresDict).astype(int)


In [28]:
# Size
def convertSize(size):
    if ('M' in size) or ('m' in size):
        return float(size[:-1])*10000000
    if ('k' in size) or ('K' in size):
        return float(size[:-1])*100023
#     print(size) Varies with device
    return -1
df["Size"] = df['Size'].map(convertSize)



In [53]:
#Price
import re
def convertPrice(price):
    price1 = re.compile(r'^\d.*')
    if price == '0':
        return 0
    elif price1.search(price):
#         print('1: ',price)
        return float(price)
    else:
#         print('2: ',price[1:])
        return float(price[1:])
    
df['Price'] = df['Price'].map(convertPrice)


In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10830 entries, 0 to 10840
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   App                 10830 non-null  object 
 1   Category            10830 non-null  object 
 2   Rating              10830 non-null  float64
 3   Reviews             10830 non-null  int64  
 4   Size                10830 non-null  float64
 5   Installs            10830 non-null  int64  
 6   Type                10830 non-null  object 
 7   Price               10830 non-null  float64
 8   Content Rating      10830 non-null  object 
 9   Genres              10830 non-null  object 
 10  Last Updated        10830 non-null  object 
 11  Current Ver         10830 non-null  object 
 12  Android Ver         10830 non-null  object 
 13  Category_int        10830 non-null  int32  
 14  Type_int            10830 non-null  int32  
 15  Content_Rating_int  10830 non-null  int32  
 16  Genr

In [14]:
# Installs
df['Installs'] = [int(i[:-1].replace(',','')) if i != '0' else 0 for i in df['Installs']]
print(df['Installs'].loc[9148])


0


In [15]:
df['Reviews'] = [int(i) for i in df['Reviews']]

In [29]:
df.reset_index(drop = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10830 entries, 0 to 10840
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   App                 10830 non-null  object 
 1   Category            10830 non-null  object 
 2   Rating              10830 non-null  float64
 3   Reviews             10830 non-null  int64  
 4   Size                10830 non-null  float64
 5   Installs            10830 non-null  int64  
 6   Type                10830 non-null  object 
 7   Price               10830 non-null  object 
 8   Content Rating      10830 non-null  object 
 9   Genres              10830 non-null  object 
 10  Last Updated        10830 non-null  object 
 11  Current Ver         10830 non-null  object 
 12  Android Ver         10830 non-null  object 
 13  Category_int        10830 non-null  int32  
 14  Type_int            10830 non-null  int32  
 15  Content_Rating_int  10830 non-null  int32  
 16  Genr

In [17]:
print('Shape: ', df.shape)
for col in df.columns:
    print(col, ': ', df[col].isnull().sum())

Shape:  (10830, 17)
App :  0
Category :  0
Rating :  1470
Reviews :  0
Size :  0
Installs :  0
Type :  0
Price :  0
Content Rating :  0
Genres :  0
Last Updated :  0
Current Ver :  0
Android Ver :  0
Category_int :  0
Type_int :  0
Content_Rating_int :  0
Genres_int :  0


In [18]:
ratingNotNaN = df.loc[df['Rating'].isnull()]
ratingNotNaN

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Category_int,Type_int,Content_Rating_int,Genres_int
23,Mcqueen Coloring pages,ART_AND_DESIGN,,61,70000000.0,100000,Free,0,Everyone,Art & Design;Action & Adventure,"March 7, 2018",1.0.0,4.1 and up,0,0,0,3
113,Wrinkles and rejuvenation,BEAUTY,,182,57000000.0,100000,Free,0,Everyone 10+,Beauty,"September 20, 2017",8.0,3.0 and up,2,0,2,5
123,Manicure - nail design,BEAUTY,,119,37000000.0,50000,Free,0,Everyone,Beauty,"July 23, 2018",1.3,4.1 and up,2,0,0,5
126,Skin Care and Natural Beauty,BEAUTY,,654,74000000.0,100000,Free,0,Teen,Beauty,"July 17, 2018",1.15,4.1 and up,2,0,1,5
129,"Secrets of beauty, youth and health",BEAUTY,,77,29000000.0,10000,Free,0,Mature 17+,Beauty,"August 8, 2017",2.0,2.3 and up,2,0,3,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10824,Cardio-FR,MEDICAL,,67,820000000.0,10000,Free,0,Everyone,Medical,"July 31, 2018",2.2.2,4.4 and up,19,0,0,85
10825,Naruto & Boruto FR,SOCIAL,,7,77000000.0,100,Free,0,Teen,Social,"February 2, 2018",1.0,4.0 and up,20,0,1,86
10831,payermonstationnement.fr,MAPS_AND_NAVIGATION,,38,98000000.0,5000,Free,0,Everyone,Maps & Navigation,"June 13, 2018",2.0.148.0,4.0 and up,32,0,0,103
10835,FR Forms,BUSINESS,,0,96000000.0,10,Free,0,Everyone,Business,"September 29, 2016",1.1.5,4.0 and up,4,0,0,7


In [19]:
df['Rating'].loc[df['Rating'].isnull()] = round(df['Rating'].mean(),1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


## MANIPULATION

In [20]:
print("A correlation between the Rating and other columns")
for col in df.columns:
    if col != 'Rating':
        try: 
            print(col, ": ",df['Rating'].corr(df[col]) )
        except TypeError:
            pass

A correlation between the Rating and other columns
Reviews :  0.06765714113155663
Size :  0.04234414280851663
Installs :  0.05078170529967981
Category_int :  -0.02441243421332733
Type_int :  0.03597058097712477
Content_Rating_int :  0.0003994398109627491
Genres_int :  -0.025892534373927845


In [64]:
df.to_csv('file1.csv')

In [55]:
df['Total'] = df['Installs']*df['Price']

In [58]:
df = df.sort_values('Total', axis = 0, ascending = False)
df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Category_int,Type_int,Content_Rating_int,Genres_int,Total
2241,Minecraft,FAMILY,4.5,2376564,-1.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,"July 24, 2018",1.5.2.1,Varies with device,18,1,2,70,69900000.0
4347,Minecraft,FAMILY,4.5,2375336,-1.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,"July 24, 2018",1.5.2.1,Varies with device,18,1,2,70,69900000.0
5351,I am rich,LIFESTYLE,3.8,3547,18000000.0,100000,Paid,399.99,Everyone,Lifestyle,"January 12, 2018",2.0,4.0.3 and up,16,1,0,29,39999000.0
5356,I Am Rich Premium,FINANCE,4.1,1867,47000000.0,50000,Paid,399.99,Everyone,Finance,"November 12, 2017",1.6,4.0 and up,11,1,0,24,19999500.0
4034,Hitman Sniper,GAME,4.6,408292,290000000.0,10000000,Paid,0.99,Mature 17+,Action,"July 12, 2018",1.7.110758,4.1 and up,17,1,3,36,9900000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10438,Dolphin and fish coloring book,FAMILY,3.9,2249,-1.0,500000,Free,0.00,Everyone,Art & Design;Creativity,"May 15, 2018",Varies with device,4.1 and up,18,0,0,2,0.0
10420,Wallpapers Volvo FH Truck,PERSONALIZATION,4.2,4,140000000.0,100,Free,0.00,Teen,Personalization,"July 31, 2016",1.0,2.3.3 and up,26,0,1,93,0.0
10421,HD Jigsaw Volvo FH Trucks,FAMILY,4.2,2,76000000.0,500,Free,0.00,Teen,Puzzle,"November 29, 2016",1.0,4.0 and up,18,0,1,38,0.0
10422,Jigsaw Puzzles Volvo FH Trucks,FAMILY,4.2,3,78000000.0,1000,Free,0.00,Teen,Puzzle,"December 1, 2016",1.0,4.0 and up,18,0,1,38,0.0


In [60]:
df.sort_values(by = 'Price', axis = 0, ascending = False)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Category_int,Type_int,Content_Rating_int,Genres_int,Total
4367,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,73000000.0,10000,Paid,400.00,Everyone,Lifestyle,"May 3, 2018",1.0.1,4.1 and up,16,1,0,29,4000000.0
4362,💎 I'm rich,LIFESTYLE,3.8,718,260000000.0,10000,Paid,399.99,Everyone,Lifestyle,"March 11, 2018",1.0.0,4.4 and up,16,1,0,29,3999900.0
5362,I Am Rich Pro,FAMILY,4.4,201,27000000.0,5000,Paid,399.99,Everyone,Entertainment,"May 30, 2017",1.54,1.6 and up,18,1,0,19,1999950.0
5359,I am rich(premium),FINANCE,3.5,472,96522195.0,5000,Paid,399.99,Everyone,Finance,"May 1, 2017",3.4,4.4 and up,11,1,0,24,1999950.0
9934,I'm Rich/Eu sou Rico/أنا غني/我很有錢,LIFESTYLE,4.2,0,400000000.0,0,Paid,399.99,Everyone,Lifestyle,"December 1, 2017",MONEY,4.1 and up,16,1,0,29,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3649,"GO Weather - Widget, Theme, Wallpaper, Efficient",WEATHER,4.5,1422858,-1.0,50000000,Free,0.00,Everyone,Weather,"August 3, 2018",Varies with device,Varies with device,29,0,0,99,0.0
3650,Info BMKG,WEATHER,4.3,21404,76000000.0,1000000,Free,0.00,Everyone,Weather,"June 15, 2017",2.2,4.4 and up,29,0,0,99,0.0
3651,Weather From DMI/YR,WEATHER,4.3,2143,-1.0,100000,Free,0.00,Everyone,Weather,"July 31, 2018",Varies with device,Varies with device,29,0,0,99,0.0
3652,wetter.com - Weather and Radar,WEATHER,4.2,189313,380000000.0,10000000,Free,0.00,Everyone,Weather,"August 6, 2018",Varies with device,Varies with device,29,0,0,99,0.0


In [62]:
print(df['Size'].corr(df['Price']))
print(df['Reviews'].corr(df['Price']))

-0.01274882315869419


-0.009673305680829114

In [63]:
nan = df.shape[0] - df.loc[df['Price']>0.0].shape[0]
print(nan)

10033


## VISUALIZATION

Link Google Data Studio: Excel https://husteduvn-my.sharepoint.com/:x:/g/personal/nga_dk173280_sis_hust_edu_vn/EZCqkO7r3TdIl4LBoNIXkkABuh7N6CaVKKlDPczulTY5Gg?e=sC6XKQ

In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10830 entries, 2241 to 10840
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   App                 10830 non-null  object 
 1   Category            10830 non-null  object 
 2   Rating              10830 non-null  float64
 3   Reviews             10830 non-null  int64  
 4   Size                10830 non-null  float64
 5   Installs            10830 non-null  int64  
 6   Type                10830 non-null  object 
 7   Price               10830 non-null  float64
 8   Content Rating      10830 non-null  object 
 9   Genres              10830 non-null  object 
 10  Last Updated        10830 non-null  object 
 11  Current Ver         10830 non-null  object 
 12  Android Ver         10830 non-null  object 
 13  Category_int        10830 non-null  int32  
 14  Type_int            10830 non-null  int32  
 15  Content_Rating_int  10830 non-null  int32  
 16  G