# Google Apps 商店的数据分析

## / 简介

相信大家对移动应用商店都不陌生吧。Google Play Store（Google Play商店）是谷歌官方的软件应用商店，拥有上架软件数十万款，下载量更是突破了20亿次，为了手机用户提供了极为广泛的应用选择，很受大家的欢迎。



本数据集(googleplaystore.csv)包含了 Google Play 商店中 App 的数据。该数据是Kaggle中 [Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps) 的一部分，其中包含 Google Play 商店中 10k+ 应用软件的信息。

数据中的变量含义解释：
```
App: 应用的名称，字符变量。
Category: 应用所属的分类，字符变量。
Rating: 某应用的用户评分，数值变量。
Reviews: 某应用获得的用户评论数量，数值变量。
Size: 某应用的所占存储空间的大小，字符变量。
Installs: 用户安装和下载某应用的次数，字符变量。
Type: 付费或免费，分类变量。
Price: 价格，字符变量。
Content Rating: 应用商店针对内容给出的年龄评级组 - Children / Mature 21+ / Adult，分类变量。
Genres: 类型/流派，一个应用可以属于多个流派，比如音乐、游戏、家庭等，字符变量。
Last Updated: 应用最新更新的日期，字符变量。
Current Ver: 当前应用的版本，字符变量。
Android Ver: 安装该应用所需要的最低安卓版本，字符变量。

```

## / 项目完成指南



本项目中的数据分析流程已经给出，但代码将完全由你自己进行书写，如果你无法完成本项目，说明你目前的能力并不足以完成 数据分析(进阶)纳米学位，建议先进行 数据分析（入门）纳米学位的学习，掌握进阶课程的先修知识。

对于数据分析过程的记录也是数据分析报告的一个重要部分，你可以自己在需要的位置插入Markdown cell，记录你在数据分析中的关键步骤和推理过程。比如：数据有什么样的特点，统计数据的含义是什么，你从可视化中可以得出什么结论，下一步分析是什么，为什么执行这种分析。如果你无法做到这一点，你也无法通过本项目。


> **小贴士**: 像这样的引用部分旨在为学员提供实用指导，帮助学员了解并使用 Jupyter notebook

## / 提出问题

在此项目中，你将以一名数据分析师的身份执行数据的探索性分析。你将了解数据分析过程的基本流程。在你分析数据之前，请先思考几个你需要了解的关于 Google 商店中应用的问题，例如，最受欢迎（下载量最高）的 Apps 有什么特征？哪些 App 的评分更高？

**问题**：请写下你感兴趣的问题，请确保这些问题能够由现有的数据进行回答。
（为了确保学习的效果，请确保你的数据分析报告中能够包含2幅可视化和1个相关性分析。）

**答案**：将此文本替换为你的回答！


在提出了问题之后，我们将开始导入数据，并对数据进行探索性分析，来回答上面提出的问题。

> **小贴士**: 双击上框，文本就会发生变化，所有格式都会被清除，以便你编辑该文本块。该文本块是用 [Markdown](http://daringfireball.net/projects/markdown/syntax)编写的，该语言使用纯文本语法，能用页眉、链接、斜体等来规范文本格式。在纳米学位课程中，你也会用到 Markdown。编辑后，可使用 **Shift** + **Enter** 或 **Shift** + **Return** 运行上该框，使其呈现出编辑好的文本格式。

# 数据评估和清理
## / 了解数据

> **小贴士**: 运行代码框的方法与编辑上方的 Markdown 框的格式类似，你只需点击代码框，按下键盘快捷键 **Shift** + **Enter** 或 **Shift** + **Return** ，或者你也可先选择代码框，然后点击工具栏的 **运行** 按钮来运行代码。运行代码框时，相应单元左侧的信息会出现星号，即 `In [*]:`，若代码执行完毕，星号则会变为某个数字，如 `In [1]`。如果代码运行后有输出结果，输出将会以 `Out [1]:` 的形式出现，其中的数字将与 "In" 中的数字相对应。

In [107]:
# 请先运行此代码块，以确保在可视化中可以显示中文
!rm -rf ~/.cache/matplotlib/fontList.json
!wget http://d.xiazaiziti.com/en_fonts/fonts/s/SimHei.ttf -O /opt/conda/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/SimHei.ttf
import matplotlib.pyplot as plt 

plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号

/bin/sh: wget: command not found


In [168]:
# TO DO: load pacakges
import pandas as pd
import pprint as pp

In [169]:
# TO DO: load the dataset
df = pd.read_csv('googleplaystore.csv')

In [170]:
# TO DO: check the dataset general info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
App               10841 non-null object
Category          10841 non-null object
Rating            9367 non-null float64
Reviews           10841 non-null object
Size              10841 non-null object
Installs          10841 non-null object
Type              10840 non-null object
Price             10841 non-null object
Content Rating    10840 non-null object
Genres            10841 non-null object
Last Updated      10841 non-null object
Current Ver       10833 non-null object
Android Ver       10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [171]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [172]:
df[['Category','Genres']].sample(100)

Unnamed: 0,Category,Genres
4654,SOCIAL,Social
1001,ENTERTAINMENT,Entertainment
10051,TOOLS,Tools
4679,GAME,Action
7187,TOOLS,Tools
1009,EVENTS,Events
4911,TOOLS,Tools
6439,FINANCE,Finance
5676,NEWS_AND_MAGAZINES,News & Magazines
4503,FAMILY,Entertainment


In [173]:
# TO DO: clean the data (optional: only there are problems)
df.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

In [174]:
df.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

## / function
### // drop_column(df,list)

In [175]:
## build fuction
def drop_column(df,list):
    '''
    input: 
    1\ dataframe which to exam
    2\ list which feature to exam
    
    check:
    to drop a list of fearures that do not need,
    and get some extra infor for indicating which column is droped,
    for drop data is always sensitive, logs is needed.
    
    output: str
    full information for deleting featrues.
    '''
    ## proceed
    dflen = len(df.columns)
    df.drop(list,axis=1,inplace=True)
    ### 在函数中要用 inplace=True 而不是赋值来作用于df
    ## check
    print('---- proceding ----')
    print('- drop {} columns: {} '.format(len(list), list))
    print('- remain {} columns'.format(len(df.columns)))
    print('- success : {}'.format(len(list) + len(df.columns) == dflen))

## / cleaning
### // drop_column

In [176]:
help(drop_column)

Help on function drop_column in module __main__:

drop_column(df, list)
    input: 
    1\ dataframe which to exam
    2\ list which feature to exam
    
    check:
    to drop a list of fearures that do not need,
    and get some extra infor for indicating which column is droped,
    for drop data is always sensitive, logs is needed.
    
    output: str
    full information for deleting featrues.



In [177]:
dfnew = df.copy()

In [178]:
drop_list = ['Category','Current Ver',
             'Content Rating','Android Ver','Size']

In [179]:
drop_column(dfnew,drop_list)

---- proceding ----
- drop 5 columns: ['Category', 'Current Ver', 'Content Rating', 'Android Ver', 'Size'] 
- remain 8 columns
- success : True


In [180]:
dfnew.sample()

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres,Last Updated
5974,BC Wildfire,3.3,27,"5,000+",Free,0,Tools,"May 31, 2018"


### // drop_duplicate

In [181]:
dfnew[dfnew.duplicated()]

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres,Last Updated
229,Quick PDF Scanner + OCR FREE,4.2,80805,"5,000,000+",Free,0,Business,"February 26, 2018"
236,Box,4.2,159872,"10,000,000+",Free,0,Business,"July 31, 2018"
239,Google My Business,4.4,70991,"5,000,000+",Free,0,Business,"July 24, 2018"
256,ZOOM Cloud Meetings,4.4,31614,"10,000,000+",Free,0,Business,"July 20, 2018"
261,join.me - Simple Meetings,4.0,6989,"1,000,000+",Free,0,Business,"July 16, 2018"
265,Box,4.2,159872,"10,000,000+",Free,0,Business,"July 31, 2018"
266,Zenefits,4.2,296,"50,000+",Free,0,Business,"June 15, 2018"
267,Google Ads,4.3,29313,"5,000,000+",Free,0,Business,"July 30, 2018"
268,Google My Business,4.4,70991,"5,000,000+",Free,0,Business,"July 24, 2018"
269,Slack,4.4,51507,"5,000,000+",Free,0,Business,"August 2, 2018"


In [182]:
dfnew.duplicated().sum()

489

In [183]:
dfnew.shape

(10841, 8)

In [184]:
dfnew.drop_duplicates(inplace=True)      
dfnew.shape

(10352, 8)

### plus More Duplicates
- 再删除之后,发现还有重名但Review不同的数据
- 比如 Facebook

In [185]:
dfnew.query('App == "Facebook"')

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres,Last Updated
2544,Facebook,4.1,78158306,"1,000,000,000+",Free,0,Social,"August 3, 2018"
3943,Facebook,4.1,78128208,"1,000,000,000+",Free,0,Social,"August 3, 2018"


In [220]:
## 这次只看 App 重复的
## 而且要找出所有的重复值(不再只显示第一个)
dfnew.duplicated(subset='App',keep=False).sum()

1211

In [225]:
# solution1 
## 为了方便处理, 生成新的df
dfdup = dfnew[dfnew.duplicated(subset='App',keep=False)]
dfdup.head()

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres,Last Updated
1,Coloring book moana,3.9,967,"500,000+",Free,0,Art & Design;Pretend Play,"January 15, 2018"
23,Mcqueen Coloring pages,,61,"100,000+",Free,0,Art & Design;Action & Adventure,"March 7, 2018"
36,UNICORN - Color By Number & Pixel Art Coloring,4.7,8145,"500,000+",Free,0,Art & Design;Creativity,"August 2, 2018"
42,Textgram - write on photos,4.4,295221,"10,000,000+",Free,0,Art & Design,"July 30, 2018"
139,Wattpad 📖 Free Books,4.6,2914724,"100,000,000+",Free,0,Books & Reference,"August 1, 2018"


In [226]:
dfdup.query('App == "ASOS"')

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres,Last Updated
2771,ASOS,4.7,181798,"10,000,000+",Free,0,Shopping,"July 30, 2018"
2800,ASOS,4.7,181823,"10,000,000+",Free,0,Shopping,"July 30, 2018"


In [236]:
dfdup.query('App == "ASOS"').sort_values(by='Reviews')

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres,Last Updated
2771,ASOS,4.7,181798,"10,000,000+",Free,0,Shopping,"July 30, 2018"
2800,ASOS,4.7,181823,"10,000,000+",Free,0,Shopping,"July 30, 2018"


In [239]:
dfdup.query('App == "ASOS"').sort_values(by='Reviews', ascending=False)

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres,Last Updated
2800,ASOS,4.7,181823,"10,000,000+",Free,0,Shopping,"July 30, 2018"
2771,ASOS,4.7,181798,"10,000,000+",Free,0,Shopping,"July 30, 2018"


In [252]:
# 接下来的想法就是找到所有最大的 index, 生成 dfmax, 从 dfnew 中去掉 dfdup 再增加 dfmax
index = dfdup.query('App == "ASOS"').sort_values(by='Reviews', ascending=False)[:1].index[0]
list = []
list.append(dfdup.query('App == "ASOS"').sort_values(by='Reviews', ascending=False)[:1].index[0])
list

2800

In [256]:
# 尝试到此为止,这种做法太复杂了, 既然也用到了排序, 那不如对整个数据做个排序
# 再之后直接drop_duplicates

In [267]:
# solution2
dfnew.shape

(10352, 8)

In [268]:
dfnew.query('App == "ASOS"')

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres,Last Updated
2771,ASOS,4.7,181798,"10,000,000+",Free,0,Shopping,"July 30, 2018"
2800,ASOS,4.7,181823,"10,000,000+",Free,0,Shopping,"July 30, 2018"


In [269]:
dftest = dfnew.sort_values(by=['App','Reviews'], ascending=False)

In [271]:
# 顺序已经改变了
dftest.query('App == "ASOS"')

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres,Last Updated
2800,ASOS,4.7,181823,"10,000,000+",Free,0,Shopping,"July 30, 2018"
2771,ASOS,4.7,181798,"10,000,000+",Free,0,Shopping,"July 30, 2018"


In [278]:
dftest.shape

(10352, 8)

In [279]:
dftest.duplicated(subset='App').sum()

692

In [280]:
dftest.drop_duplicates(subset='App',inplace=True)
dftest.shape

(9660, 8)

In [281]:
# 更新 dfnew
dfnew.sort_values(by=['App','Reviews'], ascending=False, inplace=True)
dfnew.drop_duplicates(subset='App',inplace=True)
dfnew.shape

(9660, 8)

### // drop_null
对于空值的处理需要根据实际情况进行,可选的方法有删除、均值填充等,此处采用drop处理

In [282]:
dfnew.isnull().sum()

App                0
Rating          1463
Reviews            0
Installs           0
Type               1
Price              0
Genres             0
Last Updated       0
dtype: int64

In [283]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
dfnew.dropna(inplace=True)
dfnew.shape

(8197, 8)

In [284]:
dfnew.isnull().sum()

App             0
Rating          0
Reviews         0
Installs        0
Type            0
Price           0
Genres          0
Last Updated    0
dtype: int64

# 数据探索分析
## function
### // check_value(df,list)

In [285]:
# build function
def check_value(df,list):
    '''
    input: 
    1\ dataframe which to exam
    2\ list which feature to exam
    
    check:
    and if features are many, columns will be hide,
    checks specified feature(column) for values,
    for category feature,
    then we can see the featrue's value distribution.
    a sample of the data,
    
    output: str
    full info from a specific feature value distribution.
    '''
    c = 1
    for i in list:
        print(('\n- columns #{} : {:-<8}'.format(c,i)))
        print((df[i].value_counts().nlargest(10)))
        ### nlargest 非常好用
        c += 1
    pp.pprint('----checking complete----')

In [286]:
check_value(dfnew,dfnew.columns)


- columns #1 : App-----
SNCF                                   1
Runtastic Mountain Bike GPS Tracker    1
CPU-Z                                  1
DV Youth                               1
Bi en Línea                            1
Home Workout for Men - Bodybuilding    1
Results for FL Lottery                 1
Add Watermark Free                     1
funny Image Comments for FB            1
SofaScore Live Score                   1
Name: App, dtype: int64

- columns #2 : Rating--
4.4    897
4.3    897
4.5    849
4.2    811
4.6    683
4.1    621
4.0    513
4.7    438
3.9    359
3.8    286
Name: Rating, dtype: int64

- columns #3 : Reviews-
2     82
3     76
4     74
5     74
1     67
7     62
6     60
8     55
12    51
10    45
Name: Reviews, dtype: int64

- columns #4 : Installs
1,000,000+     1416
100,000+       1095
10,000+         986
10,000,000+     934
1,000+          697
5,000,000+      608
500,000+        504
50,000+         457
5,000+          425
100+            303
Name: Insta

- 根据输出往复清理过程
    - 删除 Type 为 0的
    - 删除 Last Uddated (感觉没什么重要,因为其他数据和时序无关)

In [287]:
drop_list = ['Last Updated']
drop_column(dfnew,drop_list)

---- proceding ----
- drop 1 columns: ['Last Updated'] 
- remain 7 columns
- success : True


## / update dataframe
### // drop data (iter)

In [288]:
dfnew.query('Type == "0"')

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres
10472,Life Made WI-Fi Touchscreen Photo Frame,19.0,3.0M,Free,0,Everyone,"February 11, 2018"


In [299]:
# 根据上述输出发现这条数据的 Type 应该为 Free, 整个数据串行了
## 根据数据观察,各项指标较差,删除数据
dfnew.query('Type == "0"')

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres


In [294]:
drop_index_list = dfnew.query('Type == "0"').index[0]

In [295]:
drop_index_list

10472

In [296]:
dfnew.shape

(8197, 7)

In [297]:
dfnew.drop(drop_index_list, axis=0, inplace=True)

In [298]:
dfnew.shape

(8196, 7)

### // change data type

In [300]:
dfnew.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8196 entries, 882 to 8532
Data columns (total 7 columns):
App         8196 non-null object
Rating      8196 non-null float64
Reviews     8196 non-null object
Installs    8196 non-null object
Type        8196 non-null object
Price       8196 non-null object
Genres      8196 non-null object
dtypes: float64(1), object(6)
memory usage: 512.2+ KB


In [303]:
dfnew.sample()

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres
9524,Video Maker with Photo and Music,4.3,4871,"1,000,000+",Free,0,Photography


- change solution
    - Reviews - object > int
    - Installs - object > category
    - Type - object > category
    - Price - object > category
    - Genres - object > category

In [304]:
dfnew.Reviews = dfnew.Reviews.astype(int)

In [305]:
# 下面变换会报错 10000+ 无法改为 int
## 根据之前对数据的观察, 应该也是分类数据(上面solution已经更新)
#dfnew.Installs = dfnew.Installs.astype(int)

In [306]:
dfnew.Installs = dfnew.Installs.astype('category')
dfnew.Type = dfnew.Type.astype('category')
dfnew.Price = dfnew.Price.astype('category')
dfnew.Genres = dfnew.Genres.astype('category')
## 注意 category 要有引号
## 和 object 用法相同,但是数据量大时效率更高

In [307]:
dfnew.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8196 entries, 882 to 8532
Data columns (total 7 columns):
App         8196 non-null object
Rating      8196 non-null float64
Reviews     8196 non-null int64
Installs    8196 non-null category
Type        8196 non-null category
Price       8196 non-null category
Genres      8196 non-null category
dtypes: category(4), float64(1), int64(1), object(1)
memory usage: 298.0+ KB


## persistence

In [308]:
dfnew.to_pickle('googleplaystore_select.pickle.xz', compression='xz')

## EDA

In [309]:
df = pd.read_pickle('googleplaystore_select.pickle.xz', compression='xz')
df.sample()

Unnamed: 0,App,Rating,Reviews,Installs,Type,Price,Genres
5148,Kimbrough AH,5.0,5,100+,Free,0,Medical


In [310]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8196 entries, 882 to 8532
Data columns (total 7 columns):
App         8196 non-null object
Rating      8196 non-null float64
Reviews     8196 non-null int64
Installs    8196 non-null category
Type        8196 non-null category
Price       8196 non-null category
Genres      8196 non-null category
dtypes: category(4), float64(1), int64(1), object(1)
memory usage: 298.0+ KB


In [311]:
df.describe()

Unnamed: 0,Rating,Reviews
count,8196.0,8196.0
mean,4.173084,255248.3
std,0.536522,1985679.0
min,1.0,1.0
25%,4.0,126.0
50%,4.3,3004.0
75%,4.5,43719.5
max,5.0,78158310.0


In [162]:
dftest.drop.all()[df['App'] == 'Facebook' and df['Reviews'] != df.loc['Facebook','Reviews'].max()]

AttributeError: 'function' object has no attribute 'all'

In [None]:
# In exploratory data analysis, please make sure of using statistics and visualizations


在数据的探索性分析中，请确保你对数据分析中的关键步骤和推理过程进行了记录。你可以自己插入code cell和markdown cell来组织你的报告。

# 得出结论

**问题**：上面的分析能够回答你提出的问题？通过这些分析你能够得出哪些结论？

**答案**：将此文本替换为你的回答！

# 反思

**问题**：在你的分析和总结过程中是否存在逻辑严谨。是否有改进的空间? 你可以从下面的一些角度进行思考：
1. 数据集是否完整，包含所有想要分析的数据？
2. 在对数据进行处理的时候，你的操作（例如删除/填充缺失值）是否可能影响结论？
3. 是否还有其他变量（本数据中没有）能够对你的分析有帮助？
4. 在得出结论时，你是否混淆了相关性和因果性？

**答案**：将此文本替换为你的回答！

恭喜你完成了此项目！这只是数据分析过程的一个样本：从生成问题、整理数据、探索数据到得出结论。在数据分析(进阶)纳米学位中，你将会学到更多高级的数据分析方法和技术，如果你感兴趣的话，我们鼓励你继续学习后续的课程，掌握更多的数据分析的高级技能！

> 若想与他人分享我们的分析结果，除了向他们提供 jupyter Notebook (.ipynb) 文件的副本外，我们还可以将 Notebook 输出导出为一种甚至那些未安装 Python 的人都能打开的形式。从左上方的“文件”菜单，前往“下载为”子菜单。然后你可以选择一个可以更普遍查看的格式，例如 HTML (.html) 。你可能需要额外软件包或软件来执行这些导出。