# 6.1 什么是数据整理
   ## 6.1.1 数据的定义
   ![数据的语义](数据的语义.jpg)

## 6.1.2  整齐的数据

#### 整齐的数据是指数据含义和其结构的标准化匹配方式。一个数据集是整齐还是混乱的，取决于行、列、表格与观察对象、变量和类型如何匹配。
    在整齐的数据中心：
    * 每个变量组成一列；
    * 每个观察对象所有属性构成一列；
    * 每个观察单元的类型组成一个表格
    
    典型的混乱的数据通常会有以下5个常见的问题。
    * 列标题是值，而不是变量名
    * 多个变量储存在一列中
    * 变量既在列中存储，又在行中存储
    * 多个观测单元存储在一个表中
    * 一个观测单元存储在多个表中

# 6.2 数据整理实战
## 6.2.1 列标题是值，而不是变量名

In [1]:
# 导入数据
data_src = r'Y:\BaiduNetdiskWorkspace\data_analysis\Python数据分析\data'

In [2]:
import numpy as np
import pandas as pd
data = pd.read_csv(data_src+"\\pew-raw.csv")

###### 数据溶解
    melt函数
    参数：
        * frame:需要处理的数据框。
        * id_vars : 保持原样的数据列。
        * value_vars : 需要被转换成变量值的数据列。
        * var_name : 转换后变量的列名
        * value_name ：数值变量的列名


In [3]:
data.head()

Unnamed: 0,religion,<$10k,$10-20k,$20-30k,$30-40k,$40-50k,$50-75k
0,Agnostic,27,34,60,81,76,137
1,Atheist,12,27,37,52,35,70
2,Buddhist,27,21,30,34,33,58
3,Catholic,418,617,732,670,638,1116
4,Dont know/refused,15,14,15,11,10,35


In [4]:
data.melt(id_vars='religion',var_name='income',value_name='freq')   # 可以看出来，id_vars就是i那些本身已经是整齐的列
                                                                        #var_name是将列名转化成variable后的新的列名，
                                                                        #value_name则是本身的数值型变量的列名

Unnamed: 0,religion,income,freq
0,Agnostic,<$10k,27
1,Atheist,<$10k,12
2,Buddhist,<$10k,27
3,Catholic,<$10k,418
4,Dont know/refused,<$10k,15
5,Evangelical Prot,<$10k,575
6,Hindu,<$10k,1
7,Historically Black Prot,<$10k,228
8,Jehovahs Witness,<$10k,20
9,Jewish,<$10k,19


In [5]:
# 除了用融合操作melt()，还可以用堆叠(stack)功能来完成数据变换。
df = data.set_index('religion')
df = df.stack()
print(df.index)
df.index = df.index.rename('income',level= 1)
df.name = 'freq'
df = df.reset_index()
df.head()

MultiIndex([(                'Agnostic',   ' <$10k'),
            (                'Agnostic', ' $10-20k'),
            (                'Agnostic',  '$20-30k'),
            (                'Agnostic',  '$30-40k'),
            (                'Agnostic', ' $40-50k'),
            (                'Agnostic',  '$50-75k'),
            (                 'Atheist',   ' <$10k'),
            (                 'Atheist', ' $10-20k'),
            (                 'Atheist',  '$20-30k'),
            (                 'Atheist',  '$30-40k'),
            (                 'Atheist', ' $40-50k'),
            (                 'Atheist',  '$50-75k'),
            (                'Buddhist',   ' <$10k'),
            (                'Buddhist', ' $10-20k'),
            (                'Buddhist',  '$20-30k'),
            (                'Buddhist',  '$30-40k'),
            (                'Buddhist', ' $40-50k'),
            (                'Buddhist',  '$50-75k'),
            (               

Unnamed: 0,religion,income,freq
0,Agnostic,<$10k,27
1,Agnostic,$10-20k,34
2,Agnostic,$20-30k,60
3,Agnostic,$30-40k,81
4,Agnostic,$40-50k,76


In [6]:
# 导入数据
df = pd.read_csv(data_src+"\\billboard.csv",encoding="mac_latin2")

In [7]:
df.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,...,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,...,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,...,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,...,,,,,,,,,,


In [8]:
# 显然,x1st.weak之后都是rank，所以我们可以这样融合：
df = df.melt(id_vars= ['year','artist.inverted','track','time','genre','date.entered','date.peaked'],var_name='week',value_name='rank')

In [9]:
df.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,x1st.week,78.0
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,x1st.week,15.0
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,x1st.week,71.0
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,x1st.week,41.0
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,x1st.week,57.0


In [10]:
# 再利用str.extract函数来对week列的周数进行提出和转换
df.week = df['week'].str.extract('(\d+)',expand=False).astype(int)

In [11]:
df.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,1,78.0
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,1,15.0
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,1,71.0
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,1,41.0
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,1,57.0


## 6.2.2 多个变量存储在一列

In [14]:
df = pd.read_csv(data_src+"\\tb-raw.csv")

In [15]:
df.head()

Unnamed: 0,country,year,m014,m1524,m2534,m3544,m4554,m5564,m65,mu,f014
0,AD,2000,0.0,0.0,1.0,0.0,0,0,0.0,,
1,AE,2000,2.0,4.0,4.0,6.0,5,12,10.0,,3.0
2,AF,2000,52.0,228.0,183.0,149.0,129,94,80.0,,93.0
3,AG,2000,0.0,0.0,0.0,0.0,0,0,1.0,,1.0
4,AL,2000,2.0,19.0,21.0,14.0,24,19,16.0,,3.0
