# Chapter 2 Pandas 基础

具备了Python基础知识后进行Pandas基础学习。  
参考Datawhale：https://datawhalechina.github.io/joyful-pandas/build/html/%E7%9B%AE%E5%BD%95/ch2.html

In [1]:
import numpy as np
import pandas as pd
pd.__version__

'1.1.3'

## 1. 文件读取与写入
### 1.1 文件读取
（1）读取csv文件：pd.read_csv('direction');  
（2）读取txt文件：pd.read_table(' ');  
（3）读取excel文件：pd.read_excel(' ');

除了定义路径以外，可以增加文件读入时的选项参数，比如：  
（4）默认第一行作为列名，如果不是，则为header=None;  
（5）把某一些列作为index，index_col=['colnames'];  
（6）只提取某些列，usecols=['colnames'];  
（7）需要转换为时间的列，parse_dates=['colnames'];    
（8）明确需要提取数据的行数，nrow=n;

特别地，对于txt文件，默认分隔符是空格，如果不是则使用：  
（9）sep='separator'，且指定engine='Python';  
*note: separtor must be a regular expression.*   

### 1.2 数据写入
（1）csv：df_csv.to_csv(' ');  
（2）excel：df_excel.to_excel(' ');  
（3）txt：df_txt.to_csv(' ', sep='\t');



## 2. 基本数据结构  
Pandas中的一维数据存储于Series中，多维为DataFrame。  
### 2.1 Series  
Series由四部分组成：data、index、dtype、name.


In [2]:
S=pd.Series(data=[1,2,3], index=['one','two','three'], dtype='object', name='Series1')
S

one      1
two      2
three    3
Name: Series1, dtype: object

In [3]:
print(S.values, S.index, S.dtype, S.name) #Notes: no S.data

[1 2 3] Index(['one', 'two', 'three'], dtype='object') object Series1


In [4]:
print(S['one']) #use index to get the value

1


### 2.2 DataFrame
由于维数的增加，在Series的基础上加了列索引。

In [5]:
D=pd.DataFrame(data=[[1,'one'], [2,'two'], [3,'three']], index=['n1','n2','n3'], columns=['Number','EN'])
D

Unnamed: 0,Number,EN
n1,1,one
n2,2,two
n3,3,three


In [6]:
#或使用column与value直接对应的方式创建DataFrame
D1=pd.DataFrame(data={'Number':[1,2,3], 'EN':['one','two','three']}, index=['n1','n2','n3'])
D1

Unnamed: 0,Number,EN
n1,1,one
n2,2,two
n3,3,three


In [7]:
print(D.values, D.index, D.columns, D.dtypes, D.shape)

[[1 'one']
 [2 'two']
 [3 'three']] Index(['n1', 'n2', 'n3'], dtype='object') Index(['Number', 'EN'], dtype='object') Number     int64
EN        object
dtype: object (3, 2)


In [8]:
print(D['EN']) #只能指示行，列的提取需要用iloc等

n1      one
n2      two
n3    three
Name: EN, dtype: object


## 3. 常用基本函数


In [9]:
df=pd.read_csv('/Users/jie/Documents/Python/joyful-pandas-master/data/learn_pandas.csv')
df.head()

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,158.9,46.0,N,1,2019/10/5,0:04:34
1,Peking University,Freshman,Changqiang You,Male,166.5,70.0,N,1,2019/9/4,0:04:20
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,188.9,89.0,N,2,2019/9/12,0:05:22
3,Fudan University,Sophomore,Xiaojuan Sun,Female,,41.0,N,2,2020/1/3,0:04:08
4,Fudan University,Sophomore,Gaojuan You,Male,174.0,74.0,N,2,2019/11/6,0:05:22


In [10]:
#only need the first 7 cols
df=df[df.columns[:7]]
df.head()

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,158.9,46.0,N
1,Peking University,Freshman,Changqiang You,Male,166.5,70.0,N
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,188.9,89.0,N
3,Fudan University,Sophomore,Xiaojuan Sun,Female,,41.0,N
4,Fudan University,Sophomore,Gaojuan You,Male,174.0,74.0,N


### 3.1 汇总函数  
（1）head(n)，默认前五行；  
（2）tail(n)，默认倒数五行；  
（3）info()，表的信息；  
（4）describe()，表的基本统计计算量（对数值型字段）；

### 3.2 特征统计函数
df.sum(), df.mean(), df.max(), df.min(), df.median(), df.std(), df.var()；   
df.quantile() #0.25, 0.5, 0.75；  
df.count() #count of non-missing；  
df.idxmax() #index of max；   
可以在括号中增加参数：按列（axis=0）或按行（axis=1）聚合计算； 

### 3.3 唯一值函数
（1）unique：唯一值；nunique：唯一值个数；


In [11]:
#e.g.:
print(df['Grade'].unique())
print(df['Grade'].nunique())

['Freshman' 'Senior' 'Sophomore' 'Junior']
4


（2）value_counts：唯一值和其个数；

In [12]:
#e.g.:
print(df['Grade'].value_counts())

Junior       59
Senior       55
Freshman     52
Sophomore    34
Name: Grade, dtype: int64


（3）drop_duplicates()：剔除重复值；  
包含的参数有keep=' ' #last, first (default), false.  
（4）duplicated：返回True/False来标记所有records是否为重复的；

In [13]:
#e.g.:
df.drop_duplicates(['Gender','Transfer'], keep='last')

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer
147,Peking University,Senior,Juan You,Male,169.2,69.0,
150,Tsinghua University,Junior,Chengpeng You,Male,170.7,70.0,Y
169,Tsinghua University,Junior,Chengquan Qin,Female,160.7,52.0,Y
194,Peking University,Senior,Yanmei Qian,Female,160.3,49.0,
197,Shanghai Jiao Tong University,Senior,Chengqiang Chu,Female,153.9,45.0,N
199,Tsinghua University,Sophomore,Chunpeng Lv,Male,155.7,51.0,N


In [14]:
df.duplicated(['Gender','Transfer']).head()

0    False
1    False
2     True
3     True
4     True
dtype: bool

### 3.4 替换函数
（1）replace：对某一列的某些值进行替换；
可以自己指定替换值，或者用前面的值进行替换（ffill），或者用后面的值进行替换（bfill）。

In [15]:
#e.g.:
print(df['Gender'].replace({'Female':'f','Male':'m'}).head())
print(df['Gender'].replace(['Female'], method='ffill').head())

0    f
1    m
2    m
3    f
4    m
Name: Gender, dtype: object
0    Female
1      Male
2      Male
3      Male
4      Male
Name: Gender, dtype: object


（2）where函数：在false替换；  
（3）mask函数：在true替换；

In [16]:
#e.g.: 
print(df['Gender'].where(df['Gender']=='Female','f').head())
print(df['Gender'].mask(df['Gender']=='Female','f').head())

0    Female
1         f
2         f
3    Female
4         f
Name: Gender, dtype: object
0       f
1    Male
2    Male
3       f
4    Male
Name: Gender, dtype: object


（4）数值替换：round(), abs(), clip(lower,upper)；   
练一练：在clip中，超过边界的智能截断为边界值，超过边界的替换为自定义的值，可以考虑where或mask：

In [17]:
#e.g.:
s1=pd.Series([-1,1.2345,100,-50])
print(s1.clip(0,2))
print(s1.mask(s1<0, -1).mask(s1>2, 10))

0    0.0000
1    1.2345
2    2.0000
3    0.0000
dtype: float64
0    -1.0000
1     1.2345
2    10.0000
3    -1.0000
dtype: float64


### 3.5 排序函数
（1）按照值排序：sort_values('colnames'); #ascending=True(default)/False；   
（2）按照索引排序：sort_index(level=['colnames']; #ascending=True(default)/False；

In [18]:
#e.g.:
print(df.sort_values(['Weight','Height'], ascending=[False,False]).head())
print(df.sort_index(level=['Grade'], ascending=[False]).head())

                            School     Grade            Name Gender  Height  \
2    Shanghai Jiao Tong University    Senior         Mei Sun   Male   188.9   
38               Peking University  Freshman       Qiang Han   Male   185.3   
23   Shanghai Jiao Tong University    Senior     Qiang Zheng   Male   183.9   
134  Shanghai Jiao Tong University    Senior      Gaoli Zhao   Male   186.5   
99               Peking University  Freshman  Changpeng Zhao   Male   181.3   

     Weight Transfer  
2      89.0        N  
38     87.0        N  
23     87.0        N  
134    83.0        N  
99     83.0        N  
                          School      Grade            Name  Gender  Height  \
0  Shanghai Jiao Tong University   Freshman    Gaopeng Yang  Female   158.9   
1              Peking University   Freshman  Changqiang You    Male   166.5   
2  Shanghai Jiao Tong University     Senior         Mei Sun    Male   188.9   
3               Fudan University  Sophomore    Xiaojuan Sun  Female    

### 3.6 apply 方法
apply函数常用于DataFrame的行迭代或列迭代，具体是定义一个函数，然后apply于多个行或列。

In [19]:
#e.g.:
df[['Weight','Height']].apply(lambda x: x.max(), axis=0)

Weight     89.0
Height    193.9
dtype: float64

## 4. 窗口函数
定义一个区间大小（窗口）然后可以让这个窗口移动从而对每一个位置的数值进行相同区间长度的相同函数计算。

### 4.1 滑窗对象
用rolling函数并设定参数window来定义窗口大小：

In [20]:
#e.g.:
s2=pd.Series([1,2,3,4,5])
roller=s2.rolling(window=2) #最重要的一步！
roller.sum()
#对每一个位置计算终止于该位置的两个数值长度的总和

0    NaN
1    3.0
2    5.0
3    7.0
4    9.0
dtype: float64

In [21]:
#shift, diff, pct_change也被视为滑窗函数
#shift函数用于位置移动，可用于时间序列做差分
print(s2.shift(1))
#pct_change用于计算增长率
print(s2.pct_change())

0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
0         NaN
1    1.000000
2    0.500000
3    0.333333
4    0.250000
dtype: float64


练一练：rolling对象的默认窗口方向都是向前的，如果向后的话应该考虑：

In [22]:
s2.rolling(window=2).sum().shift(-1)

0    3.0
1    5.0
2    7.0
3    9.0
4    NaN
dtype: float64

### 4.2 扩张窗口
expanding函数：动态长度窗口函数，具体理解为对于给定的序列（长度为n），会从区间长度为1扩张到区间长度为n然后进行相同的计算。

In [23]:
#e.g.:
s2.expanding().sum()

0     1.0
1     3.0
2     6.0
3    10.0
4    15.0
dtype: float64

练一练：用expanding对象依次实现典型的类扩张窗口函数：cummax, cumsum, cumprod:

In [24]:
#cummax:
s2.expanding().max()

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

In [25]:
#cumsum:
s2.expanding().sum()

0     1.0
1     3.0
2     6.0
3    10.0
4    15.0
dtype: float64

In [26]:
#cumprod:
#定义好乘法的函数然后expanding

## 5. 练习
EX1: 口袋妖怪数据集

In [27]:
df1=pd.read_csv('/Users/jie/Documents/Python/joyful-pandas-master/data/pokemon.csv')
df1.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80
4,4,Charmander,Fire,,309,39,52,43,60,50,65


In [28]:
#1. 
df1_1=df1[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']].sum(axis=1)
print((df1['Total']==df1_1).value_counts()) #all Trues

True    800
dtype: int64


In [29]:
#2.
df1_2=df1.drop_duplicates(['#'], keep='first')

In [30]:
#2.a
print('第一属性的种类数量: ', df1_2['Type 1'].nunique())
print('前三多数量对应的种类: \n', df1_2['Type 1'].value_counts().head(3))

第一属性的种类数量:  18
前三多数量对应的种类: 
 Water     105
Normal     93
Grass      66
Name: Type 1, dtype: int64


In [31]:
#2.b
df1_2_b=df1_2[['Type 1', 'Type 2']].drop_duplicates()
df1_2_b

Unnamed: 0,Type 1,Type 2
0,Grass,Poison
4,Fire,
6,Fire,Flying
9,Water,
13,Bug,
...,...,...
773,Rock,Fairy
778,Ghost,Grass
790,Flying,Dragon
797,Psychic,Ghost


In [32]:
#2.c
df1_2_c1=[a+' '+b for a in df1_2['Type 1'].unique() for b in df1_2['Type 1'].unique()]
df1_2_c2=[a+' '+b for a in df1_2_b['Type 1'] for b in df1_2_b['Type 2'].replace(np.nan, '')]
#由于type 2有missing所以一直报错‘can only concatenate str (not "float") to str‘，直到把missing替换掉。。。
#答案中第一个df1_2_c1(aka full)需要在加一个‘ ’主要是由于匹配上剔除missing的操作结果。
#取两个list的差集
set(df1_2_c1).difference(set(df1_2_c2))

set()

In [33]:
#3.
#3.a
df1_3_a=df1['Attack'].mask(df1['Attack']>120, 'high').mask(df1['Attack']<50, 'low').mask((df1['Attack']>=50) & (df1['Attack']<=120), 'mid')
df1_3_a

0       low
1       mid
2       mid
3       mid
4       mid
       ... 
795     mid
796    high
797     mid
798    high
799     mid
Name: Attack, Length: 800, dtype: object

In [34]:
#3.b
df1_3_b1=df1['Type 1'].replace({x:str.upper(x) for x in df1['Type 1']})
df1_3_b1

0        GRASS
1        GRASS
2        GRASS
3        GRASS
4         FIRE
        ...   
795       ROCK
796       ROCK
797    PSYCHIC
798    PSYCHIC
799       FIRE
Name: Type 1, Length: 800, dtype: object

In [35]:
df1_3_b2=df1['Type 1'].apply(lambda x: str.upper(x))
df1_3_b2

0        GRASS
1        GRASS
2        GRASS
3        GRASS
4         FIRE
        ...   
795       ROCK
796       ROCK
797    PSYCHIC
798    PSYCHIC
799       FIRE
Name: Type 1, Length: 800, dtype: object

In [36]:
#3.c
a=[(df1.iloc[:,i]-df1[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']].median(axis=1)).abs() for i in [5,6,7,8,9,10]]
df1['Deviation']=np.max(a,axis=0)
df1.sort_values('Deviation', ascending=False)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Deviation
230,213,Shuckle,Bug,Rock,505,20,10,230,10,230,5,215.0
121,113,Chansey,Normal,,450,250,5,5,35,105,50,207.5
261,242,Blissey,Normal,,540,255,10,10,75,135,55,190.0
333,306,AggronMega Aggron,Steel,,630,70,140,230,60,80,50,155.0
224,208,SteelixMega Steelix,Steel,Ground,610,75,125,230,55,95,30,145.0
...,...,...,...,...,...,...,...,...,...,...,...,...
143,132,Ditto,Normal,,288,48,48,48,48,48,48,0.0
165,151,Mew,Psychic,,600,100,100,100,100,100,100,0.0
255,236,Tyrogue,Fighting,,210,35,35,35,35,35,35,0.0
206,191,Sunkern,Grass,,180,30,30,30,30,30,30,0.0


EX2: 指数加权窗口

In [37]:
#1.
np.random.seed(0)
s2_1=pd.Series(np.random.randint(-1,2,30).cumsum())
s2_1.head()

0   -1
1   -1
2   -2
3   -2
4   -2
dtype: int64

In [38]:
#一直尝试用lambda函数做，太长了把自己绕晕了，而且由于sum函数的作用在expanding之前就全部sum了。。。
#s2_1.expanding().apply(lambda i: (((1-0.2)**np.array([(29-i) for i in range(30)]))*(np.array([s2_1[i] for i in range(30)])).sum()/((1-0.2)**np.array([(29-i) for i in range(30)])).sum())

In [39]:
#贴一个正确答案，先定义一个函数的好处应该是先expanding在求函数
def ewm_func(x, alpha=0.2):
    win = (1-alpha)**np.arange(x.shape[0])[::-1]
    res = (win*x).sum()/win.sum()
    return res
s2_1.expanding().apply(ewm_func).head()

0   -1.000000
1   -1.000000
2   -1.409836
3   -1.609756
4   -1.725845
dtype: float64

In [40]:
#2. 
s2_1.rolling(window=5).apply(ewm_func).head()

0         NaN
1         NaN
2         NaN
3         NaN
4   -1.725845
dtype: float64

总结：个人思路太局限了，希望能扩展下思考，加油！