<a href="https://colab.research.google.com/github/smiledinisa/data_python_analysis/blob/master/pandas001.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 第五章 pandas 入门

![替代文字](https://img-blog.csdnimg.cn/20200807151248699.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70)

In [None]:
from pandas import Series, DataFrame
import pandas as pd


## pandas的数据结构介绍

### Series
由一组数据（numpy）以及一组与之相关的数据标签组成。

In [None]:
obj = Series([4,7,-5,3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [None]:
obj.values

array([ 4,  7, -5,  3])

In [None]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [None]:
obj2 = Series([4,7,5,3], index=['a', 'b', 'c', 'd'])

In [None]:
obj2

a    4
b    7
c    5
d    3
dtype: int64

In [None]:
obj2['d']

3

In [None]:
# 字典的转换。
sdata = {'ohi':3500, 'texa':7100, 'orgen':1900, 'utah':4000}
obj3 = Series(sdata)

In [None]:
obj3

ohi      3500
texa     7100
orgen    1900
utah     4000
dtype: int64

In [None]:
states = ['aiaang','ohi',  'orgen', 'utah']
obj4 = Series(sdata, index=states)

In [None]:
obj4

aiaang       NaN
ohi       3500.0
orgen     1900.0
utah      4000.0
dtype: float64

In [None]:
# NaN表示缺失。
# isnull和notnull 来检测缺失数据。


In [None]:
pd.isnull(obj4)

aiaang     True
ohi       False
orgen     False
utah      False
dtype: bool

In [None]:
pd.notnull(obj4)

aiaang    False
ohi        True
orgen      True
utah       True
dtype: bool

In [None]:
obj4.isnull()

aiaang     True
ohi       False
orgen     False
utah      False
dtype: bool

以上三种方法都是可以的。

In [None]:
obj3

ohi      3500
texa     7100
orgen    1900
utah     4000
dtype: int64

In [None]:
obj4

aiaang       NaN
ohi       3500.0
orgen     1900.0
utah      4000.0
dtype: float64

In [None]:
obj3+obj4

aiaang       NaN
ohi       7000.0
orgen     3800.0
texa         NaN
utah      8000.0
dtype: float64

Series 对象的name 属性。


In [None]:
obj4.name = 'population'
obj4.index.name = 'state'

In [None]:
obj4

state
aiaang       NaN
ohi       3500.0
orgen     1900.0
utah      4000.0
Name: population, dtype: float64

### DataFrame

In [None]:
import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame
# 传入由等长列表组成的字典，来构建DataFrame
data = {'state': ['Ohio','Ohio','Ohio','Nevada','Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

In [None]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [None]:
DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [None]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five'])

In [None]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [None]:
frame2['year']

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

In [None]:
frame2.index

Index(['one', 'two', 'three', 'four', 'five'], dtype='object')

In [None]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

In [None]:
frame2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [None]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [None]:
# the rows also can be retrieved by position or name. use attribute ..loc
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [None]:
# empyt colum 'debt' can be assigned
frame2['debt'] = 100
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,100
two,2001,Ohio,1.7,100
three,2002,Ohio,3.6,100
four,2001,Nevada,2.4,100
five,2002,Nevada,2.9,100


In [None]:
frame2['debt'] = np.arange(5.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0


In [None]:
# note the length must match.
# if use a Series assign a colum , the index must be realigned exactly to DataFrame's

# if not , inserting some missing values in any holes. with NaN

val = pd.Series([1.5, 2.5, -3.6], index=['one', 'three', 'five'])
frame2['debt'] = val

frame2

# u can see the result like this:

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,1.5
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,2.5
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,-3.6


In [None]:
 # del keyword. and creat a new colum
 frame2['eastern'] = frame2.state == 'Ohio'
 frame2
 # note: new colums can not be created by frame2.estern

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,1.5,True
two,2001,Ohio,1.7,,True
three,2002,Ohio,3.6,2.5,True
four,2001,Nevada,2.4,,False
five,2002,Nevada,2.9,-3.6,False


In [None]:
frame2.columns

Index(['year', 'state', 'pop', 'debt', 'eastern'], dtype='object')

In [None]:
del frame2['eastern']

In [None]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [None]:
# note that delete a column aslo can not be use "del frame2.xxx"
# its not copy, its only view on the data



In [None]:
#anothe common form of data is nested dict of dicts:

pop = {'Nevada':{2001: 2.4, 2002: 2.9},
       'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
       

In [None]:
pop

{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [None]:
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [None]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [None]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [None]:
frame3.index.name = 'year'
frame3.columns.name = 'state'

In [None]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [None]:
frame3.values # two dimensional ndarray.

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [None]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,1.5
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,2.5
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,-3.6


In [None]:
frame2.values

array([[2000, 'Ohio', 1.5, 1.5],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, 2.5],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, -3.6]], dtype=object)

所有可能传给DataFrame的值：

![替代文字](https://img-blog.csdnimg.cn/2020080722105763.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70)

### Index Objects 索引对象
pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels you use when
constructing a Series or DataFrame is internally converted to an Index:

In [None]:
frame3 = DataFrame({'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}})

In [None]:
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [None]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object')

In [None]:
# like set 
'Ohio' in frame3.columns

True

In [None]:
2001 in frame3.index

True

In [None]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels # but not like set, it can have duplicate lables.!!!

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

Index对象的各种方法和属性。

![替代文字](https://img-blog.csdnimg.cn/20200808095245189.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70)

In [None]:
dup_labels.delete(2)
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

In [None]:
dup_labels.delete(3)
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

In [None]:
print(frame3)
frame3.index.delete(2)
frame3

      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5


Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


## Essential Functionality
基本的机制，fundamental mechanics of interacting with the 
data contained.

### Reindexing 索引重建


In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6],index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [None]:
#calling reindex on this series rearrange the data according to the new index:
obj2 = obj.reindex(['a','b','c','d','e'])

In [None]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [None]:
# iterpolation with method keyward.
obj3 = pd.Series(['blue', 'purple', 'yellow'],index=[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [None]:
obj3.reindex(range(6), method='ffill') #range boject


0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

AttributeError: ignored

In DataFrame it can beused to change the index,columns,both.

In [None]:
import numpy as np
frame = pd.DataFrame(np.arange(9).reshape((3,3)), index=['a', 'b', 'c'], columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
b,3,4,5
c,6,7,8


In [None]:
frame2 = frame.reindex(['a','c','d','b','e'])

In [None]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
c,6.0,7.0,8.0
d,,,
b,3.0,4.0,5.0
e,,,


In [None]:
# use colums keywards 
frame3= frame.reindex(index=['a','c','d','b','e'], columns=['Texas', 'Ohio', 'California', 'Utah'])
frame3

Unnamed: 0,Texas,Ohio,California,Utah
a,1.0,0.0,2.0,
c,7.0,6.0,8.0,
d,,,,
b,4.0,3.0,5.0,
e,,,,


The function arguments:

frame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)

![替代文字](https://img-blog.csdnimg.cn/20200808103429470.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70)

### Dropping Entries from an Axis
根据轴来删除条目

obj.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

In [None]:
# keyward drop.
obj= pd.Series(np.arange(5.0),index=['a','b','c','d','e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [None]:
new_obj = obj.drop('a')

In [None]:
new_obj

b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [None]:
# with DataFrame
data = pd.DataFrame(np.arange(16).reshape(4,4), index=['Ohio', 'Colorada', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorada,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data.drop(['Colorada', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
# axis = 0 if you drop the index.
#if you want drop columns , can use axis=1 or axis='columns'
data.drop(['one','two'], axis= 1)

Unnamed: 0,three,four
Ohio,2,3
Colorada,6,7
Utah,10,11
New York,14,15


In [None]:
# or like this.
data.drop(columns=['one', 'two'])

Unnamed: 0,three,four
Ohio,2,3
Colorada,6,7
Utah,10,11
New York,14,15


In [None]:
# note the drop function methed 'inplace=False', it return a copy of new one(series,dataframe)
# if we want modify the size of  a Series or DataFrame, we can set it implace=True
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorada,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data.drop(index=['Ohio'], columns=['two'], inplace=True)


In [None]:
data

Unnamed: 0,one,three,four
Colorada,4,6,7
Utah,8,10,11
New York,12,14,15


### **Indexing, Selection, and Filtering**索引、选择和筛选


#### Indexing

In [None]:
# Series Indexing like numpy array indexing. 
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [None]:
obj['a']

0.0

In [None]:
obj['d']

3.0

In [None]:
obj[['a','c']]

a    0.0
c    2.0
dtype: float64

In [None]:
obj[[1,3]]

b    1.0
d    3.0
dtype: float64

In [None]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

In [None]:
# slicing like python, but the endpoint is include 
obj['a':'c']

a    0.0
b    1.0
c    2.0
dtype: float64

In [None]:
# use slicing method modify the value
obj['a':'c'] = 100

In [None]:
obj

a    100.0
b    100.0
c    100.0
d      3.0
dtype: float64

Indexing into a DataFrame is for retrieving one or more 
columns 

In [None]:
data = pd.DataFrame(np.arange(16).reshape(4,4), index=['Ohio', 'Colorada', 'Utah', 'NewYork'],
                    columns= ['one', 'two', 'trhee', 'four'])
data

Unnamed: 0,one,two,trhee,four
Ohio,0,1,2,3
Colorada,4,5,6,7
Utah,8,9,10,11
NewYork,12,13,14,15


In [None]:
data.loc['Ohio']

one      0
two      1
trhee    2
four     3
Name: Ohio, dtype: int64

In [None]:
data['two']

Ohio         1
Colorada     5
Utah         9
NewYork     13
Name: two, dtype: int64

In [None]:
data[['trhee','two']]

Unnamed: 0,trhee,two
Ohio,2,1
Colorada,6,5
Utah,10,9
NewYork,14,13


In [None]:
data[:2]

Unnamed: 0,one,two,trhee,four
Ohio,0,1,2,3
Colorada,4,5,6,7


In [None]:
data[:1]

Unnamed: 0,one,two,trhee,four
Ohio,0,1,2,3


In [None]:
data[data['trhee'] > 5 ]

Unnamed: 0,one,two,trhee,four
Colorada,4,5,6,7
Utah,8,9,10,11
NewYork,12,13,14,15


In [None]:
# 为了方便起见，提供了行选择语法数据[：2]


In [None]:
# boolean DataFrame
data < 5


Unnamed: 0,one,two,trhee,four
Ohio,True,True,True,True
Colorada,True,False,False,False
Utah,False,False,False,False
NewYork,False,False,False,False


In [None]:
data[data < 5] = 0 # like numpy case.

In [None]:
data

Unnamed: 0,one,two,trhee,four
Ohio,0,0,0,0
Colorada,0,5,6,7
Utah,8,9,10,11
NewYork,12,13,14,15


#### Selection with loc and iloc   **(loc,iloc)**


``.iloc[]`` is primarily integer position based (from ``0`` to
``length-1`` of the axis), but may also be used with a boolean
array.

Allowed inputs are:

- An integer, e.g. ``5``.
- A list or array of integers, e.g. ``[4, 3, 0]``.
- A slice object with ints, e.g. ``1:7``.
- A boolean array.
- A ``callable`` function with one argument (the calling Series or
  DataFrame) and that returns valid output for indexing (one of the above).
  This is useful in method chains, when you don't have a reference to the
  calling object, but would like to base your selection on some value.

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.


In [86]:
# 从函数的说明可以大致看出iloc 应用于数字，loc应用字符索引。
data.loc['Colorada', ['two','trhee']]

two      5
trhee    6
Name: Colorada, dtype: int64

In [85]:
data.iloc[2,[3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [87]:
data.iloc[2]

one       8
two       9
trhee    10
four     11
Name: Utah, dtype: int64

In [88]:
# slice 
data.loc[:'Utah', 'two']

Ohio        0
Colorada    5
Utah        9
Name: two, dtype: int64

In [89]:
data.loc['Utah', 'two']

9

In [90]:
data.iloc[:,:3]

Unnamed: 0,one,two,trhee
Ohio,0,0,0
Colorada,0,5,6
Utah,8,9,10
NewYork,12,13,14


![替代文字](https://img-blog.csdnimg.cn/20200808120252118.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70)