* # Pandas 基础
1. 如何理解dataframe的基本要素
2. DataFrame的数据结构
3. 什么是series,针对Series的链式方法
4. 对DataFrame的列名，行名，索引进行修改

* ### 什么是DataFrame

 Pandas中有两种基础数据结构，Series和DataFrame.Series是一种类似于一维数组的对象，由一组数据和一组与之相关的数据标签（即索引）组成。



 DataFrame既有行索引，又有列索引，可以看做是由Series组成的字典。

In [1]:
# 导入Pandas包
import pandas as pd
# 设置Notebook中最多显示多少行，超出部分用省略号代替
pd.set_option('display.max_colwidth',20)
pd.set_option('display.max_rows',8)
# 记录数据所在根目录
data_source = r"Y:\BaiduNetdiskWorkspace\data_analysis\Python数据分析\data"

In [2]:
retal_data = pd.read_csv(r'{}\Online_Retail_Fake.csv'.format(data_source))

In [14]:
retal_data[['InvoiceNo']]

Unnamed: 0,InvoiceNo
0,536365
1,536365
2,536365
3,536365
...,...
541906,581587
541907,581587
541908,581587
541909,581587


In [15]:
retal_data[1:3]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
1,536365,71053,WHITE METAL LANTERN,6,2010/12/1 8:26,,17850.0,United Kingdom
2,536365,84406B,,8,2010/12/1 8:26,2.75,17850.0,United Kingdom


In [3]:
retal_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HE...,6,2010/12/1 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010/12/1 8:26,,17850.0,United Kingdom
2,536365,84406B,,8,2010/12/1 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FL...,6,2010/12/1 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTI...,6,2010/12/1 8:26,3.39,17850.0,United Kingdom


## 一个DataFrame有一个数轴（Index）和一个横轴（Cloumns），pandas借用了NumPy的命名方式，0/1分别表示竖轴/横轴。即称（Index）为DataFrame的0轴指的是函数作用于行。

In [4]:
index = retal_data.index
colunms = retal_data.columns
data = retal_data.values

In [5]:
index

RangeIndex(start=0, stop=541910, step=1)

In [6]:
colunms

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

In [7]:
data

array([['536365', '85123A', 'WHITE HANGING HEART T-LIGHT HOLDER', ...,
        2.55, 17850.0, 'United Kingdom'],
       ['536365', '71053', 'WHITE METAL LANTERN', ..., nan, 17850.0,
        'United Kingdom'],
       ['536365', '84406B', nan, ..., 2.75, 17850.0, 'United Kingdom'],
       ...,
       ['581587', '23255', 'CHILDRENS CUTLERY CIRCUS PARADE', ..., 4.15,
        12680.0, 'France'],
       ['581587', '22138', 'BAKING SET 9 PIECE RETROSPOT ', ..., 4.95,
        12680.0, 'France'],
       ['581587', '22138', 'Wrong booking', ..., 4.95, 12680.0, 'France']],
      dtype=object)

In [8]:
index_value = index.values
index_value

array([     0,      1,      2, ..., 541907, 541908, 541909], dtype=int64)

In [9]:
cv = colunms.values
cv

array(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'], dtype=object)

In [10]:
data

array([['536365', '85123A', 'WHITE HANGING HEART T-LIGHT HOLDER', ...,
        2.55, 17850.0, 'United Kingdom'],
       ['536365', '71053', 'WHITE METAL LANTERN', ..., nan, 17850.0,
        'United Kingdom'],
       ['536365', '84406B', nan, ..., 2.75, 17850.0, 'United Kingdom'],
       ...,
       ['581587', '23255', 'CHILDRENS CUTLERY CIRCUS PARADE', ..., 4.15,
        12680.0, 'France'],
       ['581587', '22138', 'BAKING SET 9 PIECE RETROSPOT ', ..., 4.95,
        12680.0, 'France'],
       ['581587', '22138', 'Wrong booking', ..., 4.95, 12680.0, 'France']],
      dtype=object)

# ![pandas的数据类型](pandas的数据类型.jpg)

In [11]:
retal_data.dtypes

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

In [12]:
# 获取列的大小
retal_data.columns.size

8

In [13]:
## Series就是构成DataFrame的列，每一列就是一个Series。
## 比如访问Country列
ct = retal_data['Country']
ct

0         United Kingdom
1         United Kingdom
2         United Kingdom
3         United Kingdom
               ...      
541906            France
541907            France
541908            France
541909            France
Name: Country, Length: 541910, dtype: object

In [14]:
## 访问ct的name，length(实际是size)，dtype
print(ct.name,ct.size,ct.dtype,sep='---')

Country---541910---object


In [15]:
## 查看coutry的type
type(ct)

pandas.core.series.Series

In [16]:
## isnull()函数
ct.isnull().sum()

1

In [17]:
ct.fillna(0).isnull().sum()

0

* ## 修改索引与列名

In [18]:
gapminder = pd.read_csv(data_source+"/gapminder.csv")

In [19]:
gapminder.head()

Unnamed: 0.1,Unnamed: 0,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,Life expectancy
0,0,,,,,,,,,,...,,,,,,,,,,Abkhazia
1,1,28.21,28.2,28.19,28.18,28.17,28.16,28.15,28.14,28.13,...,,,,,,,,,,Afghanistan
2,2,,,,,,,,,,...,,,,,,,,,,Akrotiri and Dhe...
3,3,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,...,,,,,,,,,,Albania
4,4,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,...,,,,,,,,,,Algeria


In [20]:
col = gapminder.columns
col.size

219

In [21]:
col.values

array(['Unnamed: 0', '1800', '1801', '1802', '1803', '1804', '1805',
       '1806', '1807', '1808', '1809', '1810', '1811', '1812', '1813',
       '1814', '1815', '1816', '1817', '1818', '1819', '1820', '1821',
       '1822', '1823', '1824', '1825', '1826', '1827', '1828', '1829',
       '1830', '1831', '1832', '1833', '1834', '1835', '1836', '1837',
       '1838', '1839', '1840', '1841', '1842', '1843', '1844', '1845',
       '1846', '1847', '1848', '1849', '1850', '1851', '1852', '1853',
       '1854', '1855', '1856', '1857', '1858', '1859', '1860', '1861',
       '1862', '1863', '1864', '1865', '1866', '1867', '1868', '1869',
       '1870', '1871', '1872', '1873', '1874', '1875', '1876', '1877',
       '1878', '1879', '1880', '1881', '1882', '1883', '1884', '1885',
       '1886', '1887', '1888', '1889', '1890', '1891', '1892', '1893',
       '1894', '1895', '1896', '1897', '1898', '1899', '1900', '1901',
       '1902', '1903', '1904', '1905', '1906', '1907', '1908', '1909',
       '

In [22]:
gapminder = gapminder.set_index('Life expectancy')

In [23]:
gapminder

Unnamed: 0_level_0,Unnamed: 0,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
Life expectancy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abkhazia,0,,,,,,,,,,...,,,,,,,,,,
Afghanistan,1,28.21,28.2,28.19,28.18,28.17,28.16,28.15,28.14,28.13,...,,,,,,,,,,
Akrotiri and Dhekelia,2,,,,,,,,,,...,,,,,,,,,,
Albania,3,35.40,35.4,35.40,35.40,35.40,35.40,35.40,35.40,35.40,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zambia,256,,,,,,,,,,...,49.0,51.1,52.3,53.1,53.7,54.7,55.6,56.3,56.7,57.10
Zimbabwe,257,,,,,,,,,,...,46.4,47.3,48.0,49.1,51.6,54.2,55.7,57.0,59.3,61.69
Åland,258,,,,,,,,,,...,,,,,,,,,,
South Sudan,259,,,,,,,,,,...,55.5,55.6,55.8,56.0,55.9,56.0,56.0,56.1,56.1,56.10


In [24]:
gapminder = gapminder.reset_index('Life expectancy')

In [25]:
gapminder

Unnamed: 0.1,Life expectancy,Unnamed: 0,1800,1801,1802,1803,1804,1805,1806,1807,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Abkhazia,0,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,1,28.21,28.2,28.19,28.18,28.17,28.16,28.15,28.14,...,,,,,,,,,,
2,Akrotiri and Dhe...,2,,,,,,,,,...,,,,,,,,,,
3,Albania,3,35.40,35.4,35.40,35.40,35.40,35.40,35.40,35.40,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
776,Zambia,256,,,,,,,,,...,49.0,51.1,52.3,53.1,53.7,54.7,55.6,56.3,56.7,57.10
777,Zimbabwe,257,,,,,,,,,...,46.4,47.3,48.0,49.1,51.6,54.2,55.7,57.0,59.3,61.69
778,Åland,258,,,,,,,,,...,,,,,,,,,,
779,South Sudan,259,,,,,,,,,...,55.5,55.6,55.8,56.0,55.9,56.0,56.0,56.1,56.1,56.10


In [26]:
col_name= {
    'Life expectancy':'location','Unnamed: 0':'index'
}

## 用rename函数实现了列名的修改，当然也可以修改索引
gapminder = gapminder.rename(columns= col_name)
gapminder

Unnamed: 0,location,index,1800,1801,1802,1803,1804,1805,1806,1807,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Abkhazia,0,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,1,28.21,28.2,28.19,28.18,28.17,28.16,28.15,28.14,...,,,,,,,,,,
2,Akrotiri and Dhe...,2,,,,,,,,,...,,,,,,,,,,
3,Albania,3,35.40,35.4,35.40,35.40,35.40,35.40,35.40,35.40,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
776,Zambia,256,,,,,,,,,...,49.0,51.1,52.3,53.1,53.7,54.7,55.6,56.3,56.7,57.10
777,Zimbabwe,257,,,,,,,,,...,46.4,47.3,48.0,49.1,51.6,54.2,55.7,57.0,59.3,61.69
778,Åland,258,,,,,,,,,...,,,,,,,,,,
779,South Sudan,259,,,,,,,,,...,55.5,55.6,55.8,56.0,55.9,56.0,56.0,56.1,56.1,56.10


In [27]:
## 另一方面，index和column也可以转换成list进行索引修改。
index = gapminder.index
colunms = gapminder.columns
index = index.to_list()
colunms = colunms.to_list()

colunms[0] = 'test'
## 可以通过给DataFrame类的文件赋值list类型的文件来修改colunms或者index
gapminder.columns = colunms
gapminder


Unnamed: 0,test,index,1800,1801,1802,1803,1804,1805,1806,1807,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Abkhazia,0,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,1,28.21,28.2,28.19,28.18,28.17,28.16,28.15,28.14,...,,,,,,,,,,
2,Akrotiri and Dhe...,2,,,,,,,,,...,,,,,,,,,,
3,Albania,3,35.40,35.4,35.40,35.40,35.40,35.40,35.40,35.40,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
776,Zambia,256,,,,,,,,,...,49.0,51.1,52.3,53.1,53.7,54.7,55.6,56.3,56.7,57.10
777,Zimbabwe,257,,,,,,,,,...,46.4,47.3,48.0,49.1,51.6,54.2,55.7,57.0,59.3,61.69
778,Åland,258,,,,,,,,,...,,,,,,,,,,
779,South Sudan,259,,,,,,,,,...,55.5,55.6,55.8,56.0,55.9,56.0,56.0,56.1,56.1,56.10


* ## 添加、修改或者删除列

In [28]:
##　数据处理过程中有时需要进行列的添加修改和删除，我们先读入文件
retail_data = pd.read_csv(data_source+"\\Online_Retail_Fake.csv")

In [29]:
retail_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HE...,6,2010/12/1 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010/12/1 8:26,,17850.0,United Kingdom
2,536365,84406B,,8,2010/12/1 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FL...,6,2010/12/1 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTI...,6,2010/12/1 8:26,3.39,17850.0,United Kingdom


In [30]:
## 如表所示，表中有Quantity和UnitPrice，但是没有总价Total_price，我们添加一列Total_price
retail_data['Total_Price'] = retail_data['Quantity']*retail_data['UnitPrice']
retail_data[['Total_Price','Quantity','UnitPrice']]

Unnamed: 0,Total_Price,Quantity,UnitPrice
0,15.30,6,2.55
1,,6,
2,22.00,8,2.75
3,20.34,6,3.39
...,...,...,...
541906,16.60,4,4.15
541907,16.60,4,4.15
541908,14.85,3,4.95
541909,14.85,3,4.95


In [31]:
# 获取UnitPrice列的位置
UnitPrice_col_index = retail_data.columns.get_loc('UnitPrice')
# UnitPrice_col_index
# 使用insert()函数将Total_price列加入到指定位置
retail_data.insert(UnitPrice_col_index+1,'new_Total_Price',
                  value = retail_data['Total_Price'])


In [32]:
retail_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,new_Total_Price,CustomerID,Country,Total_Price
0,536365,85123A,WHITE HANGING HE...,6,2010/12/1 8:26,2.55,15.3,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010/12/1 8:26,,,17850.0,United Kingdom,
2,536365,84406B,,8,2010/12/1 8:26,2.75,22.0,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FL...,6,2010/12/1 8:26,3.39,20.34,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTI...,6,2010/12/1 8:26,3.39,20.34,17850.0,United Kingdom,20.34


In [33]:
# 删除列则需要drop（）函数
retail_data.drop('new_Total_Price',axis=1,inplace=True)


In [34]:
retail_data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Total_Price
0,536365,85123A,WHITE HANGING HE...,6,2010/12/1 8:26,2.55,17850.0,United Kingdom,15.30
1,536365,71053,WHITE METAL LANTERN,6,2010/12/1 8:26,,17850.0,United Kingdom,
2,536365,84406B,,8,2010/12/1 8:26,2.75,17850.0,United Kingdom,22.00
3,536365,84029G,KNITTED UNION FL...,6,2010/12/1 8:26,3.39,17850.0,United Kingdom,20.34
...,...,...,...,...,...,...,...,...,...
541906,581587,23254,CHILDRENS CUTLER...,4,2011/12/9 12:50,4.15,12680.0,France,16.60
541907,581587,23255,CHILDRENS CUTLER...,4,2011/12/9 12:50,4.15,12680.0,France,16.60
541908,581587,22138,BAKING SET 9 PIE...,3,2011/12/9 12:50,4.95,12680.0,France,14.85
541909,581587,22138,Wrong booking,3,2011/12/9 12:50,4.95,12680.0,France,14.85


In [35]:
## 除了用drop外，还可以用del来删除列
del retail_data['Total_Price']

In [36]:
retail_data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HE...,6,2010/12/1 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010/12/1 8:26,,17850.0,United Kingdom
2,536365,84406B,,8,2010/12/1 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FL...,6,2010/12/1 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541906,581587,23254,CHILDRENS CUTLER...,4,2011/12/9 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLER...,4,2011/12/9 12:50,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIE...,3,2011/12/9 12:50,4.95,12680.0,France
541909,581587,22138,Wrong booking,3,2011/12/9 12:50,4.95,12680.0,France


In [37]:
# 除了常规数学运算，还可以对其进行逻辑运算
# 如：
retail_data['logic_test'] = retail_data['UnitPrice']>4
retail_data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,logic_test
0,536365,85123A,WHITE HANGING HE...,6,2010/12/1 8:26,2.55,17850.0,United Kingdom,False
1,536365,71053,WHITE METAL LANTERN,6,2010/12/1 8:26,,17850.0,United Kingdom,False
2,536365,84406B,,8,2010/12/1 8:26,2.75,17850.0,United Kingdom,False
3,536365,84029G,KNITTED UNION FL...,6,2010/12/1 8:26,3.39,17850.0,United Kingdom,False
...,...,...,...,...,...,...,...,...,...
541906,581587,23254,CHILDRENS CUTLER...,4,2011/12/9 12:50,4.15,12680.0,France,True
541907,581587,23255,CHILDRENS CUTLER...,4,2011/12/9 12:50,4.15,12680.0,France,True
541908,581587,22138,BAKING SET 9 PIE...,3,2011/12/9 12:50,4.95,12680.0,France,True
541909,581587,22138,Wrong booking,3,2011/12/9 12:50,4.95,12680.0,France,True


In [39]:
retail_data['logic_test'].sum()

144262