pandas中，选取特定行、列有两种方式：一种是操作符“[]”访问方式，称为Indexing Operator；另一种是通过.loc和.iloc方式，即Indexer（索引器）来选择特定行、列。

# 4.1 使用.loc和.iloc筛选行和列数据
    series和Dataframe具有极大的灵活性。.loc函数支持通过索引标签（Index Label）的方式访问数据，而.iloc函数支持通过整数索引的方式访问数据。

In [1]:
import pandas as pd

In [2]:
# 设置最大显示行列数
pd.set_option('display.max_rows',10)
pd.set_option('display.max_columns',10)


In [3]:
# 数据原目录
data_source = r"Y:\BaiduNetdiskWorkspace\data_analysis\Python数据分析\data"

In [4]:
# 导入数据
data = pd.read_csv(data_source+"\\Online_Retail_Fake.csv")

In [5]:
data['UnitPrice'].fillna(data['UnitPrice'].mean(),inplace=True)

In [6]:
data['UnitPrice'].isnull().sum()

0

In [7]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010/12/1 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010/12/1 8:26,4.611117,17850.0,United Kingdom
2,536365,84406B,,8,2010/12/1 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010/12/1 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010/12/1 8:26,3.39,17850.0,United Kingdom


In [8]:
data['Total_Price'] = data['Quantity']*data['UnitPrice']

In [9]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Total_Price
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010/12/1 8:26,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010/12/1 8:26,4.611117,17850.0,United Kingdom,27.666699
2,536365,84406B,,8,2010/12/1 8:26,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010/12/1 8:26,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010/12/1 8:26,3.39,17850.0,United Kingdom,20.34


## 4.1.1
    选择Series和DataFrame中的行

In [10]:
Country = data['Country']

In [11]:
Country.head()

0    United Kingdom
1    United Kingdom
2    United Kingdom
3    United Kingdom
4    United Kingdom
Name: Country, dtype: object

In [12]:
# 上面是用[]操作符对data求Series列，如果要获取特定的几行，可以用.iloc[]
print(Country.iloc[2])
print(Country.iloc[4])
print(Country.iloc[541908])

United Kingdom
United Kingdom
France


In [13]:
# 如果想访问多行数据，可以给.iloc传入一个list列表
print(Country.iloc[[1,2,3,541908]])

1         United Kingdom
2         United Kingdom
3         United Kingdom
541908            France
Name: Country, dtype: object


In [14]:
# 除了列表外，.iloc函数还支持切片操作
print(Country.iloc[2:549900:299])

2         United Kingdom
301       United Kingdom
600       United Kingdom
899       United Kingdom
1198      United Kingdom
               ...      
540594    United Kingdom
540893    United Kingdom
541192           Belgium
541491    United Kingdom
541790           Germany
Name: Country, Length: 1813, dtype: object


In [15]:
# .iloc适用于index是数字型的索引，如果索引是其他类型的，比如下面这种情况的时候，我们可以用.loc来索引。
college = pd.read_csv(data_source+"\\College.csv")

In [16]:
college.head()

Unnamed: 0.1,Unnamed: 0,Private,Apps,Accept,Enroll,...,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Abilene Christian University,Yes,1660,1232,721,...,78,18.1,12,7041,60
1,Adelphi University,Yes,2186,1924,512,...,30,12.2,16,10527,56
2,Adrian College,Yes,1428,1097,336,...,66,12.9,30,8735,54
3,Agnes Scott College,Yes,417,349,137,...,97,7.7,37,19016,59
4,Alaska Pacific University,Yes,193,146,55,...,72,11.9,2,10922,15


In [17]:
college.set_index('Unnamed: 0',inplace=True)

In [18]:
col_name = {'Unnamed: 0':'college_name'}

In [19]:
college.rename(columns=col_name,inplace=True)

In [20]:
college.set_index('college_name',inplace=True)

KeyError: "None of ['college_name'] are in the columns"

In [None]:
college

In [None]:
college.head(20)

In [None]:
# 利用.loc进行索引，切片。
college.loc['Abilene Christian University':'Angelo State University':2]

# 4.1.2 同时选择行与列
    利用类似下面的代码的方式，可以同时选择行和列
          df.iloc[rows,columns]
          df.loc[rows,columns]
    其中，rows代表了行的选择，既可以是列表，也可以是数据切片的方式输入，colunms代表了列的选择。

In [None]:
college

In [None]:
# 利用.loc选择行与列
college.loc['Adelphi University':'Alaska Pacific University',              ##行
            :'Enroll']                                                     ##列

In [None]:
# 当然也可以用.iloc选择行与列
college.iloc[1:4,:4]                                               # 与.loc形式完全相同


## 4.2 布尔选择
    上一节中已经学到了如何选择行列来筛选数据。但是一般还是更多的选择布尔选择方式来筛选数据。
    基本思路：
        通过对Pandas中的Series和DataFrame进行逻辑运算后得到一个新的Series或者DataFrame，其中的数据就是布尔类型，
        利用布尔类型的数据来进行数据分析。

In [None]:
# 引入数据
cwur = pd.read_csv(data_source+"\\cwurData.csv")
cwur.head(10)

## 4.2.1 计算布尔值

In [None]:
# 选择那些score>85的学校，先创造出一列新的Series，内容是布尔型

loc = cwur.columns.get_loc('score')  # 先用columns.get_loc获取列的位置

cwur.insert(loc= loc+1,column='score_bool',value=cwur['score']>=85)   # 再用insert()函数插入数据

In [None]:
cwur

In [None]:
# 求score>=85的学校的数目以及比例
cwur['score_bool'].sum()
# 利用mean还可以求出比例
cwur['score_bool'].mean()            # True=1,False=0,所以mean()可以求出比例

In [None]:
# 除了与一个值进行比较，还可以几列之间比较得出bool值
cwur['bool_2'] = cwur[ 'quality_of_education']>cwur['alumni_employment']

In [None]:
cwur[['quality_of_education','alumni_employment']]

In [None]:
cwur

In [None]:
# 对score>=85和cwur[ 'quality_of_education']>cwur['alumni_employment']同时满足  
(cwur['score']>=85) & \
(cwur[ 'quality_of_education']>cwur['alumni_employment'])            # 要用()包起来否则会出错

##  4.2.2 多条件筛选数据

In [None]:
## 利用多个条件做出bool_Series
##　比如求出中国排名前100的大学
crit_1 = cwur['country']=='China'
crit_2 = cwur['world_rank']<=100

cwur['China world\'s rank before 100'] = crit_1 & crit_2
cwur['China world\'s rank before 100'].sum()

cwur[cwur['China world\'s rank before 100']]         # 利用bool值对DataFrame进行筛选

# ![按照index排序](数据按照index排序.jpg)

In [22]:
sh = pd.read_csv(data_source+"\\sh.csv")

In [23]:
sh.head()

Unnamed: 0,date,open,high,close,low,volume,amount
0,2019-08-30,2907.383,2914.577,2886.237,2874.103,19395995100,224751169931
1,2019-08-29,2895.999,2898.605,2890.919,2878.588,17861308200,196332770521
2,2019-08-28,2901.627,2905.435,2893.756,2887.012,18309790300,201805050637
3,2019-08-27,2879.515,2919.644,2902.193,2879.406,20814179400,230999692857
4,2019-08-26,2851.016,2870.494,2863.567,2849.238,16989536300,191036667851


In [30]:
sh.set_index('date',inplace=True)

In [31]:
sh.head()

Unnamed: 0_level_0,open,high,close,low,volume,amount
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-08-30,2907.383,2914.577,2886.237,2874.103,19395995100,224751169931
2019-08-29,2895.999,2898.605,2890.919,2878.588,17861308200,196332770521
2019-08-28,2901.627,2905.435,2893.756,2887.012,18309790300,201805050637
2019-08-27,2879.515,2919.644,2902.193,2879.406,20814179400,230999692857
2019-08-26,2851.016,2870.494,2863.567,2849.238,16989536300,191036667851


In [32]:
sh.sort_index(inplace=True)

In [34]:
sh.head()

Unnamed: 0_level_0,open,high,close,low,volume,amount
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2009-01-05,1849.02,1880.716,1880.716,1844.094,6713671200,46100959232
2009-01-06,1878.827,1938.69,1937.145,1871.971,9906675200,69012570112
2009-01-07,1938.974,1948.233,1924.012,1920.515,9236008800,63931166720
2009-01-08,1890.242,1894.171,1878.181,1862.263,8037400000,55076814848
2009-01-09,1875.164,1909.349,1904.861,1875.164,7122477600,50131263488
