# Pandas学习

## 1. 如何将一个列表转换成Pandas数据框

In [1]:
import pandas as pd
my_list = [('join', 25, 'male'), ('lisa', 30, 'female'), ('david', 18, 'male')]
df = pd.DataFrame(my_list,columns=['Name', 'age', 'gender'])

In [2]:
print(df)

    Name  age  gender
0   join   25    male
1   lisa   30  female
2  david   18    male


## 2.如何从一个CSV文件中读取数据到一个Pandas数据框 

In [3]:
pd = pd.read_csv('/home/data/hdfs/anomaly_label.csv', encoding='utf-8')

In [4]:
print(pd)

                         BlockId    Label
0       blk_-1608999687919862906   Normal
1        blk_7503483334202473044   Normal
2       blk_-3544583377289625738  Anomaly
3       blk_-9073992586687739851   Normal
4        blk_7854771516489510256   Normal
...                          ...      ...
575056   blk_1019720114020043203   Normal
575057  blk_-2683116845478050414   Normal
575058   blk_5595059397348477632   Normal
575059   blk_1513937873877967730   Normal
575060  blk_-9128742458709757181  Anomaly

[575061 rows x 2 columns]


## 3. 如何通过pandas创建数据到mysql数据库里面

In [2]:
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.types import *
df = pd.DataFrame({"班级":["一年级","二年级","三年级","四年级"],
               "男生人数":[25,23,27,30],
               "女生人数":[19,17,20,20]})
engin = create_engine('mysql+mysqlconnector://root:123456@127.0.0.1:3306/test_pandas')
df.to_sql("clsses",engin)

4

## 4. 如何查看一个Pandas数据框的行数和列数

In [4]:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
print(df.shape),print(df)

(3, 3)
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9


(None, None)

## 5.如何查看一个Pandas数据框的列名?

In [1]:
import pandas as pd
data = {'name':['alex','box','chery'],'age':[18,20,12]}
df = pd.DataFrame(data)
df

Unnamed: 0,name,age
0,alex,18
1,box,20
2,chery,12


In [2]:
print(df.columns)

Index(['name', 'age'], dtype='object')


## 6. 如何查看一个Pandas数据框的索引？

In [3]:
import pandas as pd
data = {'name':['alex','box','chery'],'age':[18,20,12]}
df = pd.DataFrame(data)
df

Unnamed: 0,name,age
0,alex,18
1,box,20
2,chery,12


In [4]:
print(df.index)

RangeIndex(start=0, stop=3, step=1)


## 7. 如何导入Pandas库并查看其版本好

In [7]:
import pandas as pd
print(pd.__version__)

2.1.0


## 8. 如何从CSV文件读取数据并创建一个Pandas数据框

In [2]:
import pandas as pd
df = pd.read_csv('../../log/loglizer/data/HDFS/anomaly_label.csv',encoding='utf-8')
df.head(3)

Unnamed: 0,BlockId,Label
0,blk_-1608999687919862906,Normal
1,blk_7503483334202473044,Normal
2,blk_-3544583377289625738,Anomaly


In [3]:
df.tail(3)

Unnamed: 0,BlockId,Label
575058,blk_5595059397348477632,Normal
575059,blk_1513937873877967730,Normal
575060,blk_-9128742458709757181,Anomaly


## 9. 如何查看一个Pandas数据框的数据类型

In [4]:
import pandas as pd
data = {'name':['alex','bob','chery'],'age':[10,12,13]}
df = pd.DataFrame(data)
df

Unnamed: 0,name,age
0,alex,10
1,bob,12
2,chery,13


In [5]:
print(df.dtypes)

name    object
age      int64
dtype: object


## 10.如何查看一个Pandas数据框的数据摘要统计信息?

In [6]:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4,5],'B':[2.1,4.2,6.3,8.4,10.5],'C':['a','b','a','b','a']})
df

Unnamed: 0,A,B,C
0,1,2.1,a
1,2,4.2,b
2,3,6.3,a
3,4,8.4,b
4,5,10.5,a


In [7]:
suf = df.describe()
print(suf)

              A          B
count  5.000000   5.000000
mean   3.000000   6.300000
std    1.581139   3.320392
min    1.000000   2.100000
25%    2.000000   4.200000
50%    3.000000   6.300000
75%    4.000000   8.400000
max    5.000000  10.500000


## 11.如何选择一个Pandas数据框的行

In [8]:
import pandas as pd
df = pd.DataFrame({'Name':['Alice','Bob','Charlie'],
                   'Age':[25,30,35],
                   'City':['New York','Paris','London']})
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Paris
2,Charlie,35,London


In [9]:
first_row = df.loc[0]
first_row

Name       Alice
Age           25
City    New York
Name: 0, dtype: object

In [10]:
first_two = df.loc[0:1]
first_two

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Paris


In [14]:
sub = df.loc[[0,2],['Name','Age']]
sub

Unnamed: 0,Name,Age
0,Alice,25
2,Charlie,35


## 12. 如何选择一个Pandas数据框的列

In [15]:
import pandas as pd
df = pd.DataFrame({'Name':['Alice','Bob','Charlie'],
                   'Age':[25,30,35],
                   'City':['New York','Paris','London']})
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Paris
2,Charlie,35,London


In [16]:
df['Name']

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

In [21]:
df[['Name','City']]

Unnamed: 0,Name,City
0,Alice,New York
1,Bob,Paris
2,Charlie,London


In [26]:
df.iloc[:,0:2]

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35


## 13. 如何选择一个Pandas数据框的列和行

In [1]:
import pandas as pd
df = pd.DataFrame({'Name':['Alice','Bob','Charlie'],
                   'Age':[25,30,35],
                   'City':['New York','Paris','London']})
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Paris
2,Charlie,35,London


In [3]:
sub = df.loc[[0,2],['Age','Name']]
sub

Unnamed: 0,Age,Name
0,25,Alice
2,35,Charlie


loc函数：通过行索引 "Index" 中的具体值来取行数据（如取"Index"为"A"的行）.loc包括最后一个位置

iloc函数：通过行号来取行数据（如取第二行的数据）.iloc不包括最后一个位置

In [5]:
sub1 = df.iloc[[0,2],[0,1]]
sub1

Unnamed: 0,Name,Age
0,Alice,25
2,Charlie,35


## 14.如何筛选一个Pandas数据框的行？

In [6]:
import pandas as pd
df = pd.DataFrame({'Name':['Alice','Bob','Charlie'],
                   'Age':[25,30,35],
                   'City':['New York','Paris','London']})
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Paris
2,Charlie,35,London


In [7]:
bool_index = df['Age'] > 25
bool_index

0    False
1     True
2     True
Name: Age, dtype: bool

In [8]:
filt = df[bool_index]
filt

Unnamed: 0,Name,Age,City
1,Bob,30,Paris
2,Charlie,35,London


## 15.如何筛选一个Pandas数据框的行和列？

In [9]:
import pandas as pd
df = pd.DataFrame({'Name':['Alice','Bob','Charlie'],
                   'Age':[25,30,35],
                   'City':['New York','Paris','London']})
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Paris
2,Charlie,35,London


In [10]:
sub = df.loc[df['Age']>25,['Name','Age']]

In [11]:
sub

Unnamed: 0,Name,Age
1,Bob,30
2,Charlie,35


In [13]:
sub = df.loc[df['Name'] == 'Bob',['Age','City']]
sub

Unnamed: 0,Age,City
1,30,Paris


In [14]:
sub1 = df.iloc[[0,2],[0,1]]
sub1

Unnamed: 0,Name,Age
0,Alice,25
2,Charlie,35


In [15]:
sub1 = df.iloc[1,1:]

In [16]:
sub1

Age        30
City    Paris
Name: 1, dtype: object

## 16. 如何根据某一列的值对一个Pandas数据框进行排序？

In [17]:
import pandas as pd
df = pd.DataFrame({'Name':['Alice','Bob','Charlie'],
                   'Age':[25,30,35],
                   'City':['New York','Paris','London']})
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Paris
2,Charlie,35,London


In [18]:
df_sort = df.sort_values('Age')
df_sort

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Paris
2,Charlie,35,London


In [23]:
df_sorted = df.sort_values('Name', ascending = True)
df_sorted

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Paris
2,Charlie,35,London


In [24]:
df_sorted = df.sort_values(['Age','Name'])
df_sorted

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Paris
2,Charlie,35,London


## 17.如何对一个Pandas数据框进行透视操作？

In [25]:
import pandas as pd
df = pd.DataFrame({'Product':['A','B','C','A','B','C','A','B','C'],
                   'SalesDate':['2022-01-01','2022-01-01','2022-01-01',
                                '2022-01-02','2022-01-02','2022-01-02',
                                '2022-01-03','2022-01-03','2022-01-03'],
                                'SalesAmount':[100,200,150,50,75,125,300,250,200]})
df

Unnamed: 0,Product,SalesDate,SalesAmount
0,A,2022-01-01,100
1,B,2022-01-01,200
2,C,2022-01-01,150
3,A,2022-01-02,50
4,B,2022-01-02,75
5,C,2022-01-02,125
6,A,2022-01-03,300
7,B,2022-01-03,250
8,C,2022-01-03,200


In [27]:
df_pivot = df.pivot_table(index='Product',columns='SalesDate',values='SalesAmount',aggfunc='sum')
df_pivot

SalesDate,2022-01-01,2022-01-02,2022-01-03
Product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,100,50,300
B,200,75,250
C,150,125,200


## 18.如何对一个Pandas数据框进行聚合操作?

In [28]:
import pandas as pd
df = pd.DataFrame({'Product':['A','B','C','A','B','C','A','B','C'],
                   'SalesDate':['2022-01-01','2022-01-01','2022-01-01',
                                '2022-01-02','2022-01-02','2022-01-02',
                                '2022-01-03','2022-01-03','2022-01-03'],
                                'SalesAmount':[100,200,150,50,75,125,300,250,200]})
df

Unnamed: 0,Product,SalesDate,SalesAmount
0,A,2022-01-01,100
1,B,2022-01-01,200
2,C,2022-01-01,150
3,A,2022-01-02,50
4,B,2022-01-02,75
5,C,2022-01-02,125
6,A,2022-01-03,300
7,B,2022-01-03,250
8,C,2022-01-03,200


In [30]:
aggs_df = df.groupby('Product')['SalesAmount'].agg(['sum','mean','max'])
aggs_df

Unnamed: 0_level_0,sum,mean,max
Product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,450,150.0,300
B,525,175.0,250
C,475,158.333333,200


## 19.如何对一个Pandas数据框进行合并操作？

In [31]:
import pandas as pd
df1 = pd.DataFrame({'编号':['mr001','mr002','mr003'],
                    '语文':[110,105,109],
                    '数学':[105,88,120],
                    '英语':[99,115,130]})
df1

Unnamed: 0,编号,语文,数学,英语
0,mr001,110,105,99
1,mr002,105,88,115
2,mr003,109,120,130


In [33]:
df2 = pd.DataFrame({'编号':['mr002','mr001','mr003','mr004'],
                    '体育':[34.5,39.7,38,45]})
df2

Unnamed: 0,编号,体育
0,mr002,34.5
1,mr001,39.7
2,mr003,38.0
3,mr004,45.0


In [36]:
pd.set_option('display.unicode.east_asian_width',True)
df_merge = pd.merge(df1,df2,on='编号')
df_merge

Unnamed: 0,编号,语文,数学,英语,体育
0,mr001,110,105,99,39.7
1,mr002,105,88,115,34.5
2,mr003,109,120,130,38.0


In [37]:
cont_df = pd.concat([df1,df2],axis=0)
cont_df

Unnamed: 0,编号,语文,数学,英语,体育
0,mr001,110.0,105.0,99.0,
1,mr002,105.0,88.0,115.0,
2,mr003,109.0,120.0,130.0,
0,mr002,,,,34.5
1,mr001,,,,39.7
2,mr003,,,,38.0
3,mr004,,,,45.0


## 20.如何在Pandas数据框中删除一列数据？

In [39]:
import pandas as pd
data = {
    'name':['Jack','Sarah','Mike','David'],
    'age':[24,30, 21, 29],
    'height':[175,165,180,170]
}
df = pd.DataFrame(data=data)
df

Unnamed: 0,name,age,height
0,Jack,24,175
1,Sarah,30,165
2,Mike,21,180
3,David,29,170


In [40]:
df.drop('height',axis=1,inplace=True)
df

Unnamed: 0,name,age
0,Jack,24
1,Sarah,30
2,Mike,21
3,David,29


## 21.如何在Pandas数据框中添加一行数据？

In [41]:
import pandas as pd
data = {
    'name':['Jack','Sarah','Mike','David'],
    'age':[24,30, 21, 29],
    'height':[175,165,180,170]
}
df = pd.DataFrame(data=data)
df

Unnamed: 0,name,age,height
0,Jack,24,175
1,Sarah,30,165
2,Mike,21,180
3,David,29,170


In [42]:
new_row = {'name':'jeames','age':28,'height':181}
df.loc[len(df)] = new_row

In [43]:
df

Unnamed: 0,name,age,height
0,Jack,24,175
1,Sarah,30,165
2,Mike,21,180
3,David,29,170
4,jeames,28,181


## 22. 如何在Pandas在数据框中删除一行数据?

In [44]:
df.drop(1,inplace=True)
df

Unnamed: 0,name,age,height
0,Jack,24,175
2,Mike,21,180
3,David,29,170
4,jeames,28,181


## 23.如何在Pandas数据框中选择某个范围的行？

In [48]:
new_df = df[2:4]
new_df

Unnamed: 0,name,age,height
3,David,29,170
4,jeames,28,181


## 24.如何在Pandas数据框中选择某个范围内的行？

In [49]:
data = {
    'name':['Jack','Sarah','Mike','David'],
    'age':[24,30, 21, 29],
    'height':[175,165,180,170]
}
data

{'name': ['Jack', 'Sarah', 'Mike', 'David'],
 'age': [24, 30, 21, 29],
 'height': [175, 165, 180, 170]}

In [50]:
df = pd.DataFrame(data)
df

Unnamed: 0,name,age,height
0,Jack,24,175
1,Sarah,30,165
2,Mike,21,180
3,David,29,170


In [51]:
df.set_index('name',inplace=True)
new_df = df.loc['Sarah':'David']
new_df

Unnamed: 0_level_0,age,height
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Sarah,30,165
Mike,21,180
David,29,170


## 25.如何在Pandas数据框中按特定条件选择行？

In [52]:
df = pd.DataFrame(data)
df

Unnamed: 0,name,age,height
0,Jack,24,175
1,Sarah,30,165
2,Mike,21,180
3,David,29,170


In [54]:
new_df = df[df['height']>170]
new_df

Unnamed: 0,name,age,height
0,Jack,24,175
2,Mike,21,180


## 26.如何在Pandas数据框中对某一列进行排序？

In [55]:
df

Unnamed: 0,name,age,height
0,Jack,24,175
1,Sarah,30,165
2,Mike,21,180
3,David,29,170


In [56]:
new_df = df.sort_values(by='age',ascending=True)
new_df

Unnamed: 0,name,age,height
2,Mike,21,180
0,Jack,24,175
3,David,29,170
1,Sarah,30,165


## 27.如何在Pandas数据框中计算某一列的总和、平均值、中位数、标准差、方差？

In [64]:
df

Unnamed: 0,name,age,height
0,Jack,24,175
1,Sarah,30,165
2,Mike,21,180
3,David,29,170


In [58]:
total_height = df['height'].sum()

In [59]:
total_height

690

In [60]:
ava_height = df['height'].mean()

In [61]:
ava_height

172.5

In [65]:
median_height = df['height'].median()
median_height

172.5

In [66]:
std_height = df['height'].std()
std_height

6.454972243679028

In [67]:
var_height = df['height'].var()
var_height

41.666666666666664

## 32.如何在Pandas数据框中查找最大值和最小值？

In [68]:
import pandas as pd
data = {
    'name':['Jack','Sarah','Mike','David','Zoe'],
    'age':[24,30, 21, 29, 28],
    'height':[175,165,180,170, 172]
}
df = pd.DataFrame(data)
df

Unnamed: 0,name,age,height
0,Jack,24,175
1,Sarah,30,165
2,Mike,21,180
3,David,29,170
4,Zoe,28,172


In [69]:
max_age = df['age'].max()
max_age

30

In [70]:
min_age = df['age'].min()
min_age

21

## 33.如何在Pandas数据框中查找特定行的最大值和最小值？

In [71]:
max_height = df.loc[2,'height'].max()
min_height = df.loc[2,'height'].min()
max_height,min_height

(180, 180)

## 34.如何在Pandas数据框中重命名列名？

In [72]:
data = {
    '名字':['Jack','Sarah','Mike','David','Zoe'],
    '年龄':[24,30, 21, 29, 28],
    '身高':[175,165,180,170, 172]
}
df = pd.DataFrame(data)

In [74]:
df.rename(columns={'名字':'姓名'},inplace=True)
df,df.columns

(    姓名  年龄  身高
 0   Jack    24   175
 1  Sarah    30   165
 2   Mike    21   180
 3  David    29   170
 4    Zoe    28   172,
 Index(['姓名', '年龄', '身高'], dtype='object'))

## 35.如何在Pandas数据框中替换特定值？

In [75]:
data = {'name':['Jack','Sarah','Mike','David']}
df = pd.DataFrame(data)
df

Unnamed: 0,name
0,Jack
1,Sarah
2,Mike
3,David


In [76]:
df['name'] = df['name'].replace(to_replace=r'ck',value='bb',regex=True)
df

Unnamed: 0,name
0,Jabb
1,Sarah
2,Mike
3,David


## 36.如何在Pandas数据框中将特定值替换为缺失值？

In [93]:
df = pd.DataFrame({'A':[1,2,3,4,5],'B':['a','b','c','d','e'],'C':[0,1,2,3,4]})
df

Unnamed: 0,A,B,C
0,1,a,0
1,2,b,1
2,3,c,2
3,4,d,3
4,5,e,4


In [94]:
import numpy as np
df = df.replace(3,np.nan)

In [95]:
df

Unnamed: 0,A,B,C
0,1.0,a,0.0
1,2.0,b,1.0
2,,c,2.0
3,4.0,d,
4,5.0,e,4.0


In [96]:
df.isnull()

Unnamed: 0,A,B,C
0,False,False,False
1,False,False,False
2,True,False,False
3,False,False,True
4,False,False,False


## 37.如何在Pandas数据框中填充缺失值？

In [97]:
df.fillna(value=0,inplace=True)
df

Unnamed: 0,A,B,C
0,1.0,a,0.0
1,2.0,b,1.0
2,0.0,c,2.0
3,4.0,d,0.0
4,5.0,e,4.0


In [98]:
df = pd.DataFrame({'A':[1,2,np.nan,4],'B':[5,np.nan,7,8]})
df

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,
2,,7.0
3,4.0,8.0


In [99]:
df.fillna(method='ffill',inplace=True)
df

  df.fillna(method='ffill',inplace=True)


Unnamed: 0,A,B
0,1.0,5.0
1,2.0,5.0
2,2.0,7.0
3,4.0,8.0


In [101]:
df = pd.DataFrame({'A':[1,2,np.nan,4],'B':[5,np.nan,7,8]})
df = df.bfill() #老方法 df = df.fillna(method='bfill',inplace=True)
df

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,7.0
2,4.0,7.0
3,4.0,8.0


In [115]:
df = pd.DataFrame({'A':[1,2,np.nan,4],'B':[5,np.nan,7,8]})
print(df)
df = df.fillna(value={'A':-1,'B':-2})

     A    B
0  1.0  5.0
1  2.0  NaN
2  NaN  7.0
3  4.0  8.0


In [116]:
print(df)

     A    B
0  1.0  5.0
1  2.0 -2.0
2 -1.0  7.0
3  4.0  8.0


## 38.如何在Pandas数据框中删除缺失值？

In [117]:
df = pd.DataFrame({'A':[1,2,None,4],'B':[5,None,7,8],'C':[9,10,11,None]})
df

Unnamed: 0,A,B,C
0,1.0,5.0,9.0
1,2.0,,10.0
2,,7.0,11.0
3,4.0,8.0,


In [118]:
df_dropna = df.dropna()
df_dropna

Unnamed: 0,A,B,C
0,1.0,5.0,9.0


In [120]:
df_dropna_colums = df.dropna(axis=1)
df_dropna_colums

0
1
2
3


In [121]:
df_dropna_columsB = df.dropna(subset=['B'])
df_dropna_columsB

Unnamed: 0,A,B,C
0,1.0,5.0,9.0
2,,7.0,11.0
3,4.0,8.0,


## 39.如何在Pandas中使用聚合函数？

In [122]:
data = {
    'Name':['Tom','Tom','Mary','Mary','Jack','Jack'],
    'Subject':['Math','English','Math','English','Math','English'],
    'Score':[80,70,85,75,90,95]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Subject,Score
0,Tom,Math,80
1,Tom,English,70
2,Mary,Math,85
3,Mary,English,75
4,Jack,Math,90
5,Jack,English,95


In [123]:
gruped = df.groupby(['Name','Subject']).mean()
gruped

Unnamed: 0_level_0,Unnamed: 1_level_0,Score
Name,Subject,Unnamed: 2_level_1
Jack,English,95.0
Jack,Math,90.0
Mary,English,75.0
Mary,Math,85.0
Tom,English,70.0
Tom,Math,80.0


## 40.如何在Pandas中创建一个空数据帧？

In [125]:
df = pd.DataFrame()
print(df.empty)

True


## 41.如何在Pandas中进行分组和聚合？

In [126]:
data = {
    'Name':['Tom','Tom','Mary','Mary','Jack','Jack'],
    'Subject':['Math','English','Math','English','Math','English'],
    'Score':[80,70,85,75,90,95]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Subject,Score
0,Tom,Math,80
1,Tom,English,70
2,Mary,Math,85
3,Mary,English,75
4,Jack,Math,90
5,Jack,English,95


In [127]:
groupted = df.groupby(['Name'])['Score'].agg(['mean','max','min','count'])
groupted

Unnamed: 0_level_0,mean,max,min,count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jack,92.5,95,90,2
Mary,80.0,85,75,2
Tom,75.0,80,70,2


## 42.如何在Pandas中进行数据类型转换？

In [2]:
import pandas as pd
data = {
    'Name':['Tom','Tom','Mary','Mary','Jack','Jack'],
    'Subject':['Math','English','Math','English','Math','English'],
    'Score':['80','70','85','75','90','95']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Subject,Score
0,Tom,Math,80
1,Tom,English,70
2,Mary,Math,85
3,Mary,English,75
4,Jack,Math,90
5,Jack,English,95


In [3]:
print(df.dtypes)

Name       object
Subject    object
Score      object
dtype: object


In [4]:
df['Score'] = df['Score'].astype(int)

In [5]:
print(df.dtypes)

Name       object
Subject    object
Score       int32
dtype: object


In [6]:
df

Unnamed: 0,Name,Subject,Score
0,Tom,Math,80
1,Tom,English,70
2,Mary,Math,85
3,Mary,English,75
4,Jack,Math,90
5,Jack,English,95


## 43. 如何使用Pandas中的迭代方法？

In [7]:
data = {'Name':['Tom','Mary','Jack'],
        'Age':[20,25,30],
        'Gender':['M','F','M']}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Gender
0,Tom,20,M
1,Mary,25,F
2,Jack,30,M


In [8]:
for index,row in df.iterrows():
    print(index,row)

0 Name      Tom
Age        20
Gender      M
Name: 0, dtype: object
1 Name      Mary
Age         25
Gender       F
Name: 1, dtype: object
2 Name      Jack
Age         30
Gender       M
Name: 2, dtype: object


## 44. 如何使用Pandas中的交叉表？

In [9]:
data = {
    'Name':['Tom','Tom','Mary','Mary','Jack','Jack'],
    'Subject':['Math','English','Math','English','Math','English'],
    'Score':['80','70','85','75','90','95']
}
df = pd.DataFrame(data)

In [10]:
df

Unnamed: 0,Name,Subject,Score
0,Tom,Math,80
1,Tom,English,70
2,Mary,Math,85
3,Mary,English,75
4,Jack,Math,90
5,Jack,English,95


In [11]:
cross = pd.crosstab(df['Name'],df['Subject'])
cross

Subject,English,Math
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jack,1,1
Mary,1,1
Tom,1,1


## 45. 如何在Pandas使用重塑（reshape）数据？

In [12]:
pivot = df.pivot(index='Name',columns='Subject',values='Score')
pivot

Subject,English,Math
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jack,95,90
Mary,75,85
Tom,70,80


## 46. 如何在Pandas中使用变因子（factorize）函数？

In [14]:
data = {'Name':['Tom','Mary','Jack','Tom','Mary'],
        'Gender':['M','F','M','M','F']}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Gender
0,Tom,M
1,Mary,F
2,Jack,M
3,Tom,M
4,Mary,F


In [15]:
factorized,_ = pd.factorize(df['Gender'])
df['Genderfator'] = factorized
df

Unnamed: 0,Name,Gender,Genderfator
0,Tom,M,0
1,Mary,F,1
2,Jack,M,0
3,Tom,M,0
4,Mary,F,1


## 47. 如何在Pandas中创建随机日期？

In [16]:
import numpy as np
start_date = '2020-01-01'
end_date = '2020-12-31'
data_range = pd.date_range(start=start_date, end = end_date)
sampled_datas = pd.Series(np.random.choice(data_range,size=10))
sampled_datas

0   2020-01-26
1   2020-02-26
2   2020-03-09
3   2020-07-31
4   2020-07-17
5   2020-08-04
6   2020-05-06
7   2020-09-03
8   2020-07-05
9   2020-08-02
dtype: datetime64[ns]

## 48. 如何在Pandas中使用字符串函数？

- `to_uppercase()`:将字符串中的字母转换为大写字母
- `to_lowercase()`:将字符串中的字母转换为小写字母
- `strip()`:去除字符串两侧的空格
- `replace()`:将字符串中的一个子字符串替换为另一个子字符串
- `contains()`:检查字符串是否包含给定的子字符串
- `startswith()`:检查字符串是否以给定字符串开头
- `endswith()`:检查字符串是否以给定字符串结尾

In [19]:
data = {'name':['Alex','Bob','Charlie','David','Emily'],
        'city':['AUSTIN','HoUSTON','DALLAS ','Austin','houston ']}
df = pd.DataFrame(data=data)
df

Unnamed: 0,name,city
0,Alex,AUSTIN
1,Bob,HoUSTON
2,Charlie,DALLAS
3,David,Austin
4,Emily,houston


In [20]:
df['city'] = df['city'].str.strip().str.lower().str.replace('houston','Houston')
df

Unnamed: 0,name,city
0,Alex,austin
1,Bob,Houston
2,Charlie,dallas
3,David,austin
4,Emily,Houston


## 49.如何在Pandas中使用字符串拆分函数？

In [21]:
data = {
    'Name':['Tom,Lee','Mary,Smith','Jack,Wang'],
    'Address':['Beijing,China','New York,USA','Shanghai,China']
}
df = pd.DataFrame(data=data)
df

Unnamed: 0,Name,Address
0,"Tom,Lee","Beijing,China"
1,"Mary,Smith","New York,USA"
2,"Jack,Wang","Shanghai,China"


In [22]:
df[['FirstName','LastName']] = df['Name'].str.split(',',expand=True)
df[['city','state']] = df['Address'].str.split(',',expand=True)
df

Unnamed: 0,Name,Address,FirstName,LastName,city,state
0,"Tom,Lee","Beijing,China",Tom,Lee,Beijing,China
1,"Mary,Smith","New York,USA",Mary,Smith,New York,USA
2,"Jack,Wang","Shanghai,China",Jack,Wang,Shanghai,China
