# name:shen
# time:2023/3/8
# content:

# 2.2 基本数据结构
> pandas 中具有两种基本的数据存储结构，存储```一维``` values 的 ```Series``` 和存储```二维``` values 的 ```DataFrame``` ，在这两种结构上定义了很多的属性和方法。

![](./Images/data.png)

## 1.Series
> Series一般由四个部分组成，分别是序列的值 ```data``` 、索引 ```index``` 、存储类型 ```dtype``` 、序列的名字 ```name``` 。其
中，索引也可以指定它的名字，默认为空。



In [1]:
import pandas as pd

In [2]:
s = pd.Series(data = [100,'a',{'dic1':5}],
                index = pd.Index([1,2,3],name='my_idx'),
                dtype = 'object',
                name = 'my_name')

In [3]:
s

my_idx
1            100
2              a
3    {'dic1': 5}
Name: my_name, dtype: object

In [4]:
s.index

Int64Index([1, 2, 3], dtype='int64', name='my_idx')

In [5]:
s.values

array([100, 'a', {'dic1': 5}], dtype=object)

In [6]:
s.dtype

dtype('O')

In [7]:
s1 = pd.Series(
    data = [67,78,75],
    index = pd.Index(["数学","语文","英语"],name = "学科")
)

In [8]:
s1

学科
数学    67
语文    78
英语    75
dtype: int64

In [9]:
s2 = pd.Series(
    data=["语文","数学","英语"]
)

In [10]:
s2

0    语文
1    数学
2    英语
dtype: object

## 2.DateFrame(具有相同特征和个数的列表数据的集合，可以用DataFrame来描述)
> DataFrame 在 Series 的基础上增加了列索引，一个数据框可以由二维的 data 与行列索引来构造：




In [11]:
## DataFrame创建示例1：
data = [[1,'a',1.2],
        [2,'b',2.2],
        [3,'c',3.2]]

In [12]:
df = pd.DataFrame(
    data = data,
    index = ['row_0','row_1','row_2'],
    columns = ['col_0','col_1','col_2'],
)

In [13]:
df

Unnamed: 0,col_0,col_1,col_2
row_0,1,a,1.2
row_1,2,b,2.2
row_2,3,c,3.2


In [14]:
 ## DataFrame创建示例2：
data = {
    'col_0':[1,2,3],
    'col_1':['a','b','c'],
    'col_2':[1.2,2.2,3.2]
}

In [15]:
df = pd.DataFrame(
    data = data,
    index = ['row_0','row_1','row_2']
)

In [16]:
df

Unnamed: 0,col_0,col_1,col_2
row_0,1,a,1.2
row_1,2,b,2.2
row_2,3,c,3.2


In [17]:
df.col_0

row_0    1
row_1    2
row_2    3
Name: col_0, dtype: int64

In [18]:
df['col_0']

row_0    1
row_1    2
row_2    3
Name: col_0, dtype: int64

In [19]:
df[['col_0','col_2']]

Unnamed: 0,col_0,col_2
row_0,1,1.2
row_1,2,2.2
row_2,3,3.2


In [20]:
df.iloc[1:2]

Unnamed: 0,col_0,col_1,col_2
row_1,2,b,2.2


In [21]:
df.iloc[1]

col_0      2
col_1      b
col_2    2.2
Name: row_1, dtype: object

In [22]:
##df.iloc[竖列，横列]
df.iloc[:,2]

row_0    1.2
row_1    2.2
row_2    3.2
Name: col_2, dtype: float64

In [23]:
## iloc对行和列进行切片
df.iloc[1:3,1:3]

Unnamed: 0,col_1,col_2
row_1,b,2.2
row_2,c,3.2


In [24]:
df.values

array([[1, 'a', 1.2],
       [2, 'b', 2.2],
       [3, 'c', 3.2]], dtype=object)

## 课后练习(参考pandas的cheat sheet)
> 1.iloc
> 2.loc(先取列再选行)
> 3.lat
> 4.at

# 常用基本函数
* 1.汇总函数

In [25]:
import pandas as pd

In [26]:
df = pd.read_csv('./data/learn_pandas.csv')

In [27]:
df.columns

Index(['School', 'Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer',
       'Test_Number', 'Test_Date', 'Time_Record'],
      dtype='object')

In [28]:
df.head()  # 头五个

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,158.9,46.0,N,1,2019/10/5,0:04:34
1,Peking University,Freshman,Changqiang You,Male,166.5,70.0,N,1,2019/9/4,0:04:20
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,188.9,89.0,N,2,2019/9/12,0:05:22
3,Fudan University,Sophomore,Xiaojuan Sun,Female,,41.0,N,2,2020/1/3,0:04:08
4,Fudan University,Sophomore,Gaojuan You,Male,174.0,74.0,N,2,2019/11/6,0:05:22


In [29]:
df.tail() # 尾五个

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
195,Fudan University,Junior,Xiaojuan Sun,Female,153.9,46.0,N,2,2019/10/17,0:04:31
196,Tsinghua University,Senior,Li Zhao,Female,160.9,50.0,N,3,2019/9/22,0:04:03
197,Shanghai Jiao Tong University,Senior,Chengqiang Chu,Female,153.9,45.0,N,1,2020/1/5,0:04:48
198,Shanghai Jiao Tong University,Senior,Chengmei Shen,Male,175.3,71.0,N,2,2020/1/7,0:04:58
199,Tsinghua University,Sophomore,Chunpeng Lv,Male,155.7,51.0,N,1,2019/11/6,0:05:05


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   School       200 non-null    object 
 1   Grade        200 non-null    object 
 2   Name         200 non-null    object 
 3   Gender       200 non-null    object 
 4   Height       183 non-null    float64
 5   Weight       189 non-null    float64
 6   Transfer     188 non-null    object 
 7   Test_Number  200 non-null    int64  
 8   Test_Date    200 non-null    object 
 9   Time_Record  200 non-null    object 
dtypes: float64(2), int64(1), object(7)
memory usage: 15.8+ KB


In [31]:
df.describe()

Unnamed: 0,Height,Weight,Test_Number
count,183.0,189.0,200.0
mean,163.218033,55.015873,1.645
std,8.608879,12.824294,0.722207
min,145.4,34.0,1.0
25%,157.15,46.0,1.0
50%,161.9,51.0,1.5
75%,167.5,65.0,2.0
max,193.9,89.0,3.0


* 特征统计函数
> 在 Series 和 DataFrame 上定义了许多统计函数，最常见的是 sum, mean, median, var, std, max, min 。


In [32]:
df_demo = df[['1','2']]

KeyError: "None of [Index(['1', '2'], dtype='object')] are in the [columns]"

In [None]:
df_demo

In [None]:
df_demo.mean()

In [None]:
df_demo.max()

In [None]:
df_demo.quantile(0.75)

In [None]:
df_demo.count()

In [None]:
df_demo.idxmax()

# 实践一
* 请计算：所有学校不同的身高、体重的均值、最大值、最小值
* 请计算：所有不同学校的男女比例情况
* 统计：不同学校的Grade的数量

In [35]:
df

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,158.9,46.0,N,1,2019/10/5,0:04:34
1,Peking University,Freshman,Changqiang You,Male,166.5,70.0,N,1,2019/9/4,0:04:20
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,188.9,89.0,N,2,2019/9/12,0:05:22
3,Fudan University,Sophomore,Xiaojuan Sun,Female,,41.0,N,2,2020/1/3,0:04:08
4,Fudan University,Sophomore,Gaojuan You,Male,174.0,74.0,N,2,2019/11/6,0:05:22
...,...,...,...,...,...,...,...,...,...,...
195,Fudan University,Junior,Xiaojuan Sun,Female,153.9,46.0,N,2,2019/10/17,0:04:31
196,Tsinghua University,Senior,Li Zhao,Female,160.9,50.0,N,3,2019/9/22,0:04:03
197,Shanghai Jiao Tong University,Senior,Chengqiang Chu,Female,153.9,45.0,N,1,2020/1/5,0:04:48
198,Shanghai Jiao Tong University,Senior,Chengmei Shen,Male,175.3,71.0,N,2,2020/1/7,0:04:58


In [34]:
df['School'].unique()


array(['Shanghai Jiao Tong University', 'Peking University',
       'Fudan University', 'Tsinghua University'], dtype=object)

In [33]:
df.query("School=='Peking University'")

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
1,Peking University,Freshman,Changqiang You,Male,166.5,70.0,N,1,2019/9/4,0:04:20
9,Peking University,Junior,Juan Xu,Female,164.8,,N,3,2019/10/5,0:04:05
20,Peking University,Junior,Changjuan You,Female,161.4,47.0,N,1,2019/10/5,0:04:08
29,Peking University,Sophomore,Changmei Xu,Female,151.6,43.0,N,2,2020/1/3,0:04:28
30,Peking University,Senior,Changli Lv,Female,148.7,41.0,N,2,2019/11/13,0:04:54
32,Peking University,Freshman,Gaopeng Shi,Female,162.9,48.0,N,1,2019/9/12,0:04:58
35,Peking University,Freshman,Gaoli Zhao,Male,175.4,78.0,N,2,2019/10/8,0:03:32
36,Peking University,Freshman,Xiaojuan Qin,Male,,79.0,Y,1,2019/12/10,0:04:10
38,Peking University,Freshman,Qiang Han,Male,185.3,87.0,N,3,2020/1/7,0:03:58
45,Peking University,Freshman,Quan Chu,Female,154.7,43.0,N,1,2019/11/28,0:04:47


In [36]:
df['Weight'].max()

89.0

In [37]:
df['Weight'].mean()

55.01587301587302

In [38]:
df['Weight'].min()

34.0

In [39]:
df['Height'].max()

193.9

In [40]:
df['Height'].min()

145.4

In [41]:
df['Gender'].value_counts()

Female    141
Male       59
Name: Gender, dtype: int64

In [42]:
df['Grade'].value_counts()

Junior       59
Senior       55
Freshman     52
Sophomore    34
Name: Grade, dtype: int64