网址：https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

# 导入

In [2]:
pip install numpy

Collecting numpy
  Downloading numpy-2.2.2-cp311-cp311-win_amd64.whl.metadata (60 kB)
Downloading numpy-2.2.2-cp311-cp311-win_amd64.whl (12.9 MB)
   ---------------------------------------- 0.0/12.9 MB ? eta -:--:--
   - -------------------------------------- 0.5/12.9 MB 3.4 MB/s eta 0:00:04
   -- ------------------------------------- 0.8/12.9 MB 3.0 MB/s eta 0:00:04
   -- ------------------------------------- 0.8/12.9 MB 3.0 MB/s eta 0:00:04
   ----- ---------------------------------- 1.8/12.9 MB 2.2 MB/s eta 0:00:05
   ------ --------------------------------- 2.1/12.9 MB 2.3 MB/s eta 0:00:05
   -------- ------------------------------- 2.6/12.9 MB 2.1 MB/s eta 0:00:05
   -------- ------------------------------- 2.6/12.9 MB 2.1 MB/s eta 0:00:05
   -------- ------------------------------- 2.6/12.9 MB 2.1 MB/s eta 0:00:05
   -------- ------------------------------- 2.6/12.9 MB 2.1 MB/s eta 0:00:05
   --------- ------------------------------ 3.1/12.9 MB 1.5 MB/s eta 0:00:07
   --------- -

In [4]:
pip install pandas

Collecting pandas
  Using cached pandas-2.2.3-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.1-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.2.3-cp311-cp311-win_amd64.whl (11.6 MB)
Downloading pytz-2025.1-py2.py3-none-any.whl (507 kB)
Downloading tzdata-2025.1-py2.py3-none-any.whl (346 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.2.3 pytz-2025.1 tzdata-2025.1
Note: you may need to restart the kernel to use updated packages.


In [5]:
import numpy as np
import pandas as pd

# pandas 中的基本数据结构
Pandas 提供了两种类型的类来处理数据：
1. `Series` ：包含任何类型的数据的一维标记数组, such as integers, strings, Python objects etc
2. `DataFrame`: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

# Object creation
Creating a `Series` by passing a list of values, letting pandas create a default `RangeIndex`.

In [6]:
# 通过 pd.Series() 创建了一个包含六个元素的 Pandas Series，元素为 [1, 3, 5, np.nan, 6, 8]。其中，np.nan 用来表示缺失的值。
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a `DataFrame` by passing a NumPy array with a datetime index using `date_range()` and labeled columns:

In [7]:
# pd.date_range() 从 2013年1月1日 开始，生成了一个包含 6 个连续日期的日期序列，日期从 2013年1月1日 到 2013年1月6日，步长为一天。
dates = pd.date_range("20130101", periods=6)
dates


DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [8]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.095344,0.951838,-0.086141,0.737152
2013-01-02,0.331015,1.347802,-0.517901,-1.36863
2013-01-03,0.183115,-0.164717,0.133061,2.072672
2013-01-04,0.395993,-0.32799,-0.19245,1.952368
2013-01-05,0.266103,-0.266773,-0.70663,0.141684
2013-01-06,-0.182697,1.013587,-0.732756,-0.741913


Creating a DataFrame by passing a dictionary of objects where the keys are the column labels and the values are the column values.

In [10]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2


Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting `DataFrame` have different dtypes:

In [11]:
df2.dtypes

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

# Viewing data

Use `DataFrame.head()` and `DataFrame.tail()` to view the top and bottom rows of the frame respectively:

In [16]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-1.095344,0.951838,-0.086141,0.737152
2013-01-02,0.331015,1.347802,-0.517901,-1.36863
2013-01-03,0.183115,-0.164717,0.133061,2.072672
2013-01-04,0.395993,-0.32799,-0.19245,1.952368
2013-01-05,0.266103,-0.266773,-0.70663,0.141684


In [17]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,0.395993,-0.32799,-0.19245,1.952368
2013-01-05,0.266103,-0.266773,-0.70663,0.141684
2013-01-06,-0.182697,1.013587,-0.732756,-0.741913


Display the `DataFrame.index` or `DataFrame.columns`:

In [18]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [19]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

Return a NumPy representation of the underlying data with `DataFrame.to_numpy()` without the index or column labels:

In [20]:
df.to_numpy()

array([[-1.0953436 ,  0.95183837, -0.08614059,  0.73715167],
       [ 0.33101457,  1.34780201, -0.51790122, -1.36862978],
       [ 0.18311493, -0.16471708,  0.13306107,  2.07267151],
       [ 0.39599323, -0.32798959, -0.19244994,  1.95236788],
       [ 0.26610295, -0.26677276, -0.70663015,  0.14168352],
       [-0.18269738,  1.01358673, -0.73275591, -0.74191323]])

`describe()` shows a quick statistic summary of your data:

In [21]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.016969,0.425625,-0.350469,0.465555
std,0.566066,0.757478,0.354871,1.399593
min,-1.095344,-0.32799,-0.732756,-1.36863
25%,-0.091244,-0.241259,-0.659448,-0.521014
50%,0.224609,0.393561,-0.355176,0.439418
75%,0.314787,0.99815,-0.112718,1.648564
max,0.395993,1.347802,0.133061,2.072672


Transposing your data:

In [22]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-1.095344,0.331015,0.183115,0.395993,0.266103,-0.182697
B,0.951838,1.347802,-0.164717,-0.32799,-0.266773,1.013587
C,-0.086141,-0.517901,0.133061,-0.19245,-0.70663,-0.732756
D,0.737152,-1.36863,2.072672,1.952368,0.141684,-0.741913


`DataFrame.sort_values()` sorts by values:

In [23]:
df.sort_values(by="B") # 按照B列的数据从小到大排序

Unnamed: 0,A,B,C,D
2013-01-04,0.395993,-0.32799,-0.19245,1.952368
2013-01-05,0.266103,-0.266773,-0.70663,0.141684
2013-01-03,0.183115,-0.164717,0.133061,2.072672
2013-01-01,-1.095344,0.951838,-0.086141,0.737152
2013-01-06,-0.182697,1.013587,-0.732756,-0.741913
2013-01-02,0.331015,1.347802,-0.517901,-1.36863


# Selection

## Getitem ([])
For a `DataFrame`, passing a single label selects a columns and yields a `Series` equivalent to `df.A`:

In [24]:
df["A"]

2013-01-01   -1.095344
2013-01-02    0.331015
2013-01-03    0.183115
2013-01-04    0.395993
2013-01-05    0.266103
2013-01-06   -0.182697
Freq: D, Name: A, dtype: float64

For a `DataFrame`, passing a slice `:` selects matching rows:

In [25]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-1.095344,0.951838,-0.086141,0.737152
2013-01-02,0.331015,1.347802,-0.517901,-1.36863
2013-01-03,0.183115,-0.164717,0.133061,2.072672


In [26]:
df["20130102":"20130104"]

Unnamed: 0,A,B,C,D
2013-01-02,0.331015,1.347802,-0.517901,-1.36863
2013-01-03,0.183115,-0.164717,0.133061,2.072672
2013-01-04,0.395993,-0.32799,-0.19245,1.952368


## Selection by label

Selecting a row matching a label:

In [27]:
df.loc[dates[0]] # 第一行

A   -1.095344
B    0.951838
C   -0.086141
D    0.737152
Name: 2013-01-01 00:00:00, dtype: float64

Selecting all rows `(:)` with a select column labels:

In [29]:
df.loc[:, ["A", "B"]] # 所有行，然后A和B列

Unnamed: 0,A,B
2013-01-01,-1.095344,0.951838
2013-01-02,0.331015,1.347802
2013-01-03,0.183115,-0.164717
2013-01-04,0.395993,-0.32799
2013-01-05,0.266103,-0.266773
2013-01-06,-0.182697,1.013587


For label slicing, both endpoints are included:

In [30]:
df.loc["20130102":"20130104", ["A", "B"]]

Unnamed: 0,A,B
2013-01-02,0.331015,1.347802
2013-01-03,0.183115,-0.164717
2013-01-04,0.395993,-0.32799


Selecting a single row and column label returns a scalar:

In [31]:
df.loc[dates[0], "A"]

np.float64(-1.0953436001288754)

For getting fast access to a scalar (equivalent to the prior method):

In [32]:
df.at[dates[0], "A"]

np.float64(-1.0953436001288754)

# Selection by position

Select via the position of the passed integers:

In [33]:
df.iloc[3]

A    0.395993
B   -0.327990
C   -0.192450
D    1.952368
Name: 2013-01-04 00:00:00, dtype: float64