In [None]:
print("""
@File         : 01.ipynb
@Author(s)    : Stephen CUI
@LastEditor(s): Stephen CUI
@CreatedTime  : 2024-09-23 22:00:47
@Email        : cuixuanstephen@gmail.com
@Description  : 主要介绍一些 Series 的方法，数据类型的重要性，sum 在字符串上的问题
""")

In [1]:
import numpy as np
import pandas as pd

In [3]:
# np.random.randint(70, 100, size=10)  # [70, 100)
g = np.random.default_rng(0)  # random-number generator object
g.integers(70, 101, 10)  # [70, 101)

s = pd.Series(g.integers(70, 101, 10))
s.index = "Sep Oct Nov Dec Jan Feb Mar Apr May Jun".split()
s

Sep     90
Oct     98
Nov     85
Dec     88
Jan    100
Feb     92
Mar     89
Apr     86
May     87
Jun     98
dtype: int64

In [5]:
g = np.random.default_rng(0)
months = "Sep Oct Nov Dec Jan Feb Mar Apr May Jun".split()
s = pd.Series(g.integers(70, 101, 10), index=months)
s

Sep    96
Oct    89
Nov    85
Dec    78
Jan    79
Feb    71
Mar    72
Apr    70
May    75
Jun    95
dtype: int64

In [6]:
s.mean()

81.0

In [7]:
s[:5].mean() # 由于切片始终返回相同类型的对象

85.4

In [8]:
s[5:].mean()

76.6

it’s even better to use the `.loc` and `.iloc` accessors. Whereas `.loc` retrieves one or more elements based on the index, `.iloc` retrieves based on the numeric position—the default index. 

In [9]:
s.iloc[:5].mean()

85.4

In [10]:
s.loc['Sep':'Jan'].mean()

85.4

when you use .loc, the slice end is no longer “up to and not including”
but rather “up to and including.” This makes logical sense because it’s not always obvious what “up to and not including” a custom index would be.

> 因为如果不包含，根本不知道 `to` 前面的是什么值，这显然没有意义。

In [11]:
first_half_average = s["Sep":"Jan"].mean()
second_half_average = s["Feb":"Jun"].mean()
print(f"First half average: {first_half_average}")
print(f"Second half average: {second_half_average}")
print(f"Improvement: {second_half_average - first_half_average}")

First half average: 85.4
Second half average: 76.6
Improvement: -8.800000000000011


In [18]:
# 如果最大值是在两个中如何办？即九月和六月的成绩是一样的
s.sort_values(ascending=False).index[0]
s[s == s.max()].index[0]
s.idxmax()

s.sort_values(ascending=False)[:5]

Sep    96
Jun    95
Oct    89
Nov    85
Jan    79
dtype: int64

In [19]:
s.round(-1)

Sep    100
Oct     90
Nov     80
Dec     80
Jan     80
Feb     70
Mar     70
Apr     70
May     80
Jun    100
dtype: int64

>  `round` 方法在接收到一个正整数参数时，会对小数点后的数字进行四舍五入。当接收到一个负整数参数时，它会对小数点前的数字进行四舍五入！

In [20]:
s.std(), np.std(s)

(9.72967967955095, 9.23038460737146)

By default, pandas assumes that we don’t want to divide by `s.count()` but rather by `s.count() - 1`. This is known as the sample standard deviation and is typically used on a sample of the data rather than the entire population. The pandas authors decided to default to this calculation. (NumPy’s `std` calculation doesn’t do this.)

In [22]:
s.std(ddof=0), np.std(s)

(9.23038460737146, 9.23038460737146)

**When sums go wrong**

The `sum` method is useful, as you can imagine. You will likely want to use it on numeric series to combine the values. But it turns out that if you run `s.sum()` when `s` is a series of strings, the result is the strings concatenated together:

In [23]:
s = pd.Series("abcd efgh ijkl".split())
s.sum()

'abcdefghijkl'

In [26]:
s = pd.Series("1234 5678 9012".split())
s.sum()

'123456789012'

如果在创建 `Series` 后想要更改其 `dtype`，该怎么办？无法设置 `dtype` 属性；它是只读的。相反，必须通过调用 `astype` 方法基于现有系列创建新系列：

In [27]:
s = pd.Series("10 20 30".split())
s.dtype

dtype('O')

In [29]:
s = s.astype(np.int64)
s.dtype

dtype('int64')