- **注意** : 打印表格渲染的时候, 索引会加粗,并且会比正常的列名位置偏下
- **注意2**: `inplace=True` 的意思是原地覆盖原本的数据

## 数据集来源

- [UCI机器学习仓库](https://archive.ics.uci.edu/)
- [UEA时间序列分类仓库](https://arxiv.org/abs/1811.00075)
  - FordA：一个关于发动机故障的时间序列数据集，通常用于预测机器故障。
  - ECG5000：心电图（ECG）信号数据集，广泛用于心脏健康监测。
  - GunPoint：一个运动学数据集，包含了两种不同的动作类别。
- [UCR时间序列分类仓库](https://www.cs.ucr.edu/~eamonn/time_series_data_2018/)
  - Yoga：这个数据集包含了不同瑜伽动作的时间序列数据。
  - Wafer：一个半导体生产过程中温度数据的时间序列数据集，用于预测产品的质量。
  - Trace：一个关于电力消耗的时间序列数据集，用于进行异常检测

In [48]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
absenteeism_at_work = fetch_ucirepo(id=445) 
  
# data (as pandas dataframes) 
X = absenteeism_at_work.data.features 
y = absenteeism_at_work.data.targets 
  
# metadata 
print(absenteeism_at_work.metadata) 
  
# variable information 
print(absenteeism_at_work.variables) 


{'uci_id': 445, 'name': 'Absenteeism at work', 'repository_url': 'https://archive.ics.uci.edu/dataset/445/absenteeism+at+work', 'data_url': 'https://archive.ics.uci.edu/static/public/445/data.csv', 'abstract': 'The database was created with records of absenteeism at work from July 2007 to July 2010 at a courier company in Brazil.', 'area': 'Business', 'tasks': ['Classification', 'Clustering'], 'characteristics': ['Multivariate', 'Time-Series'], 'num_instances': 740, 'num_features': 19, 'feature_types': ['Integer', 'Real'], 'demographics': ['Age', 'Education Level'], 'target_col': ['Absenteeism time in hours'], 'index_col': ['ID'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2012, 'last_updated': 'Fri Mar 08 2024', 'dataset_doi': '10.24432/C5X882', 'creators': ['Andrea Martiniano', 'Ricardo Ferreira'], 'intro_paper': {'ID': 414, 'type': 'NATIVE', 'title': 'Application of a neuro fuzzy network in prediction of absenteeism at work', 'authors': 'A

## 组装时间序列数据集合

- `pandas.groupby().count()`: 因为 count() 是对每一列非空值进行计数。如果数据框中每列的非空值数目在同一个分组下是相同的，那么它们在 groupby().count() 的结果中会显示相同的值

In [49]:
import pandas as pd

df = pd.read_csv("../demo_code/Ch02/data/year_joined.csv")

# df.groupby("user").count() 该结果返回原本表的形式
print(df.groupby("user").count())

# 对数据进行验证 检查是否存在一个会员，在某个等级下有多条记录
# 结果: 在所有的1000名会员中，每人只有一个等级
df.groupby("user").count().groupby("userStats").count()

      userStats  yearJoined
user                       
0             1           1
1             1           1
2             1           1
3             1           1
4             1           1
...         ...         ...
995           1           1
996           1           1
997           1           1
998           1           1
999           1           1

[1000 rows x 2 columns]


Unnamed: 0_level_0,yearJoined
userStats,Unnamed: 1_level_1
1,1000


In [50]:
# 检查是否有"空周"(会员在这一周内没有打开电子邮件（EmailsOpened为0）)
import pandas as pd

emails = pd.read_csv("../demo_code/Ch02/data/emails.csv")

print(emails[emails['emailsOpened'] < 1])

Empty DataFrame
Columns: [emailsOpened, user, week]
Index: []


In [51]:
# 上面的结果表明: 没有"空周", 但是这种情况的概率很低

# 查看历史记录
print(emails[emails.user == 998])

       emailsOpened   user                 week
25464           1.0  998.0  2017-12-04 00:00:00
25465           3.0  998.0  2017-12-11 00:00:00
25466           3.0  998.0  2017-12-18 00:00:00
25467           3.0  998.0  2018-01-01 00:00:00
25468           3.0  998.0  2018-01-08 00:00:00
25469           2.0  998.0  2018-01-15 00:00:00
25470           3.0  998.0  2018-01-22 00:00:00
25471           2.0  998.0  2018-01-29 00:00:00
25472           3.0  998.0  2018-02-05 00:00:00
25473           3.0  998.0  2018-02-12 00:00:00
25474           3.0  998.0  2018-02-19 00:00:00
25475           2.0  998.0  2018-02-26 00:00:00
25476           2.0  998.0  2018-03-05 00:00:00
25477           3.0  998.0  2018-03-12 00:00:00
25478           2.0  998.0  2018-03-19 00:00:00
25479           2.0  998.0  2018-03-26 00:00:00
25480           3.0  998.0  2018-04-02 00:00:00
25481           3.0  998.0  2018-04-09 00:00:00
25482           3.0  998.0  2018-04-16 00:00:00
25483           3.0  998.0  2018-04-30 0

- `pd.to_datetime()`：该函数将 week 列中的字符串转换为 datetime 对象。

In [52]:
# 上面的结果表明, 几周的数据缺失了。

# 将 week 列的字符串转换为 datetime 类型
emails['week'] = pd.to_datetime(emails['week'])
# 检查该会员总体记录中横跨几周
(max(emails[emails.user == 998].week) - min(emails[emails.user == 998].week)).days/7

25.0

In [53]:
emails[emails.user == 998].shape

(24, 3)

结果表明, 有24行数据,但是应该为**26行**(注意max - min的结果应该加1), 所以少的两行就是*空周*

In [54]:
# 笛卡尔积
complete_idx = pd.MultiIndex.from_product([list(set(emails.week)), list(set(emails.user))])
print(complete_idx)

MultiIndex([('2016-08-01',   1.0),
            ('2016-08-01',   3.0),
            ('2016-08-01',   5.0),
            ('2016-08-01',   6.0),
            ('2016-08-01',   9.0),
            ('2016-08-01',  10.0),
            ('2016-08-01',  14.0),
            ('2016-08-01',  16.0),
            ('2016-08-01',  20.0),
            ('2016-08-01',  21.0),
            ...
            ('2016-11-21', 973.0),
            ('2016-11-21', 977.0),
            ('2016-11-21', 982.0),
            ('2016-11-21', 984.0),
            ('2016-11-21', 987.0),
            ('2016-11-21', 991.0),
            ('2016-11-21', 992.0),
            ('2016-11-21', 993.0),
            ('2016-11-21', 995.0),
            ('2016-11-21', 998.0)],
           length=93247)


In [55]:
# 对原始表进行重新索引, 并填充缺失值(空周用0填充)

# df的链式操作:
# set_index(['week', 'user']) : 每个记录都会有一个由 week 和 user 组成的复合索引
all_email = emails.set_index(['week', 'user']).reindex(complete_idx, fill_value=0).reset_index()
all_email.columns = ['week', 'member', 'EmailsOpened']

In [56]:
#all_email['week'] = all_email['week'] - pd.Timedelta(days=2)  # 减去两天

In [57]:
all_email

Unnamed: 0,week,member,EmailsOpened
0,2016-08-01,1.0,3.0
1,2016-08-01,3.0,0.0
2,2016-08-01,5.0,0.0
3,2016-08-01,6.0,0.0
4,2016-08-01,9.0,3.0
...,...,...,...
93242,2016-11-21,991.0,0.0
93243,2016-11-21,992.0,0.0
93244,2016-11-21,993.0,0.0
93245,2016-11-21,995.0,3.0


In [58]:
# 展示空周
all_email[(all_email.member == 998) & (all_email.EmailsOpened == 0.0)].sort_values('week')
# 注意多条件每个条件要用()括起来

Unnamed: 0,week,member,EmailsOpened
21559,2015-02-09,998.0,0.0
10779,2015-02-16,998.0,0.0
27488,2015-02-23,998.0,0.0
5928,2015-03-02,998.0,0.0
72225,2015-03-09,998.0,0.0
...,...,...,...
66296,2017-11-13,998.0,0.0
33417,2017-11-20,998.0,0.0
14552,2017-11-27,998.0,0.0
33956,2017-12-25,998.0,0.0


通过上面的分析,虽然定位了空周, 但是会员加入的前几周也被填充为0, 这是没有意义的, 所以应该确定每个会员的使用区间

`df.agg(func, axis=0, *args, **kwargs)`

对 DataFrame 或 Series 执行**聚合操作** 的强大工具, 允许对数据应用多个聚合函数，或者对不同的列使用不同的聚合函数,也可以使用自定义函数

- 举例:
  ```python
    result = df.agg({
        'A': ['sum', range_func],  # 对 A 列同时应用 sum 和自定义函数
        'B': 'mean',              # 对 B 列求均值
        'C': ['min', 'max']       # 对 C 列求最小值和最大值
    })
   ```
- `agg()` 与 `groupby()` 结合使用，可以对数据进行分组后应用不同的聚合函数
  - 举例:
    ```python
    df = pd.DataFrame({
        'group': ['A', 'A', 'B', 'B'],
        'value': [10, 20, 30, 40]
    })

    # 使用 groupby 和 agg 对分组数据进行聚合
    result = df.groupby('group').agg({
        'value': ['sum', 'mean', 'max']
    })
    ```
    输出
    ```
            value              
            sum mean max
    group                  
    A         30   15  20
    B         70   35  40

    ```

In [59]:
cutoff_dates = emails.groupby('user').week.agg(['min', 'max']).reset_index()
cutoff_dates            # 每个会员的使用区间

Unnamed: 0,user,min,max
0,1.0,2015-06-29,2018-05-28
1,3.0,2018-03-05,2018-04-23
2,5.0,2017-06-05,2018-05-28
3,6.0,2016-12-05,2018-05-28
4,9.0,2016-07-18,2018-05-28
...,...,...,...
534,991.0,2016-10-24,2016-10-24
535,992.0,2015-02-09,2015-07-06
536,993.0,2017-09-11,2018-05-28
537,995.0,2016-09-05,2018-05-28


`iterrows()` 是 Pandas 中用于 **按行遍历** DataFrame 的一个方法。它返回一个生成器，可以逐行返回 DataFrame 的索引和行数据

使用形式:
```python
for index, row in df.iterrows():
```

In [60]:
# 删除每个会员的第一个非0值的前面的为0的行
for _, row in cutoff_dates.iterrows():
    member = row['user']
    start_date = row['min']
    end_date = row['max']
    all_email.drop(
        all_email[(all_email.member == member)]
        [(all_email.week < start_date)].index, inplace=True)
    all_email.drop(all_email[(all_email.member == member)]
                   [(all_email.week > end_date)].index, inplace=True)
    

  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_emai

In [73]:
all_email['member'] = all_email['member'].astype(int)
all_email['EmailsOpened'] = all_email['EmailsOpened'].astype(int)

In [74]:
all_email

Unnamed: 0,week,member,EmailsOpened
0,2016-08-01,1,3
4,2016-08-01,9,3
6,2016-08-01,14,0
12,2016-08-01,28,3
13,2016-08-01,31,3
...,...,...,...
93229,2016-11-21,956,3
93230,2016-11-21,958,1
93231,2016-11-21,959,0
93241,2016-11-21,987,1


## 构造找到的时间序列

In [62]:
import pandas as pd

donation = pd.read_csv('../demo_code/Ch02/data/donations.csv')

In [63]:
donation.timestamp = pd.to_datetime(donation.timestamp) # 转化为date_time形式
donation.set_index('timestamp', inplace=True)
agg_donations = donation.groupby('user').apply(
    lambda df : df.amount.resample('W-MON').sum().dropna().to_frame(name='amount')).reset_index()     # 按周对amount进行分组求和  'W-MON':按照以周一为开始的周进行重采样

  agg_donations = donation.groupby('user').apply(


In [75]:
agg_donations['amount'] = agg_donations['amount'].astype(int)
agg_donations['user'] = agg_donations['user'].astype(int)

In [76]:
all_email

Unnamed: 0,week,member,EmailsOpened
0,2016-08-01,1,3
4,2016-08-01,9,3
6,2016-08-01,14,0
12,2016-08-01,28,3
13,2016-08-01,31,3
...,...,...,...
93229,2016-11-21,956,3
93230,2016-11-21,958,1
93231,2016-11-21,959,0
93241,2016-11-21,987,1


In [77]:
print(all_email.dtypes)
print(agg_donations.dtypes)

week            datetime64[ns]
member                   int64
EmailsOpened             int64
dtype: object
user                  int64
timestamp    datetime64[ns]
amount                int64
dtype: object


In [78]:
import pandas as pd

# 假设 all_email 和 agg_donations 是两个 DataFrame

# 确保 'week' 和 'timestamp' 是日期格式
all_email['week'] = pd.to_datetime(all_email['week'])
agg_donations['timestamp'] = pd.to_datetime(agg_donations['timestamp'])

# 使用 member 和 user 列进行连接，使用 week 和 timestamp 列进行合并
merged_df = pd.merge(all_email, agg_donations, how='left', left_on=['member', 'week'], right_on=['user', 'timestamp'])

# 查看结果
print(merged_df.head())


        week  member  EmailsOpened  user  timestamp  amount
0 2016-08-01       1             3   1.0 2016-08-01     0.0
1 2016-08-01       9             3   NaN        NaT     NaN
2 2016-08-01      14             0   NaN        NaT     NaN
3 2016-08-01      28             3   NaN        NaT     NaN
4 2016-08-01      31             3  31.0 2016-08-01     0.0


In [79]:
merged_df = pd.DataFrame()
for member, member_email in all_email.groupby('member'):
    member_donations = agg_donations[agg_donations.user == member]
    member_donations.set_index('timestamp', inplace=True)

    member_email.set_index('week', inplace=True)

    member_email = all_email[all_email.member == member]
    member_email.sort_values('week').set_index('week', inplace=True)

    df = pd.merge(member_email, member_donations, how='left', left_index=True, right_index=True)

    df.fillna(0, inplace=True)

    #merged_df = merged_df.append(df.reset_index()[['member', 'week', 'EmailsOpened', 'amount']])
    # 注意: df.append已经弃用, 现在使用concat连接
    merged_df = pd.concat([merged_df, df.reset_index()[['member', 'week', 'EmailsOpened', 'amount']]], ignore_index=True)

KeyError: 'dt'