- **注意** : 打印表格渲染的时候, 索引会加粗,并且会比正常的列名位置偏下
- **注意2**: `inplace=True` 的意思是原地覆盖原本的数据

## 数据集来源

- [UCI机器学习仓库](https://archive.ics.uci.edu/)
- [UEA时间序列分类仓库](https://arxiv.org/abs/1811.00075)
  - FordA：一个关于发动机故障的时间序列数据集，通常用于预测机器故障。
  - ECG5000：心电图（ECG）信号数据集，广泛用于心脏健康监测。
  - GunPoint：一个运动学数据集，包含了两种不同的动作类别。
- [UCR时间序列分类仓库](https://www.cs.ucr.edu/~eamonn/time_series_data_2018/)
  - Yoga：这个数据集包含了不同瑜伽动作的时间序列数据。
  - Wafer：一个半导体生产过程中温度数据的时间序列数据集，用于预测产品的质量。
  - Trace：一个关于电力消耗的时间序列数据集，用于进行异常检测

In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
absenteeism_at_work = fetch_ucirepo(id=445) 
  
# data (as pandas dataframes) 
X = absenteeism_at_work.data.features 
y = absenteeism_at_work.data.targets 
  
# metadata 
print(absenteeism_at_work.metadata) 
  
# variable information 
print(absenteeism_at_work.variables) 


## 组装时间序列数据集合

- `pandas.groupby().count()`: 因为 count() 是对每一列非空值进行计数。如果数据框中每列的非空值数目在同一个分组下是相同的，那么它们在 groupby().count() 的结果中会显示相同的值

In [None]:
import pandas as pd

df = pd.read_csv("../demo_code/Ch02/data/year_joined.csv")

# df.groupby("user").count() 该结果返回原本表的形式
print(df.groupby("user").count())

# 对数据进行验证 检查是否存在一个会员，在某个等级下有多条记录
# 结果: 在所有的1000名会员中，每人只有一个等级
df.groupby("user").count().groupby("userStats").count()

In [None]:
# 检查是否有"空周"(会员在这一周内没有打开电子邮件（EmailsOpened为0）)
import pandas as pd

emails = pd.read_csv("../demo_code/Ch02/data/emails.csv")

print(emails[emails['emailsOpened'] < 1])

In [None]:
# 上面的结果表明: 没有"空周", 但是这种情况的概率很低

# 查看历史记录
print(emails[emails.user == 998])

- `pd.to_datetime()`：该函数将 week 列中的字符串转换为 datetime 对象。

In [None]:
# 上面的结果表明, 几周的数据缺失了。

# 将 week 列的字符串转换为 datetime 类型
emails['week'] = pd.to_datetime(emails['week'])
# 检查该会员总体记录中横跨几周
(max(emails[emails.user == 998].week) - min(emails[emails.user == 998].week)).days/7

In [None]:
emails[emails.user == 998].shape

结果表明, 有24行数据,但是应该为**26行**(注意max - min的结果应该加1), 所以少的两行就是*空周*

In [None]:
# 笛卡尔积
complete_idx = pd.MultiIndex.from_product([list(set(emails.week)), list(set(emails.user))])
print(complete_idx)

In [12]:
# 对原始表进行重新索引, 并填充缺失值(空周用0填充)

# df的链式操作:
# set_index(['week', 'user']) : 每个记录都会有一个由 week 和 user 组成的复合索引
all_email = emails.set_index(['week', 'user']).reindex(complete_idx, fill_value=0).reset_index()
all_email.columns = ['week', 'member', 'EmailsOpened']

In [None]:
all_email

In [None]:
# 展示空周
all_email[(all_email.member == 998) & (all_email.EmailsOpened == 0.0)].sort_values('week')
# 注意多条件每个条件要用()括起来

通过上面的分析,虽然定位了空周, 但是会员加入的前几周也被填充为0, 这是没有意义的, 所以应该确定每个会员的使用区间

`df.agg(func, axis=0, *args, **kwargs)`

对 DataFrame 或 Series 执行**聚合操作** 的强大工具, 允许对数据应用多个聚合函数，或者对不同的列使用不同的聚合函数,也可以使用自定义函数

- 举例:
  ```python
    result = df.agg({
        'A': ['sum', range_func],  # 对 A 列同时应用 sum 和自定义函数
        'B': 'mean',              # 对 B 列求均值
        'C': ['min', 'max']       # 对 C 列求最小值和最大值
    })
   ```
- `agg()` 与 `groupby()` 结合使用，可以对数据进行分组后应用不同的聚合函数
  - 举例:
    ```python
    df = pd.DataFrame({
        'group': ['A', 'A', 'B', 'B'],
        'value': [10, 20, 30, 40]
    })

    # 使用 groupby 和 agg 对分组数据进行聚合
    result = df.groupby('group').agg({
        'value': ['sum', 'mean', 'max']
    })
    ```
    输出
    ```
            value              
            sum mean max
    group                  
    A         30   15  20
    B         70   35  40

    ```

In [None]:
cutoff_dates = emails.groupby('user').week.agg(['min', 'max']).reset_index()
cutoff_dates            # 每个会员的使用区间

`iterrows()` 是 Pandas 中用于 **按行遍历** DataFrame 的一个方法。它返回一个生成器，可以逐行返回 DataFrame 的索引和行数据

使用形式:
```python
for index, row in df.iterrows():
```

In [45]:
# 删除每个会员的第一个非0值的前面的为0的行
for _, row in cutoff_dates.iterrows():
    member = row['user']
    start_date = row['min']
    end_date = row['max']
    all_email.drop(
        all_email[(all_email.member == member)]
        [(all_email.week < start_date)].index, inplace=True)
    all_email.drop(all_email[(all_email.member == member)]
                   [(all_email.week > end_date)].index, inplace=True)
    

  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_email[(all_email.member == member)]
  all_email.drop(all_email[(all_email.member == member)]
  all_emai

In [46]:
all_email

Unnamed: 0,week,member,EmailsOpened
539,2016-05-30,1.0,3.0
545,2016-05-30,14.0,0.0
551,2016-05-30,28.0,3.0
552,2016-05-30,31.0,3.0
553,2016-05-30,33.0,3.0
...,...,...,...
93215,2015-06-15,929.0,2.0
93216,2015-06-15,932.0,2.0
93221,2015-06-15,940.0,1.0
93239,2015-06-15,982.0,0.0


## 构造找到的时间序列

In [47]:
import pandas as pd

donation = pd.read_csv('../demo_code/Ch02/data/donations.csv')

In [48]:
donation.timestamp = pd.to_datetime(donation.timestamp) # 转化为date_time形式
donation.set_index('timestamp', inplace=True)
agg_donations = donation.groupby('user').apply(
    lambda df : df.amount.resample('W-MON').sum().dropna().to_frame(name='amount')).reset_index()     # 按周对amount进行分组求和  'W-MON':按照以周一为开始的周进行重采样

  agg_donations = donation.groupby('user').apply(


In [49]:
agg_donations

Unnamed: 0,user,timestamp,amount
0,0.0,2015-03-30,25.0
1,0.0,2015-04-06,0.0
2,0.0,2015-04-13,0.0
3,0.0,2015-04-20,0.0
4,0.0,2015-04-27,0.0
...,...,...,...
32347,995.0,2017-09-11,0.0
32348,995.0,2017-09-18,0.0
32349,995.0,2017-09-25,0.0
32350,995.0,2017-10-02,1000.0


In [50]:
merged_df = pd.DataFrame()
for member, member_email in all_email.groupby('member'):
    member_donations = agg_donations[agg_donations.user == member]
    member_donations.set_index('timestamp', inplace=True)

    # member_email.set_index('week', inplace=True)

    member_email = all_email[all_email.member == member]
    member_email.sort_values('week').set_index('week', inplace=True)

    df = pd.merge(member_email, member_donations, how='left', left_index=True, right_index=True)

    df.fillna(0, inplace=True)

    #merged_df = merged_df.append(df.reset_index()[['member', 'week', 'EmailsOpened', 'amount']])
    # 注意: df.append已经弃用, 现在使用concat连接
    merged_df = pd.concat([merged_df, df.reset_index()[['member', 'week', 'EmailsOpened', 'amount']]], ignore_index=True)

In [51]:
merged_df

Unnamed: 0,member,week,EmailsOpened,amount
0,1.0,2016-05-30,3.0,0.0
1,1.0,2017-02-27,3.0,0.0
2,1.0,2016-03-14,3.0,0.0
3,1.0,2015-10-12,3.0,0.0
4,1.0,2017-08-07,3.0,0.0
...,...,...,...,...
30775,998.0,2018-04-23,0.0,0.0
30776,998.0,2018-01-08,3.0,0.0
30777,998.0,2018-03-26,2.0,0.0
30778,998.0,2018-05-14,3.0,0.0
