结论：7个音频特征效果不好，下面分别使用了逻辑回归和随机森林两种方法，也将排名和上榜天数分类处理，效果跟随机差不多

后续考虑将作者名字和国籍语义转化一下加入到训练特征之中

In [1]:
# -*- coding: utf-8 -*-
"""
训练目标：
- 用 7 个音频特征预测：
  1) DaysOnChart（上榜天数，非负，右偏） -> 对 y 用 log1p 变换，线性回归
  2) Rank（历史最优排名，越小越好，>=1） -> 对 y 用 log 变换，线性回归

输出：
- 训练/测试 MAE 与 R²
- 保存模型到 PredictRankAndDay/chart_models.pkl
- 提供一个 predict_from_features() 推理函数示例
"""

import os
import joblib
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, r2_score

# ============= 路径与常量 =============
CSV = Path(r"C:\Users\Akari\OneDrive\Desktop\SEM 1\AML\Final CW\Datas\spotify_preprocess_dayAndRank.csv")
ART_DIR = Path(r"C:\Users\Akari\OneDrive\Desktop\SEM 1\AML\Final CW\PredictRankAndDay"); ART_DIR.mkdir(exist_ok=True)

FEATURES = ['Danceability','Energy','Valence','Loudness','Speechiness','Acousticness','Instrumentalness']
TARGET_DAYS = 'DaysOnChart'
TARGET_RANK = 'Rank'

# 是否使用 Ridge（比纯线性更稳一点），不想用就设 False
USE_RIDGE = True
RIDGE_ALPHA = 1.0  # 可改为 0.1/1/10 做简易调参

# ============= 读取与清洗 =============
df = pd.read_csv(CSV)

# 保留必要列，去掉有缺失的行
need_cols = ['id','Title','Artists','Nationality'] + FEATURES + [TARGET_DAYS, TARGET_RANK]
df = df[need_cols].dropna(subset=FEATURES + [TARGET_DAYS, TARGET_RANK]).copy()

# Rank 理论上 >=1，做个健壮下界截断（防止异常 0 或负）
df[TARGET_RANK] = df[TARGET_RANK].clip(lower=1)

# ============= 组装特征与标签（含变换） =============
X = df[FEATURES].values

# y1: DaysOnChart -> log1p
y_days_raw = df[TARGET_DAYS].astype(float).values
y_days = np.log1p(y_days_raw)

# y2: Rank -> log
y_rank_raw = df[TARGET_RANK].astype(float).values
y_rank = np.log(y_rank_raw)

# ============= 划分训练/测试集 =============
Xtr, Xte, y_days_tr, y_days_te, y_rank_tr, y_rank_te = train_test_split(
    X, y_days, y_rank, test_size=0.2, random_state=42
)

# ============= 定义并训练模型 =============
Reg = Ridge(alpha=RIDGE_ALPHA) if USE_RIDGE else LinearRegression()

days_model = Reg.__class__(**(Reg.get_params()))  # 拷贝一份同配置
rank_model = Reg.__class__(**(Reg.get_params()))

days_model.fit(Xtr, y_days_tr)
rank_model.fit(Xtr, y_rank_tr)

# ============= 评估（反变换回原尺度） =============
# Days
days_pred_te = np.expm1(days_model.predict(Xte))
days_true_te = np.expm1(y_days_te)

days_mae = mean_absolute_error(days_true_te, days_pred_te)
days_r2  = r2_score(days_true_te, days_pred_te)

# Rank
rank_pred_te = np.exp(rank_model.predict(Xte))
rank_true_te = np.exp(y_rank_te)

rank_mae = mean_absolute_error(rank_true_te, rank_pred_te)
rank_r2  = r2_score(rank_true_te, rank_pred_te)

print("=== Test metrics ===")
print(f"DaysOnChart  MAE = {days_mae:.3f}   R2 = {days_r2:.3f}")
print(f"Rank         MAE = {rank_mae:.3f}   R2 = {rank_r2:.3f}")

# ============= 训练集上也报一下（可选） =============
days_pred_tr = np.expm1(days_model.predict(Xtr))
rank_pred_tr = np.exp(rank_model.predict(Xtr))

print("\n=== Train metrics (for reference) ===")
print(f"DaysOnChart  MAE = {mean_absolute_error(np.expm1(y_days_tr), days_pred_tr):.3f}   R2 = {r2_score(np.expm1(y_days_tr), days_pred_tr):.3f}")
print(f"Rank         MAE = {mean_absolute_error(np.exp(y_rank_tr),   rank_pred_tr):.3f}   R2 = {r2_score(np.exp(y_rank_tr),   rank_pred_tr):.3f}")

# ============= 保存模型与元信息 =============
bundle = {
    "days_model": days_model,
    "rank_model": rank_model,
    "features": FEATURES,
    "targets": {"days": TARGET_DAYS, "rank": TARGET_RANK},
    "transforms": {
        "days": "log1p <-> expm1",
        "rank": "log <-> exp",
        "rank_clip_min": 1.0
    },
    "use_ridge": USE_RIDGE,
    "ridge_alpha": RIDGE_ALPHA
}
joblib.dump(bundle, ART_DIR / "chart_models.pkl")
print(f"\n✅ 模型已保存到: {ART_DIR / 'chart_models.pkl'}")

# ============= 推理函数（示例） =============
def predict_from_features(feat_dict: dict):
    """
    feat_dict: 7个音频特征的字典，例如：
      {'Danceability':0.6,'Energy':0.8,'Valence':0.4,'Loudness':0.7,
       'Speechiness':0.05,'Acousticness':0.2,'Instrumentalness':0.0}
    """
    x = np.array([[feat_dict[f] for f in FEATURES]], dtype=float)
    days_hat = float(np.expm1(days_model.predict(x))[0])
    rank_hat = float(np.exp(  rank_model.predict(x))[0])
    # 合理裁剪
    days_hat = max(0.0, days_hat)
    rank_hat = max(1.0, rank_hat)
    return {"pred_DaysOnChart": days_hat, "pred_Rank": rank_hat}

# # 示例：
# print(predict_from_features({'Danceability':0.6,'Energy':0.8,'Valence':0.4,'Loudness':0.7,'Speechiness':0.05,'Acousticness':0.2,'Instrumentalness':0.0}))


=== Test metrics ===
DaysOnChart  MAE = 51.289   R2 = -0.098
Rank         MAE = 52.725   R2 = -0.216

=== Train metrics (for reference) ===
DaysOnChart  MAE = 47.565   R2 = -0.102
Rank         MAE = 52.854   R2 = -0.241

✅ 模型已保存到: C:\Users\Akari\OneDrive\Desktop\SEM 1\AML\Final CW\PredictRankAndDay\chart_models.pkl


直接预测具体的排名和上榜天数效果非常差，将预测改为


排名Rank 分成前10、前50、前100、前150和前200五种类别
上榜时间DaysOnChart 分成<10，10-30，30-50，50-100，>100五种类别

先用逻辑回归

| 参数           | 含义        | 推荐值                           |
| -------------- | ---------- | -------------------------------- |
| `solver`       | 优化器      | `'lbfgs'`（默认，适合小中型数据）  |
| `multi_class`  | 多分类策略  | `'multinomial'`（softmax，推荐）  |
| `max_iter`     | 最大迭代次数 | 1000–5000（不收敛可调大）         |
| `C`            | 正则化强度  | 默认 1.0；值越大越“少正则化”       |
| `class_weight` | 类别权重    | `'balanced'` 可缓解类别不均衡     |
| `penalty`      | 正则化类型  | `'l2'`（默认，稳）                |

rank_clf = LogisticRegression(
    multi_class="multinomial",
    class_weight="balanced",
    solver="lbfgs",
    C=0.5,           # 加强正则化（0.5 或 0.2 都行）
    max_iter=3000,
    random_state=42
)

In [4]:
# -*- coding: utf-8 -*-
"""
多分类训练：
- 输入特征：7个音频特征（已是0~1）
- 任务A：最高排名 Rank → 五分类：Top10 / 11-50 / 51-100 / 101-150 / 151-200
- 任务B：上榜天数 DaysOnChart → 五分类：<10 / 10-30 / 30-50 / 50-100 / >100
- 模型：逻辑回归（multinomial），带 class_weight='balanced' 以应对类不均衡
- 评估：Accuracy、Macro-F1、分类报告、混淆矩阵
"""

import os
import joblib
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

# ============== 路径配置 ==============
CSV = Path(r"C:\Users\Akari\OneDrive\Desktop\SEM 1\AML\Final CW\Datas\spotify_preprocess_dayAndRank.csv")
ART_DIR = Path(r"C:\Users\Akari\OneDrive\Desktop\SEM 1\AML\Final CW\PredictRankAndDay"); ART_DIR.mkdir(exist_ok=True)

FEATURES = ['Danceability','Energy','Valence','Loudness','Speechiness','Acousticness','Instrumentalness']

# ============== 读取数据 ==============
df = pd.read_csv(CSV)

# 仅保留需要的列，并清理缺失
need_cols = ['id','Title','Artists','Nationality'] + FEATURES + ['Rank','DaysOnChart']
df = df[need_cols].dropna(subset=FEATURES + ['Rank','DaysOnChart']).copy()

# Rank 合理范围裁剪（1..200）
df['Rank'] = df['Rank'].clip(lower=1, upper=200).astype(int)
df['DaysOnChart'] = df['DaysOnChart'].astype(int)

# ============== 分桶/打标签 ==============
# 任务A：Rank五分类（1-10, 11-50, 51-100, 101-150, 151-200）
rank_bins   = [0, 10, 50, 100, 150, 200]
rank_labels = ["Top10", "11-50", "51-100", "101-150", "151-200"]
df['RankBucket'] = pd.cut(df['Rank'], bins=rank_bins, labels=rank_labels, right=True, include_lowest=True)

# 任务B：DaysOnChart五分类（<10, 10-30, 30-50, 50-100, >100）
# 区间解释：(-inf,9], (9,30], (30,50], (50,100], (100,+inf)
days_bins   = [-np.inf, 9, 30, 50, 100, np.inf]
days_labels = ["<10", "10-30", "30-50", "50-100", ">100"]
df['DaysBucket'] = pd.cut(df['DaysOnChart'], bins=days_bins, labels=days_labels, right=True)

# 丢弃可能未能分桶的异常
df = df.dropna(subset=['RankBucket','DaysBucket']).copy()

# ============== 特征与标签 ==============
X = df[FEATURES].values
y_rank = df['RankBucket'].astype(str).values
y_days = df['DaysBucket'].astype(str).values

# ============== 分层切分（保持类分布） ==============
X_tr, X_te, y_rank_tr, y_rank_te = train_test_split(
    X, y_rank, test_size=0.2, random_state=42, stratify=y_rank
)
# Days 任务单独切分（也可与Rank共用索引，但这里保持独立更直观）
X_tr_d, X_te_d, y_days_tr, y_days_te = train_test_split(
    X, y_days, test_size=0.2, random_state=42, stratify=y_days
)

# ============== 编码标签（为方便保存与推理） ==============
le_rank = LabelEncoder().fit(y_rank_tr)
le_days = LabelEncoder().fit(y_days_tr)

y_rank_tr_enc = le_rank.transform(y_rank_tr)
y_rank_te_enc = le_rank.transform(y_rank_te)

y_days_tr_enc = le_days.transform(y_days_tr)
y_days_te_enc = le_days.transform(y_days_te)

# ============== 定义与训练模型（逻辑回归，多类别，平衡类权重） ==============
rank_clf = LogisticRegression(
    multi_class="multinomial",
    class_weight="balanced",
    solver="lbfgs",      # 数据不大时lbfgs很稳；如收敛慢可换 saga
    max_iter=2000,
    random_state=42
)
rank_clf.fit(X_tr, y_rank_tr_enc)

days_clf = LogisticRegression(
    multi_class="multinomial",
    class_weight="balanced",
    solver="lbfgs",
    max_iter=2000,
    random_state=42
)
days_clf.fit(X_tr_d, y_days_tr_enc)

# ============== 评估（Accuracy、Macro-F1、报告、混淆矩阵） ==============
def evaluate(name, clf, Xte, yte_enc, label_encoder, label_order=None):
    pred_enc = clf.predict(Xte)
    acc = accuracy_score(yte_enc, pred_enc)
    f1m = f1_score(yte_enc, pred_enc, average="macro")

    y_true = label_encoder.inverse_transform(yte_enc)
    y_pred = label_encoder.inverse_transform(pred_enc)
    print(f"\n=== {name} - Test ===")
    print(f"Accuracy = {acc:.4f}   Macro-F1 = {f1m:.4f}")
    print("\nClassification report:")
    print(classification_report(y_true, y_pred, labels=label_order))
    # 混淆矩阵（标签有序显示）
    if label_order is None:
        label_order = sorted(label_encoder.classes_.tolist())
    cm = confusion_matrix(y_true, y_pred, labels=label_order)
    cm_df = pd.DataFrame(cm, index=[f"T:{l}" for l in label_order], columns=[f"P:{l}" for l in label_order])
    print("\nConfusion matrix:")
    print(cm_df)

evaluate("RankBucket", rank_clf, X_te, y_rank_te_enc, le_rank, label_order=rank_labels)
evaluate("DaysBucket", days_clf, X_te_d, y_days_te_enc, le_days, label_order=days_labels)

# ============== 保存模型与元数据 ==============
bundle = {
    "features": FEATURES,
    "rank": {
        "model": rank_clf,
        "label_encoder": le_rank,
        "labels": rank_labels,
        "bins": rank_bins
    },
    "days": {
        "model": days_clf,
        "label_encoder": le_days,
        "labels": days_labels,
        "bins": days_bins
    }
}
joblib.dump(bundle, ART_DIR / "chart_cls_models.pkl")
print(f"\n✅ 分类模型已保存到: {ART_DIR / 'chart_cls_models.pkl'}")

# ============== 推理函数（把概率也返回，便于报告可解释） ==============
def predict_buckets_from_features(feat_dict: dict):
    """
    输入：7个音频特征的字典（0~1），例如：
      {'Danceability':0.6,'Energy':0.8,'Valence':0.4,'Loudness':0.7,
       'Speechiness':0.05,'Acousticness':0.2,'Instrumentalness':0.0}
    输出：两个任务的类别预测与概率分布
    """
    x = np.array([[feat_dict[f] for f in FEATURES]], dtype=float)

    # Rank
    pr_rank = rank_clf.predict_proba(x)[0]
    pred_rank_idx = int(np.argmax(pr_rank))
    pred_rank_label = le_rank.inverse_transform([pred_rank_idx])[0]
    rank_probs = {label: float(pr_rank[le_rank.transform([label])[0]]) for label in rank_labels}

    # Days
    pr_days = days_clf.predict_proba(x)[0]
    pred_days_idx = int(np.argmax(pr_days))
    pred_days_label = le_days.inverse_transform([pred_days_idx])[0]
    days_probs = {label: float(pr_days[le_days.transform([label])[0]]) for label in days_labels}

    return {
        "RankBucket": {"pred": pred_rank_label, "probs": rank_probs},
        "DaysBucket": {"pred": pred_days_label, "probs": days_probs},
    }

# 示例：
# print(predict_buckets_from_features({'Danceability':0.6,'Energy':0.8,'Valence':0.4,'Loudness':0.7,'Speechiness':0.05,'Acousticness':0.2,'Instrumentalness':0.0}))





=== RankBucket - Test ===
Accuracy = 0.2019   Macro-F1 = 0.1983

Classification report:
              precision    recall  f1-score   support

       Top10       0.12      0.40      0.19       168
       11-50       0.24      0.16      0.20       431
      51-100       0.28      0.20      0.23       489
     101-150       0.23      0.23      0.23       393
     151-200       0.18      0.12      0.15       352

    accuracy                           0.20      1833
   macro avg       0.21      0.22      0.20      1833
weighted avg       0.23      0.20      0.20      1833


Confusion matrix:
           P:Top10  P:11-50  P:51-100  P:101-150  P:151-200
T:Top10         68       24        24         31         21
T:11-50        141       71        77         83         59
T:51-100       141       84        99        101         64
T:101-150      126       52        79         89         47
T:151-200       89       62        78         80         43

=== DaysBucket - Test ===
Accuracy = 0.271

逻辑回归模型几乎没有提取出有效的区分信息
预测结果接近随机，说明音频特征与榜单等级/上榜时长之间关系非常弱
(可以将结果复制给gpt，他会给出详细分析)

分三类
排名：Top50 / 51–100 / >100

上榜天数：<30 / 30–100 / >100

In [13]:
# -*- coding: utf-8 -*-
"""
多分类训练（三分类版本）：
- 输入特征：7个音频特征（已归一化到0~1）
- 任务A：最高排名 Rank → 三分类：Top50 / 51–100 / >100
- 任务B：上榜天数 DaysOnChart → 三分类：<30 / 30–100 / >100
- 模型：逻辑回归（multinomial），class_weight='balanced' 以应对类不均衡
- 评估：Accuracy、Macro-F1、分类报告、混淆矩阵
"""

import os
import joblib
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

# ============== 路径配置 ==============
CSV = Path(r"C:\Users\Akari\OneDrive\Desktop\SEM 1\AML\Final CW\Datas\spotify_preprocess_dayAndRank.csv")
ART_DIR = Path(r"C:\Users\Akari\OneDrive\Desktop\SEM 1\AML\Final CW\PredictRankAndDay"); ART_DIR.mkdir(exist_ok=True)

FEATURES = ['Danceability','Energy','Valence','Loudness','Speechiness','Acousticness','Instrumentalness']

# ============== 读取数据 ==============
df = pd.read_csv(CSV)

# 仅保留需要的列，并清理缺失
need_cols = ['id','Title','Artists','Nationality'] + FEATURES + ['Rank','DaysOnChart']
df = df[need_cols].dropna(subset=FEATURES + ['Rank','DaysOnChart']).copy()

# Rank 合理范围裁剪（1..200）
df['Rank'] = df['Rank'].clip(lower=1, upper=200).astype(int)
df['DaysOnChart'] = df['DaysOnChart'].astype(int)

# ============== 分桶/打标签（三分类） ==============
# 任务A：Rank三分类（Top50 / 51–100 / >100）
rank_bins   = [0, 50, 100, 200]
rank_labels = ["Top50", "51–100", ">100"]
df['RankBucket'] = pd.cut(df['Rank'], bins=rank_bins, labels=rank_labels, right=True, include_lowest=True)

# 任务B：DaysOnChart三分类（<30 / 30–100 / >100）
days_bins   = [-np.inf, 29, 100, np.inf]
days_labels = ["<30", "30–100", ">100"]
df['DaysBucket'] = pd.cut(df['DaysOnChart'], bins=days_bins, labels=days_labels, right=True)

# 丢弃可能未能分桶的异常
df = df.dropna(subset=['RankBucket','DaysBucket']).copy()

# ============== 特征与标签 ==============
X = df[FEATURES].values
y_rank = df['RankBucket'].astype(str).values
y_days = df['DaysBucket'].astype(str).values

# ============== 分层切分（保持类分布） ==============
X_tr, X_te, y_rank_tr, y_rank_te = train_test_split(
    X, y_rank, test_size=0.2, random_state=42, stratify=y_rank
)
X_tr_d, X_te_d, y_days_tr, y_days_te = train_test_split(
    X, y_days, test_size=0.2, random_state=42, stratify=y_days
)

# ============== 编码标签（为方便保存与推理） ==============
le_rank = LabelEncoder().fit(y_rank_tr)
le_days = LabelEncoder().fit(y_days_tr)

y_rank_tr_enc = le_rank.transform(y_rank_tr)
y_rank_te_enc = le_rank.transform(y_rank_te)

y_days_tr_enc = le_days.transform(y_days_tr)
y_days_te_enc = le_days.transform(y_days_te)

# ============== 定义与训练模型（逻辑回归，多类别，平衡类权重） ==============
rank_clf = LogisticRegression(
    multi_class="multinomial",
    class_weight="balanced",
    solver="lbfgs",
    max_iter=2000,
    random_state=42
)
rank_clf.fit(X_tr, y_rank_tr_enc)

days_clf = LogisticRegression(
    multi_class="multinomial",
    class_weight="balanced",
    solver="lbfgs",
    max_iter=2000,
    random_state=42
)
days_clf.fit(X_tr_d, y_days_tr_enc)

# ============== 评估函数（Accuracy、Macro-F1、报告、混淆矩阵） ==============
def evaluate(name, clf, Xte, yte_enc, label_encoder, label_order=None):
    pred_enc = clf.predict(Xte)
    acc = accuracy_score(yte_enc, pred_enc)
    f1m = f1_score(yte_enc, pred_enc, average="macro")

    y_true = label_encoder.inverse_transform(yte_enc)
    y_pred = label_encoder.inverse_transform(pred_enc)
    print(f"\n=== {name} - Test (3-class) ===")
    print(f"Accuracy = {acc:.4f}   Macro-F1 = {f1m:.4f}")
    print("\nClassification report:")
    print(classification_report(y_true, y_pred, labels=label_order))
    if label_order is None:
        label_order = sorted(label_encoder.classes_.tolist())
    cm = confusion_matrix(y_true, y_pred, labels=label_order)
    cm_df = pd.DataFrame(cm, index=[f"T:{l}" for l in label_order], columns=[f"P:{l}" for l in label_order])
    print("\nConfusion matrix:")
    print(cm_df)

evaluate("RankBucket", rank_clf, X_te, y_rank_te_enc, le_rank, label_order=rank_labels)
evaluate("DaysBucket", days_clf, X_te_d, y_days_te_enc, le_days, label_order=days_labels)

# ============== 保存模型与元数据 ==============
bundle = {
    "features": FEATURES,
    "rank": {
        "model": rank_clf,
        "label_encoder": le_rank,
        "labels": rank_labels,
        "bins": rank_bins
    },
    "days": {
        "model": days_clf,
        "label_encoder": le_days,
        "labels": days_labels,
        "bins": days_bins
    }
}
joblib.dump(bundle, ART_DIR / "chart_cls_3class_models.pkl")
print(f"\n✅ 三分类逻辑回归模型已保存到: {ART_DIR / 'chart_cls_3class_models.pkl'}")

# ============== 推理函数（带概率输出） ==============
def predict_buckets_from_features(feat_dict: dict):
    """
    输入：7个音频特征的字典（0~1），例如：
      {'Danceability':0.6,'Energy':0.8,'Valence':0.4,'Loudness':0.7,
       'Speechiness':0.05,'Acousticness':0.2,'Instrumentalness':0.0}
    输出：两个任务的类别预测与概率分布
    """
    x = np.array([[feat_dict[f] for f in FEATURES]], dtype=float)

    # Rank
    pr_rank = rank_clf.predict_proba(x)[0]
    pred_rank_idx = int(np.argmax(pr_rank))
    pred_rank_label = le_rank.inverse_transform([pred_rank_idx])[0]
    rank_probs = {label: float(pr_rank[le_rank.transform([label])[0]]) for label in rank_labels}

    # Days
    pr_days = days_clf.predict_proba(x)[0]
    pred_days_idx = int(np.argmax(pr_days))
    pred_days_label = le_days.inverse_transform([pred_days_idx])[0]
    days_probs = {label: float(pr_days[le_days.transform([label])[0]]) for label in days_labels}

    return {
        "RankBucket": {"pred": pred_rank_label, "probs": rank_probs},
        "DaysBucket": {"pred": pred_days_label, "probs": days_probs},
    }

# 示例：
# print(predict_buckets_from_features({'Danceability':0.6,'Energy':0.8,'Valence':0.4,
#                                     'Loudness':0.7,'Speechiness':0.05,
#                                     'Acousticness':0.2,'Instrumentalness':0.0}))





=== RankBucket - Test (3-class) ===
Accuracy = 0.3650   Macro-F1 = 0.3606

Classification report:
              precision    recall  f1-score   support

       Top50       0.37      0.47      0.41       599
      51–100       0.29      0.33      0.31       489
        >100       0.44      0.30      0.36       745

    accuracy                           0.36      1833
   macro avg       0.37      0.37      0.36      1833
weighted avg       0.38      0.36      0.36      1833


Confusion matrix:
          P:Top50  P:51–100  P:>100
T:Top50       281       169     149
T:51–100      193       163     133
T:>100        289       231     225

=== DaysBucket - Test (3-class) ===
Accuracy = 0.4539   Macro-F1 = 0.3646

Classification report:
              precision    recall  f1-score   support

         <30       0.78      0.49      0.60      1277
      30–100       0.20      0.19      0.20       294
        >100       0.20      0.56      0.30       262

    accuracy                           0

下面试试随机森林

| 参数名               | 作用                 | 推荐范围                                   | 调整建议                                          |
| -------------------- | ------------------- | ------------------------------------------ | ------------------------------------------------- |
| **n_estimators**     | 森林里树的数量       | 100–1000                                   | 越多越稳（但越慢）。一般 300~600 就足够。            |
| **max_depth**        | 每棵树的最大深度     | None / 10–20                               | 限制树的复杂度，防止过拟合；不设(None)代表自动生长。   |
| **min_samples_leaf** | 叶节点最少样本数     | 1–10                                       | 越大越平滑；通常 2~5 较稳。                         |
| **max_features**     | 每次分裂用的特征数量 | `'sqrt'`, `'log2'`, 0.5                    | 控制随机性。`'sqrt'` 常用于分类任务。               |
| **class_weight**     | 类别权重调整        | `'balanced'`, `'balanced_subsample'`, None  | 如果类别不均衡，建议用 `'balanced_subsample'`。     |
| **random_state**     | 随机种子            | 任意整数                                    | 固定为 42 以保证结果可复现。                        |
| **n_jobs**           | 并行线程数          | -1                                         | 固定写 `-1` 就能用所有CPU核心。                      |


In [6]:
# -*- coding: utf-8 -*-
"""
非线性树模型对照实验：
- 模型：RandomForestClassifier（两套，分别用于 RankBucket / DaysBucket）
- 特征：7个音频特征
- 标签：五分类（和前面逻辑回归一致）
- 评估：Accuracy、Macro-F1、分类报告、混淆矩阵
- 输出：artifacts_task3_cls_rf/chart_cls_rf_models.pkl
"""

import os
import joblib
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

# ============== 路径配置 ==============
CSV = Path(r"C:\Users\Akari\OneDrive\Desktop\SEM 1\AML\Final CW\Datas\spotify_preprocess_dayAndRank.csv")
ART_DIR = Path(r"C:\Users\Akari\OneDrive\Desktop\SEM 1\AML\Final CW\PredictRankAndDay"); ART_DIR.mkdir(exist_ok=True)

FEATURES = ['Danceability','Energy','Valence','Loudness','Speechiness','Acousticness','Instrumentalness']

# ============== 读取数据 ==============
df = pd.read_csv(CSV)
need_cols = ['id','Title','Artists','Nationality'] + FEATURES + ['Rank','DaysOnChart']
df = df[need_cols].dropna(subset=FEATURES + ['Rank','DaysOnChart']).copy()
df['Rank'] = df['Rank'].clip(lower=1, upper=200).astype(int)
df['DaysOnChart'] = df['DaysOnChart'].astype(int)

# ============== 分桶/打标签（与之前一致） ==============
# Rank：Top10 / 11-50 / 51-100 / 101-150 / 151-200
rank_bins   = [0, 10, 50, 100, 150, 200]
rank_labels = ["Top10", "11-50", "51-100", "101-150", "151-200"]
df['RankBucket'] = pd.cut(df['Rank'], bins=rank_bins, labels=rank_labels, right=True, include_lowest=True)

# Days：<10 / 10-30 / 30-50 / 50-100 / >100
days_bins   = [-np.inf, 9, 30, 50, 100, np.inf]
days_labels = ["<10", "10-30", "30-50", "50-100", ">100"]
df['DaysBucket'] = pd.cut(df['DaysOnChart'], bins=days_bins, labels=days_labels, right=True)

df = df.dropna(subset=['RankBucket','DaysBucket']).copy()

# ============== 特征与标签 ==============
X = df[FEATURES].values
y_rank = df['RankBucket'].astype(str).values
y_days = df['DaysBucket'].astype(str).values

# ============== 分层切分（保持类分布） ==============
X_tr, X_te, y_rank_tr, y_rank_te = train_test_split(
    X, y_rank, test_size=0.2, random_state=42, stratify=y_rank
)
X_tr_d, X_te_d, y_days_tr, y_days_te = train_test_split(
    X, y_days, test_size=0.2, random_state=42, stratify=y_days
)

# ============== 标签编码（保存以便推理还原） ==============
le_rank = LabelEncoder().fit(y_rank_tr)
le_days = LabelEncoder().fit(y_days_tr)
y_rank_tr_enc = le_rank.transform(y_rank_tr); y_rank_te_enc = le_rank.transform(y_rank_te)
y_days_tr_enc = le_days.transform(y_days_tr); y_days_te_enc = le_days.transform(y_days_te)

# ============== 定义随机森林（可按需调参） ==============
rf_params = dict(
    n_estimators=400,      # 森林规模
    max_depth=None,        # 让树自己长（可设 like 12/16 控制复杂度）
    min_samples_leaf=2,    # 叶子最小样本，提升泛化
    class_weight="balanced_subsample",  # 缓和不均衡
    random_state=42,
    n_jobs=-1
)

rank_clf = RandomForestClassifier(**rf_params)
days_clf = RandomForestClassifier(**rf_params)

# 训练
rank_clf.fit(X_tr, y_rank_tr_enc)
days_clf.fit(X_tr_d, y_days_tr_enc)

# ============== 评估函数 ==============
def evaluate(name, clf, Xte, yte_enc, label_encoder, label_order=None):
    pred_enc = clf.predict(Xte)
    acc = accuracy_score(yte_enc, pred_enc)
    f1m = f1_score(yte_enc, pred_enc, average="macro")
    y_true = label_encoder.inverse_transform(yte_enc)
    y_pred = label_encoder.inverse_transform(pred_enc)

    print(f"\n=== {name} - Test (RandomForest) ===")
    print(f"Accuracy = {acc:.4f}   Macro-F1 = {f1m:.4f}")
    print("\nClassification report:")
    print(classification_report(y_true, y_pred, labels=label_order))
    if label_order is None:
        label_order = sorted(label_encoder.classes_.tolist())
    cm = confusion_matrix(y_true, y_pred, labels=label_order)
    cm_df = pd.DataFrame(cm, index=[f"T:{l}" for l in label_order], columns=[f"P:{l}" for l in label_order])
    print("\nConfusion matrix:")
    print(cm_df)

evaluate("RankBucket", rank_clf, X_te, y_rank_te_enc, le_rank, label_order=rank_labels)
evaluate("DaysBucket", days_clf, X_te_d, y_days_te_enc, le_days, label_order=days_labels)

# ============== 特征重要性（可解释） ==============
def print_feature_importance(name, clf):
    fi = clf.feature_importances_
    order = np.argsort(fi)[::-1]
    print(f"\n{name} - Feature Importance:")
    for i in order:
        print(f"  {FEATURES[i]:16s}  {fi[i]:.4f}")

print_feature_importance("RankBucket", rank_clf)
print_feature_importance("DaysBucket", days_clf)

# ============== 保存模型与元数据 ==============
bundle = {
    "features": FEATURES,
    "rank": {
        "model": rank_clf,
        "label_encoder": le_rank,
        "labels": rank_labels,
        "bins": rank_bins
    },
    "days": {
        "model": days_clf,
        "label_encoder": le_days,
        "labels": days_labels,
        "bins": days_bins
    }
}
joblib.dump(bundle, ART_DIR / "chart_cls_rf_models.pkl")
print(f"\n✅ 随机森林分类模型已保存到: {ART_DIR / 'chart_cls_rf_models.pkl'}")

# ============== 推理函数（含概率） ==============
def predict_buckets_from_features_rf(feat_dict: dict):
    """
    输入：7个音频特征（0~1）
    输出：两个任务的类别预测与概率分布
    """
    x = np.array([[feat_dict[f] for f in FEATURES]], dtype=float)

    # Rank
    pr_rank = rank_clf.predict_proba(x)[0]
    pred_rank_idx = int(np.argmax(pr_rank))
    pred_rank_label = le_rank.inverse_transform([pred_rank_idx])[0]
    rank_probs = {lab: float(pr_rank[le_rank.transform([lab])[0]]) for lab in rank_labels}

    # Days
    pr_days = days_clf.predict_proba(x)[0]
    pred_days_idx = int(np.argmax(pr_days))
    pred_days_label = le_days.inverse_transform([pred_days_idx])[0]
    days_probs = {lab: float(pr_days[le_days.transform([lab])[0]]) for lab in days_labels}

    return {
        "RankBucket": {"pred": pred_rank_label, "probs": rank_probs},
        "DaysBucket": {"pred": pred_days_label, "probs": days_probs},
    }

# # 示例：
# print(predict_buckets_from_features_rf({'Danceability':0.6,'Energy':0.8,'Valence':0.4,'Loudness':0.7,'Speechiness':0.05,'Acousticness':0.2,'Instrumentalness':0.0}))



=== RankBucket - Test (RandomForest) ===
Accuracy = 0.2897   Macro-F1 = 0.2846

Classification report:
              precision    recall  f1-score   support

       Top10       0.45      0.23      0.31       168
       11-50       0.29      0.32      0.31       431
      51-100       0.29      0.38      0.33       489
     101-150       0.28      0.25      0.26       393
     151-200       0.24      0.19      0.21       352

    accuracy                           0.29      1833
   macro avg       0.31      0.28      0.28      1833
weighted avg       0.29      0.29      0.29      1833


Confusion matrix:
           P:Top10  P:11-50  P:51-100  P:101-150  P:151-200
T:Top10         39       57        38         20         14
T:11-50         19      139       154         69         50
T:51-100        13      111       187         96         82
T:101-150        9       83       139         99         63
T:151-200        6       82       124         73         67

=== DaysBucket - Test (Rand

效果还是不好


分三类
排名：Top50 / 51–100 / >100

上榜天数：<30 / 30–100 / >100

In [9]:
# -*- coding: utf-8 -*-
"""
三分类版本（非线性随机森林）：
- RankBucket:  Top50 / 51–100 / >100
- DaysBucket:  <30 / 30–100 / >100
"""

import os
import joblib
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

# ============== 路径配置 ==============
CSV = Path(r"C:\Users\Akari\OneDrive\Desktop\SEM 1\AML\Final CW\Datas\spotify_preprocess_dayAndRank.csv")
ART_DIR = Path(r"C:\Users\Akari\OneDrive\Desktop\SEM 1\AML\Final CW\PredictRankAndDay"); ART_DIR.mkdir(exist_ok=True)

FEATURES = ['Danceability','Energy','Valence','Loudness','Speechiness','Acousticness','Instrumentalness']

# ============== 读取数据 ==============
df = pd.read_csv(CSV)
need_cols = ['id','Title','Artists','Nationality'] + FEATURES + ['Rank','DaysOnChart']
df = df[need_cols].dropna(subset=FEATURES + ['Rank','DaysOnChart']).copy()
df['Rank'] = df['Rank'].clip(lower=1, upper=200).astype(int)
df['DaysOnChart'] = df['DaysOnChart'].astype(int)

# ============== 三分类分桶 ==============
# Rank: Top50 / 51–100 / >100
rank_bins   = [0, 50, 100, 200]
rank_labels = ["Top50", "51–100", ">100"]
df['RankBucket'] = pd.cut(df['Rank'], bins=rank_bins, labels=rank_labels, right=True, include_lowest=True)

# Days: <30 / 30–100 / >100
days_bins   = [-np.inf, 29, 100, np.inf]
days_labels = ["<30", "30–100", ">100"]
df['DaysBucket'] = pd.cut(df['DaysOnChart'], bins=days_bins, labels=days_labels, right=True)

df = df.dropna(subset=['RankBucket','DaysBucket']).copy()

# ============== 特征与标签 ==============
X = df[FEATURES].values
y_rank = df['RankBucket'].astype(str).values
y_days = df['DaysBucket'].astype(str).values

# ============== 分层切分（保持类分布） ==============
Xtr_r, Xte_r, y_rank_tr, y_rank_te = train_test_split(
    X, y_rank, test_size=0.2, random_state=42, stratify=y_rank
)
Xtr_d, Xte_d, y_days_tr, y_days_te = train_test_split(
    X, y_days, test_size=0.2, random_state=42, stratify=y_days
)

# ============== 标签编码 ==============
le_rank = LabelEncoder().fit(y_rank_tr)
le_days = LabelEncoder().fit(y_days_tr)
y_rank_tr_enc = le_rank.transform(y_rank_tr); y_rank_te_enc = le_rank.transform(y_rank_te)
y_days_tr_enc = le_days.transform(y_days_tr); y_days_te_enc = le_days.transform(y_days_te)

# ============== 定义随机森林模型 ==============
rf_params = dict(
    n_estimators=400,
    max_depth=None,
    min_samples_leaf=2,
    class_weight="balanced_subsample",
    random_state=42,
    n_jobs=-1
)
rank_clf = RandomForestClassifier(**rf_params)
days_clf = RandomForestClassifier(**rf_params)

rank_clf.fit(Xtr_r, y_rank_tr_enc)
days_clf.fit(Xtr_d, y_days_tr_enc)

# ============== 评估函数 ==============
def evaluate(name, clf, Xte, yte_enc, label_encoder, label_order=None):
    pred_enc = clf.predict(Xte)
    acc = accuracy_score(yte_enc, pred_enc)
    f1m = f1_score(yte_enc, pred_enc, average="macro")
    y_true = label_encoder.inverse_transform(yte_enc)
    y_pred = label_encoder.inverse_transform(pred_enc)
    print(f"\n=== {name} - Test (RandomForest, 3-class) ===")
    print(f"Accuracy = {acc:.4f}   Macro-F1 = {f1m:.4f}")
    print("\nClassification report:")
    print(classification_report(y_true, y_pred, labels=label_order))
    cm = confusion_matrix(y_true, y_pred, labels=label_order)
    cm_df = pd.DataFrame(cm, index=[f"T:{l}" for l in label_order], columns=[f"P:{l}" for l in label_order])
    print("\nConfusion matrix:")
    print(cm_df)

evaluate("RankBucket", rank_clf, Xte_r, y_rank_te_enc, le_rank, label_order=rank_labels)
evaluate("DaysBucket", days_clf, Xte_d, y_days_te_enc, le_days, label_order=days_labels)

# ============== 特征重要性 ==============
def show_feature_importance(name, clf):
    fi = clf.feature_importances_
    order = np.argsort(fi)[::-1]
    print(f"\n{name} - Feature Importance:")
    for i in order:
        print(f"  {FEATURES[i]:16s}  {fi[i]:.4f}")

show_feature_importance("RankBucket", rank_clf)
show_feature_importance("DaysBucket", days_clf)

# ============== 保存模型 ==============
bundle = {
    "features": FEATURES,
    "rank": {
        "model": rank_clf,
        "label_encoder": le_rank,
        "labels": rank_labels,
        "bins": rank_bins
    },
    "days": {
        "model": days_clf,
        "label_encoder": le_days,
        "labels": days_labels,
        "bins": days_bins
    }
}
joblib.dump(bundle, ART_DIR / "chart_cls_rf_3class_models.pkl")
print(f"\n✅ 三分类随机森林模型已保存到: {ART_DIR / 'chart_cls_rf_3class_models.pkl'}")

# ============== 推理函数 ==============
def predict_3class_from_features(feat_dict: dict):
    x = np.array([[feat_dict[f] for f in FEATURES]], dtype=float)

    pr_rank = rank_clf.predict_proba(x)[0]
    pred_rank_idx = int(np.argmax(pr_rank))
    pred_rank_label = le_rank.inverse_transform([pred_rank_idx])[0]
    rank_probs = {lab: float(pr_rank[le_rank.transform([lab])[0]]) for lab in rank_labels}

    pr_days = days_clf.predict_proba(x)[0]
    pred_days_idx = int(np.argmax(pr_days))
    pred_days_label = le_days.inverse_transform([pred_days_idx])[0]
    days_probs = {lab: float(pr_days[le_days.transform([lab])[0]]) for lab in days_labels}

    return {
        "RankBucket": {"pred": pred_rank_label, "probs": rank_probs},
        "DaysBucket": {"pred": pred_days_label, "probs": days_probs},
    }

# # 示例：
# print(predict_3class_from_features({'Danceability':0.6,'Energy':0.8,'Valence':0.4,
#                                     'Loudness':0.7,'Speechiness':0.05,'Acousticness':0.2,
#                                     'Instrumentalness':0.0}))



=== RankBucket - Test (RandomForest, 3-class) ===
Accuracy = 0.4463   Macro-F1 = 0.4138

Classification report:
              precision    recall  f1-score   support

       Top50       0.45      0.43      0.44       599
      51–100       0.34      0.22      0.26       489
        >100       0.48      0.61      0.54       745

    accuracy                           0.45      1833
   macro avg       0.42      0.42      0.41      1833
weighted avg       0.43      0.45      0.43      1833


Confusion matrix:
          P:Top50  P:51–100  P:>100
T:Top50       258        86     255
T:51–100      142       106     241
T:>100        170       121     454

=== DaysBucket - Test (RandomForest, 3-class) ===
Accuracy = 0.6738   Macro-F1 = 0.3656

Classification report:
              precision    recall  f1-score   support

         <30       0.73      0.92      0.81      1277
      30–100       0.30      0.09      0.14       294
        >100       0.24      0.11      0.15       262

    accuracy