#### 识别与年龄有关的疾病

#### 描述

##### 目标

&emsp;&emsp;预测一个人是否有三种病症中的任何一种。要求预测该人是否有三种医疗状况中的一种或多种(第1类),或没有三种医疗状况(第0类)创建一个根据健康特征测量结果训练的模型。

&emsp;&emsp;要确定一个人是否有这些医疗条件,需要一个漫长的、侵入性的过程来收集病人的信息。通过预测模型,可以缩短这一过程,并通过收集与病情有关的关键特征,然后对这些特征进行编码,从而保持病人的详细信息。

&emsp;&emsp;发现某些特征的测量与潜在的病人状况之间的关系。

##### 内容

&emsp;&emsp;他们说年龄只是一个数字,但一大堆健康问题随着年龄的增长而出现。从心脏病和痴呆症到听力损失和关节炎，衰老是众多疾病和并发症的一个风险因素。不断增长的生物信息学领域包括对干预措施的研究，这些干预措施可以帮助减缓和逆转生物衰老，并预防与年龄有关的主要疾病。数据科学可以在开发新方法以解决不同数据的问题方面发挥作用，即使样本数量很少。

&emsp;&emsp;目前,像XGBoost和随机森林这样的模型被用来预测医疗状况,但这些模型的性能还不够好。在处理人命关天的关键问题时,模型需要在不同的情况下可靠而一致地做出正确的预测。

&emsp;&emsp;利用健康特征数据的测量来解决生物信息学的关键问题。基于最小的训练,创建一个模型来预测一个人是否有三种医疗状况中的任何一种,目的是改进现有的方法。

&emsp;&emsp;帮助推进不断发展的生物信息学领域,并探索新的方法来解决不同数据的复杂问题。

#### 评价指标

&emsp;&emsp;用平衡的对数损失来评估的。总体效果是,每个类别对最终得分的重要性大致相同。

&emsp;&emsp;每个观察值要么属于第0类,要么属于第1类。对于每个观测值,必须提交每个类别的概率。那么这个公式就是:

$$LogLoss=\frac{\frac{1}{N_0}\sum_{i=1}^{N_0}y_{0i}logp_{0i}-\frac{1}{N_1}\sum_{i=1}^{N_1}y_{1i}logp_{1i}}{2}$$

&emsp;&emsp;其中$N_{c}$是类别为$c$的样本数量，$log$是自然对数，如果样本$i$属于类$c$，则$y_{c_i}=1$，否则为$0$。$p_{c_i}$是样本$i$属于类$c$的预测概率。

&emsp;&emsp;某一行的提交概率之和不需要为$1$,因为它们在被打分之前被重新标定(每一行被除以行之和)。为了避免对数函数的极端性,每个预测的概率$p$被替换成$max(min(p,1-10^{-15}),10^{-15})$。


#### 提交文件

&emsp;&emsp;对于测试集中的每个ID，必须预测两个类别中每个类别的概率。该文件应包含一个标题并具有以下格式：

> Id,class_0,class_1
> 
> 00eed32682bb,0.5,0.5
> 
> 010ebe33f668,0.5,0.5
> 
> 02fa521e1838,0.5,0.5
> 
> 040e15f562a2,0.5,0.5
> 
> 046e85c7cc7f,0.5,0.5
> 
> ...



In [147]:
import warnings
warnings.filterwarnings("ignore")

1.读取数据做简单的EDA 探索数据分析

In [148]:
import pandas as pd
import os

# 设置显示选项，最大显示列数为 None，以完整显示所有列
pd.set_option('display.max_columns', None)

In [149]:
INPUT_PATH = 'input/'
# train = pd.read_csv(INPUT_PATH + 'train.csv',index_col=[0]) # (617, 58)
train = pd.read_csv(INPUT_PATH + 'train.csv') # (617, 58)
# test = pd.read_csv(INPUT_PATH+'test.csv', index_col=[0]) # (5, 57)
test = pd.read_csv(INPUT_PATH+'test.csv') # (5, 57)
greeks = pd.read_csv(INPUT_PATH+'greeks.csv') # (617, 6)
# sample = pd.read_csv(os.path.join(INPUT_PATH, 'sample_submission.csv'), index_col=[0])
# index_col=[0]指定索引列 而不用默认的整数
sample = pd.read_csv(os.path.join(INPUT_PATH, 'sample_submission.csv'))

EJ是分类列

In [150]:
train['EJ'].dtype

dtype('O')

In [151]:
train['EJ'].value_counts()
# dtype:int 64 表示计数结果的数据类型为64位整数。

B    395
A    222
Name: EJ, dtype: int64

In [152]:
test['EJ'].value_counts()

A    5
Name: EJ, dtype: int64

EJ不缺值，对其Object类改变为int 1 0 替换

In [153]:
train['EJ'] = train['EJ'].replace({'A':0, 'B':1})
test['EJ'] = test['EJ'].replace({'A':0, 'B':1})

In [154]:
df_train = pd.concat([train, greeks.set_index('Id')], axis=1)
target_col = 'Class'

In [155]:
import matplotlib.pyplot as plt
import seaborn as sns

核密度估计图是一种用于可视化连续变量数据分布的图表。它通过平滑的曲线表示变量的概率密度函数，从而提供了对数据分布的直观理解。

核密度估计图的主要作用是：

1. 数据分布形状：核密度估计图能够帮助观察数据的分布形状，例如是否对称、是否呈现峰值或多个峰值等。通过观察曲线的高低和变化，可以获得数据分布的大致特征。

2. 峰值和模态：核密度估计图可以显示数据分布中的峰值或模态。通过观察曲线的峰值位置和数量，可以了解数据中存在的主要集中点或模式。

3. 分布尾部：核密度估计图还可以展示数据分布的尾部情况，即数据远离峰值的区域。通过观察曲线在峰值附近的陡峭程度或尾部的延伸程度，可以了解数据的分布范围和离群值的可能性。

4. 比较分布：通过在同一图上绘制不同组别或条件的核密度估计图，可以进行分布的比较。这可帮助观察不同组之间的差异或相似性。

总而言之，核密度估计图提供了对数据分布的可视化呈现，能够直观地展示数据的形状、峰值、尾部和比较信息。这有助于对数据的理解、发现异常情况、进行组别间的比较等。

In [156]:
# def plot_distribution(data):
#     num_cols = 4
#     num_rows = (len(train.columns) - 2) // num_cols
#     fig, axes = plt.subplots(
#         nrows=num_rows, ncols=num_cols, figsize=(16, 4*num_rows))
#     sns.set(font_scale=1.2, style='whitegrid')

#     for i, col_name in enumerate(train.columns):
#         if col_name not in ['Class', 'Id']:
#             ax = axes[i // num_cols, i % num_cols]
#             sns.kdeplot(data=data, x=col_name, hue=target_col,
#                         ax=ax, fill=True, alpha=0.5, linewidth=2)
#             ax.set_title(
#                 f'{col_name.title()} Distribution by {target_col.title()}', fontsize=14)
#             ax.set_xlabel(col_name.title(), fontsize=14)
#             ax.set_ylabel(target_col.title(), fontsize=14)
#             ax.tick_params(axis='both', which='major', labelsize=12)
#             ax.legend([1, 0], title=target_col.title(), fontsize=12)

#     plt.tight_layout()
#     plt.show()


# plot_distribution(df_train)


In [157]:
import numpy as np

In [158]:
# def plot_heatmap(df, title):
#     # 为对角线元素创建一个掩码
#     mask = np.zeros_like(df.astype(float).corr())
#     mask[np.triu_indices_from(mask)] = True
#     colormap = plt.cm.RdBu_r
#     plt.figure(figsize=(40, 40))
#     plt.title(f'{title} Correlation of Features', fontweight='bold', y=1.02, size=18)
    # sns.heatmap(df.astype(float).corr(), linewidths=0.1, vmax=1.0, vmin=-1.0, 
    #             square=True, cmap=colormap, linecolor='white', annot=True, annot_kws={"size": 7, "weight": "bold"},
    #             mask=mask)
# plot_heatmap(train, title='Data')

###### 分层聚类树状图

In [159]:
# from scipy.cluster import hierarchy
# from scipy.cluster.hierarchy import dendrogram, linkage
# from scipy.spatial.distance import squareform

# def hierarchical_clustering(data):
#     fig, ax = plt.subplots(1, 1, figsize=(18, 10), dpi=120)
#     correlations = data.corr()
#     converted_corr = 1 - np.abs(correlations)
#     # complete最大距离合并策略或全距离合并
#     Z = linkage(squareform(converted_corr), 'complete')
#     dn = dendrogram(Z, labels=data.columns, ax=ax, above_threshold_color='#ff0000', orientation='right')
#     hierarchy.set_link_color_palette(None)
#     plt.grid(axis='x')
#     plt.title('Hierarchical clustering, Dendrogram', fontsize=18, fontweight='bold')
#     plt.show()
# # axis=1指定target_col在列上被删除而不是行上axis=0
# hierarchical_clustering(train.drop(target_col, axis=1))

In [160]:
# def plot_target_feature(df_train, target_col, figsize=(16,5), palette='colorblind', name='Train'):
    
#     df_train = df_train.fillna('Nan')
    
#     fig, ax = plt.subplots(1, 2, figsize = figsize)
#     ax = ax.flatten()

#     # Pie chart
#     ax[0].pie(
#         df_train[target_col].value_counts(), 
#         shadow=True, 
#         explode=[0.05] * len(df_train[target_col].unique()),
#         autopct='%1.f%%',
#         textprops={'size': 10, 'color': 'white'},
#         colors=sns.color_palette(palette, len(df_train[target_col].unique()))
#     )

#     # Bar plot
#     sns.countplot(
#         data=df_train, 
#         y=target_col, 
#         ax=ax[1], 
#         palette=palette
#     )
#     ax[1].yaxis.label.set_size(16)
#     plt.yticks(fontsize=12)
#     ax[1].set_xlabel('Count', fontsize=18)
#     plt.xticks(fontsize=12)

#     fig.suptitle(f'{target_col} in {name} Dataset', fontsize=18, fontweight='bold')
#     plt.tight_layout()

#     # Show the plot
#     plt.show()
    
# plot_target_feature(df_train, 'Alpha', figsize=(16,5), palette='colorblind', name='greeks')
# plot_target_feature(df_train, 'Beta', figsize=(16,5), palette='colorblind', name='greeks')
# plot_target_feature(df_train, 'Gamma', figsize=(16,5), palette='colorblind', name='greeks')
# plot_target_feature(df_train, 'Delta', figsize=(16,5), palette='colorblind', name='greeks')

In [161]:
# def plot_boxplot(df, hue, title='', drop_cols=[], n_cols=3):
#     sns.set_style('whitegrid')
#     cols = df.columns.drop([hue] + drop_cols)
#     n_rows = (len(cols) - 1) // n_cols + 1
#     fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(20, 3*n_rows))
#     for i, var_name in enumerate(cols):
#         row = i // n_cols
#         col = i % n_cols
#         ax = axes[row, col]
#         sns.boxplot(data=df, x=hue, y=var_name, ax=ax, showmeans=True, 
#                     meanprops={"marker":"s","markerfacecolor":"white", "markeredgecolor":"blue", "markersize":"5"})
#         ax.set_title(f'{var_name} by {hue}')
#         ax.set_xlabel('')
#     fig.suptitle(f'{title} Boxplot by {hue}', fontweight='bold', fontsize=16, y=1.00)
#     plt.tight_layout()
#     plt.show()

# plot_boxplot(df_train, hue='Alpha', title='Data Set', drop_cols=['Beta', 'Gamma', 'Delta', 'Epsilon', target_col], n_cols=7)
# plot_boxplot(df_train, hue='Beta', title='Data Set', drop_cols=['Alpha', 'Gamma', 'Delta', 'Epsilon', target_col], n_cols=7)
# plot_boxplot(df_train, hue='Delta', title='Data Set', drop_cols=['Alpha', 'Beta', 'Gamma', 'Epsilon', target_col], n_cols=7)
# plot_boxplot(df_train, hue='Gamma', title='Data Set', drop_cols=['Alpha', 'Beta', 'Delta', 'Epsilon', target_col], n_cols=4)

In [182]:
feature_columns = [n for n in train.columns if n != 'Class' and n != 'Id']
x= train[feature_columns]
y = train['Class']

In [None]:
def balanced_log_loss(y_true, y_pred):
    # y_true: correct labels 0, 1
    # y_pred: predicted probabilities of class=1
    # calculate the number of observations for each class
    N_0 = np.sum(1 - y_true)
    N_1 = np.sum(y_true)
    # calculate the weights for each class to balance classes
    w_0 = 1 / N_0
    w_1 = 1 / N_1
    # calculate the predicted probabilities for each class
    p_1 = np.clip(y_pred, 1e-15, 1 - 1e-15)
    p_0 = 1 - p_1
    # calculate the summed log loss for each class
    log_loss_0 = -np.sum((1 - y_true) * np.log(p_0))
    log_loss_1 = -np.sum(y_true * np.log(p_1))
    # calculate the weighted summed logarithmic loss
    # (factgor of 2 included to give same result as LL with balanced input)
    balanced_log_loss = 2*(w_0 * log_loss_0 + w_1 * log_loss_1) / (w_0 + w_1)
    # return the average log loss
    return balanced_log_loss/(N_0+N_1)

不同层次的交叉验证对象在模型选择和评估的过程中发挥不同的作用。内层交叉验证用于选择模型的子参数或优化其他配置，中间层交叉验证用于选择超参数，而外层交叉验证用于评估模型的整体性能和泛化能力。这样的层次结构有助于更准确地选择最佳模型配置和超参数，并评估模型的预测能力。

In [162]:
from sklearn.model_selection import KFold as KF, GridSearchCV
cv_outer = KF(n_splits = 10, shuffle=True, random_state=42)
cv_inner = KF(n_splits = 5, shuffle=True, random_state=42)

In [165]:
from sklearn.decomposition import PCA
from tabpfn import TabPFNClassifier
from sklearn.impute import SimpleImputer
import xgboost

In [166]:
class Ensemble():
    def __init__(self):
        # 处理数据中的缺失值,使用中位数填充(strategy='median')
        self.imputer = SimpleImputer(missing_values=np.nan, strategy='median')
        self.classifiers = [xgboost.XGBClassifier(
        ), TabPFNClassifier(N_ensemble_configurations=64)]

    def fit(self, X, y):
        y = y.values
        unique_classes, y = np.unique(y, return_inverse=True)
        self.classes_ = unique_classes
        first_category = X.EJ.unique()[0]
        X.EJ = X.EJ.eq(first_category).astype('int')
        X = self.imputer.fit_transform(X)
#         X = normalize(X,axis=0)
        for classifier in self.classifiers:
            if classifier == self.classifiers[1]:
                classifier.fit(X, y, overwrite_warning=True)
            else:
                classifier.fit(X, y)

    def predict_proba(self, x):
        x = self.imputer.transform(x)
#         x = normalize(x,axis=0)
        probabilities = np.stack([classifier.predict_proba(x)
                                 for classifier in self.classifiers])
        averaged_probabilities = np.mean(probabilities, axis=0)
        class_0_est_instances = averaged_probabilities[:, 0].sum()
        others_est_instances = averaged_probabilities[:, 1:].sum()
        # Weighted probabilities based on class imbalance
        new_probabilities = averaged_probabilities * \
            np.array([[1/(class_0_est_instances if i == 0 else others_est_instances)
                     for i in range(averaged_probabilities.shape[1])]])
        return new_probabilities / np.sum(new_probabilities, axis=1, keepdims=1)

In [184]:
def training(model, x, y, y_meta):
    outer_results = list()
    best_loss = np.inf
    split = 0
    splits = 5
    for train_idx, val_idx in tqdm(cv_inner.split(x), total=splits):
        split += 1
        x_train, x_val = x.iloc[train_idx], x.iloc[val_idx]
        y_train, y_val = y_meta.iloc[train_idx], y.iloc[val_idx]
        model.fit(x_train, y_train)
        y_pred = model.predict_proba(x_val)
        probabilities = np.concatenate(
            (y_pred[:, :1], np.sum(y_pred[:, 1:], 1, keepdims=True)), axis=1)
        p0 = probabilities[:, :1]
        p0[p0 > 0.86] = 1
        p0[p0 < 0.14] = 0
        y_p = np.empty((y_pred.shape[0],))
        for i in range(y_pred.shape[0]):
            if p0[i] >= 0.5:
                y_p[i] = False
            else:
                y_p[i] = True
        y_p = y_p.astype(int)
        loss = balanced_log_loss(y_val, y_p)

        if loss < best_loss:
            best_model = model
            best_loss = loss
            print('best_model_saved')
        outer_results.append(loss)
        print('>val_loss=%.5f, split = %.1f' % (loss, split))
    print('LOSS: %.5f' % (np.mean(outer_results)))
    return best_model

In [168]:
from datetime import datetime
times = greeks.Epsilon.copy()
times[greeks.Epsilon != 'Unknown'] = greeks.Epsilon[greeks.Epsilon != 'Unknown'].map(lambda x: datetime.strptime(x,'%m/%d/%Y').toordinal())
times[greeks.Epsilon == 'Unknown'] = np.nan
train_pred_and_time = pd.concat((train, times), axis=1)
test_predictors = test[feature_columns]
first_category = test_predictors.EJ.unique()[0]
test_predictors.EJ = test_predictors.EJ.eq(first_category).astype('int')
test_pred_and_time = np.concatenate((test_predictors, np.zeros((len(test_predictors), 1)) + train_pred_and_time.Epsilon.max() + 1), axis=1)

In [169]:
from imblearn.over_sampling import RandomOverSampler
# 过采样 random_state=42 保证结果可复现
ros = RandomOverSampler(random_state=42)
# 生成平衡的训练数据集
train_ros, y_ros = ros.fit_resample(train_pred_and_time, greeks.Alpha)
print('Original dataset shape:{}'.format(greeks.shape))
print('greeks Alpha value counts:\n{}'.format(greeks.Alpha.value_counts()))
print('Resample dataset shape:{}'.format(train_ros.shape))
print('y_ros value counts:\n{}'.format(y_ros.value_counts()))

Original dataset shape:(617, 6)
greeks Alpha value counts:
A    509
B     61
G     29
D     18
Name: Alpha, dtype: int64
Resample dataset shape:(2036, 59)
y_ros value counts:
B    509
A    509
D    509
G    509
Name: Alpha, dtype: int64


In [179]:
x_ros = train_ros.drop(['Class', 'Id'],axis=1) # 特征 2036*57
y_ = train_ros.Class #标签
yt = Ensemble()
from tqdm.notebook import tqdm

In [185]:
m = training(yt, x_ros, y_, y_ros)
# y_.value_counts()/y_.shape[0]

  0%|          | 0/5 [00:00<?, ?it/s]

x_train:            AB           AF          AH         AM         AR        AX  \
0     0.209377   3109.03329   85.200147  22.394407   8.138688  0.699861   
1     0.145282    978.76416   85.200147  36.968889   8.138688  3.632190   
2     0.470030   2635.10654   85.200147  32.360553   8.138688  6.732840   
3     0.252107   3819.65177  120.201618  77.112203   8.138688  3.685344   
4     0.380297   3733.04844   85.200147  14.103738   8.138688  3.942255   
...        ...          ...         ...        ...        ...       ...   
2030  0.965698  18720.82960   85.200147  21.291875   9.627984  6.183582   
2032  0.589674   3734.21488   85.200147  61.435189   8.138688  5.248958   
2033  0.769140   6795.57054   85.200147  46.188658  17.488740  5.678619   
2034  0.743502   7052.62536   85.200147  20.505237   8.138688  4.287756   
2035  0.769140   6795.57054   85.200147  46.188658  17.488740  5.678619   

            AY         AZ          BC         BD        BN          BP  \
0     0.025578   

In [174]:
# y_pred = m.predict_proba(test_pred_and_time)
# probabilities = np.concatenate((y_pred[:,:1], np.sum(y_pred[:,1:], 1, keepdims=True)), axis=1)
# p0 = probabilities[:,:1]
# p0[p0 > 0.74] = 1
# p0[p0 < 0.26] = 0

In [175]:
# submission = pd.DataFrame(test["Id"], columns=["Id"])
# submission["class_0"] = p0
# submission["class_1"] = 1 - p0
# submission.to_csv('submission.csv',index=False)
# submission_df = pd.read_csv('submission.csv')
# submission_df