# Begin at here
### 本程序为建模主程序，执行本程序前请运行数据预处理程序“Text_pre_processor”，对比赛数据进行预处理，包括train.csv和validation.csv
### 程序编写设备为64位windows 7
### 运行环境为python 3.8 jupyter，运行需要cuda支持
### 代码执行过程：依次执行以下每一部分代码块（一键执行全部）

## 1 配置环境

In [None]:
try:
    from simpletransformers.classification import ClassificationModel
except ImportError:
    !pip install simpletransformers

## 2 加载数据集（训练数据与测试数据）
####  注：数据文件路径名为'data/Summary-tra.csv'，是采用代码“Text_pre_processor”对比赛源始数据进行数据预处理后的数据，执行以下代码前请确保使用的数据已经经过数据预处理，如果使用源始数据直接进行加载操作，模型得分影响不会很大

In [1]:
def load_data(data_csv):
    from sklearn.model_selection import train_test_split
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    import numpy as np
    import os
    import pickle
    df=pd.read_csv(data_csv,header=0)
    
    if 'id' in df.columns:
        del df['id']
    df.dropna(subset=['category'],inplace=True)
    #exporting the departure encoder
    if 'context' not in df.columns:
        df.dropna(thresh=2,inplace=True)
        df.fillna(value='', inplace=True)
        df['context']=df['title']+df['summary']
        del df['title']
        del df['summary']
    label_encoder='outputs/category_encoder.pkl'
    if not os.path.exists(label_encoder):
        if not os.path.exists('outputs'):
            os.makedirs('outputs')
        le = LabelEncoder()
        df['category'] = le.fit_transform(df['category'])
        output = open(label_encoder, 'wb')
        pickle.dump(le, output)
        output.close()
    else:
        le = pickle.load(open('outputs/category_encoder.pkl', 'rb'))
        df['category'] = le.fit_transform(df['category'])
    df.rename(columns={'context':'text','category':'labels'},inplace=True)
    X,Y=df['text'],df['labels']
    x_train, x_test, y_train,y_test = train_test_split(X, Y,test_size=0.3, random_state=10,stratify=Y)
    train_df=pd.concat([x_train,y_train],axis=1)
    test_df=pd.concat([x_test,y_test],axis=1)
    
    ###############################33
    tra=pd.concat([X,Y],axis=1)
    return tra,tra
    
    return train_df,test_df
train_df,eval_df=load_data('data/Summary-tra.csv')#加载测试数据与训练数据
print(train_df.shape,eval_df.shape)#输出数据样本举例
print(train_df.values[1])
#train_df.to_csv('data/split-train-title.csv',index=0)
#eval_df.to_csv('data/split-test-title.csv',index=0)

(102832, 2) (102832, 2)
['传热学主要介绍了导热、对流和辐射等课程内容的相关概念、定律、公式等。全书的重点是相关内容的例题详解和补充习题。带有答案的补充习题可帮助读者自我评估学习状况。'
 17]


## 3 搭建模型框架并建模，模型选择可详见https://huggingface.co/models

In [2]:
from simpletransformers.classification import ClassificationModel
import warnings

warnings.filterwarnings('ignore')
# Create a TransformerModel
#model_type可以是['bert'，'xlnet'，'xlm'，'roberta'，'distilbert']之一。
#要加载以前保存的模型而不是默认模型的模型，可以将model_name更改为包含已保存模型的目录的路径。
model = ClassificationModel('xlnet', 'hfl/chinese-xlnet-mid', num_labels=22,use_cuda=True,
                         args={'learning_rate':5e-5, 'num_train_epochs': 10,'use_early_stopping':True,'save_eval_checkpoints':False,
                               'reprocess_input_data': True, 'overwrite_output_dir': True,'save_model_every_epoch':False,'n_gpu':-1,
                              'max_seq_length':64, 'train_batch_size': 64,'best_model_dir':'outputs/final_best'})
# Train the model
model.train_model(train_df)

Some weights of the model checkpoint at hfl/chinese-xlnet-mid were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at hfl/chinese-xlnet-mid and are newly initialized: ['logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight', 'logits_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for pre

  0%|          | 0/102832 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/1607 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/1607 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/1607 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/1607 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/1607 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/1607 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/1607 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/1607 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/1607 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/1607 [00:00<?, ?it/s]

(16070, 0.15064805236368184)

## 4 模型性能评估

In [3]:
from sklearn.metrics import f1_score, accuracy_score

#定义评估指标之一：f1
def f1_multiclass(labels, preds):
      return f1_score(labels, preds, average='micro')

#评估
result, model_outputs, wrong_predictions = model.eval_model(eval_df, f1=f1_multiclass, acc=accuracy_score)
print(result)

  0%|          | 0/102832 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/12854 [00:00<?, ?it/s]

{'mcc': 0.9993067969145869, 'f1': 0.9994165240392096, 'acc': 0.9994165240392096, 'eval_loss': 0.001716306218328234}


##  5 加载模型并预测测试集标签
### 注：加载验证集数据前，需要运行数据预处理程序，对验证集数据采取数据预处理，这里加载的验证集路径为“data/Summary-val.csv”
#### 加载以前保存的模型而不是默认模型的模型，可以将model_name更改为包含已保存模型的目录的路径。

In [4]:
import pandas as pd
from simpletransformers.classification import ClassificationModel
model = ClassificationModel('xlnet', 'outputs', num_labels=22,use_cuda=True,
                         args={'learning_rate':5e-5, 'num_train_epochs': 10,'use_early_stopping':True,
                               'reprocess_input_data': True, 'overwrite_output_dir': True,'save_model_every_epoch':False,
                              'max_seq_length': 64, 'train_batch_size': 64})

df=pd.read_csv('data/Summary-val.csv')
if 'context' not in df.columns:
    df.fillna(value='', inplace=True)
    df['context']=df['title']+df['summary']
pre_df=list(df['context'])
predictions, raw_outputs = model.predict(pre_df)
import pickle
import pandas as pd
le = pickle.load(open('outputs/category_encoder.pkl', 'rb'))
predictions=pd.DataFrame(le.inverse_transform(predictions),columns=['label'])
save_df=pd.concat([df['id'],predictions],axis=1)
save_df.to_csv('outputs/submission.csv',index=0)
print('\adone!')

  0%|          | 0/34045 [00:00<?, ?it/s]

  0%|          | 0/4256 [00:00<?, ?it/s]

done!
