## 2023 DataLab Cup1 : Predicting News Popularity(Text Feature Engineering)
##### Competition for CS565600 Deep Learning
* 組別: 瑜旋學姊教我DL
* 成員: 112062531 王興彥 112062559 邱仁緯 112062632 林沁璿
* Public: 0.59617
* Private: 0.59894

In [1]:
import os
import warnings
import pandas as pd

%matplotlib inline

warnings.filterwarnings("ignore")

if not os.path.exists("output/") : os.mkdir("output/")

df_train = pd.read_csv('csv/train.csv')
df_test = pd.read_csv('csv/test.csv')

## Data Preprocessing
1. 使用BeautifulSoup讀取文字用來parser html格式，方便存取feature。
2. 將能夠獲取且可能對預測造成影響的feature找出。
    * title: 文章標題。
    * title_len: 標題長度。
    * topic: 文章主題。
    * date_time: 文章發表時間，使用dateutil的parser將各項時間取出，因為有部分資料找不到時間資訊。
        * 解決方式: 找到與缺乏時間資訊的資料相同的topic和popularity的資料，並將其時間資訊拿來當default值。
    * content_len: 文章長度。
    

In [2]:
import re
from bs4 import BeautifulSoup
from dateutil import parser

def preprocessor(text):
    
    text = BeautifulSoup(text, 'html.parser')
    
    title = text.body.h1.string.lower().strip()
    title_len = len(title)   
    
    l_list = text.body.find('footer', {'class': 'article-topics'}).find_all('a')
    topic_list = [a.string.strip().lower() for a in l_list]
    topic = ' '.join([re.sub('\s+', '_', t) for t in topic_list])
    
    article_info = text.head.find('div', {'class': 'article-info'})
    try :
        date_time = article_info.time['datetime']
    except:
        date_time = 'Fri, 15 Feb 2013 15:49:08'

    parse_time = parser.parse(date_time)
    year = parse_time.year
    month = parse_time.month
    day = parse_time.day
    weekday = parse_time.isoweekday()
    hour = parse_time.hour
    minute = parse_time.minute
    second = parse_time.second

    content = text.body.find('section', {'class': 'article-content'}).get_text()
    content_len = len(content)

    return title, title_len, topic, day, month, year, weekday, hour, minute, second, content_len

In [3]:
feature_train = []
feature_test = []

for text in df_train['Page content']:
    feature_train.append(preprocessor(text))
for text in df_test['Page content']:
    feature_test.append(preprocessor(text))

df_train_new = pd.DataFrame(feature_train,
    columns=['Title', 'Title_Len', 'Topic', 'Day', 'Month', 'Year',
             'Weekday', 'Hour', 'Minute', 'Second', 'Content_Len']
)
df_test_new = pd.DataFrame(feature_test,
    columns=['Title', 'Title_Len', 'Topic', 'Day', 'Month', 'Year',
             'Weekday', 'Hour', 'Minute', 'Second', 'Content_Len']
)

df_train_new.to_csv('csv/df_train_new.csv', index=False)
df_test_new.to_csv('csv/df_test_new.csv', index=False)

In [4]:
df_train_new = pd.read_csv('csv/df_train_new.csv')
df_test_new = pd.read_csv('csv/df_test_new.csv')

3. 將缺資訊的資料補上default
    * Topic: 補上None。
    * Content_Len: 直接從資料中找出原文，只有一個，而實際查看後大約100字左右。

In [4]:
df_train_sel = df_train_new.drop(columns=['Title', 'Minute', 'Second'])
df_test_sel = df_test_new.drop(columns=['Title', 'Minute', 'Second'])


df_train_sel['Topic'] = df_train_sel['Topic'].where(pd.notnull(df_train_sel['Topic']), 'None')
df_train_sel['Content_Len'] = df_train_sel['Content_Len'].where(pd.notnull(df_train_sel['Content_Len']), 100)
df_test_sel['Topic'] = df_test_sel['Topic'].where(pd.notnull(df_test_sel['Topic']), 'None')
df_test_sel['Content_Len'] = df_test_sel['Content_Len'].where(pd.notnull(df_test_sel['Content_Len']), 100)

4. 透過Natural Language Toolkit(NLTK)對Topic進行Word Stemming。

In [5]:
import numpy as np
from nltk.stem.porter import PorterStemmer

def tokenizer_stem(text):
    if type(text) == np.ndarray:
        text = text[0]
    porter = PorterStemmer()
    return [porter.stem(word) for word in re.split('\s+', text.strip())]

5. 使用ColumnTransformer，針對Topic進行轉換，方便後續pipeline使用。

    <https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html>

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer

trans = ColumnTransformer(
    [('Topic', CountVectorizer(tokenizer=tokenizer_stem, lowercase=False), [1])], n_jobs=-1, remainder='passthrough')

6. 將訓練資料切成training set和validation set，使用0.3的ratio進行切割。

In [7]:
from sklearn.model_selection import train_test_split

X = df_train_sel.values
y = (df_train['Popularity'].values == 1).astype(int)

X_test = df_test_sel.values

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=0)

In [8]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import roc_auc_score

def training(clf):
    cv = cross_validate(clf, X, y, scoring='roc_auc', return_train_score=True, return_estimator=True)
    print('train score: {:.5f}'.format(np.mean(cv['train_score'])))
    print('valid score: {:.5f}'.format(np.mean(cv['test_score'])))

    clf.fit(X_train, y_train)
    print('train score: {:.5f}'.format(roc_auc_score(y_train, clf.predict_proba(X_train)[:, 1])))
    print('valid score: {:.5f}'.format(roc_auc_score(y_valid, clf.predict_proba(X_valid)[:, 1])))
    
    return clf

## Model

1. 使用多種classifier(RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, <br/>XGBClassifier, CatBoostClassifier, LGBMClassifier, VotingClassifier)進行比對挑選出最好的結果，最後選擇LGBMClassifier。

2. 參數的部分有透過GridSearchCV搜尋，但效果不理想所以手動挑選learning_rate以及n_estimators。

In [11]:
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier

lgbm = Pipeline([('ct', trans),
                 ('clf', LGBMClassifier( n_jobs=-1, verbose=0, random_state=0, learning_rate=0.01, n_estimators=270))])
lgbm = training(lgbm)

train score: 0.67383
valid score: 0.60283
train score: 0.68009
valid score: 0.59993


In [16]:
y_score = lgbm.predict_proba(X_test)[:, 1]
df_pred = pd.DataFrame({'Id': df_test['Id'], 'Popularity': y_score})
df_pred.to_csv('output/test_pred.csv', index=False)

## Conclusion

這次Text Feature Engineering學習到很多東西，從基本的資料處理到一些library的使用，其中最大的收穫就是理解到老師一開始講的話，好的資料對model的影響是非常大的，所以資料前處理在ML是相當重要的部分，這點在挑選feature和feature extraction的時候能夠感受出很明顯的差異，資料不是越多越好，選出具有代表性的feature和處理出好的feature才是重點。

除了上面提到的部分還有一點可以檢討的就是在比賽過程中太執著於public score，雖然最後private score也表現得更好，但是因為執著於public score導致沒選出最好的結果0.60232。