# DataLab Cup 1: Predicting News Popularity
Team Name: 沒liao恩宇  
Team Member:  110062802 呂宸漢 110062552 周伯宇 110062560 林子鵑  


## Data Loader
讀入 Training data & Testing data

In [1]:
import pandas as pd

df_train = pd.read_csv('./dataset/train.csv')
df_test = pd.read_csv('./dataset/test.csv')


## Preprocessing: Data Cleaning
BeautifulSoup 為 Python 的函式庫，可以從 HTML、XML 檔案中分析資料，從 Raw Data (News Content) 提取不同的 Tag 開頭來分析本次 Competition 中所使用到的 data。  
以下分別介紹每一種 Data 的 Extraction and Cleaning 方式:  
- **title**: 文章標題 
- **author**: 文章作者 (去除 by, & 等多餘的雜訊，保留作者名稱的 lower case) 
- **channel**: 文章所屬的頻道 (article 中 data-channel)
- **topic**: 文章所屬的章節主題 (footer 中 article-topics)
- **date_time**: 文章所發表的日期時間 (包含年月日時分秒)，若文章中沒有此項資訊 default 設為 'Wed, 10 Oct 2014 15:00:43'
- **see_also**: 文章中標記 see also 的數量
- **content_Len**: 文章長度
- **num_image**: 文章中圖片數量 (find_all 找出所有 HTML 標籤 'img' 出現次數)
- **num_a**: 文章中通往其他頁面、檔案、Email、或其他 URL 的超連結數量 (find_all 找出所有 HTML 標籤 'a' 出現次數)
    
透過 BeautifulSoup 的 preprocessing 後，總共整理出 15 項資料（見以下表格）。


In [2]:
import re
from bs4 import BeautifulSoup


def preprocessor(text):
    soup = BeautifulSoup(text, 'html.parser')

    # find title
    title = soup.body.h1.string.strip().lower()

    # find author
    article_info = soup.head.find('div', {'class': 'article-info'})
    author_name = article_info.find('span', {'class': 'author_name'})
    if author_name != None:
        author = author_name.get_text()
    elif article_info.span != None:
        author = article_info.span.string
    else:
        author = article_info.a.string

    # clean author
    author = re.sub('\s+', ' ', author.strip().lower())
    if author.startswith('by '):
        author = author[3:]
    author = re.sub('&.*;', '&', author.replace(' and ', ' & '))

    author_list = []
    if author.find(',') == -1:
        author_list = re.split('\s*&\s*', author)
    else:
        authors = re.split('\s*,\s*', author)
        if authors[-1].find('&') == -1 or len(authors[-1].split('&')[-1].strip().split()) > 3:
            author_list.append(authors[0])
        else:
            author_list += authors[:-1]
            author_list += re.split('\s*&\s*', authors[-1])
    author = ' '.join([re.sub('\s+', '_', a) for a in author_list])

    # find channel
    channel = soup.body.article['data-channel'].strip().lower()

    # find topic
    a_list = soup.body.find('footer', {'class': 'article-topics'}).find_all('a')
    topic_list = [a.string.strip().lower() for a in a_list]
    topic = ' '.join([re.sub('\s+', '_', t) for t in topic_list])

    # find datetime
    article_info = soup.head.find('div', {'class': 'article-info'})
    try:
        date_time = article_info.time['datetime']
    except:
        date_time = 'Wed, 10 Oct 2014 15:00:43'
    match_obj = re.search('([\w]+),\s+([\d]+)\s+([\w]+)\s+([\d]+)\s+([\d]+):([\d]+):([\d]+)', date_time)
    day, date, month, year, hour, minute, second = match_obj.groups()
    day, month = day.lower(), month.lower()

    # find content
    content = soup.body.find('section', {'class': 'article-content'}).get_text()
    content_len = len(content)

    # find see also
    num_see_also = len(re.findall('see also', content.lower()))

    # find image
    num_image = len(soup.body.find_all('img'))

    # find a
    num_a = len(soup.body.find_all('a'))

    return title, author, channel, topic, day, date, month, year, \
        hour, minute, second, content_len, num_see_also, num_image, num_a


feature_list = []
for text in df_train['Page content']:
    feature_list.append(preprocessor(text))
for text in df_test['Page content']:
    feature_list.append(preprocessor(text))

df_combine = pd.DataFrame(
    feature_list,
    columns=['Title', 'Author', 'Channel', 'Topic', 'Day', 'Date', 'Month', 'Year',
             'Hour', 'Minute', 'Second', 'Content_Len', 'Num_See_Also', 'Num_Image', 'Num_A']
)


In [3]:
df_combine.head()


Unnamed: 0,Title,Author,Channel,Topic,Day,Date,Month,Year,Hour,Minute,Second,Content_Len,Num_See_Also,Num_Image,Num_A
0,nasa's grand challenge: stop asteroids from de...,clara_moskowitz,world,asteroid asteroids challenge earth space u.s. ...,wed,19,jun,2013,15,4,30,3591,4,1,21
1,google's new open source patent pledge: we won...,christina_warren,tech,apps_and_software google open_source opn_pledg...,thu,28,mar,2013,17,40,55,1843,1,1,16
2,ballin': 2014 nfl draft picks get to choose th...,sam_laird,entertainment,entertainment nfl nfl_draft sports television,wed,7,may,2014,19,15,20,6646,1,1,9
3,cameraperson fails deliver slapstick laughs,sam_laird,watercooler,sports video videos watercooler,fri,11,oct,2013,2,26,50,1821,1,0,11
4,nfl star helps young fan prove friendship with...,connor_finnegan,entertainment,entertainment instagram instagram_video nfl sp...,thu,17,apr,2014,3,31,43,8921,1,51,14


## Preprocessing: Feature Extraction
接著對每一篇 article 經過 preprocessing 整理的資料進行 **Feature Extraction**。
針對時間資訊做以下處理：  
Day(Mon, Tue, Wed, Thu, Fri, Sat, Sun) 分別以 1-7 的數字做 mapping；
Month(Jan, Feb, Mar, ..., Nov, Dec) 分別以 1-12 的數字做 mapping。

最後經過不同 Feature 組合實驗之下，發現時間資訊中的 Minute, Second 以及 Title, Channel, Num_See_Also, Num_Image, Num_a 這些 fearture 對於整體的 score 沒有上升的幫助，因此在實驗過後 drop 掉不必要的 feature，作爲最終 training 的 input  (見以下表格)。

In [4]:
day_map = {'mon': 1, 'tue': 2, 'wed': 3,
           'thu': 4, 'fri': 5, 'sat': 6, 'sun': 7}
month_map = {'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
             'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}

df_copy = df_combine.copy()
df_copy['Day'] = df_copy['Day'].map(day_map)
df_copy['Month'] = df_copy['Month'].map(month_map)

df_copy = df_copy.drop(columns=['Title', 'Channel', 'Minute', 'Second', 'Num_See_Also', 'Num_Image', 'Num_A'])


In [5]:
df_copy.head()


Unnamed: 0,Author,Topic,Day,Date,Month,Year,Hour,Content_Len
0,clara_moskowitz,asteroid asteroids challenge earth space u.s. ...,3,19,6,2013,15,3591
1,christina_warren,apps_and_software google open_source opn_pledg...,4,28,3,2013,17,1843
2,sam_laird,entertainment nfl nfl_draft sports television,3,7,5,2014,19,6646
3,sam_laird,sports video videos watercooler,5,11,10,2013,2,1821
4,connor_finnegan,entertainment instagram instagram_video nfl sp...,4,17,4,2014,3,8921


## Preprocessing: Tokenization and Word Stemming
針對文字資訊 (Topic, Author) 做 Tokenization 提取每一個單字，並做 WordNetLemmatizer 來對單字做精度較高的處理 (還原單字原型)，並使用 CountVectorizer 來計算這些資料中單字出現頻率作為 features。  
  
透過 Column_Transformer 對於指定 feature 做處理，並對每一種 model 傳入不同的 feature 來做 training。

In [6]:
import numpy as np
import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('omw-1.4')


def tokenizer(text):
    if type(text) == np.ndarray:
        text = text[0]
    return re.split('\s+', text.strip())


def tokenizer_wnl(text):
    if type(text) == np.ndarray:
        text = text[0]
    text = re.sub("([\w]+)'[\w]+",
                  (lambda match_obj: match_obj.group(1)), text)
    text = re.sub('\.', '', text)
    text = re.sub('[^\w]+', ' ', text)
    wnl = WordNetLemmatizer()
    return [wnl.lemmatize(s) for s in re.split('\s+', text.strip())]


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Eric\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Eric\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer

trans_forest = ColumnTransformer(
    [('Author', CountVectorizer(tokenizer=tokenizer, lowercase=False), [0]),
     ('Topic', CountVectorizer(tokenizer=tokenizer_wnl, lowercase=False), [1])],
    n_jobs=-1,
    remainder='passthrough'
)

trans_other = ColumnTransformer(
    [('Author', 'drop', [0]),
     ('Topic', CountVectorizer(tokenizer=tokenizer_wnl, lowercase=False), [1])],
    n_jobs=-1,
    remainder='passthrough'
)


## Model Training and Selection
Feature 準備好了以後進入 model training 階段，並透過實驗選擇較好的參數以及 model 組合，最終用來做 testing set 的 prediction。
首先將 training set 分成 80% training set, 20% validation set，調整參數並判斷 training 是否出現 overfitting 的情況。
  
本次 Competition 嘗試使用的 Model 有 LightGBM Classifier, Random Forest Classifier, XGBoost Classifier, CatBoost Classifier，並透過實驗組合不同 model 觀察其結果的，最終選擇 LightGBM, Random Forest 以及 CatBoost 搭配 Voting Classifier 作為最終的 model，並將 testing data 放入 model 得到 prediction 結果。
  
以下詳細說明最終使用的 prediction 所使用的 features 以及 models:  
- **LightGBM (learning_rate=0.01, n_estimators=300)**  
features: Topic, Day, Date, Month, Year, Hour, Content_Len (其中 Topic 使用 CountVectorizer 搭配 tokenizer_wnl 處理)  
[train score] 0.67578  
[valid score] 0.59656  

- **Random Forest (n_estimators=300)**  
features: Author, Topic, Day, Date, Month, Year, Hour, Content_Len (其中 Author, Topic 以 CountVectorizer 分別搭配 tokenizer_wnl, tokenizer 處理)  
[train score] 1.00000  
[valid score] 0.58942  
  
- **CatBoost (n_estimators=300)**  
features: Topic, Day, Date, Month, Year, Hour, Content_Len (其中 Topic 使用 CountVectorizer 搭配 tokenizer_wnl 處理)  
[train score] 0.68522  
[valid score] 0.58965
  
- **Voting (voting='soft', weights=[1, 0.2, 0.05])**  
以上三種 model 做 voting 得到 training result  
[train score] 0.93644  
[valid score] 0.59872  


In [8]:
from sklearn.model_selection import train_test_split

X_train_raw = df_copy.values[:df_train.shape[0]]
y_train_raw = (df_train['Popularity'].values == 1).astype(int)
X_test = df_copy.values[df_train.shape[0]:]

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_raw, y_train_raw, test_size=0.2, random_state=0)


In [9]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import roc_auc_score


def training(clf):
    cv_results = cross_validate(clf, X_train_raw, y_train_raw,
                                scoring='roc_auc', return_train_score=True, return_estimator=True)
    print('train score: {:.5f} (+/-{:.5f})'.format(
        np.mean(cv_results['train_score']), np.std(cv_results['train_score'])))
    print('valid score: {:.5f} (+/-{:.5f})'.format(
        np.mean(cv_results['test_score']), np.std(cv_results['test_score'])))

    clf.fit(X_train, y_train)
    print('train score: {:.5f}'.format(roc_auc_score(
        y_train, clf.predict_proba(X_train)[:, 1])))
    print('valid score: {:.5f}'.format(roc_auc_score(
        y_valid, clf.predict_proba(X_valid)[:, 1])))
    return clf


### LightGBM Classifier
[Reference](https://lightgbm.readthedocs.io/en/v3.3.3/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier)

In [10]:
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier

lgbm = Pipeline([('ct', trans_other),
                 ('clf', LGBMClassifier(random_state=0, learning_rate=0.009, n_estimators=300))])
lgbm = training(lgbm)


train score: 0.67002 (+/-0.00237)
valid score: 0.60292 (+/-0.00819)
train score: 0.67156
valid score: 0.59803


### Random Forest Classifier
[Reference](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

forest = Pipeline([('ct', trans_forest),
                   ('clf', RandomForestClassifier(n_jobs=-1, random_state=0, n_estimators=300))])
forest = training(forest)


train score: 1.00000 (+/-0.00000)
valid score: 0.58562 (+/-0.01088)
train score: 1.00000
valid score: 0.58640


### XGBoost Classifier
[Reference](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)

In [12]:
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier

xgboost = Pipeline([('ct', trans_other),
                    ('clf', XGBClassifier(verbosity=0, n_estimators=300))])
xgboost = training(xgboost)


train score: 0.81768 (+/-0.00317)
valid score: 0.58165 (+/-0.01044)
train score: 0.81421
valid score: 0.56472


### CatBoost Classifier
[Reference](https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier)

In [13]:
from sklearn.pipeline import Pipeline
from catboost import CatBoostClassifier

catboost = Pipeline([('ct', trans_other),
                     ('clf', CatBoostClassifier(verbose=False, eval_metric='AUC', n_estimators=290, learning_rate=0.06))])
catboost = training(catboost)


train score: 0.68785 (+/-0.00297)
valid score: 0.59684 (+/-0.00972)
train score: 0.68520
valid score: 0.59027


### Voting Classifier
[Reference](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html)

In [14]:
from sklearn.ensemble import VotingClassifier
voting = VotingClassifier([('lgbm', lgbm), ('forest', forest), ('catboost', catboost)],
                          voting='soft', weights=[1, 0.2, 0.05])
voting = training(voting)


train score: 0.93903 (+/-0.00294)
valid score: 0.60309 (+/-0.00960)
train score: 0.93632
valid score: 0.59879


## Testing Data Prediction

In [15]:
best_model = voting

y_score = best_model.predict_proba(X_test)[:, 1]
df_pred = pd.DataFrame({'Id': df_test['Id'], 'Popularity': y_score})
df_pred.to_csv('test_pred.csv', index=False)


## Results on Kaggle
經過多次嘗試不同的 Feature Extraction 與 Preprocessing，Model 的挑選以及參數的 tuning 後，在 Kaggle 平台上獲得了以下成績：  
[Public Score]  0.60181 (first place)  
[Private Score] 0.59755 (second place)  


## Conclusion
這次的 Text Feature Engineering Competition 相當具有挑戰性，除了要在 News article 中以不同的技巧 parse 出可能有幫助的 features，還要針對數字的 data 做 encoding，對文字的 data 做 preprocessing，雖然實驗課中有提到許多對文字 data 的 preprocessing 方式（像是提取 HTML 中去除 tag 的結果、對單字純化的技巧、計算單字出現頻率以及重要性等等方式），但若只是單純將 article 中常見的 文章標題、作者、內容做處理的話很難達到很好的 prediction 結果。  
  
在多次嘗試各種 features 的可能性及組合，以及天馬行空的 feature 抽取，同時也嘗試對數字 data 做不同方式的 encoding，發現在此次的 popularity prediction 中，時間的資訊相當重要，同時也發現對於某些feature 來說，OneHotEncoder 未必能展現最佳的結果，也是相當有趣的發現。  
  
最後針對選擇 model 的部分則是使用了 LightGBM 得到了 performance 相當程度的提升，同時也加速了 training 的速度，針對每個 model 所使用的 feature 和參數也不盡相同，最終透過多次實驗才得到了 Public LeaderBoard 第一名的成績。  
