# **Reliable Intelligence Identification on Vietnamese SNSs (ReINTEL)**

## **Introduction**

This challenge aims to identify a piece of information shared on social network sites (SNSs), is reliable or unreliable. With the blazing-fast spurt of SNSs, e.g., Facebook, Zalo, or Lotus, there are approximately 65 million Vietnamese users on board with the annual growth of 2.7 million in the recent year, as reported by the Digital 2020 [6]. SNSs become essential means for users to not only connect friends but also freely create, share diverse information [2, 5], i.e., news. Within freedom, a number of users tend to spread unreliable information for their personal purposes affecting the online society. Detecting whether news spreading in SNSs is reliable or unreliable has gained significant attention recently [1, 3, 4]. Therefore, this shared task targets identifying shared information in Vietnamese SNSs. It provides an opportunity for participants who are interested in the problem, to contribute their knowledge to improve the online society for social good.

## **Data Format**

Each instance includes 6 main attributes with/without a binary target label as follows:

* id: unique id for a news post on SNSs

* uid: the anonymized id of the owner

* text: the text content of the news 

* timestamp: the time when the news is posted 

* image_links: image urls associated with the news

* nb_likes: the number of likes that the news is received

* nb_comments: the number of comment that the news is received

* nb_shares: the number of shares that the news is received

* label: a manually annotated label which marks the news as potentially unreliable (1: unreliable, 0: reliable)

## **Our results**

Our model (Weighted ensemble SVM + LightGBM ) achieved an AUC score of 0.9523 and took the 1st place in the private test.

Link competition: https://competitions.codalab.org/competitions/27232

<p align="center">
<img src="https://imgur.com/hoBON9b.png">
</p>

In [1]:
from google.colab import drive
import os
drive.mount('/content/gdrive')
path = "/content/gdrive/My Drive/Fake News Detection"
os.chdir(path)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## **Setup**

In [2]:
!pip install catboost



In [3]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_auc_score
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

## **Load data**

In [4]:
cols = ['post_message','num_char','num_url','num_hashtag','num_post','num_like',
        'num_cmt','num_share','pixel','num_image','hour','weekday','day','month','label']
        
df_train = pd.read_csv('dataset/raw/train.csv', usecols=cols) 
df_test = pd.read_csv('dataset/raw/test.csv', usecols=cols)

In [5]:
print(df_train.info())
df_train

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3788 entries, 0 to 3787
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   post_message  3788 non-null   object
 1   label         3788 non-null   int64 
 2   num_char      3788 non-null   int64 
 3   num_url       3788 non-null   int64 
 4   num_hashtag   3788 non-null   int64 
 5   num_post      3788 non-null   int64 
 6   num_like      3788 non-null   int64 
 7   num_cmt       3788 non-null   int64 
 8   num_share     3788 non-null   int64 
 9   pixel         3788 non-null   int64 
 10  num_image     3788 non-null   int64 
 11  hour          3788 non-null   int64 
 12  weekday       3788 non-null   int64 
 13  day           3788 non-null   int64 
 14  month         3788 non-null   int64 
dtypes: int64(14), object(1)
memory usage: 444.0+ KB
None


Unnamed: 0,post_message,label,num_char,num_url,num_hashtag,num_post,num_like,num_cmt,num_share,pixel,num_image,hour,weekday,day,month
0,Thủ_tướng CANADA đã chính_thức thông_báo : các...,1,116,0,0,1,37,3,2,0,0,15,6,24,4
1,Sửa Nghị_định 20 : Hồi_tố hàng nghìn tỷ đồng_b...,0,99,0,0,42,1,0,0,289080,0,2,2,20,4
2,Luật_sư cho rằng việc khai_báo nhỏ_giọt gây kh...,0,3276,1,0,45,930,12,160,0,1,13,1,15,3
3,"“ Thiên_tai , Nhân_họa hay khủng_hoảng niềm ti...",0,2668,0,0,1,2700,156,2100,0,0,18,2,3,2
4,Sẽ bán đấu_giá và cấp biển số xe theo sở_thích...,1,154,0,0,1,68,31,4,0,0,22,3,21,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3783,"Theo thông_tin từ Sở Y_tế Thái_Nguyên , bệnh_n...",0,1606,0,0,28,53000,5500,4600,0,0,18,1,29,3
3784,🔥 BN91 ( 43 tuổi ) hiện vẫn đang trong tình_tr...,0,2007,0,0,1,28,1,2,421200,1,20,6,10,4
3785,Tin_Sét_Đánh Vào Đầu_ChinaZi Thượng_Viện Hoa_K...,1,1768,0,0,1,6,0,3,0,2,7,6,22,5
3786,Trung_Quốc không có thêm ca lây_nhiễm virus SA...,0,490,0,0,58,5801,53,32,0,1,3,2,25,5


In [6]:
print(df_test.info())
df_test

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 948 entries, 0 to 947
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   post_message  948 non-null    object
 1   label         948 non-null    int64 
 2   num_char      948 non-null    int64 
 3   num_url       948 non-null    int64 
 4   num_hashtag   948 non-null    int64 
 5   num_post      948 non-null    int64 
 6   num_like      948 non-null    int64 
 7   num_cmt       948 non-null    int64 
 8   num_share     948 non-null    int64 
 9   pixel         948 non-null    int64 
 10  num_image     948 non-null    int64 
 11  hour          948 non-null    int64 
 12  weekday       948 non-null    int64 
 13  day           948 non-null    int64 
 14  month         948 non-null    int64 
dtypes: int64(14), object(1)
memory usage: 111.2+ KB
None


Unnamed: 0,post_message,label,num_char,num_url,num_hashtag,num_post,num_like,num_cmt,num_share,pixel,num_image,hour,weekday,day,month
0,HÃY XÔNG KHI BỊ NHIỄM_VIRUS ( Tin thực_tế từ n...,1,3701,0,0,1,2700,640,12000,0,0,21,2,30,3
1,"Tôi cũng vài lần rơi vào trạng_thái đãng_trí ,...",0,170,0,0,1,7,0,0,0,0,3,4,10,6
2,Thêm 4 tỉnh_thành sẽ cho học_sinh đi học trở_l...,0,67,0,0,58,3180,47,116,518400,1,15,4,22,4
3,Sự_việc phân_chia vùng còn chưa ngã_ngũ thì ở ...,1,122,0,0,1,20,11,2,0,2,18,5,4,6
4,"16 h33p ngày 27/5 ( giờ_địa_phương ) , SpaceX ...",0,146,0,0,64,13,1,3,0,0,11,4,27,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
943,TƯỚC QUÂN_TỊCH MỘT CẢNH_SÁT GIAO_THÔNG LIÊN_QU...,0,485,0,0,58,11703,334,198,269312,1,11,1,24,5
944,"Đây không phải là "" ngăn sông cấm chợ "" !",0,41,0,0,1,5005,8,1722,0,1,14,3,31,3
945,""" Ghen Cô Vy "" là 1 dự_án sáng_tạo của Viện Sứ...",0,315,0,0,28,22000,1000,5500,0,0,-1,-1,-1,-1
946,"Chiều 15-4 , tại cuộc họp giao_ban trực_tuyến ...",0,228,1,0,2,1,0,0,0,0,20,4,15,4


## **Data Normalization**

In [7]:
numerical_cols = ['num_char','num_url','num_hashtag','num_post','num_like','num_cmt',
                  'num_share','pixel','num_image','hour','weekday','day','month']

for col in numerical_cols:
    scale = StandardScaler().fit(df_train[[col]])
    df_train[col] = scale.transform(df_train[[col]])  
    df_test[col] = scale.transform(df_test[[col]])

In [8]:
df_train

Unnamed: 0,post_message,label,num_char,num_url,num_hashtag,num_post,num_like,num_cmt,num_share,pixel,num_image,hour,weekday,day,month
0,Thủ_tướng CANADA đã chính_thức thông_báo : các...,1,-0.446748,-0.372198,-0.223096,-0.472197,-0.289728,-0.216099,-0.282523,-0.445261,-0.368489,0.628546,1.019241,0.960062,-0.107563
1,Sửa Nghị_định 20 : Hồi_tố hàng nghìn tỷ đồng_b...,0,-0.458510,-0.372198,-0.223096,1.467858,-0.294094,-0.219293,-0.283350,0.649050,-0.368489,-1.444186,-0.929717,0.497905,-0.107563
2,Luật_sư cho rằng việc khai_báo nhỏ_giọt gây kh...,0,1.739626,0.609617,-0.223096,1.609813,-0.181430,-0.206519,-0.217160,-0.445261,0.567685,0.309664,-1.416957,-0.079792,-0.741229
3,"“ Thiên_tai , Nhân_họa hay khủng_hoảng niềm ti...",0,1.318957,-0.372198,-0.223096,-0.472197,0.033225,-0.053227,0.585398,-0.445261,-0.368489,1.106868,-0.929717,-1.466264,-1.374895
4,Sẽ bán đấu_giá và cấp biển số xe theo sở_thích...,1,-0.420456,-0.372198,-0.223096,-0.472197,-0.285969,-0.186293,-0.281695,-0.445261,-0.368489,1.744632,-0.442477,0.613444,-0.107563
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3783,"Theo thông_tin từ Sở Y_tế Thái_Nguyên , bệnh_n...",0,0.584169,-0.372198,-0.223096,0.805400,6.133324,5.635587,1.619623,-0.445261,-0.368489,1.106868,-1.416957,1.537759,-0.741229
3784,🔥 BN91 ( 43 tuổi ) hiện vẫn đang trong tình_tr...,0,0.861617,-0.372198,-0.223096,-0.472197,-0.290820,-0.218228,-0.282523,1.149190,0.567685,1.425750,1.019241,-0.657488,-0.107563
3785,Tin_Sét_Đánh Vào Đầu_ChinaZi Thượng_Viện Hoa_K...,1,0.696255,-0.372198,-0.223096,-0.472197,-0.293488,-0.219293,-0.282109,-0.445261,1.503858,-0.646981,1.019241,0.728984,0.526103
3786,Trung_Quốc không có thêm ca lây_nhiễm virus SA...,0,-0.187980,-0.372198,-0.223096,2.224953,0.409297,-0.162873,-0.270112,-0.445261,0.567685,-1.284745,-0.929717,1.075602,0.526103


In [9]:
df_test

Unnamed: 0,post_message,label,num_char,num_url,num_hashtag,num_post,num_like,num_cmt,num_share,pixel,num_image,hour,weekday,day,month
0,HÃY XÔNG KHI BỊ NHIỄM_VIRUS ( Tin thực_tế từ n...,1,2.033680,-0.372198,-0.223096,-0.472197,0.033225,0.462002,4.680926,-0.445261,-0.368489,1.585191,-0.929717,1.653298,-0.741229
1,"Tôi cũng vài lần rơi vào trạng_thái đãng_trí ,...",0,-0.409385,-0.372198,-0.223096,-0.472197,-0.293367,-0.219293,-0.283350,-0.445261,-0.368489,-1.284745,0.044762,-0.657488,1.159769
2,Thêm 4 tỉnh_thành sẽ cho học_sinh đi học trở_l...,0,-0.480650,-0.372198,-0.223096,2.224953,0.091437,-0.169260,-0.235362,1.517140,0.567685,0.628546,0.044762,0.728984,-0.107563
3,Sự_việc phân_chia vùng còn chưa ngã_ngũ thì ở ...,1,-0.442596,-0.372198,-0.223096,-0.472197,-0.291790,-0.207583,-0.282523,-0.445261,1.503858,1.106868,0.532002,-1.350724,1.159769
4,"16 h33p ngày 27/5 ( giờ_địa_phương ) , SpaceX ...",0,-0.425991,-0.372198,-0.223096,2.508863,-0.292639,-0.218228,-0.282109,-0.445261,-0.368489,-0.009218,0.044762,1.306680,0.526103
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
943,TƯỚC QUÂN_TỊCH MỘT CẢNH_SÁT GIAO_THÔNG LIÊN_QU...,0,-0.191440,-0.372198,-0.223096,2.224953,1.125058,0.136258,-0.201439,0.574218,0.567685,-0.009218,-1.416957,0.960062,0.526103
944,"Đây không phải là "" ngăn sông cấm chợ "" !",0,-0.498639,-0.372198,-0.223096,-0.472197,0.312763,-0.210777,0.429024,-0.445261,0.567685,0.469105,-0.442477,1.768838,-0.741229
945,""" Ghen Cô Vy "" là 1 dự_án sáng_tạo của Viện Sứ...",0,-0.309061,-0.372198,-0.223096,0.805400,2.373820,0.845231,1.991943,-0.445261,-0.368489,-1.922509,-2.391436,-1.928421,-3.275893
946,"Chiều 15-4 , tại cuộc họp giao_ban trực_tuyến ...",0,-0.369256,0.609617,-0.223096,-0.424879,-0.294094,-0.219293,-0.283350,-0.445261,-0.368489,1.425750,0.044762,-0.079792,-0.107563


## **Function**

In [10]:
def get_metrics(y_test, y_pred_proba):
    print('ACCURACY_SCORE: ', round(accuracy_score(y_test, y_pred_proba>=0.5), 4), '\n')
    print('ROC_AUC_SCORE: ', round(roc_auc_score(y_test, y_pred_proba), 4), '\n')
    print('F1_SCORE: ', round(f1_score(y_test, y_pred_proba>=0.5, average='macro'), 4), '\n')
    print('CONFUSION_MATRIX:\n', confusion_matrix(y_test, y_pred_proba>=0.5),'\n')

In [11]:
def get_numeric_data(x):
    return [record[1:].astype(float) for record in x]

def get_text_data(x):
    return [record[0] for record in x]
    
transfomer_numeric = FunctionTransformer(get_numeric_data)
transformer_text = FunctionTransformer(get_text_data)

In [12]:
pipeline = Pipeline([
    ('features', FeatureUnion([
            ('numeric_features', Pipeline([
                ('selector', transfomer_numeric)
            ])),
            ('text_features', Pipeline([
                ('selector', transformer_text),
                ('tfidf', TfidfVectorizer(max_features=100000, ngram_range=(1,2))),
                #('svd', TruncatedSVD(n_components = 512, random_state=42))
            ]))
    ])),
    #('clf', LGBMClassifier())
])

##**Data pretrain**

In [13]:
X_train = df_train.drop(['label'], axis=1).to_numpy()
X_test = df_test.drop(['label'], axis=1).to_numpy()

In [14]:
X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)
y_train = df_train['label'].values
y_test = df_test['label'].values

print(X_train.shape)
print(X_test.shape)

(3788, 100013)
(948, 100013)


##**Train and evaluate**

In [15]:
model = LGBMClassifier()
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test) [:,1]
get_metrics(y_test, y_pred_proba)

ACCURACY_SCORE:  0.904 

ROC_AUC_SCORE:  0.9449 

F1_SCORE:  0.8076 

CONFUSION_MATRIX:
 [[764  17]
 [ 74  93]] 



## **Table Scores**

In [16]:
list_model = [DecisionTreeClassifier(), LogisticRegression(), 
              RandomForestClassifier(), SVC(kernel='linear', probability=True), 
              LGBMClassifier(), XGBClassifier(), CatBoostClassifier(verbose=200)] 

list_model_name, list_acc_score, list_f1_score, list_roc_auc = [], [], [], []

In [17]:
for model in list_model:
    print(f"{type(model).__name__} .....\n")
    model.fit(X_train, y_train)
    y_pred_proba = model.predict_proba(X_test) [:,1]
    list_model_name.append(type(model).__name__);
    list_acc_score.append(accuracy_score(y_test, y_pred_proba>=0.5))
    list_f1_score.append(f1_score(y_test, y_pred_proba>=0.5, average='macro'))
    list_roc_auc.append(roc_auc_score(y_test, y_pred_proba))

DecisionTreeClassifier .....

LogisticRegression .....

RandomForestClassifier .....

SVC .....

LGBMClassifier .....

XGBClassifier .....

CatBoostClassifier .....

Learning rate set to 0.018193
0:	learn: 0.6835726	total: 1.07s	remaining: 17m 49s
200:	learn: 0.2925928	total: 2m 31s	remaining: 10m 1s
400:	learn: 0.2475734	total: 4m 59s	remaining: 7m 26s
600:	learn: 0.2060787	total: 7m 28s	remaining: 4m 57s
800:	learn: 0.1708495	total: 9m 56s	remaining: 2m 28s
999:	learn: 0.1474028	total: 12m 23s	remaining: 0us


In [18]:
table_cols = {'Model name': list_model_name, 'Accuracy score': list_acc_score, 
              'Macro-F1 score': list_f1_score, 'ROC-AUC score': list_roc_auc}
table = pd.DataFrame(table_cols) 
table = table.sort_values(by=['ROC-AUC score'], ascending=False).reset_index(drop=True)
table

Unnamed: 0,Model name,Accuracy score,Macro-F1 score,ROC-AUC score
0,LGBMClassifier,0.904008,0.807636,0.944904
1,CatBoostClassifier,0.888186,0.752432,0.939851
2,SVC,0.894515,0.794119,0.934346
3,XGBClassifier,0.89135,0.763514,0.932092
4,LogisticRegression,0.874473,0.704482,0.92063
5,RandomForestClassifier,0.837553,0.540992,0.896551
6,DecisionTreeClassifier,0.836498,0.709361,0.703048
