# Reliable Intelligence Identification on Vietnamese SNSs (ReINTEL)

## Introduction

This challenge aims to identify a piece of information shared on social network sites (SNSs), is reliable or unreliable. With the blazing-fast spurt of SNSs, e.g., Facebook, Zalo, or Lotus, there are approximately 65 million Vietnamese users on board with the annual growth of 2.7 million in the recent year, as reported by the Digital 2020 [6]. SNSs become essential means for users to not only connect friends but also freely create, share diverse information [2, 5], i.e., news. Within freedom, a number of users tend to spread unreliable information for their personal purposes affecting the online society. Detecting whether news spreading in SNSs is reliable or unreliable has gained significant attention recently [1, 3, 4]. Therefore, this shared task targets identifying shared information in Vietnamese SNSs. It provides an opportunity for participants who are interested in the problem, to contribute their knowledge to improve the online society for social good.

## Data Format

Each instance includes 6 main attributes with/without a binary target label as follows:

* id: unique id for a news post on SNSs

* uid: the anonymized id of the owner

* text: the text content of the news 

* timestamp: the time when the news is posted 

* image_links: image urls associated with the news

* nb_likes: the number of likes that the news is received

* nb_comments: the number of comment that the news is received

* nb_shares: the number of shares that the news is received

* label: a manually annotated label which marks the news as potentially unreliable (1: unreliable, 0: reliable)

## Our results

Our model (Weighted ensemble SVM + LightGBM ) achieved an AUC score of 0.9523 and took the 1st place in the private test.

Link competition: https://competitions.codalab.org/competitions/27232

<p align="center">
<img src="https://imgur.com/hoBON9b.png">
</p>

## Setup

In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_auc_score
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

## Load Dataset
Train / Val = 8 / 2

This dataset has been preprocessed and features extracted. We used the following 14 features:

post_message, num_char, num_url, num_hashtag, num_post, num_like,\
num_cmt, num_share, pixel, num_image, hour, weekday, day, month, label

In [2]:
cols = ['post_message','num_char','num_url','num_hashtag','num_post','num_like',
        'num_cmt','num_share','pixel','num_image','hour','weekday','day','month','label']

df_train = pd.read_csv('../input/reintel-preprocessed/train.csv', usecols=cols) 
df_test = pd.read_csv('../input/reintel-preprocessed/test.csv', usecols=cols)

In [3]:
print(df_train.info())
df_train

In [4]:
print(df_test.info())
df_test

## Data Normalization

In [5]:
numerical_cols = ['num_char','num_url','num_hashtag','num_post','num_like','num_cmt',
                  'num_share','pixel','num_image','hour','weekday','day','month']

for col in numerical_cols:
    scale = StandardScaler().fit(df_train[[col]])
    df_train[col] = scale.transform(df_train[[col]])  
    df_test[col] = scale.transform(df_test[[col]])

In [6]:
df_train

In [7]:
df_test

## Function

In [8]:
def get_metrics(y_test, y_pred_proba):
    print('ACCURACY_SCORE: ', round(accuracy_score(y_test, y_pred_proba>=0.5), 4), '\n')
    print('ROC_AUC_SCORE: ', round(roc_auc_score(y_test, y_pred_proba), 4), '\n')
    print('F1_SCORE: ', round(f1_score(y_test, y_pred_proba>=0.5, average='macro'), 4), '\n')
    print('CONFUSION_MATRIX:\n', confusion_matrix(y_test, y_pred_proba>=0.5),'\n')

In [9]:
def get_numeric_data(x):
    return [record[1:].astype(float) for record in x]

def get_text_data(x):
    return [record[0] for record in x]
    
transfomer_numeric = FunctionTransformer(get_numeric_data)
transformer_text = FunctionTransformer(get_text_data)

In [10]:
pipeline = Pipeline([
    ('features', FeatureUnion([
            ('numeric_features', Pipeline([
                ('selector', transfomer_numeric)
            ])),
            ('text_features', Pipeline([
                ('selector', transformer_text),
                ('tfidf', TfidfVectorizer(max_features=100000, ngram_range=(1,2))),
                #('svd', TruncatedSVD(n_components = 512, random_state=42))
            ]))
    ])),
    #('clf', LGBMClassifier())
])

## Data Pretrain

In [11]:
X_train = df_train.drop(['label'], axis=1).to_numpy()
X_test = df_test.drop(['label'], axis=1).to_numpy()

In [12]:
X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)
y_train = df_train['label'].values
y_test = df_test['label'].values

print(X_train.shape)
print(X_test.shape)

## Train and evaluate

In [13]:
model = LGBMClassifier()
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test) [:,1]
get_metrics(y_test, y_pred_proba)

## Table Scores

In [14]:
list_model = [DecisionTreeClassifier(), LogisticRegression(solver='liblinear'), 
              RandomForestClassifier(), SVC(kernel = 'linear', probability = True), 
              LGBMClassifier(), XGBClassifier(), CatBoostClassifier(verbose = 200)] 

list_model_name, list_acc_score, list_f1_score, list_roc_auc = [], [], [], []

In [15]:
for model in list_model:
    print(f"{type(model).__name__} .....\n")
    model.fit(X_train, y_train)
    y_pred_proba = model.predict_proba(X_test) [:,1]
    list_model_name.append(type(model).__name__);
    list_acc_score.append(accuracy_score(y_test, y_pred_proba>=0.5))
    list_f1_score.append(f1_score(y_test, y_pred_proba>=0.5, average='macro'))
    list_roc_auc.append(roc_auc_score(y_test, y_pred_proba))

In [16]:
table_cols = {'Model name': list_model_name, 'Accuracy score': list_acc_score, 
              'Macro-F1 score': list_f1_score, 'ROC-AUC score': list_roc_auc}
table = pd.DataFrame(table_cols) 
table = table.sort_values(by=['ROC-AUC score'], ascending=False).reset_index(drop=True)
table