# Topic: Kaggle Campaign - Fake News Detection Challenge KDD 2020
Description: Using LightGBM and XGBoodt model to detect the fake news. Please refer to the dataset: https://www.kaggle.com/c/fakenewskdd2020

## Step1: Import the needed tools and datasets
Firstly, import all tools and dataset which are needed, including the the tools for manage dataframe, array, and the tools for characters transform to vectors, and the tools for model construct from scikit-learn. Also, import the score to measure the model performance. 
____
After loading the data, we can see there are only two columns, one is "Text", which is the Variable X. And the other column is "Label", when 1 representing Fake, and 0 representing True, which is the Target Variable, Value Y.

In [1]:
#import library
import pandas as pd
import numpy as np
import xgboost as xgb
import lightgbm as lgb
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction import text
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
#load datasets
df_train = pd.read_csv("train.csv", "\t",encoding='utf-8',header=(0))
df_test = pd.read_csv("test.csv", "\t",encoding='utf-8',header=(0))
df_sub = pd.read_csv("sample_submission.csv",encoding='utf-8',header=(0))
df_train

Unnamed: 0,text,label
0,Get the latest from TODAY Sign up for our news...,1
1,2d Conan On The Funeral Trump Will Be Invited...,1
2,It’s safe to say that Instagram Stories has fa...,0
3,Much like a certain Amazon goddess with a lass...,0
4,At a time when the perfect outfit is just one ...,0
...,...,...
4982,The storybook romance of WWE stars John Cena a...,0
4983,The actor told friends he’s responsible for en...,0
4984,Sarah Hyland is getting real. The Modern Fami...,0
4985,Production has been suspended on the sixth and...,0


## Step2: Data Pre-processing
Seperate X and Y, Train and Test. After setting Stop words to get rid of the meanless words, we can transfer the text to vectors to caculate their features. Take the 1800 features only with the largest TF-IDF value among all features, which are Variable X. 

In [3]:
#set x&y train and test
x_train = df_train['text']
y_train = df_train['label'].tolist()
x_test = df_test['text']
y_test=pd.to_numeric(df_sub['label']).tolist()

In [4]:
#set stop words
stopwords= text.ENGLISH_STOP_WORDS

In [5]:
#transform text to vector by Tfidf
vectorizer = TfidfVectorizer(
            norm='l2',                      
            stop_words=stopwords,
            max_features=1800               
            )

X_train = vectorizer.fit_transform(x_train).toarray()
X_test = vectorizer.fit_transform(x_test).toarray()

## Step3: Construct Model
In this case, we use two models, XGBoost and LightGBM. After trying several times, set the parameters of this model.
### XGBoost
1. The learning rate of gradient descent is set to 0.5, which is common seen.  (Have tried 0.1 is worse.)
2. The number of trees is set to 100 (I found that the accuracy did not improve with more trees)
3. The depth of the tree is set to 6 layers (10 layers is too much, 5 layers is too little)
4. Specify the loss function as binary classification of logistic regression.
___
### LightGBM
1. The learning rate of gradient descent is set to 0.5, which is common seen. (Have tried 0.1 is worse.)
2. The number of leaves of the tree is set to 50 (100 is too high)
3. The number of trees is set to 120 (120 is a little better than 100)
4. Limit the depth of the tree to 200

In [6]:
#applying Xgboost model

#set paramaters
XGB_Classfier = xgb.XGBClassifier(learning_rate=0.5,                   
                              n_estimators=100,         
                              max_depth=6,                  
                              gamma=5,                               
                              objective='binary:logistic',
                              random_state=99            
                              )
#training model
XGB_Classfier = XGB_Classfier.fit(X_train, y_train)
#predicting
Xgb_pred = XGB_Classfier.predict(X_test).astype(int)





In [7]:
#reviewing model performance
Xgb_accuracy = accuracy_score(y_test, Xgb_pred)
Xgb_precision = metrics.precision_score(y_test, Xgb_pred)
Xgb_recall = metrics.recall_score(y_test, Xgb_pred)
Xgb_F_measure = metrics.f1_score(y_test, Xgb_pred)

print("Accuracy: %f" % Xgb_accuracy)
print("Precision: %f" % Xgb_precision)
print("Recall: %f" % Xgb_recall)
print("F_measure: %f" % Xgb_F_measure)

Accuracy: 0.530072
Precision: 0.540052
Recall: 0.338736
F_measure: 0.416335


In [8]:
XGBC_report = classification_report(y_test, Xgb_pred)
print(XGBC_report)

              precision    recall  f1-score   support

           0       0.53      0.72      0.61       630
           1       0.54      0.34      0.42       617

    accuracy                           0.53      1247
   macro avg       0.53      0.53      0.51      1247
weighted avg       0.53      0.53      0.51      1247



In [9]:
#applying LightGBM model

#set paramaters
LGB_Classifier = lgb.LGBMClassifier( 
                      learning_rate=0.5, 
                      num_leaves=50,
                      n_estimators=120,
                      max_bin=200,
                      random_state=99,          
                      device='cpu'
                      )
#training model
LGB_Classfier = LGB_Classifier.fit(X_train, y_train)
#predicting
Lgb_pred = LGB_Classfier.predict(X_test).astype(int)

In [10]:
#reviewing model performance
Lgb_accuracy = accuracy_score(y_test, Lgb_pred)
Lgb_precision = metrics.precision_score(y_test, Lgb_pred)
Lgb_recall = metrics.recall_score(y_test, Lgb_pred)
Lgb_F_measure = metrics.f1_score(y_test, Lgb_pred)

print("Accuracy: %f" % Lgb_accuracy)
print("Precision: %f" % Lgb_precision)
print("Recall: %f" % Lgb_recall)
print("F_measure: %f" % Lgb_F_measure)

Accuracy: 0.512430
Precision: 0.514019
Recall: 0.267423
F_measure: 0.351812


In [11]:
LGBC_report = classification_report(y_test, Lgb_pred)
print(LGBC_report)

              precision    recall  f1-score   support

           0       0.51      0.75      0.61       630
           1       0.51      0.27      0.35       617

    accuracy                           0.51      1247
   macro avg       0.51      0.51      0.48      1247
weighted avg       0.51      0.51      0.48      1247

