# Classifying fake news using supervised learning with NLP

使用Scikit learn套件來做supervised learning

我們使用bag-of-words models或TF-IDF models作為features，來建立supervised learning data from text

![](Image/Image11.jpg)

## Data Preprocessing

1. 載入所需套件：Pandas、sklearn的train_test_split、sklearn的CountVectorizer
2. Load data into DataFrame
3. 將資料分成training set和testing set
4. 建立一個sklearn的CountVectorizer模型，並傳入stop_words = 'english'這個參數 (此模型很像Gensim，能將我的text轉換成bag-of-word vectors，並能移除stopwords)
5. 呼叫CountVectorizer物件的fit_transform方法來實際將training set轉換成word vectors的dictionary (產生word的id對應該字在一句話出現的次數)
6. CountVectorizer模型處理過後得到的每個token，都是這個classification problem的feature
7. 由於training data的features被transformed成word vectors了，因此testing data也要做相同的轉換

範例：

![](Image/Image12.jpg)

### Import dataset 

此資料有四個 columns：
1. Identifier
2. Title (新聞標題)
3. Text (新聞內文)
4. Label (是否為假新聞 FAKE or REAL)

In [14]:
import pandas as pd
import numpy as np

df = pd.read_csv("Datasets/fake_or_real_news.csv")

課堂練習一：實作CountVectorizer (結果為整數，因為回傳詞出現的次數)

In [4]:
# Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Print the head of df
print(df.head())

# Create a series to store the labels: y
y = df.label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df["text"], y, test_size = 0.33, random_state = 53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words = 'english')

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])

   Unnamed: 0                                              title  \
0        8476                       You Can Smell Hillary’s Fear   
1       10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2        3608        Kerry to go to Paris in gesture of sympathy   
3       10142  Bernie supporters on Twitter erupt in anger ag...   
4         875   The Battle of New York: Why This Primary Matters   

                                                text label  
0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  
2  U.S. Secretary of State John F. Kerry said Mon...  REAL  
3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  
4  It's primary day in New York and front-runners...  REAL  
['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']


get_feature_names()這個方法可以得到轉換過後的column names

課堂練習二：嘗試另一個sklearn的模型 TfidfVectorizer (結果為小數，因為tfidf代表token的詞頻率)

In [5]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])

['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


上面例子的A代表attributes

課堂練習三：比較 CountVectorizer 和 TfidfVectorizer 所建立的features的差異

In [6]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())

# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())

# Print the head of count_df
print(count_df.head())

# Print the head of tfidf_df
print(tfidf_df.head())

# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))

   00  000  0000  00000031  000035  00006  0001  0001pt  000ft  000km  ...  \
0   0    0     0         0       0      0     0       0      0      0  ...   
1   0    0     0         0       0      0     0       0      0      0  ...   
2   0    0     0         0       0      0     0       0      0      0  ...   
3   0    0     0         0       0      0     0       0      0      0  ...   
4   0    0     0         0       0      0     0       0      0      0  ...   

   حلب  عربي  عن  لم  ما  محاولات  من  هذا  والمرضى  ยงade  
0    0     0   0   0   0        0   0    0        0      0  
1    0     0   0   0   0        0   0    0        0      0  
2    0     0   0   0   0        0   0    0        0      0  
3    0     0   0   0   0        0   0    0        0      0  
4    0     0   0   0   0        0   0    0        0      0  

[5 rows x 56922 columns]
    00  000  0000  00000031  000035  00006  0001  0001pt  000ft  000km  ...  \
0  0.0  0.0   0.0       0.0     0.0    0.0   0.0     0.0    

---

---

## Naive Bayes Classifier

使用條件機率模型來判斷一個特定的輸入的結果

It is a simple and effective tool to build a fake news classifier.

例子：Given a particular piece of data, how likely is a particular outcome? <br/>
有一串電影的描述，根據內容出現的字詞來判斷屬於哪種電影 <br/>
有一串新聞，根據內容出現的文字判斷是否為假新聞 <br/>

### Naive Bayes Classifier in Sklearn

Multinomial Naive Bayes 模型很適合處理 CountVectorize 出來的 word vectors 結果，因為 MultinomialNB 的 input 要是 integer

因此 Multinomial Naive Bayes 模型處理 float 類型的 features (像是 TfidfVectorizer 的結果) 可能不會得到很好的結果。因此如果我用 TfidfVectorizer 來將文字轉換成 word vectors 字典，可以先測試 Multinomial Naive Bayes 的結果；如果結果不理想，再用 SVM 或 linear models 來建立分類模型。

---

### 實作：用 Multinomial Naive Bayes 來 model fake news classifier (CountVectorizer的結果為整數)

In [7]:
# import library and function
from sklearn.naive_bayes import MultinomialNB

# Evaluate model performance
from sklearn import metrics

In [10]:
# 初始化 naive bayes classifier
nb_classifier = MultinomialNB()

# fit模型
nb_classifier.fit(count_train, y_train)

# 預測 (分類) 
pred = nb_classifier.predict(count_test)

# 評估結果好壞 (accuracy)
print("Accuracy: "+str(metrics.accuracy_score(y_test, pred)))    # 先傳入真實值，再傳入預測結果

# 評估模型好壞 (confusion matrix)
metrics.confusion_matrix(y_test, pred, labels = ['FAKE', "REAL"])

Accuracy: 0.893352462936394


array([[ 865,  143],
       [  80, 1003]], dtype=int64)

### 用 MultinomialNB 來 fit TfidfVectorizer 的結果 (小數)

測試看看 **TfidfVectorizer** 的結果是否適合用 MultinomialNB

In [12]:
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print("Accuracy: "+str(score))

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels = ["FAKE","REAL"])
print(cm)

Accuracy: 0.8565279770444764
[[ 739  269]
 [  31 1052]]


### 測試不同的 alpha 值

alpha 為 MultinomialNB 的一個參數，為一個介於 0 到 1 的值

In [15]:
# Create the list of alphas: alphas
alphas = np.arange(0, 1, 0.1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha = alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

Alpha:  0.0
Score:  0.8813964610234337

Alpha:  0.1
Score:  0.8976566236250598

Alpha:  0.2
Score:  0.8938307030129125

Alpha:  0.30000000000000004
Score:  0.8900047824007652

Alpha:  0.4
Score:  0.8857006217120995

Alpha:  0.5
Score:  0.8842659014825442

Alpha:  0.6000000000000001
Score:  0.874701099952176

Alpha:  0.7000000000000001
Score:  0.8703969392635102

Alpha:  0.8
Score:  0.8660927785748446

Alpha:  0.9
Score:  0.8589191774270684



  'setting alpha = %.1e' % _ALPHA_MIN)


### Inspecting the model

In [16]:
# Get the class labels: class_labels
class_labels = nb_classifier.classes_

# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names()

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))

# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])

# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])

FAKE [(-11.316312804238807, '0000'), (-11.316312804238807, '000035'), (-11.316312804238807, '0001'), (-11.316312804238807, '0001pt'), (-11.316312804238807, '000km'), (-11.316312804238807, '0011'), (-11.316312804238807, '006s'), (-11.316312804238807, '007'), (-11.316312804238807, '007s'), (-11.316312804238807, '008s'), (-11.316312804238807, '0099'), (-11.316312804238807, '00am'), (-11.316312804238807, '00p'), (-11.316312804238807, '00pm'), (-11.316312804238807, '014'), (-11.316312804238807, '015'), (-11.316312804238807, '018'), (-11.316312804238807, '01am'), (-11.316312804238807, '020'), (-11.316312804238807, '023')]
REAL [(-7.742481952533027, 'states'), (-7.717550034444668, 'rubio'), (-7.703583809227384, 'voters'), (-7.654774992495461, 'house'), (-7.649398936153309, 'republicans'), (-7.6246184189367, 'bush'), (-7.616556675728881, 'percent'), (-7.545789237823644, 'people'), (-7.516447881078008, 'new'), (-7.448027933291952, 'party'), (-7.411148410203476, 'cruz'), (-7.410910239085596, 'st



---

# NLP 的各種問題

1. Translation
2. Sentiment Analysis
3. Language Biases (如果用有biases的word vectors來訓練模型，則模型可能也會有偏見)