# Types of Boosting

Boosting ensembles are a way of correcting the mistakes of models before them in a sequence of models. I am going to be using a library called CatBoost to implement boosting with our chosen data set. Firstly, I will need to install CatBoost in order to implement it.

In [6]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(tf_idf_matrix, df_output, test_size = 0.3, random_state=42)

In [3]:
import numpy as np 
import pandas as pd 
import six as si
!pip install catboost
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension



Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [8]:
import pandas as pd
df = pd.read_csv('news.csv')
df.head(6335)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


In [2]:

import numpy as np
import pandas as pd

df = pd.read_csv('news.csv')
df['news'] = df['title'] + ' ' + df['text']
convert_to_binary = {'REAL':1,'FAKE':0}
df['label'] = df['label'].map(convert_to_binary)
df = df.drop([df.columns[0],df.columns[1],df.columns[2]],axis=1)
df = df.reindex(columns=['news','label'])

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

stop_words = stopwords.words('english')
stop_words.extend(['the','it','in'])
WNL = WordNetLemmatizer()

for index, row in df.iterrows():
    filtered_article = ''
    article = row['news']
    article = re.sub(r'[^\w\s]', '', article)
    words = [word.lower() for word in nltk.word_tokenize(article)]
    words = [word for word in words if not word in stop_words]
    words_lemmatized = []
    for word in words:
        if word == 'us':
            words_lemmatized.append(word)
        else:
            words_lemmatized.append(WNL.lemmatize(word))
    filtered_article = " ".join([word for word in words_lemmatized])
    df.loc[index, 'news'] = filtered_article
    
df.head()


# We need the Vectorization

df_input = df['news']
df_output = df['label']

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tf_idf_matrix = vectorizer.fit_transform(df_input)
tf_idf_matrix

<6335x80967 sparse matrix of type '<class 'numpy.float64'>'
	with 1762247 stored elements in Compressed Sparse Row format>

Let's split the training data into both a training data set and a validation set

In [3]:
train_df, test_df = df
from sklearn.model_selection import train_test_split
x_train, x_validation, y_train, y_validation = train_test_split(tf_idf_matrix,df_output,train_size=0.75,random_state=42)
x_test = test_df

In [6]:
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score



ModuleNotFoundError: No module named 'catboost'

In [None]:
model = CatBoostClassifier(custom_loss =['Accuracy'],random_seed=42,logging_level='Silent')

## Using the Boosting classifier from Scikit Learn

Besides using Catboost, there are boosting classifiers within Scikit learn. There are two that I am going to explore: AdaBoost and the Gradient Boosting Classifier. It will be interesting to see the difference between the accuracy of the CatBoostClassifier and the classifiers from Scikit Learn. 

In [9]:
from sklearn.ensemble import AdaBoostClassifier
ABC = AdaBoostClassifier()
ABC.fit(x_train,y_train)
Accuracy = ABC.score(x_test,y_test)
print(Accuracy)

0.859375


In [12]:
from sklearn.ensemble import GradientBoostingClassifier
GBC = GradientBoostingClassifier()
GBC.fit(x_train,y_train)
accuracy = GBC.score(x_test,y_test)
print(accuracy)

0.8974175035868006


AdaBoost gives us an accuracy of 86% which is a significantly higher accuracy than that of the Decision tree classifier. GradientBoostingClassifier gives us a 90% accuracy which is extremely high! I now want to test both of these classifiers on unseen data. 

## Testing on Unseen Data

In [13]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

stop_words = stopwords.words('english')
stop_words.extend(['the','it','in'])
WNL = WordNetLemmatizer()
    

def article_preprocessor (article):
    filtered_article = ''
    article = re.sub(r'[^\w\s]', '', article)
    words = [word.lower() for word in nltk.word_tokenize(article)]
    words = [word for word in words if not word in stop_words]
    words_lemmatized = []
    for word in words:
        if word == 'us':
            words_lemmatized.append(word)
        else:
            words_lemmatized.append(WNL.lemmatize(word))
    filtered_article = " ".join([word for word in words_lemmatized])
    return filtered_article

In [14]:
def AdaBoost_classifier (list_of_articles):
    
    # Pre-process the articles
    articles_pp = [article_preprocessor(article) for article in list_of_articles]
    new_input = df_input.append(pd.Series(articles_pp))
    tf_idf_matrix = vectorizer.fit_transform(new_input)
    orig_data_matrix = tf_idf_matrix[:len(df_input)]
    new_data_matrix = tf_idf_matrix[len(df_input):]
    x_train, x_test, y_train, y_test = train_test_split(orig_data_matrix, df_output, random_state=42)
    ABC = AdaBoostClassifier()
    ABC.fit(x_train, y_train)
    accuracy = ABC.score(x_test,y_test)
    print('AdaBoost accuracy: ' + str(accuracy))
    predictions = ABC.predict(new_data_matrix)
    
    return predictions

def GradientBoost_classifier (list_of_articles):
    
    # Pre-process the articles
    articles_pp = [article_preprocessor(article) for article in list_of_articles]
    new_input = df_input.append(pd.Series(articles_pp))
    tf_idf_matrix = vectorizer.fit_transform(new_input)
    orig_data_matrix = tf_idf_matrix[:len(df_input)]
    new_data_matrix = tf_idf_matrix[len(df_input):]
    x_train, x_test, y_train, y_test = train_test_split(orig_data_matrix, df_output, random_state=42)
    GBC = GradientBoostingClassifier()
    GBC.fit(x_train, y_train)
    accuracy = GBC.score(x_test,y_test)
    print('Gradient Boosting accuracy: ' + str(accuracy))
    # The model can now classify the new data
    predictions = GBC.predict(new_data_matrix)
    
    return predictions

In [17]:
# The top news story on the BBC
bbc_news_article = '''The furlough scheme will be extended until the end of September by the chancellor in the Budget later.
Rishi Sunak said the scheme - which pays 80% of employees' wages for the hours they cannot work in the pandemic - would help millions through "the challenging months ahead".
Some 600,000 more self-employed people will also be eligible for government help as access to grants is widened.
But Labour said the support schemes should have been extended "months ago".
Mr Sunak will outline a three-point plan to support people through the coming months, rebuild the economy and "fix" the public finances in the wake of the pandemic when he delivers his statement to the Commons at about 12:30 GMT.
But he has warned of tough economic times ahead and there are reports that he plans to raise some taxes.'''

# Here's a fake news article from the New York Mag
fake_article = '''Twelve days out from judgment day in an election in which he continues to trail badly, President Trump continues to hammer home an issue that will surely resonate with that small slice of still-undecided voters: his supposedly unfair treatment at the hands of CBS’s Lesley Stahl. After two days of promising to release unedited footage of an as-yet-unaired 60 Minutes interview, during which he walked out prematurely because he was upset with Stahl’s line of questioning, the president finally followed through on Thursday. Throughout the interview, Stahl presses Trump on issues from health care (the president says he hopes the Supreme Court strikes down Obamacare, a politically toxic position) to his derogatory comments about Anthony Fauci (Trump claims he was misinterpreted) to his false claims that the Obama campaign spied on him. The tone is of an adversarial back-and-forth, well within normal journalistic bounds. Nevertheless, Trump continuously claims that Joe Biden hasn’t been given similar treatment by CBS and cuts the proceedings short.'''

In [18]:
articles = [bbc_news_article,fake_article]
print(AdaBoost_classifier(articles))
print(GradientBoost_classifier(articles))

AdaBoost accuracy: 0.8832070707070707
[1 0]
Gradient Boosting accuracy: 0.8958333333333334
[1 0]


As we can see, with an accuracy of 88%, the Adaboost classifier has correctly identified the real and fake news article. Again, with the Gradient Boost classifier both articles were correctly identified as real or fake with an accuracy of 90%.