# Real or Fake? Detecting Fake news.

Fake news is rife. It is misinformation with the aim to misinform and to spread false imformation to either harm or influence with little to no evidence. In this notebook, we will explore datasets from genuine news articles and from fake news articles through graphs and examples such as word frequencies, bi-grams/tri-grams and word clouds. Then apply a classify to these datasets to see how well a machine can distinguish the difference between the two classes. Finally, we will evaluate these models using statistical evaluations such as precision, recall, F1 score and ROC.

Afterwards, the best model will be used to create a web application for users to input news articles and let the model predict whether to article is real or fake.

Dataset for this project from Kaggle. Click [Here](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset/notebooks)

In [None]:
# Import our visual libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

# Import text clearning libraries
import re
from bs4 import BeautifulSoup
import nltk

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


## Import the data

In [None]:
# Input full file path for True news dataset
real = pd.read_csv("/kaggle/input/fake-and-real-news-dataset/True.csv")
real.head()

In [None]:
# Input full file path for Fake news dataset
fake = pd.read_csv("/kaggle/input/fake-and-real-news-dataset/Fake.csv")
fake.head()

Let's check the number of entries there are for each dataset.

In [None]:
print("Real news count: " + str(len(real)))
print("Fake news count: " + str(len(fake)))
print("Total available entries: " + str(len(real) + len(fake)))

In [None]:
# Merge the datasets
real["label"] = 1
fake["label"] = 0

frame = [real, fake]
df = pd.concat(frame)

df.head()

In [None]:
df.tail()

In [None]:
# Check to see if any columns have missing values
df.isnull().sum()

In [None]:
# Get basic information on dataset
df.info()

In [None]:
# Run this cell to avoid indexes of previous datasets from overlapping
df.reset_index(inplace = True)
df.drop("index", axis=1, inplace = True)

In [None]:
df.columns

Check an example of a fake news text:

In [None]:
df["text"][40000]

You can probably tell from reading this article that the writing style is a sensationalist piece designed to prey of the fears of the reader. Examples being the mention of graphic imagery and not written by a professional.

Example of geniune news text:

In [None]:
df["text"][10000]

Here we can see the piece is well written by a professional writer, and the subject is on politics. What I have mentioned may be naive assumptions from just reading only two brief text. In the next section, we will be diving deep into the data to help better understand what differentiates genuine and fake news.

## Exploratory Data Analysis

**Label Count**

In [None]:
# 0 for fake
# 1 for true
sns.set(style="darkgrid")
sns.countplot(df["label"])

In [None]:
df["subject"].value_counts()

**Subject count by Label**

In [None]:
# Chart to show count of subject by label
plt.figure(figsize=(12,9))
sns.set(style="darkgrid")
sns.countplot(df["subject"], hue=df["label"])

Looking at the chart above, genuine news have the subjecst **politicsNews** and **worldnews**. Fake news will have the subjects **News, politics, Government News, left-news, US_News** and **Middle-east**.

**Publish data analysis**

In [None]:
set(df["date"])

Date contains http links and news article titles. Months may be shortened or only month.

In [None]:
# Filter out dates with http links
httpremove = "http"
filter1 = df["date"].str.contains(httpremove)
df_ = df[filter1]
df_

In [None]:
df_["text"][30775]

In [None]:
# Dataset with wrong dates and meanliness text
df_ = df[df["date"].apply(lambda x: len(x) > 20)]
df_

It's okay to remove entries that are not considered news: all the text in this dataset contains http links and left-behind code.

In [None]:
# Want to remove entries which are not dates
df = df[df["date"].apply(lambda x: len(x) < 20)]
df.head()

In [None]:
# 44888 entries after removing non-dates
df["title"].count()

In [None]:
df_ = df.copy()

In [None]:
df_["date"]

In [None]:
# Transform dates to datetime
# Use to_period('M') to get datetime to month
df_['date'] = pd.to_datetime(df_['date']).dt.to_period('M')
df_.head()

In [None]:
# Check count of articles by Year
# Over half of the articles are from 2017
df_["date"].apply(lambda x: (str(x)[:4])).value_counts()

In [None]:
# Get number of articles by Year-Month
# Change date type to string from datetime format
df_["date"] = df_["date"].apply(lambda x: (str(x)[:7]))
df_.head()

In [None]:
# DataFrame of year of count by Year-Month
year_month = pd.DataFrame(df_["date"].value_counts()).sort_index()
year_month.reset_index(inplace=True)
year_month["index"] = year_month["index"].astype(str)
year_month

In [None]:
# Count of articles by Month
plt.figure(figsize=(12,9))
plt.bar(year_month["index"], year_month["date"])
plt.xticks(rotation=45)
plt.xlabel("Year and Month")
plt.ylabel("Count")
plt.title("Count of articles by Month/Year")
plt.show

**Count by Month-Year**

In [None]:
# Plot count of articles by Month-Year
df_1 = df_[df_["label"]==1]
df_0 = df_[df_["label"]==0]
df_1 = pd.DataFrame(df_1["date"].value_counts()).sort_index()
df_1.rename(columns={"date": "true"}, inplace=True)
df_0 = pd.DataFrame(df_0["date"].value_counts()).sort_index()
df_0.rename(columns={"date": "false"}, inplace=True)

new_df = df_1.join(df_0, how='outer')
new_df.reset_index(inplace=True)
new_df

In [None]:
# Plot Count of articles by Month
plt.figure(figsize=(15,7))
plt.plot(new_df["index"], new_df["true"], label="True")
plt.plot(new_df["index"], new_df["false"], color="red", label="Fake")
plt.xticks(rotation=45)
plt.legend(facecolor='white')
plt.xlabel("Year and Month")
plt.ylabel("Count")
plt.title("Count of articles by Month/Year")
plt.show

Looking at the distribution of news articles by Month-Year, we can see that fake news was published between 2015-03 to 2018-02, and true news published from 2016-01 to 2017-12. The majority of the real news articles were from August to November of 2017 with at least 2500-3000 each month, making up over half of the total real news dataset.

The bulk of fake news articles were collected between January 2016 to August 2017, with around 700-1000 each month.

**Length of text (by characters)**

In [None]:
# Create side-by-side histograms of True and Fake news text
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(14,9))

true_length = df[df["label"]==1]["text"].str.len()
ax1.set_title("True text")
ax1.hist(true_length, color="blue")

fake_length = df[df["label"]==0]["text"].str.len()
ax2.set_title("Fake text")
ax2.hist(fake_length, color="red")

fig.suptitle("Length of text (by characters)")
plt.show()

Both true and fake news text have different distributions: true news text will have mostly around 2500 characters, fake news text will have mostly 5000 characters in their pieces.

In [None]:
df["title"].count()

**Length of title (by characters)**

In [None]:
# Create side-by-side histograms of True and Fake news text
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(14,9))

true_length = df[df["label"]==1]["title"].str.len()
ax1.set_title("True titles")
ax1.hist(true_length, color="blue")

fake_length = df[df["label"]==0]["title"].str.len()
ax2.set_title("Fake titles")
ax2.hist(fake_length, color="red")

fig.suptitle("Length of title (by characters)")
plt.show()

Again the distributions for true and fake news are different. Going by the news titles, true news have mostly 60-80 characters in their titles, while fake news tend to be longer with 75-125 characters.

## Data preparation

We need to perform data cleaning in order to use the wordcloud to the best possible usage. This involves reducing down words to lower case, removing stopwords such as "a" or "as", and removing hyperlinks to other sites. Let's merge the text and title columns together. Remove the other columns becasue we will only be concentrating on the text itself and not the subject matter or date released.

In [None]:
# Merge text and title columns, remove title, subject and date.
df['text'] = df['title'] + " " + df['text']
del df['title']
del df['subject']
del df['date']

In [None]:
# Check if anything missing after cleaning
df.isna().sum()

In [None]:
df.head()

In [None]:
# Stopwords to remove from text will have little effect on the context of the text
# Stopwords in list already in lower case
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

In [None]:
# Use this to deal with apostophes and abbreviation
# df_["news"] = df_['news'].str.replace('[^\w\s]','')
def remove_apostrophe_abbrev(text):
    return re.sub('[^\w\s]','', text)

# Function to remove stopwords
def remove_stop_words(text):
    clean_text = []
    for word in text.split():
        if word.strip().lower() not in stop_words:
            clean_text.append(word)
            
    return " ".join(clean_text)


In [None]:
# Text example of stop words removed
remove_stop_words(df["text"][0])

In [None]:
# Functions for cleaning text

# Remove html tags, using regex is bad idea
def remove_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

# Remove links in text, http and https
# s? in regex means or s (case sensitive)
def remove_links(text):
    return re.sub('https?:\/\/\S+', '', text)

# Remove 's (possessive pronouns) from text
# Two kinds of apostophes found
def remove_possessive_pronoun(text):
    return re.sub("’s|'s", '', text)

# Remove between brackets and their contents
def remove_between_brackets(text):
    return re.sub('\([^]]*\)', '', text)

# Remove square brackets and their contents
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

# Remove curly brackets and their contents
# Useful to remove any JavaScript scripts, though removed
# through html as above
def remove_between_curly_brackets(text):
    return re.sub('\{[^]]*\}', '', text)

def remove_n_space(text):
    return re.sub('\n', '', text)

def text_cleaner(text):
    text = remove_html_tags(text)
    text = remove_links(text)
    text = remove_possessive_pronoun(text)
    text = remove_apostrophe_abbrev(text)
    text = remove_between_brackets(text)
    text = remove_between_square_brackets(text)
    text = remove_between_curly_brackets(text)
    text = remove_n_space(text)
    text = remove_stop_words(text)
    
    return text

In [None]:
# Apply cleaning functions to text
df["text"] = df["text"].apply(text_cleaner)

### Word Cloud

In [None]:
from wordcloud import WordCloud, STOPWORDS 

**Word Cloud for true news**

In [None]:
# True news wordcloud
plt.figure(figsize=(15,15))
wordcloud = WordCloud(max_words = 1000 , width = 1600 , 
                      height = 800 , stopwords = STOPWORDS).generate(" ".join(df[df["label"] == 1].text))

plt.axis("off")
plt.imshow(wordcloud)

**Word Cloud for fake news**

In [None]:
# Fake news wordcloud
plt.figure(figsize=(15,15))
wordcloud = WordCloud(max_words = 1000 , width = 1600 , 
                      height = 800 , stopwords = STOPWORDS).generate(" ".join(df[df["label"] == 0].text))

plt.axis("off")
plt.imshow(wordcloud)

You can tell from both word clouds there is no distint words which can determine whether the text is more likely to be true or fake. Both seem include words associates with U.S. politics and U.S. Politicians.

Okay, let's explore frequency of words in text.

**Frequent words**

In [None]:
from collections import Counter

In [None]:
# Get 25 most common words in True news
true_corpus = pd.Series(" ".join(df[df["label"] == 1].text))[0].split()

counter = Counter(true_corpus)
true_common = counter.most_common(25)
true_common = dict(true_common)
true_common

In [None]:
# as a dataframe
true_common_df = pd.DataFrame(true_common.items(), columns = ["words", "count"])
true_common_df.set_index("words")

In [None]:
# Histogram of 25 true common words
plt.figure(figsize=(12,9))
plt.bar(true_common.keys(), true_common.values())
plt.xticks(rotation=45)
plt.xlabel("Common words")
plt.ylabel("Count")
plt.title("25 Most common words (True)")
plt.show

In [None]:
# Get 25 most common words in Fake news
fake_corpus = pd.Series(" ".join(df[df["label"] == 0].text))[0].split()

counter = Counter(fake_corpus)
fake_common = counter.most_common(25)
fake_common = dict(fake_common)
fake_common

In [None]:
# as a dataframe
fake_common_df = pd.DataFrame(fake_common.items(), columns = ["words", "count"])
fake_common_df.set_index("words")

In [None]:
# Histogram of 25 fake common words
plt.figure(figsize=(12,9))
plt.bar(fake_common.keys(), fake_common.values(), color="red")
plt.xticks(rotation=45)
plt.xlabel("Common words")
plt.ylabel("Count")
plt.title("25 Most common words (Fake)")
plt.show

So from these list of common words, the distribution of news leans towards U.S. politics and that there is no telling what is true news and what is fake news. This makes fake news all the more dangerous and harmful if taken at face value.

### Bigrams and Trigrams

Let's look further into the text with bigrams and trigrams, and find common pair words and three-words.

**Bigrams**

In [None]:
from nltk.util import ngrams

Bigram: True news

In [None]:
# Find most common bigrams in True news
text = pd.Series(" ".join(df[df["label"] == 1].text))[0]
tokenizer = nltk.RegexpTokenizer(r"\w+")
token = tokenizer.tokenize(text)

# ngrams set to 2
counter = Counter(ngrams(token,2))
most_common = counter.most_common(25)
most_common = dict(most_common)
most_common

In [None]:
# as a dataframe
true_common_bi = pd.DataFrame(most_common.items(), columns = ["bigram", "count"])
true_common_bi["bigram"] = true_common_bi["bigram"].apply(lambda x: " ".join(x))
true_common_bi

In [None]:
# Histogram of 25 common bigrams for True news
plt.figure(figsize=(12,9))
plt.bar(true_common_bi["bigram"], true_common_bi["count"]) # can do tuples
plt.xticks(rotation=90)
plt.xlabel("Common bigram")
plt.ylabel("Count")
plt.title("25 Most common bigram (True)")
plt.show

Bigram: Fake news

In [None]:
# Find most common bigrams in Fake news
text = pd.Series(" ".join(df[df["label"] == 0].text))[0]
tokenizer = nltk.RegexpTokenizer(r"\w+")
token = tokenizer.tokenize(text)

# ngrams set to 2
counter = Counter(ngrams(token,2))
most_common = counter.most_common(25)
most_common = dict(most_common)
most_common

In [None]:
# as a dataframe
fake_common_bi = pd.DataFrame(most_common.items(), columns = ["bigram", "count"])
fake_common_bi["bigram"] = fake_common_bi["bigram"].apply(lambda x: " ".join(x))
fake_common_bi

In [None]:
# Histogram of 25 common bigrams for Fakefake news
plt.figure(figsize=(12,9))
plt.bar(fake_common_bi["bigram"], fake_common_bi["count"], color="red") # can do tuples
plt.xticks(rotation=90)
plt.xlabel("Common bigram")
plt.ylabel("Count")
plt.title("25 Most common bigram (Fake)")
plt.show

From these fake news bigrams, it's interesting to see news sources Fox News, realDonaldTrump and 21st Century appear. For real news, Reuters more frequent as a news source.

Trigrams

Trigrams: True news

In [None]:
# Find most common trigrams in True news
text = pd.Series(" ".join(df[df["label"] == 1].text))[0]
tokenizer = nltk.RegexpTokenizer(r"\w+")
token = tokenizer.tokenize(text)

# ngrams set to 3
counter = Counter(ngrams(token,3))
most_common = counter.most_common(25)
most_common = dict(most_common)
most_common

In [None]:
# as a dataframe
true_common_tri = pd.DataFrame(most_common.items(), columns = ["trigram", "count"])
true_common_tri["trigram"] = true_common_tri["trigram"].apply(lambda x: " ".join(x))
true_common_tri

In [None]:
# Histogram of 25 common trigrams for True news
plt.figure(figsize=(12,9))
plt.bar(true_common_tri["trigram"], true_common_tri["count"])
plt.xticks(rotation=90)
plt.xlabel("Common trigram")
plt.ylabel("Count")
plt.title("25 Most common trigram (True)")
plt.show

Trigram: Fake News

In [None]:
# Find most common trigrams in Fake news
text = pd.Series(" ".join(df[df["label"] == 0].text))[0]
tokenizer = nltk.RegexpTokenizer(r"\w+")
token = tokenizer.tokenize(text)

# ngrams set to 3
counter = Counter(ngrams(token,3))
most_common = counter.most_common(25)
most_common = dict(most_common)
most_common

In [None]:
# as a dataframe
fake_common_tri = pd.DataFrame(most_common.items(), columns = ["trigram", "count"])
fake_common_tri["trigram"] = fake_common_tri["trigram"].apply(lambda x: " ".join(x))
fake_common_tri

In [None]:
# Histogram of 25 common trigrams for True news
plt.figure(figsize=(12,9))
plt.bar(fake_common_tri["trigram"], fake_common_tri["count"], color="red")
plt.xticks(rotation=90)
plt.xlabel("Common trigram")
plt.ylabel("Count")
plt.title("25 Most common trigram (Fake)")
plt.show

## Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer


lemma = WordNetLemmatizer()

In [None]:
# Lemmatizer example
print(lemma.lemmatize("boys"))

In [None]:
# Function to perform lemmatization on text
def lemmatize_text(text):
    tokenize_text = nltk.word_tokenize(text)
    lemmatize_words = [lemma.lemmatize(word) for word in tokenize_text]
    join_text = ' '.join(lemmatize_words)
    
    return join_text

# Example sentence on function
lemmatize_text("There once was a boy named Naruto who was possessed by a Nine-Tailed Demon Fox")

In [None]:
# Copy main df dataset and lemmatize the text
lemmatized_df = df.copy()
lemmatized_df["text"] = lemmatized_df["text"].apply(lemmatize_text)
lemmatized_df.head()

In [None]:
lemmatized_df["text"][0]

After must needed data cleaning and extensive exploratory analysis, let's move on to training the dataset on machine learning models to see how well it could predict whether the text is real or fake.

## Machine Learning

### Training models

We will begin training models and experiment with different vector methods, CountVectorizer and TF-IDF, with lemmatized text. We will use the following models:
* Logistic Regression
* Naive Bayes
* Support Vector Machine
* Random Forest
* Gradient Boosting

In [None]:
# Machine Learning models to import
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
# Random state at 42 for reproducibility
# Go for 80:20 train:test set
test_size = 0.20
X_train, X_test, y_train, y_test = train_test_split(lemmatized_df["text"], lemmatized_df["label"], test_size=test_size, random_state=42)

**CountVectorizer**

In [None]:
# Fit CountVectorizer to X_train and X_test datasets
cv_train = CountVectorizer(max_features=10000).fit(X_train)
X_vec_train = cv_train.transform(X_train)
X_vec_test = cv_train.transform(X_test)

In [None]:
X_vec_train

**Logistic Regression (CountVectorizer)**

In [None]:
# Training on Logistic Regression model
# set LR parameter max_iter = 4000 to avoid error
lr = LogisticRegression(max_iter = 4000)
lr.fit(X_vec_train, y_train)
predicted_value = lr.predict(X_vec_test)
lr_accuracy_value = roc_auc_score(y_test, predicted_value)

In [None]:
# Logistic Regression Test ROC 99.68% on just lemmatized text
print("ROC: " + str(lr_accuracy_value*100) + "%")

In [None]:
conmat = confusion_matrix(y_test, predicted_value)
print(conmat)
print(classification_report(y_test, predicted_value))

In [None]:
# Visual of confusion matrix of Logistic Regression
fig = plt.subplot()
sns.heatmap(conmat, annot=True, ax=fig)
fig.set_ylabel('y_test')
fig.set_xlabel('predicted values')

**Multinomial Naive Bayes (CountVectorizer)**

In [None]:
# Training on Naive Bayes model
# Quick at training
nb = MultinomialNB()
nb.fit(X_vec_train, y_train)
predicted_value = nb.predict(X_vec_test)
nb_accuracy_value = roc_auc_score(y_test, predicted_value)

In [None]:
# Naive Bayes Training ROC 95.21% lemmatized text
print("ROC: " + str(roc_auc_score(y_train, nb.predict(X_vec_train))*100) + "%")

In [None]:
# Naive Bayes Test ROC 95.22% lemmatized text
print("ROC: " + str(nb_accuracy_value*100) + "%")

In [None]:
conmat = confusion_matrix(y_test, predicted_value)
print(conmat)
print(classification_report(y_test, predicted_value))

In [None]:
# Visual of confusion matrix of Naive Bayes
fig = plt.subplot()
sns.heatmap(conmat, annot=True, ax=fig)
fig.set_ylabel('y_test')
fig.set_xlabel('predicted values')

**Support Vector Machines (CountVectorizer)**

In [None]:
# Training on Support Vector Machine
# Very slow at training
# time complexity O(no.features * no.of samples**2)
svm = SVC()
svm.fit(X_vec_train, y_train)
predicted_value = svm.predict(X_vec_test)
svm_accuracy_value = roc_auc_score(y_test, predicted_value)

In [None]:
# SVM Training ROC 99.9% lemmatized text
# svm predict on training set took very long time
print("ROC: " + str(roc_auc_score(y_train, svm.predict(X_vec_train))*100) + "%")

In [None]:
# SVM Test ROC 99.53% lemmatized text
print("ROC: " + str(svm_accuracy_value*100) + "%")

In [None]:
conmat = confusion_matrix(y_test, predicted_value)
print(conmat)
print(classification_report(y_test, predicted_value))

In [None]:
# Visual of confusion matrix of SVM
fig = plt.subplot()
sns.heatmap(conmat, annot=True, ax=fig)
fig.set_ylabel('y_test')
fig.set_xlabel('predicted values')

**Random Forest (CountVectorizer)**

In [None]:
# Training on Random Forest
rf = RandomForestClassifier()
rf.fit(X_vec_train, y_train)
predicted_value = rf.predict(X_vec_test)
rf_accuracy_value = roc_auc_score(y_test, predicted_value)

In [None]:
# Random Forest Training ROC 100% lemmatized text
print("ROC: " + str(roc_auc_score(y_train, rf.predict(X_vec_train))*100) + "%")

In [None]:
# Random Forest Test ROC 99.67% lemmatized text
print("ROC: " + str(rf_accuracy_value*100) + "%")

In [None]:
conmat = confusion_matrix(y_test, predicted_value)
print(conmat)
print(classification_report(y_test, predicted_value))

In [None]:
# Visual of confusion matrix of Random Forest
fig = plt.subplot()
sns.heatmap(conmat, annot=True, ax=fig)
fig.set_ylabel('y_test')
fig.set_xlabel('predicted values')

**Gradient Boosting (CountVectorizer)**

In [None]:
# Training on Gradient Boosting 
gbc = GradientBoostingClassifier()
gbc.fit(X_vec_train, y_train)
predicted_value = gbc.predict(X_vec_test)
gbc_accuracy_value = roc_auc_score(y_test, predicted_value)

In [None]:
# Gradient Boost Training ROC 99.64% lemmatized text
print("ROC: " + str(roc_auc_score(y_train, gbc.predict(X_vec_train))*100) + "%")

In [None]:
# Gradient Boost Test ROC 99.5% lemmatized text
print("ROC: " + str(gbc_accuracy_value*100) + "%")

In [None]:
conmat = confusion_matrix(y_test, predicted_value)
print(conmat)
print(classification_report(y_test, predicted_value))

In [None]:
# Visual of confusion matrix of Gradient Boosting
fig = plt.subplot()
sns.heatmap(conmat, annot=True, ax=fig)
fig.set_ylabel('y_test')
fig.set_xlabel('predicted values')

So it looks like most of the models did fairly well. On comparing training accuracy and test acccuracy difference, it seems the **Gradient Boosting** did pretty well. When training accuracy is 100%, then there may be a problem of the model overfitting, and this could lead to new news data being predicted incorrectly.

Now lets save the Gradient Boosting file.

## Saving the model

In [None]:
import pickle

In [None]:
# Save model
model_file = "gbc.pkl"
with open(model_file,mode='wb') as model_f:
    pickle.dump(gbc,model_f)

In [None]:
# Open the model, print result for sanity check
with open("gbc.pkl",mode='rb') as model_f:
    model = pickle.load(model_f)
    predict = model.predict(X_vec_test)
    result = roc_auc_score(y_test, predict)
    print("result:",result*100, "%")

**What have we learnt**
* How to clean dataset by removing tags, punctuation, stopwords
* How to use lemmatization to remove duplicated word meanings
* Play with data to produce visualizations like countplots and wordcloud
* Look for frequent words, sequence of words (bi-grams, tri-grams)
* Train machine learning models by using CountVectorizer on text dataset
* Evaluate model's training and test accuracy, classification report and confusion matrix
* Save model as pkl file for reuse.