## Load the Modules

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
import gensim.downloader as api

## Splitting the data into 2 parts - train and test data

In [2]:
# Load the cleaned dataset
data = pd.read_csv('cleaned_news_data.csv')

# Combine title_clean and text_clean as the input for the model
data['combined_text'] = data['title_clean'] + ' ' + data['text_clean']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(data['combined_text'], data['label'], test_size=0.2, random_state=42)

## We'll be using the same 3 text representation techniques as shown previously
Bag of Words, TF-IDF and GloVe

## Create a CountVectorizer object
Common english words (e.g., "a", "an", "the") will be removed from the text The vectorizer will use the top 10,000 most frequent words in the text to create the feature vectors.

## The 'vectorizer' object is then used to fit and transform the training data (X_train) and transform the test data (X_test) into BoW feature vectors.

In [3]:
# Bag of Words
vectorizer = CountVectorizer(stop_words='english', max_features=10000)
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

## Create a TfidfVectorizer object 
The same stop_words and max_features parameters as before. The difference between CountVectorizer and TfidfVectorizer is that the latter calculates the Term Frequency-Inverse Document Frequency (TF-IDF) of each word, which is a measure that reflects the importance of a word in the document and the entire corpus.

## The 'vectorizer_tfidf' object is then used to fit and transform the training data (X_train) and transform the test data (X_test) into TF-IDF feature vectors. 
These feature vectors are stored in the variables 'X_train_tfidf' and 'X_test_tfidf'.

In [4]:
# TfidfVectorizer
vectorizer_tfidf = TfidfVectorizer(stop_words='english', max_features=10000)
X_train_tfidf = vectorizer_tfidf.fit_transform(X_train)
X_test_tfidf = vectorizer_tfidf.transform(X_test)

## Loads the pre-trained GloVe model with 100-dimensional word vectors

The get_average_glove_vector() function is defined, which takes a text string and the GloVe model as input arguments. Inside the function, the text is split into words. For each word, if it exists in the GloVe model, the corresponding word vector is extracted. If there are no valid word vectors in the text, the function returns a zero vector of the same size as the GloVe model's vector size. Otherwise, the function computes the average of all the word vectors in the text and returns this average vector as the final representation for the input text.

## The function get_average_glove_vector() is applied to each text in the training data (X_train) and test data (X_test) using list comprehensions. 
The resulting arrays of average GloVe vectors are then converted to NumPy arrays and stored in the variables 'X_train_glove' and 'X_test_glove'.

In [5]:
# Pre-trained word embeddings (GloVe)
glove_model = api.load('glove-wiki-gigaword-100')

def get_average_glove_vector(text, model):
    words = text.split()
    word_vectors = [model[word] for word in words if word in model]
    if not word_vectors:
        return np.zeros(model.vector_size)
    return np.mean(word_vectors, axis=0)

X_train_glove = np.array([get_average_glove_vector(text, glove_model) for text in X_train])
X_test_glove = np.array([get_average_glove_vector(text, glove_model) for text in X_test])



## The choice of using these four models to detect fake news can be attributed to their unique strengths and properties that make them suitable for text classification tasks.

### Logistic Regression: 
Logistic Regression is a simple and efficient linear model that performs well on binary classification tasks. It is easy to interpret and understand, which makes it a popular choice for text classification problems like fake news detection. Logistic Regression can handle large feature sets, such as the ones found in our text data, and can be easily regularized to prevent overfitting.

### Random Forest: 
Random Forest is an ensemble learning method that constructs multiple decision trees and aggregates their results. This approach offers robustness and accuracy in classification tasks, as it leverages the power of multiple individual models. Random Forest can handle high-dimensional data and automatically performs feature selection, which is beneficial when dealing with text data that has numerous features. Additionally, it is resistant to overfitting and provides a measure of feature importance, aiding in interpretability.

### Multinomial Naive Bayes: 
Multinomial Naive Bayes is a probabilistic classifier based on Bayes' theorem, specifically designed to handle discrete features like word counts or term frequencies in text data. It assumes that the features are conditionally independent, which simplifies the computation and allows for fast training. This model is easy to implement and has been widely used in text classification tasks, such as spam filtering and sentiment analysis, making it a reasonable choice for fake news detection.

### Support Vector Machines (SVM): 
SVM is a powerful and flexible classifier that can handle both linear and non-linear classification problems. It works by finding the optimal hyperplane that separates the data points of different classes, maximizing the margin between them. SVM is particularly well-suited for high-dimensional data, such as text, and can be used with different kernel functions to model complex relationships between features. Additionally, it provides good generalization performance and can be fine-tuned using regularization parameters to prevent overfitting.

## Two dictionaries: models and inputs. 
To store different machine learning models and different input feature types for an NLP classification task. These dictionaries will later be used for running multiple experiments, allowing for easy comparison of model performance using different input features.

In [6]:
# Define models and input types
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Multinomial Naive Bayes': MultinomialNB(),
    'SVM': SVC(random_state=42)
}

inputs = {
    'Bag of Words': (X_train_bow, X_test_bow),
    'TF-IDF': (X_train_tfidf, X_test_tfidf),
    'GloVe': (X_train_glove, X_test_glove)
}



## Iterate over different input types (Bag of Words, TF-IDF, and GloVe) and machine learning models (Logistic Regression, Random Forest, Multinomial Naive Bayes, and SVM) 
To train, optimize, and evaluate their performance. The objective is to compare the performance of different models using different input feature types for a given text classification task.

## Note: GloVe is not suitable for use with Multinomial Naive Bayes, and is not considered in our analysis. WHY?
1) Value Range: Multinomial Naive Bayes expects non-negative, integer-valued input features, such as word counts or term frequencies. However, GloVe generates continuous-valued, dense vectors that can include both positive and negative values. This incompatibility in value range makes it difficult to directly apply Multinomial Naive Bayes to GloVe embeddings.

2) Feature Independence: GloVe embeddings capture semantic and syntactic relationships between words. As a result, the word vectors are not independent of one another. This violates the assumption of feature independence in the Multinomial Naive Bayes model, which can lead to suboptimal performance.

In [7]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#Store evaluation metrics in a DataFrame
performance_df = pd.DataFrame(columns=['Input Type', 'Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

for input_name, (X_train_input, X_test_input) in inputs.items():
    for model_name, model in models.items():
        #GloVe generates continuous word embeddings, while MNB works with discrete features. 
        #In the case of text classification, MNB typically uses counts of words or tokens 
        #(e.g., Term Frequency or TF-IDF representation). 
        #Mixing these two representations – continuous word embeddings and discrete feature counts – 
        #can lead to poor performance or even incompatibility when using MNB.
        if model_name == 'Multinomial Naive Bayes' and input_name in ['GloVe']:
            continue

        model.fit(X_train_input, y_train)
        y_pred = model.predict(X_test_input)

        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)

        performance_df = performance_df.append({
            'Input Type': input_name,
            'Model': model_name,
            'Accuracy': accuracy,
            'Precision': precision,
            'Recall': recall,
            'F1 Score': f1
        }, ignore_index=True)

print(performance_df)

  performance_df = performance_df.append({
  performance_df = performance_df.append({
  performance_df = performance_df.append({
  performance_df = performance_df.append({
  performance_df = performance_df.append({
  performance_df = performance_df.append({
  performance_df = performance_df.append({
  performance_df = performance_df.append({
  performance_df = performance_df.append({
  performance_df = performance_df.append({


      Input Type                    Model  Accuracy  Precision    Recall  \
0   Bag of Words      Logistic Regression  0.994353   0.995044  0.993168   
1   Bag of Words            Random Forest  0.996612   0.996233  0.996702   
2   Bag of Words  Multinomial Naive Bayes  0.943867   0.941773  0.941107   
3   Bag of Words                      SVM  0.992772   0.993858  0.991048   
4         TF-IDF      Logistic Regression  0.985995   0.981538  0.989399   
5         TF-IDF            Random Forest  0.996612   0.996700  0.996231   
6         TF-IDF  Multinomial Naive Bayes  0.928507   0.929387  0.920848   
7         TF-IDF                      SVM  0.993336   0.991776  0.994346   
8          GloVe      Logistic Regression  0.934041   0.929191  0.933569   
9          GloVe            Random Forest  0.946917   0.951879  0.936631   
10         GloVe                      SVM  0.939349   0.935211  0.938516   

    F1 Score  
0   0.994105  
1   0.996467  
2   0.941440  
3   0.992451  
4   0.985453

  performance_df = performance_df.append({


## Interactive grouped bar chart with buttons to switch between the different evaluation metrics (Accuracy, Precision, Recall, and F1 Score). 
The chart will display the performance comparison of different models and input types.

In [15]:
import plotly.graph_objects as go

models = ['Logistic Regression', 'Random Forest', 'Multinomial Naive Bayes', 'SVM']

performance_data = {
    'Bag of Words': {
        'Accuracy': [0.994353, 0.996612, 0.943867, 0.992772],
        'Precision': [0.995044, 0.996233, 0.941773, 0.993858],
        'Recall': [0.993168, 0.996702, 0.941107, 0.991048],
        'F1 Score': [0.994105, 0.996467, 0.941440, 0.992451]
    },
    'TF-IDF': {
        'Accuracy': [0.985995, 0.996612, 0.928507, 0.993336],
        'Precision': [0.981538, 0.996700, 0.929387, 0.991776],
        'Recall': [0.989399, 0.996231, 0.920848, 0.994346],
        'F1 Score': [0.985453, 0.996466, 0.925098, 0.993060]
    },
    'GloVe': {
        'Accuracy': [0.934041, 0.946917, None, 0.939349],
        'Precision': [0.929191, 0.951879, None, 0.935211],
        'Recall': [0.933569, 0.936631, None, 0.938516],
        'F1 Score': [0.931375, 0.944194, None, 0.936861]
    }
}

fig = go.Figure()

for technique, data in performance_data.items():
    for metric, values in data.items():
        fig.add_trace(go.Bar(
            x=models,
            y=values,
            name=f'{technique} {metric}',
            text=values,
            textposition='auto',
            hoverinfo='x+text',
            visible=metric == 'Accuracy',
        ))

buttons = []
for metric in performance_data['Bag of Words'].keys():
    buttons.append(dict(
        label=metric,
        method="update",
        args=[{"visible": [metric == m for m in ['Accuracy', 'Precision', 'Recall', 'F1 Score']]}]
    ))

fig.update_layout(
    title='Performance Comparison of Text Classification Models',
    updatemenus=[dict(
        type="buttons",
        showactive=True,
        buttons=buttons,
    )],
    xaxis_title='Models',
    yaxis_title='Score',
    barmode='group'
)

fig.show()

## The performance metrics presented are:

1) Accuracy: The proportion of correct predictions out of the total predictions made.

2) Precision: The proportion of true positives out of all predicted positives (true positives + false positives).

3) Recall: The proportion of true positives out of all actual positives (true positives + false negatives).

4) F1 Score: The harmonic mean of precision and recall, providing a single metric that balances both.

## Analysis of the results:
### Bag of Words:
1) Logistic Regression: The model achieved 99.44% accuracy, with 99.50% precision, 99.32% recall, and an F1 score of 99.41%. This model performed exceptionally well on the dataset.

2) Random Forest: The model achieved 99.66% accuracy, with 99.62% precision, 99.67% recall, and an F1 score of 99.65%. This model performed the best among all the models and text representation techniques.

3) Multinomial Naive Bayes: The model achieved 94.39% accuracy, with 94.18% precision, 94.11% recall, and an F1 score of 94.14%. This model had the lowest performance among the Bag of Words models.

4) SVM: The model achieved 99.28% accuracy, with 99.39% precision, 99.10% recall, and an F1 score of 99.25%. This model also performed well, but not as well as Random Forest.

### TF-IDF:
1) Logistic Regression: The model achieved 98.60% accuracy, with 98.15% precision, 98.94% recall, and an F1 score of 98.55%. This model performed well but not as well as its Bag of Words counterpart.

2) Random Forest: The model achieved 99.66% accuracy, with 99.67% precision, 99.62% recall, and an F1 score of 99.65%. This model performed at a similar level to the Bag of Words Random Forest model.

3) Multinomial Naive Bayes: The model achieved 92.85% accuracy, with 92.94% precision, 92.08% recall, and an F1 score of 92.51%. This model performed the worst among all the models and text representation techniques.

4) SVM: The model achieved 99.33% accuracy, with 99.18% precision, 99.43% recall, and an F1 score of 99.31%. This model performed better with TF-IDF than with Bag of Words.

### GloVe:
1) Logistic Regression: The model achieved 93.40% accuracy, with 92.92% precision, 93.36% recall, and an F1 score of 93.14%. This model had the lowest performance among the GloVe models.

2) Random Forest: The model achieved 94.69% accuracy, with 95.19% precision, 93.66% recall, and an F1 score of 94.42%. This model performed better than Logistic Regression but not as well as SVM.

3) SVM: The model achieved 93.93% accuracy, with 93.52% precision, 93.85% recall, and an F1 score of 93.69%. This model performed the best among the GloVe models but not as well as the other text representation techniques.

## The top models for each input type:

#### Bag of Words:
Random Forest with an accuracy of 0.996612, precision of 0.996233, recall of 0.996702, and F1 score of 0.996467.

#### TF-IDF:
Random Forest with an accuracy of 0.996612, precision of 0.996700, recall of 0.996231, and F1 score of 0.996466.

#### GloVe:
Random Forest with an accuracy of 0.946917, precision of 0.951879, recall of 0.936631, and F1 score of 0.944194.

## In summary, the Random Forest model using Bag of Words and TF-IDF text representation techniques achieved the highest performance, with an accuracy of 99.66% and F1 scores of 99.65% and 99.65%, respectively. GloVe-based models generally underperformed compared to Bag of Words and TF-IDF models. Among the classification models, Multinomial Naive Bayes had the lowest performance across all text representation techniques.


## Thus, we can consider using the Random Forest model with either the Bag of Words or TF-IDF input types. (before optimisation)

## -------------------------------------------------------------------------------------------------------------------------------
## To Perform Optimisation using analysis of performance above.

## Reason for choice in parameters in optimisation -
#### Logistic Regression:
1) 'C': This parameter controls the inverse of the regularization strength. Smaller values result in stronger regularization, which can help avoid overfitting. We have chosen a range of values that cover both small (stronger regularization) and large (weaker regularization) values, since our model's performance is already quite high, we've included larger values like 100, to explore less regularized models.

2) 'penalty': This parameter determines the type of regularization applied to the model (L1, L2). Different penalties can lead to different feature selection behavior in the model, which might affect the model's performance.

3) 'solver': This parameter defines the optimization algorithm used for training the model. Different solvers can perform differently depending on the problem and data size.

#### Random Forest:
1) 'n_estimators': This parameter controls the number of trees in the forest. Increasing the number of trees can lead to better performance but also requires more computational resources. Since our model is already performing well, we've focused on a range of higher values.

2) 'max_depth': This parameter defines the maximum depth of each tree. Limiting the depth can help prevent overfitting. We have included a range of values from no limit (None) to moderately deep trees (40).

3) 'min_samples_split': This parameter determines the minimum number of samples required to split an internal node. Higher values help prevent overfitting but can lead to underfitting if too high.

4) 'min_samples_leaf': This parameter controls the minimum number of samples required to be at a leaf node. Increasing this value can help prevent overfitting by creating less complex trees.

#### Multinomial Naive Bayes:
1) 'alpha': This parameter is a smoothing parameter (Laplace or Lidstone smoothing) applied to the model to handle unseen features in the test data. A range of values is provided to help find the best balance between overfitting and underfitting.

#### SVM:
2) 'C': This parameter is the regularization parameter, similar to the one in logistic regression. It determines the balance between achieving a low training error and a low testing error (overfitting). We've chosen a range of values to explore different levels of regularization.

3) 'kernel': This parameter defines the kernel function used by the SVM. Different kernel functions can lead to different decision boundaries and affect the model's performance.

4) 'gamma': This parameter is the kernel coefficient for the 'rbf', 'linear' kernels. It controls the shape of the decision boundary. Including 'scale' and 'auto' in the search allows for different scaling strategies, which can impact the model's performance.

In [9]:
from sklearn.model_selection import RandomizedSearchCV

# Define the parameter grids for each model
param_grids = {
    'Logistic Regression': {
        'model': models['Logistic Regression'],
        'params': {
                    'C': [1, 10, 100],
                    'penalty': ['l1', 'l2'],
                    'solver': ['liblinear', 'saga']
                  }  # Define the appropriate parameter grid
    },
    'Random Forest': {
        'model': models['Random Forest'],
        'params': {
                    'n_estimators': [100, 200],
                    'max_depth': [None, 40],
                    'min_samples_split': [2, 5],
                    'min_samples_leaf': [1, 2]
                  }  # Define the appropriate parameter grid
    },
    'Multinomial Naive Bayes': {
        'model': models['Multinomial Naive Bayes'],
        'params': {
                    'alpha': [0.1, 1, 5]
                  }  # Define the appropriate parameter grid
    },
    'SVM': {
        'model': models['SVM'],
        'params': {
                    'C': [1, 10, 100],
                    'kernel': ['linear', 'rbf'],
                    'gamma': ['scale', 'auto']
                  }  # Define the appropriate parameter grid
    }
}

optimized_models = {}

for input_name, (X_train_input, X_test_input) in inputs.items():
    optimized_models[input_name] = {}
    print(f'===== {input_name} =====')
    for name, model_grid in param_grids.items():
        if name == 'Multinomial Naive Bayes' and input_name == 'GloVe':
            # Skip Multinomial Naive Bayes for non-negative input types
            continue
        grid_search = RandomizedSearchCV(estimator=model_grid['model'], param_distributions=model_grid['params'], cv=3, scoring='accuracy', n_jobs=-1, n_iter=5)
        grid_search.fit(X_train_input, y_train)
        best_model = grid_search.best_estimator_
        y_pred_best_model = best_model.predict(X_test_input)

        optimized_models[input_name][name] = best_model

        accuracy_best_model = accuracy_score(y_test, y_pred_best_model)
        precision_best_model = precision_score(y_test, y_pred_best_model)
        recall_best_model = recall_score(y_test, y_pred_best_model)
        f1_best_model = f1_score(y_test, y_pred_best_model)

        print(f'Optimized {name}:')
        print(f'Accuracy: {accuracy_best_model:.2f}')
        print(f'Precision: {precision_best_model:.2f}')
        print(f'Recall: {recall_best_model:.2f}')
        print(f'F1 Score: {f1_best_model:.2f}')
        print(f'Best Parameters: {grid_search.best_params_}')
        print()

===== Bag of Words =====
Optimized Logistic Regression:
Accuracy: 0.99
Precision: 0.99
Recall: 0.99
F1 Score: 0.99
Best Parameters: {'solver': 'liblinear', 'penalty': 'l2', 'C': 10}

Optimized Random Forest:
Accuracy: 1.00
Precision: 1.00
Recall: 1.00
F1 Score: 1.00
Best Parameters: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': None}





Optimized Multinomial Naive Bayes:
Accuracy: 0.94
Precision: 0.95
Recall: 0.94
F1 Score: 0.94
Best Parameters: {'alpha': 0.1}

Optimized SVM:
Accuracy: 0.99
Precision: 1.00
Recall: 0.99
F1 Score: 0.99
Best Parameters: {'kernel': 'rbf', 'gamma': 'scale', 'C': 10}

===== TF-IDF =====




Optimized Logistic Regression:
Accuracy: 0.99
Precision: 0.99
Recall: 0.99
F1 Score: 0.99
Best Parameters: {'solver': 'saga', 'penalty': 'l1', 'C': 100}

Optimized Random Forest:
Accuracy: 1.00
Precision: 1.00
Recall: 1.00
F1 Score: 1.00
Best Parameters: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': 40}





Optimized Multinomial Naive Bayes:
Accuracy: 0.93
Precision: 0.94
Recall: 0.92
F1 Score: 0.93
Best Parameters: {'alpha': 0.1}

Optimized SVM:
Accuracy: 0.99
Precision: 0.99
Recall: 0.99
F1 Score: 0.99
Best Parameters: {'kernel': 'linear', 'gamma': 'auto', 'C': 10}

===== GloVe =====
Optimized Logistic Regression:
Accuracy: 0.94
Precision: 0.93
Recall: 0.94
F1 Score: 0.93
Best Parameters: {'solver': 'liblinear', 'penalty': 'l2', 'C': 10}

Optimized Random Forest:
Accuracy: 0.95
Precision: 0.95
Recall: 0.93
F1 Score: 0.94
Best Parameters: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': None}

Optimized SVM:
Accuracy: 0.94
Precision: 0.93
Recall: 0.94
F1 Score: 0.94
Best Parameters: {'kernel': 'rbf', 'gamma': 'auto', 'C': 100}



Note: To ignore the errors and warnings, RandomizedSearchCV can still handle these cases and still return the best results.

In [14]:
import plotly.graph_objects as go

# Create the data
bow_models = ['Logistic Regression', 'Random Forest', 'Multinomial Naive Bayes', 'SVM']
bow_acc = [0.99, 1.00, 0.94, 0.99]
bow_prec = [0.99, 1.00, 0.95, 1.00]
bow_rec = [0.99, 1.00, 0.94, 0.99]
bow_f1 = [0.99, 1.00, 0.94, 0.99]
bow_best_params = [
    "{'solver': 'liblinear', 'penalty': 'l2', 'C': 10}",
    "{'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': None}",
    "{'alpha': 0.1}",
    "{'kernel': 'rbf', 'gamma': 'scale', 'C': 10}"
]

tfidf_models = ['Logistic Regression', 'Random Forest', 'Multinomial Naive Bayes', 'SVM']
tfidf_acc = [0.99, 1.00, 0.93, 0.99]
tfidf_prec = [0.99, 1.00, 0.94, 0.99]
tfidf_rec = [0.99, 1.00, 0.92, 0.99]
tfidf_f1 = [0.99, 1.00, 0.93, 0.99]
tfidf_best_params = [
    "{'solver': 'saga', 'penalty': 'l1', 'C': 100}",
    "{'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': 40}",
    "{'alpha': 0.1}",
    "{'kernel': 'linear', 'gamma': 'auto', 'C': 10}"
]

glove_models = ['Logistic Regression', 'Random Forest', 'SVM']
glove_acc = [0.94, 0.95, 0.94]
glove_prec = [0.93, 0.95, 0.93]
glove_rec = [0.94, 0.93, 0.94]
glove_f1 = [0.93, 0.94, 0.94]
glove_best_params = [
    "{'solver': 'liblinear', 'penalty': 'l2', 'C': 10}",
    "{'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': None}",
    "{'kernel': 'rbf', 'gamma': 'auto', 'C': 100}"
]

# Create the bar plots
fig = go.Figure()
fig.add_trace(go.Bar(x=bow_models, y=bow_acc, name='Accuracy', marker_color='rgb(102, 153, 255)'))
fig.add_trace(go.Bar(x=bow_models, y=bow_prec, name='Precision', marker_color='rgb(255, 102, 102)'))
fig.add_trace(go.Bar(x=bow_models, y=bow_rec, name='Recall', marker_color='rgb(102, 255, 178)'))
fig.add_trace(go.Bar(x=bow_models, y=bow_f1, name='F1 Score', marker_color='rgb(255, 178, 102)'))

# Update the layout
fig.update_layout(
    title='Optimised Model Evaluation',
    xaxis=dict(title='Model'),
    yaxis=dict(title='Score'),
    barmode='group',
    bargap=0.15,
    bargroupgap=0.1,
)

# Add the dropdown menu
fig.update_layout(
    updatemenus=[
        dict(
            buttons=list([
                dict(
                    args=[{
                        'y': [bow_acc, bow_prec, bow_rec, bow_f1],
                        'x': [bow_models] * 4,
                        'text': bow_best_params,
                        'hovertemplate': 'Accuracy: %{y:.2f}<br>Precision: %{text[0]}<br>Recall: %{text[1]}<br>F1 Score: %{text[2]}<br>Best Parameters: %{text[3]}'
                    }],
                    label='Bag of Words',
                    method='update'
                ),
                dict(
                    args=[{
                        'y': [tfidf_acc, tfidf_prec, tfidf_rec, tfidf_f1],
                        'x': [tfidf_models] * 4,
                        'text': tfidf_best_params,
                        'hovertemplate': 'Accuracy: %{y:.2f}<br>Precision: %{text[0]}<br>Recall: %{text[1]}<br>F1 Score: %{text[2]}<br>Best Parameters: %{text[3]}'
                    }],
                    label='TF-IDF',
                    method='update'
                ),
                dict(
                    args=[{
                        'y': [glove_acc, glove_prec, glove_rec, glove_f1],
                        'x': [glove_models] * 4,
                        'text': glove_best_params,
                        'hovertemplate': 'Accuracy: %{y:.2f}<br>Precision: %{text[0]}<br>Recall: %{text[1]}<br>F1 Score: %{text[2]}<br>Best Parameters: %{text[3]}'
                    }],
                    label='GloVe',
                    method='update'
                )
            ]),
            direction='down',
            pad={'r': 10, 't': 10},
            showactive=True,
            x=0.1,
            xanchor='left',
            y=1.1,
            yanchor='top'
        ),
    ]
)

fig.show()


## Analysis of the results:
### Bag of Words:
1) Optimized Logistic Regression: Achieved an accuracy of 99%, precision of 99%, recall of 99%, and an F1 score of 99%.

2) Optimized Random Forest: Achieved an accuracy of 100%, precision of 100%, recall of 100%, and an F1 score of 100%. This is the best-performing model among all models and text representation techniques.

3) Optimized Multinomial Naive Bayes: Achieved an accuracy of 94%, precision of 95%, recall of 94%, and an F1 score of 94%.

4) Optimized SVM: Achieved an accuracy of 99%, precision of 100%, recall of 99%, and an F1 score of 99%.

### TF-IDF:
1) Optimized Logistic Regression: Achieved an accuracy of 99%, precision of 99%, recall of 99%, and an F1 score of 99%.

2) Optimized Random Forest: Achieved an accuracy of 100%, precision of 100%, recall of 100%, and an F1 score of 100%.

3) Optimized Multinomial Naive Bayes: Achieved an accuracy of 93%, precision of 94%, recall of 92%, and an F1 score of 93%.

4) Optimized SVM: Achieved an accuracy of 99%, precision of 99%, recall of 99%, and an F1 score of 99%.

### GloVe:
1) Optimized Logistic Regression: Achieved an accuracy of 94%, precision of 93%, recall of 94%, and an F1 score of 93%.

2) Optimized Random Forest: Achieved an accuracy of 95%, precision of 95%, recall of 93%, and an F1 score of 94%.

3) Optimized SVM: Achieved an accuracy of 94%, precision of 93%, recall of 94%, and an F1 score of 94%.

## In conclusion, the best model for this task is the Random Forest with Bag of Words input, as it achieved an accuracy of 100%, precision of 100%, recall of 100%, and an F1 score of 100%. This model outperformed all other models and text representation techniques. When presenting these results to your professor, highlight the superior performance of the Random Forest with Bag of Words input and discuss the importance of selecting the appropriate text representation technique and classification model for a given task.

## The best model is the Random Forest with Bag of Words input -
## Save best model

In [16]:
import joblib
best_model = optimized_models['Bag of Words']['Random Forest']

#Save model to file
joblib.dump(best_model, 'best_model_rf_bow.pkl')

# Bag of Words
vectorizer = CountVectorizer(stop_words='english', max_features=10000)
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

# Save the vectorizer
joblib.dump(vectorizer, 'vectorizer.pkl')

['vectorizer.pkl']

## Load the model to make predictions. 
Tacking the problem - detection of fake news.

In [1]:
import joblib
# Load the saved model
loaded_model = joblib.load('best_model_rf_bow.pkl')

# Load the preprocessor
vectorizer = joblib.load('vectorizer.pkl')

## Making sure the model is optimal

In [26]:
# Use the loaded model to make predictions
X_test_transformed = vectorizer.transform(X_test)
y_pred_loaded_model = loaded_model.predict(X_test_transformed)
# Calculate the performance metrics
accuracy_loaded_model = accuracy_score(y_test, y_pred_loaded_model)
precision_loaded_model = precision_score(y_test, y_pred_loaded_model)
recall_loaded_model = recall_score(y_test, y_pred_loaded_model)
f1_loaded_model = f1_score(y_test, y_pred_loaded_model)

print(f'Accuracy: {accuracy_loaded_model:.2f}')
print(f'Precision: {precision_loaded_model:.2f}')
print(f'Recall: {recall_loaded_model:.2f}')
print(f'F1 Score: {f1_loaded_model:.2f}')

Accuracy: 1.00
Precision: 1.00
Recall: 1.00
F1 Score: 1.00


In [13]:
# Create the data
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
scores = [1.00, 1.00, 1.00, 1.00]

# Create the bar plot
fig = go.Figure()
fig.add_trace(go.Bar(x=metrics, y=scores, marker_color='rgb(102, 153, 255)'))

# Update the layout
fig.update_layout(
    title='Model Evaluation',
    xaxis=dict(title='Metric'),
    yaxis=dict(title='Score'),
    bargap=0.15,
)

# Add annotations to the bars
annotations = []
for i in range(len(metrics)):
    annotations.append(dict(x=metrics[i], y=scores[i]+0.05, text=str(scores[i]), showarrow=False))
fig.update_layout(annotations=annotations)

# Show the figure
fig.show()


## Predicting news found online

## News 1:
https://www.reuters.com/article/us-usa-fiscal-idUSKBN1EP0LK - copy paste sample from start to end. 
## News 2:
https://web.archive.org/web/20161115024211/http://wtoe5news.com/us-election/pope-francis-shocks-world-endorses-donald-trump-for-president-releases-statement/ - copy paste sample from start to end. 

In [2]:
def predict_fake_news(sample_text):
    # Preprocess the sample text
    preprocessed_sample = vectorizer.transform([sample_text])

    # Make a prediction
    prediction = loaded_model.predict(preprocessed_sample)
    print (prediction)
    if prediction == 1:
        return "Real"
    else:
        return "Fake"

# Prompt the user to enter the sample news text repeatedly
while True:
    sample_text = input("Enter your sample news text (type 0 to exit): ")

    if sample_text == "0":
        break

    result = predict_fake_news(sample_text)
    
    print(f"The given news is {result}.")

Enter your sample news text (type 0 to exit): WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018.  In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January.  When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress.  President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discret

## Prediction Outcomes are Correct!!

## News 1: As U.S. budget fight looms, Republicans flip their fiscal script
True story. Thomson Reuters is dedicated to upholding the Trust Principles and to preserving its independence, integrity, and freedom from bias in the gathering and dissemination of information and news.

## News 2: In 2016, a story circulated that Pope Francis made an unprecedented and shocking endorsement of Donald Trump for president.
This story is completely false.
The original story can be traced back to a satire website, but it took off from there and became viral.
There were also other versions of this fake story claiming Pope Francis instead endorsed Hillary Clinton and Bernie Sanders for president.