# Proof of Concept Project: News Article Classification
---
## Introduction
The goal of this proof of concept project is to classify news articles into different categories based on their headlines. The dataset contains 422,937 news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014. The news categories include business, science and technology, entertainment, and health. Articles referring to the same news item are categorized together.

## Project Steps

1. **Data Gathering**
   - Collect the newspaper articles dataset containing headlines, URLs, and categories.

2. **Pre-processing Functions**
   - Set up pre-processing functions to clean and prepare the text data for classification.
   - Create a stop word list to remove common words with low significance.
   - Handle stop-words and bigrams to improve data quality.
   - Apply general contractions on text to standardize language.
   - Tokenize and perform lemmatization, stemming as necessary to convert words to their base forms.
   - Remove punctuation to focus on word meanings.

4. **Classifier Model Building**
   - Build the classifier model using multiple techniques found during research.
   - Compare the accuracy of different models to select the best approach.
   - Utilize basic bag of words approach for initial classification.
   - Implement word embedding and Word2Vec to enhance accuracy.

5. **Model Improvement and Iteration**
   - Tweak contractions and pre-processing functions as needed.
   - Rerun the model with updated settings until accuracy reaches an optimal level.

## Conclusion
This proof of concept project aims to classify news articles based on their content and categories. By following the outlined steps, we can preprocess the data, build and refine a classifier model using different techniques, and achieve accurate categorization of news articles. The incorporation of word embedding and Word2Vec will further enhance the model's performance in understanding word meanings and context.


In [3]:
# -*- coding: utf-8 -*-
"""
Created on Fri Jun 30 10:13:51 2023

@author: jkelly3 & sduplessis
"""

#Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import tree
from sklearn import metrics, feature_extraction, feature_selection, model_selection, naive_bayes, pipeline, manifold, preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.multiclass import OneVsRestClassifier
from sklearn.decomposition import PCA
from sklearn.linear_model import SGDClassifier
from collections import  Counter

import re

#nltk.download()
import nltk

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize

import string
#import contractions as cns

import matplotlib.pyplot as plt
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import gensim.models
from sklearn.manifold import TSNE
#from adjustText import adjust_text

import xgboost as xgb




In [4]:
#Download corpora
def download_Data(file_location):

    '''
    Download and read a CSV file into a pandas DataFrame. Give the Function a string with the absolute path to your csv,
    or alternatively give it the name of the csv if the file is in the same directory as this script.

    Args:
    file_location (str) : The file path or URL of the CSV file to be downloaded and read.

    Returns:
    df (pandas.DataFrame) : A DataFrame containing the data read from the CSV file.

    '''

    return pd.read_csv(file_location)


In [None]:
# Testing on data
data_loc = 'C:\Users\conkennedy\OneDrive\Desktop\textClassification\aggregated_data.csv'
download_Data(data_loc)

# Data Exploration:
---
Data exploration is the first step in data analysis involving the use of data visualization tools and statistical techniques to uncover data set characteristics and initial patterns.

It is a crucially important step for both the viewer of the analysis and the data scientist. Humans are visual learners, able to process visual data much more easily than numerical data. Consequently, it's challenging for data scientists to review thousands of rows of data points and infer meaning without assistance.


## Function: `barplot_target_category`

This function generates a bar plot to visualize category counts within a specified DataFrame column. It supports category replacement and allows customization of plot appearance.

## Parameters

- **`df1`** (*pandas.DataFrame*): The original DataFrame for analysis.
- **`field`** (*str*): The column containing categories to visualize.
- **`field_replacement_dict`** (*dict*, optional): A dictionary for category replacements. Default is *None*.

## Returns

- **None**
---

By using this function, you can conveniently visualize the distribution of categories within a specific DataFrame column. The function allows for easy handling of categories, provides options for customization, and is particularly beneficial for exploratory data analysis.


In [None]:
#Exploratory Data Analysis

def barplot_target_category(df1, field, field_replacement_dict=None):

    '''
    Create a bar plot to visualize the counts of categories in a DataFrame column.

    Args:
        df1 (pandas.DataFrame): The original DataFrame.
        field (str): The name of the column in the DataFrame containing the categories to be plotted.
        field_replacement_dict (dict, optional): A dictionary to map category replacements if needed. Default is None.

    Returns:
        None
    '''

    df = df1.copy()

    if field_replacement_dict:
        df[field] = df[field].replace(field_replacement_dict)

    counts = df[field].value_counts()
    num_categories = len(counts)
    color_palette = mpl.colormaps.get_cmap("tab10")  # Change 'tab10' to any other colormap you prefer

    # Create the bar plot
    plt.bar(counts.index, counts.values, color=color_palette(range(num_categories)))
    plt.plot()

    # Customize the plot
    plt.xlabel('Category')
    plt.ylabel('Count')
    plt.title('Counts of Categories')
    plt.xticks(rotation=45)

    # Display the plot
    plt.show()



## Analysis of N-grams

## Function: `_get_top_ngram`

This function retrieves the top n-grams (word sequences) from a corpus based on their frequency. This function can be called
on the unprocesesed target column or after processing.

### Parameters

- **`corpus`** (*list*): A list of strings representing the text corpus.
- **`n`** (*int*, optional): The n-gram size. Default is None, corresponding to unigrams (single words).

### Returns

- **list**: A list of tuples containing the top n-grams and their frequencies.

---

## Function: `plot_top_ngrams_barchart`

This function creates a bar chart to visualize the top n-grams (word sequences) in the given text.

### Parameters

- **`text`** (*pandas.Series*): A pandas Series containing the text data.
- **`n`** (*int*, optional): The n-gram size. Default is 2.


---

By utilizing these functions, you can efficiently analyze the distribution of n-grams within text data. The `_get_top_ngram` function calculates the most frequent n-grams, while `plot_top_ngrams_barchart` visualizes these n-grams using a bar chart.

In [None]:
#Analysis of N-grams:

def _get_top_ngram(corpus, n=None):

    '''
    Get the top n-grams (word sequences) in a corpus based on their frequency.

    Args:
        corpus (list): A list of strings representing the text corpus.
        n (int, optional): The n-gram size. Default is None, which corresponds to unigrams (single words).

    Returns:
        list: A list of tuples containing the top n-grams and their frequencies.
    '''

    # Create a CountVectorizer object to convert text into a bag of words representation
    vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)

    # Transform the corpus into a bag of words
    bag_of_words = vec.transform(corpus)

    # Sum the word frequencies across the entire corpus
    sum_words = bag_of_words.sum(axis=0)

    # Create a list of tuples containing each n-gram and its frequency
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

    # Sort the list in descending order based on the n-gram frequency
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)

    # Return the top n-grams (defaulting to the top 10)
    return words_freq[:10]

# Plot the N-grams:

def plot_top_ngrams_barchart(text, n=2):

    '''
    Create a bar chart to visualize the top n-grams (word sequences) in the given text.

    Args:
        text (pandas.Series): A pandas Series containing the text data.
        n (int, optional): The n-gram size. Default is 2.

    Returns:
        None
    '''

    # Get the set of English stopwords
    stop = set(stopwords.words('english'))

    # Split the text into words and remove stopwords
    new = text.str.split()
    new = new.values.tolist()
    corpus = [word for i in new for word in i if word not in stop]

    # Get the top n-grams using the _get_top_ngram function
    top_n_bigrams = _get_top_ngram(text, n)[:10]

    # Extract the n-grams and their frequencies from the result
    x, y = map(list, zip(*top_n_bigrams))

    # Create a bar plot using seaborn's barplot function
    sns.barplot(x=y, y=x)

    # Show the plot
    plt.show()

#plot_top_ngrams_barchart(df['TITLE'],2)

#plot_top_ngrams_barchart(df['CLEAN_TEXT'],3)


# Functions Used in Data Manipulation and Preprocessing
---
These functions collectively support the preparation and preprocessing of text data for classification tasks, incorporating steps such as forming corpora, handling contractions, removing stopwords, cleaning text, and engineering features for improved model accuracy.Use help() on the functions or read the docstrings for more information.

## Form a Corpus List from a target Column:

1. **`form_corpora_list(df, field, is_list=False)`**
   
    - **Description:** Extracts a list of words from a DataFrame column for text analysis.
   
    - **Parameters:**
        - `df` (pandas DataFrame): The DataFrame containing the data.
        - `field` (str): The name of the column from which the corpus will be extracted.
        - `is_list` (bool): Flag indicating if the column contains lists of words or string type variables.
   
    - **Returns:** A list containing the words from the specified column (corpus).

In [None]:
# Form the Corpora and Analyse frequency of non-stop words:
def form_corpora_list(df, field, is_list = False):

    '''
    Extract a corpus list from a DataFrame column. Give this function the pandas dataframe and the field/column that
    you are using to form the corpus of your analysis. If that column is tokenized e.g., in a list format, include the
    is_list flag as true, if it is a column of strings, make the flag false.

    Args:
    df (pandas.DataFrame) : The DataFrame containing the data.

    field (str) : The name of the column in the DataFrame from which the corpus will be extracted.

    is_list (bool), optional (default=False) :
        If True, assumes that the specified column contains lists of words.
        If False, assumes that the specified column contains sentences or text.

    Returns:
    list (list) : A list containing the words or elements from the specified column (corpus).
    '''

    if is_list:
        return df[field].explode().tolist()

    corpus = []
    column_split = df[field].str.split()
    column_split = column_split.values
    column_split_list = column_split.tolist()

    # Flatten nested Lists into one corpus list
    for i in column_split:
        for word in i:
            corpus.append(word)


    return corpus


## Function to define the specific contractions in the dataset:
2. **`specific_contractions()`**
   
    - **Description:** Provides a dictionary of specific contractions and their replacements.
   
    - **Returns:** A dictionary of contraction patterns and their corresponding replacements.

In [None]:
def specific_contractions():

    '''
    Get a dictionary of specific contractions and their corresponding replacements.

    Returns:
    general phrases (dict) : A dictionary containing regular expression patterns as keys and their corresponding replacements
    as values.

    '''

    GENERAL_PHRASES = {
        #covid
        r"\bcovid19\b": "covid",
        r"\bcovid 19\b": "covid",
        r"\bcoronavirus\b": "covid",
        r"\bcorona virus\b": "covid",
        r"\b[Uu]\.?[Ss]\.?'?s?\b ": "US ",
        r"\bU\.?K\.?\b" : "United Kingdom",
        r"&" : "and"}

    return GENERAL_PHRASES


## Replace the Custom Contractions defined in previous Function:
3. **`replace_custom_contractions(text)`**
   
    - **Description:** Replaces specific contractions in the given text with their corresponding expansions.
   
    - **Parameters:** `text` (str): The input text with contractions to be replaced.
   
    - **Returns:** The input text with specific contractions replaced.

In [None]:
def replace_custom_constractions(text):

    '''
    Replace specific contractions in the given text with their corresponding expansions.

    Args:
    text (str) : The input text where contractions will be replaced.

    Returns:
    text (str) : The input text with specific contractions replaced.

    '''

    dict_contractions = specific_contractions()

    for pattern, replacement in dict_contractions.items():
        text = re.sub(pattern, replacement, text)

    return text



## Form a set of stopwords with the built-in nltk stopwords() and any aditional stopward identified:
4. **`custom_stopwords()`**
   
    - **Description:** Provides a set of custom stopwords by combining NLTK's English stopwords with additional stopwords.
   
    - **Returns:** A set of custom stopwords for text processing.

In [None]:
#Import stopwords
def custom_stopwords():

    '''
    Get a set of custom stopwords by combining NLTK's English stopwords with additional stopwords.

    Returns:
        set (set): A set containing custom stopwords for text processing.
    '''

    from nltk.corpus import stopwords
    stopwords = set(stopwords.words('english'))


    extra = set([])# Placeholder to include regular expressions and custom stopwords later

    stopwords = stopwords.union(extra)
    return stopwords


## Sentiment Analysis with TextBlob

The provided functions enable sentiment analysis using the TextBlob library, offering a convenient way to assess text sentiment and subjectivity.

### Function: `get_polarity`

This function calculates the polarity (sentiment) of a given text using TextBlob.

- **Parameters**:  
  - `text` (*str*): The input text.

- **Returns**:  
  - *float*: The polarity score ranging from -1.0 (negative) to 1.0 (positive). If sentiment analysis fails, it returns 0.0.

### Function: `get_subjectivity`

This function calculates the subjectivity of a given text using TextBlob.

- **Parameters**:  
  - `text` (*str*): The input text.

- **Returns**:  
  - *float*: The subjectivity score ranging from 0.0 (objective) to 1.0 (subjective). If sentiment analysis fails, it returns 0.0.

These functions offer a streamlined approach for analyzing sentiment and subjectivity of text using TextBlob. They can be valuable for tasks like understanding customer feedback sentiment, monitoring social media sentiment, and assessing the subjective nature of textual content. They are called in the feature engineering function.


In [None]:
from textblob import TextBlob

def get_polarity(text):
    """
    Calculate the polarity (sentiment) of a given text using TextBlob.

    Args:
        text (str): The input text.

    Returns:
        float: The polarity score ranging from -1.0 (negative) to 1.0 (positive).
               If sentiment analysis fails, returns 0.0.
    """
    try:
        textblob = TextBlob(text)
        polarity = textblob.sentiment.polarity
    except:
        polarity = 0.0
    return polarity

def get_subjectivity(text):
    """
    Calculate the subjectivity of a given text using TextBlob.

    Args:
        text (str): The input text.

    Returns:
        float: The subjectivity score ranging from 0.0 (objective) to 1.0 (subjective).
               If sentiment analysis fails, returns 0.0.
    """
    try:
        textblob = TextBlob(text)
        subjectivity = textblob.sentiment.subjectivity
    except:
        subjectivity = 0.0
    return subjectivity


## Preprocess Function:
Please note that the built in `preprocessing()` function from skilearn will do similar processing but I wanted to provide a custom function that can be tailored to any NLP project as a template.


In [None]:
def pre_process(text, is_lemming = False, is_stemming = False, str_rejoin = False):

    '''
    Preprocess the text, giving options for lemmatization, stemming and rejoining string from list as optional flags.

    Args:
        text (str): The string input text that is being preprocessed.
        is_lemming (bool): The flag to control if the text should be lemmatized - default is False.
        is_stemming (bool): The flag to control if the text should be stemmed - default is False.
        str_rejoin (bool): The flag to control if the text should be rejoined from a list at the end of the preprocessing - default is False.

    Returns:
        str: A string with the preprocessed text.

    '''


    # Remove the basic contractions with the contractions package:
    text =  replace_custom_constractions(text)
    text = cns.fix(text)

    # Remove 1st, 2nd, 3rd etc.:

    text = re.sub(r"\b\d+(st|nd|rd|th)\b", '', text)

    # Remove the Amounts of money e.g. 100k, 200M etc.:
    text = re.sub(r"\b\d+(\.\d+)?[kMBGTPEZY]\b", '', text)


    # Remove " 's " pattern:
    text = re.sub(r"\b\w+'s\b", '', text)

    # remove currency numbers:
    text = re.sub(r"[$€¥£]\d+[^ ]*", '', text)


    # Remove punctuation where the punctuation is internal - add space instead to maintain word breakpoint:
    pattern = r"(?<=\w)[!\"#$%&'()*+,-./:;<=>?@[\\\]^_`{|}~](?=\w)"
    text = re.sub(pattern, ' ', text)

    # Remove punctuation internal to words and replace with space
    text = re.sub(r"[^a-zA-Z0-9\s]+", '', text)

    # Remove Numbers:
    text =  re.sub(r'\d+', '', text)


    # Remove multiple spaces where they occur:
    text = re.sub(r'^\s+', '', text)
    text = re.sub(r'\s+', ' ', text)

    # Make the Text Lower
    text = text.lower()

    # Tokenize the text:
    text = nltk.word_tokenize(text)

    # Remove set of default stopwords with stopwords library:
    stop = custom_stopwords()
    text = [word for word in text if word not in stop]

    # Lemmatisation:
    if is_lemming:

        lem = WordNetLemmatizer()
        text = [lem.lemmatize(w) for w in text]

    # Stemming
    if is_stemming:

        ps = PorterStemmer()
        text = [ps.stem(w) for w in text]

    # If you want the tokens to be made back into a string:
    if str_rejoin:
        text = " ".join(text)


    return text


## Preprocessing Function applied to a Dataframe:
5. **`data_frame_pre_process(df, field, is_lemming=False, is_stemming=False, str_rejoin=False)`**
   
    - **Description:** Preprocesses text data in a DataFrame column.
   
    - **Parameters:**
        - `df_input` (pandas DataFrame): The DataFrame containing the text data.
        - `field` (str): The column containing the text data.
        - `is_lemming` (bool): Flag for lemmatization.
        - `is_stemming` (bool): Flag for stemming.
        - `str_rejoin` (bool): Flag to rejoin tokens into a string.
   
    - **Returns:** A DataFrame with preprocessed text data in a new 'CLEAN_TEXT' column.

In [None]:
def data_frame_pre_process(df_input, field, is_lemming = False, is_stemming = False, str_rejoin = False):

    '''
    Preprocess the text data in a DataFrame column.

    Args:
        df_input (pandas.DataFrame): The DataFrame containing the text data.
        field (str): The name of the column in the DataFrame containing the text data.
        is_lemming (bool): The flag to control if the text should be lemmatized - default is False.
        is_stemming (bool): The flag to control if the text should be stemmed - default is False.
        str_rejoin (bool): The flag to control if the text should be rejoined from a list at the end of the preprocessing - default is False.

    Returns:
        pandas.DataFrame: A DataFrame with the preprocessed text data in a new column named 'CLEAN_TEXT'.
    '''

    # Remove NA values from the dataset if present:
    df = df_input.dropna().copy()

    # Remove the basic contractions with the contractions package:
    df['CLEAN_TEXT'] = df[field].apply(lambda x: replace_custom_constractions(x))
    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x: cns.fix(x))

    # Remove 1st, 2nd, 3rd etc.:

    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : re.sub(r"\b\d+(st|nd|rd|th)\b", '', x))

    # Remove the Amounts of money e.g. 100k, 200M etc.:
    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : re.sub(r"\b\d+(\.\d+)?[kMBGTPEZY]\b", '', x))


    # Remove " 's " pattern:
    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : re.sub(r"\b\w+'s\b", '', x))

    # remove currency numbers:
    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : re.sub(r"[$€¥£]\d+[^ ]*", '', x))


    # Remove punctuation where the punctuation is internal - add space instead to maintain word breakpoint:
    pattern = r"(?<=\w)[!\"#$%&'()*+,-./:;<=>?@[\\\]^_`{|}~](?=\w)"
    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x: re.sub(pattern, ' ', x))

    # Remove punctuation internal to words and replace with space
    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : re.sub(r"[^a-zA-Z0-9\s]+", '', x))

    # Remove Numbers:
    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : re.sub(r'\d+', '', x))


    # Remove multiple spaces where they occur:
    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : re.sub(r'^\s+', '', x))
    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : re.sub(r'\s+', ' ', x))

    # Make the Text Lower
    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : x.lower())

    # Tokenize the text:
    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : nltk.word_tokenize(x))

    # Remove set of default stopwords with stopwords library:
    stop = custom_stopwords()
    df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : [word for word in x if word not in stop])

    # Lemmatisation:
    if is_lemming:

        lem = WordNetLemmatizer()
        df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : [lem.lemmatize(w) for w in x])

    # Stemming
    if is_stemming:

        ps = PorterStemmer()
        df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : [ps.stem(w) for w in x])

    # If you want the tokens to be made back into a string:
    if str_rejoin:
        df['CLEAN_TEXT'] = df['CLEAN_TEXT'].apply(lambda x : " ".join(x))

    return df


## Feature Engineering:
---
Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in supervised learning. In order to make machine learning work well on new tasks, it might be necessary to design and train better features.

In this case, the feature engineering is impractical and does not lead to further accuracy, but including an example is always useful for applying to non-text classification problems or problems with a more meaningful relationship with the features I engineer below e.g. classifying complaint might benefit from knowing punctuation or length of sentence etc.

6. **`feature_engineer(df, target, unwanted=False)`**
   
    - **Description:** Performs feature engineering on the DataFrame for improved classification accuracy.
   
    - **Parameters:**
        - `df` (pandas DataFrame): The DataFrame containing pre-processed text data.
        - `target` (str): Name of the target categorical label.
        - `unwanted` (list): List of unwanted columns to drop.
   
    - **Returns:**
        - `X` (pandas DataFrame): DataFrame with added features.
        - `y` (numpy array): Label encoded target.


In [None]:
def feature_engineer(df,target, unwanted = False):

    """
    Perform feature engineering on the given DataFrame for increased accuracy in classification.

    Args:
        df (pandas.DataFrame): The DataFrame containing pre-processed text data.
        target (str): The String name of the target categorical Label.
        unwanted (list): The list of unwanted columns Strings to drop

    Returns:
        X (pandas.DataFrame): The DataFrame with added features based on feature engineering.
        Y (np.array): Label Encoding of Categorical Target

    Note:
        This function adds various features to the DataFrame including word count, character count,
        diversity score, punctuation count, polarity, subjectivity, counts of specific parts of speech (POS) tags,
        and more.

    """
    y = df[target]


    # Feature Engineering for increased accuracy in classification:
    # Now we have processed and pre-processed text in our dataframe. Let's start making features from the above data.
    if unwanted:
        unwanted.append(target)
        X = df.drop(unwanted, axis = 1)

    # Feature 1 - Length of the input OR count of the words in the statement (Vocab size).
    X['WORD_COUNT'] = X['TITLE'].apply(lambda x: len(str(x).split()))  # Feature 1

    # Feature 2 - Count of characters in a statement
    X['CHARACTER_COUNT'] = X['TITLE'].apply(lambda x: len(str(x)))  # Feature 2

    # Feature 3 - Diversity_score i.e. Average length of words used in statement
    X['AVERAGE_LENGTH'] = X['CHARACTER_COUNT'] / X['WORD_COUNT']  # Feature 3

    # Feature 4: Count of punctuations in the input.
    X['PUNCTUATION_COUNT'] = X['TITLE'].apply(lambda x: len([w for w in str(x) if w in string.punctuation]))  # Feature 4

    # Change df_small to df to create these features on the complete dataframe
    X['polarity'] = X['TITLE'].apply(get_polarity)  # Feature 5: Polarity
    X['subjectivity'] = X['TITLE'].apply(get_subjectivity)  # Feature 6: Subjectivity

    # Tokenize all text without stopwords
    all_text_without_sw = ''
    for i in df.itertuples():
        all_text_without_sw = all_text_without_sw + str(i.TITLE)

    tokenized_all_text = word_tokenize(all_text_without_sw)  # tokenize the text

    # Adding POS Tags to tokenized words
    list_of_tagged_words = nltk.pos_tag(tokenized_all_text)
    set_pos = set(list_of_tagged_words)  # set of POS tags & words

    # Counting specific POS tags
    nouns = ['NN', 'NNS', 'NNP', 'NNPS']  # POS tags of nouns
    list_of_words = set(map(lambda tuple_2: tuple_2[0], filter(lambda tuple_2: tuple_2[1] in nouns, set_pos)))
    X['NOUN'] = X['TITLE'].apply(lambda x: len([w for w in str(x).lower().split() if w in list_of_words]))  # Feature 7

    # Counting pronouns
    pronouns = ['PRP', 'PRP$', 'WP', 'WP$']  # POS tags of pronouns
    list_of_words = set(map(lambda tuple_2: tuple_2[0], filter(lambda tuple_2: tuple_2[1] in pronouns, set_pos)))
    df['PRONOUN_COUNT'] = df['TITLE'].apply(lambda x: len([w for w in str(x).lower().split() if w in list_of_words]))  # Feature 8

    # Counting verbs
    verbs = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']  # POS tags of verbs
    list_of_words = set(map(lambda tuple_2: tuple_2[0], filter(lambda tuple_2: tuple_2[1] in verbs, set_pos)))
    X['VERBS_COUNT'] = X['TITLE'].apply(lambda x: len([w for w in str(x).lower().split() if w in list_of_words]))  # Feature 9

    # Counting adverbs
    adverbs = ['RB', 'RBR', 'RBS', 'WRB']  # POS tags of adverbs
    list_of_words = set(map(lambda tuple_2: tuple_2[0], filter(lambda tuple_2: tuple_2[1] in adverbs, set_pos)))
    X['ADVERBS_COUNT'] = X['TITLE'].apply(lambda x: len([w for w in str(x).lower().split() if w in list_of_words]))  # Feature 10

    # Counting adjectives
    adjectives = ['JJ', 'JJR', 'JJS']  # POS tags of adjectives
    list_of_words = set(map(lambda tuple_2: tuple_2[0], filter(lambda tuple_2: tuple_2[1] in adjectives, set_pos)))
    X['ADJECTIVE_COUNT'] = X['TITLE'].apply(lambda x: len([w for w in str(x).lower().split() if w in list_of_words]))  # Feature 11

    encoder = LabelEncoder()
    y = encoder.fit_transform(y)

    return X,y

# Further Data Exploration on Prepared Dataset:
---
Now that we have our target text processed, we should take another exploratory look at the data in its final form before modelling:

## Word Frequency Analysis without stopwords

This function conducts a word frequency analysis on a corpus within a DataFrame column and provides visualization of the results.

## Function: `word_frequency_analysis`

### Parameters

- **`df`** (*pandas.DataFrame*): The DataFrame containing the text data.
- **`field`** (*str*): The column name in the DataFrame with the text data.
- **`amount`** (*int* or `'all'`): The number of most common words to analyze. Use `'all'` to analyze all words.
- **`is_list`** (*bool*, optional): Set to *True* if the specified column contains lists of words. Default is *False*.
- **`dict_wanted`** (*bool*, optional): Set to *True* to return a dictionary of word frequencies. Default is *False*.

### Returns

- **None** or **dict**: If *dict_wanted* is *True*, returns a dictionary of word frequencies. Otherwise, displays a bar plot.

### Procedure

1. **Form Corpora List**: Obtain a list of text corpora using the *form_corpora_list* function, considering the *is_list* parameter.

2. **Count Most Common Words**: Utilize the *Counter* class to count the most common words within the corpus.

3. **Initialize Lists**: Prepare lists to hold word and frequency information.

4. **Determine Range**: Based on the *amount* parameter, decide the range for analysis (up to specified count or all words).

5. **Extract Information**: Extract word and frequency data from the most common list based on the determined range.

6. **Check for Dictionary**: If *dict_wanted* is *True*, return a dictionary of word frequencies.

7. **Create Bar Plot**: If *dict_wanted* is *False*, generate a bar plot using seaborn's *barplot* function and display it.

---

By utilizing this function, you can conduct word frequency analysis on textual data in a DataFrame column. The function allows customization of the number of most common words to analyze and provides the option to return a dictionary of word frequencies. The resulting bar plot aids in visualizing the frequency distribution of words.


In [None]:
# Word Frequency Analysis without stopwords:

def word_frequency_analysis(df, field, amount, is_list = False, dict_wanted = False):

    '''
    Perform word frequency analysis on a corpus in a DataFrame column and visualize the results.

    Args:
        df (pandas.DataFrame): The DataFrame containing the text data.
        field (str): The name of the column in the DataFrame containing the text data.
        amount (int or 'all'): The number of most common words to analyze. Set to 'all' for all words.
        is_list (bool, optional): Set to True if the specified column contains lists of words. Default is False.
        dict_wanted (bool, optional): Set to True to return a dictionary of word frequencies. Default is False.

    Returns:
        None or dict: If dict_wanted is True, returns a dictionary of word frequencies. Otherwise, displays a bar plot.
    '''

    corpus = form_corpora_list(df, field, is_list)

    # Count most common words in the corpus
    counter = Counter(corpus)
    most_common_list = counter.most_common()

    # Initialize lists to store word and frequency information
    list_a, list_b = [], []

    # Determine the range for analysis based on 'amount' parameter
    if amount == 'all':
        range_limit = len(most_common_list)
    else:
        range_limit = amount

    # Extract word and frequency information based on range_limit
    for word, count in most_common_list[:range_limit]:
        list_a.append(word)
        list_b.append(count)

    # Check if a dictionary of word frequencies is wanted
    if dict_wanted:
        return {list_a[i]: list_b[i] for i in range(len(list_a))}

    # Create a bar plot using seaborn's barplot function
    sns.barplot(x=list_b, y=list_a)
    plt.show()

## Word Frequency Analysis with Specific Word Frequency Request

This function performs word frequency analysis on a corpus within a DataFrame column and displays the frequency of specific words using a dictionary.

## Function: `Word_frequency_request`

### Parameters

- **`df`** (*pandas.DataFrame*): The DataFrame containing the text data.
- **`field`** (*str*): The column name in the DataFrame with the text data.
- **`test_list`** (*list*, optional): A list of words to check the frequency for. Default is an empty list.
- **`is_list`** (*bool*, optional): Set to *True* if the specified column contains lists of words. Default is *False* (for strings).

### Returns

- **None**

### Procedure

1. **Word Frequency Analysis**: Invoke the *word_frequency_analysis* function to retrieve a dictionary of word frequencies from the corpus.

2. **Iterate Over Test Words**: Iterate through each word in the *test_list*.

3. **Check for Presence**: For each word, check if it exists in the word frequency dictionary.

4. **Print Frequency**: If the word is present, print its frequency. If not, indicate that the word is not present in the corpus.

---

By using this function, you can analyze the word frequency within a DataFrame column and retrieve the frequency of specific words. The function utilizes the previously defined *word_frequency_analysis* function and provides a convenient way to check the occurrence of specific words in the corpus.


In [None]:
def Word_frequency_request(df, field, test_list = [], is_list = False):

    '''
    Perform word frequency analysis on a corpus in a DataFrame column and display the frequency of specific words
    using a dictionary.

    Args:
        df (pandas.DataFrame): The DataFrame containing the text data.
        field (str): The name of the column in the DataFrame containing the text data.
        test_list (list, optional): A list of words to check the frequency for. Default is an empty list.
        is_list (bool, optional): Set to True if the specified column contains lists of words. Default is False (for strings).

    Returns:
        None
    '''

    word_dict = word_frequency_analysis(df, field, all, is_list, True)

    for word in test_list:

        if word in word_dict.keys():
            print(f"Frequency of {word}: {word_dict[word]}")

        else:
            print(f"{word} is not present in this corpus")


## Sentiment Analysis with VADER

This function performs sentiment analysis on a given text using the VADER (Valence Aware Dictionary and Sentiment Reasoner) sentiment analysis tool.

## Function: `sentiment_vader`

### Parameters

- **`text`** (*str*): The input text for sentiment analysis.
- **`sid`** (*nltk.sentiment.vader.SentimentIntensityAnalyzer*): The VADER SentimentIntensityAnalyzer instance.

### Returns

- A sentiment label with the highest score from the VADER analysis ('pos', 'neg', or 'neu').

### Procedure

1. **Polarity Score Calculation**: Calculate the polarity scores using the VADER SentimentIntensityAnalyzer.

2. **Select the Maximum Score**: Identify the sentiment label with the highest score.

### Function: `plot_sentiment_barchart`

This function plots a bar chart to visualize the distribution of sentiments in the given text data.

### Parameters

- **`text`** (*pandas.Series*): A pandas Series containing the text data.

### Returns

- **None**

### Procedure

1. **Sentiment Analysis**: Initialize the VADER SentimentIntensityAnalyzer and analyze sentiment for each text using the `sentiment_vader` function.

2. **Plot Distribution**: Create a bar plot to visualize the sentiment distribution.

By using these functions, you can perform sentiment analysis on text data and visualize the distribution of sentiments. The `sentiment_vader` function computes sentiment labels, and the `plot_sentiment_barchart` function offers an easy way to visualize the sentiment distribution in the provided text data. Remember to have the `nltk` library installed and download the VADER lexicon using `nltk.download('vader_lexicon')` if needed.


In [None]:
#Sentiment analysis

def sentiment_vader(text, sid):

    """
    Analyze the sentiment of a given text using the VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis tool.

    Args:
        text (str): The input text for sentiment analysis.
        sid (nltk.sentiment.vader.SentimentIntensityAnalyzer): The VADER SentimentIntensityAnalyzer instance.

    Returns:
        str: The sentiment label with the highest score from the VADER analysis (either 'pos', 'neg', or 'neu').
    """

    # Polarity score returns dictionary
    ss = sid.polarity_scores(text)
    ss.pop('compound')
    return max(ss, key=ss.get)


def plot_sentiment_barchart(text):

    """
    Plot a bar chart to visualize the distribution of sentiments in the given text data.

    Args:
        text (pandas.Series): A pandas Series containing the text data.

    Returns:
        None
    """
    # Initialize the VADER SentimentIntensityAnalyzer - #nltk.download('vader_lexicon') if necessary:
    sid = SentimentIntensityAnalyzer()

    # Analyze sentiment for each text using the sentiment_vader function
    sentiment = text.map(lambda x: sentiment_vader(x, sid=sid))

    # Create a bar plot to visualize the sentiment distribution
    plt.bar(sentiment.value_counts().index, sentiment.value_counts())


#plot_sentiment_barchart(df_test['CLEAN_TEXT'])

# Data Preparation: Forming Feature Matrix and Target Vector
---
Now that we have preparedt the data and looked at some of the patterns and behaviour of the data, it's time to from our feature matrix and our target vector. This function prepares the feature matrix (*X*) and target vector (*y*) for a given DataFrame and target column. It also supports unwanted column removal and optional feature engineering.

## Function: `Form_X_y`

### Parameters

- **`df`** (*pandas.DataFrame*): The DataFrame containing the data.
- **`target`** (*str*): The target column name.
- **`unwanted`** (*bool*, optional): Set to *True* if unwanted columns need to be removed from the feature matrix. Default is *False*.
- **`feat_eng_flag`** (*bool*, optional): Set to *True* to enable feature engineering. Default is *False*.

### Returns

- **X** (*pandas.DataFrame*): The feature matrix.
- **y** (*numpy.array*): The target vector.

### Procedure

1. **Data Preprocessing**: If specified, perform data preprocessing on the DataFrame using the `download_Data` and `data_frame_pre_process` functions.

2. **Feature Engineering (Optional)**: If `feat_eng_flag` is *True*, invoke the `feature_engineer` function to obtain feature-engineered *X* and *y*. Measure and display the time taken.

3. **Obtain Target Vector**: Extract the target column (*target*) as the target vector (*y*).

4. **Remove Unwanted Columns (Optional)**: If *unwanted* is *True*, remove unwanted columns and assign the remaining DataFrame to *X*.

5. **Label Encoding**: Encode the target vector (*y*) using *LabelEncoder*.

By using this function, you can easily prepare the feature matrix and target vector for machine learning tasks. You can also opt for optional unwanted column removal and feature engineering if required.


In [None]:
def Form_X_y(df, target, unwanted = False, feat_eng_flag = False):

    """
    Prepare the feature matrix (X) and target vector (y) for a given DataFrame and target column.

    Args:
        df (pandas.DataFrame): The DataFrame containing the data.
        target (str): The target column name.
        unwanted (bool, optional): Set to True if unwanted columns need to be removed from the feature matrix. Default is False.
        feat_eng_flag (bool, optional): Set to True to enable feature engineering. Default is False.

    Returns:
        X (pandas.DataFrame): The feature matrix.
        y (numpy.array): The target vector.
    """

    # If feat_eng_flag is True, perform feature engineering and return X and y:
    if feat_eng_flag:

        start = time.time()
        X, y = feature_engineer(df, 'CATEGORY', unwanted)
        print('Time took Feature Engineer: ', time.time() - start, 'seconds')
        return X,y

    # Extract the target vector y from the specified column:
    y = df[target]
    X = df.drop(target, axis = 1).copy()

    # If unwanted is True, remove specified columns from the DataFrame:
    if unwanted:
        X = X.drop(unwanted, axis = 1)

    # Encode the target vector y using LabelEncoder:
    encoder = LabelEncoder()
    #y = encoder.fit_transform(y)
    return X, y


#X,y = Form_X_y(df, 'CATEGORY', ['URL','PUBLISHER','STORY','HOSTNAME','TIMESTAMP'], False)

# Defining the Grid Search Functions to tune the Hyperparameters of the models:
---
Hyperparameter tuning is a critical step in the machine learning pipeline to optimize model performance. GridSearch is a method that systematically searches through a specified hyperparameter space to find the combination of hyperparameters that yields the best model performance.

## Steps for Using GridSearch:

1. **Define Hyperparameter Space**: Identify the hyperparameters you want to tune and specify the range or values for each hyperparameter.

2. **Select Estimator**: Choose the machine learning algorithm you want to use and create an instance of it.

3. **Create GridSearchCV Object**: Instantiate the `GridSearchCV` class from the scikit-learn library. Provide the estimator, hyperparameter grid, cross-validation strategy, and other necessary parameters.

4. **Fit GridSearchCV**: Fit the `GridSearchCV` object to your training data. This will perform an exhaustive search over the hyperparameter space, evaluating each combination using cross-validation.

5. **Access Best Estimator and Parameters**: After the search is complete, access the best estimator (model with the best hyperparameters) and the associated hyperparameters.

6. **Evaluate Model**: Evaluate the best model on a separate validation set or test set to get an accurate estimate of its performance.

Using GridSearch, you can efficiently explore the hyperparameter space and identify the best configuration for your machine learning model, leading to improved model performance.


## Multinomial Naive Bayes Classifier with Hyperparameter Tuning

This function fits a Multinomial Naive Bayes classifier using grid search for hyperparameter tuning.

## Function: `Fit_Multinomial_Grid`

### Parameters

- **`training_set`** (*array-like*): The training data.
- **`training_classifications`** (*array-like*): The corresponding class labels.

### Returns

- **`optimised`**: The optimized Multinomial Naive Bayes classifier.

### Procedure

1. **Define Hyperparameters**: Prepare a list of hyperparameter values to be explored during grid search. In this case, the hyperparameter is `'alpha'`.

2. **Create GridSearchCV Object**: Construct a GridSearchCV object using the MultinomialNB estimator. Set the parameter grid as the defined hyperparameter values, the number of cross-validation folds (cv), and configure error handling.

3. **Fit Grid Search**: Fit the grid search to the training data. The best estimator is determined by cross-validation performance.

4. **Retrieve Optimized Estimator**: Retrieve the best estimator (optimized classifier) from the grid search results.

5. **Display Optimized Parameters**: Print the optimized parameters obtained from the grid search.

6. **Return Optimized Estimator**: Return the optimized Multinomial Naive Bayes classifier.

---

By using this function, you can automatically fine-tune hyperparameters for a Multinomial Naive Bayes classifier using grid search. The grid search process explores various `'alpha'` values, helping to optimize the classifier's performance on the training data.


In [None]:
def Fit_Multinomial_Grid(training_set, training_classifications):
    """
    Fit a Multinomial Naive Bayes classifier using grid search for hyperparameter tuning.

    Args:
        training_set (array-like): The training data.
        training_classifications (array-like): The corresponding class labels.

    Returns:
        optimised: The optimized Multinomial Naive Bayes classifier.
    """
    # Define hyperparameter values to search
    parameters = {'alpha': [0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50]}

    # Create a GridSearchCV object with MultinomialNB estimator
    clf = GridSearchCV(estimator=naive_bayes.MultinomialNB(),
                       param_grid=parameters,
                       cv=3,
                       refit=True,
                       error_score=0,
                       n_jobs=-1)

    # Fit the grid search to the training data
    clf.fit(training_set, training_classifications)

    # Get the best estimator (optimized classifier)
    optimised = clf.best_estimator_
    print("Optimised Parameters from grid search: ",optimised)

    return optimised


## Logistic Regression Classifier with Hyperparameter Tuning

This function fits a Logistic Regression classifier using grid search for hyperparameter tuning.

## Function: `Fit_Logistic_Regression_Grid`

### Parameters

- **`training_set`** (*array-like*): The training data.
- **`training_classifications`** (*array-like*): The corresponding class labels.

### Returns

- **`optimised`**: The optimized Logistic Regression classifier.

### Procedure

1. **Define Hyperparameters**: Prepare a list of hyperparameter values to be explored during grid search. In this case, the hyperparameter is `'C'`.

2. **Create GridSearchCV Object**: Construct a GridSearchCV object using the LogisticRegression estimator. Set the parameter grid as the defined hyperparameter values, the number of cross-validation folds (cv), configure error handling, and specify `'accuracy'` as the scoring metric.

3. **Fit Grid Search**: Fit the grid search to the training data. The best estimator is determined by cross-validation performance.

4. **Retrieve Optimized Estimator**: Retrieve the best estimator (optimized classifier) from the grid search results.

5. **Display Optimized Parameters**: Print the optimized parameters obtained from the grid search.

6. **Return Optimized Estimator**: Return the optimized Logistic Regression classifier.

---

By using this function, you can automatically fine-tune hyperparameters for a Logistic Regression classifier using grid search. The grid search process explores various `'C'` values and leverages class weights for balancing, helping to optimize the classifier's performance on the training data.


In [None]:
def Fit_Logistic_Regression_Grid(training_set, training_classifications):
    """
    Fit a Logistic Regression classifier using grid search for hyperparameter tuning.

    Args:
        training_set (array-like): The training data.
        training_classifications (array-like): The corresponding class labels.

    Returns:
        optomised: The optimized Logistic Regressor classifier.
    """
    # Define hyperparameter values to search
    parameters = {'C': [ 0.01,0.05, 0.1,0.5,1,5,10]}

    # Create a GridSearchCV object with Logistic Regression estimator
    clf = GridSearchCV(estimator = LogisticRegression(class_weight = 'balanced', penalty = 'l2', max_iter=10000),
                       param_grid=parameters,
                       cv=3,
                       refit=True,
                       scoring = 'accuracy',
                       n_jobs=-1)

    # Fit the grid search to the training data
    clf.fit(training_set, training_classifications)

    # Get the best estimator (optimized classifier)
    optimised = clf.best_estimator_
    print("Optimised Parameters from grid search: ",optimised)

    return optimised


## One-vs-Rest Logistic Regression Classifier with Hyperparameter Tuning

This function fits a One-vs-Rest Logistic Regression classifier using grid search for hyperparameter tuning.

## Function: `Fit_One_VS_Rest_Log_Regression_Grid`

### Parameters

- **`training_set`** (*array-like*): The training data.
- **`training_classifications`** (*array-like*): The corresponding class labels.

### Returns

- **`optimised`**: The optimized One-vs-Rest Logistic Regression classifier.

### Procedure

1. **Define Hyperparameters**: Prepare a list of hyperparameter values to be explored during grid search. In this case, the hyperparameter is `'estimator__alpha'`.

2. **Create One-vs-Rest Model**: Construct a OneVsRestClassifier model using the SGDClassifier with `'log_loss'` loss and `'l1'` penalty.

3. **Create GridSearchCV Object**: Build a GridSearchCV object using the OneVsRest model. Set the parameter grid as the defined hyperparameter values, the number of cross-validation folds (cv), configure error handling, and specify `'accuracy'` as the scoring metric.

4. **Fit Grid Search**: Fit the grid search to the training data. The best estimator is determined by cross-validation performance.

5. **Retrieve Optimized Estimator**: Retrieve the best estimator (optimized classifier) from the grid search results.

6. **Display Optimized Parameters**: Print the optimized parameters obtained from the grid search.

7. **Return Optimized Estimator**: Return the optimized One-vs-Rest Logistic Regression classifier.

---

By using this function, you can automatically fine-tune hyperparameters for a One-vs-Rest Logistic Regression classifier using grid search. The grid search process explores various `'estimator__alpha'` values and leverages the One


In [None]:
def Fit_One_VS_Rest_Log_Regression_Grid(training_set, training_classifications):
    """
    Fit a OneVsRest Logistic classifier using grid search for hyperparameter tuning.

    Args:
        training_set (array-like): The training data.
        training_classifications (array-like): The corresponding class labels.

    Returns:
        optimised: The optimized OneVSRest classifier.
    """
    # Define hyperparameter values to search
    parameters  = {"estimator__alpha": [10**-5, 10**-3, 10**-1, 10**1, 10**2]}

    model = OneVsRestClassifier(SGDClassifier(loss='log_loss',penalty='l1'))

    # Create a GridSearchCV object with MultinomialNB estimator
    clf = GridSearchCV(model,
                       param_grid=parameters,
                       cv=3,
                       refit=True,
                       scoring = 'accuracy',
                       n_jobs=-1)

    # Fit the grid search to the training data
    clf.fit(training_set, training_classifications)

    # Get the best estimator (optimized classifier)
    optimised = clf.best_estimator_
    print("Optimised Parameters from grid search: ",optimised)

    return optimised


# Grid Search for Support Vector Machine (SVM) Model Optimization

This function performs a grid search to optimize hyperparameters for a Support Vector Machine (SVM) classification model. Grid search is a technique used to systematically search through a specified range of hyperparameters to find the combination that produces the best performance.

## Function: `Fit_SVM_Grid`

### Parameters

- **`training_set`** (*array-like*): The training dataset features.
- **`training_classifications`** (*array-like*): The corresponding class labels for the training dataset.

### Returns

- **`optimised`** (*SVC*): The optimized SVM classifier with the best hyperparameters.

---

Using this function, you can efficiently search through different combinations of hyperparameters to find the configuration that results in the highest accuracy for your SVM classification model. The example usage below demonstrates how to call the function with the training set and classifications to obtain the optimized SVM classifier.


In [None]:
def Fit_SVM_Grid(training_set, training_classifications):

# Define hyperparameter values to search

    parameters = {'C': [1, 10], 'gamma': [0.001, 0.01, 1]}

    model = SVC()

    # Create a GridSearchCV object with MultinomialNB estimator
    clf = GridSearchCV(model,
                       param_grid=parameters,
                       cv=3,
                       refit=True,
                       scoring = 'accuracy',
                       n_jobs=-1)

    # Fit the grid search to the training data
    clf.fit(training_set, training_classifications)

    # Get the best estimator (optimized classifier)
    optimised = clf.best_estimator_
    print("Optimised Parameters from grid search: ",optimised)

    return optimised

# Grid Search for Support Vector Machine (SVM) Model Optimization

This function performs a grid search to optimize hyperparameters for a Support Vector Machine (SVM) classification model. Grid search is a technique used to systematically search through a specified range of hyperparameters to find the combination that produces the best performance.

## Function: `Fit_SVM_Grid`

### Parameters

- **`training_set`** (*array-like*): The training dataset features.
- **`training_classifications`** (*array-like*): The corresponding class labels for the training dataset.

### Returns

- **`optimised`** (*SVC*): The optimized SVM classifier with the best hyperparameters.

---

Using this function, you can efficiently search through different combinations of hyperparameters to find the configuration that results in the highest accuracy for your SVM classification model. The example usage below demonstrates how to call the function with the training set and classifications to obtain the optimized SVM classifier.


In [None]:
def Fit_Random_Forest_Grid(training_set, training_classifications):

    # Define hyperparameter values to search

    parameters = {'n_estimators': [1,5,10],
              'max_depth': [1,3,5]}

    model = RandomForestClassifier(criterion = 'entropy')

    # Create a GridSearchCV object with MultinomialNB estimator
    clf = GridSearchCV(model,
                       param_grid=parameters,
                       cv=3,
                       refit=True,
                       scoring = 'accuracy',
                       n_jobs=-1)

    # Fit the grid search to the training data
    clf.fit(training_set, training_classifications)

    # Get the best estimator (optimized classifier)
    optimised = clf.best_estimator_
    print("Optimised Parameters from grid search: ",optimised)

    return optimised


## Fit the Bag of Words (BoW) Classification Model

This function fits a classification model using the Bag of Words (BoW) approach with the option to choose between Logistic Regression, Multinomial Naive Bayes, or One-vs-Rest classifiers. It also evaluates the model and displays evaluation metrics, a confusion matrix, ROC curves, and precision-recall curves.

## Function: `fit_BoW_model`

### Parameters

- **`X`** (*pandas.DataFrame*): The input DataFrame containing feature data.
- **`y`** (*np.array*): The label-encoded categorical target.
- **`field`** (*str*): The name of the column in the DataFrame containing the text data.
- **`target`** (*str*): The name of the column in the DataFrame containing the target variable.
- **`model_name`** (*str*): The name of the classification model to use.

### Returns

- **None** (Prints evaluation metrics and plots confusion matrix, ROC curve, and precision-recall curve.)

### Procedure

1. **Split Training and Test Sets**: Split the DataFrame into training and test sets.

2. **TF-IDF Vectorization**: Utilize the TfidfVectorizer to convert the training corpus into TF-IDF features.

3. **Feature Selection**: Perform feature selection using the Chi-squared test to select relevant features for each category.

4. **Re-fit Vectorizer**: Re-fit the vectorizer with the selected features.

5. **Choose Classifier**: Depending on the specified *model_name*, choose either the Logistic Regression, Multinomial Naive Bayes, or One-vs-Rest classifier using respective functions.

6. **Train the Classifier**: Train the chosen classifier using a pipeline with the vectorizer.

7. **Test the Model**: Test the trained model on the test data and calculate predicted probabilities.

8. **Evaluate Model**:
   - Calculate accuracy, AUC, and display a classification report.
   - Plot a confusion matrix.
   - Plot ROC curves for each class.
   - Plot precision-recall curves for each class.

---

By utilizing this function, you can fit and evaluate a classification model using the Bag of Words (BoW) approach. The function offers flexibility in choosing the classification model and provides a comprehensive evaluation of the model's performance, aiding in assessing its effectiveness on the given data.


Note: The 'CLEAN_TEXT' column contains the preprocessed text data, and the 'CATEGORY' column contains the target variable with different categories.

The accuracy plot explores the model's performance in terms of accuracy, precision, recall, and the area under the ROC curve for each class. The confusion matrix provides insights into the model's classification performance by showing the number of true positives, false positives, true negatives, and false negatives for each class. The ROC curve represents the trade-off between the true positive rate (recall) and the false positive rate, and the precision-recall curve shows the trade-off between precision and recall for each class. These plots help assess the model's ability to correctly classify instances of each category and identify the trade-offs between different evaluation metrics.


In [None]:

# Fit the BoW classification model:

def fit_BoW_model(X, y, field, target, model_name):
    """
    Fit the classification model using the Bag of Words (BoW) approach with the Multinomial Naive Bayes classifier.

    Args:
        X (pandas.DataFrame): The input DataFrame containing Feature Data.
        y (np.array): The Label Encoded Categorical Target.
        field (str): The name of the column in the DataFrame containing the text data.
        target (str): The name of the column in the DataFrame containing the target variable.
        model (str) :

    Returns:
        None (Prints evaluation metrics and plots confusion matrix, ROC curve, and precision-recall curve.)
    """


    # Split the DataFrame into training and test sets
    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

    ## TF-IDF (advanced variant of BoW)
    vectorizer = feature_extraction.text.TfidfVectorizer(max_features=20000, ngram_range=(1,2))

    # Fit the vectorizer on the training corpus
    corpus = X_train[field]
    vectorizer.fit(corpus)
    X_train = vectorizer.transform(corpus)
    print("TF-IDF Vectorization Shape:", X_train.shape)
    dic_vocabulary = vectorizer.vocabulary_

    # Feature Selection:
    X_names = vectorizer.get_feature_names_out()
    p_value_limit = 0.95
    df_features = pd.DataFrame()
    for cat in np.unique(y_train):
        # Perform Chi-squared test for feature selection
        chi2, p = feature_selection.chi2(X_train, y_train==cat)
        df_features = pd.concat([df_features,pd.DataFrame(
                       {'feature':X_names, 'score':1-p, 'y':cat})])
        df_features = df_features.sort_values(['y','score'],
                        ascending=[True,False])
        df_features = df_features[df_features['score']>p_value_limit]
    X_names = df_features['feature'].unique().tolist()
    print("Selected Features Count:", len(X_names))

    # Show most relevant vectors in each category:
    for cat in np.unique(y_train):
        print(f"# {cat}:")
        print(f"  . Selected Features: {len(df_features[df_features['y']==cat])}")
        print(f"  . Top Features: {', '.join(df_features[df_features['y']==cat]['feature'].values[:10])}")
        print()

    # Re-fit the vectorizer with the new selected features:
    vectorizer = feature_extraction.text.TfidfVectorizer(vocabulary=X_names)
    vectorizer.fit(corpus)
    X_train = vectorizer.transform(corpus)
    dic_vocabulary = vectorizer.vocabulary_

    # Choose the classifier model (Naive Bayes)
    classifier =''
    if model_name == 'Logistic Regression':
        classifier = Fit_Logistic_Regression_Grid(X_train, y_train)
    elif model_name == 'Multinomial Naive Bayes':
        classifier = Fit_Multinomial_Grid(X_train, y_train)
    elif model_name == 'OneVsRest':
        classifier = Fit_One_VS_Rest_Log_Regression_Grid(X_train, y_train)
    elif model_name == 'SVM':
        classifier = Fit_SVM_Grid(X_train, y_train)
    elif model_name == 'Random Forest':
        classifier = Fit_Random_Forest_Grid(X_train, y_train)
    else:
        print("No model name recognised, please refer to help function for function to ensure proper parameter is used.")
        return

    # Train the Classifier using a pipeline
    model = pipeline.Pipeline([('vectorizer', vectorizer), ('classifier', classifier)])
    model['classifier'].fit(X_train, y_train)

    # Test the model on the test data
    X_test = X_test[field].values
    predicted = model.predict(X_test)
    predicted_prob = model.predict_proba(X_test)

    classes = np.unique(y_test)
    y_test_array = pd.get_dummies(y_test, drop_first=False).values

    ## Calculate Accuracy, Precision, Recall, and AUC
    accuracy = metrics.accuracy_score(y_test, predicted)
    auc = metrics.roc_auc_score(y_test, predicted_prob, multi_class="ovr")

    print("Accuracy:", round(accuracy,2))
    print("AUC:", round(auc,2))
    print("Classification Report:")
    print(metrics.classification_report(y_test, predicted))

    ## Plot Confusion Matrix
    cm = metrics.confusion_matrix(y_test, predicted)
    fig, ax = plt.subplots()
    sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues, cbar=False)
    ax.set(xlabel="Predicted", ylabel="True", xticklabels=classes, yticklabels=classes, title="Confusion Matrix")
    plt.yticks(rotation=0)

    fig, ax = plt.subplots(nrows=1, ncols=2)
    # Plot ROC
    for i in range(len(classes)):
        fpr, tpr, thresholds = metrics.roc_curve(y_test_array[:,i], predicted_prob[:, i])
        ax[0].plot([0,1], [0,1], color='navy', lw=3, linestyle='--')
        ax[0].set(xlim=[-0.05, 1.0], ylim=[0.05, 1.05],
                xlabel="False Positive Rate",
                ylabel="True Positive Rate (Recall)",
                title="Receiver Operating Characteristic (ROC) Curve")
        ax[0].legend(loc="lower right")
        ax[0].grid(True)

    # Plot Precision-Recall Curve
    for i in range(len(classes)):
        precision, recall, thresholds = metrics.precision_recall_curve(y_test_array[:,i], predicted_prob[:,i])
        ax[1].plot(recall, precision, lw=3, label='{0} (area={1:0.2f})'.format(classes[i], metrics.auc(recall, precision)))
        ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall', ylabel="Precision", title="Precision-Recall Curve")
        ax[1].legend(loc="best")
        ax[1].grid(True)

    box = ax[1].get_position()
    box.x0 = box.x0 + 0.1
    box.x1 = box.x1 + 0.1
    ax[1].set_position(box)
    plt.show()

# Example usage of the function

#fit_BoW_model(X, y, 'CLEAN_TEXT', 'CATEGORY','SVM')

# Model Comparisons

### Multinomial Naive Bayes

**Optimised Parameters from grid search:**  
MultinomialNB(alpha=0.1)

**Accuracy:** 0.92 | **AUC:** 0.99

|           | Precision | Recall  | F1-Score | Support |
|-----------|-----------|---------|----------|---------|
| Class 0   | 0.89      | 0.90    | 0.89     | 23142   |
| Class 1   | 0.95      | 0.96    | 0.96     | 30407   |
| Class 2   | 0.95      | 0.87    | 0.91     | 9160    |
| Class 3   | 0.89      | 0.89    | 0.89     | 21775   |
| **Average** | **0.92**  | **0.92** | **0.92** | **84484** |

**Macro Avg:** Precision: 0.92 | Recall: 0.91 | F1-Score: 0.91  

**Weighted Avg:** Precision: 0.92 | Recall: 0.92 | F1-Score: 0.92 | Support: 84484

---

### Logistic Regression

**Optimised Parameters from grid search:**  
LogisticRegression(C=5, class_weight='balanced', max_iter=10000)

**Accuracy:** 0.94 | **AUC:** 0.99

|           | Precision | Recall  | F1-Score | Support |
|-----------|-----------|---------|----------|---------|
| Class 0   | 0.92      | 0.91    | 0.91     | 23170   |
| Class 1   | 0.97      | 0.97    | 0.97     | 30631   |
| Class 2   | 0.89      | 0.94    | 0.92     | 9122    |
| Class 3   | 0.92      | 0.91    | 0.92     | 21561   |
| **Average** | **0.93**  | **0.94** | **0.93** | **84484** |

**Macro Avg:** Precision: 0.92 | Recall: 0.93 | F1-Score: 0.93  

**Weighted Avg:** Precision: 0.94 | Recall: 0.94 | F1-Score: 0.94 | Support: 84484

---

### One vs Rest Logistic Regression

**Optimised Parameters from grid search:**  
OneVsRestClassifier(estimator=SGDClassifier(alpha=1e-05, loss='log_loss', penalty='l1'))

**Accuracy:** 0.93 | **AUC:** 0.99

|           | Precision | Recall  | F1-Score | Support |
|-----------|-----------|---------|----------|---------|
| Class 0   | 0.90      | 0.91    | 0.90     | 23237   |
| Class 1   | 0.94      | 0.97    | 0.96     | 30408   |
| Class 2   | 0.95      | 0.88    | 0.91     | 9158    |
| Class 3   | 0.91      | 0.90    | 0.91     | 21681   |
| **Average** | **0.93**  | **0.93** | **0.92** | **84484** |

**Macro Avg:** Precision: 0.93 | Recall: 0.91 | F1-Score: 0.92  

**Weighted Avg:** Precision: 0.93 | Recall: 0.93 | F1-Score: 0.92 | Support: 84484


---
# Introduction to Word Embeddings with Word2Vec
---

Word embedding is a powerful method for representing the vocabulary of a document. It captures various aspects of words such as context, similarity, and relationships. Word embeddings are essentially vector representations of words, and they are generated using techniques like Word2Vec.

## What are Word Embeddings?

Word embeddings are vector representations of words. They capture word meaning, context, and syntactic relationships. Word2Vec is a popular technique for learning word embeddings through a shallow neural network. Developed by Tomas Mikolov at Google in 2013, Word2Vec is widely used for this purpose.

## The Need for Word Embeddings

Consider similar sentences: "Have a good day" and "Have a great day." Despite minor differences, their meanings are quite similar. If we create a vocabulary V = {Have, a, good, great, day}, and encode each word with a one-hot vector, they would look like this:

Have = [1,0,0,0,0]
a = [0,1,0,0,0]
good = [0,0,1,0,0]
great = [0,0,0,1,0]
day = [0,0,0,0,1]

Visualizing this, each word occupies one dimension, and the rest are irrelevant. This means 'good' and 'great' are as different as 'day' and 'have', which isn't accurate.

The goal is to position words with similar context close in space. Mathematically, their cosine similarity should be close to 1, indicating a small angle between vectors.

## Word2Vec and Contextual Similarity

Word2Vec uses neural networks to learn word embeddings. It captures context by training on a corpus of text. The model learns to predict context words given a target word. As a result, words with similar contexts end up closer in the vector space.

In summary, word embeddings provide richer word representations than one-hot encodings. Word2Vec is a popular technique for learning embeddings, allowing words with similar meanings and contexts to be represented as vectors close in space.

---

# Load Google's Pre-trained Word2Vec Model

This function loads the Word2Vec model trained on a Google News corpus. Google's pre-trained Word2Vec model contains word embeddings learned from a vast amount of text data from the Google News dataset. These embeddings capture semantic relationships between words, enabling various natural language processing tasks.

## Function: `load_Google_word2vec`

### Returns

- **`model`** (*gensim.models.KeyedVectors*): The loaded Word2Vec model.

### Procedure

1. **Load Model**: The function loads the Word2Vec model using the `gensim.models.KeyedVectors.load_word2vec_format` method. This method loads the model's binary format.

2. **Model Source**: The model is trained on an extensive Google corpus, making it a valuable resource for understanding word semantics.

By using this function, you can easily load Google's pre-trained Word2Vec model and utilize the pre-learned word embeddings for various natural language processing tasks. The example provided at the end demonstrates how to call the function and load the model.


In [None]:
def load_Google_word2vec():

    """
    Load the Word2Vec model trained on a Google News corpus.

    Returns:
        gensim.models.KeyedVectors: The loaded Word2Vec model.

    """

    # Load Word2Vec model (trained on an enormous Google corpus)
    model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary = True)
    return model

#model = load_Google_word2vec()

## Visualize Word Vectors using t-SNE Dimensionality Reduction

This function visualizes word vectors using the t-SNE (t-Distributed Stochastic Neighbor Embedding) dimensionality reduction technique. It takes a DataFrame containing text data, a specific column from the DataFrame, and a pre-trained word vectors model as inputs. The t-SNE visualization is used to display word vectors in a two-dimensional space while preserving their similarity relationships.

## Function: `show_vectors_tsne`

### Parameters

- **`df`** (*pandas.DataFrame*): The DataFrame containing the text data.
- **`field`** (*str*): The name of the column in the DataFrame containing the text data.
- **`model`** (*gensim.models.keyedvectors.Word2VecKeyedVectors*): The pre-trained word vectors model.

### Returns

- **None** (Displays t-SNE visualization of word vectors).

### Procedure

1. **Form Corpus List**: Generate a list of words from the specified column of the DataFrame.
2. **Filter Words**: Filter out words that are not present in the model's vocabulary.
3. **Combine Vectors and Words**: Create a dictionary that maps filtered words to their corresponding word vectors.
4. **Create DataFrame**: Convert the dictionary to a DataFrame for further processing.
5. **t-SNE Dimensionality Reduction**: Use t-SNE to reduce the dimensionality of word vectors to two dimensions.
6. **Plot Visualization**: Plot a scatterplot using Seaborn to visualize the reduced word vectors.
7. **Add Labels to Points**: Add labels to some points on the plot using the `adjust_text` function to handle overlapping labels.

By using this function, you can visually explore the relationships between word vectors in a two-dimensional space, gaining insights into word similarities and clusters. The example usage provided at the end demonstrates how to call the function with appropriate arguments.


In [None]:
def show_vectors_tsne(df, field, model):
    """
    Visualize word vectors using t-SNE dimensionality reduction technique.

    Args:
        df (pandas.DataFrame): The DataFrame containing the text data.
        field (str): The name of the column in the DataFrame containing the text data.
        model (gensim.models.keyedvectors.Word2VecKeyedVectors): The pre-trained word vectors model.

    Returns:
        None (Displays t-SNE visualization of word vectors).
    """

    # Form the list of corpus from the specified field
    corpus = form_corpora_list(df, field, False)

    # Filter out words not present in the model's vocabulary
    vector_list = [model[word] for word in corpus if word in model.key_to_index]
    words_filtered = [word for word in corpus if word in model.key_to_index]

    # Combine words and their vectors into a dictionary
    word_vec_zip = zip(words_filtered, vector_list)
    word_vec_dict = dict(word_vec_zip)

    # Convert the dictionary to a DataFrame
    df_word_vec = pd.DataFrame.from_dict(word_vec_dict, orient='index')

    # Perform t-SNE dimensionality reduction
    tsne = TSNE(n_components=2, init='random', random_state=10, perplexity=100)
    tsne_df = tsne.fit_transform(df_word_vec[:400])

    # Plot the t-SNE visualization
    sns.set()
    fig, ax = plt.subplots(figsize=(11.7, 8.27))
    sns.scatterplot(x=tsne_df[:, 0], y=tsne_df[:, 1], alpha=0.5)

    # Add labels to some points using adjustText to handle overlapping text
    texts = []
    words_to_plot = list(np.arange(0, 400, 10))
    for word in words_to_plot:
        texts.append(plt.text(tsne_df[word, 0], tsne_df[word, 1], df_word_vec.index[word], fontsize=14))

    adjust_text(texts, force_points=0.4, force_text=0.4, expand_points=(2, 1), expand_text=(1, 2),
                arrowprops=dict(arrowstyle="-", color='black', lw=0.5))

    plt.show()

# Example usage
#show_vectors_tsne(df, 'CLEAN_TEXT', model)


# Document Vector Generation using Word Vectors

This function generates a document vector by calculating the mean of word vectors for words in a given document. It utilizes pre-trained word vectors from a Word2Vec model to represent the document in a continuous vector space.

## Function: `document_vector`

### Parameters

- **`word2vec_model`** (*gensim.models.keyedvectors.Word2VecKeyedVectors*): The pre-trained word vectors model.
- **`doc`** (*list*): List of words in the document.

### Returns

- **`numpy.ndarray`**: A document vector representing the mean of word vectors in the document. If no words are found in the model, it returns a vector of zeros.

### Procedure

1. **Filtering Out-of-Vocabulary Words**: Remove words from the document that are not present in the Word2Vec model's vocabulary.
2. **Calculating Mean Vector**: Calculate the mean vector of word vectors for the remaining words in the document using `np.mean` along the specified axis (axis=0).
3. **Return Document Vector**: Return the generated document vector.

Using this function, you can transform a document into a continuous vector representation using the mean of word vectors. This representation can be useful for various natural language processing tasks, such as document classification, clustering, and similarity calculations. The example usage below demonstrates how to call the function with a Word2Vec model and a document to obtain its vector representation.


In [None]:
def document_vector(word2vec_model, doc):

    """
    Generate a document vector by calculating the mean of word vectors in the given document.

    Args:
        word2vec_model (gensim.models.keyedvectors.Word2VecKeyedVectors): The pre-trained word vectors model.
        doc (list): List of words in the document.

    Returns:
        numpy.ndarray: A document vector representing the mean of word vectors in the document.
                       If no words are found in the model, returns a vector of zeros.
    """

    # remove out-of-vocabulary words
    doc = [word for word in doc if word in model.key_to_index]
    return np.mean(model[doc], axis=0)




# Document Vector Representation Check using Word Vectors

This function checks whether a document has a vector representation in a given Word2Vec model's vocabulary. It helps identify documents that contain only out-of-vocabulary words and, therefore, cannot be represented by word vectors.

## Function: `has_vector_representation`

### Parameters

- **`word2vec_model`** (*gensim.models.keyedvectors.Word2VecKeyedVectors*): The pre-trained word vectors model.
- **`doc`** (*list*): List of words in the document.

### Returns

- **`bool`**: True if the document has a vector representation in the model, False otherwise.

### Procedure

1. **Checking Vector Representation**: For each word in the document, the function checks if the word is present in the Word2Vec model's vocabulary using the `key_to_index` attribute. If all words in the document are not found in the model's vocabulary, the function returns False; otherwise, it returns True.

This function can be used to filter out documents that cannot be represented using the provided Word2Vec model's vocabulary. By applying this function before document vector generation, you can ensure that only valid documents are used for further analysis and processing. The example usage below demonstrates how to call the function with a Word2Vec model and a document to check its vector representation.


In [None]:
# Function that will help us drop documents that have no word vectors in word2vec
def has_vector_representation(word2vec_model, doc):

    """
    Check if a document has vector representation in the given word vectors model.

    Args:
        word2vec_model (gensim.models.keyedvectors.Word2VecKeyedVectors): The pre-trained word vectors model.
        doc (list): List of words in the document.

    Returns:
        bool: True if the document has vector representation, False otherwise.
    """


    return not all(word not in word2vec_model.key_to_index for word in doc)

# Document Filtering based on a Condition

This function filters out documents from a corpus based on a given condition. It allows for efficient removal of unwanted documents, helping to clean and refine the dataset before further analysis.

## Function: `filter_docs`

### Parameters

- **`corpus`** (*list*): List of documents.
- **`texts`** (*list*): List of text data corresponding to each document.
- **`condition_on_doc`** (*function*): A function that takes a document and returns a boolean indicating whether the document should be retained or removed.

### Returns

- **`tuple`**: A tuple containing the filtered corpus and filtered texts (if provided).

### Procedure

1. **Initial Corpus Size**: Calculate the initial number of documents in the corpus.

2. **Filtering Texts (if provided)**: If a list of texts is provided, iterate through each text and corresponding document in the corpus. For each pair, apply the `condition_on_doc` function to check whether the document should be retained or removed. If the condition is met, keep the text; otherwise, discard it.

3. **Filtering Corpus**: Iterate through the entire corpus and retain only the documents that satisfy the condition specified by the `condition_on_doc` function.

4. **Display Removed Count**: Calculate the difference between the initial number of documents and the final number of documents in the filtered corpus. Print the count of removed documents.

5. **Return Filtered Data**: Return a tuple containing the filtered corpus and the filtered texts (if provided).

By using this function, you can easily filter out documents from a corpus based on specific criteria. This is particularly useful for data preprocessing and cleaning tasks. The example usage below demonstrates how to apply the function with a condition and retrieve the filtered corpus and texts (if provided).


In [None]:
# Filter out documents
def filter_docs(corpus, texts, condition_on_doc):

    """
    Filter out documents from a corpus based on a given condition.

    Args:
        corpus (list): List of documents.
        texts (list): List of text data corresponding to each document.
        condition_on_doc (function): A function that takes a document and returns a boolean indicating whether the
                                    document should be retained or removed.

    Returns:
        tuple: A tuple containing the filtered corpus and filtered texts (if provided).
    """

    number_of_docs = len(corpus)

    if texts is not None:
        texts = [text for (text, doc) in zip(texts, corpus)
                 if condition_on_doc(doc)]

    corpus = [doc for doc in corpus if condition_on_doc(doc)]

    print("{} docs removed".format(number_of_docs - len(corpus)))

    return (corpus, texts)


In [None]:
def Form_word2Vec_Xy(df, pre_clean_feature, clean_feature, target):

    titles_list = [title for title in df[pre_clean_feature_feature]]
    corpus = [c_title for c_title in df[clean_feature]]

    # Remove docs that don't include any words in W2V's vocab
    corpus, titles_list = filter_docs(corpus, titles_list, lambda doc: has_vector_representation(model, doc))

    # Filter out any empty docs
    corpus, titles_list = filter_docs(corpus, titles_list, lambda doc: (len(doc) != 0))
    # Initialize an array for the size of the corpus
    x = []
    for doc in corpus: # append the vector for each document
        x.append(document_vector(model, doc))

    X = np.array(x) # list to array

    from sklearn.decomposition import PCA

    pca = PCA(n_components='mle', svd_solver='full', random_state=10)

    # x is the array with our 300-dimensional vectors
    reduced_vecs = pca.fit_transform(x)
    df_w_vectors = pd.DataFrame(reduced_vecs)

    df_w_vectors['Title'] = titles_list

    main_w_vectors = pd.concat((df_w_vectors, df), axis=1)

    # Get rid of vectors that couldn't be matched with the main_df
    main_w_vectors.dropna(axis=0, inplace=True)

    # Drop all non-numeric, non-dummy columns, for feeding into the models
    cols_to_drop = ['Title','ID', 'URL', 'STORY', 'HOSTNAME', 'TIMESTAMP', 'PUBLISHER','TITLE', 'CLEAN_TEXT']

    data = main_w_vectors.drop(columns=cols_to_drop, axis = 1)

    X, y = Form_X_y(data, target)

    return X, y



In [None]:
# Fit the BoW classification model:

def fit_Word2Vec_model(X, y, field, target, model_name):

    """
    Fit and evaluate a classification model using word embeddings (Word2Vec) on the given dataset.

    Args:
        X (pandas.DataFrame): The DataFrame containing the input features.
        y (pandas.Series): The Series containing the target labels.
        field (str): The name of the column in the DataFrame containing the word embedding data.
        target (str): The name of the target column.
        model_name (str): The name of the classification model to be used.

    Returns:
        None (Displays model evaluation metrics and visualization).
    """

    # Split the DataFrame into training and test sets
    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

    # Choose the classifier model (Naive Bayes)
    classifier =''
    if model_name == 'Logistic Regression':
        classifier = Fit_Logistic_Regression_Grid(X_train, y_train)
    elif model_name == 'Multinomial Naive Bayes':
        classifier = Fit_Multinomial_Grid(X_train, y_train)
    elif model_name == 'OneVsRest':
        classifier = Fit_One_VS_Rest_Log_Regression_Grid(X_train, y_train)
    elif model_name == 'SVM':
        classifier = Fit_SVM_Grid(X_train, y_train)
    elif model_name == 'Random Forest':
        classifier = Fit_Random_Forest_Grid(X_train, y_train)
    else:
        print("No model name recognised, please refer to help function for function to ensure proper parameter is used.")
        return

    # Train the Classifier using a pipeline
    model = pipeline.Pipeline([('classifier', classifier)])
    model['classifier'].fit(X_train, y_train)

    # Test the model on the test data
    predicted = model.predict(X_test)
    predicted_prob = model.predict_proba(X_test)

    classes = np.unique(y_test)
    y_test_array = pd.get_dummies(y_test, drop_first=False).values

    ## Calculate Accuracy, Precision, Recall, and AUC
    accuracy = metrics.accuracy_score(y_test, predicted)
    auc = metrics.roc_auc_score(y_test, predicted_prob, multi_class="ovr")

    print("Accuracy:", round(accuracy,2))
    print("AUC:", round(auc,2))
    print("Classification Report:")
    print(metrics.classification_report(y_test, predicted))

    ## Plot Confusion Matrix
    cm = metrics.confusion_matrix(y_test, predicted)
    fig, ax = plt.subplots()
    sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues, cbar=False)
    ax.set(xlabel="Predicted", ylabel="True", xticklabels=classes, yticklabels=classes, title="Confusion Matrix")
    plt.yticks(rotation=0)

    fig, ax = plt.subplots(nrows=1, ncols=2)
    # Plot ROC
    for i in range(len(classes)):
        fpr, tpr, thresholds = metrics.roc_curve(y_test_array[:,i], predicted_prob[:, i])
        ax[0].plot([0,1], [0,1], color='navy', lw=3, linestyle='--')
        ax[0].set(xlim=[-0.05, 1.0], ylim=[0.05, 1.05],
                xlabel="False Positive Rate",
                ylabel="True Positive Rate (Recall)",
                title="Receiver Operating Characteristic (ROC) Curve")
        ax[0].legend(loc="lower right")
        ax[0].grid(True)

    # Plot Precision-Recall Curve
    for i in range(len(classes)):
        precision, recall, thresholds = metrics.precision_recall_curve(y_test_array[:,i], predicted_prob[:,i])
        ax[1].plot(recall, precision, lw=3, label='{0} (area={1:0.2f})'.format(classes[i], metrics.auc(recall, precision)))
        ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall', ylabel="Precision", title="Precision-Recall Curve")
        ax[1].legend(loc="best")
        ax[1].grid(True)

    box = ax[1].get_position()
    box.x0 = box.x0 + 0.1
    box.x1 = box.x1 + 0.1
    ax[1].set_position(box)
    plt.show()

#fit_Word2Vec_model(X, y, 'CLEAN_TEXT', 'CATEGORY','Random Forest')

# Model Comparisons

## Model with Word Embedding

### Logistic Regression Model

**Optimised Parameters from grid search:**  
LogisticRegression(C=0.05, class_weight='balanced', max_iter=10000)

**Accuracy:** 0.79 | **AUC:** 0.93

|           | Precision | Recall  | F1-Score | Support |
|-----------|-----------|---------|----------|---------|
| Class b   | 0.78      | 0.76    | 0.77     | 23069   |
| Class e   | 0.88      | 0.85    | 0.87     | 30700   |
| Class m   | 0.65      | 0.78    | 0.71     | 9037    |
| Class t   | 0.76      | 0.75    | 0.76     | 21660   |
| **Average** | **0.80**  | **0.79** | **0.79** | **84466** |

**Macro Avg:** Precision: 0.77 | Recall: 0.79 | F1-Score: 0.78  

**Weighted Avg:** Precision: 0.80 | Recall: 0.79 | F1-Score: 0.79 | Support: 84466

---

### One-vs-Rest Classifier

**Optimised Parameters from grid search:**  
OneVsRestClassifier(estimator=SGDClassifier(alpha=1e-05, loss='log_loss', penalty='l1'))

**Accuracy:** 0.80 | **AUC:** 0.93

|           | Precision | Recall  | F1-Score | Support |
|-----------|-----------|---------|----------|---------|
| Class b   | 0.76      | 0.80    | 0.78     | 23095   |
| Class e   | 0.84      | 0.89    | 0.87     | 30579   |
| Class m   | 0.80      | 0.66    | 0.72     | 9077    |
| Class t   | 0.79      | 0.73    | 0.76     | 21715   |
| **Average** | **0.80**  | **0.80** | **0.80** | **84466** |

**Macro Avg:** Precision: 0.79 | Recall: 0.77 | F1-Score: 0.78  

**Weighted Avg:** Precision: 0.80 | Recall: 0.80 | F1-Score: 0.80 | Support: 84466

---

### Random Forest Classifier

**Optimised Parameters from grid search:**  
RandomForestClassifier(criterion='entropy', max_depth=5, n_estimators=10)

**Accuracy:** 0.63 | **AUC:** 0.85

|           | Precision | Recall  | F1-Score | Support |
|-----------|-----------|---------|----------|---------|
| Class b   | 0.64      | 0.61    | 0.63     | 23172   |
| Class e   | 0.59      | 0.92    | 0.71     | 30538   |
| Class m   | 0.86      | 0.16    | 0.27     | 9108    |
| Class t   | 0.75      | 0.44    | 0.56     | 21648   |
| **Average** | **0.71**  | **0.53** | **0.54** | **84466** |

**Macro Avg:** Precision: 0.71 | Recall: 0.53 | F1-Score: 0.54  

**Weighted Avg:** Precision: 0.67 | Recall: 0.63 | F1-Score: 0.60 | Support: 84466

---

The comparison of the three models using word embeddings (embedding Model with Logistic Regression, embedding Model with One-vs-Rest Classifier, and embedding Model with Random Forest Classifier) suggests that these models generally perform worse than the initial bag of words models. Despite leveraging the semantic meaning of words through word embeddings, these models struggle to achieve comparable performance on the classification task.


In [None]:
def main():
        barplot_target_category(df,'CATEGORY', {'b': 'BUISNESS', 't': 'TECHNOLOGY' , 'm' :'MEDICAL', 'e':'ENTERTAINMENT'})

In [None]:
if __name__ == "__main__":
    main()


In [None]:
!pip install bert-tensorflow

In [None]:
import pandas as pd
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import tensorflow_hub as hub
from datetime import datetime
from sklearn.model_selection import train_test_split
import os

print("tensorflow version : ", tf.__version__)
print("tensorflow_hub version : ", hub.__version__)

import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization
from bert import modeling

In [None]:
!pip install tensorflow.contrib

In [None]:
!pip install tensorflow_hub

# LTSM

In [16]:
import numpy as np
import pandas as pd
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.models import Sequential
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from keras.callbacks import EarlyStopping
from tensorflow import keras

In [None]:
from google.colab import files
data_to_load = files.upload()

In [17]:
df = download_Data('uci-news-aggregator.csv')
df = data_frame_pre_process(df, 'TITLE', True, False, False)

FileNotFoundError: ignored

In [None]:
data= df.sample(frac = 0.75)

In [None]:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 100
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True)
tokenizer.fit_on_texts(data['CLEAN_TEXT'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 40326 unique tokens.


In [None]:
X = tokenizer.texts_to_sequences(data['CLEAN_TEXT'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)


Shape of data tensor: (316813, 250)


In [None]:
Y = pd.get_dummies(data['CATEGORY']).values
print('Shape of label tensor:', Y.shape)

Shape of label tensor: (316813, 4)


In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(221769, 250) (221769, 4)
(95044, 250) (95044, 4)


In [None]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 250, 100)          5000000   
                                                                 
 lstm (LSTM)                 (None, 100)               80400     
                                                                 
 dense (Dense)               (None, 4)                 404       
                                                                 
Total params: 5080804 (19.38 MB)
Trainable params: 5080804 (19.38 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [None]:
epochs = 5
batch_size = 128

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.3)

Epoch 1/5
  93/2426 [>.............................] - ETA: 51:53 - loss: 1.1950 - accuracy: 0.4906

In [None]:
%tensorboard --logdir logs

In [None]:
accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: 0.9501'.format(accr[0],accr[1]))

In [None]:
plt.title('Loss')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show()

In [None]:
plt.title('Accuracy')
plt.plot(history.history['acc'], label='train')
plt.plot(history.history['val_acc'], label='test')
plt.legend()
plt.show();

In [None]:
def check(new_title):
    seq = tokenizer.texts_to_sequences(new_title)
    padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
    pred = model.predict(padded)
    labels=['b','e','m','t']
    print(pred)
    print('\n')
    Dict = {'b': 'buisness', 'm': 'Medical', 'e': 'Entertainment','t': 'technology'}
    return (Dict[labels[np.argmax(pred)]])