#  Genre classification of historical newspaper texts

In this notebook we will train and test two different classifiers on a collection of texts from [historical New Zealand newspapers](https://paperspast.natlib.govt.nz/newspapers). Our aim is to build genre classification models that are independent of topic, so we will use features based on the structure and layout of the text (for example line widths), linguistic features (such as the frequency of certain parts-of-speech), and other text statistics.

The data used in this notebook is originally sourced from the [National Library of New Zealand's Papers Past open data](https://natlib.govt.nz/about-us/open-data/papers-past-metadata/papers-past-newspaper-open-data-pilot/dataset-papers-past-newspaper-open-data-pilot). It consists of a small dataset of articles that have been pre-labelled with their genre and includes features related to line widths and offsets that have been extracted from the [METS/ALTO XML files](https://veridiansoftware.com/knowledge-base/metsalto/) for each newspaper.

We will use [spaCy](https://spacy.io/) and [textstat](https://pypi.org/project/textstat/) to extract additional features and add them to our dataframe. We will then use [scikit-learn](https://scikit-learn.org/stable/) to train and test our models.

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 0:</strong> Throughout the notebook there are defined tasks for you to do. Watch out for them - they will have a box around them like this! Make sure you take some notes as you go.
</div>

![National Library Papers Past](https://images.ctfassets.net/pwv49hug9jad/6tW2XbQ3rwBfOYilgpZmVQ/468368a1454e2201958401cab2ea7d79/guides-pp-open-data-feature-image.jpg?fm=webp)

[Image source: natlib.govt.nz](https://natlib.govt.nz/about-us/open-data/papers-past-metadata/papers-past-newspaper-open-data-pilot/get-started-papers-past-newspaper-open-data-pilot)

## Setup

We need to make sure the libraries we will need in this notebook are installed (you only need to run this cell once):

In [None]:
!pip install scikit-learn --upgrade
!pip install textstat
# !pip install seaborn

Now import the required libraries. 

In [None]:
import sys

import pandas as pd
import numpy as np
import math
import pickle
import re

# Classifier training and evaluation
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier 
from sklearn.tree import plot_tree

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc, roc_auc_score, confusion_matrix
from sklearn.metrics import RocCurveDisplay

# Feature extraction
import spacy
import textstat
from collections import Counter

import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(
    context="notebook",
    style="whitegrid",
    font="sans-serif",
    font_scale=1
    )

In [None]:
# spacy.cli.download("en_core_web_sm")  # uncomment if needed
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

## Load and explore the dataset

In [None]:
# Load the dataframe
filepath = "paperspast_4genres_20240502.csv"
df = pd.read_csv(filepath, index_col = [0])

In [None]:
# View the count of articles by genre
display(df.groupby(["genre"])["genre"].count())

In [None]:
# View first ten rows of the dataframe
df.head(10)

In [None]:
# Because the ID of the newspaper 'Northern Advocate' is 'NA', this has been read-in as NaN
# View the problem by selecting rows where 'newspaper' column equals 'Northern Advocate'
# Look at the 'newspaper_id' column

mask = df["newspaper"] == "Northern Advocate"
display(df.loc[mask])

In [None]:
# We can fix this by filling the newspaper_id column for our selected rows with the correct code 'NA'
df.loc[mask, "newspaper_id"] = df.loc[mask, "newspaper_id"].fillna("NA")
display(df.loc[mask])

In [None]:
# Let's look at the distribution of the articles in our dataset by newspaper

plt.figure(figsize = (10, 4))
sample_papers_unique = df["newspaper"].nunique()
print("-----------------------------------------------------") 
print(f"Number of newspaper titles in sample dataset: {sample_papers_unique}") 
print("-----------------------------------------------------") 
print("") 

ax_1 = sns.countplot(x = "newspaper", 
                     data = df, 
                     order = df["newspaper"].value_counts().index, 
                     color = "#32a5fc")
ax_1.set_xlabel("Newspaper", fontsize = 12)
ax_1.set_ylabel("Count of articles", fontsize = 12, labelpad = 9)
ax_1.set_title("Distribution of articles by newspaper", fontsize = 14)
sns.despine(top = True, right = True, left = True, bottom = False, offset = None, trim = False)
plt.xticks(rotation = 90, fontsize = 11)
plt.yticks(fontsize = 11)
plt.show()

In [None]:
# Now let's view the distribution of the articles in our dataset by year

plt.figure(figsize = (10, 4))
annual_df = (df.groupby(df["year"])
             ["text"].count().reset_index())
ax_2 = sns.barplot(x = "year", 
                   y = "text", 
                   data = annual_df, 
                   color = "#32a5fc")
ax_2.set_xlabel("Year", fontsize = 12, labelpad = 14)
ax_2.set_ylabel("Count of articles", fontsize = 12, labelpad = 9)
ax_2.set_title("Distribution of articles in dataset by year", fontsize = 14)
sns.despine(top = True, right = True, left = True, bottom = False, offset = None, trim = False)
plt.xticks(rotation = 90, fontsize = 11)
plt.yticks(fontsize = 11)
plt.show()

In [None]:
# We can display the full text of a selected article by dataframe index
selected_index = 431

print(f"\nGenre: {df['genre'].values[selected_index]}\n")
print("==============\n")
print(f"Title:\t\t{df['title'].values[selected_index]}")
print(f"Newspaper:\t{df['newspaper'].values[selected_index]}")
print(f"Date:\t\t{df['day'].values[selected_index]} / {df['month'].values[selected_index]} / { df['year'].values[selected_index]}\n")
print(f"{df['text'].values[selected_index]}\n")
print(f"You can view the scanned article on the Papers Past website. "
      f"Follow the link below to see the scan of the original article.\n{df['article_web'].values[selected_index]}\n")

## Data cleaning

You'll see from the above that there can be symbols and punctuation in the text that are the result of [OCR](https://en.wikipedia.org/wiki/Optical_character_recognition) errors. We will run a simple cleaner function over the text column of the dataframe to improve this and add the cleaned text to a new column. Before we remove punctuation, we will count the sentences and add this feature to the dataframe.  

In [None]:
def cleaner(df, column_name):
    """
    Given a dataframe column of OCR text, count the sentences, question marks, quotation marks,
    exclamation marks, and apostrophes and store these counts in new columns.
    
    Remove symbols using a regular expression and create a clean text column.

    Return the updated dataframe.
    """
    # A column of sentence count is added to the dataframe before punctuation is removed.
    df["sentence_count"] = df[column_name].apply(lambda x: textstat.sentence_count(x))

    # Count occurrences of specific characters and add to new columns
    df["freq_q_marks"] = df[column_name].apply(lambda x: x.count("?"))
    df["freq_double_quotes"] = df[column_name].apply(lambda x: x.count('"'))
    df["freq_exclam"] = df[column_name].apply(lambda x: x.count("!"))
    df["freq_apost"] = df[column_name].apply(lambda x: x.count("'"))

    # Regex pattern for only alphanumeric text
    pattern = re.compile(r"[A-Za-z0-9]{1,50}")
    df["clean_text"] = df[column_name].str.findall(pattern).str.join(" ")
    
    return df

In [None]:
df = cleaner(df, "text")

In [None]:
# Let's look at that same text after 'cleaning'

# We can display the full text of a selected article by index
print(f"\nGenre: {df['genre'].values[selected_index]}\n")
print("==============\n")
print(f"Title:\t\t{df['title'].values[selected_index]}")
print(f"Newspaper:\t{df['newspaper'].values[selected_index]}")
print(f"Date:\t\t{df['day'].values[selected_index]} / {df['month'].values[selected_index]} / {df['year'].values[selected_index]}\n")
print(df["clean_text"].values[selected_index])

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 1:</strong> We have done a very simple clean-up of the text but, as you can see, there are still problems. You might see incorrect words like 'np' instead of 'up', capital letters in the wrong place, numbers where there should be letters, and more. Think about the impact this might have on our model and discuss it with your classmates or tutors. 
</div>

In [None]:
# You can see our additional columns have been added to the end of our dataframe
# Scroll across to the right if they are not visible
df.head(5)

## Feature extraction: linguistic features and text statistics

The following cells extract parts-of-speech and text statistic features and add them to the dataframe. For efficiency, the texts are [processed for parts-of-speech tagging](https://spacy.io/usage/processing-pipelines) as a stream using spaCy's [nlp.pipe](https://spacy.io/usage/processing-pipelines#processing). This allows the texts to be buffered in batches instead of one-by-one.

In [None]:
# Run this cell to define the list of POS tags to count (you don't need to change anything here)
# We will use a selection of Universal POS tags: https://universaldependencies.org/u/pos/ 

pos_tags = [
            "ADJ",    # adjective
            "ADV",    # adverb
            "NOUN",   # noun
            "NUM",    # numeral
            "PRON",   # pronoun
            "PROPN",  # proper noun
            "VERB",   # verb
            ]

In [None]:
def process_text(df, pos_tags):
    """
    Given a pandas dataframe with a column called "clean_text"
    and a list of Universal parts of speech tags, add columns for 
    a list of tokens, word count, relative frequencies 
    of the given parts of speech (using Spacy), relative frequencies 
    of stopwords, and relative frequencies of monosyllabic words 
    and selected punctuation. 

    Return the dataframe with the additional columns.
    """
    stop = stopwords.words("english")
 
    token_list = []
    pos_counts = []
    input_col = "clean_text"

    # Spacy pipeline to count POS
    nlp_text_pipe = nlp.pipe(df[input_col], batch_size = 20)

    for doc in nlp_text_pipe:
        token_list.append([token.text for token in doc if not token.is_punct and not token.is_space]) 
        pos_counts.append(Counter(token.pos_ for token in doc if token.pos_ in pos_tags))
    
    df["tokens"] = token_list
    df["word_count"] = df["tokens"].apply(lambda x: len(x))
    df["stopwords_count"] = df[input_col].apply(lambda x: len([i for i in x.split() if i.lower() in stop]))
    df["stopword_relfreq"] = df["stopwords_count"] / df["word_count"]

    pos_columns = set().union(*pos_counts)

    # Compute the relative frequencies of each part-of-speech tag
    for pos in pos_columns:
        df[pos + "_relfreq"] = [count.get(pos, 0) for count in pos_counts] / df["word_count"]

    # Add monsyllabic words relative frequency using the textstat library
    # Add relative frequencies of the punctuation marks counted earlier
    df["monosyll_count"] = df[input_col].apply(lambda x: textstat.monosyllabcount(x)) 
    df["monosyll_relfreq"] = df["monosyll_count"] / df["word_count"]
    df["q_marks_relfreq"] = df["freq_q_marks"]  / df["word_count"]
    df["double_quotes_relfreq"] = df["freq_double_quotes"] / df["word_count"]
    df["exclam_relfreq"] = df["freq_exclam"]  / df["word_count"]
    df["apost_relfreq"] = df["freq_apost"] / df["word_count"]

    # Drop count columns that are no longer required (they've been converted to relative frequencies)
    df.drop(columns=["tokens", 
                     "monosyll_count", 
                     "stopwords_count", 
                     "freq_q_marks", 
                     "freq_double_quotes", 
                     "freq_exclam", 
                     "freq_apost"], axis = 1, inplace = True)

    return df

In [None]:
# Run the function to extract text features and add them to the dataframe
# This might take a little while to run
df = process_text(df, pos_tags)

In [None]:
# Inspect the first few rows of the dataframe to see the features that have been added
# Scroll to the right
pd.set_option("display.max_columns", None)
df.head(5)

In [None]:
# We can examine use of a selected POS for a given dataframe index

pos_var = 'PRON'
my_ind = 4

#---------------------------------------------------------------------------------------------------#

doc = nlp(df.iloc[my_ind]["clean_text"])
my_text = df.iloc[my_ind]["clean_text"]
stop = stopwords.words("english")
my_pos = [token.text for token in doc if token.pos_ == pos_var]
my_stopwords = [text for text in my_text.split() if text.lower() in stop]

print(f"------------------\nIndex: {my_ind}\n------------------")
display(df.loc[[my_ind]])
print(f"\n------------------")
print(f"Word count: {df.iloc[my_ind]['word_count']}")
print(f"\n------------------")
print(f"{pos_var} relative frequency: {df.iloc[my_ind][f'{pos_var}_relfreq']:.3f}")
print(f"\n------------------")
print(f"Article title:\t\t\t{df.iloc[my_ind]['title']}")
print(f"Scanned newspaper issue:\t{df.iloc[my_ind]['newspaper_web']}\n")
print(f"------------------\nClean text:\n")
print(df.iloc[my_ind][f"clean_text"])
print(f"\n------------------")
print(f"{pos_var} (count = {len(my_pos)}):\n")
print(my_pos)
print("\n------------------\n"
        f"Stopwords (count = {len(my_stopwords)}):\n")
print(my_stopwords)

In [None]:
# We can also inspect summary statistics for all our numerical data 
# We will use these later to explore the misclassified articles

df.describe()

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 2:</strong> Why is exploratory data anlaysis (EDA) using techniques such as visualising the data and examining descriptive statistics important? What can it reveal? Discuss with your classmates or tutors. 
</div>

In [None]:
# Inspect the full list of columns in our dataframe, and see their data types
display(df.dtypes)

## Specify features to include in the model

* We now need to specify the features we want to include in our model, for example if we know that two features are highly correlated, we can choose to only include one in the model
* You can include or remove features from the model to explore the impact of different combinations of features on the performance of the classifier.
* **Use the default list shown below first, then experiment to see what effect the changes have on the model**

In [None]:
# List of features to include in the model 
# Place cursor in the text and press Ctrl + / to comment or uncomment the line

features = [
            "avg_line_width",
            # "min_line_width",
            # "max_line_width",
            # "line_width_range",
            "avg_line_offset",
            # "max_line_offset",
            # "min_line_offset",
            # "sentence_count",
            "word_count",
            # "stopword_relfreq",
            "VERB_relfreq",
            "ADV_relfreq",
            "PRON_relfreq",
            "ADJ_relfreq",
            "PROPN_relfreq",
            "NOUN_relfreq",
            "NUM_relfreq",
            "monosyll_relfreq",
            # "q_marks_relfreq",
            # "double_quotes_relfreq",
            # "exclam_relfreq",
            # "apost_relfreq",
    
            # We will code our target genre as a binary class '1' and the other genres as '0'
            # Do not remove this feature from the set
            "binary_class"  
           ]

## Set the target genre

* We will specify the genre we want to predict with the binary classifier. 
* The selected genre will be labelled as 1 in the binary classification model, with the other classes labelled as 0.

In [None]:
# Select from:
# FamilyNotice     
# Fiction          
# LetterToEditor    
# Poetry         

target_genre = "Poetry"

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 3:</strong> Train and test classifiers for each of the four genres and take note of the results in a separate document. Which combination of genre and classifier achieved the best metrics and which was the worst? Discuss the results with your classmates or tutors.
</div>

## Split the data into train and test sets

* Run the cells below to split the data into train and test data sets.

In [None]:
def train_test_data(df, features, target_genre):
    """
    Given the dataframe, features to include in the model,
    and the target genre, split the data into 
    training and test sets and use the dataframe indices to 
    save the order of the split
    """
    
    df["binary_class"] = np.where(df["genre"]== target_genre, 1, 0)
    model_df = df.filter(features, axis=1)
    indices = df.index.values

    # Extract the explanatory variables in X and the target variable in y
    y = model_df.binary_class.copy()
    X = model_df.drop(["binary_class"], axis=1)

    # Train test split 
    # Use the indices to save the order of the split.
    # https://stackoverflow.com/questions/48947194/add-randomforestclassifier-predict-proba-results-to-original-dataframe
    X_train, X_test, indices_train, indices_test = train_test_split(X, 
                                                                    indices, 
                                                                    test_size = .3,    # This value changes the proportion of data held out for the test set
                                                                    random_state = 7)  # You can change the random state to change the allocation of docs to the training and test sets
    
    y_train, y_test = y[indices_train], y[indices_test]
    
    return X_train, X_test, y_train, y_test, indices_train, indices_test

In [None]:
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_data(df, features, target_genre)

In [None]:
X_train.head(10)

In [None]:
y_train.head(10)

## Train and Test a Logistic Regression Classifier 
* [Logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) is a binary classification method popular for its computational efficiency and interpretability.
* Run the cells below to train and test a logistic regression classifier for our selected genre.

In [None]:
def log_reg_binary(X_train, X_test, y_train, y_test, target_genre):
    """
    Train a logistic regression model to classify the selected genre
    """ 
    pipe = Pipeline([("scl", StandardScaler()),
                     ("clf", LogisticRegression())]) 
    pipe.fit(X_train, y_train)  
    
    y_pred_train = pipe.predict(X_train)
    y_pred_test = pipe.predict(X_test)
    y_prob_train = pipe.predict_proba(X_train)
    y_prob_test = pipe.predict_proba(X_test)
        
    accuracy_result = accuracy_score(y_test, y_pred_test)
    precision_result = precision_score(y_test, y_pred_test)
    recall_result = recall_score(y_test, y_pred_test)
    f1_result = f1_score(y_test, y_pred_test)
    auroc_result = roc_auc_score(y_test, y_prob_test[:, 1])  # Use probabilities for AUROC

    print("-----------------------------------------------")
    print(f"Binary Classification - Logistic Regression")
    print(f"Target genre: {target_genre}")
    print("-----------------------------------------------")
    print()
    print(f"Accuracy = {accuracy_result:.3f}")
    print(f"Precision = {precision_result:.3f}")
    print(f"Recall = {recall_result:.3f}")
    print(f"F1 Score = {f1_result:.3f}")
    print(f"AUROC Score = {auroc_result:.3f}")
    
    RocCurveDisplay.from_predictions(y_test, y_prob_test[:, 1])  # Use probabilities for ROC curve
    plt.title("AUROC: Logistic Regression")
    plt.show()
    
    print()
    print("-----------------------------------------------")
    print(f"Model coefficients \nwith log odds (logit) converted to odds ratio\nfor improved interpretability\n")
    print(f"Target genre: {target_genre}")
    print("-----------------------------------------------")
    
    # Get coefficients (log odds or logit)
    log_odds = pipe.named_steps["clf"].coef_[0]
    # Convert log odds to odds ratio
    odds = np.exp(log_odds)
    
    return y_pred_train, y_pred_test, y_prob_train, y_prob_test, log_odds, odds

In [None]:
def genres_binary_lr(df, target_genre, X_train, X_test, y_train, y_test, indices_train, indices_test):
    """
    Train and test the model, and return the dataframe
    with appended predictions.
    """

    y_pred_train, y_pred_test, y_prob_train, y_prob_test, log_odds, odds = log_reg_binary(X_train, 
                                                                                          X_test,
                                                                                          y_train,
                                                                                          y_test,
                                                                                          target_genre
                                                                                         )

    # # Add the predictions to a copy of the original dataframe
    df_new = df.copy()
    df_new.loc[indices_train,"pred_train"] = y_pred_train
    df_new.loc[indices_test,"pred_test"] = y_pred_test
    df_new.loc[indices_train,"prob_0_train"] = y_prob_train[:,0]
    df_new.loc[indices_test,"prob_0_test"] = y_prob_test[:,0]
    df_new.loc[indices_train,"prob_1_train"] = y_prob_train[:,1]
    df_new.loc[indices_test,"prob_1_test"] = y_prob_test[:,1]   

    # Sort the dataframe by probability of being the given genre
    df_new = df_new.sort_values(by="prob_1_test", ascending=False)  
    
    # Create a dataframe with both log odds and odds
    lr_odds_df = pd.DataFrame({
        "feature": X_train.columns,
        "log odds (logit)": log_odds,
        "odds ratio": odds
    })
    
    # Sort the dataframe by odds in descending order
    lr_odds_df = lr_odds_df.sort_values(by="odds ratio", ascending=False)
    
    # Reset the index for cleaner display
    lr_odds_df = lr_odds_df.reset_index(drop=True)
       
    return df_new, lr_odds_df

In [None]:
lr_preds_df, lr_odds_df = genres_binary_lr(df, target_genre, X_train, X_test, y_train, y_test, indices_train, indices_test)

# Explore the model coefficients
display(lr_odds_df)

### Interpreting the logistic regression model
A benefit of logistic regression is that it is relatively easy to interpret compared to other classifiers. We can extract the coefficients of the features in the final model (using the 'coef_' attribute) to see which features were the strongest predictors of the positive class (in our case, the selected genre). 

The coefficients extracted using 'coef_' are the log odds (logit) that an observation belongs to the positive class. In order to interpret them more easily, we can convert them to the odds ratio. An odds ratio greater than 1 represents a positive association and can be interpreted as follows:

**"For every unit increase in {feature}, the odds that the observation is {positive class} are {odds ratio} times greater than the odds that it is not {positive class} when all other variables are held constant."**

An odds ratio less than 1 represents a negative association. To describe them in a similar way to the above, we need to take 1/odds ratio. For example:

"For every unit increase in {feature}, the odds that the observation **is not** {positive class} are {1 / odds ratio} times greater than the odds that it **is** {positive class} when all other variables are held constant."

**When interpreting the model coefficients it is important to consider the influence of features that may be correlated with each other (multicollinearity). These features will have similar predictive relationships to the outcome and therefore the sign and value of the coefficients should be interpreted with caution.** 

You can read more about calculating and interpreting the coefficients of regression models in this [Towards Data Science](https://towardsdatascience.com/interpreting-coefficients-in-linear-and-logistic-regression-6ddf1295f6f1) article. 

## Train and Test a Decision Tree classifier 
* [Decision Tree](https://scikit-learn.org/stable/modules/tree.html) methods are useful because they are very easy to apply and interpret, however, the results can be susceptible to small changes in the dataset and they don't work so well for imbalanced datasets (is this a problem with our dataset?).
* Run the cells below to train and test a Decision Tree classifier for our selected genre and compare the results to the logistic regression model.

In [None]:
def dt_binary(X_train, X_test, y_train, y_test, target_genre, features):
    """
    Train a decision tree to classify the selected genre
    """ 
    pipe = Pipeline([("clf", 
                      DecisionTreeClassifier(random_state=343, 
                                             max_depth = 3 # Limiting the depth of the tree can help to prevent overfitting
                                            )
                     )]) 
    pipe.fit(X_train, y_train)

    y_pred_train = pipe.predict(X_train)
    y_pred_test = pipe.predict(X_test)
    
    y_prob_train = pipe.predict_proba(X_train) 
    y_prob_test = pipe.predict_proba(X_test) 
        
    accuracy_result = accuracy_score(y_test, y_pred_test)
    precision_result = precision_score(y_test, y_pred_test)
    recall_result = recall_score(y_test, y_pred_test)
    f1_result = f1_score(y_test, y_pred_test)
    auroc_result = roc_auc_score(y_test, y_prob_test[:, 1])  # Use probabilities for AUROC

    print("-----------------------------------------------")
    print(f"Binary Classification - Decision Tree")
    print(f"{target_genre}")
    print("-----------------------------------------------")
    print()
    print(f"Accuracy = {accuracy_result:.3f}")
    print(f"Precision = {precision_result:.3f}")
    print(f"Recall = {recall_result:.3f}")
    print(f"F1 Score = {f1_result:.3f}")
    print(f"AUROC Score = {auroc_result:.3f}")
    
    RocCurveDisplay.from_predictions(y_test, y_prob_test[:, 1])  # Use probabilities for ROC curve
    plt.title("AUROC: Decision Tree")
    plt.show()

    plt.figure(figsize=(20,10))
    print(f"\n\nInterpreting the decision tree: if the condition in the box (node) is TRUE, take the LEFT branch. If FALSE, take the RIGHT.\n")
    plot_tree(pipe["clf"], 
              feature_names=features[:-1],
              class_names=["Other", target_genre],
              filled=True,
              impurity=False,
              rounded=True,
              fontsize=14
              )
    plt.show()
    
    return y_pred_train, y_pred_test, y_prob_train, y_prob_test

In [None]:
def genres_binary_dt(df, target_genre, X_train, X_test, y_train, y_test, indices_train, indices_test, features):
    """
    Train and test the model, and return the dataframe
    with appended predictions.
    """
    
    y_pred_train, y_pred_test, y_prob_train, y_prob_test = dt_binary(X_train, 
                                                                     X_test, 
                                                                     y_train, 
                                                                     y_test, 
                                                                     target_genre,
                                                                     features)

    # Add the predictions to a copy of the original dataframe
    df_new = df.copy()
    df_new.loc[indices_train,"pred_train"] = y_pred_train
    df_new.loc[indices_test,"pred_test"] = y_pred_test
    df_new.loc[indices_train,"prob_0_train"] = y_prob_train[:,0]
    df_new.loc[indices_test,"prob_0_test"] = y_prob_test[:,0]
    df_new.loc[indices_train,"prob_1_train"] = y_prob_train[:,1]
    df_new.loc[indices_test,"prob_1_test"] = y_prob_test[:,1]    

    # Sort the dataframe by probability of being the given genre
    df_new = df_new.sort_values(by="prob_1_test", ascending=False)  
    
    return df_new

In [None]:
dt_preds_df = genres_binary_dt(df, target_genre, X_train, X_test, y_train, y_test, indices_train, indices_test, features)

## Inspect incorrectly classified texts

We can explore which texts were incorrectly classified by the two models. Run the cell below to display dataframes of the misclassified texts.

In [None]:
pd.set_option("display.max_columns", None)

lr_misclass = lr_preds_df.loc[(lr_preds_df["binary_class"] != lr_preds_df["pred_test"]) & (lr_preds_df["pred_test"] >= 0)]
lr_misclass = lr_misclass.filter(["date", 
                                  "newspaper_id", 
                                  "newspaper", 
                                  "article_id", 
                                  "title", 
                                  "text", 
                                  "clean_text",
                                  "genre", 
                                  "binary_class", 
                                  "pred_test", 
                                  "article_web"], 
                                  axis=1
                                ).reset_index(drop=True)

print(f"\nMisclassified texts for Logistic Regression model (lr)")
print(f"{target_genre}")
print("========================================================\n")
display(lr_misclass)

dt_misclass = dt_preds_df.loc[(dt_preds_df["binary_class"] != dt_preds_df["pred_test"]) & (dt_preds_df["pred_test"] >= 0)]
dt_misclass = dt_misclass.filter(["date", 
                                  "newspaper_id", 
                                  "newspaper", 
                                  "article_id", 
                                  "title", 
                                  "text", 
                                  "clean_text",
                                  "genre", 
                                  "binary_class", 
                                  "pred_test", 
                                  "article_web"], 
                                  axis=1
                                ).reset_index(drop=True)

print(f"\nMisclassified texts for Decision Tree model (dt)")
print(f"{target_genre}")
print("========================================================\n")
display(dt_misclass)

### Display the full text, the feature values, and the newspaper web address of a selected misclassification by model and index

In [None]:
# Select the model and index number of the misclassified text

# Enter 'lr' or 'dt'
model_code = "lr"
selected_index = 0

##################################################################

if model_code == "lr":
    article_title = f"{lr_misclass['title'].values[selected_index]}"
    article_text = f"{lr_misclass['text'].values[selected_index]}"
    print("\nTitle:")
    print("--------------")
    print(f"{article_title}")
    print("\nText:")
    print("--------------")
    print(article_text)
    print("\nCleaned text:")
    print("--------------")
    print(lr_misclass["clean_text"].values[selected_index])
    print("\nView the scanned article on the Papers Past website")
    print(lr_misclass['article_web'].values[selected_index]) 

elif model_code == "dt":
    article_title = f"{dt_misclass['title'].values[selected_index]}"
    article_text = f"{dt_misclass['text'].values[selected_index]}"
    print("\nTitle:")
    print("--------------")
    print(f"{article_title}")
    print("\nOriginal OCR text:")
    print("--------------")
    print(article_text)
    print("\nCleaned text:")
    print("--------------")
    print(dt_misclass["clean_text"].values[selected_index])
    print("\nView the scanned article on the Papers Past website")
    print(f"{dt_misclass['article_web'].values[selected_index]}\n") 
else:
    print("\nPlease enter either 'lr' or 'dt' for the model code")
    
mask = (df['title'] == article_title) & (df['text'] == article_text)
display(df.loc[mask])

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 4:</strong> Examine some of the misclassified texts. Why do you think they were misclassified? Examine the coefficients of the logistic regression model or the decision tree nodes, and the feature values. You can compare the feature values for the important logistic regression features in selected article to the overall dataset distribution for that feature shown in the summary statistics (see cell below). Discuss with your classmates or tutors.
</div>

In [None]:
# Inspect summary statistics for the whole dataset to compare to the misclassified text
df.describe()