ML Final Project


Import the required libraries


In [None]:
import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re
import matplotlib.pyplot as plt


In [None]:
!pip install contractions

Load the dataset

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Final ML/22204097.csv')

Here, we define our function with which we shall preprocess and tokenize our text


In [None]:
def preprocess_and_tokenize(data):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(data)
    filtered_text = [word for word in word_tokens if word.casefold() not in stop_words]
    return filtered_text

Analyze the most commonly occuring words for each category

In [82]:

def analyze():
  for category in df['category'].unique():
      words = []
      for row in df[df['category'] == category][['headline', 'short_description']].values:
          words.extend(preprocess_and_tokenize(" ".join(row)))
      most_common_terms = Counter(words).most_common(10)
      print(f"For {category}, the most common terms are: {most_common_terms}")
#analyze()


From our output, it seems that there are missing values which requires us to clean our data.


In [None]:
df['headline'] = df['headline'].fillna('')
df['short_description'] = df['short_description'].fillna('')


Remove any rows with missing text


In [None]:
df = df.dropna(subset=['headline', 'short_description'])


Attempt to Analyze the most commonly occuring words for each category again

In [None]:
analyze()

Our output doesn't seem to actually align with obtaining the most common words. This seems to be a side effect of tokenization.
Let's revise the preprocessing function to exclude punctuation by using 'string.punctuation'.

In [None]:


def preprocess_and_tokenize(data):
    stop_words = set(stopwords.words('english'))
    stop_words.update(set(string.punctuation))
    word_tokens = word_tokenize(data)
    filtered_text = [word for word in word_tokens if word.casefold() not in stop_words]
    return filtered_text


Rerun the analysis


In [None]:
analyze()

While relatively better, our output is still including unwanted strings such as "'s", "n't" etc
We shall use the 're' module to remove non-alphanumeric characters from our strings. We shall then use 'contractions' library. This is so we can split contractions into separate words.

In [None]:
def preprocess_and_tokenize(data):
    stop_words = set(stopwords.words('english'))
    stop_words.update(set(string.punctuation))
    expanded_words = []
    for word in data.split():
        expanded_words.append(contractions.fix(word))
    data = ' '.join(expanded_words)
    data = re.sub(r'\W', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    word_tokens = word_tokenize(data)
    filtered_text = [word for word in word_tokens if word.casefold() not in stop_words]
    return filtered_text

An example of what is going on in this function: "don't" is converted to "do not".
Run analyze() again

In [None]:
analyze()

This output is much more representative of the answer we are looking for: the most occuring words in each category.

Let's make a visual representation of our findings


In [None]:
def plot_common_words(category):
    words = []
    for row in df[df['category'] == category][['headline', 'short_description']].values:
        words.extend(preprocess_and_tokenize(" ".join(row)))
    most_common_terms = dict(Counter(words).most_common(10))

    plt.figure(figsize=(10, 5))
    plt.bar(most_common_terms.keys(), most_common_terms.values())
    plt.title(f'Most Common Words in {category}')
    plt.xlabel('Words')
    plt.ylabel('Frequency')
    plt.show()

for category in df['category'].unique():
    if pd.isna(category):
        continue
    plot_common_words(category)


Let's analyze some other features and their relationships with labels.
For starters, we could analyze the length of the headlines and the descriptions for each category.

In [None]:
df['headline_length'] = df['headline'].apply(lambda x: len(word_tokenize(x)))
df['description_length'] = df['short_description'].apply(lambda x: len(word_tokenize(x)))

plt.figure(figsize=(12,6))
sns.boxplot(data=df, x='category', y='headline_length')
plt.title('Headline length by category')
plt.show()

plt.figure(figsize=(12,6))
sns.boxplot(data=df, x='category', y='description_length')
plt.title('Description length by category')
plt.show()


Observation:


Let's perform a date-time analysis to see if we can discover any interesting insights.

Trend Analysis to see if there are any noticeable trends.

In [None]:

df['date'] = pd.to_datetime(df['date'])
df.groupby([df.date.dt.year, 'category']).size().unstack().plot(kind='line', subplots=True)


Seasonality Analysis to see if we can look for patterns that recur at regular intervals.

In [None]:
df.groupby([df.date.dt.month, 'category']).size().unstack().plot(kind='line', subplots=True)


**TASK 2**


Splitting the dataset into training, development and test sets.
We shall be looking to split our data into 60% for training, 20% for validation and 20% for testing.

In [None]:
from sklearn.model_selection import train_test_split

df = df.dropna()

# Split into training and temp
df_train, df_temp = train_test_split(df, test_size=0.4, random_state=42, stratify=df['category'])

# Split temp into validation and test
df_valid, df_test = train_test_split(df_temp, test_size=0.5, random_state=42, stratify=df_temp['category'])

# Save the datasets into csv files
df_train.to_csv('train.csv', index=False)
df_valid.to_csv('valid.csv', index=False)
df_test.to_csv('test.csv', index=False)


In [None]:
testes = pd.read_csv('test.csv')
testes.head()

We might need to remove NaN values.

In [None]:
df = df.dropna(subset=['category'])


In [None]:
# Split into training and tempe
df_train, df_temp = train_test_split(df, test_size=0.4, random_state=42, stratify=df['category'],shuffle=True)

# Split temp into valid and test
df_valid, df_test = train_test_split(df_temp, test_size=0.5, random_state=42, stratify=df_temp['category'],shuffle=True)

# Save the datasets into csv files
df_train.to_csv('train.csv', index=False)
df_valid.to_csv('valid.csv', index=False)
df_test.to_csv('test.csv', index=False)


In [None]:
df_train['headline'] = df_train['headline'].fillna('')
df_train['short_description'] = df_train['short_description'].fillna('')

df_valid['headline'] = df_valid['headline'].fillna('')
df_valid['short_description'] = df_valid['short_description'].fillna('')


We can now load data and apply preprocessing steps

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the training and validation datasets
df_train = pd.read_csv('train.csv')
df_valid = pd.read_csv('valid.csv')
df_train['headline'] = df_train['headline'].fillna('')
df_train['short_description'] = df_train['short_description'].fillna('')

df_valid['headline'] = df_valid['headline'].fillna('')
df_valid['short_description'] = df_valid['short_description'].fillna('')

'''
# Initialize a TF-IDF Vectorizer
vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')

# Fit and transform the training data
X_train = vectorizer.fit_transform(df_train['headline'] + ' ' + df_train['short_description'])

# Transform the validation data
X_valid = vectorizer.transform(df_valid['headline'] + ' ' + df_valid['short_description'])
'''
# Create a TF-IDF Vectorizer instance
vectorizer = TfidfVectorizer(ngram_range=(1, 2),
                             stop_words='english',
                             max_df=0.5,
                             min_df=2,
                             max_features=5000)

# Fit and transform the training data
X_train = vectorizer.fit_transform(df_train['headline'] + ' ' + df_train['short_description'])

# Transform the validation data
X_valid = vectorizer.transform(df_valid['headline'] + ' ' + df_valid['short_description'])

# Get the labels
y_train = df_train['category']
y_valid = df_valid['category']


In [None]:
y_test = df_test['category']

Building binary classification models

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

# Initialize the classifiers
clf1 = LogisticRegression(random_state=42)
clf2 = LinearSVC(random_state=42)

# Fit the classifiers
clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)

# Predict the labels of the validation set
y_pred1 = clf1.predict(X_valid)
y_pred2 = clf2.predict(X_valid)


Building a classifier using deep learning

In [None]:
pip install tensorflow


In [None]:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'


In [None]:
import tensorflow as tf
print(tf.__version__)


In [None]:
!pip install --upgrade tensorflow


In [None]:
from sklearn.preprocessing import LabelEncoder

# create an encoder
le = LabelEncoder()

# fit and transform y_train with the encoder
y_train = le.fit_transform(y_train)

# transform y_valid with the encoder
y_valid = le.transform(y_valid)


In [None]:
y_test = le.transform(y_test)

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM

y_train = y_train.astype(float)
y_valid = y_valid.astype(float)


# Tokenize the texts
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df_train['headline'] + ' ' + df_train['short_description'])

X_train = tokenizer.texts_to_sequences(df_train['headline'] + ' ' + df_train['short_description'])
X_valid = tokenizer.texts_to_sequences(df_valid['headline'] + ' ' + df_valid['short_description'])

# Pad the sequences
X_train = pad_sequences(X_train, maxlen=100)
X_valid = pad_sequences(X_valid, maxlen=100)

# Build the model
model = Sequential()
model.add(Embedding(5000, 64))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, batch_size=32, epochs=10, validation_data=(X_valid, y_valid))

# Save the model
model.save('text_classifier.h5')


In [None]:
X_test = tokenizer.texts_to_sequences(df_test['headline'] + ' ' + df_test['short_description'])
X_test = pad_sequences(X_test, maxlen=100)

The first thing we need to make sure is to have a clear understanding of the problem we have. It's a binary classification task. There's multiple factors that go into choosing our primary metric. Which mainly depends on the specifics of our problem and the distribution of our classes. Which is why we need to ask the question. Are the classes balanced?

In [None]:
class_counts = df['category'].value_counts()
print(class_counts)


Our classes are not balanced! Considering the classes are imbalanced, we want to balance the importance of precision and recall. Thus we shall choose F1-score.

Considering we have an imbalanced dataset, and the task we have is text classification, we can consider the F1 score in the range of 0.8 to 0.9 to be an appropriate benchmark.

Let's train a Logistic Regression model and an SVM model using Scikit-Learn.

Here we shall convert our text data into numericals so that our models can be worked on.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Combine into single text feature
df_train['text'] = df_train['headline'] + ' ' + df_train['short_description']
df_valid['text'] = df_valid['headline'] + ' ' + df_valid['short_description']
df_test['text'] = df_test['headline'] + ' ' + df_test['short_description']

# Initialize a TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)

# Fit the vectorizer and transform the text data
X_train_tfidf = vectorizer.fit_transform(df_train['text'])
X_valid_tfidf = vectorizer.transform(df_valid['text'])

#Test
X_test_tfidf = vectorizer.fit_transform(df_test['text'])

Now train the logistic regression model


In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
log_reg = LogisticRegression()

# Fit the model
log_reg.fit(X_train_tfidf, y_train)


Now train the SVM model


In [None]:
from sklearn import svm

# Initialize the SVM model
svm_model = svm.SVC()

# Fit the model
svm_model.fit(X_train_tfidf, y_train)


Let's save our models using joblib

In [None]:
from joblib import dump

# Save the models
dump(log_reg, 'log_reg.joblib')
dump(svm_model, 'svm_model.joblib')


We can load our new models like this.

In [None]:
from joblib import load


model1 = load('log_reg.joblib')
model2 = load('svm_model.joblib')


Load our deep learning model


In [None]:
from tensorflow.keras.models import load_model

dl_model = load_model('text_classifier.h5')


Let's start making our Predictions


In [None]:
# Make predictions on the TF-IDF data
train_preds_model1 = model1.predict(X_train_tfidf)
valid_preds_model1 = model1.predict(X_valid_tfidf)

train_preds_model2 = model2.predict(X_train_tfidf)
valid_preds_model2 = model2.predict(X_valid_tfidf)

# For the deep learning model, use the tokenized sequences
train_preds_dl_model = (dl_model.predict(X_train) > 0.5).astype("int32")
valid_preds_dl_model = (dl_model.predict(X_valid) > 0.5).astype("int32")



In [None]:
# Make predictions on the TF-IDF data
test_preds_model1 = model1.predict(X_test_tfidf)


test_preds_model2 = model2.predict(X_test_tfidf)


# For the deep learning model, use the tokenized sequences
test_preds_dl_model = (dl_model.predict(X_test) > 0.5).astype("int32")


Calculate F1-Score


In [None]:
from sklearn.metrics import f1_score

# Calculate F1-score for the train data
f1_train_model1 = f1_score(y_train, train_preds_model1)
f1_train_model2 = f1_score(y_train, train_preds_model2)
f1_train_dl_model = f1_score(y_train, train_preds_dl_model)

# Calculate F1-score for the test data
f1_test_model1 = f1_score(y_test, test_preds_model1)
f1_test_model2 = f1_score(y_test, test_preds_model2)
f1_test_dl_model = f1_score(y_test, test_preds_dl_model)

# Calculate F1-score for the validation data
f1_valid_model1 = f1_score(y_valid, valid_preds_model1)
f1_valid_model2 = f1_score(y_valid, valid_preds_model2)
f1_valid_dl_model = f1_score(y_valid, valid_preds_dl_model)

print(f'Model 1 - F1 Score: Train {f1_train_model1}, Validation {f1_valid_model1}, Test {f1_test_model1}')
print(f'Model 2 - F1 Score: Train {f1_train_model2}, Validation {f1_valid_model2}, Test {f1_test_model2}')
print(f'Deep Learning Model - F1 Score: Train {f1_train_dl_model}, Validation {f1_valid_dl_model}, Test {f1_test_dl_model}')


Each model shows a different performance.

**Logistic Regression Model**

High F1 Score on training set:      **0.9400352733686067**

Reduced F1 Score on Validation Set: **0.8789986091794159**

**SVM Model**

High F1 Score on Training Set: **0.99370012599748**

Reduced F1 Score on Validation Set: **0.9093369418132612**

**Deep Learning Model**

High F1 Score on Training Set: **1.0**

Relatively high F1 Score on Validation Set: **0.9145728643216079**

Observation

Considering how the drop from training to validation is pretty high in SVM and Deep Learning model, it seems to indicate some level of overfitting. Logistic Regression Model is performing relatively well.

**Error Analysis**

We need to understand where our models are making mistakes.

First, we identify incorrect predictions.

Since we are doing a binary classification task, our predictions should be binary as well.

In [None]:
# Formatting
train_preds_model1_bin = [1 if pred > 0.5 else 0 for pred in train_preds_model1]
train_preds_model2_bin = [1 if pred > 0.5 else 0 for pred in train_preds_model2]
train_preds_dl_model_bin = [1 if pred > 0.5 else 0 for pred in train_preds_dl_model]


Now let's find out the incorrect predictions for each model. np.where should help us

In [None]:
# Incorrect predictions

import numpy as np

incorrect_preds_model1_indices = np.where(y_valid != valid_preds_model1)
incorrect_preds_model2_indices = np.where(y_valid != valid_preds_model2)
incorrect_preds_dl_model_indices = np.where(y_valid != valid_preds_dl_model)


In [None]:
print(y_valid.shape)
print(valid_preds_model1.shape)
print(valid_preds_model2.shape)
print(valid_preds_dl_model.shape)


Analyze common errors. Perhaps there might be a systematic issue with our data or preprocessing steps?

In [None]:
# Common incorrect predictions
common_errors = np.intersect1d(incorrect_preds_model1_indices, np.intersect1d(incorrect_preds_model2_indices, incorrect_preds_dl_model_indices))

# Print common errors
print('Number of common errors:', len(common_errors))


There are 65 common errors. Let's try to take a closer look and understand the error examples


In [None]:
# Print some error examples
for i in list(common_errors)[:5]:
    print(f"Text: {df_valid.loc[i, 'headline']} {df_valid.loc[i, 'short_description']}")
    print(f"Actual label: {df_valid.loc[i, 'category']}")
    print(f"Predicted label (Model 1): {train_preds_model1[i]}")
    print(f"Predicted label (Model 2): {train_preds_model2[i]}")
    print(f"Predicted label (Deep Learning Model): {train_preds_dl_model[i]}\n")


Observation

From our error analysis, we seem to find that there have been significant misclassifications. All of them belong to the same 'WORLDPOST' category. Perhaps the errors occur because of geopolitical terms due to how complex the topic is. Or perhaps it is occuring due to cultural contexts. For example, the mention of names like Charlie or Fidel. Our models seem to be struggling with certain types of articles. Therefore, I assume that we would require more training data and that it is not an error in preprocessing.

In [None]:
from sklearn.utils import class_weight

# Calculate the weights for each class
weights = class_weight.compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)

# Create a dictionary mapping each class to its weight
class_weights = dict(enumerate(weights))

Random Forest

Let's explore a new model


In [None]:
from sklearn.ensemble import RandomForestClassifier
import joblib

rf_model = RandomForestClassifier(random_state=42, n_estimators = 1000, max_depth= 40, max_features ='sqrt')
rf_model.fit(X_train_tfidf, y_train)

valid_preds_rf_model = rf_model.predict(X_valid_tfidf)

rf_model_f1 = f1_score(y_valid, valid_preds_rf_model, average='weighted')
print('Random Forest - F1 Score: Validation', rf_model_f1)

# Save to file
joblib_file = "rf_model.pkl"
joblib.dump(rf_model, joblib_file)


Load RF

In [None]:
from joblib import load

model3 = load('rf_model.pkl')

In [None]:
test_preds_model3 = model3.predict(X_test_tfidf)
f1_test_model3 = f1_score(y_test, test_preds_model3)
f1_test_model3

To improve our model further, we shall be performing GridSearch.

In [None]:
from sklearn.model_selection import GridSearchCV


Define a parameter grid, a dictionary of parameters that helps us in optimization

In [None]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],

}


Initialize the GridSearchCV object

In [None]:
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='f1_weighted', cv=5, n_jobs=-1)


Fit GridSearchCV

In [None]:
grid_search.fit(X_train_tfidf, y_train)


Let's evaluate so we can see the ebst parameters and the best score

In [None]:
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)


Extract the best estimator

In [None]:
best_rf_model = grid_search.best_estimator_
valid_preds_best_rf_model = best_rf_model.predict(X_valid_tfidf)
test_preds_best_rf_model = best_rf_model.predict(X_test_tfidf)

joblib_file = "best_random_forest.pkl"
joblib.dump(best_rf_model, joblib_file)

Evaluation on F1 score

In [None]:
f1_valid_model4 = f1_score(y_valid, valid_preds_best_rf_model, average='weighted')
f1_valid_model4

Evaluation on F1 Score

In [None]:
f1_test_model4 = f1_score(y_test, test_preds_best_rf_model)
f1_test_model4

We shall be running grid search over SVM as well


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn import svm


In [None]:
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf', 'linear', 'poly']
}


Initializing GridSearchCV

In [None]:
svc = svm.SVC()
grid_search = GridSearchCV(estimator=svc, param_grid=param_grid, scoring='f1_weighted', cv=5, n_jobs=-1)


Fit GridSearchCV

In [None]:
grid_search.fit(X_train_tfidf, y_train)


Evaluate GridSearchCV

In [None]:
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)


Make our predictions


In [None]:
best_svc_model = grid_search.best_estimator_
valid_preds_best_svc_model = best_svc_model.predict(X_valid_tfidf)

from joblib import dump

# save the model
dump(best_svc_model, 'best_svc_model.joblib')


Evaluate the performance


In [None]:
svc_best_model_f1 = f1_score(y_valid, valid_preds_best_svc_model, average='weighted')
print('Optimized SVC - F1 Score: Validation', svc_best_model_f1)


From this, we can observe tat we have an improvement from 0.90 to approx 0.96 in F1 Score. This suggests that our SVM model is now performing much better in terms of both precision and recall.

**Deep learning Model**

For this, we shall experiment by adding more layers and changing the type of the layers.

In [None]:
from sklearn.utils import class_weight

# Calculate the weights for each class
weights = class_weight.compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)

# Create a dictionary mapping each class to its weight
class_weights = dict(enumerate(weights))

In [None]:
from sklearn.preprocessing import LabelEncoder

# create an encoder
le = LabelEncoder()

# fit and transform y_train with the encoder
y_train = le.fit_transform(y_train)

# transform y_valid with the encoder
y_valid = le.transform(y_valid)


In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
import numpy as np

y_train = y_train.astype(float)
y_valid = y_valid.astype(float)


# Tokenize the texts
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df_train['headline'] + ' ' + df_train['short_description'])

X_train = tokenizer.texts_to_sequences(df_train['headline'] + ' ' + df_train['short_description'])
X_valid = tokenizer.texts_to_sequences(df_valid['headline'] + ' ' + df_valid['short_description'])

# Pad the sequences
X_train = pad_sequences(X_train, maxlen=100)
X_valid = pad_sequences(X_valid, maxlen=100)

X = np.concatenate((X_train, X_valid))
y = np.concatenate((y_train, y_valid))

num_words = 5000
embedding_dim = 100
max_length = 100

# Build the model
model = Sequential()
model.add(Embedding(input_dim=num_words, output_dim=embedding_dim, input_length=max_length))
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, batch_size=32, epochs=5, validation_data=(X_valid, y_valid), class_weight=class_weights)

# Save the model
model.save('text_classifier_DL2.h5')


In [None]:
train_preds_dl_model = (model.predict(X_train) > 0.5).astype("int32")
valid_preds_dl_model = (model.predict(X_valid) > 0.5).astype("int32")

In [None]:
f1_train_dl_model = f1_score(y_train, train_preds_dl_model)
f1_valid_dl_model = f1_score(y_valid, valid_preds_dl_model)

In [None]:
print(f'Deep Learning Model - F1 Score: Train {f1_train_dl_model}, Validation {f1_valid_dl_model}')

We can observe that we have improved from 0.9145728643216079 to 0.9182879377431907 on our validation data.

**Cross Validation**

Applying cross-validation for our Random forest and SVM model

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm

# Perform cross validation
rf_scores = cross_val_score(best_rf_model, X, y, cv=5, scoring='f1', n_jobs=-1)

print(f'Random Forest cross-validation F1 score: {rf_scores.mean()}')



In [None]:
cv_preds_dl_model = (model.predict(X) > 0.5).astype("int32")

In [None]:
f1_cv_dl_model = f1_score(y, cv_preds_dl_model)

In [None]:
print(f'Deep Learning Model - F1 Score: Cross Validation {f1_cv_dl_model}')

Our Cross Validation has a score of 0.98.
Thus making this our best model.

We shall now use this on test data

In [None]:
test_preds_dl_model = (model.predict(X_test) > 0.5).astype("int32")

In [None]:
f1_test_dl_model = f1_score(y_test, test_preds_dl_model)
print(f'Deep Learning Model - F1 Score: Test Data {f1_test_dl_model}')

Our test results are slightly reduced compared to our cross validation data.

Let's train this data on both training and validation data now


In [None]:
model.fit(X, y, batch_size=32, epochs=5, validation_data=(X_valid, y_valid), class_weight=class_weights)

# Save the model
model.save('text_classifier_DL3.h5')

Now apply it to test set

In [None]:
test_preds_dl_model = (model.predict(X_test) > 0.5).astype("int32")
f1_test_dl_model = f1_score(y_test, test_preds_dl_model)
print(f'Deep Learning Model - F1 Score: Test Data {f1_test_dl_model}')

Training the model with more data has given us a slight increase in our score from 0.917 to 0.918