# Capstone Project: Topic Modelling of Academic Journals (Model-Based Systems Engineering)

# 06: Topic Modeling Evaluation

In this notebook, we will perform the following actions:
1. Evaluation of the topic modeling output

## Import Libraries

In [1]:
# Import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle
import seaborn as sns

from wordcloud import WordCloud

# Set all columns and rows to be displayed
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)



## Import Data

In [2]:
# Import the data for modelling
journals = pd.read_csv('../data/journals_topics.csv')

In [3]:
# Take a quick look at the data
journals.head()

Unnamed: 0,title,abstract,year,tokens,topic,0,1,2,3,4,5,6,7
0,Model-based Design Process for the Early Phase...,This paper presents an approach for a model-ba...,2017,paper present approach planning process early ...,3,0.134677,0.04933,0.121864,0.157459,0.01872,0.16562,0.034708,0.056888
1,Model Based Systems Engineering using VHDL-AMS,The purpose of this paper is to contribute to ...,2013,purpose paper contribute definition ( ) approa...,6,0.166052,0.053813,0.101767,0.040338,0.02592,0.180074,0.204835,0.046355
2,Code Generation Approach Supporting Complex Sy...,Code generation is an effective way to drive t...,2022,code generation effective way drive complex de...,5,0.048002,0.023436,0.040045,0.012378,0.006812,0.11011,0.013624,0.015338
3,Model based systems engineering as enabler for...,"Product complexity is steadily increasing, cus...",2021,"product complexity steadily increasing , custo...",3,0.14437,0.04502,0.161951,0.335414,0.020165,0.169411,0.040359,0.061398
4,Electric Drive Vehicle Development and Evaluat...,To reduce development time and introduce techn...,2014,reduce development time introduce technology f...,0,0.48726,0.042662,0.097253,0.059448,0.029092,0.165888,0.037608,0.06911


## Labels Identified from the Topic Model

Here, we will take a look at the key words extracted for each topic, as well as the top few articles of each topic to properly identify and assess intent behind each topic. We will then apply this information as well as our domain knowledge to generate a proper label for each topic.

|Topic| Top 10 Key Words for Topic| Topic Label Using Domain Knowledge|
|-----|---------------------------|-----------------------------------|
|Topic 0| cubesat, vehicle, spacecraft, satelite, requirement, nasa, modeling, submarine, payload, electric vehicle| Application of MBSE in Projects|
|Topic 1| sysml, modeling, simulation, modeling language, uml, language sysml, diagram, modeling language sysml, software, specification||
|Topic 2| ontology, research, reuse, paper, industry, knowledge, semantic, tool, modeling, database||
|Topic 3| development, product development, production, process, manufacturing, industrial, iot, product line, toolchain, development process||
|Topic 4| reliability, safety analysis, fmea, fault tree, design safety, safety artifact, medical device, reliability analysis, failure mode, safety critical||
|Topic 5| mechatronic, inspection, inspection equipment, production scheduling, modeling, constraint, business rule, validation, property verification, mecahtronic product||
|Topic 6| requirement, design, engineer, specification, hcd, wfrequirements, text-based requirement, cm process, property-based requirement, methodology||
|Topic 7| digital twin, cyber, resilience, mbsecps, simplexity test-bed, security threat, vulnerability, twin technology, risk assessment, cpg||


In [15]:
# Take a look at the top 5 articles for Topic 0
journals[['title', 'abstract', '0']].sort_values(by = '0', ascending=False).head()

Unnamed: 0,title,abstract,0
719,Forward Design of Landing Gear Retraction Syst...,"In this paper, according to the forward design...",1.0
422,The Economic Benefits of Human Performance Mod...,Abstract Human performance modeling (HPM) is a...,1.0
582,MBSE approach to support and formalize mission...,This paper deals with the application of a Mod...,1.0
52,Model-Based Systems Engineering approach for t...,The Origins Spectral Interpretation Resource I...,1.0
820,The Digital (Mission) Twin: an Integrating Con...,Top-down decomposition of a complex system of ...,1.0


In [17]:
# Take a look at the top 5 articles for Topic 1
journals[['title', 'abstract', '1']].sort_values(by = '1', ascending=False).head()

Unnamed: 0,title,abstract,1
788,Simulating SysML transportation models,Model-based Systems Engineering (MBSE) promise...,1.0
186,Simulation of Database Interactions for Early ...,Digitized enterprise processes often encompass...,1.0
375,Frenemies: OPM and SysML Together in an MBSE M...,Abstract A Frenemy is “a person with whom one ...,1.0
554,Creating SysML views from an OPM model,Conceptual modeling is key to model-based syst...,1.0
561,Verification of embedded system's specificatio...,The authors propose an extension of SysML whic...,1.0


In [18]:
# Take a look at the top 5 articles for Topic 2
journals[['title', 'abstract', '2']].sort_values(by = '2', ascending=False).head()

Unnamed: 0,title,abstract,2
133,The Ontology of Systems Engineering: Towards a...,The goal of implementing an enterprise digital...,1.0
301,Towards Developing Metrics to Evaluate Digital...,Abstract Model-based systems engineering (MBSE...,1.0
32,Maturity assessment of Systems Engineering reu...,To enable the transition towards Model-Based S...,1.0
591,Can ontologies prevent MBSE models from becomi...,Model-based systems engineering provides a sof...,1.0
233,Evolving Model-Based Systems Engineering Ontol...,Abstract Model-Based Systems Engineering (MBSE...,1.0


In [19]:
# Take a look at the top 5 articles for Topic 3
journals[['title', 'abstract', '3']].sort_values(by = '3', ascending=False).head()

Unnamed: 0,title,abstract,3
182,Automated Derivation of Optimal Production Seq...,"Customer specific, individual products nowaday...",1.0
678,Achieving flexibility in business process mode...,Companies nowadays are under increasing pressu...,1.0
610,Product-Process MBSE and Dysfunctional Analysi...,The integration of new technologies and the in...,1.0
618,Model-based Systems Engineering Process for Su...,Combining the benefits of model-based systems ...,1.0
13,Model Based Systems Engineering in Modular Des...,Many industries have to react progressively to...,1.0


In [20]:
# Take a look at the top 5 articles for Topic 4
journals[['title', 'abstract', '4']].sort_values(by = '4', ascending=False).head()

Unnamed: 0,title,abstract,4
593,An Integrated System Design and Safety Framewo...,Safety analysis is often performed independent...,1.0
130,AADL-Based safety analysis using formal method...,Model-based engineering tools are increasingly...,1.0
572,Obtaining Fault Trees Through SysML Diagrams: ...,Reliability analysis provides fundamental resu...,1.0
634,Dynamic Fault Tree Generation for Safety-Criti...,Systems are getting increasingly complex and c...,1.0
757,KAUSAL: A New Methodological Approach for Mode...,The increasing complexity of modern products i...,1.0


In [21]:
# Take a look at the top 5 articles for Topic 5
journals[['title', 'abstract', '5']].sort_values(by = '5', ascending=False).head()

Unnamed: 0,title,abstract,5
667,A Practitioner’s Guide to Optimizing the Inter...,As model-based approaches to Systems Engineeri...,1.0
587,Requirement analysis of inspection equipment f...,Quality control is an essential part in the pr...,1.0
191,Modelling and Simulation for the Integrated De...,The development of mechatronic systems involve...,1.0
676,Practicing modelling in manufacturing,This paper is about understanding modelling pr...,1.0
677,Efficient recognition of finite satisfiability...,Models lie at the heart of the emerging Model-...,1.0


In [22]:
# Take a look at the top 5 articles for Topic 6
journals[['title', 'abstract', '6']].sort_values(by = '6', ascending=False).head()

Unnamed: 0,title,abstract,6
849,Deploying model-based systems engineering with...,Prerequisites - to be provided by Customer. Pr...,1.0
450,Toward a property based requirements theory: S...,"Abstract In this paper, we outline a property-...",1.0
215,7.2.3 On enabling a model-based systems engine...,Abstract This paper considers the requirements...,1.0
721,Structural Rules for an Intelligent Advisor to...,Requirements define the problem boundaries wit...,1.0
93,A decision-making framework for selecting an M...,The increasing system complexity due to techno...,1.0


In [23]:
# Take a look at the top 5 articles for Topic 7
journals[['title', 'abstract', '7']].sort_values(by = '7', ascending=False).head()

Unnamed: 0,title,abstract,7
167,Subsystem selection for digital twin developme...,Digital twins are virtual representations of s...,1.0
668,Augmenting MBSE with Digital Twin Technology: ...,Model-Based Systems Engineering (MBSE) require...,1.0
796,Integrating Machine Learning in Digital Twins ...,Fueled by the opportunities for supporting man...,1.0
487,Enabling design of agile security in the IOT w...,Design for system security within the IOT is a...,1.0
446,Microreactor Testbed Automation through Digita...,Abstract Digital engineering is the practice o...,1.0


1. Draw the distribution of % of each topic
2. Draw the yearly trending of each topic

## Preprocessing the Data by Topic

Here, we will preprocess the data once again by using CountVectorizer to tokenize and count our words. This is so that we can compare each topic for similar high frequency words. We will add these words as stop words so that our classification models can be trained on a more distinct set of words. 

In [None]:
# Define an empty list to store my topics
topic_dfs = []

# Iterate using a for loop to extra each topic and assign them into a separate dataframe
for topic_id in range(8):
    topic_dfs.append(journals.loc[journals['topic'] == topic_id])

topic_0, topic_1, topic_2, topic_3, topic_4, topic_5, topic_6, topic_7 = topic_dfs

# Create list containing all the topics dataframes
list_of_topics = [topic_0, topic_1, topic_2, topic_3, topic_4, topic_5, 
                  topic_6, topic_7]

# Create empty list to store tokens for each topic
topic_tokens = []

# Iterate using a for loop to vectorize and store each topic tokens into a separate dataframe
for topic in list_of_topics:
    cvec = CountVectorizer(lowercase=False, ngram_range=(1,3))
    topic_matrix = cvec.fit_transform(topic['tokens'])
    topic_tokens_df = pd.DataFrame(topic_matrix.todense(), columns=cvec.get_feature_names_out())
    topic_tokens.append(topic_tokens_df)
    
topic_tokens_0, topic_tokens_1, topic_tokens_2, topic_tokens_3, topic_tokens_4, topic_tokens_5, topic_tokens_6, topic_tokens_7 = topic_tokens

## Visualize the High Frequency Words for Each Topic

Here, we will visualize the high frequency words for each topic, to identify common words between the topics. These common words will be added to our stop words.

In [None]:
# Create a dataframe with the top 50 words for each topic
top_50_words_topic_0 = topic_tokens_0.sum().sort_values(ascending=False).head(50)
top_50_words_topic_1 = topic_tokens_1.sum().sort_values(ascending=False).head(50)
top_50_words_topic_2 = topic_tokens_2.sum().sort_values(ascending=False).head(50)
top_50_words_topic_3 = topic_tokens_3.sum().sort_values(ascending=False).head(50)
top_50_words_topic_4 = topic_tokens_4.sum().sort_values(ascending=False).head(50)
top_50_words_topic_5 = topic_tokens_5.sum().sort_values(ascending=False).head(50)
top_50_words_topic_6 = topic_tokens_6.sum().sort_values(ascending=False).head(50)
top_50_words_topic_7 = topic_tokens_7.sum().sort_values(ascending=False).head(50)

In [None]:
# Convert the dataframes into sets of words
word_sets = [set(df.index) for df in [top_50_words_topic_0, top_50_words_topic_1, 
                                      top_50_words_topic_2, top_50_words_topic_3,
                                      top_50_words_topic_4, top_50_words_topic_5,
                                      top_50_words_topic_6, top_50_words_topic_7]]

# Find common words using set intersection
common_words = set.intersection(*word_sets)

# Print the common words
print(common_words)

In [None]:
# Define the list of lists to plot
list_to_plot = [top_50_words_topic_0, top_50_words_topic_1, top_50_words_topic_2,
                top_50_words_topic_3, top_50_words_topic_4, top_50_words_topic_5,
                top_50_words_topic_6, top_50_words_topic_7]

# Create a 4x2 grid for the bar plots
fig, axes = plt.subplots(4, 2, figsize=(12, 30))

# Iterate over the list of lists and plot the bar plots
for i, ax in enumerate(axes.flatten()):
    sorted_data = list_to_plot[i].sort_values(ascending=True)
    ax.barh(sorted_data.index, sorted_data.values)
    ax.set_title(f'Topic {i}')
    ax.tick_params(axis='y', rotation=0)

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

## Definition of Stop Words

Based on the data exploration above, we have identified some additional stop words to remove. This will make each topic more distinct to train the classification model.

In [None]:
# define the stop words
stop_words = ['tool', 'paper', 'method', 'architecture', 'design', 'modeling', 
              'development', 'approach', 'process']

## Train-Test Split

In [None]:
# Set up data for modeling
X = journals['tokens']
y = journals['topic']

In [None]:
# Check the distribution of the y values
y.value_counts()

In [None]:
# Check the distribution of the y values
y.value_counts(normalize=True)

As the data set is imbalanced, we will need to stratify it. 

In [None]:
# Split the data into the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.25, 
                                                    stratify = y, 
                                                    random_state=42)

Because our dataset is imbalanced, we will use SMOTE to oversample the minority classes.

## Classification Modeling

In this section, we'll begin with the classificatin modeling. 

## Model 0: Base Model

We will calculate the baseline score first. This is done by creating a dummy model to act as a baseline mode. All models trained later on can compare against this score to base its performance.

In [None]:
# Create dummy model to get baseline scores
dummy_class = DummyClassifier(strategy='stratified')

# Fit the model
dummy_class.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred = dummy_class.predict(X_test)

# Evaluate the performance of the DummyClassifier
print(classification_report(y_test, y_pred))

## Model 1: Naive Bayes Model (CountVectorizer)

The Naive Bayes model is a probabilistic classifier based on Bayes' theorem. It heavily relies on one simplifying assumption, which is that we assume our features are indepedent from one another. 

In [None]:
# Set up pipeline
pipe_nb_cvec = Pipeline([
    ('cvec', CountVectorizer(stop_words=stop_words, lowercase=False, 
                             ngram_range=(1,3))),
    ('smote', SMOTE(random_state=42)),
    ('nb', MultinomialNB())    
])

# Set up pipeline parameters
pipe_nb_cvec_params = {
    'cvec__max_features' : [250, 500, 1000, 2000],
    'nb__alpha' : [0.2, 0.5, 1],
    'nb__fit_prior' : [False, True]
}

# Perform gridsearch on the pipeline
gs_nb_cvec = GridSearchCV(pipe_nb_cvec, pipe_nb_cvec_params, cv=5)

# Fit the model
gs_nb_cvec.fit(X_train, y_train)

In [None]:
# Check what's the best parameters
gs_nb_cvec.best_params_

In [None]:
# Predict the classification on the test set
y_pred = gs_nb_cvec.predict(X_test)

# Evaluate the performance of the model
print(classification_report(y_test, y_pred))

## Model 2: Naive Bayes Model (TF-IDF)

The Naive Bayes model is a probabilistic classifier based on Bayes' theorem. It heavily relies on one simplifying assumption, which is that we assume our features are indepedent from one another. 

In [None]:
# Set up pipeline
pipe_nb_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=False, ngram_range=(1,3))),
    ('smote', SMOTE(random_state=42)),
    ('nb', MultinomialNB())    
])

# Set up pipeline parameters
pipe_nb_tfidf_params = {
    'tfidf__max_features' : [250, 500, 1000],
    'nb__alpha' : [0.1, 0.2, 0.5, 1],
    'nb__fit_prior' : [False, True]
}

# Perform gridsearch on the pipeline
gs_nb_tfidf = GridSearchCV(pipe_nb_tfidf, pipe_nb_tfidf_params, cv=5)

# Fit the model
gs_nb_tfidf.fit(X_train, y_train)

In [None]:
# Check what's the best parameters
gs_nb_tfidf.best_params_

In [None]:
# Predict the classification on the test set
y_pred = gs_nb_cvec.predict(X_test)

# Evaluate the performance of the model
print(classification_report(y_test, y_pred))

## Model 3: Logistic Regression (CountVectorizer)

The logistic regression gives us the probability of a feature being in each class by using a link function to "bend" of line of best fit into a curve of best fit to match the values we're interested in. This link funciton is known as the logit link.

In [None]:
# Set up pipeline
pipe_log_cvec = Pipeline([
    ('cvec', CountVectorizer(lowercase=False, ngram_range=(1,3))),
    #('smote', SMOTE(random_state=42)),
    ('log', LogisticRegression(solver='liblinear'))    
])

# Set up pipeline parameters
pipe_log_cvec_params = {
    'cvec__max_features' : [250, 500, 1000],
    'log__penalty' : ['l1', 'l2']
}

# Perform gridsearch on the pipeline
gs_log_cvec = GridSearchCV(pipe_log_cvec, pipe_log_cvec_params, cv=5)

# Fit the model
gs_log_cvec.fit(X_train, y_train)

In [None]:
# Check what's the best parameters
gs_log_cvec.best_params_

In [None]:
# Predict the classification on the test set
y_pred = gs_log_cvec.predict(X_test)

# Evaluate the performance of the model
print(classification_report(y_test, y_pred))

## Model 4: Logistic Regression (TF-IDF)

The logistic regression gives us the probability of a feature being in each class by using a link function to "bend" of line of best fit into a curve of best fit to match the values we're interested in. This link funciton is known as the logit link.

In [None]:
# Set up pipeline
pipe_log_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=False, ngram_range=(1,3))),
    ('smote', SMOTE(random_state=42)),
    ('log', LogisticRegression(solver='liblinear'))     
])

# Set up pipeline parameters
pipe_log_tfidf_params = {
    'tfidf__max_features' : [250, 500, 1000],
    'log__penalty' : ['l1', 'l2']
}

# Perform gridsearch on the pipeline
gs_log_tfidf = GridSearchCV(pipe_log_tfidf, pipe_log_tfidf_params, cv=5)

# Fit the model
gs_log_tfidf.fit(X_train, y_train)

In [None]:
# Check what's the best parameters
gs_log_tfidf.best_params_

In [None]:
# Predict the classification on the test set
y_pred = gs_log_tfidf.predict(X_test)

# Evaluate the performance of the model
print(classification_report(y_test, y_pred))

## Model 5: K-Nearest Neighbors (CountVectorizer)

The KNN model is a non-parametric method that uses the nearest neighbor's classification to assign a class membership. 

In [None]:
# Set up pipeline
pipe_knn_cvec = Pipeline([
    ('cvec', CountVectorizer(lowercase=False, ngram_range=(1,3))),
    ('smote', SMOTE(random_state=42)),
    ('knn', KNeighborsClassifier())    
])

# Set up pipeline parameters
pipe_knn_cvec_params = {
    'cvec__max_features' : [250, 500, 1000],
    'knn__n_neighbors' : [5, 10, 15, 20, 25],
    'knn__p' : [1, 2]
}

# Perform gridsearch on the pipeline
gs_knn_cvec = GridSearchCV(pipe_knn_cvec, pipe_knn_cvec_params, cv=5)

# Fit the model
gs_knn_cvec.fit(X_train, y_train)

In [None]:
# Check what's the best parameters
gs_knn_cvec.best_params_

In [None]:
# Predict the classification on the test set
y_pred = gs_knn_cvec.predict(X_test)

# Evaluate the performance of the model
print(classification_report(y_test, y_pred))

## Model 6: K-Nearest Neighbors (TF-IDF)

The KNN model is a non-parametric method that uses the nearest neighbor's classification to assign a class membership. 

In [None]:
# Set up pipeline
pipe_knn_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=False, ngram_range=(1,3))),
    #('smote', SMOTE(random_state=42)),
    ('knn', KNeighborsClassifier())     
])

# Set up pipeline parameters
pipe_knn_tfidf_params = {
    'tfidf__max_features' : [250, 500, 1000],
    'knn__n_neighbors' : [5, 10, 15, 20, 25],
    'knn__p' : [1, 2]
}

# Perform gridsearch on the pipeline
gs_knn_tfidf = GridSearchCV(pipe_knn_tfidf, pipe_knn_tfidf_params, cv=5)

# Fit the model
gs_knn_tfidf.fit(X_train, y_train)

In [None]:
# Check what's the best parameters
gs_knn_tfidf.best_params_

In [None]:
# Predict the classification on the test set
y_pred = gs_knn_tfidf.predict(X_test)

# Evaluate the performance of the model
print(classification_report(y_test, y_pred))

## Model 7: Random Forest (CountVectorizer)

The Random Forest Classifier is an emsemble method that combines the predictions of other smaller models. Each of the smaller model is a decision tree. 

In [None]:
# Set up pipeline
pipe_rf_cvec = Pipeline([
    ('cvec', CountVectorizer(lowercase=False, ngram_range=(1,3))),
    ('smote', SMOTE(random_state=42)),
    ('rf', RandomForestClassifier())    
])

# Set up pipeline parameters
pipe_rf_cvec_params = {
    'cvec__max_features' : [250, 500, 1000],
    'rf__max_depth' : [1, 2, 3, 4, 5, 10, 15],
    'rf__n_estimators' : [100, 150, 200]
}

# Perform gridsearch on the pipeline
gs_rf_cvec = GridSearchCV(pipe_rf_cvec, pipe_rf_cvec_params, cv=5)

# Fit the model
gs_rf_cvec.fit(X_train, y_train)

In [None]:
# Check what's the best parameters
gs_rf_cvec.best_params_

In [None]:
# Predict the classification on the test set
y_pred = gs_rf_cvec.predict(X_test)

# Evaluate the performance of the model
print(classification_report(y_test, y_pred))

## Model 8: Random Forest (TF-IDF)

The Random Forest Classifier is an emsemble method that combines the predictions of other smaller models. Each of the smaller model is a decision tree. 

In [None]:
# Set up pipeline
pipe_rf_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=False, ngram_range=(1,3))),
    ('smote', SMOTE(random_state=42)),
    ('rf', RandomForestClassifier())     
])

# Set up pipeline parameters
pipe_rf_tfidf_params = {
    'tfidf__max_features' : [250, 500],
    'rf__max_depth' : [1, 2, 3, 4, 5],
    'rf__n_estimators' : [100, 150, 200]
}

# Perform gridsearch on the pipeline
gs_rf_tfidf = GridSearchCV(pipe_rf_tfidf, pipe_rf_tfidf_params, cv=5)

# Fit the model
gs_rf_tfidf.fit(X_train, y_train)

In [None]:
# Check what's the best parameters
gs_rf_tfidf.best_params_

In [None]:
# Predict the classification on the test set
y_pred = gs_rf_tfidf.predict(X_test)

# Evaluate the performance of the model
print(classification_report(y_test, y_pred))

## Model 9: Support Vector Machine (CountVectorizer)

The Support Vector Machine is a classification model that predicts the categorical vairables. They belong to a wider class of models called discriminant models. 

In [None]:
# Set up pipeline
pipe_svc_cvec = Pipeline([
    ('cvec', CountVectorizer(lowercase=False, ngram_range=(1,3))),
    #('smote', SMOTE(random_state=42)),
    ('svc', SVC())    
])

# Set up pipeline parameters
pipe_svc_cvec_params = {
    'cvec__max_features' : [250, 500, 1000],
    'svc__C' : [0.1, 1, 10, 50, 100]
}

# Perform gridsearch on the pipeline
gs_svc_cvec = GridSearchCV(pipe_svc_cvec, pipe_svc_cvec_params, cv=5)

# Fit the model
gs_svc_cvec.fit(X_train, y_train)

In [None]:
# Check what's the best parameters
gs_svc_cvec.best_params_

In [None]:
# Predict the classification on the test set
y_pred = gs_svc_cvec.predict(X_test)

# Evaluate the performance of the model
print(classification_report(y_test, y_pred))

## Model 10: Support Vector Machine (TF-IDF)

The Support Vector Machine is a classification model that predicts the categorical vairables. They belong to a wider class of models called discriminant models. 

In [None]:
# Set up pipeline
pipe_svc_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=False, ngram_range=(1,3))),
    #('smote', SMOTE(random_state=42)),
    ('svc', SVC())     
])

# Set up pipeline parameters
pipe_svc_tfidf_params = {
    'tfidf__max_features' : [250, 500, 1000],
    'svc__C' : [0.1, 1, 10, 50, 100]
}

# Perform gridsearch on the pipeline
gs_svc_tfidf = GridSearchCV(pipe_svc_tfidf, pipe_svc_tfidf_params, cv=5)

# Fit the model
gs_svc_tfidf.fit(X_train, y_train)

In [None]:
# Check what's the best parameters
gs_svc_tfidf.best_params_

In [None]:
# Predict the classification on the test set
y_pred = gs_svc_tfidf.predict(X_test)

# Evaluate the performance of the model
print(classification_report(y_test, y_pred))

## Model Evaluation

|Model   | Model Type          |Precision |Recall | F1 | 
|------------|---------------------|----|----------|-------------|
|Model 0     |Dummy Classifier     |0.14|0.13|0.14|
|Model 1     |Naive Bayes (Cvec)   |0.79|0.74|0.76|
|Model 2     |Naive Bayes (TF-IDF) |0.79|0.74|0.76|
|Model 3     |Logistic Regression (Cvec)|0.75|0.71|0.71|
|Model 4     |Logistic Regression (TF-IDF)|0.73|0.75|0.74|
|Model 5     |KNN Classifier (Cvec)|0.39|0.44|0.34|
|Model 6     |KNN Classifier (TF-IDF)|0.68|0.61|0.62|
|Model 7     |Random Forest (Cvec)|0.63|0.67|0.63|
|Model 8     |Random Forest (TF-IDF)|0.67|0.71|0.67|
|Model 9     |Support Vector Machine (Cvec)|0.74|0.60|0.64|
|Model 10    |Support Vector Machine (TF-IDF)|0.78|0.63|0.66|

As all our classes are important, we will be using the macro average F1 score to identify the best performing model. Furthermore, the macro average is used compared to the weighted average as this score method will heavily penalize the minority classes if the model does not perform well. This is because we are using an imbalanced dataset and we expect the minority classes to not perform as well as the majority class.  

### Best Performing Model

From the F1 scores, we can tell that the Naive Bayes model performs the best, with an F1 score of 0.76. This is followed by the Logistic Regression Model, which has an F1 score of 0.74. As such, we will choose the Naive Bayes model as our production model and further explore it by looking at the confusion matrix.