Predicting Subjects From Articles Supervised

Written and developed by Eric Detjen

This repository contains a machine-learning model that classifies articles into different subjects. The model uses article meta-data sourced from arXiv's condensed matter physics articles( The project leverages a variety of techniques, including TF-IDF for text feature extraction, date feature extraction, and Random Forest Classifier for model training. Experiment logging and tracking are handled using the Capital One open-source framework, Rubicon-ML.

For ease of viewing, I have included the model.ipynb file in the README here. If desired, this and all other files are available in this repo.

Table of Contents

  1. Initialize Rubicon and Project
  2. Read and Adjust Data
  3. Feature Extraction
  4. Data Preparation
  5. Model Training
  6. Logging
  7. Saving
  8. Model Evaluation
  9. Results Summary
  10. Accuracy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import hstack
from sklearn.ensemble import RandomForestClassifier
from collections import defaultdict
from joblib import dump
import pandas as pd
from IPython.display import display
from rubicon_ml import Rubicon
from datetime import datetime
import json
from colorama import Fore, Style
from dateutil.relativedelta import relativedelta
import csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Initialize Rubicon and Project

We initialize Rubicon for logging and create a project for Article Classification.

rubicon = Rubicon(persistence="filesystem", root_dir="./rubicon-root")
project = rubicon.get_or_create_project("Article Classification")

# Log the experiment
experiment = project.log_experiment(
    model_name="Random Forest",
    tags=["text classification", "NLP"]

Read and Adjust Data

Read the data

we will test the accuracy by comparing using 80% of the train data to train and then for the last 20% removing the subjects and comparing the models results on that against the true subject data for the same 20%

with open("train.json", "r") as file:
    data = json.load(file)

with open("test.json", "r") as file:
    final_test_data = json.load(file)

# Split the data into 80% for training and 20% for testing
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

Feature Extraction

We extract features using TF-IDF for article abstracts and titles and also extract date features.

The TF-IDF is key here since it will yield a higher rating for words that are frequent in the specific text it is analyzing yet rare across all documents. This allows the model to focus on the words that are important and not common words like “and” or “the”.

max_features_abstract = 1000
stop_words = 'english'
max_features_title = 500
# Extracting month and year from the date for both train and test data
def extract_date_features(data):
    months = []
    years = []
    for article in data:
        article_date = datetime.strptime(article['date'], '%Y-%m-%dT%H:%M:%S.%fZ')
    return months, years

# Initialize the vectorizers and the label encoder
vectorizer_abstract = TfidfVectorizer(max_features= max_features_abstract, stop_words='english')
vectorizer_title = TfidfVectorizer(max_features= max_features_title, stop_words='english')
label_encoder = LabelEncoder()

# Extract date features for train and test data
train_months, train_years = extract_date_features(train_data)
test_months, test_years = extract_date_features(test_data)

# Using TF-IDF for abstract and title feature extraction for train and test data
train_abstract_features = vectorizer_abstract.fit_transform([article['abstract'] for article in train_data])
train_title_features = vectorizer_title.fit_transform([article['title'] for article in train_data])

test_abstract_features = vectorizer_abstract.transform([article['abstract'] for article in test_data])
test_title_features = vectorizer_title.transform([article['title'] for article in test_data])

Data Preparation

I now combine all the extracted features using hstack and encode the labels using label_encoder.

# Combining all features
X_train = hstack([train_abstract_features, train_title_features, [[month, year] for month, year in zip(train_months, train_years)]])
X_test = hstack([test_abstract_features, test_title_features, [[month, year] for month, year in zip(test_months, test_years)]])

# Encoding the target variable 
y_train = label_encoder.fit_transform([article['subject'] for article in train_data])
y_test = label_encoder.transform([article['subject'] for article in test_data])

Model Training

I initialize and train the Random Forest Classifier.

I chose to use the random forest classifier because it is a very robust model for supervised learning that is not prone to overfitting and is very good at handling the highly dimensional categorical data that we have. It is also is very interpretable which allows us to log meaningful metrics and parameters to Rubicon.

# Initializing and training the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators = n_estimators, random_state = random_state, verbose = verbose, n_jobs = n_jobs), y_train)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 tasks      | elapsed:   24.4s
[Parallel(n_jobs=-1)]: Done 180 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:  2.1min finished
[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:    0.3s
[Parallel(n_jobs=10)]: Done 180 tasks      | elapsed:    1.6s
[Parallel(n_jobs=10)]: Done 200 out of 200 | elapsed:    1.8s finished


Using the Capital One Rubicon system to log our important parameters

experiment = project.log_experiment(
    model_name="Random Forest",
    tags=["text classification", "NLP"]

# Log parameters dynamically from the trained RandomForestClassifier object
parameters_to_log = [
    "n_estimators", "max_depth", "min_samples_split", "min_samples_leaf",
    "min_weight_fraction_leaf", "max_features", "max_leaf_nodes", 
    "min_impurity_decrease", "min_impurity_split", "bootstrap", 
    "oob_score", "n_jobs", "random_state", "verbose", "warm_start"

for param_name in parameters_to_log:
    param_value = getattr(rf_classifier, param_name, "Not set")
    experiment.log_parameter(name=param_name, value=param_value)


I use the joblib dump to save the model so it does not have to be trained repeatedly. This file was over the GitHub size limits so it is unfortunately not in the repo. The model only takes about 6 minutes to train locally though.

model_path = "random_forest_model.joblib"
dump(rf_classifier, model_path)

Model Evaluation

here we evaluate the performance of the test data set.

Again, we test the accuracy by comparing using 80% of the train data to train and then for the last 20% removing the subjects and comparing the model results on that against the true subject data for the same 20%

#predicting the test data
y_pred = rf_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
model_path, accuracy

experiment.log_metric(name="Accuracy_Score_5", value=accuracy)  

# Optionally, log additional text or configurations
#experiment.log_parameter(name="TF-IDF Max Features", value=f"{max_features_abstract} for abstract, {max_features_title} for title")
[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:    0.3s
[Parallel(n_jobs=10)]: Done 180 tasks      | elapsed:    1.4s
[Parallel(n_jobs=10)]: Done 200 out of 200 | elapsed:    1.5s finished

<rubicon_ml.client.metric.Metric at 0x2ae3b4280>

Results Summary

I display the results from the tests here. It shows which subjects it messes up on most and for each of those, which subjects it mistakenly labels it as.
This is key for giving meaningful insight into what is causing the confusion.

decoded_predictions = label_encoder.inverse_transform(y_pred)

# Extract the actual labels from the test data
actual_labels = [article['subject'] for article in test_data]

# Compare the predicted labels to the actual labels
results_comparison = list(zip(actual_labels, decoded_predictions))

# Initialize dictionaries to keep track of all predictions and incorrect predictions for each unique "Actual" value.
total_predictions = defaultdict(int)
incorrect_predictions = defaultdict(lambda: defaultdict(int))

df_rows = []

# Populate the dictionaries with data.
for actual, predicted in results_comparison:
    total_predictions[actual] += 1
    if actual != predicted:
        incorrect_predictions[actual][predicted] += 1

# Add rows to the DataFrame list.
for actual, predicted_dict in incorrect_predictions.items():
    total = total_predictions[actual]
    correct = total - sum(predicted_dict.values())
    correct_percentage = (correct / total) * 100

    first_row = True
    for predicted, count in predicted_dict.items():
        if first_row:
                'Actual': actual,
                'Predicted': predicted,
                'Count': count,
                'Correct Percentage': f"{correct_percentage:.2f}%",
                'Total Articles': total
            first_row = False
                'Actual': '',
                'Predicted': predicted,
                'Count': count,
                'Correct Percentage': '',
                'Total Articles': ''

    # Add a separator row
        'Actual': '',
        'Predicted': '',
        'Count': '',
        'Correct Percentage': '',
        'Total Articles': ''
pd.set_option('display.max_rows', None)

# Create a DataFrame from the list of rows.
df = pd.DataFrame(df_rows)

# Display the DataFrame.


Actual Predicted Count Correct Percentage Total Articles
0 quantum physics mesoscale and nanoscale physics 193 32.22% 841
1 quantum gases 67
2 materials science 30
3 superconductivity 30
4 disordered systems and neural networks 4
5 statistical mechanics 138
6 strongly correlated electrons 84
7 other condensed matter 14
8 soft condensed matter 6
9 condensed matter 4
11 disordered systems and neural networks statistical mechanics 682 23.60% 1449
12 superconductivity 19
13 mesoscale and nanoscale physics 112
14 materials science 103
15 strongly correlated electrons 91
16 soft condensed matter 68
17 quantum gases 9
18 quantum physics 5
19 high energy physics - theory 1
20 condensed matter 9
21 physics - physics and society 1
22 other condensed matter 7
24 mesoscale and nanoscale physics superconductivity 231 74.87% 5527
25 materials science 579
26 other condensed matter 20
27 statistical mechanics 190
28 strongly correlated electrons 220
29 quantum gases 19
30 soft condensed matter 29
31 quantum physics 50
32 condensed matter 34
33 disordered systems and neural networks 16
34 high energy physics - theory 1
36 other condensed matter mesoscale and nanoscale physics 198 20.72% 1144
37 statistical mechanics 205
38 strongly correlated electrons 118
39 quantum gases 72
40 materials science 238
41 superconductivity 28
42 quantum physics 12
43 condensed matter 2
44 soft condensed matter 28
45 disordered systems and neural networks 4
46 high energy physics - theory 1
47 physics - chemical physics 1
49 soft condensed matter materials science 217 60.15% 2454
50 statistical mechanics 549
51 other condensed matter 50
52 strongly correlated electrons 30
53 mesoscale and nanoscale physics 43
54 condensed matter 44
55 physics - biological physics 1
56 superconductivity 15
57 disordered systems and neural networks 16
58 quantum gases 13
60 superconductivity mesoscale and nanoscale physics 105 84.85% 3643
61 statistical mechanics 43
62 strongly correlated electrons 246
63 condensed matter 20
64 quantum gases 21
65 materials science 74
66 quantum physics 10
67 other condensed matter 25
68 high energy physics - theory 1
69 soft condensed matter 6
70 disordered systems and neural networks 1
72 materials science statistical mechanics 244 65.53% 4549
73 mesoscale and nanoscale physics 713
74 soft condensed matter 150
75 strongly correlated electrons 322
76 condensed matter 31
77 physics - chemical physics 1
78 superconductivity 76
79 disordered systems and neural networks 15
80 quantum physics 8
81 nuclear theory 1
82 other condensed matter 4
83 quantum gases 1
84 physics - optics 1
85 high energy physics - theory 1
87 strongly correlated electrons mesoscale and nanoscale physics 508 69.34% 4883
88 materials science 370
89 superconductivity 373
90 statistical mechanics 153
91 condensed matter 25
92 other condensed matter 10
93 quantum gases 26
94 disordered systems and neural networks 8
95 quantum physics 11
96 soft condensed matter 10
97 high energy physics - theory 3
99 statistical mechanics condensed matter 84 74.64% 4870
100 strongly correlated electrons 284
101 quantum physics 61
102 soft condensed matter 306
103 superconductivity 55
104 materials science 122
105 other condensed matter 64
106 disordered systems and neural networks 93
107 mesoscale and nanoscale physics 124
108 quantum gases 37
109 physics - physics and society 2
110 mathematical physics 1
111 high energy physics - theory 2
113 condensed matter materials science 152 54.36% 2360
114 strongly correlated electrons 194
115 mesoscale and nanoscale physics 159
116 statistical mechanics 393
117 superconductivity 64
118 soft condensed matter 66
119 disordered systems and neural networks 19
120 other condensed matter 18
121 quantum gases 8
122 high energy physics - theory 2
123 quantum physics 2
125 quantum gases strongly correlated electrons 113 67.97% 946
126 statistical mechanics 63
127 other condensed matter 8
128 mesoscale and nanoscale physics 56
129 materials science 21
130 condensed matter 1
131 superconductivity 18
132 quantum physics 14
133 disordered systems and neural networks 5
134 high energy physics - theory 1
135 soft condensed matter 3
137 physics - atomic physics quantum gases 17 1.28% 78
138 mesoscale and nanoscale physics 9
139 other condensed matter 7
140 superconductivity 1
141 statistical mechanics 9
142 strongly correlated electrons 5
143 quantum physics 3
144 condensed matter 4
145 materials science 21
146 soft condensed matter 1
148 high energy physics - theory mesoscale and nanoscale physics 53 13.07% 589
149 condensed matter 102
150 strongly correlated electrons 72
151 superconductivity 36
152 statistical mechanics 215
153 soft condensed matter 8
154 materials science 11
155 disordered systems and neural networks 2
156 quantum physics 4
157 quantum gases 7
158 other condensed matter 2
160 high energy physics - phenomenology condensed matter 17 0.72% 138
161 statistical mechanics 49
162 mesoscale and nanoscale physics 15
163 superconductivity 25
164 high energy physics - theory 3
165 strongly correlated electrons 11
166 quantum physics 2
167 disordered systems and neural networks 2
168 materials science 7
169 soft condensed matter 4
170 quantum gases 1
171 other condensed matter 1
173 physics - computational physics statistical mechanics 30 0.00% 83
174 materials science 26
175 soft condensed matter 16
176 strongly correlated electrons 3
177 mesoscale and nanoscale physics 7
178 quantum physics 1
180 physics - fluid dynamics soft condensed matter 51 0.00% 91
181 statistical mechanics 30
182 disordered systems and neural networks 1
183 materials science 4
184 mesoscale and nanoscale physics 2
185 superconductivity 2
186 high energy physics - theory 1
188 mathematics - probability statistical mechanics 57 0.00% 63
189 disordered systems and neural networks 5
190 quantum physics 1
192 physics - chemical physics strongly correlated electrons 7 0.00% 140
193 statistical mechanics 29
194 materials science 62
195 soft condensed matter 19
196 condensed matter 5
197 mesoscale and nanoscale physics 17
198 high energy physics - theory 1
200 physics - physics and society statistical mechanics 125 9.49% 158
201 materials science 4
202 disordered systems and neural networks 13
203 soft condensed matter 1
205 nonlinear sciences - chaotic dynamics statistical mechanics 117 1.10% 182
206 soft condensed matter 5
207 disordered systems and neural networks 8
208 mesoscale and nanoscale physics 16
209 strongly correlated electrons 3
210 condensed matter 22
211 quantum physics 5
212 other condensed matter 2
213 materials science 2
215 nonlinear sciences - pattern formation and sol... materials science 12 0.00% 74
216 superconductivity 6
217 statistical mechanics 24
218 quantum gases 13
219 strongly correlated electrons 2
220 other condensed matter 3
221 soft condensed matter 7
222 condensed matter 4
223 high energy physics - theory 1
224 mesoscale and nanoscale physics 2
226 physics - biological physics soft condensed matter 48 0.72% 139
227 statistical mechanics 61
228 materials science 18
229 mesoscale and nanoscale physics 7
230 disordered systems and neural networks 4
232 nonlinear sciences - adaptation and self-organ... disordered systems and neural networks 7 0.00% 58
233 statistical mechanics 39
234 condensed matter 8
235 materials science 2
236 physics - physics and society 1
237 strongly correlated electrons 1
239 physics - optics soft condensed matter 8 0.00% 154
240 mesoscale and nanoscale physics 49
241 materials science 74
242 statistical mechanics 11
243 quantum gases 2
244 quantum physics 2
245 strongly correlated electrons 2
246 other condensed matter 1
247 superconductivity 3
248 disordered systems and neural networks 1
249 condensed matter 1
251 general relativity and quantum cosmology materials science 2 1.64% 61
252 strongly correlated electrons 3
253 quantum gases 5
254 statistical mechanics 23
255 mesoscale and nanoscale physics 6
256 soft condensed matter 9
257 quantum physics 6
258 other condensed matter 2
259 condensed matter 1
260 superconductivity 1
261 high energy physics - theory 2
263 quantitative biology - populations and evolution statistical mechanics 57 0.00% 67
264 disordered systems and neural networks 2
265 soft condensed matter 4
266 materials science 2
267 physics - physics and society 2
269 quantitative biology - biomolecules statistical mechanics 22 0.00% 49
270 soft condensed matter 19
271 mesoscale and nanoscale physics 1
272 materials science 7
274 mathematical physics statistical mechanics 169 3.06% 294
275 quantum physics 11
276 quantum gases 9
277 disordered systems and neural networks 9
278 mesoscale and nanoscale physics 23
279 strongly correlated electrons 32
280 materials science 19
281 condensed matter 2
282 soft condensed matter 4
283 other condensed matter 2
284 high energy physics - theory 2
285 superconductivity 3
287 nuclear theory strongly correlated electrons 15 1.11% 90
288 statistical mechanics 37
289 quantum gases 7
290 materials science 10
291 superconductivity 9
292 mesoscale and nanoscale physics 3
293 other condensed matter 2
294 soft condensed matter 3
295 condensed matter 2
296 quantum physics 1
298 high energy physics - lattice statistical mechanics 65 5.93% 118
299 condensed matter 23
300 superconductivity 3
301 quantum gases 4
302 strongly correlated electrons 9
303 mesoscale and nanoscale physics 3
304 disordered systems and neural networks 1
305 high energy physics - theory 3

Display the Logs

Finally we display the logs from the Rubicon system.

# Get the project

# Loop through the experiments and print details
for experiment in project.experiments():
    print(f"Experiment ID: {}")
    print(f"Model Name: {experiment.model_name}")
    for param in experiment.parameters():
        print(f"  - {}: {param.value}")
    for metric in experiment.metrics():
        print(f"  - {}: {metric.value}")
Experiment ID: f967a29a-0c03-4270-8a70-766802a02193
Model Name: Random Forest
Experiment ID: 08961d24-105a-415f-b78f-40bdc4e03a5e
Model Name: Random Forest
Experiment ID: 924925f4-ee26-4b89-b27e-6631c79de01d
Model Name: Random Forest
  - n_estimators: 200
  - max_depth: None
  - min_samples_split: 2
  - min_samples_leaf: 1
  - min_weight_fraction_leaf: 0.0
  - max_features: sqrt
  - max_leaf_nodes: None
  - min_impurity_decrease: 0.0
  - min_impurity_split: Not set
  - bootstrap: True
  - oob_score: False
  - n_jobs: -1
  - random_state: 92
  - verbose: 1
  - warm_start: False
  - Accuracy: 0.6119800521364616
  - Accuracy_Score: 0.6119800521364616
  - New_Accuracy_Score: 0.6119800521364616
  - Accuracy_Score_1: 0.6119800521364616
  - Accuracy_Score_2: 0.6119800521364616
Experiment ID: 6ed79741-4d0b-49b4-a1be-d2f7cd025851
Model Name: Random Forest
Experiment ID: 1a36363e-6313-4d6a-b0ab-6e818a5bdf92
Model Name: Random Forest
  - n_estimators: 200
  - max_depth: None
  - min_samples_split: 2
  - min_samples_leaf: 1
  - min_weight_fraction_leaf: 0.0
  - max_features: sqrt
  - max_leaf_nodes: None
  - min_impurity_decrease: 0.0
  - min_impurity_split: Not set
  - bootstrap: True
  - oob_score: False
  - n_jobs: -1
  - random_state: 92
  - verbose: 1
  - warm_start: False
  - Accuracy_Score_2: 0.6119800521364616
Experiment ID: 3b95e82b-c76a-45cb-b1c6-0d91a0728618
Model Name: Random Forest
  - n_estimators: 200
  - max_depth: None
  - min_samples_split: 2
  - min_samples_leaf: 1
  - min_weight_fraction_leaf: 0.0
  - max_features: sqrt
  - max_leaf_nodes: None
  - min_impurity_decrease: 0.0
  - min_impurity_split: Not set
  - bootstrap: True
  - oob_score: False
  - n_jobs: -1
  - random_state: 92
  - verbose: 1
  - warm_start: False
  - Accuracy_Score_2: 0.6119800521364616
Experiment ID: 228131f9-da73-4ffc-823f-b85ff225321c
Model Name: Random Forest
Experiment ID: 624c6e9e-59c3-4464-b86e-bc7d1677b39a
Model Name: Random Forest
Experiment ID: a6f9b5e1-f475-41b5-86fd-ed6683daa9b8
Model Name: Random Forest
  - n_estimators: 200
  - max_depth: None
  - min_samples_split: 2
  - min_samples_leaf: 1
  - min_weight_fraction_leaf: 0.0
  - max_features: sqrt
  - max_leaf_nodes: None
  - min_impurity_decrease: 0.0
  - min_impurity_split: Not set
  - bootstrap: True
  - oob_score: False
  - n_jobs: -1
  - random_state: 92
  - verbose: 1
  - warm_start: False
  - Accuracy_Score_2: 0.6119800521364616
