## Contents
1. Setup & Imports  
2. Load Data
3. Data Checks
4. Preprocessing & Pipeline
5. Training & Evaluation  
6. Model Export
7. Manually Test Model
8. Streamlit
9. MLFlow

## 1. Setup & Imports

In [152]:
# !pip install xgboost

# !pip freeze > requirements.txt
# !pip install -r requirements.txt

In [153]:
import pandas as pd
import numpy as np
import re
import nltk
import mlflow
import mlflow.sklearn

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression, SGDClassifier, RidgeClassifier, PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gpietersen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\gpietersen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gpietersen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\gpietersen\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\gpietersen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [154]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

## 2. Load Data

In [155]:
train = pd.read_csv("Data/processed/train.csv")
test = pd.read_csv("Data/processed/test.csv")

# Applying our post inspection changes
train.dropna(inplace=True)
test.dropna(inplace=True)

display(train.head(), test.head())

Unnamed: 0,headlines,description,content,url,category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,https://indianexpress.com/article/business/ban...,business
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,https://indianexpress.com/article/business/com...,business
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,https://indianexpress.com/article/business/mar...,business
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,https://indianexpress.com/article/business/eco...,business
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,https://indianexpress.com/article/business/eco...,business


Unnamed: 0,headlines,description,content,url,category
0,NLC India wins contract for power supply to Ra...,State-owned firm NLC India Ltd (NLCIL) on Mond...,State-owned firm NLC India Ltd (NLCIL) on Mond...,https://indianexpress.com/article/business/com...,business
1,SBI Clerk prelims exams dates announced; admit...,SBI Clerk Prelims Exam: The SBI Clerk prelims ...,SBI Clerk Prelims Exam: The State Bank of Indi...,https://indianexpress.com/article/education/sb...,education
2,"Golden Globes: Michelle Yeoh, Will Ferrell, An...","Barbie is the top nominee this year, followed ...","Michelle Yeoh, Will Ferrell, Angela Bassett an...",https://indianexpress.com/article/entertainmen...,entertainment
3,"OnePlus Nord 3 at Rs 27,999 as part of new pri...",New deal makes the OnePlus Nord 3 an easy purc...,"In our review of the OnePlus Nord 3 5G, we pra...",https://indianexpress.com/article/technology/t...,technology
4,Adani family’s partners used ‘opaque’ funds to...,Citing review of files from multiple tax haven...,Millions of dollars were invested in some publ...,https://indianexpress.com/article/business/ada...,business


## 3. Data Checks

In [156]:
# Null checks
print("Null counts (train):")
print(train.isnull().sum())
print("Null counts (test):")
print(test.isnull().sum())

Null counts (train):
headlines      0
description    0
content        0
url            0
category       0
dtype: int64
Null counts (test):
headlines      0
description    0
content        0
url            0
category       0
dtype: int64


Check how frequent each category occurs

In [157]:
display(train.groupby('category').size())
display(test.groupby('category').size())
        

category
business         1120
education        1520
entertainment     960
sports            640
technology       1280
dtype: int64

category
business         400
education        400
entertainment    400
sports           400
technology       400
dtype: int64

Check for categories not reflected in the feature data

In [158]:
train['test'] = train['content'] #train['headlines'] + train['content'] + train['description'] + train['url']
train['contains'] = train.apply(
    lambda row: str(row['category']).lower() in str(row['test']).lower(),
    axis=1
)

train

Unnamed: 0,headlines,description,content,url,category,test,contains
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,https://indianexpress.com/article/business/ban...,business,The Reserve Bank of India (RBI) has changed th...,True
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,https://indianexpress.com/article/business/com...,business,Broadcaster New Delhi Television Ltd on Monday...,False
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,https://indianexpress.com/article/business/mar...,business,Homegrown server maker Netweb Technologies Ind...,False
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,https://indianexpress.com/article/business/eco...,business,India’s current account deficit declined sharp...,True
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,https://indianexpress.com/article/business/eco...,business,States have been forced to pay through their n...,False
...,...,...,...,...,...,...,...
5515,"Samsung sends out invites for ‘Unpacked 2024’,...",Samsung is most likely to announce next-genera...,Samsung plans to reveal the next-generation fl...,https://indianexpress.com/article/technology/t...,technology,Samsung plans to reveal the next-generation fl...,False
5516,Google Pixel 8 Pro accidentally appears on off...,The Pixel 8 Pro will most likely carry over it...,Google once again accidentally gave us a glimp...,https://indianexpress.com/article/technology/m...,technology,Google once again accidentally gave us a glimp...,False
5517,Amazon ad on Google Search redirects users to ...,Clicking on the real looking Amazon ad will op...,A new scam seems to be making rounds on the in...,https://indianexpress.com/article/technology/t...,technology,A new scam seems to be making rounds on the in...,False
5518,"Elon Musk’s X, previously Twitter, now worth l...","Elon Musk's X, formerly Twitter, has lost more...",More than a year after Elon Musk acquired Twit...,https://indianexpress.com/article/technology/s...,technology,More than a year after Elon Musk acquired Twit...,False


Add MLFlow autolog for loggin model runs

In [159]:
mlflow.autolog()

2025/10/19 22:23:09 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2025/10/19 22:23:09 INFO mlflow.tracking.fluent: Autologging successfully enabled for xgboost.


In [160]:
train['data'] = train['content'] #train['headlines'] + train['category'] + train['content']
test['data'] = test['content'] #test['headlines'] + test['category'] + test['content'] 

train = train[['data', 'category']]
test = test[['data', 'category']]

display(train.head(), test.head())

Unnamed: 0,data,category
0,The Reserve Bank of India (RBI) has changed th...,business
1,Broadcaster New Delhi Television Ltd on Monday...,business
2,Homegrown server maker Netweb Technologies Ind...,business
3,India’s current account deficit declined sharp...,business
4,States have been forced to pay through their n...,business


Unnamed: 0,data,category
0,State-owned firm NLC India Ltd (NLCIL) on Mond...,business
1,SBI Clerk Prelims Exam: The State Bank of Indi...,education
2,"Michelle Yeoh, Will Ferrell, Angela Bassett an...",entertainment
3,"In our review of the OnePlus Nord 3 5G, we pra...",technology
4,Millions of dollars were invested in some publ...,business


## 4. Preprocessing & Pipeline

In [161]:
# 1. Improved Preprocessing Function

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def process_text_pro(text: str) -> str:
    # 1. Lowercase
    text = text.lower()
    # 2. Remove URLs, mentions, hashtags
    text = re.sub(r"http\S+|www\S+|@\w+|#\w+", " ", text)
    # 3. Remove non-alphabetic characters (keep spaces)
    text = re.sub(r'[^a-z\s]', ' ', text)
    # 4. Collapse multiple spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # 5. Tokenize
    tokens = word_tokenize(text)
    # 6. Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # 7. Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

train['data'] = train['data'].apply(process_text_pro)
test['data'] = test['data'].apply(process_text_pro)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['data'] = train['data'].apply(process_text_pro)


In [162]:
train['data']

0       reserve bank india rbi changed definition poli...
1       broadcaster new delhi television ltd monday re...
2       homegrown server maker netweb technology india...
3       india current account deficit declined sharply...
4       state forced pay nose weekly auction debt tues...
                              ...                        
5515    samsung plan reveal next generation flagship g...
5516    google accidentally gave u glimpse upcoming fl...
5517    new scam seems making round internet legitimat...
5518    year elon musk acquired twitter billion platfo...
5519    apple begun rolling io update eligible iphones...
Name: data, Length: 5520, dtype: object

## 5. Training & Evaluation

In [163]:
X_train = train['data']
y_train = train['category']

X_test = test['data']
y_test = test['category']


# Initialise TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))  # Unigrams and bigrams; adjust max_features as needed

# Fit and transform the text data
X_tfidf = tfidf_vectorizer.fit_transform(X_train)
Xt_tfidf = tfidf_vectorizer.transform(X_test)

# Display the shape of the resulting sparse matrix
print(f"TF-IDF Matrix Shape: {X_tfidf.shape}")

# Display the vocabulary (optional)
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

# Initialise the Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)

# Train the model on the training set
model.fit(X_tfidf, y_train)

# Print training accuracy
train_accuracy = model.score(X_tfidf, y_train)
print(f"Training Accuracy: {train_accuracy:.2f}")

# Make predictions on the validation set
y_pred = model.predict(Xt_tfidf)

# Compute validation accuracy
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Validation Accuracy: {test_accuracy:.2f}")

# Generate a classification report
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=y_test.unique()))

2025/10/19 22:23:27 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'd94c825df78a4665b6dde9d3066c0a93', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


TF-IDF Matrix Shape: (5520, 5000)
Vocabulary: ['aadhaar' 'aamir' 'aamir khan' ... 'zoom' 'zoya' 'zoya akhtar']


1. Set the MLFLOW_TRACKING_URI environment variable to the desired tracking URI. `export MLFLOW_TRACKING_URI=http://localhost:5000`
2. Set the tracking URI programmatically by calling `mlflow.set_tracking_uri`. `mlflow.set_tracking_uri('http://localhost:5000')`


Training Accuracy: 0.99
Validation Accuracy: 0.97
Classification Report:
                precision    recall  f1-score   support

     business       0.98      0.96      0.97       400
    education       0.99      0.99      0.99       400
entertainment       1.00      0.98      0.99       400
   technology       0.99      0.97      0.98       400
       sports       0.92      0.98      0.95       400

     accuracy                           0.97      2000
    macro avg       0.98      0.97      0.98      2000
 weighted avg       0.98      0.97      0.98      2000



In [None]:

logreg_param_grid = {
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__min_df': [1, 2, 3],
    'tfidf__max_features': [20000, 40000, None],
    'clf__C': [0.5, 1, 2, 5],
    'clf__penalty': ['l2'],                # (liblinear/saga support l1 too if desired)
    'clf__solver': ['liblinear', 'saga']   # both handle many classes
}

linearsvc_param_grid = {
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__max_features': [20000, 40000],
    'clf__C': [0.1, 1, 5, 10]
}

nb_param_grid = {
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__max_features': [20000, 40000],
    'clf__alpha': [0.1, 0.5, 1.0]
}

sgd_param_grid = {
    'tfidf__ngram_range': [(1,1), (1,2)],
    'clf__alpha': [1e-4, 1e-3, 1e-2],
    'clf__penalty': ['l2', 'l1'],
    'clf__loss': ['hinge', 'log_loss']  # hinge ~ SVM, log_loss ~ logistic regression
}

xgb_param_grid = {
    'tfidf__max_features': [10000, 20000],
    'clf__n_estimators': [100, 200],
    'clf__max_depth': [4, 6, 8],
    'clf__learning_rate': [0.05, 0.1]
}


rf_param_grid = {
    'tfidf__max_features': [10000, 20000],
    'clf__n_estimators': [200, 400],
    'clf__max_depth': [None, 20, 40]
}

ridge_param_grid = {
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__max_features': [20000, 40000],
    'clf__alpha': [0.5, 1.0, 2.0, 5.0],
    'clf__class_weight': [None, 'balanced']
}

pac_param_grid = {
    'tfidf__ngram_range': [(1,1), (1,2)],
    'clf__C': [0.1, 0.5, 1, 2, 5],
    'clf__loss': ['hinge', 'squared_hinge'],
    'clf__class_weight': [None, 'balanced'],
    'clf__max_iter': [2000]
}

cnb_param_grid = {
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__max_features': [20000, 40000],
    'clf__alpha': [0.1, 0.5, 1.0]
}

models = {
    'logreg': (LogisticRegression(max_iter=2000), logreg_param_grid),
    'linearsvc': (LinearSVC(), linearsvc_param_grid),
    'nb': (MultinomialNB(), nb_param_grid),
    'sgd': (SGDClassifier(), sgd_param_grid),
    'rf': (RandomForestClassifier(), rf_param_grid),
    'ridge': (RidgeClassifier(), ridge_param_grid),
    'pac': (PassiveAggressiveClassifier(), pac_param_grid),
    'cnb': (ComplementNB(), cnb_param_grid),
}

results = []
best_models = {}

for name, (estimator, params) in models.items():
    pipe = Pipeline([
        ('tfidf', TfidfVectorizer(stop_words='english')),
        ('clf', estimator)
    ])
    grid = GridSearchCV(pipe, params, cv=5, n_jobs=-1, scoring='f1_macro')
    grid.fit(X_train, y_train)
    best_models[name] = grid.best_estimator_
    results.append({
        'model': name,
        'best_cv_f1': grid.best_score_,
        'best_params': grid.best_params_
    })
    print(f"{name:10s} | best CV f1: {grid.best_score_:.4f} | {grid.best_params_}")

results_df = pd.DataFrame(results).sort_values('best_cv_f1', ascending=False)
display(results_df)


2025/10/19 22:24:49 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '1b28c16d2f4349fbb5a1a4758a8590a4', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


## 6. Model Export

In [106]:
import pickle

In [107]:
import pickle

with open('pickled_files/model_and_vectorizer.pkl', 'wb') as f:
    pickle.dump({'model': model, 'vectorizer': tfidf_vectorizer}, f)

In [None]:
# import streamlit as st
 
# st.write("""
# # My first app
# # Hello *world!*
# """)




## 7. Manually Test Model

In [90]:
input_text = input("Enter your text here: ")

In [91]:
pt= process_text_pro(input_text)
pt

'educational aspect'

In [92]:
vect_t = tfidf_vectorizer.transform([pt])

In [93]:
vect_t

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2 stored elements and shape (1, 5000)>

In [94]:
model.predict(vect_t)

array(['education'], dtype=object)

### Test the model in a loop

In [77]:
breaker = True

while breaker:
    input_text = input("Enter Text to Classify, Enter End/Quit to end program")

    if input_text.lower() == "end" or input_text.lower() == "quit":
        break
    else:
        pt = process_text_pro(input_text)
        vect_t = tfidf_vectorizer.transform([pt])
        p_class = model.predict(vect_t)
        print(f"{input_text} #Classification: {p_class}")



sproting conduct #Classification: ['education']
sporting conduct #Classification: ['technology']
sports manager #Classification: ['sports']
its a sporting liability #Classification: ['technology']


KeyboardInterrupt: Interrupted by user

## 8. Streamlit

In [132]:
!streamlit run Streamlit/base_app.py
# http://localhost:8508/

^C


## 9. MLFlow

In [None]:
!mlflow ui  
#  http://127.0.0.1:5000/

^C


2025/10/19 22:07:41 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '35749d280e0446ba9fac91f432b0a0de', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
1. Set the MLFLOW_TRACKING_URI environment variable to the desired tracking URI. `export MLFLOW_TRACKING_URI=http://localhost:5000`
2. Set the tracking URI programmatically by calling `mlflow.set_tracking_uri`. `mlflow.set_tracking_uri('http://localhost:5000')`
2025/10/19 22:09:50 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'd03ad9b9e35b45298d6fd1fa95bd6dd0', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


ridge      | best CV f1: 0.9823 | {'clf__alpha': 1.0, 'clf__class_weight': 'balanced', 'tfidf__max_features': 40000, 'tfidf__ngram_range': (1, 2)}


1. Set the MLFLOW_TRACKING_URI environment variable to the desired tracking URI. `export MLFLOW_TRACKING_URI=http://localhost:5000`
2. Set the tracking URI programmatically by calling `mlflow.set_tracking_uri`. `mlflow.set_tracking_uri('http://localhost:5000')`
2025/10/19 22:11:51 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '50b531749bfa4cf5b83c70c6eae794cd', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


pac        | best CV f1: 0.9834 | {'clf__C': 1, 'clf__class_weight': 'balanced', 'clf__loss': 'squared_hinge', 'clf__max_iter': 2000, 'tfidf__ngram_range': (1, 2)}


1. Set the MLFLOW_TRACKING_URI environment variable to the desired tracking URI. `export MLFLOW_TRACKING_URI=http://localhost:5000`
2. Set the tracking URI programmatically by calling `mlflow.set_tracking_uri`. `mlflow.set_tracking_uri('http://localhost:5000')`


cnb        | best CV f1: 0.9819 | {'clf__alpha': 0.1, 'tfidf__max_features': 40000, 'tfidf__ngram_range': (1, 2)}


Unnamed: 0,model,best_cv_f1,best_params
1,pac,0.983427,"{'clf__C': 1, 'clf__class_weight': 'balanced',..."
0,ridge,0.982316,"{'clf__alpha': 1.0, 'clf__class_weight': 'bala..."
2,cnb,0.98186,"{'clf__alpha': 0.1, 'tfidf__max_features': 400..."


In [None]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression(max_iter=2000))
])

param_grid = {
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__min_df': [1, 2, 3],
    'tfidf__max_features': [20000, 40000, None],
    'clf__C': [0.5, 1, 2, 5],
    'clf__penalty': ['l2'],                # (liblinear/saga support l1 too if desired)
    'clf__solver': ['liblinear', 'saga']   # both handle many classes
}

grid = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    n_jobs=-1,
    verbose=1
)
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
print("Best CV accuracy:", grid.best_score_)

# 8) Evaluate on hold-out validation set
y_val_pred = grid.predict(X_test)
print("Validation accuracy:", accuracy_score(y_test, y_val_pred))
print(classification_report(y_test, y_val_pred))

# 9) Fit on full training data (optional, after you’re happy with params)
best_model = grid.best_estimator_
best_model.fit(X_train, y_train)

# 10) Predict on test set
test_preds = best_model.predict(X_test)

# If you want a submission file:
out = test_df[['url']].copy()  # or any ID column you have; 'url' is available
out['predicted_category'] = test_preds
out.to_csv("predictions.csv", index=False)
print("Saved predictions to predictions.csv")


2025/10/19 14:17:30 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'f53be7814934477885cc01096be632da', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Fitting 5 folds for each of 144 candidates, totalling 720 fits


1. Set the MLFLOW_TRACKING_URI environment variable to the desired tracking URI. `export MLFLOW_TRACKING_URI=http://localhost:5000`
2. Set the tracking URI programmatically by calling `mlflow.set_tracking_uri`. `mlflow.set_tracking_uri('http://localhost:5000')`


Best params: {'clf__C': 5, 'clf__penalty': 'l2', 'clf__solver': 'liblinear', 'tfidf__max_features': 40000, 'tfidf__min_df': 3, 'tfidf__ngram_range': (1, 2)}
Best CV accuracy: 0.983016557088904
Validation accuracy: 0.9855072463768116
               precision    recall  f1-score   support

     business       0.98      0.97      0.98       224
    education       1.00      0.99      1.00       304
entertainment       1.00      1.00      1.00       192
       sports       0.98      0.97      0.98       128
   technology       0.97      0.98      0.98       256

     accuracy                           0.99      1104
    macro avg       0.99      0.98      0.98      1104
 weighted avg       0.99      0.99      0.99      1104



2025/10/19 14:30:53 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '097fa99c2b6f414287860400d20d8b8d', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
1. Set the MLFLOW_TRACKING_URI environment variable to the desired tracking URI. `export MLFLOW_TRACKING_URI=http://localhost:5000`
2. Set the tracking URI programmatically by calling `mlflow.set_tracking_uri`. `mlflow.set_tracking_uri('http://localhost:5000')`


Saved predictions to predictions.csv


In [130]:
# train_df['text'].head(200)
train_df['category']
# test_df['text'].head(200)

0         business
1         business
2         business
3         business
4         business
           ...    
5515    technology
5516    technology
5517    technology
5518    technology
5519    technology
Name: category, Length: 5520, dtype: object