# TF-IDF Models — split first (Kaggle)

RandomOverSampler is applied only on the training set after train/test split. Data is loaded from Kaggle via kagglehub.


In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from imblearn.over_sampling import RandomOverSampler

In [2]:
# Install and import kagglehub
try:
    import kagglehub
    from kagglehub import KaggleDatasetAdapter
except Exception:
    import sys, subprocess

    subprocess.check_call(
        [sys.executable, "-m", "pip", "install", "kagglehub[pandas-datasets]"]
    )
    import kagglehub
    from kagglehub import KaggleDatasetAdapter

# Load Kaggle dataset
file_path = "twitter_sentiment_data.csv"
df = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "edqian/twitter-climate-change-sentiment-dataset",
    file_path,
)

# Select required columns by exact name
df = df[["message", "sentiment"]]

# Drop sentiment '2' (supports both numeric 2 and string '2')
if df["sentiment"].dtype.kind in {"i", "u", "f"}:
    df = df[df["sentiment"] != 2]
else:
    df = df[df["sentiment"].astype(str) != "2"]

df.head()

  from .autonotebook import tqdm as notebook_tqdm
  df = kagglehub.load_dataset(


Unnamed: 0,message,sentiment
0,@tiniebeany climate change is an interesting h...,-1
1,RT @NatGeoChannel: Watch #BeforeTheFlood right...,1
2,Fabulous! Leonardo #DiCaprio's film on #climat...,1
3,RT @Mick_Fanning: Just watched this amazing do...,1
5,Unamshow awache kujinga na iko global warming ...,0


In [3]:
# NLTK prerequisites
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")

[nltk_data] Downloading package stopwords to /Users/nafis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/nafis/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/nafis/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [4]:
# Preprocess and split
def preprocess(text):
    text = re.sub("[^a-zA-Z]", " ", str(text))
    text = text.lower()
    words = text.split()
    sw = set(stopwords.words("english"))
    words = [w for w in words if w not in sw]
    stemmer = PorterStemmer()
    words = [stemmer.stem(w) for w in words]
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(w) for w in words]
    return " ".join(words)


df["message"] = df["message"].apply(preprocess)
X = df["message"]
y = df["sentiment"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [5]:
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [6]:
oversampler = RandomOverSampler(random_state=42)
X_train_res, y_train_res = oversampler.fit_resample(X_train_tfidf, y_train)
X_train_res.shape, X_test_tfidf.shape

((55170, 44806), (6934, 44806))

## Models


In [7]:
# Logistic Regression
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train_res, y_train_res)
y_pred = logreg.predict(X_test_tfidf)
print("Logistic Regression:")
print(classification_report(y_test, y_pred))

Logistic Regression:
              precision    recall  f1-score   support

          -1       0.56      0.66      0.60       824
           0       0.57      0.54      0.55      1538
           1       0.85      0.84      0.85      4572

    accuracy                           0.75      6934
   macro avg       0.66      0.68      0.67      6934
weighted avg       0.76      0.75      0.75      6934



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [8]:
# Random Forest
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train_res, y_train_res)
y_pred = rfc.predict(X_test_tfidf)
print("Random Forest:")
print(classification_report(y_test, y_pred))

Random Forest:
              precision    recall  f1-score   support

          -1       0.78      0.42      0.54       824
           0       0.58      0.51      0.54      1538
           1       0.80      0.90      0.85      4572

    accuracy                           0.76      6934
   macro avg       0.72      0.61      0.65      6934
weighted avg       0.75      0.76      0.74      6934



In [None]:
# Multinomial Naive Bayes
nb = MultinomialNB()
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(X_test_tfidf)
print("Multinomial Naive Bayes:")
print(classification_report(y_test, y_pred))

Multinomial Naive Bayes:
              precision    recall  f1-score   support

          -1       0.44      0.76      0.55       824
           0       0.57      0.45      0.50      1538
           1       0.86      0.80      0.83      4572

    accuracy                           0.72      6934
   macro avg       0.62      0.67      0.63      6934
weighted avg       0.74      0.72      0.72      6934



You are correct that the Multinomial Naive Bayes (MNB) algorithm in `sklearn` expects non-negative feature values because it is designed for discrete count data (e.g., word counts in text classification). However, the `TfidfVectorizer` produces floating-point values that represent term frequency-inverse document frequency (TF-IDF) scores, which can include values close to zero but are always non-negative.

In this case, the `TfidfVectorizer` ensures that all feature values are non-negative, which satisfies the requirement of the Multinomial Naive Bayes algorithm. Therefore, the code does not cause an error because the input matrix `X_train_res` (and `X_test_tfidf`) contains non-negative values, even though they are not integer counts. 

While MNB is theoretically designed for count data, it can still work with TF-IDF-transformed data in practice, as long as the values are non-negative. This is a common approach in text classification tasks.
