# Text Classification with TF-IDF and Machine Learning

This notebook demonstrates the process of building and evaluating a text classification model using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization and various machine learning algorithms.

## Main Steps

1. **Data Loading**
   - Load training data for two categories: 'cars' and 'sleep' from JSON files
   - Load separate test data for both categories

2. **Data Preprocessing**
   - Extract 'question' field from the test data JSON structure

3. **Text Processing**
   - Lemmatization: Reduce words to their base or dictionary form
   - Remove punctuation and potentially stop words

4. **Feature Extraction**
   - Use TF-IDF vectorization to convert text data into numerical features

5. **Model Building**
   - Create a pipeline that combines TF-IDF vectorization with a classifier
   - Experiment with different classifiers:
     * Logistic Regression
     * Random Forest
     * Naive Bayes
     * Support Vector Machine (SVM)

6. **Model Evaluation**
   - Split data into training and validation sets
   - Use cross-validation to assess model performance
   - Evaluate models using metrics like accuracy and classification report

7. **Hyperparameter Tuning**
   - Use GridSearchCV to find the best hyperparameters for the chosen model

8. **Final Model Selection and Testing**
   - Train the best model on the entire training dataset
   - Evaluate the final model on the separate test datasets

In [1]:
BASE_PATH = ".."

In [34]:
import pandas as pd # type: ignore

import nltk # type: ignore
from nltk.tokenize import word_tokenize # type: ignore
from nltk.stem import WordNetLemmatizer # type: ignore
import string

from sklearn.feature_extraction.text import TfidfVectorizer # type: ignore
from sklearn.pipeline import Pipeline # type: ignore
from sklearn.metrics import accuracy_score, classification_report # type: ignore
from sklearn.model_selection import train_test_split # type: ignore
from sklearn.model_selection import cross_val_score # type: ignore
from sklearn.model_selection import GridSearchCV # type: ignore

# models
from sklearn.linear_model import LogisticRegression # type: ignore
from sklearn.ensemble import RandomForestClassifier # type: ignore
from sklearn.naive_bayes import MultinomialNB # type: ignore
from sklearn.svm import SVC # type: ignore

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/stepantytarenko/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/stepantytarenko/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/stepantytarenko/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [5]:
# Load the datasets
cars_train = pd.read_json(f'{BASE_PATH}/data/cars_qa.json')
sleep_train = pd.read_json(f'{BASE_PATH}/data/sleep_qa.json')
cars_test = pd.read_json(f'{BASE_PATH}/data/training_qna_car.json')
sleep_test = pd.read_json(f'{BASE_PATH}/data/training_qna_sleep.json')

In [16]:
cars_test['question'] = cars_test['qna'].apply(lambda x: x['question'])
sleep_test['question'] = sleep_test['qna'].apply(lambda x: x['question'])

In [17]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Lemmatize words
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Join the tokens back into a single string
    preprocessed_text = ' '.join(lemmatized_tokens)
    
    return preprocessed_text

# Apply preprocessing to the datasets
cars_train['preprocessed_question'] = cars_train['question'].apply(preprocess_text)
sleep_train['preprocessed_question'] = sleep_train['question'].apply(preprocess_text)
cars_test['preprocessed_question'] = cars_test['question'].apply(preprocess_text)
sleep_test['preprocessed_question'] = sleep_test['question'].apply(preprocess_text)


In [19]:
# Prepare the training data
X = pd.concat([cars_train['preprocessed_question'], sleep_train['preprocessed_question']], axis=0)
y = [1] * len(cars_train) + [0] * len(sleep_train)

# Prepare the test data
X_test = pd.concat([cars_test['preprocessed_question'], sleep_test['preprocessed_question']], axis=0)
y_test = [1] * len(cars_test) + [0] * len(sleep_test)

In [20]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

In [21]:
# Create the TF-IDF and Logistic Regression pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression(random_state=42))
])

pipeline.fit(X_train, y_train)

In [22]:
y_pred = pipeline.predict(X_val)

In [23]:
accuracy = accuracy_score(y_val, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 1.00


In [24]:
print("\nClassification Report:")
print(classification_report(y_val, y_pred, target_names=['Sleep', 'Cars']))


Classification Report:
              precision    recall  f1-score   support

       Sleep       1.00      1.00      1.00        28
        Cars       1.00      1.00      1.00        35

    accuracy                           1.00        63
   macro avg       1.00      1.00      1.00        63
weighted avg       1.00      1.00      1.00        63



In [26]:
y_pred = pipeline.predict(X_test)

In [27]:
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {test_accuracy:.2f}")

Test Accuracy: 0.94


In [28]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Sleep', 'Cars']))


Classification Report:
              precision    recall  f1-score   support

       Sleep       0.96      0.93      0.94        27
        Cars       0.93      0.96      0.94        26

    accuracy                           0.94        53
   macro avg       0.94      0.94      0.94        53
weighted avg       0.94      0.94      0.94        53



In [32]:
def full_pipeline(model, X_train, y_train, X_val, y_val, X_test, y_test):
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(stop_words='english')),
        ('clf', model)
    ])
    
    cross_val_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
    print(f"{model.__class__.__name__} Cross-Validation Accuracy: {cross_val_scores.mean():.2f} (+/- {cross_val_scores.std() * 2:.2f})")
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_val)
    accuracy = accuracy_score(y_val, y_pred)
    print(f"{model.__class__.__name__} Validation Accuracy: {accuracy:.2f}")
    print("\nValidationClassification Report:")
    print(classification_report(y_val, y_pred, target_names=['Sleep', 'Cars']))
    
    y_pred = pipeline.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_pred)
    print(f"{model.__class__.__name__} Test Accuracy: {test_accuracy:.2f}")
    print("\nTest Classification Report:")
    print(classification_report(y_test, y_pred, target_names=['Sleep', 'Cars']))
    return pipeline


In [33]:
models = [
    LogisticRegression(random_state=42),
    RandomForestClassifier(random_state=42),
    MultinomialNB(),
    SVC(random_state=42)
]

for model in models:
    full_pipeline(model, X_train, y_train, X_val, y_val, X_test, y_test)

LogisticRegression Cross-Validation Accuracy: 1.00 (+/- 0.00)
LogisticRegression Validation Accuracy: 1.00

ValidationClassification Report:
              precision    recall  f1-score   support

       Sleep       1.00      1.00      1.00        28
        Cars       1.00      1.00      1.00        35

    accuracy                           1.00        63
   macro avg       1.00      1.00      1.00        63
weighted avg       1.00      1.00      1.00        63

LogisticRegression Test Accuracy: 0.94

Test Classification Report:
              precision    recall  f1-score   support

       Sleep       0.96      0.93      0.94        27
        Cars       0.93      0.96      0.94        26

    accuracy                           0.94        53
   macro avg       0.94      0.94      0.94        53
weighted avg       0.94      0.94      0.94        53

RandomForestClassifier Cross-Validation Accuracy: 0.99 (+/- 0.03)
RandomForestClassifier Validation Accuracy: 1.00

ValidationClassificat

In [36]:
# optimize performance of LogisticRegression

param_grid = {
    'clf__C': [0.01, 0.1, 1, 10],
    'clf__penalty': ['l1', 'l2', 'elasticnet'],
    'clf__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}

In [None]:
# create the pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression(random_state=42))
])

# perform grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# print the best parameters and best score
print("Best parameters: ", grid_search.best_params_)
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

In [38]:
print('Best parameters: ', grid_search.best_params_)

Best parameters:  {'clf__C': 1, 'clf__penalty': 'l2', 'clf__solver': 'newton-cg'}


In [40]:
# merge train and validation data
X_train_val = pd.concat([X_train, X_val], axis=0)
y_train_val = y_train + y_val

In [43]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression(random_state=42, C=1, penalty='l2', solver='newton-cg'))
])
pipeline.fit(X_train_val, y_train_val)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Sleep', 'Cars']))

Accuracy: 0.96

Classification Report:
              precision    recall  f1-score   support

       Sleep       1.00      0.93      0.96        27
        Cars       0.93      1.00      0.96        26

    accuracy                           0.96        53
   macro avg       0.96      0.96      0.96        53
weighted avg       0.96      0.96      0.96        53



In [45]:
X_final = pd.concat([X_train, X_val, X_test], axis=0)
y_final = y_train + y_val + y_test

In [46]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression(random_state=42, C=1, penalty='l2', solver='newton-cg'))
])
pipeline.fit(X_final, y_final)

In [49]:
# save the pipeline
import joblib
joblib.dump(pipeline, f'{BASE_PATH}/models/logistic_regression_model.pkl')

['../models/logistic_regression_model.pkl']