# ARI 510 Phase 2b

Project Phase 2b Competition Participation
Ryan Smith
Project Competing in: Zoiya's team's.

From the official description: Your task is to download our Reddit posts dataset and train a model to predict whether they're HIPAA violations. 

These are from Reddit posts, and it is a text classification task.

# The Models

I will use SVM and Logistic Regression, which are both good for text classifications.

# Model 1: Logistic Regression

Logistic regression is a popular and effective model for text classification tasks. It's a simple, yet powerful algorithm that's particularly well-suited for binary classification problems, where the goal is to categorize text into one of two classes (e.g., spam or not spam, positive or negative sentiment). Despite its name, logistic regression is actually a classification algorithm, not a regression one. It works by estimating the probability of a text belonging to a specific class using a logistic function.

## Hyperparameters

The Hyperparameters we will work with are:

C_values: Controls the penalty for complex models, balancing between fitting the training data well and avoiding overfitting.

max_iter_values:  Sets the maximum number of iterations for the optimization algorithm to find the best model parameters.

solvers:  Specifies the algorithm used to find the best-fitting parameters for the model.

penalties: Determines the type of regularization used to prevent overfitting by shrinking the model's weights.

## Default Values

In [8]:
# A Basic Logistic Regression Model for Text Classification
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load datasets with a different encoding
train_df = pd.read_csv('zoiya/train.csv', encoding='ISO-8859-1')
dev_df = pd.read_csv('zoiya/dev.csv', encoding='ISO-8859-1')
test_df = pd.read_csv('zoiya/test_data.csv', encoding='ISO-8859-1')

# Preprocess text data
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df['Features'])
X_dev = vectorizer.transform(dev_df['Features'])
X_test = vectorizer.transform(test_df['Features'])

y_train = train_df['Label']
y_dev = dev_df['Label']

# Define hyperparameters
# C = 1.0  # Regularization strength
# max_iter = 100  # Maximum number of iterations
# solver = 'liblinear'  # Solver to use

# Build the model
# model = LogisticRegression(C=C, max_iter=max_iter, solver=solver)

# Build the model with default values
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Validate the model
y_dev_pred = model.predict(X_dev)
dev_accuracy = accuracy_score(y_dev, y_dev_pred)
print(f'Validation Accuracy: {dev_accuracy}')

# Test the model
y_test_pred = model.predict(X_test)

# Output predictions
with open('preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(pred + '\n')

Validation Accuracy: 0.5


More verbose version of the above code.

In [10]:
# Verbose version
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load datasets with a different encoding
train_df = pd.read_csv('zoiya/train.csv', encoding='ISO-8859-1')
dev_df = pd.read_csv('zoiya/dev.csv', encoding='ISO-8859-1')
test_df = pd.read_csv('zoiya/test_data.csv', encoding='ISO-8859-1')

print("Datasets loaded successfully.")
print(f"Training set size: {train_df.shape}")
print(f"Validation set size: {dev_df.shape}")
print(f"Test set size: {test_df.shape}")

# Preprocess text data
print("Preprocessing text data...")
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df['Features'])
X_dev = vectorizer.transform(dev_df['Features'])
X_test = vectorizer.transform(test_df['Features'])

y_train = train_df['Label']
y_dev = dev_df['Label']

print("Text data preprocessed.")
print(f"Number of features: {X_train.shape[1]}")

# Define hyperparameters
# C = 1.0  # Regularization strength
# max_iter = 100  # Maximum number of iterations
# solver = 'liblinear'  # Solver to use

# Build the model
print("Building the Logistic Regression model...")
# model = LogisticRegression(C=C, max_iter=max_iter, solver=solver)
model = LogisticRegression()

# Train the model
print("Training the model...")
model.fit(X_train, y_train)
print("Model training completed.")

# Validate the model
print("Validating the model...")
y_dev_pred = model.predict(X_dev)
dev_accuracy = accuracy_score(y_dev, y_dev_pred)
print(f'Validation Accuracy: {dev_accuracy}')
print("Validation Classification Report:")
print(classification_report(y_dev, y_dev_pred))

# Test the model
print("Testing the model...")
y_test_pred = model.predict(X_test)

# Output predictions
print("Writing predictions to preds.txt...")
with open('preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(pred + '\n')
print("Predictions written to preds.txt.")

Datasets loaded successfully.
Training set size: (80, 2)
Validation set size: (10, 2)
Test set size: (10, 1)
Preprocessing text data...
Text data preprocessed.
Number of features: 1873
Building the Logistic Regression model...
Training the model...
Model training completed.
Validating the model...
Validation Accuracy: 0.5
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.44      1.00      0.62         4
         yes       1.00      0.17      0.29         6

    accuracy                           0.50        10
   macro avg       0.72      0.58      0.45        10
weighted avg       0.78      0.50      0.42        10

Testing the model...
Writing predictions to preds.txt...
Predictions written to preds.txt.


## Tuning, with three hyperparameter values that are not default

In [12]:
# we are using C_values, max_inter_values, and solvers as our hyperparameters
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load datasets with a different encoding
print("Loading datasets...")
train_df = pd.read_csv('zoiya/train.csv', encoding='ISO-8859-1')
dev_df = pd.read_csv('zoiya/dev.csv', encoding='ISO-8859-1')
test_df = pd.read_csv('zoiya/test_data.csv', encoding='ISO-8859-1')

print("Datasets loaded successfully.")
print(f"Training set size: {train_df.shape}")
print(f"Validation set size: {dev_df.shape}")
print(f"Test set size: {test_df.shape}")

# Preprocess text data
print("Preprocessing text data...")
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df['Features'])
X_dev = vectorizer.transform(dev_df['Features'])
X_test = vectorizer.transform(test_df['Features'])

y_train = train_df['Label']
y_dev = dev_df['Label']

print("Text data preprocessed.")
print(f"Number of features: {X_train.shape[1]}")

# Define hyperparameters
C_values = [0.01, 0.1, 1.0, 10.0, 100.0]  # Different values for regularization strength
max_iter_values = [100, 200, 300]  # Different values for maximum number of iterations
solvers = ['liblinear', 'lbfgs', 'saga']  # Different solvers

best_accuracy = 0
best_params = {}

# Experiment with different hyperparameters
for C in C_values:
    for max_iter in max_iter_values:
        for solver in solvers:
            print(f"Training model with C={C}, max_iter={max_iter}, solver={solver}...")
            model = LogisticRegression(C=C, max_iter=max_iter, solver=solver)
            model.fit(X_train, y_train)
            y_dev_pred = model.predict(X_dev)
            dev_accuracy = accuracy_score(y_dev, y_dev_pred)
            print(f'Validation Accuracy: {dev_accuracy}')
            if dev_accuracy > best_accuracy:
                best_accuracy = dev_accuracy
                best_params = {'C': C, 'max_iter': max_iter, 'solver': solver}

print(f"Best hyperparameters: {best_params}")
print(f"Best validation accuracy: {best_accuracy}")

# Train the model with the best hyperparameters
model = LogisticRegression(**best_params)
model.fit(X_train, y_train)

# Validate the model
y_dev_pred = model.predict(X_dev)
dev_accuracy = accuracy_score(y_dev, y_dev_pred)
print(f'Validation Accuracy: {dev_accuracy}')
print("Validation Classification Report:")
print(classification_report(y_dev, y_dev_pred))

# Test the model
y_test_pred = model.predict(X_test)

# Output predictions
print("Writing predictions to preds.txt...")
with open('preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(pred + '\n')
print("Predictions written to preds.txt.")

Loading datasets...
Datasets loaded successfully.
Training set size: (80, 2)
Validation set size: (10, 2)
Test set size: (10, 1)
Preprocessing text data...
Text data preprocessed.
Number of features: 1873
Training model with C=0.01, max_iter=100, solver=liblinear...
Validation Accuracy: 0.4
Training model with C=0.01, max_iter=100, solver=lbfgs...
Validation Accuracy: 0.4
Training model with C=0.01, max_iter=100, solver=saga...
Validation Accuracy: 0.4
Training model with C=0.01, max_iter=200, solver=liblinear...
Validation Accuracy: 0.4
Training model with C=0.01, max_iter=200, solver=lbfgs...
Validation Accuracy: 0.4
Training model with C=0.01, max_iter=200, solver=saga...
Validation Accuracy: 0.4
Training model with C=0.01, max_iter=300, solver=liblinear...
Validation Accuracy: 0.4
Training model with C=0.01, max_iter=300, solver=lbfgs...
Validation Accuracy: 0.4
Training model with C=0.01, max_iter=300, solver=saga...
Validation Accuracy: 0.4
Training model with C=0.1, max_iter=100



Our results of 

Validation Classification Report:
              precision    recall  f1-score   support

          no       0.67      1.00      0.80         4
         yes       1.00      0.67      0.80         6

    accuracy                           0.80        10
   macro avg       0.83      0.83      0.80        10
weighted avg       0.87      0.80      0.80        10

Are better than default values.  Tuning this dataset is somewhat challenging due to the 10 datapoints in the test set, so some tunings will be identical.  This one does show a good increase though.

## Tuning, new hyperparameter 'penalty'

In [13]:
# penalty is added to the hyperparameters
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load datasets with a different encoding
print("Loading datasets...")
train_df = pd.read_csv('zoiya/train.csv', encoding='ISO-8859-1')
dev_df = pd.read_csv('zoiya/dev.csv', encoding='ISO-8859-1')
test_df = pd.read_csv('zoiya/test_data.csv', encoding='ISO-8859-1')

print("Datasets loaded successfully.")
print(f"Training set size: {train_df.shape}")
print(f"Validation set size: {dev_df.shape}")
print(f"Test set size: {test_df.shape}")

# Preprocess text data
print("Preprocessing text data...")
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df['Features'])
X_dev = vectorizer.transform(dev_df['Features'])
X_test = vectorizer.transform(test_df['Features'])

y_train = train_df['Label']
y_dev = dev_df['Label']

print("Text data preprocessed.")
print(f"Number of features: {X_train.shape[1]}")

# Define hyperparameters
C_values = [0.01, 0.1, 1.0, 10.0, 100.0]  # Different values for regularization strength
max_iter_values = [100, 200, 300]  # Different values for maximum number of iterations
solvers = ['liblinear', 'lbfgs', 'saga']  # Different solvers
penalties = ['l1', 'l2']  # Different penalties

best_accuracy = 0
best_params = {}

# Experiment with different hyperparameters
for C in C_values:
    for max_iter in max_iter_values:
        for solver in solvers:
            for penalty in penalties:
                # 'liblinear' and 'saga' solvers support 'l1' penalty
                if solver in ['liblinear', 'saga'] or penalty == 'l2':
                    print(f"Training model with C={C}, max_iter={max_iter}, solver={solver}, penalty={penalty}...")
                    model = LogisticRegression(C=C, max_iter=max_iter, solver=solver, penalty=penalty)
                    model.fit(X_train, y_train)
                    y_dev_pred = model.predict(X_dev)
                    dev_accuracy = accuracy_score(y_dev, y_dev_pred)
                    print(f'Validation Accuracy: {dev_accuracy}')
                    print("Validation Classification Report:")
                    report = classification_report(y_dev, y_dev_pred)
                    print(report)
                    if dev_accuracy > best_accuracy:
                        best_accuracy = dev_accuracy
                        best_params = {'C': C, 'max_iter': max_iter, 'solver': solver, 'penalty': penalty}

print(f"Best hyperparameters: {best_params}")
print(f"Best validation accuracy: {best_accuracy}")

# Train the model with the best hyperparameters
model = LogisticRegression(**best_params)
model.fit(X_train, y_train)

# Validate the model
y_dev_pred = model.predict(X_dev)
dev_accuracy = accuracy_score(y_dev, y_dev_pred)
print(f'Validation Accuracy: {dev_accuracy}')
print("Validation Classification Report:")
report = classification_report(y_dev, y_dev_pred)
print(report)

# Test the model
y_test_pred = model.predict(X_test)

# Output predictions
print("Writing predictions to preds.txt...")
with open('preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(pred + '\n')
print("Predictions written to preds.txt.")

Loading datasets...
Datasets loaded successfully.
Training set size: (80, 2)
Validation set size: (10, 2)
Test set size: (10, 1)
Preprocessing text data...
Text data preprocessed.
Number of features: 1873
Training model with C=0.01, max_iter=100, solver=liblinear, penalty=l1...
Validation Accuracy: 0.4
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.40      1.00      0.57         4
         yes       0.00      0.00      0.00         6

    accuracy                           0.40        10
   macro avg       0.20      0.50      0.29        10
weighted avg       0.16      0.40      0.23        10

Training model with C=0.01, max_iter=100, solver=liblinear, penalty=l2...
Validation Accuracy: 0.4
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.40      1.00      0.57         4
         yes       0.00      0.00      0.00         6

    accuracy                          

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

Validation Accuracy: 0.6
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.50      1.00      0.67         4
         yes       1.00      0.33      0.50         6

    accuracy                           0.60        10
   macro avg       0.75      0.67      0.58        10
weighted avg       0.80      0.60      0.57        10

Training model with C=100.0, max_iter=100, solver=liblinear, penalty=l2...
Validation Accuracy: 0.8
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.67      1.00      0.80         4
         yes       1.00      0.67      0.80         6

    accuracy                           0.80        10
   macro avg       0.83      0.83      0.80        10
weighted avg       0.87      0.80      0.80        10

Training model with C=100.0, max_iter=100, solver=lbfgs, penalty=l2...
Validation Accuracy: 0.8
Validation Classification Report:
              precision



## Tuning

In [14]:
# New settings for the hyperparameters
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load datasets with a different encoding
print("Loading datasets...")
train_df = pd.read_csv('zoiya/train.csv', encoding='ISO-8859-1')
dev_df = pd.read_csv('zoiya/dev.csv', encoding='ISO-8859-1')
test_df = pd.read_csv('zoiya/test_data.csv', encoding='ISO-8859-1')

print("Datasets loaded successfully.")
print(f"Training set size: {train_df.shape}")
print(f"Validation set size: {dev_df.shape}")
print(f"Test set size: {test_df.shape}")

# Preprocess text data
print("Preprocessing text data...")
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df['Features'])
X_dev = vectorizer.transform(dev_df['Features'])
X_test = vectorizer.transform(test_df['Features'])

y_train = train_df['Label']
y_dev = dev_df['Label']

print("Text data preprocessed.")
print(f"Number of features: {X_train.shape[1]}")

# Define hyperparameters
C_values = [0.001, 0.005, 0.01]  # Much smaller values for regularization strength
max_iter_values = [50, 75, 100]  # Smaller values for maximum number of iterations
solvers = ['newton-cg', 'sag']  # Different solvers
penalties = ['l2', 'none']  # Different penalties, including no regularization

best_accuracy = 0
best_params = {}

# Experiment with different hyperparameters
for C in C_values:
    for max_iter in max_iter_values:
        for solver in solvers:
            for penalty in penalties:
                # 'liblinear' and 'saga' solvers support 'l1' penalty
                if solver in ['liblinear', 'saga'] or penalty == 'l2':
                    print(f"Training model with C={C}, max_iter={max_iter}, solver={solver}, penalty={penalty}...")
                    model = LogisticRegression(C=C, max_iter=max_iter, solver=solver, penalty=penalty)
                    model.fit(X_train, y_train)
                    y_dev_pred = model.predict(X_dev)
                    dev_accuracy = accuracy_score(y_dev, y_dev_pred)
                    print(f'Validation Accuracy: {dev_accuracy}')
                    print("Validation Classification Report:")
                    report = classification_report(y_dev, y_dev_pred)
                    print(report)
                    if dev_accuracy > best_accuracy:
                        best_accuracy = dev_accuracy
                        best_params = {'C': C, 'max_iter': max_iter, 'solver': solver, 'penalty': penalty}

print(f"Best hyperparameters: {best_params}")
print(f"Best validation accuracy: {best_accuracy}")

# Train the model with the best hyperparameters
model = LogisticRegression(**best_params)
model.fit(X_train, y_train)

# Validate the model
y_dev_pred = model.predict(X_dev)
dev_accuracy = accuracy_score(y_dev, y_dev_pred)
print(f'Validation Accuracy: {dev_accuracy}')
print("Validation Classification Report:")
report = classification_report(y_dev, y_dev_pred)
print(report)

# Test the model
y_test_pred = model.predict(X_test)

# Output predictions
print("Writing predictions to preds.txt...")
with open('preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(pred + '\n')
print("Predictions written to preds.txt.")

Loading datasets...
Datasets loaded successfully.
Training set size: (80, 2)
Validation set size: (10, 2)
Test set size: (10, 1)
Preprocessing text data...
Text data preprocessed.
Number of features: 1873
Training model with C=0.001, max_iter=50, solver=newton-cg, penalty=l2...
Validation Accuracy: 0.4
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.40      1.00      0.57         4
         yes       0.00      0.00      0.00         6

    accuracy                           0.40        10
   macro avg       0.20      0.50      0.29        10
weighted avg       0.16      0.40      0.23        10

Training model with C=0.001, max_iter=50, solver=sag, penalty=l2...
Validation Accuracy: 0.4
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.40      1.00      0.57         4
         yes       0.00      0.00      0.00         6

    accuracy                           0.40 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

These drastic changes produced poor results

Validation Accuracy: 0.4
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.40      1.00      0.57         4
         yes       0.00      0.00      0.00         6

    accuracy                           0.40        10
   macro avg       0.20      0.50      0.29        10
weighted avg       0.16      0.40      0.23        10

I have tested other settings, which might be too much code to put here for every hyperparameter tuning, and the previous one with an accuracy of .8 is the best I could get.  This current one is the most drastic combination I tried which resulted in very poor scores.

# Model 2: SVM

Support Vector Machines (SVMs) are powerful algorithms used in supervised learning for text classification tasks. They work by finding an optimal hyperplane that maximally separates different classes in a high-dimensional space. In the context of text classification, each document is represented as a point in this space, and the SVM aims to find the hyperplane that best separates documents belonging to different categories (e.g., spam vs. ham, positive vs. negative sentiment). The "support vectors" are the data points closest to the hyperplane and have the most influence on its position. SVMs are effective for text classification because they can handle high-dimensional data and are good at capturing complex relationships between words and classes. 

## Hyperparameters

C_values
kernels
gammas

C_values: Controls the penalty for misclassifications, balancing between maximizing the margin and minimizing training errors.

kernels: Specifies the type of function used to map data points to a higher-dimensional space where they might be linearly separable.

gammas:  Determines the influence of a single training example, affecting the decision boundary's curvature and model complexity.

## Default Values

In [16]:
# Default values test for the hyperparameters
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load datasets with a different encoding
print("Loading datasets...")
train_df = pd.read_csv('zoiya/train.csv', encoding='ISO-8859-1')
dev_df = pd.read_csv('zoiya/dev.csv', encoding='ISO-8859-1')
test_df = pd.read_csv('zoiya/test_data.csv', encoding='ISO-8859-1')

print("Datasets loaded successfully.")
print(f"Training set size: {train_df.shape}")
print(f"Validation set size: {dev_df.shape}")
print(f"Test set size: {test_df.shape}")

# Preprocess text data
print("Preprocessing text data...")
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df['Features'])
X_dev = vectorizer.transform(dev_df['Features'])
X_test = vectorizer.transform(test_df['Features'])

y_train = train_df['Label']
y_dev = dev_df['Label']

print("Text data preprocessed.")
print(f"Number of features: {X_train.shape[1]}")

# Define hyperparameters
C_values = [1.0]  # Default value for regularization strength
kernels = ['rbf']  # Default kernel
gammas = ['scale']  # Default gamma value

best_accuracy = 0
best_params = {}

# Experiment with different hyperparameters
for C in C_values:
    for kernel in kernels:
        for gamma in gammas:
            print(f"Training model with C={C}, kernel={kernel}, gamma={gamma}...")
            model = SVC(C=C, kernel=kernel, gamma=gamma)
            model.fit(X_train, y_train)
            y_dev_pred = model.predict(X_dev)
            dev_accuracy = accuracy_score(y_dev, y_dev_pred)
            print(f'Validation Accuracy: {dev_accuracy}')
            print("Validation Classification Report:")
            report = classification_report(y_dev, y_dev_pred)
            print(report)
            if dev_accuracy > best_accuracy:
                best_accuracy = dev_accuracy
                best_params = {'C': C, 'kernel': kernel, 'gamma': gamma}

print(f"Best hyperparameters: {best_params}")
print(f"Best validation accuracy: {best_accuracy}")

# Train the model with the hyperparameters
model = SVC(**best_params)
model.fit(X_train, y_train)

# Validate the model
y_dev_pred = model.predict(X_dev)
dev_accuracy = accuracy_score(y_dev, y_dev_pred)
print(f'Validation Accuracy: {dev_accuracy}')
print("Validation Classification Report:")
report = classification_report(y_dev, y_dev_pred)
print(report)

# Test the model
y_test_pred = model.predict(X_test)

# Output predictions
print("Writing predictions to preds.txt...")
with open('preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(pred + '\n')
print("Predictions written to preds.txt.")

Loading datasets...
Datasets loaded successfully.
Training set size: (80, 2)
Validation set size: (10, 2)
Test set size: (10, 1)
Preprocessing text data...
Text data preprocessed.
Number of features: 1873
Training model with C=1.0, kernel=rbf, gamma=scale...
Validation Accuracy: 0.5
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.44      1.00      0.62         4
         yes       1.00      0.17      0.29         6

    accuracy                           0.50        10
   macro avg       0.72      0.58      0.45        10
weighted avg       0.78      0.50      0.42        10

Best hyperparameters: {'C': 1.0, 'kernel': 'rbf', 'gamma': 'scale'}
Best validation accuracy: 0.5
Validation Accuracy: 0.5
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.44      1.00      0.62         4
         yes       1.00      0.17      0.29         6

    accuracy                      

These produced decent values:

Validation Accuracy: 0.5
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.44      1.00      0.62         4
         yes       1.00      0.17      0.29         6

    accuracy                           0.50        10
   macro avg       0.72      0.58      0.45        10
weighted avg       0.78      0.50      0.42        10

Those are considerably better than our tuned values, interestingly.  The tuned ones are below this.

## Tuning

In [None]:
# A basic SVM model for text classification
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load datasets with a different encoding
print("Loading datasets...")
train_df = pd.read_csv('zoiya/train.csv', encoding='ISO-8859-1')
dev_df = pd.read_csv('zoiya/dev.csv', encoding='ISO-8859-1')
test_df = pd.read_csv('zoiya/test_data.csv', encoding='ISO-8859-1')

print("Datasets loaded successfully.")
print(f"Training set size: {train_df.shape}")
print(f"Validation set size: {dev_df.shape}")
print(f"Test set size: {test_df.shape}")

# Preprocess text data
print("Preprocessing text data...")
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df['Features'])
X_dev = vectorizer.transform(dev_df['Features'])
X_test = vectorizer.transform(test_df['Features'])

y_train = train_df['Label']
y_dev = dev_df['Label']

print("Text data preprocessed.")
print(f"Number of features: {X_train.shape[1]}")

# Define hyperparameters
C_values = [0.001, 0.005, 0.01]  # Different values for regularization strength
kernels = ['linear', 'rbf']  # Different kernels
gammas = ['scale', 'auto']  # Different gamma values

best_accuracy = 0
best_params = {}

# Experiment with different hyperparameters
# Psuedo grid search
for C in C_values:
    for kernel in kernels:
        for gamma in gammas:
            print(f"Training model with C={C}, kernel={kernel}, gamma={gamma}...")
            model = SVC(C=C, kernel=kernel, gamma=gamma)
            model.fit(X_train, y_train)
            y_dev_pred = model.predict(X_dev)
            dev_accuracy = accuracy_score(y_dev, y_dev_pred)
            print(f'Validation Accuracy: {dev_accuracy}')
            print("Validation Classification Report:")
            report = classification_report(y_dev, y_dev_pred)
            print(report)
            if dev_accuracy > best_accuracy:
                best_accuracy = dev_accuracy
                best_params = {'C': C, 'kernel': kernel, 'gamma': gamma}

print(f"Best hyperparameters: {best_params}")
print(f"Best validation accuracy: {best_accuracy}")

# Train the model with the best hyperparameters
model = SVC(**best_params)
model.fit(X_train, y_train)

# Validate the model
y_dev_pred = model.predict(X_dev)
dev_accuracy = accuracy_score(y_dev, y_dev_pred)
print(f'Validation Accuracy: {dev_accuracy}')
print("Validation Classification Report:")
report = classification_report(y_dev, y_dev_pred)
print(report)

# Test the model
y_test_pred = model.predict(X_test)

# Output predictions
print("Writing predictions to preds.txt...")
with open('preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(pred + '\n')
print("Predictions written to preds.txt.")

Loading datasets...
Datasets loaded successfully.
Training set size: (80, 2)
Validation set size: (10, 2)
Test set size: (10, 1)
Preprocessing text data...
Text data preprocessed.
Number of features: 1873
Training model with C=0.001, kernel=linear, gamma=scale...
Validation Accuracy: 0.4
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.40      1.00      0.57         4
         yes       0.00      0.00      0.00         6

    accuracy                           0.40        10
   macro avg       0.20      0.50      0.29        10
weighted avg       0.16      0.40      0.23        10

Training model with C=0.001, kernel=linear, gamma=auto...
Validation Accuracy: 0.4
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.40      1.00      0.57         4
         yes       0.00      0.00      0.00         6

    accuracy                           0.40        10
   macro avg   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

## Tuning

In [18]:
# Some different hyperparameter settings with a pseudo grid search
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load datasets with a different encoding
print("Loading datasets...")
train_df = pd.read_csv('zoiya/train.csv', encoding='ISO-8859-1')
dev_df = pd.read_csv('zoiya/dev.csv', encoding='ISO-8859-1')
test_df = pd.read_csv('zoiya/test_data.csv', encoding='ISO-8859-1')

print("Datasets loaded successfully.")
print(f"Training set size: {train_df.shape}")
print(f"Validation set size: {dev_df.shape}")
print(f"Test set size: {test_df.shape}")

# Preprocess text data
print("Preprocessing text data...")
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df['Features'])
X_dev = vectorizer.transform(dev_df['Features'])
X_test = vectorizer.transform(test_df['Features'])

y_train = train_df['Label']
y_dev = dev_df['Label']

print("Text data preprocessed.")
print(f"Number of features: {X_train.shape[1]}")

# Define hyperparameters
C_values = [0.1, 1.0, 10.0]  # Different values for regularization strength
kernels = ['linear', 'poly']  # Different kernels
gammas = ['auto', 0.1, 1.0]  # Different gamma values

best_accuracy = 0
best_params = {}

# Experiment with different hyperparameters
# Psuedo grid search; Custom grid search
for C in C_values:
    for kernel in kernels:
        for gamma in gammas:
            print(f"Training model with C={C}, kernel={kernel}, gamma={gamma}...")
            model = SVC(C=C, kernel=kernel, gamma=gamma)
            model.fit(X_train, y_train)
            y_dev_pred = model.predict(X_dev)
            dev_accuracy = accuracy_score(y_dev, y_dev_pred)
            print(f'Validation Accuracy: {dev_accuracy}')
            print("Validation Classification Report:")
            report = classification_report(y_dev, y_dev_pred)
            print(report)
            if dev_accuracy > best_accuracy:
                best_accuracy = dev_accuracy
                best_params = {'C': C, 'kernel': kernel, 'gamma': gamma}

print(f"Best hyperparameters: {best_params}")
print(f"Best validation accuracy: {best_accuracy}")

# Train the model with the best hyperparameters
model = SVC(**best_params)
model.fit(X_train, y_train)

# Validate the model
y_dev_pred = model.predict(X_dev)
dev_accuracy = accuracy_score(y_dev, y_dev_pred)
print(f'Validation Accuracy: {dev_accuracy}')
print("Validation Classification Report:")
report = classification_report(y_dev, y_dev_pred)
print(report)

# Test the model
y_test_pred = model.predict(X_test)

# Output predictions
print("Writing predictions to preds.txt...")
with open('preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(pred + '\n')
print("Predictions written to preds.txt.")

Loading datasets...
Datasets loaded successfully.
Training set size: (80, 2)
Validation set size: (10, 2)
Test set size: (10, 1)
Preprocessing text data...
Text data preprocessed.
Number of features: 1873
Training model with C=0.1, kernel=linear, gamma=auto...
Validation Accuracy: 0.4
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.40      1.00      0.57         4
         yes       0.00      0.00      0.00         6

    accuracy                           0.40        10
   macro avg       0.20      0.50      0.29        10
weighted avg       0.16      0.40      0.23        10

Training model with C=0.1, kernel=linear, gamma=0.1...
Validation Accuracy: 0.4
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.40      1.00      0.57         4
         yes       0.00      0.00      0.00         6

    accuracy                           0.40        10
   macro avg       0.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

This tuning with our pseudo grid search produced good values.  This ties our highest values from both models.

Best hyperparameters: {'C': 1.0, 'kernel': 'linear', 'gamma': 'auto'}
Best validation accuracy: 0.8
Validation Accuracy: 0.8
Validation Classification Report:
              precision    recall  f1-score   support

          no       0.67      1.00      0.80         4
         yes       1.00      0.67      0.80         6

    accuracy                           0.80        10
   macro avg       0.83      0.83      0.80        10
weighted avg       0.87      0.80      0.80        10

We have this as our highest score in the logistic regression model, so we could go with this one, or that one for our submission to the competition.

Our pseudo grid search code labeled in the comments is just going through the various combinations of possibilities with those hyperparameters.  I believe, for this dataset, this score is probably as good as it will get.

Overall, the combination of balanced regularization, a simple and effective linear kernel, appropriate scaling of gamma, and effective text preprocessing with TF-IDF vectorization contributed to the high performance of the SVM model on this dataset. These settings provided a robust and generalizable model that performed well on the validation set, making it a strong candidate for submission to the competition.