# ARI 510 Phase 2b

Project Phase 2b Competition Participation
Ryan Smith
Project Competing in: chawki's team's.

Here is the official description of the task:

In this competition, participants will develop models to classify customer reviews into sentiment categories: Positive, Neutral, or Negative. This challenge is ideal for anyone interested in natural language processing and machine learning.

# The Models

We will use SVM and Random Forest

Support Vector Machines (SVMs) are powerful algorithms that excel at creating a clear boundary between different text categories. Imagine drawing a line to separate documents about sports from documents about cooking; SVMs aim to find the best possible line (or hyperplane in higher dimensions) to maximize the separation. They are particularly effective in high-dimensional spaces, which is often the case with text data represented by word frequencies or embeddings. SVMs can also handle non-linear relationships between words and categories by using kernel functions, which essentially transform the data into a space where it's easier to separate. This makes them versatile for various text classification tasks, from sentiment analysis to topic categorization.

Random Forest is an ensemble learning method that leverages the wisdom of multiple decision trees to classify text. Imagine having a group of experts, each specializing in different aspects of the text, like vocabulary, grammar, or topic. Each expert (decision tree) makes a prediction based on their knowledge, and the final classification is determined by a majority vote or averaging their predictions. This approach often leads to robust and accurate results because it reduces the risk of overfitting to the training data. Random Forest is particularly useful when dealing with complex datasets and can handle large amounts of text data efficiently. Its ability to capture non-linear relationships and interactions between words makes it a valuable tool for tasks like spam detection, news categorization, and author identification.

# Model 1

SVM

## Default Values

In [30]:
# Default values for hyperparameters
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load datasets
print("Loading datasets...")
train_data = pd.read_csv('chawki/data/train_data.csv')
test_data = pd.read_csv('chawki/data/test_data.csv')
print("Datasets loaded.")

# Handle missing values
print("Handling missing values...")
train_data['Review Text'].fillna('', inplace=True)
train_data.dropna(subset=['Ground_Truth'], inplace=True)
test_data['Review Text'].fillna('', inplace=True)
print("Missing values handled.")

# Features and labels
print("Extracting features and labels...")
X_train = train_data['Review Text']
y_train = train_data['Ground_Truth']
X_test = test_data['Review Text']
print("Features and labels extracted.")

# Split the training data for validation
print("Splitting data into training and validation sets...")
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print("Data split completed.")

# Hyperparameters
C = 1.0
max_iter = 1000
solver = 'lbfgs'

# Create a model pipeline
print("Creating model pipeline...")
model = make_pipeline(CountVectorizer(), LogisticRegression(C=C, max_iter=max_iter, solver=solver, multi_class='multinomial'))
print("Model pipeline created.")

# Train the model
print("Training the model...")
model.fit(X_train_split, y_train_split)
print("Model training completed.")

# Validate the model
print("Validating the model...")
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f'Validation Accuracy: {val_accuracy}')
print("Classification Report:")
print(classification_report(y_val, y_val_pred))

# Test the model
print("Testing the model...")
y_test_pred = model.predict(X_test)
print("Model testing completed.")

# Save predictions to preds.txt
print("Saving predictions to preds.txt...")
with open('chawki/preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(f"{pred}\n")
print("Predictions saved.")

Loading datasets...
Datasets loaded.
Handling missing values...
Missing values handled.
Extracting features and labels...
Features and labels extracted.
Splitting data into training and validation sets...
Data split completed.
Creating model pipeline...
Model pipeline created.
Training the model...
Model training completed.
Validating the model...
Validation Accuracy: 0.6666666666666666
Classification Report:
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00         4
     Neutral       0.50      0.25      0.33         8
    Positive       0.71      0.92      0.80        24

    accuracy                           0.67        36
   macro avg       0.40      0.39      0.38        36
weighted avg       0.58      0.67      0.61        36

Testing the model...
Model testing completed.
Saving predictions to preds.txt...
Predictions saved.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_data['Review Text'].fillna('', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_data['Review Text'].fillna('', inplace=True)


## Tuning

In [8]:
# Default values for hyperparameters
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load datasets
print("Loading datasets...")
train_data = pd.read_csv('chawki/data/train_data.csv')
test_data = pd.read_csv('chawki/data/test_data.csv')
print("Datasets loaded.")

# Handle missing values
print("Handling missing values...")
train_data['Review Text'].fillna('', inplace=True)
train_data.dropna(subset=['Ground_Truth'], inplace=True)
test_data['Review Text'].fillna('', inplace=True)
print("Missing values handled.")

# Features and labels
print("Extracting features and labels...")
X_train = train_data['Review Text']
y_train = train_data['Ground_Truth']
X_test = test_data['Review Text']
print("Features and labels extracted.")

# Split the training data for validation
print("Splitting data into training and validation sets...")
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print("Data split completed.")

# Hyperparameters
C = 0.01
max_iter = 3000
solver = 'saga'

# Create a model pipeline
print("Creating model pipeline...")
model = make_pipeline(CountVectorizer(), LogisticRegression(C=C, max_iter=max_iter, solver=solver, multi_class='multinomial'))
print("Model pipeline created.")

# Train the model
print("Training the model...")
model.fit(X_train_split, y_train_split)
print("Model training completed.")

# Validate the model
print("Validating the model...")
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f'Validation Accuracy: {val_accuracy}')
print("Classification Report:")
print(classification_report(y_val, y_val_pred))

# Test the model
print("Testing the model...")
y_test_pred = model.predict(X_test)
print("Model testing completed.")

# Save predictions to preds.txt
print("Saving predictions to preds.txt...")
with open('chawki/preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(f"{pred}\n")
print("Predictions saved.")

Loading datasets...
Datasets loaded.
Handling missing values...
Missing values handled.
Extracting features and labels...
Features and labels extracted.
Splitting data into training and validation sets...
Data split completed.
Creating model pipeline...
Model pipeline created.
Training the model...
Model training completed.
Validating the model...
Validation Accuracy: 0.6666666666666666
Classification Report:
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00         4
     Neutral       0.00      0.00      0.00         8
    Positive       0.67      1.00      0.80        24

    accuracy                           0.67        36
   macro avg       0.22      0.33      0.27        36
weighted avg       0.44      0.67      0.53        36

Testing the model...
Model testing completed.
Saving predictions to preds.txt...
Predictions saved.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_data['Review Text'].fillna('', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_data['Review Text'].fillna('', inplace=True)
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is"

## Tuning

In [18]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load datasets
print("Loading datasets...")
train_data = pd.read_csv('chawki/data/train_data.csv')
test_data = pd.read_csv('chawki/data/test_data.csv')
print("Datasets loaded.")

# Handle missing values
print("Handling missing values...")
train_data['Review Text'].fillna('', inplace=True)
train_data.dropna(subset=['Ground_Truth'], inplace=True)
test_data['Review Text'].fillna('', inplace=True)
print("Missing values handled.")

# Features and labels
print("Extracting features and labels...")
X_train = train_data['Review Text']
y_train = train_data['Ground_Truth']
X_test = test_data['Review Text']
print("Features and labels extracted.")

# Split the training data for validation
print("Splitting data into training and validation sets...")
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print("Data split completed.")

# Hyperparameters
C = 1.0
max_iter = 10000
solver = 'saga'
penalty = 'l1'

# Create a model pipeline
print("Creating model pipeline...")
model = make_pipeline(CountVectorizer(), LogisticRegression(C=C, max_iter=max_iter, solver=solver, multi_class='multinomial', penalty=penalty))
print("Model pipeline created.")

# Train the model
print("Training the model...")
model.fit(X_train_split, y_train_split)
print("Model training completed.")

# Validate the model
print("Validating the model...")
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f'Validation Accuracy: {val_accuracy}')
print("Classification Report:")
print(classification_report(y_val, y_val_pred))

# Test the model
print("Testing the model...")
y_test_pred = model.predict(X_test)
print("Model testing completed.")

# Save predictions to preds.txt
print("Saving predictions to preds.txt...")
with open('chawki/preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(f"{pred}\n")
print("Predictions saved.")

Loading datasets...
Datasets loaded.
Handling missing values...
Missing values handled.
Extracting features and labels...
Features and labels extracted.
Splitting data into training and validation sets...
Data split completed.
Creating model pipeline...
Model pipeline created.
Training the model...


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_data['Review Text'].fillna('', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_data['Review Text'].fillna('', inplace=True)


Model training completed.
Validating the model...
Validation Accuracy: 0.6111111111111112
Classification Report:
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00         4
     Neutral       0.25      0.12      0.17         8
    Positive       0.68      0.88      0.76        24

    accuracy                           0.61        36
   macro avg       0.31      0.33      0.31        36
weighted avg       0.51      0.61      0.55        36

Testing the model...
Model testing completed.
Saving predictions to preds.txt...
Predictions saved.


# Model 2

Random Forest

## Default Values

In [None]:
# Default values for hyperparameters
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load datasets
print("Loading datasets...")
train_data = pd.read_csv('chawki/data/train_data.csv')
test_data = pd.read_csv('chawki/data/test_data.csv')
print("Datasets loaded.")

# Handle missing values
print("Handling missing values...")
train_data['Review Text'].fillna('', inplace=True)
train_data.dropna(subset=['Ground_Truth'], inplace=True)
test_data['Review Text'].fillna('', inplace=True)
print("Missing values handled.")

# Features and labels
print("Extracting features and labels...")
X_train = train_data['Review Text']
y_train = train_data['Ground_Truth']
X_test = test_data['Review Text']
print("Features and labels extracted.")

# Split the training data for validation
print("Splitting data into training and validation sets...")
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print("Data split completed.")

# Hyperparameters for RandomForestClassifier
n_estimators = 100
max_depth = None
random_state = None

# Create a model pipeline
print("Creating model pipeline...")
model = make_pipeline(CountVectorizer(), RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=random_state))
print("Model pipeline created.")

# Train the model
print("Training the model...")
model.fit(X_train_split, y_train_split)
print("Model training completed.")

# Validate the model
print("Validating the model...")
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f'Validation Accuracy: {val_accuracy}')
print("Classification Report:")
print(classification_report(y_val, y_val_pred))

# Test the model
print("Testing the model...")
y_test_pred = model.predict(X_test)
print("Model testing completed.")

# Save predictions to preds.txt
print("Saving predictions to preds.txt...")
with open('chawki/preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(f"{pred}\n")
print("Predictions saved.")

Loading datasets...
Datasets loaded.
Handling missing values...
Missing values handled.
Extracting features and labels...
Features and labels extracted.
Splitting data into training and validation sets...
Data split completed.
Creating model pipeline...
Model pipeline created.
Training the model...
Model training completed.
Validating the model...
Validation Accuracy: 0.6666666666666666
Classification Report:
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00         4
     Neutral       0.00      0.00      0.00         8
    Positive       0.67      1.00      0.80        24

    accuracy                           0.67        36
   macro avg       0.22      0.33      0.27        36
weighted avg       0.44      0.67      0.53        36

Testing the model...
Model testing completed.
Saving predictions to preds.txt...
Predictions saved.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_data['Review Text'].fillna('', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_data['Review Text'].fillna('', inplace=True)
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is"

## Tuning

In [29]:
# Modified values for hyperparameters
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load datasets
print("Loading datasets...")
train_data = pd.read_csv('chawki/data/train_data.csv')
test_data = pd.read_csv('chawki/data/test_data.csv')
print("Datasets loaded.")

# Handle missing values
print("Handling missing values...")
train_data['Review Text'].fillna('', inplace=True)
train_data.dropna(subset=['Ground_Truth'], inplace=True)
test_data['Review Text'].fillna('', inplace=True)
print("Missing values handled.")

# Features and labels
print("Extracting features and labels...")
X_train = train_data['Review Text']
y_train = train_data['Ground_Truth']
X_test = test_data['Review Text']
print("Features and labels extracted.")

# Split the training data for validation
print("Splitting data into training and validation sets...")
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print("Data split completed.")

# Hyperparameters for RandomForestClassifier
n_estimators = 20000
max_depth = 5
random_state = 42
min_samples_split = 100
min_samples_leaf = 50

# Create a model pipeline
print("Creating model pipeline...")
model = make_pipeline(CountVectorizer(), RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=random_state))
print("Model pipeline created.")

# Train the model
print("Training the model...")
model.fit(X_train_split, y_train_split)
print("Model training completed.")

# Validate the model
print("Validating the model...")
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f'Validation Accuracy: {val_accuracy}')
print("Classification Report:")
print(classification_report(y_val, y_val_pred))

# Test the model
print("Testing the model...")
y_test_pred = model.predict(X_test)
print("Model testing completed.")

# Save predictions to preds.txt
print("Saving predictions to preds.txt...")
with open('chawki/preds.txt', 'w') as f:
    for pred in y_test_pred:
        f.write(f"{pred}\n")
print("Predictions saved.")

Loading datasets...
Datasets loaded.
Handling missing values...
Missing values handled.
Extracting features and labels...
Features and labels extracted.
Splitting data into training and validation sets...
Data split completed.
Creating model pipeline...
Model pipeline created.
Training the model...


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_data['Review Text'].fillna('', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_data['Review Text'].fillna('', inplace=True)


Model training completed.
Validating the model...
Validation Accuracy: 0.6666666666666666
Classification Report:
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00         4
     Neutral       0.00      0.00      0.00         8
    Positive       0.67      1.00      0.80        24

    accuracy                           0.67        36
   macro avg       0.22      0.33      0.27        36
weighted avg       0.44      0.67      0.53        36

Testing the model...
Model testing completed.
Saving predictions to preds.txt...
Predictions saved.


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
