# Problem 1: Bag-of-Words Feature Representation
In this notebook, we will explore the concept of the Bag-of-Words (BoW) representation for text data and its two popular variations:


In [49]:
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import shuffle

In [50]:
x_train_df = pd.read_csv('../data_reviews/x_train.csv')
y_train_df = pd.read_csv('../data_reviews/y_train.csv')

# Convert the columns to lists/arrays
x_train_unshuffled = x_train_df['text'].values.tolist()
y = y_train_df['is_positive_sentiment'].values

# Shuffle both X and y together
X_train, y_train = shuffle(x_train_unshuffled, y, random_state=42)

# Check the lengths
print(f'Length of tr_list_of_sentences: {len(x_train_unshuffled)}')
print(f'Length of y_train: {len(y)}')

Length of tr_list_of_sentences: 2400
Length of y_train: 2400


To process the training data, we will remove all the stop words and punctuations using the CountVectorizer method from the scikit-learn library. We will also use the same CountVectorizer method to transform the test data. We will also be using the unigram method to help us understand the concept of BoW representation better and excclude words that appear in less than 5 documents.

In [51]:
# Creating the pipeline
vectorizer = CountVectorizer(ngram_range=(1, 1))

pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('tfidf', TfidfTransformer(smooth_idf=True, use_idf=True, sublinear_tf=True)),
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

For the hyperparameter tuning, we will use the GridSearchCV method from the scikit-learn library to find the best hyperparameters for the Logistic Regression model. We will also use the accuracy score to evaluate the model's performance. We also use two different logistic regressions models(one based on quasi-newton method and other based on stochastic gradient descent) to compare the performance of the two models.

In [52]:
param_grid = {
    'classifier__solver': ['lbfgs', 'saga'],
    'vectorizer__max_features': [100, 500, 1000, 10000, 100000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2', 'elasticnet']
}


grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='roc_auc'
)



grid_search.fit(X_train, y_train)

375 fits failed out of a total of 750.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
125 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/manuelpena/micromamba/envs/cs135_env/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/manuelpena/micromamba/envs/cs135_env/lib/python3.10/site-packages/sklearn/pipeline.py", line 405, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/Users/manuelpena/micromamba/envs/cs135_env/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  

In [53]:
y_pred = grid_search.best_estimator_.predict(X_train)

print(classification_report(y_train, y_pred))
print(confusion_matrix(y_train, y_pred))

y_pred_prob = grid_search.best_estimator_.predict_proba(X_train)[:, 1]

auc = roc_auc_score(y_train, y_pred_prob)

print(f'AUC: {auc:.4f}')

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1200
           1       1.00      0.99      0.99      1200

    accuracy                           0.99      2400
   macro avg       0.99      0.99      0.99      2400
weighted avg       0.99      0.99      0.99      2400

[[1196    4]
 [  15 1185]]
AUC: 0.9987


In [54]:
# Load the test data
x_test_df = pd.read_csv('../data_reviews/x_test.csv')

# Get the predicted probabilities for the positive class
y_test_pred_prob = grid_search.best_estimator_.predict_proba(x_test_df['text'])[:, 1]

# Save the probabilities to a plain-text file
with open('../data_reviews/yproba1_test.txt', 'w') as f:
    for prob in y_test_pred_prob:
        f.write(f"{prob:.6f}\n")  # Formatting to six decimal places

print("Probabilistic predictions saved to '../data_reviews/yproba1_test.txt'.")

Probabilistic predictions saved to '../data_reviews/yproba1_test.txt'.
