# SWE3011_41 Task1

**Supervised Text Classification using traditional machine learning methods**

1. Complete all the functions given.
2. Conduct various experiments including hyper-parameter tuning, cross validation, etc.
3. Write a report on the analysis of experiment results.  


**0. Installation**

**1. Load Dataset**

In [2]:
pip install datasets

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6



Evaluation should be done using **provided test dataset**

In [3]:
from datasets import load_dataset

train_ds = load_dataset("glue", "sst2", split="train")

# Evaluation should be done using test_ds
test_ds = load_dataset("csv", data_files="./test_dataset.csv")['train']

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

**2. Preparing Dataset**

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
def transform_data(X_train, X_test):
    """
    Input:
    - X_train, X_test: Series containing the text data for training and testing respectively.

    Output:
    - X_train_tfidf, X_test_tfidf: Transformed text data in TF-IDF format for training and testing respectively.
    - vectorizer: Fitted TfidfVectorizer object.
    """
    #########################################
    # TODO: Convert the text data to TF-IDF format and return the transformed data and the vectorizer
    # Create a TfidfVectorizer
    vectorizer = TfidfVectorizer()

    # Fit and transform X_train
    X_train_tfidf = vectorizer.fit_transform(X_train)

    # Transform X_test
    X_test_tfidf = vectorizer.transform(X_test)
    #########################################
    return X_train_tfidf, X_test_tfidf, vectorizer

In [6]:
X_train, y_train = train_ds['sentence'], train_ds['label']
X_test, y_test = test_ds['sentence'], test_ds['label']
X_train_tfidf, X_test_tfidf, vectorizer = transform_data(X_train, X_test)

**3. Train**

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

In [8]:
def logistic_regression(X_train_tfidf, y_train):
    """
    Input:
    - X_train_tfidf: Transformed text data in TF-IDF format for training.
    - y_train: Series containing the labels for training.

    Output:
    - clf: Trained Logistic Regression model.
    """
    #########################################
    # Define a logistic regression classifier with max_iter
    clf = LogisticRegression(max_iter=1000)

    # Define hyperparameters to tune
    param_grid = {
        'C': [0.001, 0.01, 0.1, 1, 10, 100],
        'penalty': ['l1', 'l2']
    }

    # Perform grid search with cross-validation to find the best hyperparameters
    grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train_tfidf, y_train)

    # Print the best parameters
    print("Best Parameters: ", grid_search.best_params_)

    # Get the best model with the optimal hyperparameters
    clf = grid_search.best_estimator_

    # Train the final model with the entire training data
    clf.fit(X_train_tfidf, y_train)

    #########################################
    return clf

In [9]:
def random_forest(X_train_tfidf, y_train):
    """
    Input:
    - X_train_tfidf: Transformed text data in TF-IDF format for training.
    - y_train: Series containing the labels for training.

    Output:
    - clf: Trained Random Forest classifier.
    """
    #########################################
    # Define a Random Forest classifier
    clf = RandomForestClassifier()

    # Define hyperparameters to tune
    param_grid = {
        'n_estimators': [100, 200, 300],  # Number of trees in the forest
        'max_depth': [None, 10, 20],  # Maximum depth of the tree
    }

    # Perform grid search with cross-validation to find the best hyperparameters
    grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train_tfidf, y_train)

    # Print the best parameters
    print("Best Parameters: ", grid_search.best_params_)

    # Get the best model with the optimal hyperparameters
    clf = grid_search.best_estimator_

    # Train the final model with the entire training data
    clf.fit(X_train_tfidf, y_train)

    #########################################
    return clf

In [10]:
def naive_bayes_classifier(X_train_tfidf, y_train):
    """
    Input:
    - X_train_tfidf: Transformed text data in TF-IDF format for training.
    - y_train: Series containing the labels for training.

    Output:
    - clf: Trained Multinomial Naive Bayes classifier.
    """
    #########################################
    # Define a Multinomial Naive Bayes classifier
    clf = MultinomialNB()

    # Define hyperparameters to tune
    param_grid = {
        'alpha': [0.1, 0.5, 1.0, 2.0]  # Smoothing parameter (Laplace/Lidstone smoothing)
    }

    # Perform grid search with cross-validation to find the best hyperparameters
    grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train_tfidf, y_train)

    # Print the best parameters
    print("Best Parameters: ", grid_search.best_params_)

    # Get the best model with the optimal hyperparameters
    clf = grid_search.best_estimator_

    # Train the final model with the entire training data
    clf.fit(X_train_tfidf, y_train)

    #########################################
    return clf

In [11]:
clf = logistic_regression(X_train_tfidf, y_train)

30 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.888031          nan 0.90

Best Parameters:  {'C': 100, 'penalty': 'l2'}


In [19]:
clf_rf = random_forest(X_train_tfidf, y_train)

Best Parameters:  {'max_depth': None, 'n_estimators': 200}


In [12]:
clf_nb = naive_bayes_classifier(X_train_tfidf, y_train)

Best Parameters:  {'alpha': 0.1}


**4. Evaluation**

In [13]:
from sklearn.metrics import accuracy_score, classification_report

In [14]:
def evaluate_model(clf, X_test_tfidf, y_test):
    """
    Input:
    - clf: Trained Logistic Regression model.
    - X_test_tfidf: Transformed text data in TF-IDF format for testing.
    - y_test: Series containing the labels for testing.

    Output:
    - None (This function will print the evaluation results.)
    """
    #########################################
    # TODO: Evaluate the model and print the results
    # Predict the labels using the trained classifier
    y_pred = clf.predict(X_test_tfidf)

    # Calculate the accuracy
    accuracy = accuracy_score(y_test, y_pred)

    #########################################
    print(f"Accuracy: {accuracy:.2f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))

In [15]:
evaluate_model(clf, X_test_tfidf, y_test)

Accuracy: 0.77
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.72      0.77        54
           1       0.72      0.83      0.77        46

    accuracy                           0.77       100
   macro avg       0.77      0.77      0.77       100
weighted avg       0.78      0.77      0.77       100



In [20]:
evaluate_model(clf_rf, X_test_tfidf, y_test)

Accuracy: 0.71
Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.69      0.72        54
           1       0.67      0.74      0.70        46

    accuracy                           0.71       100
   macro avg       0.71      0.71      0.71       100
weighted avg       0.71      0.71      0.71       100



In [18]:
evaluate_model(clf_nb, X_test_tfidf, y_test)

Accuracy: 0.70
Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.69      0.71        54
           1       0.66      0.72      0.69        46

    accuracy                           0.70       100
   macro avg       0.70      0.70      0.70       100
weighted avg       0.70      0.70      0.70       100

