### Comparing Models and Vectorization Strategies for Text Classification

This Try-It activity focuses on weighing the positives and negatives of different estimators and vectorization strategies for a text classification problem.  In order to consider each of these components, you should make use of the `Pipeline` and `GridSearchCV` objects in Scikit-Learn to try different combinations of vectorizers with different estimators.  For each of these, you also want to use the `.cv_results_` to examine the time for the estimator to fit the data.

### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset" created for this [paper](https://arxiv.org/pdf/2004.12765.pdf).  You are to use the text column to classify whether or not the text was humorous.  It is loaded and displayed below.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.preprocessing import StandardScaler
import re

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
df = pd.read_csv('/content/sample_data/mtsamples.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


In [4]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['transcription'], df['medical_specialty'], test_size=0.2, random_state=42)

In [8]:
# Create a list of stop words
stop_words = set(stopwords.words('english'))

# Create a stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

In [18]:
# Define a function to preprocess the text
def preprocess_text(text):
    """
    Preprocesses text using stemming and lemmatization, stop word removal, lowercasing,
    and punctuation removal.

    Args:
        text (str): The text to preprocess.

    Returns:
        str: The preprocessed text.
    """
    text = text.lower()  # Lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    words = text.split()
    words = [stemmer.stem(lemmatizer.lemmatize(word)) for word in words if word not in stop_words]
    return ' '.join(words)

In [19]:
# Define the pipelines
pipelines = [
    # CountVectorizer + Logistic Regression
    Pipeline([
        ('vectorizer', CountVectorizer(stop_words='english', max_features=1500, max_df=0.75)),
        ('scaler', StandardScaler(with_mean=False)),
        ('classifier', LogisticRegression(random_state=42))
    ]),
    # TfidfVectorizer + Logistic Regression
    Pipeline([
        ('vectorizer', TfidfVectorizer(stop_words='english', max_features=1500, max_df=0.75)),
        ('scaler', StandardScaler(with_mean=False)),
        ('classifier', LogisticRegression(random_state=42))
    ]),
    # CountVectorizer + Decision Tree
    Pipeline([
        ('vectorizer', CountVectorizer(stop_words='english', max_features=1500, max_df=0.75)),
        ('scaler', StandardScaler(with_mean=False)),
        ('classifier', DecisionTreeClassifier(random_state=42))
    ]),
    # TfidfVectorizer + Decision Tree
    Pipeline([
        ('vectorizer', TfidfVectorizer(stop_words='english', max_features=1500, max_df=0.75)),
        ('scaler', StandardScaler(with_mean=False)),
        ('classifier', DecisionTreeClassifier(random_state=42))
    ]),
    # CountVectorizer + Bayes
    Pipeline([
        ('vectorizer', CountVectorizer(stop_words='english', max_features=1500, max_df=0.75)),
        ('scaler', StandardScaler(with_mean=False)),
        ('classifier', MultinomialNB())
    ]),
    # TfidfVectorizer + Bayes
    Pipeline([
        ('vectorizer', TfidfVectorizer(stop_words='english', max_features=1500, max_df=0.75)),
        ('scaler', StandardScaler(with_mean=False)),
        ('classifier', MultinomialNB())
    ]),
]

In [20]:
param_grids = [
    # CountVectorizer + Logistic Regression
    {'classifier__C': [0.3, 0.4, 0.5]},
    # TfidfVectorizer + Logistic Regression
    {'classifier__C': [0.15, 0.2, 0.25]},
    # CountVectorizer + Decision Tree
    {'classifier__max_depth': [200, 220, 250], 'classifier__min_samples_split': [50, 60, 70]},
    # TfidfVectorizer + Decision Tree
    {'classifier__max_depth': [150, 175, 200], 'classifier__min_samples_split': [70, 80, 90]},
    # CountVectorizer + Bayes
    {'classifier__alpha': [0.004, 0.005, 0.006]},
    # TfidfVectorizer + Bayes
    {'classifier__alpha': [0.004, 0.005, 0.006]}
]

### Results will be in the format below:

In [21]:
# Set up a dictionary to store the results
results = {
    'model': [
        'Logistic Regression (CountVectorizer)',
        'Logistic Regression (TfidfVectorizer)',
        'Decision Tree (CountVectorizer)',
        'Decision Tree (TfidfVectorizer)',
        'Naive Bayes (CountVectorizer)',
        'Naive Bayes (TfidfVectorizer)'
    ],
    'best_params': ['' for _ in range(6)],
    'best_score': ['' for _ in range(6)],
    'mean_fit_time': ['' for _ in range(6)]
}

In [22]:
pd.set_option('display.max_colwidth', None)  # Set to None for unlimited width

# Train and evaluate each pipeline with GridSearchCV
for i, (pipeline, param_grid) in enumerate(zip(pipelines, param_grids)):
    grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)

    results['best_params'][i] = str(grid_search.best_params_)
    results['best_score'][i] = grid_search.best_score_
    results['mean_fit_time'][i] = grid_search.cv_results_['mean_fit_time'][grid_search.best_index_]
    print(f"{results['model'][i]}:")
    print(f"  Best params: {results['best_params'][i]}")
    print(f"  Best cross-validation score: {results['best_score'][i]:.4f}")
    print(f"  Mean fit time: {results['mean_fit_time'][i]:.2f}s")
    print()


# Create a Pandas DataFrame from the results
results_df = pd.DataFrame(results).set_index('model')

# Print the results
results_df



ValueError: 
All the 15 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/pipeline.py", line 588, in _fit
    X, fitted_transformer = fit_transform_one_cached(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/joblib/memory.py", line 312, in __call__
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/pipeline.py", line 1551, in _fit_transform_one
    res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/feature_extraction/text.py", line 1376, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/feature_extraction/text.py", line 1263, in _count_vocab
    for feature in analyze(doc):
                   ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/feature_extraction/text.py", line 99, in _analyze
    doc = decoder(doc)
          ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/feature_extraction/text.py", line 232, in decode
    raise ValueError(
ValueError: np.nan is an invalid document, expected byte or unicode string.
