In [None]:
#locally install the package
! pip install ..\dist\quick_sentiments-0.1.6-py3-none-any.whl

## Main Fucntion pipeline

`run_pipeline` is the second main function to run the whole pipeline. When the data is cleaned with `pre_process`, it can be passed to `run_pipeline`. This function will take df, the vectorization method, the machine learning methods to be used, the columns that was pre-processed and the columns that will be used for the target variable. It also takes parameter tunng for the machine learning methods. By defualt, the parameter tuning is set to False. 

In [11]:
import polars as pl
df = pl.read_csv("training_data/train.csv", encoding="utf-8")
df = df[1:500]

In [10]:
from quick_sentiments import pre_process
from quick_sentiments import run_pipeline


In [12]:
response_column = "reviewText" 
sentiment_column = "sentiment"
df = df.with_columns(
    pl.col(response_column).map_elements(lambda x: pre_process(x, remove_brackets=True)).alias("processed")  #add inside the map_elements
)

This is how the function is called:

def run_pipeline(

    vectorizer_name: str,  <- name the vectorization method ("BOW", "tf", "tfidf", "wv")

    model_name: str,     <- name the machine learning method ("logit", "rf", "XGB")

    df: Union[pl.DataFrame, pd.DataFrame],  <- the dataframe that contains the pre-processed text data and the target variable

    text_column_name: str,  <- the column name of the pre-processed text data, the function needs to know,

    sentiment_column_name: str,  <- the column name of the target variable, the function needs to know,

    perform_tuning: bool = False <- whether to perform hyperparameter tuning or not, default is False
    
):

The run_pipeline can be used after the data is pre-processed with `pre_process`. It will return a dictionary with the results of the model training and evaluation. The dictionary will contain the following keys:
- `model`: the trained model, this will be used for prediction later
- `vectorizer_name`: the name of the vectorizer used for the text data, it will be used for prediction later
- `vectorizer_object`: the fitted vectorizer object, this will be used for prediction later
- `label_encoder`: we will need this to decode the labels back to their original form
- `y_test`: the true labels of the test data
- `y_pred`: the predicted labels of the test data
- `accuracy`: the accuracy of the model on the test data
- `report`: the classification report of the model on the test data

    
    return {
        "model_object": trained_model_object,
        "vectorizer_name": vectorizer_name,
        "vectorizer_object": fitted_vectorizer_object,
        "label_encoder": label_encoder,
        "y_test": y_test,
        "y_pred": y_pred,
        "accuracy": accuracy_score(y_test, y_pred),
        "report": classification_report(y_test, y_pred, output_dict=True, target_names=label_encoder.classes_)
    }

#### Running BOW and Logistic Regression


No tuning

In [13]:
dt= run_pipeline(
    vectorizer_name="BOW", # BOW, tf, tfidf, wv
    model_name="logit", # logit, rf, XGB .#XGB takes long time, can not recommend using it on normal case
    df=df,
    text_column_name="processed",  # this is the column name of the text data, 
    sentiment_column_name = "sentiment",
    perform_tuning = False# make this true if you want to perform hyperparameter tuning, it will take longer time and 
                            # may run out of memory if the dataset is large,
)

--- Running Pipeline for Bow + Logit ---
Labels encoded: Original -> ['NEGATIVE' 'POSITIVE'], Encoded -> [0 1]
1. Vectorizing entire dataset (X)...
   - Generating Bag-of-Words features...
2. Splitting data into train/test...
3. Training and predicting...
   - Training Logistic Regression with default parameters (no hyperparameter tuning)...
   - Model trained with default parameters.
Best model parameters: {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'deprecated', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
4. Evaluating model...

Classification Report:
              precision    recall  f1-score   support

    NEGATIVE       0.50      0.18      0.27        33
    POSITIVE       0.68      0.91      0.78        64

    accuracy                           0.66        97
   macro avg       0.59      0.54      0.52

With parameter tuning

In [14]:
dt= run_pipeline(
    vectorizer_name="BOW", # BOW, tf, tfidf, wv
    model_name="logit", # logit, rf, XGB .#XGB takes long time, can not recommend using it on normal case
    df=df,
    text_column_name="processed",  # this is the column name of the text data, 
    sentiment_column_name = "sentiment",
    perform_tuning = True# make this true if you want to perform hyperparameter tuning, it will take longer time and 
                            # may run out of memory if the dataset is large,
)

--- Running Pipeline for Bow + Logit ---
Labels encoded: Original -> ['NEGATIVE' 'POSITIVE'], Encoded -> [0 1]
1. Vectorizing entire dataset (X)...
   - Generating Bag-of-Words features...
2. Splitting data into train/test...
3. Training and predicting...
   - Starting Logistic Regression training with GridSearchCV for hyperparameter tuning...
   - Using default parameter grid for tuning: {'solver': ['liblinear', 'lbfgs'], 'C': [0.1, 1.0, 10.0], 'class_weight': [None, 'balanced'], 'max_iter': [500, 1000]}
Fitting 5 folds for each of 24 candidates, totalling 120 fits

   - Best Hyperparameters found:
{'C': 10.0, 'class_weight': 'balanced', 'max_iter': 500, 'solver': 'liblinear'}
   - Best Cross-Validation Score (F1-weighted): 0.6243
Best model parameters: {'C': 10.0, 'class_weight': 'balanced', 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 500, 'multi_class': 'deprecated', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'libli

Using TF-IDF and Random Forest

In [15]:
dt= run_pipeline(
    vectorizer_name="tfidf", 
    model_name="rf", 
    df=df,
    text_column_name="processed",   
    sentiment_column_name = "sentiment",
    perform_tuning = False
                            
)

--- Running Pipeline for Tfidf + Rf ---
Labels encoded: Original -> ['NEGATIVE' 'POSITIVE'], Encoded -> [0 1]
1. Vectorizing entire dataset (X)...
   - Generating TF-IDF features...
2. Splitting data into train/test...
3. Training and predicting...
   - Training Random Forest with default parameters (no hyperparameter tuning)...
   - Model trained with default parameters.
4. Evaluating model...

Classification Report:
              precision    recall  f1-score   support

    NEGATIVE       0.50      0.03      0.06        33
    POSITIVE       0.66      0.98      0.79        64

    accuracy                           0.66        97
   macro avg       0.58      0.51      0.42        97
weighted avg       0.61      0.66      0.54        97

True labels distribution: Counter({np.int64(1): 64, np.int64(0): 33})
Predicted labels distribution: Counter({np.int64(1): 95, np.int64(0): 2})


In [16]:
dt= run_pipeline(
    vectorizer_name="tfidf", 
    model_name="rf", 
    df=df,
    text_column_name="processed",   
    sentiment_column_name = "sentiment",
    perform_tuning = True
                            
)

--- Running Pipeline for Tfidf + Rf ---
Labels encoded: Original -> ['NEGATIVE' 'POSITIVE'], Encoded -> [0 1]
1. Vectorizing entire dataset (X)...
   - Generating TF-IDF features...
2. Splitting data into train/test...
3. Training and predicting...
   - Starting Random Forest training with GridSearchCV for hyperparameter tuning...
   - Using default parameter grid for tuning: {'n_estimators': [100, 200, 300], 'max_depth': [10, 20, None], 'min_samples_split': [2, 5], 'min_samples_leaf': [1, 2], 'class_weight': [None, 'balanced']}
Fitting 5 folds for each of 72 candidates, totalling 360 fits

   - Best Hyperparameters found:
{'class_weight': 'balanced', 'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}
   - Best Cross-Validation Score (F1-weighted): 0.6001
4. Evaluating model...

Classification Report:
              precision    recall  f1-score   support

    NEGATIVE       0.36      0.55      0.43        33
    POSITIVE       0.68      0.50      0.5