## Main Fucntion: make_predictions

`make_predictions` is the third main function to run the whole pipeline. When the data is cleaned with `pre_process`, it can be passed to `run_pipeline`. The `run_pipeline` function will return few values, including the model, vectorizer, vector representation, accuracy. The `make_predictions` function will use these values to make predictions on new data and return a `polars` DataFrame with the predictions.

In [1]:
import polars as pl
df = pl.read_csv("training_data/train.csv", encoding="utf-8")
df = df[1:500]

In [3]:
from quick_sentiments import pre_process
from quick_sentiments import run_pipeline
from quick_sentiments import make_predictions


In [4]:
response_column = "reviewText" 
sentiment_column = "sentiment"
df = df.with_columns(
    pl.col(response_column).map_elements(lambda x: pre_process(x, remove_brackets=True)).alias("processed")  #add inside the map_elements
)

After pre-processing the data, suppose we run the pipeline with 'BOW' as the vectorization method and 'LogisticRegression' as the machine learning method. The `run_pipeline` function will return the model, vectorizer, vector representation, and accuracy.

In [15]:
dt= run_pipeline(
    vectorizer_name="wv", # BOW, tf, tfidf, wv
    model_name="logit", # logit, rf, XGB .#XGB takes long time, can not recommend using it on normal case
    df=df,
    text_column_name="processed",  # this is the column name of the text data, 
    sentiment_column_name = "sentiment",
    perform_tuning = False# make this true if you want to perform hyperparameter tuning, it will take longer time and 
                            # may run out of memory if the dataset is large,
)

--- Running Pipeline for Wv + Logit ---
Labels encoded: Original -> ['NEGATIVE' 'POSITIVE'], Encoded -> [0 1]
1. Vectorizing entire dataset (X)...
Using already loaded Word2Vec model.
2. Splitting data into train/test...
3. Training and predicting...
   - Training Logistic Regression with default parameters (no hyperparameter tuning)...
   - Model trained with default parameters.
Best model parameters: {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'deprecated', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
4. Evaluating model...

Classification Report:
              precision    recall  f1-score   support

    NEGATIVE       0.70      0.42      0.53        33
    POSITIVE       0.75      0.91      0.82        64

    accuracy                           0.74        97
   macro avg       0.73      0.67      0.68     

Now, based on the best accuracy, we can select the model and vectorizer to make predictions on new data. The `make_predictions` function will take the model, vectorizer, and new data as input and return a DataFrame with the predictions.

In [19]:
new_data = pl.read_csv("new_data/test.csv",encoding='ISO-8859-1')
new_data = new_data[1:250]
new_data = new_data.with_columns(
    pl.col(response_column).map_elements(lambda x: pre_process(x, remove_brackets=True)).alias("processed")  #add inside the map_elements
)

#### make_predictions


In [20]:
make_predictions(
    new_data=new_data,
    text_column_name="processed",
    vectorizer=dt["vectorizer_object"],
    best_model=dt["model_object"],
    label_encoder=dt["label_encoder"],
    prediction_column_name="sentiment_predictions"  # Optional custom name
)

movieid,reviewerName,isTopCritic,reviewText,processed,sentiment_predictions
str,str,bool,str,str,str
"""terminator_kat…","""Brian Chaney""",false,"""Philip Noyce's…","""philip noyce d…","""POSITIVE"""
"""james_bond_lab…","""Danielle Parke…",false,"""It wouldn't do…","""would nt say p…","""POSITIVE"""
"""v_quest_han_so…","""Brittany Lane""",false,"""Pig is not exa…","""pig exactly ar…","""POSITIVE"""
"""enigma_hulk_su…","""Justin Willis""",false,"""An imaginative…","""imaginative no…","""POSITIVE"""
"""infinite_elega…","""Carla Guzman""",false,"""Life happens..…","""life happens l…","""POSITIVE"""
"""travis_bickle_…","""Kathy Wade""",false,"""You can't hire…","""ca nt hire jud…","""POSITIVE"""
"""jack_torrance_…","""Diana Black""",false,"""certainly riva…","""certainly riva…","""POSITIVE"""
"""rick_blaine_ne…","""Hunter Castill…",false,"""&apos;Avatar&a…","""apos avatar ap…","""POSITIVE"""
"""the_joker_hann…","""Shawn Bautista…",false,"""Rock of Ages i…","""rock age sound…","""POSITIVE"""
"""gandalf_the_gr…","""Tara Rich""",false,"""It might just …","""might ultimate…","""POSITIVE"""
