# Baseline: Logistic Regression + TF-IDF

In this notebook I'm going to create a strong baseline using classical algorithms. 

In [None]:
# !pip install skl2onnx==1.12.0 onnxruntime==1.13.1 protobuf==3.20.1

## Imports

In [None]:
import os
from pathlib import Path

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

In [None]:
import onnxruntime as rt
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import StringTensorType

In [None]:
SEED = 42

## Paths

In [None]:
SAVED_MODELS_PATH = "saved_models"

In [None]:
relative_path = os.path.join("../../../", "data")

In [None]:
sentiment_analysis_data_path = os.path.join(relative_path, "3_sentiment_analysis")

In [None]:
Path(SAVED_MODELS_PATH).mkdir(parents=True, exist_ok=True)

## Data

### Loading data

In [None]:
reviews = pd.read_parquet(
    os.path.join(sentiment_analysis_data_path, "split_reviews.parquet")
)
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206537 entries, 0 to 206536
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype   
---  ------     --------------   -----   
 0   sentiment  206537 non-null  category
 1   review     206537 non-null  object  
 2   fold       206537 non-null  object  
dtypes: category(1), object(2)
memory usage: 3.3+ MB


In [None]:
train = reviews[reviews["fold"] == "train"]
test = reviews[reviews["fold"] == "test"]

In [None]:
test["review"] = test["review"].str.replace("<p>", " ")
train["review"] = train["review"].str.replace("<p>", " ")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["review"] = test["review"].str.replace("<p>", " ")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train["review"] = train["review"].str.replace("<p>", " ")


In [None]:
X_train, X_test, y_train, y_test = (
    train["review"].values.tolist(),
    test["review"].values.tolist(),
    train["sentiment"].values.tolist(),
    test["sentiment"].values.tolist(),
)

len(X_train), len(X_test), len(y_train), len(y_test)

(185883, 20654, 185883, 20654)

# Investigation

### Text encoding

For baseline model, I've decided to start with TF-IDF and Logistic Regression

#### Hyperparameter Investigation

##### `lowercase`

In [None]:
vectorizer = CountVectorizer(lowercase=False)
vectors_wo_lowercase = vectorizer.fit_transform(X_train)

print(
    f"The size of the train dataset is {vectors_wo_lowercase.shape} with lowercase turned off"
)

In [None]:
vectorizer = CountVectorizer()
vectors_w_lowercase = vectorizer.fit_transform(X_train)

print(
    f"The size of the train dataset is {vectors_w_lowercase.shape} with lowercase turned on"
)

In [None]:
vectors_wo_lowercase.shape[1] - vectors_w_lowercase.shape[1]

The difference in vocabulary size without making all characters lowercase and with lowercase is more than 100 000, so we better stick to lowercase 

##### `max_df` and `min_df`

`min_df` is used for removing terms that appear **too infrequently**. For example:

 - `min_df = 0.01` means "ignore terms that appear in **less than 1% of the documents**".
 - `min_df = 5` means "ignore terms that appear in **less than 5 documents**".  
 
The default `min_df` is `1`, which means "ignore terms that appear in **less than 1 document**".  
Thus, the default setting does not ignore any terms.

`max_df` is used for removing terms that appear **too frequently**, also known as "corpus-specific stop words". For example:

 - `max_df = 0.50` means "ignore terms that appear in **more than 50% of the documents**".
 - `max_df = 25` means "ignore terms that appear in **more than 25 documents**".  
 
The default `max_df` is `1.0`, which means "ignore terms that appear in **more than 100% of the documents**".  
Thus, the default setting does not ignore any terms.

In [None]:
vectorizer.get_feature_names_out()[:50]

array(['00', '000', '0000', '00000', '000000',
       '000000000000000000попкорн000000000000', '000000000000001',
       '000000000000на', '00000000000во', '00000000000данной',
       '00000000000есть000000000000000',
       '00000000000есть000000000000000000', '0000000000жевать',
       '0000000000ненавижу00000000', '00000000016', '000000000надо',
       '000000000разговаривать0000000000', '00000000визуальная',
       '00000001', '000001', '00000громко', '00000точек', '00001',
       '00007', '0001', '0002', '000доктора', '000какой',
       '000косметические', '000р', '000теряются', '001', '002', '003',
       '00381', '006', '007', '00в', '00вых', '00е', '00м', '00по', '00с',
       '00седьмого', '00х', '00ые', '00ых', '01', '011', '013'],
      dtype=object)

We can see that if we do not limit the vocabulary, we will have very infrequent words, so we better do it.  
For that we have to choose the `min_df` and `max_df` thresholds.

In [None]:
vectorizer = CountVectorizer(min_df=0.8)
vectors = vectorizer.fit_transform(X_train)
vectors.shape

CPU times: total: 39.3 s
Wall time: 39.3 s


(186063, 7)

In [None]:
vectorizer.get_feature_names_out()

array(['как', 'на', 'не', 'но', 'то', 'что', 'это'], dtype=object)

These words are in the 80% of all reviews, and it is understandable.  

In [None]:
MIN_DF = 0.01
vectorizer = CountVectorizer(min_df=MIN_DF)
vectors = vectorizer.fit_transform(X_train)

print(
    f"The size of the train dataset is {vectors.shape} with lowercase turned on and min_df={MIN_DF}"
)

The size of the train dataset is (186063, 3284) with lowercase turned on and min_df=0.01
CPU times: total: 39.4 s
Wall time: 39.4 s


In [None]:
vectorizer.get_feature_names_out()[:50]

array(['10', '100', '11', '12', '13', '15', '16', '18', '20', '2012',
       '21', '30', '3d', '40', '50', '60', '70', '80', '90', 'dc',
       'marvel', 'of', 'the', 'абсолютно', 'аватар', 'автор', 'автора',
       'авторов', 'авторы', 'аж', 'актер', 'актера', 'актерам',
       'актерами', 'актерах', 'актеров', 'актером', 'актерская',
       'актерский', 'актерского', 'актерской', 'актерскую', 'актеры',
       'актриса', 'актрисы', 'актёр', 'актёра', 'актёров', 'актёрская',
       'актёрский'], dtype=object)

In [None]:
MIN_DF = 0.01
MAX_DF = 0.9
vectorizer = CountVectorizer(min_df=MIN_DF, max_df=MAX_DF)
vectors = vectorizer.fit_transform(X_train)

print(
    f"The size of the train dataset is {vectors.shape} with lowercase turned on and min_df={MIN_DF} and max_df={MAX_DF}"
)

The size of the train dataset is (186063, 3281) with lowercase turned on and min_df=0.01 and max_df=0.9
CPU times: total: 39.2 s
Wall time: 39.2 s


##### `ngram_range`

The lower and upper boundary of the range of n-values for different n-grams to be extracted.  
All values of n such that min_n ≤ n ≤ max_n will be used.   

For example a `ngram_range` of `(1, 1)` means only `unigrams`, `(1, 2)` means `unigrams` and `bigrams`, and `(2, 2)` means only `bigrams`.

In [None]:
NGRAM_RANGE = (1, 3)
vectorizer = CountVectorizer(ngram_range=NGRAM_RANGE, min_df=MIN_DF)
train_vectors = vectorizer.fit_transform(X_train)

print(
    f"The size of the train dataset is {vectors.shape} with lowercase turned on and min_df={MIN_DF} and ngram_range={NGRAM_RANGE}"
)

The size of the train dataset is (186063, 3281) with lowercase turned on and min_df=0.01 and ngram_range=(1, 3)
CPU times: total: 5min 35s
Wall time: 7min 7s


In [None]:
vectorizer.get_feature_names_out()[:50]

array(['10', '10 лет', '100', '11', '12', '13', '15', '16', '18', '20',
       '2012', '21', '30', '3d', '40', '50', '60', '70', '80', '90', 'dc',
       'marvel', 'of', 'the', 'абсолютно', 'абсолютно все',
       'абсолютно не', 'аватар', 'автор', 'автора', 'авторов', 'авторы',
       'аж', 'актер', 'актера', 'актерам', 'актерами', 'актерах',
       'актеров', 'актером', 'актерская', 'актерская игра', 'актерский',
       'актерский состав', 'актерского', 'актерской', 'актерской игры',
       'актерскую', 'актерскую игру', 'актеры'], dtype=object)

# Modelling

In [None]:
stages = []

## Vectorizing reviews with TF-IDF

In [None]:
vectorizer_params = {
    "min_df": 0.01,
    "ngram_range": (1, 2),
    "max_features": 10_000,
}

review_vectorizer = TfidfVectorizer(**vectorizer_params)

In [None]:
stages.append(("vectorizer", review_vectorizer))

## LogReg

In [None]:
log_reg = LogisticRegression(
    C=1, random_state=SEED, n_jobs=-1, solver="saga", max_iter=10_000
)

In [None]:
stages.append(("classifier", log_reg))

## Training

In [None]:
pipe = Pipeline(stages)

In [None]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(max_features=10000, min_df=0.01,
                                 ngram_range=(1, 2))),
                ('classifier',
                 LogisticRegression(C=1, max_iter=10000, n_jobs=-1,
                                    random_state=42, solver='saga'))])

### Evaluation

In [None]:
pred_labels = pipe.predict(X_test)

In [None]:
averaging = "micro"
f1 = f1_score(y_test, pred_labels, average=averaging)

In [None]:
print(f"F1 score with {averaging}-averaging is {f1.round(3)}")

F1 score with micro-averaging is 0.801


# ONNX

## Converting

In [None]:
pipe_path = os.path.join(SAVED_MODELS_PATH, "TfIdfLogRegSentiment.onnx")

In [None]:
initial_type = [('input', StringTensorType([None, 1]))]
seps = {
    TfidfVectorizer: {
        "separators": [
            ' ', '.', '\\?', ',', ';', ':', '!',
            '\\(', '\\)', '\n', '"', "'",
            "-", "\\[", "\\]", "@"
        ]
    }
}

In [None]:
model_onnx = convert_sklearn(
    pipe, "tfidf",
    initial_types=initial_type,
    options=seps, 
    target_opset=12)



## Saving 

In [None]:
with open(pipe_path, "wb") as f:
    f.write(model_onnx.SerializeToString())

## Comparing results

In [None]:
sess = rt.InferenceSession(pipe_path)

input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name

inputs = {'input': [[input] for input in X_test]}

In [None]:
pred_onx = sess.run(None, inputs)

In [None]:
averaging = "micro"
f1 = f1_score(y_test, pred_onx[0], average=averaging)

In [None]:
print(f"F1 score with {averaging}-averaging is {f1.round(3)}")

F1 score with micro-averaging is 0.72
