# Title: SENTIMENT ANALYSIS

# A Brief Intro


In today's digital age, where information is abundant and easily accessible, sentiment analysis plays a crucial role in understanding public opinion. By leveraging data science techniques, specifically natural language processing (NLP) in machine learning, we embarked on a fascinating project to analyze sentiments within textual data. This post will take you through our journey from scratch to hatch.

# Data collecting

Sentiment analysis involves determining the sentiment or emotional tone behind a piece of text, such as a review, tweet, or customer feedback. By leveraging NLP techniques, we aim to automate sentiment classification, enabling organizations to gain insights from large volumes of textual data more efficiently.

To embark on our sentiment analysis project, we needed a diverse and sizable dataset. We turned to Hugging Face, a widely-used platform for natural language processing, which provides access to a wide range of pre-trained models and datasets. Hugging Face's library allowed us to easily access and download a rich collection of text data suitable for training and evaluation purposes.

First thing first, we have to do the pip install to attach the api from HuggingAI. After that, we choose the "yell_review_full" dataset and assign it with the df as the DataFrame.

In [None]:
#using datasets api from huggingAI
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


About the dataset:

["yelp_review_full"](https://huggingface.co/datasets/yelp_review_full?fbclid=IwAR3BsF0ayfB05ft3k7HT54Talr7SiG0RKJs3-yqFqc1aCgoXAgL5miyPo1w)
is the dataset that consists of reviews and labels (which corresponds to the score associated with the review (between 1 and 5)) from Yelp . It is constructed by Xiang Zhang ([xiang.zhang@nyu.edu](https://xiang.zhang@nyu.edu)) extracted from the Yelp Dataset Challenge 2015 data.

This dataset includes 650.000 features, which will provide us with large amount of vocabulary for predicting the test sentiment.

To load the dataset, we need to import `load_dataset` from `datasets`




In [None]:
from datasets import load_dataset
import pandas as pd
data = load_dataset("yelp_review_full")



  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [None]:
df = pd.DataFrame(data['train'])

In [None]:
df

Unnamed: 0,label,text
0,4,dr. goldberg offers everything i look for in a...
1,1,"Unfortunately, the frustration of being Dr. Go..."
2,3,Been going to Dr. Goldberg for over 10 years. ...
3,3,Got a letter in the mail last week that said D...
4,0,I don't know what Dr. Goldberg was like before...
...,...,...
649995,4,I had a sprinkler that was gushing... pipe bro...
649996,0,Phone calls always go to voicemail and message...
649997,0,Looks like all of the good reviews have gone t...
649998,4,I was able to once again rely on Yelp to provi...


The raw dataset has 5 labels, we gather into 3 main labels:

-1:   negative sentiment

0:   neutral sentiment

1:   positive sentiment


In [None]:
df['label'].replace([0,1], -1, inplace=True)
df['label'].replace(2,0, inplace=True)
df['label'].replace([3,4],1, inplace=True)

In [None]:
import plotly.express as px

def plot_label_distribution(df):
    # Count the number of occurrences of each label
    label_counts = df['label'].value_counts()
    # Create a bar chart to visualize the label distribution
    fig1 = px.bar(x=label_counts.index, y=label_counts.values, labels={'x': 'Label', 'y': 'Count'},
                  title='Label Distribution')
    fig1.show()

    # Calculate the percentage of each label
    label_ratios = label_counts / len(df) * 100

    # Create a pie chart to visualize the label ratios
    fig2 = px.pie(values=label_ratios.values, names=label_ratios.index, title='Label Ratios')
    fig2.show()

In [None]:
plot_label_distribution(df)

Because the raw dataset is too large and it can take us plenty of time to run fully all of our predicting model. So we want to narrow it down but still retaining 20% of the original data within each label group, effectively reducing the overall dataset while preserving the label distribution. So we have built a method for doing this.

In [None]:
import numpy as np

# Separate the DataFrame into three groups based on the label
group1 = df[df['label'] == -1]
group2 = df[df['label'] == 0]
group3 = df[df['label'] == 1]

# Define the function to delete a certain percentage of text within each group
def delete_percentage(group):
    num_to_delete = int(len(group) * 0.8)  # Calculate the number of text to delete (80%)
    indices_to_delete = np.random.choice(group.index, size=num_to_delete, replace=False)
    return group.drop(indices_to_delete)

# Apply the function to each group
group1_filtered = delete_percentage(group1)
group2_filtered = delete_percentage(group2)
group3_filtered = delete_percentage(group3)

# Concatenate the filtered groups back into a single DataFrame
df = pd.concat([group1_filtered, group2_filtered, group3_filtered])

In [None]:
df = df.reset_index()

In [None]:
df = df.drop(columns = ['index'])

In [None]:
df

Unnamed: 0,label,text
0,-1,I'm writing this review to give you a heads up...
1,-1,Owning a driving range inside the city limits ...
2,-1,"Used to go there for tires, brakes, etc. Thei..."
3,-1,Last summer I had an appointment to get new ti...
4,-1,I will start by saying we have a nice new deck...
...,...,...
129995,1,Nice atmosphere. I expected this to be more of...
129996,1,This used be Cathy House. . Now The Jade. . It...
129997,1,Best hidden secret in Vegas..... Great selecti...
129998,1,MACARONS!!!! I've died and gone to heaven. \n\...


# EDA


After gathering our dataset, we performed exploratory data analysis (EDA) to gain insights into the characteristics and distribution of the data. This stage involved statistical analysis, data visualization, and identifying any patterns or outliers within the dataset. EDA enabled us to understand the composition of the dataset and make informed decisions during subsequent stages of the project.

In [None]:
# Generally describe statistics for all columns in DataFrame, regardless of their data type.
df.describe(include="all")

Unnamed: 0,label,text
count,130000.0,130000
unique,,130000
top,,In all fairness - I did not see this dentist b...
freq,,1
mean,0.0,
std,0.894431,
min,-1.0,
25%,-1.0,
50%,0.0,
75%,1.0,


As we can see the `text` and `label`, which is an non-numeric and numeric columns, have no missing and duplicated values through the `count`, `unique`.

In [None]:
len(df)

130000

In [None]:
# Handle missing values
df.isnull().sum()

label    0
text     0
dtype: int64

Luckily, our dataset does not have any null values, which we don't have to do the handle missing value things.

Move on to the plots, there are 40% of negative, 40% of positive and 20% of neutral as we can see through the pie chart. Moreovers, there is an imbalanced distribution among 3 labels, that the frequency of negative and positive are more than 50.000, but neutral is nearly 30.000. This could lead to the inaccuracy in our predicting model. But let see what we can do next to figure this out.

In [None]:
plot_label_distribution(df)

# Data Preprocessing

Data pre-processing plays a crucial role in effectively dealing with natural language processing (NLP) tasks. When working with text data, several pre-processing steps are typically applied to enhance the quality and relevance of the data. These steps often include tasks such as tokenization, lowercasing, removing punctuation, handling stop words, stemming or lemmatization, and handling numerical values, etc. (you gonna see in the cleaning process below). These pre-processing steps play active roles (nearly 70% of our entire process) to help prepare the text data for subsequent analysis and modeling, improving the accuracy and effectiveness of model which we are about to build lately.

## Data Cleaning

The import of the mentioned libraries in the code serves the purpose of natural language processing (NLP) and text preprocessing. Let's analyze the significance of each library:

- **Stopwords** are common words that do not carry significant meaning in a text, such as "a," "an," "the," "is," and so on.

-  **Lemmatization** is the process of normalizing words to their base or dictionary form. It helps convert words into their base form, for example, transforming "running" into "run."

- **Stemming** is the process of removing affixes and prefixes from words to retain their base or stem form.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
import re
import string
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from tqdm.notebook import tqdm


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

The data cleaning process proceeds as follows:

1. Convert the text to lowercase.
2. Remove "@" characters.
3. Remove URLs and HTML codes.
4. Remove #hashtags.
5. Remove special characters and numbers.
6. Remove punctuation.
7. Tokenize the text using nltk.word_tokenize.
8. Remove stopwords (commonly used words) and words with a length less than 2.
9. Perform lemmatization to convert words to their base form.
10. Perform stemming to reduce words to their root form.
11. Return the processed text and collect the unique words in the text.

In [None]:
class cleaning_data:
  def __init__(self, text=''):
    self.text = text

  def clean_text(self, text):
      try:
        # Lowercase
        text = text.lower()

        # Remove @
        text = re.sub(r'@[^\s]+', ' ', text)

        # Remove URL and HTML code
        text = re.sub(r'http\S+|www\S+', '', text)
        text = re.sub(r'<.*?>+', ' ', text)

        # Remove #hashtags
        text = re.sub(r'#[^\s]+', ' ', text)

        # Remove special characters and numbers
        text = re.sub("[^-9A-Za-z ]", " ", text)

        # Remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))

        text = text.strip()

        # Tokenization
        tokens = nltk.word_tokenize(text)

        stopwords_lst = set(stopwords.words('english'))
        tokens = [word for word in tokens if word not in stopwords_lst and len(word) > 2]

        # Stemming
        stemmer = SnowballStemmer(language='english')
        tokens = [stemmer.stem(word) for word in tokens]

        # Lemmatization
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(word) for word in tokens]

        cleaned_text = " ".join(tokens)

        return cleaned_text
      except Exception as e:
        print(str(e))
        return ""

  def fit_transform(self, corpus) -> list:
    return [self.clean_text(text) for text in corpus]

In [None]:
cleaner = cleaning_data()
clean_data = cleaner.fit_transform(df['text'])

In [None]:
clean_data = pd.DataFrame(clean_data)
clean_data

Unnamed: 0,cleaned_text
0,fair see dentist schedul appoint wait minut op...
1,take avoid place cost time starv energi cook g...
2,worst dental experi life butcher husband went ...
3,even though food eat park right abomin smile l...
4,one good thing say michael wide varieti hobbi ...
...,...
129995,mean sandwich usual locat new forum food court...
129996,use cathi hous jade improv dim sum way better ...
129997,one multitud babi boomer grew slinki silli put...
129998,macaron die gone heaven nthey light fluffi lit...


In [None]:
clean_data.isnull().sum()

cleaned_text    0
dtype: int64

## Using TF-IDF from scikit-learn

###Definition of TF-IDF

Using TF-IDF (Term Frequency-Inverse Document Frequency) from the sklearn library is a powerful technique for text data analysis. TF-IDF assigns weights to individual words in a document based on their frequency and importance within the entire corpus. The sklearn library provides an easy-to-use implementation of TF-IDF, allowing users to transform raw text data into a numerical representation that can be used for machine learning tasks. By calculating the TF-IDF scores, we can identify the most significant words in a document while downweighting common words that appear across multiple documents. This helps in capturing the essence of each document and highlighting the distinguishing features of the text. The TF-IDF representation is particularly useful for tasks such as text classification, information retrieval, and document similarity analysis. By leveraging the sklearn implementation of TF-IDF, we can efficiently process and analyze textual data, enabling more accurate and insightful NLP applications.

**TF(t,d) = Term Frequency(t,d):** is calculated by dividing the number of occurrences of the word in the document by the total number of words in the document.

$$tf(t,d) = \frac{\text{count of t in d}}{\text{number of words in d}}$$

**IDF(t,D) = Inverse Term Frequency(t,D):** measures the importance of term $t$ in all documents ($D$), we obtain this measure by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

$$idf(t,D) = log(\frac{|D|}{|d∈D:t∈d|})$$

- $|D|$: the total number of documents

- $|d∈D:t∈d|$: the number of documents containing the word

**TF-IDF(t,d,D) = Term Frequency(t,d) -  Inverse Term Frequency(t,D)**:

$$TF\text{-}IDF(t,d,D) = tf(t,d)\times idf(t,D)$$

Words that have high TF-IDF value is:

- words occur many times in document ($d$) corresponding to $ tf(t,d) $ are high

- words occur fewer times in documents ($D$) corresponding to $ idf(t,D) $ are high

For that reason, TF-IDF help us to filter out the common words and retain the high values ​​(considered as keywords of the text).

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=2, max_df=0.8)
X = tfidf_vectorizer.fit_transform(clean_data['cleaned_text'])

We will perform the TF-IDF transformation on the text data. `fit_transform()` method fits the vectorizer on the provided text data and returns a matrix `X` where each row represents a document, and each column corresponds to a unique word. The values in the matrix represent the TF-IDF scores for each word in each document.

In [None]:
X.shape

(130000, 40576)

# Data modeling

**First at all**, we need split data to 2 set: train and test.

The `train_test_split` function from sklearn.model_selection is used to split the data into two subsets: the training data and the test data. In this case:

`X`: is the matrix (or array) of features of the original data.

`y`: is the array (or vector) of labels corresponding to the original data.

`test_size`: is the proportion of data used for the test set. In this case, 0.3 (or 30%) of the original data is randomly chosen to create the test set.

`random_state`: is a value to ensure that the random data splitting can be reproduced. When providing a specific integer (such as 42), the results will be consistent if the code is run again.


In [None]:
from sklearn.model_selection import train_test_split
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((91000, 40576), (39000, 40576), (91000,), (39000,))

After splitting, **the training data** is comprised of 91,000 samples with 40,576 features, while **the test data** is comprised of 39000 samples with the same number of features. The training and test labels have corresponding shapes of 39,000 and 40,576 respectively.


# GridSearchCV and Cross Validation

In [None]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.metrics import accuracy_score, precision_score, recall_score, make_scorer
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.svm import SVC
from sklearn.model_selection import cross_validate
import numpy as np
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV

We defines a function called `hyperparamtune` that performs hyperparameter tuning using grid search cross-validation.

Hyperparameter tuning aims to find the optimal values for these hyperparameters that result in the best performance of the model on unseen data.

The function takes the following parameters:

`classifier`: The machine learning classifier model to be tuned.
param_grid: A dictionary specifying the hyperparameter values to be explored.

`metric`: The evaluation metric used to score the models during grid search.

`verbose_value`: An integer value determining the verbosity level of the grid search process.

`cv`: An integer or cross-validation generator specifying the number of folds for cross-validation.


In [None]:
def hyperparamtune(classifier, param_grid, metric, verbose_value, cv):
  '''
  function using GridSearchCV to find the best parametor of each classifier (model)
  => return grid object and best parametor
  '''
  model=model_selection.GridSearchCV(
          estimator=classifier,
          param_grid=param_grid,
          scoring=metric,
          verbose=verbose_value,
          cv=cv)

  model.fit(X_train, y_train)
  print("Best Score %s" % {model.best_score_})
  print("Best hyperparameter set:")
  best_parameters = model.best_estimator_.get_params()
  for param_name in sorted(param_grid.keys()):
      print(f"\t{param_name}: {best_parameters[param_name]}")
  return model, best_parameters

## Logistic Regression

Logistic regression is a statistical model that uses a logistic function, or logit function in mathematics, as an equation between x and y. The logite function maps y as the sigmoid function of x:

$$ f(x) = \frac{1}{1+e^{-x}}$$

By training a logistic regression model on labeled text data, the algorithm learns the relationship between the features and the target variable. It estimates the coefficients that maximize the likelihood of observing the given text samples and their corresponding labels.

Once the model is trained, it can be used to classify new, unseen text data. The logistic regression model calculates the probability of each class or outcome based on the learned coefficients and the input text's features. The predicted class is determined by selecting the class with the highest probability.

Below is the param_gd dictionary which is used to define a grid of hyperparameters for hyperparameter tuning in machine learning models. Each key in the dictionary represents a specific hyperparameter, and the corresponding value is a list of potential values to be explored during the tuning process.

`penalty`: This hyperparameter controls the type of regularization used in the model. In this case, it is set to "l2", which indicates L2 regularization (Ridge regularization). L2 regularization adds a penalty term to the loss function, encouraging the model to have smaller coefficients.

`C`: This hyperparameter represents the inverse of the regularization strength. It determines how much regularization is applied to the model. The potential values [0.01, 0.1, 1.0, 10] specify different levels of regularization. Smaller values of C correspond to stronger regularization, meaning the model will be more constrained.

- The choice of specific values like 0.01, 0.1, 1.0, and 10 for the hyperparameter C in machine learning models, particularly in logistic regression and linear models, is often based on convention and empirical experience. They are often used as default or common values for exploring different levels of regularization and have been shown to work well in many scenarios.

`tol`: This hyperparameter sets the tolerance for convergence of the optimization algorithm. It defines the minimum change in the loss function between iterations to consider the model as converged. The list of potential values [0.0001, 0.001, 0.01] represents different levels of tolerance, with smaller values indicating a higher precision for convergence.

- Setting a smaller tol value ensures that the optimization algorithm converges to a more precise solution. However, decreasing tol can lead to longer computational times since the algorithm needs to iterate more times to achieve convergence. By choosing values like 0.0001, 0.001, and 0.01, we strike a balance between achieving reasonable convergence precision and computational efficiency.

`max_iter`: This hyperparameter determines the maximum number of iterations allowed for the optimization algorithm to converge. The values [100, 200] specify different maximum iteration limits. If the algorithm does not converge within the specified number of iterations, it will stop and return the current solution.

In [None]:
from sklearn.linear_model import LogisticRegression
# Convert label type to integers if necessary
y_train = y_train.astype(int)
param_gd={"penalty":["l2"],
         "C":[0.01,0.1,1.0,10],
         "tol":[0.0001,0.001,0.01],
         "max_iter":[100,200]}
model_log, best_param = hyperparamtune(LogisticRegression(), param_gd, "accuracy", 3, cv = 5)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END C=0.01, max_iter=100, penalty=l2, tol=0.0001;, score=0.689 total time=   4.3s
[CV 2/5] END C=0.01, max_iter=100, penalty=l2, tol=0.0001;, score=0.687 total time=   4.5s
[CV 3/5] END C=0.01, max_iter=100, penalty=l2, tol=0.0001;, score=0.687 total time=   5.6s
[CV 4/5] END C=0.01, max_iter=100, penalty=l2, tol=0.0001;, score=0.688 total time=   4.6s
[CV 5/5] END C=0.01, max_iter=100, penalty=l2, tol=0.0001;, score=0.691 total time=   4.7s
[CV 1/5] END C=0.01, max_iter=100, penalty=l2, tol=0.001;, score=0.689 total time=   5.4s
[CV 2/5] END C=0.01, max_iter=100, penalty=l2, tol=0.001;, score=0.687 total time=   4.5s
[CV 3/5] END C=0.01, max_iter=100, penalty=l2, tol=0.001;, score=0.687 total time=   5.2s
[CV 4/5] END C=0.01, max_iter=100, penalty=l2, tol=0.001;, score=0.688 total time=   4.8s
[CV 5/5] END C=0.01, max_iter=100, penalty=l2, tol=0.001;, score=0.691 total time=   4.4s
[CV 1/5] END C=0.01, max_iter=100

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=0.1, max_iter=100, penalty=l2, tol=0.0001;, score=0.740 total time=  12.7s
[CV 1/5] END C=0.1, max_iter=100, penalty=l2, tol=0.001;, score=0.741 total time=  11.3s
[CV 2/5] END C=0.1, max_iter=100, penalty=l2, tol=0.001;, score=0.743 total time=  13.8s
[CV 3/5] END C=0.1, max_iter=100, penalty=l2, tol=0.001;, score=0.742 total time=  10.0s
[CV 4/5] END C=0.1, max_iter=100, penalty=l2, tol=0.001;, score=0.737 total time=  10.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=0.1, max_iter=100, penalty=l2, tol=0.001;, score=0.740 total time=  12.7s
[CV 1/5] END C=0.1, max_iter=100, penalty=l2, tol=0.01;, score=0.741 total time=  10.5s
[CV 2/5] END C=0.1, max_iter=100, penalty=l2, tol=0.01;, score=0.743 total time=  12.6s
[CV 3/5] END C=0.1, max_iter=100, penalty=l2, tol=0.01;, score=0.742 total time=   9.1s
[CV 4/5] END C=0.1, max_iter=100, penalty=l2, tol=0.01;, score=0.737 total time=  10.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=0.1, max_iter=100, penalty=l2, tol=0.01;, score=0.740 total time=  12.4s
[CV 1/5] END C=0.1, max_iter=200, penalty=l2, tol=0.0001;, score=0.741 total time=  11.1s
[CV 2/5] END C=0.1, max_iter=200, penalty=l2, tol=0.0001;, score=0.743 total time=  12.6s
[CV 3/5] END C=0.1, max_iter=200, penalty=l2, tol=0.0001;, score=0.742 total time=   8.6s
[CV 4/5] END C=0.1, max_iter=200, penalty=l2, tol=0.0001;, score=0.737 total time=  11.1s
[CV 5/5] END C=0.1, max_iter=200, penalty=l2, tol=0.0001;, score=0.740 total time=  14.1s
[CV 1/5] END C=0.1, max_iter=200, penalty=l2, tol=0.001;, score=0.741 total time=  10.9s
[CV 2/5] END C=0.1, max_iter=200, penalty=l2, tol=0.001;, score=0.743 total time=  12.3s
[CV 3/5] END C=0.1, max_iter=200, penalty=l2, tol=0.001;, score=0.742 total time=   9.7s
[CV 4/5] END C=0.1, max_iter=200, penalty=l2, tol=0.001;, score=0.737 total time=  10.4s
[CV 5/5] END C=0.1, max_iter=200, penalty=l2, tol=0.001;, score=0.740 total time=  12.1s
[CV 1/5] END C=0.

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=1.0, max_iter=100, penalty=l2, tol=0.0001;, score=0.755 total time=  12.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=1.0, max_iter=100, penalty=l2, tol=0.0001;, score=0.760 total time=  12.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=1.0, max_iter=100, penalty=l2, tol=0.0001;, score=0.756 total time=  14.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=1.0, max_iter=100, penalty=l2, tol=0.0001;, score=0.750 total time=  13.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=1.0, max_iter=100, penalty=l2, tol=0.0001;, score=0.752 total time=  13.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=1.0, max_iter=100, penalty=l2, tol=0.001;, score=0.755 total time=  13.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=1.0, max_iter=100, penalty=l2, tol=0.001;, score=0.760 total time=  12.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=1.0, max_iter=100, penalty=l2, tol=0.001;, score=0.756 total time=  12.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=1.0, max_iter=100, penalty=l2, tol=0.001;, score=0.750 total time=  13.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=1.0, max_iter=100, penalty=l2, tol=0.001;, score=0.752 total time=  13.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=1.0, max_iter=100, penalty=l2, tol=0.01;, score=0.755 total time=  12.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=1.0, max_iter=100, penalty=l2, tol=0.01;, score=0.760 total time=  13.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=1.0, max_iter=100, penalty=l2, tol=0.01;, score=0.756 total time=  12.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=1.0, max_iter=100, penalty=l2, tol=0.01;, score=0.750 total time=  12.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=1.0, max_iter=100, penalty=l2, tol=0.01;, score=0.752 total time=  13.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=1.0, max_iter=200, penalty=l2, tol=0.0001;, score=0.755 total time=  25.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=1.0, max_iter=200, penalty=l2, tol=0.0001;, score=0.759 total time=  25.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=1.0, max_iter=200, penalty=l2, tol=0.0001;, score=0.756 total time=  24.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=1.0, max_iter=200, penalty=l2, tol=0.0001;, score=0.750 total time=  27.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=1.0, max_iter=200, penalty=l2, tol=0.0001;, score=0.752 total time=  29.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=1.0, max_iter=200, penalty=l2, tol=0.001;, score=0.755 total time=  27.5s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=1.0, max_iter=200, penalty=l2, tol=0.001;, score=0.759 total time=  26.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=1.0, max_iter=200, penalty=l2, tol=0.001;, score=0.756 total time=  25.6s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=1.0, max_iter=200, penalty=l2, tol=0.001;, score=0.750 total time=  25.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=1.0, max_iter=200, penalty=l2, tol=0.001;, score=0.752 total time=  25.5s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=1.0, max_iter=200, penalty=l2, tol=0.01;, score=0.755 total time=  25.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=1.0, max_iter=200, penalty=l2, tol=0.01;, score=0.759 total time=  24.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=1.0, max_iter=200, penalty=l2, tol=0.01;, score=0.756 total time=  25.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=1.0, max_iter=200, penalty=l2, tol=0.01;, score=0.750 total time=  25.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=1.0, max_iter=200, penalty=l2, tol=0.01;, score=0.752 total time=  25.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=10, max_iter=100, penalty=l2, tol=0.0001;, score=0.738 total time=  13.5s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=10, max_iter=100, penalty=l2, tol=0.0001;, score=0.747 total time=  12.6s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=10, max_iter=100, penalty=l2, tol=0.0001;, score=0.744 total time=  11.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=10, max_iter=100, penalty=l2, tol=0.0001;, score=0.734 total time=  12.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=10, max_iter=100, penalty=l2, tol=0.0001;, score=0.735 total time=  12.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=10, max_iter=100, penalty=l2, tol=0.001;, score=0.738 total time=  15.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=10, max_iter=100, penalty=l2, tol=0.001;, score=0.747 total time=  12.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=10, max_iter=100, penalty=l2, tol=0.001;, score=0.744 total time=  11.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=10, max_iter=100, penalty=l2, tol=0.001;, score=0.734 total time=  12.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=10, max_iter=100, penalty=l2, tol=0.001;, score=0.735 total time=  12.5s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=10, max_iter=100, penalty=l2, tol=0.01;, score=0.738 total time=  13.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=10, max_iter=100, penalty=l2, tol=0.01;, score=0.747 total time=  12.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=10, max_iter=100, penalty=l2, tol=0.01;, score=0.744 total time=  12.6s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=10, max_iter=100, penalty=l2, tol=0.01;, score=0.734 total time=  12.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=10, max_iter=100, penalty=l2, tol=0.01;, score=0.735 total time=  12.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=10, max_iter=200, penalty=l2, tol=0.0001;, score=0.735 total time=  25.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=10, max_iter=200, penalty=l2, tol=0.0001;, score=0.736 total time=  25.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=10, max_iter=200, penalty=l2, tol=0.0001;, score=0.737 total time=  23.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=10, max_iter=200, penalty=l2, tol=0.0001;, score=0.731 total time=  25.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=10, max_iter=200, penalty=l2, tol=0.0001;, score=0.736 total time=  24.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=10, max_iter=200, penalty=l2, tol=0.001;, score=0.735 total time=  26.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=10, max_iter=200, penalty=l2, tol=0.001;, score=0.736 total time=  25.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=10, max_iter=200, penalty=l2, tol=0.001;, score=0.737 total time=  23.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=10, max_iter=200, penalty=l2, tol=0.001;, score=0.731 total time=  26.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=10, max_iter=200, penalty=l2, tol=0.001;, score=0.736 total time=  25.6s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=10, max_iter=200, penalty=l2, tol=0.01;, score=0.735 total time=  25.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=10, max_iter=200, penalty=l2, tol=0.01;, score=0.736 total time=  25.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=10, max_iter=200, penalty=l2, tol=0.01;, score=0.737 total time=  24.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=10, max_iter=200, penalty=l2, tol=0.01;, score=0.731 total time=  24.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=10, max_iter=200, penalty=l2, tol=0.01;, score=0.736 total time=  24.7s
Best Score {0.7546857142857142}
Best hyperparameter set:
	C: 1.0
	max_iter: 100
	penalty: l2
	tol: 0.0001


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Random Forest

Random Forest builds an ensemble of decision trees, where each tree is trained on a different subset of the training data. During training, the algorithm considers different combinations of features and samples to create diverse decision trees. This ensemble approach helps to improve the accuracy and generalization of sentiment predictions.

In [None]:
param_gd={"n_estimators":[100,200,300],
         "max_depth":[11,13,17,19,23],
         "criterion":["gini","entropy"],
         "min_samples_split":[3,7,11],
         "min_samples_leaf":[3,5],
         "max_features":["sqrt", "log2"]}

model_rf, best_param_rf = hyperparamtune(RandomForestClassifier(),param_gd,"accuracy",10,5)

Fitting 5 folds for each of 360 candidates, totalling 1800 fits
[CV 1/5; 1/360] START criterion=gini, max_depth=11, max_features=sqrt, min_samples_leaf=3, min_samples_split=3, n_estimators=100
[CV 1/5; 1/360] END criterion=gini, max_depth=11, max_features=sqrt, min_samples_leaf=3, min_samples_split=3, n_estimators=100;, score=0.644 total time=  10.8s
[CV 2/5; 1/360] START criterion=gini, max_depth=11, max_features=sqrt, min_samples_leaf=3, min_samples_split=3, n_estimators=100
[CV 2/5; 1/360] END criterion=gini, max_depth=11, max_features=sqrt, min_samples_leaf=3, min_samples_split=3, n_estimators=100;, score=0.641 total time=  10.7s
[CV 3/5; 1/360] START criterion=gini, max_depth=11, max_features=sqrt, min_samples_leaf=3, min_samples_split=3, n_estimators=100
[CV 3/5; 1/360] END criterion=gini, max_depth=11, max_features=sqrt, min_samples_leaf=3, min_samples_split=3, n_estimators=100;, score=0.642 total time=  10.5s
[CV 4/5; 1/360] START criterion=gini, max_depth=11, max_features=sqrt

## Naive Bayes

Naive Bayes applies Bayes' theorem, which mathematically calculates the posterior probability of a class given the observed evidence. The formula is as follows:

$$P(sentiment|words) = \frac{P(sentiment) * P(words|sentiment)}{P(words)}$$

Here,

- $P(sentiment|words)$ is the posterior probability of a $sentiment$ given the observed $words$.

- $P(sentiment)$ is the prior probability of the $sentiment$.

- $P(words|sentiment)$ is the likelihood of the $words$ occurring in the $sentiment$.

- $P(words)$ is the probability of the observed $words$ regardless of the $sentiment$.

By estimating the prior probabilities and likelihoods from the training data, we can classify new documents by selecting the class with the highest posterior probability.

Below we perform  hyperparameter tuning using cross-validation for a Multinomial Naive Bayes model. As you can see in our `param_grid_nb` dictionary contain two hyperparameters:

`alpha`: The smoothing parameter for the Multinomial Naive Bayes model. The potential values are [0.1, 0.5, 1.0, 2.0], representing different levels of smoothing. It prevents zero probabilities and helps address the issue of unseen features in the training data: smaller values of alpha result in less smoothing.

- Smaller values like 0.1 and 0.5 introduce lighter smoothing, allowing the model to rely more on observed feature frequencies. Larger values like 1.0 and 2.0 introduce stronger smoothing, giving more weight to prior probabilities and reducing the influence of observed feature frequencies.

`fit_prior`: A boolean indicating whether to learn class prior probabilities or not. The potential values are [True, False], representing different choices for learning class priors.

In [None]:
param_grid_nb = {
    'alpha': [0.1, 0.5, 1.0, 2.0],
    'fit_prior': [True, False]
}

model_nb, best_params_nb = hyperparamtune(MultinomialNB(), param_grid_nb, 'accuracy', 3, cv=5)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV 1/5] END .........alpha=0.1, fit_prior=True;, score=0.699 total time=   0.0s
[CV 2/5] END .........alpha=0.1, fit_prior=True;, score=0.696 total time=   0.0s
[CV 3/5] END .........alpha=0.1, fit_prior=True;, score=0.702 total time=   0.0s
[CV 4/5] END .........alpha=0.1, fit_prior=True;, score=0.699 total time=   0.1s
[CV 5/5] END .........alpha=0.1, fit_prior=True;, score=0.701 total time=   0.0s
[CV 1/5] END ........alpha=0.1, fit_prior=False;, score=0.664 total time=   0.1s
[CV 2/5] END ........alpha=0.1, fit_prior=False;, score=0.669 total time=   0.0s
[CV 3/5] END ........alpha=0.1, fit_prior=False;, score=0.671 total time=   0.0s
[CV 4/5] END ........alpha=0.1, fit_prior=False;, score=0.666 total time=   0.0s
[CV 5/5] END ........alpha=0.1, fit_prior=False;, score=0.664 total time=   0.0s
[CV 1/5] END .........alpha=0.5, fit_prior=True;, score=0.700 total time=   0.0s
[CV 2/5] END .........alpha=0.5, fit_prior=True;,

## Metrics valuation

- TP (True Positive) is the number of positive samples that are properly classified
- TN (True Negative) is the number of negative samples that are properly classified, FP (False Positive) is the number of negative samples that are misclassified as positive
- FN (False Negative) is the number of positive samples that are misclassified as negative.

1. Precision:
    - Precision measures the ratio of the number of positively classified samples to the total number of samples classified as positive.
    - Formula: Precision = TP / (TP + FP)
    - High precision shows that the model is capable of accurately classifying positive patterns, and less misclassifying negative patterns into positive.

2. Recall (Positive prediction coverage):
    - Recall (also known as Sensitivity or True Positive Rate) measures the ratio of the number of properly classified positive samples to the total number of actual positive samples.
    - Formula: Recall = TP / (TP + FN)
    - High recall shows that the model is capable of properly covering positive patterns, and fewer positive patterns are missed.

3. Accuracy:
    - Accuracy is the ratio between the number of correct predictions and the total number of samples in the test set.
    - Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
    - In particular, TP (True Positive) is the number of positive samples that are properly classified, TN (True Negative) is the number of negative samples that are properly classified, FP (False Positive) is the number of negative samples that are misclassified as positive, and FN (False Negative) is the number of positive samples that are misclassified as negative.

4. F1-score (Average of Precision and Recall):
    - F1-score is a measurement that combines Precision and Recall into a single number.
    - F1-score is good when both Precision and Recall are high.
    - Formula: F1-score = 2 * (Precision * Recall) / (Precision + Recall)

In [None]:
models = [model_log, model_rf, model_nb]
model_names = ['Logistic Regression', 'Random Forest', 'Naive Bayes']

results = []

for model, model_name in zip(models, model_names):
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    accuracy_train = accuracy_score(y_train, y_train_pred)
    precision_train = precision_score(y_train, y_train_pred, average='macro')
    recall_train = recall_score(y_train, y_train_pred, average='macro')
    f1_train = f1_score(y_train, y_train_pred, average='macro')

    accuracy_test = accuracy_score(y_test, y_test_pred)
    precision_test = precision_score(y_test, y_test_pred, average='macro')
    recall_test = recall_score(y_test, y_test_pred, average='macro')
    f1_test = f1_score(y_test, y_test_pred, average='macro')

    result = {
        'Model': model_name,
        'Accuracy (Train)': accuracy_train,
        'Precision (Train)': precision_train,
        'Recall (Train)': recall_train,
        'F1-score (Train)': f1_train,
        'Accuracy (Test)': accuracy_test,
        'Precision (Test)': precision_test,
        'Recall (Test)': recall_test,
        'F1-score (Test)': f1_test
    }

    results.append(result)

In [None]:
df_results = pd.DataFrame(results)
df_results

Unnamed: 0,Model,Accuracy (Train),Precision (Train),Recall (Train),F1-score (Train),Accuracy (Test),Precision (Test),Recall (Test),F1-score (Test)
0,Logistic Regression,0.797651,0.762185,0.737623,0.743058,0.798826,0.760699,0.737253,0.7424
1,Random Forest,0.699384,0.800481,0.579051,0.517138,0.70174,0.801965,0.58024,0.518398
2,Naive Bayes,0.743266,0.701072,0.681599,0.684964,0.739281,0.692635,0.676136,0.678751


In [None]:
#Visualize
import plotly.graph_objects as go

# Create a list of model names
model_names = ['Logistic Regression', 'Random Forest', 'Naive Bayes']

# Create the chart objects for train and test metrics
fig_train = go.Figure()
fig_test = go.Figure()

# Iterate through each metric and model to add lines to the train and test charts
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-score']
for metric in metrics:
    # Get the metric values for each model
    train_values = df_results[f'{metric} (Train)']
    test_values = df_results[f'{metric} (Test)']

# Add lines for the metric on the train chart
    fig_train.add_trace(go.Bar(x=model_names, y=train_values, name=f'{metric} (Train)'))

# Add lines for the metric on the test chart
    fig_test.add_trace(go.Bar(x=model_names, y=test_values, name=f'{metric} (Test)'))

# Configure the train chart layout
fig_train.update_layout(
    title='Performance Metrics Comparison (Train)',
    xaxis_title='Model',
    yaxis_title='Score',
    barmode='group',
    legend=dict(x=0.85, y=1),
    autosize=False,
    width=800,
    height=500
)

# Configure the test chart layout
fig_test.update_layout(
    title='Performance Metrics Comparison (Test)',
    xaxis_title='Model',
    yaxis_title='Score',
    barmode='group',
    legend=dict(x=0.85, y=1),
    autosize=False,
    width=800,
    height=500
)

# Show the train chart
fig_train.show()

# Show the test chart
fig_test.show()

In natural language processing (NLP) tasks, there are specific factors to consider when analyzing these evaluation metrics. Here is a general overview of the evaluation metrics in NLP:

**Logistic Regression:**

- Logistic Regression model shows high accuracy on both the training and testing sets (accuracy_train = 0.809811, accuracy_test = 0.811996).
- Precision and recall are also relatively high on both sets.
- The F1-score on the testing set is lower compared to other metrics, indicating a challenge in balancing precision and recall.

Recommendation: Logistic Regression model can be a good choice for this NLP task, especially when accuracy is an important factor.

**Random Forest:**

- The model shows lower performance compared to Logistic Regression, with lower accuracy scores on both the training and test sets (accuracy_train = 0.695069, accuracy_test = 0.699040).
- Precision, recall, and F1-score are also lower, indicating challenges in effectively classifying positive samples and balancing precision and recall.

Recommendation: Random Forest may not be the optimal choice for this NLP task, as it exhibits lower performance and potential underfitting.

**Naive Bayes:**

- The model performs reasonably well, with good accuracy scores on both the training and test sets (accuracy_train = 0.744195, accuracy_test = 0.741215).
- Precision, recall, and F1-score are also relatively high, indicating a balanced performance in identifying positive samples.

Recommendation: Naive Bayes model can be a promising choice for the NLP task, especially when there is a need to balance precision and recall.

Based on the observations, model of ` Logistic Regression ` and ` Naive Bayes `appear to be the better choices for this NLP task, considering their overall performance, balanced precision and recall, and consistency on both training and test data.

# Pipeline

Using a pipeline in this project helps effectively organize and manage the data processing and model building workflow, ensuring consistency, saving time and effort, and increasing adaptability and scalability.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

In [None]:
import pickle

# Specify the path to the pickle file
file_path = 'logistic.pkl'

# Open the pickle file in read mode
with open(file_path, 'rb') as file:
    # Load the pickled object
    model_log = pickle.load(file)

# Now you can use the loaded data as needed
model_log

In [None]:
tf = TfidfVectorizer()

In [None]:
def create_pipeline(model, data, label, tf):
  '''
  Data: list of text (corpus)
  label: dataframe
  '''
  #clean_data = [text for text in data['clenaed_text']]
  pipeline = Pipeline([
    ('tfidf', tf),
    ('model', model.best_estimator_)
  ])
  pipeline.fit(data,label)
  return pipeline

In [None]:
log_pipeline = create_pipeline(model_log, clean_data1, clean_data['label'], tf)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
rf_pipeline = create_pipeline(model_rf, clean_data1, clean_data['label'], tf)

In [None]:
naive_pipeline = create_pipeline(model_nb, clean_data1, clean_data['label'], tf)

In [None]:
#Save pkl
pipeline_name = {'logistic': model_log, 'randomforest': model_rf, 'naivebayes': model_nb}
for name in pipeline_name:
  with open(f'pipeline_{name}.pkl', 'wb') as file:
    pickle.dump(pipeline_name[name], file)

#References

[Machine Learning Co Ban](https://machinelearningcoban.com/tabml_book/ch_model/random_forest.html)

[Logistic Regression](https://www.ibm.com/topics/logistic-regression#:~:text=Related%20solutions-,What%20is%20logistic%20regression%3F,given%20dataset%20of%20independent%20variables)

[Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

[Grid Search](https://www.mygreatlearning.com/blog/gridsearchcv/)

[Cross Valid](https://en.wikipedia.org/wiki/Cross-validation_(statistics)#:~:text=Cross%2Dvalidation%20is%20a%20resampling,model%20will%20perform%20in%20practice.)

Nhóm 10

| Họ Tên  | MSSV | Tỉ lệ % đóng góp |
| -------- | -------- | -------- |
| Huỳnh Lưu Vĩnh Phong | 21280103 | 100% |
| Nguyễn Hải Ngọc Huyền | 21280091 | 100% |
| Trịnh Minh Anh | 21280005 | 100% |
| Tạ Hoàng Kim Thy | 21280083 | 100% |
| Nguyễn Lưu Phương Ngọc Lam | 21280096 | 100% |