# Scikit-Learn

For Scikit Learn I use two classifiers: 

1. [Logistic Regression](#Experiment-1-(Logistic-Regression))
2. [Decision Tree](#Experiment-2-(Decision-Tree))

**Author:** BrenoAV

**Last Date Modified:** 2/2/2024

# Load Dataset

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("data.csv", sep="\t", encoding="utf-8")

In [3]:
df

Unnamed: 0,sentence,target,source
0,So there is no way for me to plug it in here i...,0,amazon
1,"Good case, Excellent value.",1,amazon
2,Great for the jawbone.,1,amazon
3,Tied to charger for conversations lasting more...,0,amazon
4,The mic is great.,1,amazon
...,...,...,...
2743,I think food should have flavor and texture an...,0,yelp
2744,Appetite instantly gone.,0,yelp
2745,Overall I was not impressed and would not go b...,0,yelp
2746,"The whole experience was underwhelming, and I ...",0,yelp


## Split dataset

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(df["sentence"], 
                                                    df["target"], 
                                                    test_size=0.2, 
                                                    random_state=123)

80% Train / 20% Test

In [6]:
X_train.shape, y_train.shape

((2198,), (2198,))

In [7]:
X_test.shape, y_test.shape

((550,), (550,))

# Preprocessing

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
count_vectorizer = CountVectorizer(min_df=1, lowercase=True)
count_vectorizer.fit(X_train)  # import use only the training!
X_train_encoded = count_vectorizer.transform(X_train)
X_test_encoded = count_vectorizer.transform(X_test)

In [10]:
import pickle

with open("count_vectorizer.pkl", "wb") as f:
    pickle.dump(count_vectorizer, f)

In [11]:
X_train_encoded

<2198x4529 sparse matrix of type '<class 'numpy.int64'>'
	with 24039 stored elements in Compressed Sparse Row format>

In [12]:
X_test_encoded

<550x4529 sparse matrix of type '<class 'numpy.int64'>'
	with 5563 stored elements in Compressed Sparse Row format>

# MLFlow

In [13]:
import mlflow
from mlflow.data.pandas_dataset import PandasDataset

mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")

In [14]:
dataset: PandasDataset = mlflow.data.from_pandas(df, source="data.csv")

  return _dataset_source_registry.resolve(
  string_columns = trimmed_df.columns[(df.applymap(type) == str).all(0)]


## Experiment 1 (Logistic Regression)

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from scipy.stats import uniform

In [16]:
experiment_name = "sentiment_analysis_logistic_regression"

experiment_tags = {
    "nlp.framework": "Scikit Learn",
    "nlp.encoding": "CountVectorizer",
    "nlp.model": "Logistic Regression",
    "nlp.task": "Sentiment Analysis"
}

mlflow.create_experiment(name=experiment_name, 
                         artifact_location="mlartifacts",
                         tags=experiment_tags)

'899577663967136178'

In [17]:
mlflow.set_experiment(experiment_name=experiment_name)  # It could be use the ID too

params_distribuitions = dict(
    C=uniform(loc=0, scale=4),
    solver=["lbfgs", "liblinear"]
)


# Cross validation
clf = RandomizedSearchCV(estimator=LogisticRegression(random_state=123), 
                         param_distributions=params_distribuitions,
                         refit=True,
                         n_iter=20,
                         cv=5,
                         verbose=2,
                         random_state=123)
search = clf.fit(X=X_train_encoded, y=y_train)

best_params = search.best_params_

# # Training using all the data

y_pred = clf.predict(X_test_encoded)

accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
precision = precision_score(y_true=y_test, y_pred=y_pred)
recall = recall_score(y_true=y_test, y_pred=y_pred)
f1 = f1_score(y_true=y_test, y_pred=y_pred)

metrics = {
    "accuracy": accuracy,
    "precision": precision,
    "recall": recall,
    "f1": f1
}

run_name = "_".join([f"{k}_{v}" for k, v in best_params.items()])

with mlflow.start_run(run_name=run_name):
    # Log the hyperparameters
    mlflow.log_params(best_params)

    # Log the metrics
    mlflow.log_metrics(metrics)

    # Log the dataset
    mlflow.log_input(dataset, context="training")

    # Log the model
    mlflow.sklearn.log_model(sk_model=clf, 
                             artifact_path="logistic_regression_model", 
                             input_example=X_train_encoded)

    mlflow.log_artifact("tokenizer.pkl", artifact_path="mlartifacts")

Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV] END .................C=2.7858767423914466, solver=lbfgs; total time=   0.0s
[CV] END .................C=2.7858767423914466, solver=lbfgs; total time=   0.0s
[CV] END .................C=2.7858767423914466, solver=lbfgs; total time=   0.0s
[CV] END .................C=2.7858767423914466, solver=lbfgs; total time=   0.0s
[CV] END .................C=2.7858767423914466, solver=lbfgs; total time=   0.0s
[CV] END .................C=1.7138837047473028, solver=lbfgs; total time=   0.0s
[CV] END .................C=1.7138837047473028, solver=lbfgs; total time=   0.0s
[CV] END .................C=1.7138837047473028, solver=lbfgs; total time=   0.0s
[CV] END .................C=1.7138837047473028, solver=lbfgs; total time=   0.0s
[CV] END .................C=1.7138837047473028, solver=lbfgs; total time=   0.0s
[CV] END ..............C=2.205259076331565, solver=liblinear; total time=   0.0s
[CV] END ..............C=2.205259076331565, sol



## Experiment 2 (Decision Tree)

In [18]:
from sklearn.tree import DecisionTreeClassifier

In [19]:
experiment_name = "sentiment_analysis_decision_tree"

experiment_tags = {
    "nlp.framework": "Scikit Learn",
    "nlp.encoding": "CountVectorizer",
    "nlp.model": "Decision Tree",
    "nlp.task": "Sentiment Analysis"
}

mlflow.create_experiment(name=experiment_name, 
                         tags=experiment_tags)

'420298245417662379'

In [20]:
# NOTE: THIS CAN BE TURN INTO A FUNCTION INSTEAD OF CODING REPEATED CODE

mlflow.set_experiment(experiment_name=experiment_name)  # It could be use the ID too

params_distribuitions = dict(
    criterion=["gini", "entropy"],
    max_depth=range(1, 4),
)

# Cross validation
clf = RandomizedSearchCV(estimator=DecisionTreeClassifier(random_state=123), 
                         param_distributions=params_distribuitions,
                         refit=True,
                         n_iter=20,
                         cv=5, 
                         verbose=2,
                         random_state=123)
search = clf.fit(X=X_train_encoded, y=y_train)

best_params = search.best_params_

# # Training using all the data

y_pred = clf.predict(X_test_encoded)

accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
precision = precision_score(y_true=y_test, y_pred=y_pred)
recall = recall_score(y_true=y_test, y_pred=y_pred)
f1 = f1_score(y_true=y_test, y_pred=y_pred)

metrics = {
    "accuracy": accuracy,
    "precision": precision,
    "recall": recall,
    "f1": f1
}

run_name = "_".join([f"{k}_{v}" for k, v in best_params.items()])

with mlflow.start_run(run_name=run_name):
    # Log the hyperparameters
    mlflow.log_params(best_params)

    # Log the metrics
    mlflow.log_metrics(metrics)

    # Log the dataset
    mlflow.log_input(dataset, context="training")

    # Log the model
    mlflow.sklearn.log_model(sk_model=clf, 
                             artifact_path="decision_tree_model", 
                             input_example=X_train_encoded)

    mlflow.log_artifact("tokenizer.pkl", artifact_path="mlartifacts")



Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] END ........................criterion=gini, max_depth=1; total time=   0.0s
[CV] END ........................criterion=gini, max_depth=1; total time=   0.0s
[CV] END ........................criterion=gini, max_depth=1; total time=   0.0s
[CV] END ........................criterion=gini, max_depth=1; total time=   0.0s
[CV] END ........................criterion=gini, max_depth=1; total time=   0.0s
[CV] END ........................criterion=gini, max_depth=2; total time=   0.0s
[CV] END ........................criterion=gini, max_depth=2; total time=   0.0s
[CV] END ........................criterion=gini, max_depth=2; total time=   0.0s
[CV] END ........................criterion=gini, max_depth=2; total time=   0.0s
[CV] END ........................criterion=gini, max_depth=2; total time=   0.0s
[CV] END ........................criterion=gini, max_depth=3; total time=   0.0s
[CV] END ........................criterion=gini, 



This Jupyter Notebook was created by BrenoAV. For any inquiries or feedback, please feel free to create an issue on [GitHub](https://github.com/BrenoAV/NLP-Sentiment-Analysis/issues) 📣.