<h1>Baseline Model</h1>
<p>Logistic Regression was chosen as the baseline model due to its simplicity, interpretability, and efficiency in handling binary classification tasks.<br> Its linear nature makes it a suitable starting point to establish a baseline performance, allowing for a clear understanding of the dataset's separability.</p>

<h3>Imports</h3>

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import joblib

<h3>Loading train dataset</h3>

In [2]:
# Load train data
train_dataset = pd.read_csv('kmaml223/train.csv')#.sample(n=1000)

# Features to consider for classification
feature_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'] 

In [3]:
train_dataset.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,comment_text_cleaned
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,explanation edits made username hardcore metal...
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,d'aww match background colour 'm seemingly stu...
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,hey man 'm really trying edit war 's guy const...
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,ca n't make real suggestion improvement wonder...
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,sir hero chance remember page 's


<h3>Splitting train dataset</h3>

In [4]:
# Splitting train_data into training and testing subsets
train_subset, test_subset = train_test_split(train_dataset, test_size=0.2, random_state=42)

train_subset['comment_text_cleaned'].fillna('', inplace=True)
test_subset['comment_text_cleaned'].fillna('', inplace=True)

X_train = train_subset['comment_text_cleaned']
X_test = test_subset['comment_text_cleaned']

<h3>Creating a Logistic Regression Model for every feature</h3>

<p>Let's store them as well!</p>

In [5]:
models = {}

# Preprocess text data, initialize and fit TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Save the TfidfVectorizer
tfidf_filename = "models/baseline/tfidf_vectorizer.pkl"
joblib.dump(tfidf_vectorizer, tfidf_filename)
print("TfidfVectorizer saved as tfidf_vectorizer.pkl")

for feature in feature_columns:
    print(f"===== Predicting for '{feature}' =====")
    
    y_train = train_subset[feature]  # Select the target feature    
    y_test = test_subset[feature]  # Select the target feature
    
    # Initialize and train Logistic Regression model
    logreg = LogisticRegression(max_iter=1000)
    logreg.fit(X_train_tfidf, y_train)

    # Predict toxicity on test data
    predictions = logreg.predict(X_test_tfidf)

    # Evaluate the model
    accuracy = accuracy_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)
    
    print(f"Accuracy for '{feature}': {accuracy}")
    print(f"F1 Score for '{feature}': {f1}")

    # Store the trained model in the dictionary
    models[feature] = logreg

    # Save the trained model
    model_filename = f"models/baseline/{feature}_model.pkl"
    joblib.dump(logreg, model_filename)
    print(f"Model for '{feature}' saved as {model_filename}")
    print()

TfidfVectorizer saved as tfidf_vectorizer.pkl
===== Predicting for 'toxic' =====
Accuracy for 'toxic': 0.9568854770484099
F1 Score for 'toxic': 0.7291338582677166
Model for 'toxic' saved as models/baseline/toxic_model.pkl

===== Predicting for 'severe_toxic' =====
Accuracy for 'severe_toxic': 0.9906626977910074
F1 Score for 'severe_toxic': 0.3739495798319328
Model for 'severe_toxic' saved as models/baseline/severe_toxic_model.pkl

===== Predicting for 'obscene' =====
Accuracy for 'obscene': 0.9763747454175152
F1 Score for 'obscene': 0.7376478775226166
Model for 'obscene' saved as models/baseline/obscene_model.pkl

===== Predicting for 'threat' =====
Accuracy for 'threat': 0.9977126742910857
F1 Score for 'threat': 0.21505376344086022
Model for 'threat' saved as models/baseline/threat_model.pkl

===== Predicting for 'insult' =====
Accuracy for 'insult': 0.9692307692307692
F1 Score for 'insult': 0.6254767353165522
Model for 'insult' saved as models/baseline/insult_model.pkl

===== Predict

<p>Yeey!</p>