# Logistic Regression Model Training

This notebook focuses on training a logistic regression model using the provided dataset. 
Logistic regression is a widely used classification algorithm suitable for binary and multiclass classification tasks. 
The goal is to create and train a logistic regression model to predict target labels based on the input features.

In [1]:
#importing required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import joblib

In [2]:
# Load your dataset
df = pd.read_csv('cleaned_train.csv',header=0,index_col=0)
df.head()

Unnamed: 0_level_0,review_text,rating
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1
e8cb23191d6c27e930243a08ff826395,realli meant get landlin check one reason coul...,4
953dfd48b372f081b5f82ce1def753f7,updat make maximum ride movi look terribl http...,4
48509a6f6128d4f2ca243e04a0cdc896,feel like ive read mani urban fantasi book get...,3
a09f7ff4eca0c8c2fbaacf4baf6b114f,reread decemb simpli fantast read full humor m...,5
93b0128f768ee9c1af8864f566e3a7b6,big ass dnf ughhh im mad pick even care book o...,1


In [3]:
df_subset = df

In [4]:
# Handle missing values in 'review_text' for both training and testing sets
df_subset['review_text'].fillna('', inplace=True)

# Split the data into features (X) and target (y)
X = df_subset['review_text']
y = df_subset['rating']  # Assuming 'rating' is an integer between 0 and 5

# Convert the rating to integers (if it's not already)
y = y.astype(int)

In [5]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Explicitly handle missing values in 'review_text' for both training and testing sets
X_train = X_train.fillna('')
X_test = X_test.fillna('')

# Print the sizes of the training and testing sets
print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))

Training set size: 504000
Testing set size: 126000


## Vectorization 
In this section, we focus on transforming textual data into a format suitable for machine learning models. 
The chosen technique is Term Frequency-Inverse Document Frequency (TF-IDF), a widely used method for converting text data 
into numerical vectors while considering the importance of words within a document and across the entire dataset.

The `TfidfVectorizer` from the Scikit-learn library is employed to perform TF-IDF vectorization on the text data. 
The `max_features` parameter is adjustable to control the maximum number of features (terms) retained in the vectorized representation.

In [6]:
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

## Model Training
This cell focuses on building and training a Logistic Regression model using the TF-IDF vectorized text data. Logistic 

- **C (Regularization Parameter):** Set to 2 for regularization control.
- **Max Iterations:** Limited to 1000 iterations during training.
- **n_jobs:** Utilizes all available CPU cores for parallel computation.

The Logistic Regression model is trained on the TF-IDF vectorized training data (`X_train_vectorized`) with corresponding labels (`y_train`).

In [7]:
# Build and train a Logistic Regression model
lr = LogisticRegression(C= 2, max_iter = 1000, n_jobs=-1)  # You can experiment with different parameters
lr.fit(X_train_vectorized, y_train)

In [8]:
# Make predictions on the test set
y_pred = lr.predict(X_test_vectorized)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.5233968253968254
Classification Report:
               precision    recall  f1-score   support

           0       0.51      0.28      0.36      4327
           1       0.46      0.29      0.36      4035
           2       0.42      0.28      0.34     10187
           3       0.46      0.41      0.44     26590
           4       0.49      0.61      0.54     43754
           5       0.63      0.63      0.63     37107

    accuracy                           0.52    126000
   macro avg       0.50      0.42      0.44    126000
weighted avg       0.52      0.52      0.52    126000



In [9]:
#saving The vectorizer and the trained model
joblib.dump(lr, 'logistic_regression_model_v3.joblib')
joblib.dump(vectorizer, 'tfidf_vectorizer_v3.joblib')

['tfidf_vectorizer_v3.joblib']

# END

The below cell was used during Development to fine tune the hyperparameters.