# Model 1- Logistic regresssion 
**Logistic regression** is a foundational statistical model used primarily for binary classification tasks, where it predicts the probability of an event occurring based on input variables. Despite its name, it's a linear model that fits data to a logistic (sigmoid) function, transforming the output into a probability score between 0 and 1. This makes it straightforward to interpret, as the predicted probability directly indicates the likelihood of an instance belonging to a specific class. Logistic regression is computationally efficient, making it suitable for large datasets, and it doesn't assume a particular distribution of input variables like Gaussian assumptions in other classifiers. It's commonly applied in scenarios such as predicting customer churn, identifying spam emails, or detecting fraudulent transactions. While logistic regression's linear decision boundary can be limiting in capturing complex relationships, its simplicity, interpretability, and robust performance as a baseline model make it indispensable in both statistical analysis and machine learning applications.

##  Data Preprocessing
Data preprocessing transforms raw data into a clean and usable format by handling missing values, outliers, and ensuring consistent data scales through normalization or standardization. It also includes feature extraction and selection to enhance dataset quality. This step is essential for efficient and accurate data analysis or machine learning model performance.

In [6]:
# importing necessery Libraries
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Load the IMDb movie reviews dataset
df = pd.read_csv('IMDB Dataset.csv.zip')

In [7]:
#removal of stopwords 
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))

#tokanize the text 
def tokenize_text(text):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t not in stop_words]
    return ' '.join(tokens)

df['review'] = df['review'].apply(tokenize_text)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
#apply lemmitization 
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return ' '.join(tokens)

df['review'] = df['review'].apply(lemmatize_text)


In [9]:
# applying TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)

X_train = vectorizer.fit_transform(df['review'])
y_train = df['sentiment']

X_test = vectorizer.transform(df['review'])
y_test = df['sentiment']

In [10]:
#Data Splitting
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

##  Model Building
Model building with logistic regression for binary classification involves steps as -The data is split into training and testing sets. A logistic regression model is trained on the training set, and its performance is evaluated on the test set using metrics like accuracy, precision, recall. Hyperparameter tuning can be performed to optimize the model. This process helps create a robust predictive model that can effectively classify binary outcomes.

In [11]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_val)
print("Logistic Regression Accuracy:", log_reg.score(X_val, y_val))

Logistic Regression Accuracy: 0.8885


In [13]:
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score

# Assuming y_val and y_pred are numpy arrays
y_val_numeric = np.where(y_val == 'positive', 1, 0)
y_pred_numeric = np.where(y_pred == 'positive', 1, 0)

print("Logistic Regression Accuracy:", log_reg.score(X_val, y_val))
print("Precision:", precision_score(y_val_numeric, y_pred_numeric))
print("Recall:", recall_score(y_val_numeric, y_pred_numeric))
print("F1-score:", f1_score(y_val_numeric, y_pred_numeric))


Logistic Regression Accuracy: 0.8885
Precision: 0.8790571870170015
Recall: 0.9029569358999802
F1-score: 0.8908467939304944


In [19]:
y_pred_numeric = log_reg.predict(X_test) 

# Compute the confusion matrix and classification report
print("confusion_matrix")
print(confusion_matrix(y_test, y_pred_numeric))
print("\nclassification_report")
print(classification_report(y_test, y_pred_numeric))

confusion_matrix
[[22412  2588]
 [ 2058 22942]]

classification_report
              precision    recall  f1-score   support

    negative       0.92      0.90      0.91     25000
    positive       0.90      0.92      0.91     25000

    accuracy                           0.91     50000
   macro avg       0.91      0.91      0.91     50000
weighted avg       0.91      0.91      0.91     50000



## Detailed Output Analysis of Logistic Regression Model

The logistic regression model achieved an overall accuracy of 0.91468, or approximately 91.47%, on the test set. This indicates that the model correctly classified 91.47% of the instances in the test data.

The classification report provides more detailed performance metrics:

### Negative Class Performance
- **Precision**: 0.92
- **Recall**: 0.90
- **F1-score**: 0.91
- **Support**: 25,000

For the negative class, the model has a high precision of 92%, meaning that 92% of the instances predicted as negative are actually negative. The recall is also strong at 90%, indicating that the model correctly identifies 90% of the actual negative instances. The F1-score, which is the harmonic mean of precision and recall, is 0.91, reflecting the model's solid performance for the negative class.

### Positive Class Performance
- **Precision**: 0.90
- **Recall**: 0.92
- **F1-score**: 0.91
- **Support**: 25,000

For the positive class, the model's precision is 90%, indicating that 90% of the instances predicted as positive are indeed positive. The recall is slightly higher at 92%, meaning the model successfully identifies 92% of the actual positive instances. The F1-score for the positive class is also 0.91, demonstrating the model's balanced performance in identifying positive instances.

### Overall Performance
- **Accuracy**: 0.91
- **Macro Average Precision**: 0.91
- **Macro Average Recall**: 0.91
- **Macro Average F1-score**: 0.91
- **Weighted Average Precision**: 0.91
- **Weighted Average Recall**: 0.91
- **Weighted Average F1-score**: 0.91

The overall accuracy of 91% confirms that the model performs well in classifying instances correctly. The macro average metrics provide the unweighted average of the metrics across both classes, showing that the model performs consistently well across the board. The weighted average metrics, which account for the equal distribution of classes (25,000 instances of both negative and positive classes), also indicate strong and balanced performance.

### Summary
In summary, the logistic regression model demonstrates strong and balanced performance with an overall accuracy of 91.47%. It performs well for both the negative and positive classes, with high precision, recall, and F1-scores across the board. The classification report highlights the model's ability to accurately distinguish between the two classes, making it a reliable model for this classification task.

In [20]:
import joblib

# Save the Logistic Regression model and TF-IDF vectorizer
joblib.dump(log_reg, 'log_reg.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')

['vectorizer.pkl']