In [None]:
""""Theoretical Questions

1. What is Logistic Regression, and how does it differ from Linear Regression?
Logistic Regression is used for classification problems (predicting categorical outcomes like 0/1, Yes/No).

Linear Regression predicts a continuous numeric value.

In Logistic Regression, the output is probabilistic (between 0 and 1), whereas in Linear Regression it can be any real number.

2. What is the mathematical equation of Logistic Regression?
The logistic regression equation is:
P(y=1∣x)=σ(z)= 1/1+e −z
where
𝑧
=
𝛽
0
+
𝛽1𝑥1+𝛽2𝑥2+...+𝛽𝑛𝑥𝑛

3. Why do we use the Sigmoid function in Logistic Regression?
The sigmoid function squashes input values into the range (0, 1), making it suitable for modeling probabilities.

It ensures that the output can be interpreted as a probability.

4. What is the cost function of Logistic Regression?
The cost function is the Log Loss (Binary Cross-Entropy):
J(θ)=− m/1
i=1∑m
(y
(i)
 log(
y
^
(i)
 )+(1−y
(i)
 )log(1−
y
^
(i)
 ))
where
𝑦
^
y
^
​
  is the predicted probability.

5. What is Regularization in Logistic Regression? Why is it needed?
Regularization adds a penalty to the cost function to prevent overfitting.

It discourages the model from fitting noise by keeping model parameters (weights) small.

6. Explain the difference between Lasso, Ridge, and Elastic Net regression.
Lasso (L1 Regularization): Shrinks some coefficients to zero — good for feature selection.

Ridge (L2 Regularization): Shrinks coefficients smoothly but never exactly zero — good for handling multicollinearity.

Elastic Net: A combination of L1 and L2 — useful when there are many correlated features.

7. When should we use Elastic Net instead of Lasso or Ridge?
Use Elastic Net when:

Features are highly correlated.

You want both feature selection and model stability.

Lasso alone might randomly pick one feature among correlated ones.

8. What is the impact of the regularization parameter (λ) in Logistic Regression?
λ (lambda) controls the strength of regularization.

High λ → More regularization → Simpler model (may underfit).

Low λ → Less regularization → Complex model (may overfit).

9. What are the key assumptions of Logistic Regression?
Linearity between independent variables and the log odds (not the dependent variable).

No extreme multicollinearity.

Observations are independent.

Large sample size preferred for stable estimates.

10. What are some alternatives to Logistic Regression for classification tasks?
Decision Trees

Random Forest

Support Vector Machines (SVM)

Gradient Boosting (e.g., XGBoost)

Neural Networks

K-Nearest Neighbors (KNN)

11. What are Classification Evaluation Metrics?
Accuracy

Precision

Recall

F1 Score

ROC-AUC

Confusion Matrix

Log Loss

12. How does class imbalance affect Logistic Regression?
The model may become biased towards the majority class.

Can lead to poor performance on the minority class.

Solutions: Use class weights, oversampling (SMOTE), undersampling.

13. What is Hyperparameter Tuning in Logistic Regression?
Selecting the best hyperparameters (like λ, penalty type, solver) to optimize model performance.

Done using Grid Search, Random Search, or Bayesian Optimization.

14. What are different solvers in Logistic Regression? Which one should be used?
liblinear: Good for small datasets, supports L1 and L2.

newton-cg: Good for L2 penalty, large datasets.

sag: Faster for large datasets, only supports L2.

saga: Supports L1, L2, and Elastic Net; good for large datasets.

Choice:

Small dataset → liblinear

Large dataset → saga

15. How is Logistic Regression extended for multiclass classification?
One-vs-Rest (OvR): One classifier per class vs all others.

Softmax (Multinomial Logistic Regression): Generalizes logistic regression for multiple classes directly.

16. What are the advantages and disadvantages of Logistic Regression?
Advantages:

Simple and fast.

Interpretable coefficients.

Works well when the classes are linearly separable.

Disadvantages:

Struggles with non-linear problems.

Sensitive to outliers.

Assumes independence between features.

17. What are some use cases of Logistic Regression?
Email spam detection.

Credit scoring.

Disease diagnosis (e.g., diabetes prediction).

Customer churn prediction.

Marketing campaign response prediction.

18. What is the difference between Softmax Regression and Logistic Regression?
Logistic Regression: Used for binary classification.

Softmax Regression: Used for multi-class classification, outputs probabilities for each class.

19. How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification?
If speed and simplicity are needed → OvR.

If better probability estimation and full class consideration are needed → Softmax.

20. How do we interpret coefficients in Logistic Regression?
Each coefficient represents the change in the log-odds of the outcome for a one-unit increase in the predictor, holding all other predictors constant.

Exponentiating a coefficient gives the odds ratio.

"""""

In [None]:
#Practical

#1. Load dataset, split, train Logistic Regression, print accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 1.0


In [None]:
#2. Apply L1 regularization (Lasso)

model = LogisticRegression(penalty='l1', solver='saga', max_iter=1000)
model.fit(X_train, y_train)
print("Accuracy with L1 regularization:", model.score(X_test, y_test))


Accuracy with L1 regularization: 1.0




In [None]:
#3. Train with L2 regularization (Ridge)

model = LogisticRegression(penalty='l2', max_iter=1000)
model.fit(X_train, y_train)
print("Accuracy with L2 regularization:", model.score(X_test, y_test))
print("Coefficients:", model.coef_)


Accuracy with L2 regularization: 1.0
Coefficients: [[-0.40538546  0.86892246 -2.2778749  -0.95680114]
 [ 0.46642685 -0.37487888 -0.18745257 -0.72127133]
 [-0.06104139 -0.49404358  2.46532746  1.67807247]]


In [None]:
#4. Train Logistic Regression with Elastic Net

model = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5, max_iter=1000)
model.fit(X_train, y_train)
print("Accuracy with Elastic Net:", model.score(X_test, y_test))


Accuracy with Elastic Net: 1.0




In [None]:
#5. Multiclass Logistic Regression using OvR

model = LogisticRegression(multi_class='ovr', max_iter=1000)
model.fit(X_train, y_train)
print("OvR Accuracy:", model.score(X_test, y_test))


OvR Accuracy: 0.9555555555555556




In [None]:
#6. Apply GridSearchCV to tune C and penalty

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2'], 'solver': ['liblinear', 'saga']}
grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)
print("Best accuracy:", grid.best_score_)




Best parameters: {'C': 1, 'penalty': 'l1', 'solver': 'saga'}
Best accuracy: 0.9619047619047618




In [None]:
#7. Evaluate using Stratified K-Fold

from sklearn.model_selection import StratifiedKFold, cross_val_score

skf = StratifiedKFold(n_splits=5)
model = LogisticRegression(max_iter=1000)
scores = cross_val_score(model, X, y, cv=skf)

print("Average Stratified K-Fold Accuracy:", scores.mean())


Average Stratified K-Fold Accuracy: 0.9733333333333334


In [25]:
#8. Load dataset from CSV and apply Logistic Regression

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder # Import LabelEncoder


df = pd.read_csv('/dev/CSV/amazon1.csv')
# Check for the actual name of the target column
print(df.columns)  # Print the column names to identify the target column

# Replace 'review_score' with the actual target column name from the printed columns
target_column_name = 'target Rate' # Example: 'rating'

X = df.drop(target_column_name, axis=1)
y = df[target_column_name]

# Identify columns with non-numeric data
categorical_cols = X.select_dtypes(include=['object']).columns

# Create a LabelEncoder object
label_encoder = LabelEncoder()

# Apply label encoding to each categorical column
for col in categorical_cols:
    X[col] = label_encoder.fit_transform(X[col])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print("Accuracy:", model.score(X_test, y_test))

Index(['product_id', 'product_name', 'category', 'discounted_price',
       'actual_price', 'discount_percentage', 'rating', 'target Rate',
       'rating_count', 'about_product', 'user_id', 'user_name', 'review_id',
       'review_title', 'review_content', 'img_link', 'product_link'],
      dtype='object')
Accuracy: 0.23863636363636365


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [26]:
#9. Apply RandomizedSearchCV for tuning

from sklearn.model_selection import RandomizedSearchCV

param_dist = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2'], 'solver': ['liblinear', 'saga']}
random_search = RandomizedSearchCV(LogisticRegression(max_iter=1000), param_distributions=param_dist, cv=5, n_iter=5)
random_search.fit(X_train, y_train)

print("Best parameters:", random_search.best_params_)
print("Best accuracy:", random_search.best_score_)




Best parameters: {'solver': 'liblinear', 'penalty': 'l2', 'C': 1}
Best accuracy: 0.24292682926829268


In [None]:
# Install the required package using the appropriate Colab command

from sklearn.multiclass import OneVsOneClassifier

ovo_model = OneVsOneClassifier(LogisticRegression(max_iter=1000))
ovo_model.fit(X_train, y_train)

print("OvO Accuracy:", ovo_model.score(X_test, y_test))


In [None]:
#11. Visualize confusion matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot()
#11. Visualize confusion matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt  # Import pyplot

# ... (rest of your code) ...

# Before plotting, set the font family to a known available font on your system
plt.rcParams['font.family'] = 'Arial'  # Or any other font you have installed

# Now try to plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot()
plt.show()  # Display the plot

In [37]:
#12. Evaluate using Precision, Recall, and F1-Score
from sklearn.metrics import precision_score, recall_score, f1_score

print("Precision:", precision_score(y_test, y_pred, average='weighted'))
print("Recall:", recall_score(y_test, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))


Precision: 0.24567056314183247
Recall: 0.23863636363636365
F1 Score: 0.22957003984566463




In [38]:
#13. Train on imbalanced data using class weights

model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)
print("Accuracy with class weights:", model.score(X_test, y_test))


Accuracy with class weights: 0.19318181818181818


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [None]:
#14. Train on Titanic dataset, handle missing values

df = pd.read_csv('titanic.csv')
df.fillna(df.mean(), inplace=True)

X = df[['Pclass', 'Age', 'SibSp', 'Fare']]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print("Titanic dataset accuracy:", model.score(X_test, y_test))


In [40]:
#15. Apply feature scaling before Logistic Regression

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print("Accuracy with scaling:", model.score(X_test, y_test))


Accuracy with scaling: 0.5818181818181818


In [None]:
#16. Evaluate using ROC-AUC score

from sklearn.metrics import roc_auc_score

y_prob = model.predict_proba(X_test)[:, 1]
print("ROC-AUC Score:", roc_auc_score(y_test, y_prob))


In [42]:
#17. Train with custom learning rate (C=0.5)

model = LogisticRegression(C=0.5, max_iter=1000)
model.fit(X_train, y_train)

print("Accuracy with C=0.5:", model.score(X_test, y_test))


Accuracy with C=0.5: 0.5295454545454545


In [43]:
#18. Identify important features

feature_importance = pd.Series(model.coef_[0], index=df.columns[:-1])
print("Feature Importance:\n", feature_importance)


Feature Importance:
 product_id             0.008829
product_name          -0.200362
category               0.422459
discounted_price      -0.244244
actual_price          -0.149932
discount_percentage   -0.491425
rating                 1.480221
target Rate            0.576105
rating_count           0.017573
about_product          0.226476
user_id               -0.249185
user_name             -0.071216
review_id             -0.416970
review_title          -0.399488
review_content         0.273859
img_link              -0.183259
dtype: float64


In [44]:
#19. Evaluate using Cohen’s Kappa Score

from sklearn.metrics import cohen_kappa_score

print("Cohen's Kappa Score:", cohen_kappa_score(y_test, y_pred))


Cohen's Kappa Score: 0.01713946234962127


In [None]:
#20. Visualize Precision-Recall Curve

from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay

precision, recall, _ = precision_recall_curve(y_test, y_prob)
PrecisionRecallDisplay(precision=precision, recall=recall).plot()


In [46]:
#21. Train with different solvers and compare accuracy

for solver in ['liblinear', 'saga', 'lbfgs']:
    model = LogisticRegression(solver=solver, max_iter=1000)
    model.fit(X_train, y_train)
    print(f"Accuracy with {solver}: {model.score(X_test, y_test)}")


Accuracy with liblinear: 0.3409090909090909
Accuracy with saga: 0.5818181818181818
Accuracy with lbfgs: 0.5818181818181818


In [47]:
#22. Evaluate using Matthews Correlation Coefficient (MCC)

from sklearn.metrics import matthews_corrcoef

print("MCC Score:", matthews_corrcoef(y_test, y_pred))


MCC Score: 0.01731363428223641


In [48]:
#23. Train on raw vs standardized data and compare

# Raw data
model_raw = LogisticRegression(max_iter=1000)
model_raw.fit(X_train, y_train)
raw_acc = model_raw.score(X_test, y_test)

# Standardized data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3)

model_scaled = LogisticRegression(max_iter=1000)
model_scaled.fit(X_train_scaled, y_train)
scaled_acc = model_scaled.score(X_test_scaled, y_test)

print("Raw accuracy:", raw_acc)
print("Scaled accuracy:", scaled_acc)


Raw accuracy: 0.5818181818181818
Scaled accuracy: 0.5136363636363637


In [49]:
#24. Find optimal C using cross-validation

from sklearn.model_selection import cross_val_score

Cs = [0.01, 0.1, 1, 10, 100]
best_c = 0
best_score = 0

for c in Cs:
    model = LogisticRegression(C=c, max_iter=1000)
    score = cross_val_score(model, X, y, cv=5).mean()
    if score > best_score:
        best_score = score
        best_c = c

print(f"Best C: {best_c}, Best Score: {best_score}")


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Best C: 0.1, Best Score: 0.22457337883959044


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [50]:
#25. Save and load Logistic Regression model using joblib

import joblib

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Save model
joblib.dump(model, 'logistic_model.pkl')

# Load model
loaded_model = joblib.load('logistic_model.pkl')
print("Loaded model accuracy:", loaded_model.score(X_test, y_test))


Loaded model accuracy: 0.16136363636363638
