#Logistic Regression

* It is a classification algorithm
* It works on Sigmoid function ie. values above a certain threshold become 1 and values below it, become 0
* It is used when the data is linearly separable ie. data can be separated using a line

Pros:
* less prone to overfitting compared to more complex algorithms.
* It can handle scaled features and does not require normalization or standardization of the input data.

Cons:
*  It may not perform well when the relationship is non-linear.
* It works best with numerical features and may require encoding or transformation of categorical variables.

Example:

There's a dataset with two features: "Age" (ranging from 0 to 100) and "Income" (ranging from 10,000 to 1,000,000). Logistic regression can handle these features as they are, without requiring any scaling. It can estimate the coefficients and make predictions based on the original scale of the features.

In [1]:
import numpy as np
from sklearn.linear_model import LogisticRegression
input_feature_x=np.array([4,12,16,20,24,36,40]).reshape((-1,1)) #maps to output_y values
output_y=np.array([0,0,0,1,0,1,1])
logistic_model=LogisticRegression()
logistic_model.fit(input_feature_x, output_y) #independent features, target value
prediction_array=np.array([13,16]).reshape(-1,1) #predict output for 13, 16 input values
logistic_model.predict(prediction_array)

array([0, 0])

#Naive Bayes
* It is based on Bayes’s theorem.
* It assumes that the features are independent

Pros:

* This algorithm works very fast.
* It can also be used to solve multi-class prediction problems as it’s quite useful with them.
* This classifier performs better than other models with less training data if the assumption of independence of features holds.
* It has few hyperparameters, making them easy to implement and tune.

Cons:

* It assumes that all the features are independent which means anyone can hardly find a set of independent features. It is unrealistic for real-world datasets

Example:

In a sentiment analysis task where the goal is to classify movie reviews as positive or negative, Naive Bayes algorithm can analyze the occurrences of different words in the reviews and estimate the likelihood of a review being positive or negative based on the observed word frequencies in each class.

In [2]:
import numpy as np
from sklearn.naive_bayes import GaussianNB
input_x=np.array([[-2,-1],[-3,-1],[-4,-3],[1,1],[3,2],[6,4]])
output_y=np.array([1,1,1,2,2,2])
nb_classifier=GaussianNB()
nb_classifier.fit(input_x,output_y)
prediction_array=[[8,5]]
nb_classifier.predict(prediction_array) #classifier predicts the input to belong to class label 2.

array([2])

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
X, y = load_iris(return_X_y=True) #returns the feature matrix X and the target vector y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=142) #used 25% of data for testing
nb_classifier = GaussianNB() #classifier follows Gaussian (normal) distribution
nb_classifier.fit(X_train, y_train) #train the classifier using the training data
prediction_results = nb_classifier.predict(X_test)  #predictions on the test data
print(prediction_results) #prints the array of predicted class labels

[0 1 1 2 1 1 0 0 2 1 1 1 2 0 1 0 2 1 1 2 2 1 0 1 2 1 2 2 0 1 2 1 2 1 2 2 1
 2]


#Performance Metrics

* Confusion Matrix
  * TP, TN, FP(Type-1 error), FN(Type-2 error)
  * It is particularly useful when dealing with imbalanced classes
  * It can be visually represented as a heatmap
* Precision
  * focuses on Type-1 error only
  * range is 0 to 1
  * tells the correctness of positive predictions
  * TP/(TP+FP)
  * high precision score signifies FP is low and model is performing good
  * low precision score can be due to imbalanced dataset or hyperparameters of the model are not tuned properly.
* Recall
  * focuses on Type-2 error only
  * TP/(TP+FN)
  * high recall means low FN
  * recall will be low when (same as precision)
* Accuracy
  * correct predictions/total no. of predictions
  * (TP+TN)/(TP+FP+TN+FN)
* F1 Score
  * combines both precision and recall (harmonic mean of both)
  * 2 * (precision * recall)/(precision+recall)
  * high F1 score means precision and recall is well-balanced
  * cannot comment in case of low F1 score

In [4]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, accuracy_score
actual_labels = [1, 0, 1, 0, 1, 0, 0, 1, 1, 0] # Actual labels of the data
predicted_labels = [1, 0, 1, 0, 0, 1, 1, 1, 0, 1] # Predicted labels from a model

cm = confusion_matrix(actual_labels, predicted_labels) # Calculate confusion matrix
precision = precision_score(actual_labels, predicted_labels)
recall = recall_score(actual_labels, predicted_labels)
f1 = f1_score(actual_labels, predicted_labels)
accuracy = accuracy_score(actual_labels, predicted_labels)


print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("Accuracy:", accuracy) #model correctly predicts the labels for 50% of the instances.
print("Confusion Matrix:")
print(cm) #shows TP, FP, FN, and TN

Precision: 0.5
Recall: 0.6
F1 Score: 0.5454545454545454
Accuracy: 0.5
Confusion Matrix:
[[2 3]
 [2 3]]
