# 2.1.2.3. LOGISTIC REGRESSION
## DEFINITION
**Logistic regression** is a type of machine learning algorithm that is used for solving classification problems. It is a **statistical method that models the probability of an event occurring based on a set of independent variables.**

## DATASET
Logistic regression is a supervised learning algorithm, which means it requires labeled data for training. The data consists of two types of variables:

- **Dependent variable**: The target variable that we want to predict. It is a categorical variable that can take one of several values, such as yes or no, spam or not spam, etc.
- **Independent variables**: The input variables that influence the outcome of the dependent variable. They can be numerical or categorical variables. For example, age, gender, income, education, etc.

## GOAL
The goal of logistic regression is to find the best fit line that describes the relationship between the dependent and independent variables. However, unlike linear regression, which uses a straight line to model the relationship, logistic regression uses a sigmoid function (also called a logistic function) to model the relationship. The sigmoid function is an S-shaped curve that can map any real value to a value between 0 and 1.

## CLASSIFICATION PROCESS
The logistic regression model can be used to make predictions by plugging in the values of the independent variables and calculating the probability of the event occurring. If the probability is greater than 0.5, we predict that the event will occur (class 1). If the probability is less than or equal to 0.5, we predict that the event will not occur (class 0).

To find the optimal values of the coefficients that minimize the error between the predicted and actual outcomes, we use a method called maximum likelihood estimation (MLE). MLE is a technique that maximizes the likelihood function, which measures how well the model fits the data.

## ADVANTAGES
- It is easy to implement and interpret.
- It can handle both numerical and categorical variables.
- It can perform well with a small number of observations.

## DISADVANTAGES
- It assumes a linear relationship between the independent variables and the log-odds of the event occurring.
- It may suffer from overfitting or underfitting if there are too many or too few features.
- It may not perform well with highly correlated or multicollinear variables.

## APPLICATIONS
- **Medical diagnosis**: Logistic regression can be used to diagnose diseases based on symptoms and test results.
- **Customer segmentation**: Logistic regression can be used to segment customers based on their demographics, preferences, behavior, etc.
- **Fraud detection**: Logistic regression can be used to detect fraudulent transactions based on patterns and anomalies in the data.
- **Natural language processing**: Logistic regression can be used to perform tasks such as sentiment analysis, text classification, part-of-speech tagging, etc.

## CONCLUSION
In conclusion, logistic regression is a powerful and versatile tool for solving classification problems in machine learning. It has a simple and intuitive structure that can represent complex decisions and outcomes. However, it also has some limitations and challenges that need to be addressed carefully. Logistic regression is widely used in various domains and applications that require decision making and prediction.

## HANDS-ON: LOGISITIC REGRESSION

### 1. IMPORTS

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

### 2. DATASET

In [2]:
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

### 3. PREPROCESSING

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### 4. LOGISITIC REGRESSION MODEL

In [4]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

### 5. ACCURACY SCORE

In [5]:
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)

Accuracy: 0.9707602339181286


## REFERENCES
1. https://en.wikipedia.org/wiki/Logistic_regression
2. https://www.geeksforgeeks.org/understanding-logistic-regression/
3. https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148
4. https://www.ibm.com/topics/logistic-regression