In [None]:
import matplotlib.pyplot as plt
import numpy as np
import math
from sklearn.linear_model import LogisticRegression
import pandas as pd

## 1. Logistic regression

**Question** What is logistic regression? Does it really solve the regression problems?

*Nasza odpowiedź*
Rozwiązuje problem klasyfikacji binarnej (jedna jest sukcesem a druga jest porażką). Opiera się na dwóch funkcjach, na podstawie której szacujemy prawdopodobieństwo.

Logistic regression is based on logit function which is a defined as log(odds), where odds is the share of the probability of "one" (p) and the probability of "zero" (1-p):

$$odds(p) = \frac{p}{1-p}$$ <br/>
$$logit(p) = ln(odds(p)) = ln(\frac{p}{1-p})$$

**Example** We take at random one card from full deck of cards. We define success as taking a heart card. What are the values of probability of success, ODDS and logit in this example? <br/>

**Task** Generate in python two plots visualising the relation between: 
* probability and odds
* probability and logit

What is the value range for thise 3 functions (probability, odds and logit)?

In [None]:
probabilities = np.linspace(0.01, 0.99, 100)
odds = probabilities / (1 - probabilities)
logit = np.log(odds)


plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.plot(probabilities, odds, label="Odds", color='b')
plt.title("Probability and Odds")
plt.xlabel("Probability")
plt.ylabel("Odds")
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(probabilities, logit, label="Logit", color='r')
plt.title("Probability and Logit")
plt.xlabel("Probability")
plt.ylabel("Logit")
plt.grid(True)

plt.tight_layout()
plt.show()


## 2. Simple logistic regression
Let's start with simple logistic regression - with only one independent variable. We will work on the example from english Wikipedia (https://en.wikipedia.org/wiki/Logistic_regression). We want to build a logistic regression model for estimating probability od passing an exam based on the time spent on studying:


|Hours|	0.50|0.75|1.00|1.25|1.50|1.75|1.75|2.00|2.25|2.50|2.75|3.00|3.25|3.50|4.00|4.25|4.50|4.75|5.00|5.50|
|-----|-----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
|Pass |0    |0   |0   |0   |0   |0   |1   |0   |1   |0   |1   |0   |1   |0   |1   |1   |1   |1   |1   |1   |

1 - student passed, 0 - student failed

In [None]:
hours = np.array([0.5, 0.75, 1, 1.25, 1.5, 1.75, 1.75, 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 4, 4.25, 4.5, 4.75, 5, 5.5])
passed = np.array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1])
plt.figure(figsize=(12, 6))
plt.scatter(hours, passed, color='r', label="Passed (1) or Failed (0)")
plt.show()

Let's use the scatter plot to visualize our data. Can you intuitively find a threshold - after how long studying students usually pass the test?

*nasza odpowiedź*

powyżej 2.5h uczenia się daje dobre szanse na zdanie, a powyżej 4 daje (niemal) pewność

We define the logit function as the linear combination of independent variables. In simple regression case we have only one variable x:

$$logit(p) = ln(\frac{p}{1-p}) = \beta_0 + \beta_1\cdot x,$$
so the estimated odds value is equal to:
$$\frac{p}{1-p} = e^{\beta_0 + \beta_1\cdot x}$$.

The estimation of $\beta$ coefficient is being done with the maximum likelihood estimation method (https://online.stat.psu.edu/stat415/lesson/1/1.2).

**Task** Use Logistic Regression from scikit learn to obtain the regression coefficients in our problem.

In [None]:
model = LogisticRegression()
model.fit(hours.reshape(-1, 1), passed)

intercept = model.intercept_[0]
coefficient = model.coef_[0][0]

print(f"Intercept: {intercept}")
print(f"Coefficient: {coefficient}")

**Question** Having the coefficients how can we determine what is the probability that student who studied for 4.25h will pass the test?

*Tak, mamy nawet na to podany wyżej wzór. A poniżej zaimplementowaną funkcję!*

**Task** Write the body of function which calculates this probability having the $\beta_0$ and $\beta_1$ coefficients and the study time (x). Using this function and coefficients obtained with previous task find the answer for previous question (the probability of passing the test after 4.25 hours of studying). <br/>
_Tip: np.exp() function can be useful_

In [None]:
def calculate_probability(beta_0, beta_1, x):
    return 1 / (1 + np.exp(-(beta_0 + beta_1 * x)))

#TODO fill with regression coefficients from previours task
calculate_probability(0, 0, 4.25)

**Task** Let's visualize the probability function - p(studying hours), to your previous scatter add a line showing the relation between studying time and the probability of passing. Can you see something interesting about this function? Do you know how this type of function is called?

*Sigmoida, Funkcja sigmoidalna -> widać coś na kształt rozciągniętej litery s.

In [None]:
x_values = np.linspace(0, 6, 100)
y_values = calculate_probability(intercept, coefficient, x_values)

plt.figure(figsize=(12, 6))
plt.scatter(hours, passed, color='red', label='Observed Data (Pass=1, Fail=0)')
plt.plot(x_values, y_values, color='blue', label='Logistic Function (Probability)', linewidth=2)
plt.title("Probability of Passing vs Study Hours")
plt.show()

Let's check if the probability which we calculated is the same as this returned with out fit logistic regression model from scikit learn. Use predict_proba function to find the probability of studying 4.25h. <br/>
_Tip: predict_proba requires 2D array or list of lists as input, so [[4.25]] should be passed_

In [None]:
study_time = np.array([[4.25]])
probabilities = model.predict_proba(study_time)
print(probabilities)

We may also not want to obtain the probabilities, but just the more probable class (whether student will pass or not). To obtain this class use the predict function on the model:

In [None]:
prediction = model.predict(study_time)
print(prediction) # 1 oznacza passed, 0 oznacza failed

## 3. Multiple logistic regression

More usual real-world case is to use the multiple logistic regression, when the value of some binary class depends on more than one factor. In this part we will work on the dataset containing data about whether given person is diabetic or not.

Let's start with reading the data from "diabetes_scaled.csv" file. The data have already been scaled.

In [None]:
data = pd.read_csv("diabetes_scaled.csv")
data.head()

Firstly, we add the intercept column (value equal to 1 for all rows), just for easier futher analysis.

In [None]:
#done add column
data['intercept'] = 1
data.head()

Next, we divide our data into train and test. Let's just leave last 100 cases in the test set, the remaining ones will be the training examples.

In [None]:
test_data = data.sample(n=100, random_state=00)  # Random 100 rows
train_data = data.drop(test_data.index)

Use the scikit learn implementation of the logistic regression to build the model on the training set. What are the obtained regression coefficients?

In [None]:
X_train = train_data.drop(columns=['Class'])
y_train = train_data['Class']
model = LogisticRegression()
model.fit(X_train, y_train)
intercept = model.intercept_[0]
coefficients = model.coef_[0]
print(intercept)
print(coefficients)

Now, evaluate your model on the test examples. Classify the test examples, find the probabilitiy of diabetes desease for each test case and the overall accuracy score.

In [None]:
from sklearn.metrics import accuracy_score

X_test = test_data.drop(columns=['Class'])
y_test = test_data['Class']
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.85 accuracy score. Chociaż to może być zawyżone lub zaniżone przez random_state który wybiera dane do zbioru testowego i treningowego ;)