# Logestic regression wih Scikit-Learn

Start by importing `LogesticRegression` from the `sklear.linear_model` module, and creating a `LogesticRegression` object.

```py
from sklearn.linear_model import LogisticRegression

model = Logesticregression()
```

We then train (or `fit`) our model using the `.fit()` method, which takes two parameters. The first is a matrix of features, and the second is a matrix of class labels.

```py
model.fit(features, labels)
```

When we fit the model with sklearn it will perform gradient descent, repeatedly updating the coefficients of our model in order to minimize the log-loss.

Once trained, we can access the `coef_` attribute, which is a vector of the coefficients of each feature, and the `intercept_`, the `b_0` value.

We can also predict whether new data points belong to the positive class using the `.predict()` method. It takes a matrix of features as a parameter and returns a vector of labels 1 or 0 for each sample. `sklearn` uses a classification threshold of 0.5.

```py
model.predict(features)
```

If we are more interested in the predicted probability of the data samples belonging to the positive class than the actual class, we can use the `.predict_proba()` method. It takes a matrix of features as a parameter and returns a vector of probabilities, ranging from 0 to 1, for each sample.

```py
model.predict_proba(features)
```
`sklearn`'s Logistic Regression implementation requires that feature data be normalized since it uses `Regularization`.


```py
import numpy as np
from sklearn.linear_model import LogisticRegression
from exam import hours_studied_scaled, passed_exam, exam_features_scaled_train, exam_features_scaled_test, passed_exam_2_train, passed_exam_2_test, guessed_hours_scaled

# Create logistic regression model
model = LogisticRegression()

# Train the model using hours_studied_scaled as the training features and passed_exam as the training labels.
model.fit(hours_studied_scaled, passed_exam)

# Save the model coefficients and intercept here
calculated_coefficients = model.coef_
print(calculated_coefficients) # [[1.71391157]]
intercept = model.intercept_
print(intercept) # [-0.24783765]

# Predict the probabilities of passing for next semester's students based on the number of hours studied
passed_predictions = model.predict_proba(guessed_hours_scaled)

# Create a new model on the training data with two features students were asked to estimate how much time they spent studying, as well as how many previous math courses they have taken
model_2 = LogisticRegression()
model_2.fit(exam_features_scaled_train, passed_exam_2_train)

# Predict whether the students will pass here
passed_predictions_2 = model_2.predict(exam_features_scaled_test)
print(passed_predictions_2) # [1 1 1 1 1]

```

## Feature importance

Since our data is normalized, all features vary over the same range. Given this understanding, we can compare the feature coefficients' magnitudes and signs to determine which features have the greatest impact on class prediction, and if that impact is positive or negative.

 - Features with larger, positive coefficients will increase the probability of a data sample belonging to the positive class

 - Features with larger, negative coefficients will decrease the probability of a data sample belonging to the positive class
 
 - Features with small, positive or negative coefficients have minimal impact on the probability of a data sample belonging to the positive class


```py
import codecademylib3_seaborn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from exam import exam_features_scaled, passed_exam_2

# the two features in the model are the number of hours studied and the number of previous math courses taken
# Train a sklearn logistic regression model on the normalized exam data
model_2 = LogisticRegression()
model_2.fit(exam_features_scaled,passed_exam_2)

# Assign and update coefficients
coefficients = model_2.coef_

# with numpy's tolist() method we can convert the array into a list so as to visualize the data
coefficients = coefficients.tolist()[0]

# Plot bar graph  with matplotlib's plt.bar() method. 
plt.bar([1,2],coefficients)
plt.xticks([1,2],['hours studied','math courses taken'])
plt.xlabel('feature')
plt.ylabel('coefficient')

plt.show()
```

![Log reg](img/logistic-regression-14.png)

## Logestice Regression review

The output of a Logistic Regression model is a probability that ranges from 0 to 1. The output of a Linear Regression model ranges from -∞ to +∞

Logistic Regression is used to perform binary classification, predicting whether a data sample belongs to a positive (present) class, labeled 1 and the negative (absent) class, labeled 0.

The Sigmoid Function bounds the product of feature values and their coefficients, known as the log-odds, to the range [0,1], providing the probability of a sample belonging to the positive class.

A loss function measures how well a machine learning model makes predictions. The loss function of Logistic Regression is log-loss.

A Classification Threshold is used to determine the probabilistic cutoff for where a data sample is classified as belonging to a positive or negative class. The standard cutoff for Logistic Regression is 0.5, but the threshold can be higher or lower depending on the nature of the data and the situation.

Scikit-learn has a Logistic Regression implementation that allows you to fit a model to your data, find the feature coefficients, and make predictions on new data samples.
The coefficients determined by a Logistic Regression model can be used to interpret the relative importance of each feature in predicting the class of a data sample.

Checkout `sklearn`'s [Breast Cancer Dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) as an exercise