# Solvers ⚙️

In this exercise, you will investigate the effects of different `solvers` on `LogisticRegression` models.

👇 Run the code below

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("data.csv")

df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,sulphates,alcohol,quality rating
0,9.47,5.97,7.36,10.17,6.84,9.15,9.78,9.52,10.34,8.8,6
1,10.05,8.84,9.76,8.38,10.15,6.91,9.7,9.01,9.23,8.8,7
2,10.59,10.71,10.84,10.97,9.03,10.42,11.46,11.25,11.34,9.06,4
3,11.0,8.44,8.32,9.65,7.87,10.92,6.97,11.07,10.66,8.89,8
4,12.12,13.44,10.35,9.95,11.09,9.38,10.22,9.04,7.68,11.38,3


- The dataset consists of different wines 🍷
- The features describe different properties of the wines 
- The target 🎯 is a quality rating given by an expert

## 1. Target engineering

In this section, you are going to transform the ratings into a binary target.

👇 How many observations are there for each rating?

In [None]:
# YOUR CODE HERE

df.info()

👇 Create `y` by transforming the target into a binary classification task where quality ratings below 6 are bad [0], and ratings of 6 and above are good [1]

In [2]:
# YOUR CODE HERE

df['y'] = np.where(df['quality rating']<6, 0, 1)

#np.where(x < y, x, 10 + y)

👇 Check the class balance of the new binary target

In [None]:
# YOUR CODE HERE

df.head()

Create your `X` by scaling the features. This will allow for fair comparison of different solvers.

In [3]:
# YOUR CODE HERE
X = df.drop(columns=['y','quality rating'])# Create feature set


In [11]:
from sklearn.preprocessing import MinMaxScaler

for col in X.columns:

    scaler = MinMaxScaler() # Instanciate MinMaxScaler
    scaler.fit(X[[col]]) # Fit scaler to data
    scaler.data_max_


    X[col] = scaler.transform(X[[col]]) # Use scaler to transform data

In [5]:
# YOUR CODE HERE
y=df['y']

## 2. LogisticRegression solvers

👇 Logistic Regression models can be optimized using different **solvers**. Find out 
- Which is the `fastest_solver` ?
- What can you say about their respective precision score?

`solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']`
 
For more information on these 5 solvers, check out [this stackoverflow thread](https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-defintions)

In [6]:
# YOUR CODE HERE

# Import the model
from sklearn.linear_model import LogisticRegression

# Import cross valuation
from sklearn.model_selection import cross_validate

# Instanciate model
#modelog = LogisticRegression(max_iter = 1000)



In [12]:
X.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,sulphates,alcohol
0,0.531348,0.285244,0.265966,0.504968,0.229879,0.363248,0.451878,0.432173,0.557503,0.413523
1,0.576803,0.420113,0.459984,0.34327,0.412348,0.123932,0.442488,0.370948,0.435926,0.413523
2,0.619122,0.507989,0.547292,0.577236,0.350606,0.498932,0.649061,0.639856,0.667032,0.432028
3,0.651254,0.401316,0.343573,0.457995,0.286659,0.55235,0.122066,0.618247,0.592552,0.419929
4,0.739028,0.636278,0.50768,0.485095,0.464168,0.387821,0.503521,0.37455,0.266156,0.597153


In [13]:
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

for sol in solvers:
    # 10-Fold Cross validate model
    modelog = LogisticRegression(solver=sol)
    cv_results = cross_validate(modelog, X, y, cv=5)
    print(f'{sol} : {cv_results["fit_time"].mean()}')


newton-cg : 0.568980073928833
lbfgs : 0.765258502960205
liblinear : 0.3079834461212158
sag : 0.7270970344543457
saga : 1.342588186264038


In [9]:
# 10-Fold Cross validate model
modelog = LogisticRegression(solver='sag')
cv_results = cross_validate(modelog, X, y, cv=5)
cv_results['fit_time'].mean()



1.8026379108428956

In [None]:
%%time
# Instanciate model
modelog = LogisticRegression(max_iter = 1000)
modelog.fit(X,y)



In [None]:
# YOUR ANSWER
fastest_solver = "liblinear"

<details>
    <summary>☝️ Intuition</summary>

All solvers should produce similar precision scores because our cost-function is "easy" enough to have a global minimum which is found by all 5 solvers. For very complex cost-functions such as in Deep Learning, different solvers may stopping at different values of the loss function. 

</details> 

###  🧪 Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('solvers',
                         fastest_solver=fastest_solver
                         )
result.write()
print(result.check())

## 3. Stochastic Gradient Descent

Logistic Regression models can also be optimized via Stochastic Gradient Descent.

👇 Evaluate a Logistic Regression model optimized via **Stochastic Gradient Descent**. How do its precision score and training time compare to the performance of the models trained in section 2.?


<details>
<summary>💡 Hint</summary>

- If you are stuck, look at the [SGDClassifier doc](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)!

</details>



In [14]:
# YOUR CODE HERE

from sklearn.linear_model import SGDClassifier

# 10-Fold Cross validate model

modelog = SGDClassifier(loss='log')
cv_results = cross_validate(modelog, X, y, cv=10)
cv_results['fit_time'].mean()

0.29161131381988525

☝️ The SGD model should have the shortest training time, for similar performance. This is a direct effect of performing each epoch of the Gradient Descent on a single data point.

## 4. Predictions

👇 Use the best model to predict the binary quality (0 or 1) of the following wine. Store your
- `predicted_class`
- `predicted_proba_of_class`

In [17]:
new_data = pd.read_csv('new_data.csv')

new_data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,sulphates,alcohol
0,9.54,13.5,12.35,8.78,14.72,9.06,9.67,10.15,11.17,12.17


In [18]:
for col in new_data:
    
    new_data[col] = scaler.transform(new_data[[col]]) # Use scaler to transform data

In [19]:
new_data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,sulphates,alcohol
0,0.466192,0.748043,0.666192,0.4121,0.834875,0.432028,0.475445,0.509609,0.582206,0.653381


In [None]:
# YOUR CODE HERE

In [20]:
# Instanciate model

#modelog = LogisticRegression(solver=sol)
modelog.fit(X, y)

SGDClassifier(loss='log')

In [None]:
X.head()

In [21]:
probab = modelog.predict_proba(new_data)
probability = probab[0][0]

In [22]:
probab

array([[0.98348325, 0.01651675]])

In [23]:
predicted_class = modelog.predict(new_data)
predicted_proba_of_class = probab[0][0]

# 🏁  Check your code and push your notebook

In [24]:
from nbresult import ChallengeResult

result = ChallengeResult('new_data_prediction',
    predicted_class=predicted_class,
    predicted_proba_of_class=predicted_proba_of_class
)
result.write()
print(result.check())

platform linux -- Python 3.8.6, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /home/nandosoq/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/nandosoq/code/Nandosoq/data-challenges/05-ML/04-Under-the-hood/02-Solvers
plugins: anyio-3.2.1, dash-1.21.0
[1mcollecting ... [0mcollected 2 items

tests/test_new_data_prediction.py::TestNewDataPrediction::test_predicted_class [32mPASSED[0m[32m [ 50%][0m
tests/test_new_data_prediction.py::TestNewDataPrediction::test_predicted_proba [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/new_data_prediction.pickle

[32mgit[39m commit -m [33m'Completed new_data_prediction step'[39m

[32mgit[39m push origin master
