# Solvers ⚙️

In [59]:
import pandas as pd
import time
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import precision_score


In this exercise, you will investigate the effects of different `solvers` on `LogisticRegression` models.

👇 Run the code below to import the dataset

In [60]:
df = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/04-Under-the-Hood/solvers_dataset.csv")
df.head()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,sulphates,alcohol,quality rating
0,9.47,5.97,7.36,10.17,6.84,9.15,9.78,9.52,10.34,8.8,6
1,10.05,8.84,9.76,8.38,10.15,6.91,9.7,9.01,9.23,8.8,7
2,10.59,10.71,10.84,10.97,9.03,10.42,11.46,11.25,11.34,9.06,4
3,11.0,8.44,8.32,9.65,7.87,10.92,6.97,11.07,10.66,8.89,8
4,12.12,13.44,10.35,9.95,11.09,9.38,10.22,9.04,7.68,11.38,3


- The dataset consists of different wines 🍷
- The features describe different properties of the wines 
- The target 🎯 is a quality rating given by an expert

## 1. Target engineering

In this section, you are going to transform the ratings into a binary target.

👇 How many observations are there for each rating?

In [61]:
rating_counts = df['quality rating'].value_counts()

print(rating_counts)


10    10143
5     10124
1     10090
2     10030
8      9977
6      9961
9      9955
7      9954
4      9928
3      9838
Name: quality rating, dtype: int64


❓ Create `y` by transforming the target into a binary classification task where quality ratings below 6 are bad [0], and ratings of 6 and above are good [1]

In [62]:
df['y'] = df['quality rating'].apply(lambda x: 1 if x >= 6 else 0)

print(df[['quality rating', 'y']].head())


   quality rating  y
0               6  1
1               7  1
2               4  0
3               8  1
4               3  0


❓ Check the class balance of the new binary target

In [63]:
df['y'] = df['quality rating'].apply(lambda x: 1 if x >= 6 else 0)

class_balance = df['y'].value_counts()

print(class_balance)


0    50010
1    49990
Name: y, dtype: int64


❓ Create your `X` by normalising the features. This will allow for fair comparison of different solvers.

In [64]:
df['y'] = df['quality rating'].apply(lambda x: 1 if x >= 6 else 0)

X = df.drop(['quality rating', 'y'], axis=1)
y = df['y']

scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)

print(pd.DataFrame(X_normalized, columns=X.columns).head())


   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0      -0.788603         -1.528461    -1.733180        0.461130  -1.526653   
1      -0.346860         -0.462069    -0.158290       -0.783868   0.117066   
2       0.064417          0.232757     0.550411        1.017553  -0.439117   
3       0.376684         -0.610695    -1.103224        0.099454  -1.015163   
4       1.229704          1.247129     0.228871        0.308113   0.583862   

   free sulfur dioxide  total sulfur dioxide   density  sulphates   alcohol  
0            -0.852381             -0.221393 -0.478387   0.340231 -0.489833  
1            -3.102634             -0.301357 -0.986972  -0.769429 -0.489833  
2             0.423432              1.457850  1.246811   1.339925 -0.307387  
3             0.925720             -3.030126  1.067310   0.660133 -0.426679  
4            -0.621328              0.218409 -0.957055  -2.318954  1.320591  


## 2. LogisticRegression solvers

❓ Logistic Regression models can be optimized using different **solvers**. Make a comparison of the available solvers':
- Fit time - which solver is **the fastest**?
- Precision - **how different** are their respective precision scores?

Available solvers for Logistic Regression are `['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']`
 
For more information on these 5 solvers, check out [this Stack Overflow thread](https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-defintions)

In [65]:
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2, random_state=42)

solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

fit_times = {}
precision_scores = {}

for solver in solvers:
    model = LogisticRegression(solver=solver, max_iter=1000)

    start_time = time.time()
    model.fit(X_train, y_train)
    fit_time = time.time() - start_time
    fit_times[solver] = fit_time

    y_pred = model.predict(X_test)

    precision = precision_score(y_test, y_pred)
    precision_scores[solver] = precision

print("fit times")
for solver, time_taken in fit_times.items():
    print(f"{solver}: {time_taken:.4f} seconds")

print("\precision scores")
for solver, precision in precision_scores.items():
    print(f"{solver}: {precision:.4f}")


Fit Times:
newton-cg: 0.3992 seconds
lbfgs: 0.0764 seconds
liblinear: 0.0809 seconds
sag: 0.5301 seconds
saga: 0.9103 seconds

Precision Scores:
newton-cg: 0.8801
lbfgs: 0.8801
liblinear: 0.8801
sag: 0.8801
saga: 0.8801


In [66]:
# YOUR ANSWER
fastest_solver = "lbfgs"


<details>
    <summary>ℹ️ Click here for our interpretation</summary>

All solvers should produce similar precision scores because our cost-function is "easy" enough to have a global minimum which is found by all 5 solvers. For very complex cost-functions such as in Deep Learning, different solvers may stopping at different values of the loss function.

**The wine dataset**
    
If you check feature importance with sklearn's <a href="https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html">permutation_importance</a> on the current dataset, you'll see many features result in almost 0 importance. Liblinear solver successively moves only along *one* direction at a time, regularizing the others with L1 regularization (a.k.a, setting their beta to 0), which might provide a good fit for a dataset where many features are not that important in predicting the target.

❗️There is a cost to searching for the best solver. Sticking with the default (`lbfgs`) may save the most time overall, sklearn provides you this grid for an idea of which solver to choose to start off with: 

<img src="https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/04-Under-the-Hood/solvers-chart.png" width=700>



</details> 

###  🧪 Test your code

In [67]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'solvers',
    fastest_solver=fastest_solver
)
result.write()
print(result.check())



platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/reecepalmer/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/reecepalmer/Code/RPalmr/05-ML/04-Under-the-hood/data-solvers/tests
plugins: asyncio-0.19.0, dash-2.14.0, typeguard-2.13.3, anyio-3.6.2, hydra-core-1.3.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_solvers.py::TestSolvers::test_fastest_solver [32mPASSED[0m[32m                 [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/solvers.pickle

[32mgit[39m commit -m [33m'Completed solvers step'[39m

[32mgit[39m push origin master



## 3. Stochastic Gradient Descent

Logistic Regression models can also be optimized via Stochastic Gradient Descent.

❓ Evaluate a Logistic Regression model optimized via **Stochastic Gradient Descent**. How do its precision score and training time compare to the performance of the models trained in section 2?


<details>
<summary>💡 Hint</summary>

- If you are stuck, look at the [SGDClassifier doc](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)!

</details>



In [68]:
sgd_model = SGDClassifier(loss='log_loss', max_iter=1000, random_state=42)

start_time = time.time()
sgd_model.fit(X_train, y_train)
fit_time_sgd = time.time() - start_time

y_pred_sgd = sgd_model.predict(X_test)

precision_sgd = precision_score(y_test, y_pred_sgd)

print("fit time: {fit_time_sgd:.4f} seconds")
print(f"precision score: {precision_sgd:.4f}")


fit time: {fit_time_sgd:.4f} seconds
precision score: 0.8650


☝️ The SGD model should have one of the shortest times (maybe even shorter than `liblinear`), for similar performance. This is a direct effect of performing each epoch of the Gradient Descent on a single row as opposed to loading 100k rows into memory at a time.

## 4. Predictions

❓ Use the best model (balanced with short fit time and high precision) to predict the binary quality (0 or 1) of the following wine. Store your:
- `predicted_class`
- `predicted_proba_of_class` (i.e if your model predicted a class of 1 what is the probability it believes 1 to be the class should be between 0 and 1)

In [69]:
new_wine = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/04-Under-the-Hood/solvers_new_wine.csv')
new_wine


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,sulphates,alcohol
0,9.54,13.5,12.35,8.78,14.72,9.06,9.67,10.15,11.17,12.17


In [70]:
best_model = SGDClassifier(loss='log_loss', max_iter=1000, random_state=42)

best_model.fit(X_train, y_train)

X_new_wine_normalized = scaler.transform(new_wine)

predicted_class = best_model.predict(X_new_wine_normalized)[0]

predicted_proba_of_class = best_model.predict_proba(X_new_wine_normalized)[0][predicted_class]


# 🏁  Check your code and push your notebook

In [71]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'new_data_prediction',
    predicted_class=predicted_class,
    predicted_proba_of_class=predicted_proba_of_class
)
result.write()
print(result.check())



platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/reecepalmer/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/reecepalmer/Code/RPalmr/05-ML/04-Under-the-hood/data-solvers/tests
plugins: asyncio-0.19.0, dash-2.14.0, typeguard-2.13.3, anyio-3.6.2, hydra-core-1.3.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 2 items

test_new_data_prediction.py::TestNewDataPrediction::test_predicted_class [32mPASSED[0m[32m [ 50%][0m
test_new_data_prediction.py::TestNewDataPrediction::test_predicted_proba [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/new_data_prediction.pickle

[32mgit[39m commit -m [33m'Completed new_data_prediction step'[39m

[32mgit[39m push origin master

