In [62]:
%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# `Logit` on Orders - A warm-up challenge (~1h)

## Select features

🎯 Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

👉 Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [74]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import numpy as np


👉 Import your dataset:

In [64]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)


In [65]:
orders.head()


Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
0,e481f51cbdc54678b7cc49136f2d6af7,8.436574,15.544063,0.0,delivered,0,0,4,1,1,29.99,8.72,18.063837
1,53cdb2fc8bc7dce0b6741e2150273451,13.782037,19.137766,0.0,delivered,0,0,4,1,1,118.7,22.76,856.29258
2,47770eb9100c2d0c44946d9cf07ec65d,9.394213,26.639711,0.0,delivered,1,0,5,1,1,159.9,19.22,514.130333
3,949d5b44dbf5de918fe9c16f97b45f8a,13.20875,26.188819,0.0,delivered,1,0,5,1,1,45.0,27.2,1822.800366
4,ad21c59c0840e6cb83a9ceb5573f8159,2.873877,12.112049,0.0,delivered,1,0,5,1,1,19.9,8.72,30.174037


👉 Select in a list which features you want to use:

⚠️ Make sure you are not creating data leakage (i.e. selecting features that are derived from the target)

💡 To figure out the impact of `wait_time` and `delay_vs_expected` we need to control for the impact of other features, include in your list all features that may be relevant

In [66]:
data = pd.DataFrame(orders)

features = [
    'wait_time',
    'delay_vs_expected',
    'number_of_products',
    'number_of_sellers',
    'distance_seller_customer',
]


🕵🏻 Check the `multi-colinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
📚 <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

📚  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

⚖️ Standardizing:

In [88]:
orders_data = orders[['wait_time', 'delay_vs_expected', 'number_of_products', 'distance_seller_customer', 'number_of_sellers']].copy()

for features in orders_data.columns:
    mu = orders_data[features].mean()
    sigma = orders_data[features].std()
    orders_data[features] = orders_data[features].apply(lambda x: (x-mu)/sigma)

orders_data.head(5)


Unnamed: 0,wait_time,delay_vs_expected,number_of_products,distance_seller_customer,number_of_sellers
0,-0.431192,-0.161781,-0.264595,-0.979475,-0.112544
1,0.134174,-0.161781,-0.264595,0.429743,-0.112544
2,-0.329907,-0.161781,-0.264595,-0.145495,-0.112544
3,0.07354,-0.161781,-0.264595,2.054621,-0.112544
4,-1.019535,-0.161781,-0.264595,-0.959115,-0.112544


In [68]:
orders_scaled['dim_is_five_star'] = orders ['dim_is_five_star']
orders_scaled.head()


Unnamed: 0,wait_time,delay_vs_expected,number_of_products,number_of_sellers,distance_seller_customer,dim_is_five_star
0,-0.431192,-0.161781,-0.264595,-0.112544,-0.979475,0
1,0.134174,-0.161781,-0.264595,-0.112544,0.429743,0
2,-0.329907,-0.161781,-0.264595,-0.112544,-0.145495,1
3,0.07354,-0.161781,-0.264595,-0.112544,2.054621,1
4,-1.019535,-0.161781,-0.264595,-0.112544,-0.959115,1


In [89]:
orders_data['dim_is_one_star'] = orders ['dim_is_one_star']
orders_data.head()


Unnamed: 0,wait_time,delay_vs_expected,number_of_products,distance_seller_customer,number_of_sellers,dim_is_one_star
0,-0.431192,-0.161781,-0.264595,-0.979475,-0.112544,0
1,0.134174,-0.161781,-0.264595,0.429743,-0.112544,0
2,-0.329907,-0.161781,-0.264595,-0.145495,-0.112544,0
3,0.07354,-0.161781,-0.264595,2.054621,-0.112544,0
4,-1.019535,-0.161781,-0.264595,-0.959115,-0.112544,0


In [90]:
orders_data['review_score'] = orders['review_score']
orders_data.head()


Unnamed: 0,wait_time,delay_vs_expected,number_of_products,distance_seller_customer,number_of_sellers,dim_is_one_star,review_score
0,-0.431192,-0.161781,-0.264595,-0.979475,-0.112544,0,4
1,0.134174,-0.161781,-0.264595,0.429743,-0.112544,0,4
2,-0.329907,-0.161781,-0.264595,-0.145495,-0.112544,0,5
3,0.07354,-0.161781,-0.264595,2.054621,-0.112544,0,5
4,-1.019535,-0.161781,-0.264595,-0.959115,-0.112544,0,5


👉 Run your VIF Analysis to analyze the potential multicolinearities:

In [92]:
df = pd.DataFrame()
df['features'] = orders_data.columns

df['vif_index'] = [vif(orders_data.values, i) for i in range(orders_data.shape[1])]

round(df.sort_values(by = 'vif_index', ascending= False), 2)


TypeError: 'DataFrame' object is not callable

## Logistic Regressions

👉 Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1️⃣`

In [93]:
logit_one = smf.logit(formula=f'dim_is_one_star ~ wait_time +distance_seller_customer + number_of_sellers + number_of_products', data = orders_scaled).fit()
logit_one.summary()


Optimization terminated successfully.
         Current function value: 0.274749
         Iterations 7


0,1,2,3
Dep. Variable:,dim_is_one_star,No. Observations:,95872.0
Model:,Logit,Df Residuals:,95867.0
Method:,MLE,Df Model:,4.0
Date:,"Thu, 02 Nov 2023",Pseudo R-squ.:,0.1411
Time:,11:06:06,Log-Likelihood:,-26341.0
converged:,True,LL-Null:,-30669.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.4943,0.013,-192.672,0.000,-2.520,-2.469
wait_time,0.8886,0.011,78.709,0.000,0.866,0.911
distance_seller_customer,-0.2371,0.013,-18.521,0.000,-0.262,-0.212
number_of_sellers,0.1826,0.008,23.137,0.000,0.167,0.198
number_of_products,0.2470,0.009,27.291,0.000,0.229,0.265


In [94]:
odd_one = np.exp(0.7470)
odd_one/(1 + odd_one)


0.6785246631166438

In [79]:
inc_odd_wait = np.exp(0.7470)
inc_odd_wait


2.110658533543552

`Logit 5️⃣`

In [95]:
logit_five = smf.logit(formula=f'dim_is_five_star ~ wait_time +distance_seller_customer + number_of_sellers + number_of_products', data = orders_scaled).fit()
logit_five.summary()


Optimization terminated successfully.
         Current function value: 0.639097
         Iterations 5


0,1,2,3
Dep. Variable:,dim_is_five_star,No. Observations:,95872.0
Model:,Logit,Df Residuals:,95867.0
Method:,MLE,Df Model:,4.0
Date:,"Thu, 02 Nov 2023",Pseudo R-squ.:,0.0547
Time:,11:06:10,Log-Likelihood:,-61272.0
converged:,True,LL-Null:,-64817.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.3704,0.007,54.242,0.000,0.357,0.384
wait_time,-0.6582,0.009,-69.784,0.000,-0.677,-0.640
distance_seller_customer,0.1307,0.008,16.983,0.000,0.116,0.146
number_of_sellers,-0.1452,0.008,-18.483,0.000,-0.161,-0.130
number_of_products,-0.1300,0.008,-17.304,0.000,-0.145,-0.115


In [96]:
odd_one = np.exp(0.3405)
odd_one/(1 + odd_one)


0.5843119737982952

In [97]:
inc_odd_wait = np.exp(0.3405)
inc_odd_wait


1.4056502400065978

💡 It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [98]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = [a]


🧪 __Test your code__

In [99]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())



platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/reecepalmer/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/reecepalmer/Code/RPalmr/04-Decision-Science/04-Logistic-Regression/data-logit/tests
plugins: asyncio-0.19.0, dash-2.14.0, typeguard-2.13.3, anyio-3.6.2, hydra-core-1.3.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                           [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master



<details>
    <summary>- <i>Explanations and advanced concepts </i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
❗️ However, to be totally rigorous, we have to be **more careful when comparing coefficients from two different models**, because **they might not be based on similar populations**!
    We have 2 sub-populations here: (people who gave 1-stars; and people who gave 5-stars) and they may exhibit intrinsically different behavior patterns. It may well be that "happy-people" (who tends to give 5-stars easily) are less sensitive as "grumpy-people" (who shoot 1-stars like Lucky-Luke), when it comes to "delay", or "price"...

</details>


## Logistic vs. Linear ?

👉 Compare the coefficients obtained from:
- A `Logistic Regression` to explain `dim_is_five_star`
- A `Linear Regression` to explain `review_score` 

Make sure to use the same set of features for both regressions.  

⚠️ Check that both sets of coefficients  tell  "the same story".

the coefficients from all models show consistent relationships between the chosen features and the outcomes, with 'wait_time' negatively impacting both binary outcomes and review scores, while 'distance_seller_customer', 'number_of_sellers', and 'number_of_products' have a positive impact.

In [100]:
linear_model = smf.ols(formula="review_score ~ wait_time + distance_seller_customer + number_of_sellers + number_of_products", data=orders_scaled).fit()

logit_one_coefficients = logit_one.params
logit_five_coefficients = logit_five.params
linear_coefficients = linear_model.params

coefficient_comparison = pd.DataFrame({
    'Logit_One_Coefficients': logit_one_coefficients,
    'Logit_Five_Coefficients': logit_five_coefficients,
    'Linear_Regression_Coefficients': linear_coefficients,
})

print("Coefficient Comparison:")
print(coefficient_comparison)


Coefficient Comparison:
                          Logit_One_Coefficients  Logit_Five_Coefficients  \
Intercept                              -2.494327                 0.370397   
wait_time                               0.888580                -0.658178   
distance_seller_customer               -0.237066                 0.130664   
number_of_sellers                       0.182646                -0.145167   
number_of_products                      0.247010                -0.130000   

                          Linear_Regression_Coefficients  
Intercept                                       4.155509  
wait_time                                      -0.480912  
distance_seller_customer                        0.110718  
number_of_sellers                              -0.132171  
number_of_products                             -0.127928  


🏁 Congratulations! 

💾 Don't forget to commit and push your `logit.ipynb` notebook !