In [None]:
%load_ext autoreload
%autoreload 2

# `Logit` on Orders - A warm-up challenge (~1h)

## Select features

🎯 Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

👉 Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [4]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

👉 Import your dataset:

In [5]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

👉 Select in a list which features you want to use:

⚠️ Make sure you are not creating data leakage (i.e. selecting features that are derived from the target)

💡 To figure out the impact of `wait_time` and `delay_vs_expected` we need to control for the impact of other features, include in your list all features that may be relevant

In [14]:
print("Available columns:")
print(orders.columns.tolist())

Available columns:
['order_id', 'wait_time', 'expected_wait_time', 'delay_vs_expected', 'order_status', 'dim_is_five_star', 'dim_is_one_star', 'review_score', 'number_of_items', 'number_of_sellers', 'price', 'freight_value']


In [11]:
features = [
    'wait_time', 
    'delay_vs_expected', 
    'price',
    'freight_value',
]

X = orders[features].dropna()
y_one_star = orders.loc[X.index, 'dim_is_one_star']
y_five_star = orders.loc[X.index, 'dim_is_five_star']

🕵🏻 Check the `multicollinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
📚 <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

📚  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

⚖️ Standardizing:

In [15]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = pd.DataFrame(
    scaler.fit_transform(X), 
    columns=X.columns, 
    index=X.index
)

👉 Run your VIF Analysis to analyze the potential multicollinearities:

In [16]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data["Feature"] = X_scaled.columns
vif_data["VIF"] = [variance_inflation_factor(X_scaled.values, i) 
                   for i in range(X_scaled.shape[1])]
print(vif_data)

             Feature       VIF
0          wait_time  2.070003
1  delay_vs_expected  2.012947
2              price  1.203142
3      freight_value  1.255015


## Logistic Regressions

👉 Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1️⃣`

In [18]:
X_with_const = sm.add_constant(X_scaled)

logit_one = sm.Logit(y_one_star, X_with_const).fit()
print(logit_one.summary())

Optimization terminated successfully.
         Current function value: 0.282085
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:        dim_is_one_star   No. Observations:                96353
Model:                          Logit   Df Residuals:                    96348
Method:                           MLE   Df Model:                            4
Date:                Wed, 17 Sep 2025   Pseudo R-squ.:                  0.1179
Time:                        14:33:16   Log-Likelihood:                -27180.
converged:                       True   LL-Null:                       -30814.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -2.4066      0.012   -194.029      0.000      -2.431      -2.382
wait_tim

`Logit 5️⃣`

In [19]:
logit_five = sm.Logit(y_five_star, X_with_const).fit()
print(logit_five.summary())

Optimization terminated successfully.
         Current function value: 0.642535
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:       dim_is_five_star   No. Observations:                96353
Model:                          Logit   Df Residuals:                    96348
Method:                           MLE   Df Model:                            4
Date:                Wed, 17 Sep 2025   Pseudo R-squ.:                 0.04958
Time:                        14:34:25   Log-Likelihood:                -61910.
converged:                       True   LL-Null:                       -65140.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 0.3377      0.007     47.562      0.000       0.324       0.352
wait_tim

💡 It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [20]:
# Compare coefficients
comparison = pd.DataFrame({
    'one_star_coef': logit_one.params,
    'five_star_coef': logit_five.params,
    'one_star_pvalue': logit_one.pvalues,
    'five_star_pvalue': logit_five.pvalues
})
print(comparison)


                   one_star_coef  five_star_coef  one_star_pvalue  \
const                  -2.406562        0.337681     0.000000e+00   
wait_time               0.523720       -0.423737    1.098684e-273   
delay_vs_expected       0.362836       -0.498358     3.931975e-92   
price                   0.043565        0.023619     3.038135e-05   
freight_value           0.111959       -0.062670     1.071778e-26   

                   five_star_pvalue  
const                  0.000000e+00  
wait_time              0.000000e+00  
delay_vs_expected     2.932241e-104  
price                  1.872614e-03  
freight_value          7.398488e-15  


In [21]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more than one_star"

your_answer = [a]

🧪 __Test your code__

In [22]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())


platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /home/emtenan/.pyenv/versions/3.12.9/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/emtenan/code/Emtenan12/04-Decision-Science/04-Logistic-Regression/data-logit/tests
plugins: typeguard-4.4.2, anyio-4.10.0
[1mcollecting ... [0mcollected 1 item

test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                           [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master



<details>
    <summary>- <i>Explanations and advanced concepts </i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
❗️ However, to be totally rigorous, we have to be **more careful when comparing coefficients from two different models**, because **they might not be based on similar populations**!
    We have 2 sub-populations here: (people who gave 1-stars; and people who gave 5-stars) and they may exhibit intrinsically different behavior patterns. It may well be that "happy-people" (who tends to give 5-stars easily) are less sensitive as "grumpy-people" (who shoot 1-stars like Lucky-Luke), when it comes to "delay", or "price"...

</details>


🏁 Congratulations! 

💾 Don't forget to commit and push your `logit.ipynb` notebook !