In [1]:
%load_ext autoreload
%autoreload 2

# `Logit` on Orders - A warm-up challenge (~1h)

## Select features

🎯 Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

👉 Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [26]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import math

👉 Import your dataset:

In [3]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

👉 Select in a list which features you want to use:

⚠️ Make sure you are not creating data leakage (i.e. selecting features that are derived from the target)

In [5]:
orders.columns

Index(['order_id', 'wait_time', 'expected_wait_time', 'delay_vs_expected',
       'order_status', 'dim_is_five_star', 'dim_is_one_star', 'review_score',
       'number_of_products', 'number_of_sellers', 'price', 'freight_value',
       'distance_seller_customer'],
      dtype='object')

In [6]:
# YOUR CODE HERE
features = ["wait_time","delay_vs_expected", "number_of_products"]

🕵🏻 Check the `multi-colinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
📚 <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

📚  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

⚖️ Standardizing:

In [12]:
# YOUR CODE HERE

def standardize(L, dfi):
    df = dfi.copy()
    for col in L:
        mu = df[col].mean()
        sigma = df[col].std()
        df[col] = (df[col]-mu)/sigma
    return df
    


In [15]:
features = ["wait_time","delay_vs_expected", "number_of_products"]
df_feat = standardize(features, orders)

In [14]:
orders

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
0,e481f51cbdc54678b7cc49136f2d6af7,8.436574,15.544063,0.0,delivered,0,0,4,1,1,29.99,8.72,18.063837
1,53cdb2fc8bc7dce0b6741e2150273451,13.782037,19.137766,0.0,delivered,0,0,4,1,1,118.70,22.76,856.292580
2,47770eb9100c2d0c44946d9cf07ec65d,9.394213,26.639711,0.0,delivered,1,0,5,1,1,159.90,19.22,514.130333
3,949d5b44dbf5de918fe9c16f97b45f8a,13.208750,26.188819,0.0,delivered,1,0,5,1,1,45.00,27.20,1822.800366
4,ad21c59c0840e6cb83a9ceb5573f8159,2.873877,12.112049,0.0,delivered,1,0,5,1,1,19.90,8.72,30.174037
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95875,9c5dedf39a927c1b2549525ed64a053c,8.218009,18.587442,0.0,delivered,1,0,5,1,1,72.00,13.08,69.481037
95876,63943bddc261676b46f01ca7ac2f7bd8,22.193727,23.459051,0.0,delivered,0,0,4,1,1,174.90,20.10,474.098245
95877,83c1379a015df1e13d02aae0204711ab,24.859421,30.384225,0.0,delivered,1,0,5,1,1,205.99,65.02,968.051192
95878,11c177c8e97725db2631073c19f07b62,17.086424,37.105243,0.0,delivered,0,0,2,2,1,359.98,81.18,370.146853


👉 Run your VIF Analysis to analyze the potential multicolinearities:

In [18]:
# YOUR CODE HERE
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
df = pd.DataFrame()
df["features"] = df_feat[features].columns
df["vif_index"] = [vif(df_feat[features].values, i) for i in range(df_feat[features].shape[1])]
round(df.sort_values(by="vif_index", ascending = False),2)


Unnamed: 0,features,vif_index
0,wait_time,1.98
1,delay_vs_expected,1.97
2,number_of_products,1.0


## Logistic Regressions

👉 Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1️⃣`

In [22]:
# YOUR CODE HERE
logit_one = smf.logit(formula = 'dim_is_one_star ~ wait_time + delay_vs_expected + number_of_products', \
                      data = df_feat).fit()
logit_one.summary()

Optimization terminated successfully.
         Current function value: 0.277057
         Iterations 7


0,1,2,3
Dep. Variable:,dim_is_one_star,No. Observations:,95872.0
Model:,Logit,Df Residuals:,95868.0
Method:,MLE,Df Model:,3.0
Date:,"Thu, 27 Jan 2022",Pseudo R-squ.:,0.1339
Time:,11:29:32,Log-Likelihood:,-26562.0
converged:,True,LL-Null:,-30669.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.4402,0.013,-192.083,0.000,-2.465,-2.415
wait_time,0.5783,0.015,39.338,0.000,0.549,0.607
delay_vs_expected,0.3416,0.018,19.175,0.000,0.307,0.377
number_of_products,0.3008,0.008,35.440,0.000,0.284,0.317


`Logit 5️⃣`

In [23]:
# YOUR CODE HERE
logit_five = smf.logit(formula = 'dim_is_five_star ~ wait_time + delay_vs_expected + number_of_products', \
                      data = df_feat).fit()
logit_five.summary()

Optimization terminated successfully.
         Current function value: 0.639466
         Iterations 7


0,1,2,3
Dep. Variable:,dim_is_five_star,No. Observations:,95872.0
Model:,Logit,Df Residuals:,95868.0
Method:,MLE,Df Model:,3.0
Date:,"Thu, 27 Jan 2022",Pseudo R-squ.:,0.05416
Time:,11:30:31,Log-Likelihood:,-61307.0
converged:,True,LL-Null:,-64817.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.3371,0.007,47.166,0.000,0.323,0.351
wait_time,-0.4470,0.010,-44.530,0.000,-0.467,-0.427
delay_vs_expected,-0.4952,0.023,-21.378,0.000,-0.541,-0.450
number_of_products,-0.1740,0.007,-23.717,0.000,-0.188,-0.160


💡 It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [39]:
def odd_to_p(odd):
    return odd/(1+odd)

odd_to_p(0.62)

0.38271604938271603

In [40]:
(odd_to_p(math.exp(logit_one.params.delay_vs_expected)), odd_to_p(math.exp(logit_five.params.delay_vs_expected)))

(0.5845897139382945, 0.37866838782538803)

In [41]:
(odd_to_p(math.exp(logit_one.params.wait_time)), odd_to_p(math.exp(logit_five.params.wait_time)))

(0.6406746434584986, 0.3900855506778553)

In [42]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = [a]

🧪 __Test your code__

In [43]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/04-Decision-Science/04-Logistic-Regression/01-Logit
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 1 item

tests/test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master


<details>
    <summary>- <i>Explanations and advanced concepts </i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
❗️ However, to be totally rigorous, we have to be **more careful when comparing coefficients from two different models**, because **they might not be based on similar populations**!
    We have 2 sub-populations here: (people who gave 1-stars; and people who gave 5-stars) and they may exhibit intrinsically different behavior patterns. It may well be that "happy-people" (who tends to give 5-stars easily) are less sensitive as "grumpy-people" (who shoot 1-stars like Lucky-Luke), when it comes to "delay", or "price"...

</details>


## Logistic vs. Linear ?

👉 Compare:
- the regression coefficients obtained from the `Logistic Regression `
- with the regression coefficients obtained through a `Linear Regression` 
- on `review_score`, using the same features. 

⚠️ Check that both sets of coefficients  tell  "the same story".

> YOUR ANSWER HERE

In [44]:
# YOUR CODE HERE
lin_1 = smf.ols(formula = 'dim_is_one_star ~ wait_time + delay_vs_expected + number_of_products', \
                      data = df_feat).fit()
lin_1.summary()

0,1,2,3
Dep. Variable:,dim_is_one_star,R-squared:,0.119
Model:,OLS,Adj. R-squared:,0.119
Method:,Least Squares,F-statistic:,4312.0
Date:,"Thu, 27 Jan 2022",Prob (F-statistic):,0.0
Time:,11:48:30,Log-Likelihood:,-13527.0
No. Observations:,95872,AIC:,27060.0
Df Residuals:,95868,BIC:,27100.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0977,0.001,108.512,0.000,0.096,0.099
wait_time,0.0626,0.001,49.513,0.000,0.060,0.065
delay_vs_expected,0.0410,0.001,32.439,0.000,0.039,0.044
number_of_products,0.0374,0.001,41.522,0.000,0.036,0.039

0,1,2,3
Omnibus:,45225.968,Durbin-Watson:,2.004
Prob(Omnibus):,0.0,Jarque-Bera (JB):,264784.61
Skew:,2.246,Prob(JB):,0.0
Kurtosis:,9.79,Cond. No.,2.39


In [45]:
lin_5 = smf.ols(formula = 'dim_is_five_star ~ wait_time + delay_vs_expected + number_of_products', \
                      data = df_feat).fit()
lin_5.summary()

0,1,2,3
Dep. Variable:,dim_is_five_star,R-squared:,0.061
Model:,OLS,Adj. R-squared:,0.061
Method:,Least Squares,F-statistic:,2070.0
Date:,"Thu, 27 Jan 2022",Prob (F-statistic):,0.0
Time:,11:49:10,Log-Likelihood:,-64919.0
No. Observations:,95872,AIC:,129800.0
Df Residuals:,95868,BIC:,129900.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.5921,0.002,384.945,0.000,0.589,0.595
wait_time,-0.1211,0.002,-56.011,0.000,-0.125,-0.117
delay_vs_expected,0.0075,0.002,3.485,0.000,0.003,0.012
number_of_products,-0.0378,0.002,-24.574,0.000,-0.041,-0.035

0,1,2,3
Omnibus:,472463.795,Durbin-Watson:,1.997
Prob(Omnibus):,0.0,Jarque-Bera (JB):,12559.925
Skew:,-0.382,Prob(JB):,0.0
Kurtosis:,1.4,Cond. No.,2.39


🏁 Congratulations! 

💾 Don't forget to commit and push your `logit.ipynb` notebook !