# Logit Orders - A warm-up challenge (~1h)

Let's figure out the impact of `wait_time` and `delay_vs_expected` on very good and very bad reviews

Using our `orders` training_set, we will run two multivariate logistic regressions (`logit_one` and `logit_five`) to predict `dim_is_one_star` and `dim_is_five_star` respectively.

 

In [1]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

❓ Import your dataset

In [2]:
# import orders dataset
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

In [9]:
orders['order_status'].min()

'delivered'

❓ Select which features you want to use (avoid data-leaks)

In [47]:
features = ['wait_time','expected_wait_time','delay_vs_expected','dim_is_five_star','dim_is_one_star',
            'review_score','number_of_products','number_of_sellers','price','freight_value','distance_seller_customer']

data_leaks = ['order_status']

❓ Check the multi-colinearity of your features, using the `VIF index`. It shouldn't be too high (< 10 preferably) to ensure we can trust the partial regression coefficents and their associated `p-values` 

In [48]:
#All = ['wait_time', 'expected_wait_time', 'delay_vs_expected', 'number_of_products', 'number_of_sellers', 'price', 'freight_value']
## Colineal

from statsmodels.stats.outliers_influence import variance_inflation_factor

X_variables = orders[features]

vif_data = pd.DataFrame()
vif_data["feature"] = X_variables.columns
vif_data["VIF"] = [variance_inflation_factor(X_variables.values, i) for i in range(len(X_variables.columns))]
vif_data



Unnamed: 0,feature,VIF
0,wait_time,8.736345
1,expected_wait_time,13.234131
2,delay_vs_expected,2.564483
3,dim_is_five_star,8.008839
4,dim_is_one_star,3.46063
5,review_score,59.946357
6,number_of_products,7.569385
7,number_of_sellers,45.194485
8,price,1.730428
9,freight_value,3.557825


In [49]:
features = ['wait_time','expected_wait_time','delay_vs_expected','dim_is_five_star','dim_is_one_star'
            ,'number_of_products','price','freight_value','distance_seller_customer']

data_leaks = ['order_status']

colinearity =['review_score', 'number_of_sellers']

❓ Fit two LOGIT models (`logit_one` and `logit_five`) to predict `dim_is_one_star` and `dim_is_five_star`

In [50]:
#Info definition

logit_models = ['dim_is_five_star','dim_is_one_star']

features_mask = ['wait_time','expected_wait_time','delay_vs_expected',
            'number_of_products','price','freight_value','distance_seller_customer']

features_mask_2 = ['wait_time','expected_wait_time','delay_vs_expected',
            'number_of_products','price','distance_seller_customer']

In [52]:
logit_one = smf.logit(formula=f'dim_is_one_star ~ {("+").join(features_mask)}', data=orders).fit()
logit_one.summary()

Optimization terminated successfully.
         Current function value: 0.281278
         Iterations 7


0,1,2,3
Dep. Variable:,dim_is_one_star,No. Observations:,96525.0
Model:,Logit,Df Residuals:,96517.0
Method:,MLE,Df Model:,7.0
Date:,"Thu, 05 Aug 2021",Pseudo R-squ.:,0.1382
Time:,12:12:25,Log-Likelihood:,-27150.0
converged:,True,LL-Null:,-31505.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.5530,0.041,-85.836,0.000,-3.634,-3.472
wait_time,0.0837,0.002,41.269,0.000,0.080,0.088
expected_wait_time,-0.0205,0.002,-11.202,0.000,-0.024,-0.017
delay_vs_expected,0.0339,0.004,7.912,0.000,0.026,0.042
number_of_products,0.5638,0.018,30.571,0.000,0.528,0.600
price,0.0003,5.24e-05,4.816,0.000,0.000,0.000
freight_value,-0.0005,0.001,-0.870,0.384,-0.002,0.001
distance_seller_customer,-0.0002,2.35e-05,-9.226,0.000,-0.000,-0.000


In [53]:
logit_one = smf.logit(formula=f'dim_is_one_star ~ {("+").join(features_mask_2)}', data=orders).fit()
logit_one.summary()

Optimization terminated successfully.
         Current function value: 0.281282
         Iterations 7


0,1,2,3
Dep. Variable:,dim_is_one_star,No. Observations:,96525.0
Model:,Logit,Df Residuals:,96518.0
Method:,MLE,Df Model:,6.0
Date:,"Thu, 05 Aug 2021",Pseudo R-squ.:,0.1382
Time:,12:12:28,Log-Likelihood:,-27151.0
converged:,True,LL-Null:,-31505.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.5471,0.041,-86.846,0.000,-3.627,-3.467
wait_time,0.0836,0.002,41.262,0.000,0.080,0.088
expected_wait_time,-0.0206,0.002,-11.262,0.000,-0.024,-0.017
delay_vs_expected,0.0340,0.004,7.931,0.000,0.026,0.042
number_of_products,0.5559,0.016,34.575,0.000,0.524,0.587
price,0.0002,4.88e-05,4.825,0.000,0.000,0.000
distance_seller_customer,-0.0002,2.29e-05,-9.635,0.000,-0.000,-0.000


❓Interpret your results:

- Interpret the partial coefficients in your own words.
- Check their statistical significance with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importance?

In [54]:
logit_five = smf.logit(formula=f'dim_is_five_star ~ {("+").join(features_mask)}', data=orders).fit()
logit_five.summary()

Optimization terminated successfully.
         Current function value: 0.638984
         Iterations 7


0,1,2,3
Dep. Variable:,dim_is_five_star,No. Observations:,96525.0
Model:,Logit,Df Residuals:,96517.0
Method:,MLE,Df Model:,7.0
Date:,"Thu, 05 Aug 2021",Pseudo R-squ.:,0.05647
Time:,12:12:36,Log-Likelihood:,-61678.0
converged:,True,LL-Null:,-65370.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.2470,0.026,48.082,0.000,1.196,1.298
wait_time,-0.0583,0.001,-43.627,0.000,-0.061,-0.056
expected_wait_time,0.0076,0.001,7.618,0.000,0.006,0.010
delay_vs_expected,-0.0832,0.005,-16.107,0.000,-0.093,-0.073
number_of_products,-0.3361,0.015,-21.940,0.000,-0.366,-0.306
price,9.26e-05,3.66e-05,2.531,0.011,2.09e-05,0.000
freight_value,-4.374e-05,0.000,-0.103,0.918,-0.001,0.001
distance_seller_customer,0.0001,1.46e-05,7.618,0.000,8.25e-05,0.000


In [60]:
logit_five = smf.logit(formula=f'dim_is_five_star ~ {("+").join(features_mask_2)}', data=orders).fit()
logit_five.summary()

logit_five.params

Optimization terminated successfully.
         Current function value: 0.638984
         Iterations 7


Intercept                   1.247448
wait_time                  -0.058321
expected_wait_time          0.007608
delay_vs_expected          -0.083200
number_of_products         -0.336809
price                       0.000091
distance_seller_customer    0.000111
dtype: float64

In [58]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = [a]

🧪 __Test your code__

In [59]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())

platform linux -- Python 3.8.6, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /home/nandosoq/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/nandosoq/code/Nandosoq/data-challenges/04-Decision-Science/04-Logistic-Regression/01-Logit
plugins: anyio-3.2.1, dash-1.21.0
[1mcollecting ... [0mcollected 1 item

tests/test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master


<details>
    <summary>Explanations</summary>


> _All other thing being equal, the delay factor tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
</details>


❓ How do these regression coefficients compare with an OLS on `review_score` using the same features? Double check that both OLS and Logit analyses tell approximately "the same story".

In [None]:
models = ['review_score']

features_mask = ['wait_time','expected_wait_time','delay_vs_expected',
            'number_of_products','price','freight_value','distance_seller_customer']



In [62]:
# standardize features (transform them into their respective z-scores)

df = orders[features_mask] #distance
cols = list(df.columns)

df[cols]

df_zscore  = pd.DataFrame()

for col in cols:
    col_zscore = col + '_zscore'
    df_zscore[col_zscore] = (df[col] - df[col].mean())/df[col].std()
    
df_zscore['review_score'] = orders['review_score']

In [64]:
# Create and train model4


features_z_score = [feature+'_zscore' for feature in features_mask]


model4 = smf.ols(formula = f'review_score ~ {("+").join(features_z_score)}', data=df_zscore).fit()

model4.summary()

0,1,2,3
Dep. Variable:,review_score,R-squared:,0.139
Model:,OLS,Adj. R-squared:,0.139
Method:,Least Squares,F-statistic:,2223.0
Date:,"Thu, 05 Aug 2021",Prob (F-statistic):,0.0
Time:,12:43:21,Log-Likelihood:,-154830.0
No. Observations:,96525,AIC:,309700.0
Df Residuals:,96517,BIC:,309800.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.1420,0.004,1069.360,0.000,4.134,4.150
wait_time_zscore,-0.4865,0.007,-71.437,0.000,-0.500,-0.473
expected_wait_time_zscore,0.0870,0.005,17.771,0.000,0.077,0.097
delay_vs_expected_zscore,-0.0158,0.006,-2.583,0.010,-0.028,-0.004
number_of_products_zscore,-0.1690,0.004,-38.463,0.000,-0.178,-0.160
price_zscore,-0.0038,0.004,-0.902,0.367,-0.012,0.005
freight_value_zscore,-0.0030,0.005,-0.605,0.545,-0.013,0.007
distance_seller_customer_zscore,0.0693,0.005,14.146,0.000,0.060,0.079

0,1,2,3
Omnibus:,18347.558,Durbin-Watson:,2.009
Prob(Omnibus):,0.0,Jarque-Bera (JB):,34478.595
Skew:,-1.183,Prob(JB):,0.0
Kurtosis:,4.725,Cond. No.,3.47


### 🏁 Congratulation! Don't forget to commit and push your notebook