In [1]:
%load_ext autoreload
%autoreload 2

# `Logit` on Orders - A warm-up challenge (~1h)

## Select features

🎯 Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

👉 Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [41]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler



👉 Import your dataset:

In [42]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

👉 Select in a list which features you want to use:

⚠️ Make sure you are not creating data leakage (i.e. selecting features that are derived from the target)

💡 To figure out the impact of `wait_time` and `delay_vs_expected` we need to control for the impact of other features, include in your list all features that may be relevant

In [59]:
features = ['wait_time','delay_vs_expected','number_of_products','number_of_sellers','price','freight_value','distance_seller_customer']


In [71]:
from statsmodels.stats.outliers_influence import variance_inflation_factor


ft = features + ['dim_is_one_star', 'dim_is_five_star', 'review_score']


data = orders[ft]


vif = pd.DataFrame()
vif["Feature"] = features
vif["VIF"] = [variance_inflation_factor(data[features].values, i) for i in range(len(features))]

vif


Unnamed: 0,Feature,VIF
0,wait_time,6.910956
1,delay_vs_expected,2.229765
2,number_of_products,7.59751
3,number_of_sellers,9.105615
4,price,1.727967
5,freight_value,3.526939
6,distance_seller_customer,2.910732


🕵🏻 Check the `multi-colinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
📚 <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

📚  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

⚖️ Standardizing:

In [72]:


scaler = StandardScaler()
data[features] = scaler.fit_transform(data[features])


vif = pd.DataFrame()
vif["Feature"] = features
vif["VIF"] = [variance_inflation_factor(data[features].values, i) for i in range(len(features))]




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[features] = scaler.fit_transform(data[features])


👉 Run your VIF Analysis to analyze the potential multicolinearities:

In [73]:
vif

Unnamed: 0,Feature,VIF
0,wait_time,2.624944
1,delay_vs_expected,2.2135
2,number_of_products,1.371316
3,number_of_sellers,1.093349
4,price,1.208582
5,freight_value,1.673727
6,distance_seller_customer,1.44204


## Logistic Regressions

👉 Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1️⃣`

In [74]:
logit_one = smf.logit(formula='dim_is_one_star ~ ' + ' + '.join(features), data=data).fit()

Optimization terminated successfully.
         Current function value: 0.273582
         Iterations 7


`Logit 5️⃣`

In [75]:
logit_five = smf.logit(formula='dim_is_five_star ~ ' + ' + '.join(features), data=data).fit()

Optimization terminated successfully.
         Current function value: 0.636830
         Iterations 7


💡 It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [76]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = [a]

🧪 __Test your code__

In [77]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/Dylan.Lamaison/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/Dylan.Lamaison/code/DylanLamaison/04-Decision-Science/04-Logistic-Regression/data-logit/tests
plugins: asyncio-0.19.0, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                           [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master



<details>
    <summary>- <i>Explanations and advanced concepts </i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
❗️ However, to be totally rigorous, we have to be **more careful when comparing coefficients from two different models**, because **they might not be based on similar populations**!
    We have 2 sub-populations here: (people who gave 1-stars; and people who gave 5-stars) and they may exhibit intrinsically different behavior patterns. It may well be that "happy-people" (who tends to give 5-stars easily) are less sensitive as "grumpy-people" (who shoot 1-stars like Lucky-Luke), when it comes to "delay", or "price"...

</details>


## Logistic vs. Linear ?

👉 Compare the coefficients obtained from:
- A `Logistic Regression` to explain `dim_is_five_star`
- A `Linear Regression` to explain `review_score` 

Make sure to use the same set of features for both regressions.  

⚠️ Check that both sets of coefficients  tell  "the same story".

> YOUR ANSWER HERE

In [82]:
logistic = smf.logit(formula='dim_is_five_star ~ ' + ' + '.join(features), data=data).fit()

logistic.summary()



Optimization terminated successfully.
         Current function value: 0.636830
         Iterations 7


0,1,2,3
Dep. Variable:,dim_is_five_star,No. Observations:,95872.0
Model:,Logit,Df Residuals:,95864.0
Method:,MLE,Df Model:,7.0
Date:,"Thu, 27 Jul 2023",Pseudo R-squ.:,0.05806
Time:,11:50:29,Log-Likelihood:,-61054.0
converged:,True,LL-Null:,-64817.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.3388,0.007,47.340,0.000,0.325,0.353
wait_time,-0.5219,0.012,-44.693,0.000,-0.545,-0.499
delay_vs_expected,-0.4330,0.024,-18.414,0.000,-0.479,-0.387
number_of_products,-0.1365,0.008,-16.331,0.000,-0.153,-0.120
number_of_sellers,-0.1427,0.008,-18.219,0.000,-0.158,-0.127
price,0.0208,0.008,2.721,0.007,0.006,0.036
freight_value,0.0053,0.009,0.590,0.555,-0.012,0.023
distance_seller_customer,0.0872,0.008,10.509,0.000,0.071,0.103


In [83]:
linear = smf.ols(formula='review_score ~ ' + ' + '.join(features), data=data).fit()

linear.summary()

0,1,2,3
Dep. Variable:,review_score,R-squared:,0.145
Model:,OLS,Adj. R-squared:,0.145
Method:,Least Squares,F-statistic:,2322.0
Date:,"Thu, 27 Jul 2023",Prob (F-statistic):,0.0
Time:,11:51:45,Log-Likelihood:,-152580.0
No. Observations:,95872,AIC:,305200.0
Df Residuals:,95864,BIC:,305200.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.1555,0.004,1082.730,0.000,4.148,4.163
wait_time,-0.4397,0.006,-70.719,0.000,-0.452,-0.428
delay_vs_expected,-0.0515,0.006,-9.023,0.000,-0.063,-0.040
number_of_products,-0.1297,0.004,-28.861,0.000,-0.139,-0.121
number_of_sellers,-0.1314,0.004,-32.747,0.000,-0.139,-0.124
price,-0.0029,0.004,-0.696,0.486,-0.011,0.005
freight_value,0.0043,0.005,0.868,0.385,-0.005,0.014
distance_seller_customer,0.0967,0.005,20.984,0.000,0.088,0.106

0,1,2,3
Omnibus:,18749.799,Durbin-Watson:,2.009
Prob(Omnibus):,0.0,Jarque-Bera (JB):,37820.902
Skew:,-1.175,Prob(JB):,0.0
Kurtosis:,4.986,Cond. No.,3.02


🏁 Congratulations! 

💾 Don't forget to commit and push your `logit.ipynb` notebook !