In [0]:
%load_ext autoreload
%autoreload 2

# `Logit` on Orders - A warm-up challenge (~1h)

🎯 Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

👉 Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [2]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

👉 Import your dataset:

In [3]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

👉 Select which features you want to use:

⚠️ Make sure you are not creating data leakage (i.e. selecting features that are derived from the target)

In [4]:
orders.columns

Index(['order_id', 'wait_time', 'expected_wait_time', 'delay_vs_expected',
       'order_status', 'dim_is_five_star', 'dim_is_one_star', 'review_score',
       'number_of_products', 'number_of_sellers', 'price', 'freight_value',
       'distance_seller_customer'],
      dtype='object')

In [63]:
type(orders.order_id[1])

str

In [15]:
X = orders[['wait_time', 'delay_vs_expected', 'dim_is_five_star', 'dim_is_one_star']]

🕵🏻 Check the `multi-colinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
📚 <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

📚  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

⚖️ Standardizing:

In [14]:
wait_time_std = (orders['wait_time']-(orders['wait_time'].mean()))/(orders['wait_time'].std())
delay_vs_expected_std = (orders['delay_vs_expected']-(orders['delay_vs_expected'].mean()))/(orders['delay_vs_expected'].std())
five_star_std = (orders['dim_is_five_star']-(orders['dim_is_five_star'].mean()))/(orders['dim_is_five_star'].std())
one_star_std = (orders['dim_is_one_star']-(orders['dim_is_one_star'].mean()))/(orders['dim_is_one_star'].std())

In [16]:
X_std = X.copy()

for f in X.columns:
    mu = X[f].mean()
    sigma = X[f].std()
    X_std[f] = X[f].map(lambda x: (x - mu) / sigma)
    
X_std

Unnamed: 0,wait_time,delay_vs_expected,dim_is_five_star,dim_is_one_star
0,-0.431192,-0.161781,-1.204841,-0.328964
1,0.134174,-0.161781,-1.204841,-0.328964
2,-0.329907,-0.161781,0.829977,-0.328964
3,0.073540,-0.161781,0.829977,-0.328964
4,-1.019535,-0.161781,0.829977,-0.328964
...,...,...,...,...
95875,-0.454309,-0.161781,0.829977,-0.328964
95876,1.023841,-0.161781,-1.204841,-0.328964
95877,1.305780,-0.161781,0.829977,-0.328964
95878,0.483664,-0.161781,-1.204841,-0.328964


👉 Run your VIF Analysis to analyze the potential multicolinearities:

In [26]:
X_std.shape[1]
X_std.columns

Index(['wait_time', 'delay_vs_expected', 'dim_is_five_star',
       'dim_is_one_star'],
      dtype='object')

In [27]:
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

In [30]:
df = pd.DataFrame()
df["vif_index"] = [vif(X_std.values, i) for i in range(X_std.shape[1])]
df["features"] = X_std.columns
df

Unnamed: 0,vif_index,features
0,2.057922,wait_time
1,2.001394,delay_vs_expected
2,1.209669,dim_is_five_star
3,1.274586,dim_is_one_star


👉 Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1️⃣`

In [35]:
logit1 = smf.logit(formula='dim_is_one_star ~ wait_time + delay_vs_expected', data=orders).fit();
logit1.summary()

Optimization terminated successfully.
         Current function value: 0.283196
         Iterations 7


0,1,2,3
Dep. Variable:,dim_is_one_star,No. Observations:,95872.0
Model:,Logit,Df Residuals:,95869.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 21 Oct 2021",Pseudo R-squ.:,0.1147
Time:,11:50:36,Log-Likelihood:,-27151.0
converged:,True,LL-Null:,-30669.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.1933,0.024,-131.073,0.000,-3.241,-3.146
wait_time,0.0591,0.002,38.510,0.000,0.056,0.062
delay_vs_expected,0.0723,0.004,19.082,0.000,0.065,0.080


`Logit 5️⃣`

In [33]:
logit5 = smf.logit(formula='dim_is_five_star ~ 1', data=orders).fit();
logit5.params

Optimization terminated successfully.
         Current function value: 0.676080
         Iterations 4


Intercept    0.372705
dtype: float64

💡 It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [0]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = []

🧪 __Test your code__

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())

<details>
    <summary>- <i>Explanations</i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
</details>


👉 Compare:
- the regression coefficients obtained from the `Logistic Regression `
- with the regression coefficients obtained through the `Linear Regression` 
- on `review_score`, using the same features. 

⚠️ Make sure both sets of coefficients  tell  "the same story".

**`Linear Regression`** of the Review score w.r.t. selected features :

1️⃣ Fit the Linear Regression:

In [0]:
# YOUR CODE HERE

2️⃣ Print its summary:

In [0]:
# YOUR CODE HERE

3️⃣ Print the summary of the `logit_five` 

In [0]:
# YOUR CODE HERE

4️⃣ Compare `logit_five` and `linear_regression` regression coefficients.

<details>
    <summary>- <i>Hints</i> -</summary>


* Plot a sorted horizontal barchat of the regression cofficients for each model
* Plot them side-by-side !
    
</details>


In [0]:
# YOUR CODE HERE

<details>
    <summary><i> - Explanations -</i></summary>


* A side-by-side comparison of the linear regression on `review_score` and the logistic regression on `dim_is_five_star` clearly shows that : <br/>
    The most important feature when it comes to  `review_score` and `dim_is_five_star` is the same :`wait_time` (surprised ? probably not, but at least this is confirmed statistically !)_
    
</details>

🏁 Congratulations! 

💾 Don't forget to commit and push your `logit.ipynb` notebook !