In [0]:
%load_ext autoreload
%autoreload 2

# `Logit` on Orders - A warm-up challenge (~1h)

🎯 Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

👉 Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [0]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

👉 Import your dataset:

In [0]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

👉 Select which features you want to use:

⚠️ Make sure you are not creating data leakage (i.e. selecting features that are derived from the target)

In [0]:
orders.columns

In [0]:
# YOUR CODE HERE

🕵🏻 Check the `multi-colinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
📚 <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

📚  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

⚖️ Standardizing:

In [0]:
# YOUR CODE HERE

👉 Run your VIF Analysis to analyze the potential multicolinearities:

In [0]:
# YOUR CODE HERE

👉 Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1️⃣`

In [0]:
# YOUR CODE HERE

`Logit 5️⃣`

In [0]:
# YOUR CODE HERE

💡 It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [0]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = []

🧪 __Test your code__

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())

<details>
    <summary>- <i>Explanations</i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
</details>


👉 Compare:
- the regression coefficients obtained from the `Logistic Regression `
- with the regression coefficients obtained through the `Linear Regression` 
- on `review_score`, using the same features. 

⚠️ Make sure both sets of coefficients  tell  "the same story".

**`Linear Regression`** of the Review score w.r.t. selected features :

1️⃣ Fit the Linear Regression:

In [0]:
# YOUR CODE HERE

2️⃣ Print its summary:

In [0]:
# YOUR CODE HERE

3️⃣ Print the summary of the `logit_five` 

In [0]:
# YOUR CODE HERE

4️⃣ Compare `logit_five` and `linear_regression` regression coefficients.

<details>
    <summary>- <i>Hints</i> -</summary>


* Plot a sorted horizontal barchat of the regression cofficients for each model
* Plot them side-by-side !
    
</details>


In [0]:
# YOUR CODE HERE

<details>
    <summary><i> - Explanations -</i></summary>


* A side-by-side comparison of the linear regression on `review_score` and the logistic regression on `dim_is_five_star` clearly shows that : <br/>
    The most important feature when it comes to  `review_score` and `dim_is_five_star` is the same :`wait_time` (surprised ? probably not, but at least this is confirmed statistically !)_
    
</details>

🏁 Congratulations! 

💾 Don't forget to commit and push your `logit.ipynb` notebook !