# Data analysis in the Amazon audit
Most of the materials in this notebook were kindly provided by Piotr Sapiezynski (www.sapiezynski.com)

#### The motivation behind the analysis
Read it here: [Amazon Puts Its Own “Brands” First](https://themarkup.org/amazons-advantage/2021/10/14/amazon-puts-its-own-brands-first-above-better-rated-products)

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant

In [2]:
# this is the data from the markup audit, downloaded from their github: 
# https://github.com/the-markup/investigation-amazon-brands 
data = pd.read_csv('https://www.sapiezynski.com/cs4910/markup/data.csv')
data.head()

Unnamed: 0,search_term,placed_higher,stars_delta,reviews_delta,is_shipped_by_amazon,is_sold_by_amazon,is_amazon,is_top_clicked,random_noise,asin_1,asin_2
0,#10 envelope,True,0.0,1654.0,0,2,2,0,-2.569736,B06VVLD2GL,B01D0OANU4
1,#6 envelope,True,0.1,7844.0,0,2,2,2,0.951532,B06X15WSLL,B07JNXMBSX
2,1 inch binder,False,0.0,-9383.0,0,0,-2,0,0.618294,B00A45VF2S,B01BRGTWOA
3,1% milk,False,0.0,-183.0,0,0,0,-2,0.645435,B07W5Z8SJ8,B07WC9MMPD
4,10 dollar gifts,True,0.8,75410.0,0,2,0,0,-1.075435,B00F4CEHNK,B07FCNYND8


![features](https://mrkp-static-production.themarkup.org/graphics/amazon-methodology-product-comparison/1634143723511/assets/product-pairs-subtraction-desktop.png)

In [3]:
columns = ['stars_delta', 'reviews_delta', 'is_amazon',
        'is_shipped_by_amazon', 'is_sold_by_amazon',
        'is_top_clicked', 'placed_higher']

<div style="background-color:lightblue; padding:20px">
<h5>Exercise 1</h5>
Use a random forest classifier to predict `placed_higher` using the features `is_amazon`,`is_shipped_by_amazon`, `is_sold_by_amazon`,`is_top_clicked`, `stars_delta`, `reviews_delta`. What is the most important feature?
</div>

You should get that `is_amazon` is the most important feature. How does it translate into the probability of being the top result? RandomForests won't tell us... But logistic regression can.

Quick reminder on regression and explainability - here applied to the amazon dataset.

**Linear** regression has the following formula:

$$ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... $$

This means that:
* when all variables are equal to 0, the outcome variable $y$ is equal to $\beta_0$, the intercept.
* a **unit** change in $x_1$ corresponds to $y$ changing by $\beta_1$ 

It might not be a powerful model, but it offers a clear interpretation/explanation in regression problems.

**Logistic** regression is mostly used for binary classification and it looks similar to the linear regression, but now the outcome variable is the logarithm of odds of success.

$$ ln(\frac{\pi}{1-\pi}) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... $$

where $\pi$ is the probability of success, so $1-\pi$ is the probability of failure.

Then, a unit change in $x_1$ corresponds to $\beta_1$ change in log odd ratios, but that is not intuitive at all. Let's rewrite it to calculate the probability directly:

$$\begin{align}
ln(\frac{\pi}{1-\pi}) &= \beta_0 + \beta_1x_1 + \beta_2x_2 + ... \\
\frac{\pi}{1-\pi} &= e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ...} \\
{\pi} &= ({1-\pi})e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ...} \\
{\pi} &= e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ...}-\pi e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ...}\\
{\pi}(1+e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ...}) &= e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ...} \\
{\pi} &= \frac{e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ...}}{1+e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ...}}
\end{align}$$

So, while we cannot directly say from that a unit change of $x_1$ corresponds to a certain change in the probability of the outcome, we can just plug zeros and ones into the equation above and compute the changes.

Let's load the data from scratch and do some cleaning (I'm dividing by two, because The Markup left the values of -2, 0, 2 so they never have a "unit" change. If we divide them by two, the unit change actually is meaningful).

In [4]:
data = pd.read_csv('https://www.sapiezynski.com/cs4910/markup/data.csv')
data['is_shipped_by_amazon'] /= 2
data['is_sold_by_amazon'] /= 2
data['is_amazon'] /= 2
data['is_top_clicked'] /= 2
dataset = data[columns]
train, test = train_test_split(dataset, test_size=0.2, stratify=dataset['placed_higher'])

Let's start with a model that only has the intercept (constant) and `is_amazon` as features.

In [None]:
# adding the intercept as a variable (add column of ones to the dataframe)
train = add_constant(train)
# selecting which features to train on
features = ['const', 'is_amazon']
# training a logistic regression model. First the dependent (outcome variable), then the features
log_reg = sm.Logit(train['placed_higher'], train[features]).fit()

In [None]:
log_reg.summary()

We will mostly be looking at the bottom table.

`const` (the intercept) translates to the probability if all other variables are equal to 0

$$ p = \frac{e^{\beta_{const}}}{1+e^{\beta_{const}}} $$

In [None]:
import numpy as np
# the beta coefficients are stored in the .params list, with the constant being first
# get the constant out of the fit model:
const = log_reg.params[0]

print("P (placed higher) = %.3f" %(np.e**const/(1+np.e**const)))

This should be very close to 50%. In our case if all variables are zero, it means that either:
1. The products are both from amazon
2. The products are both not from amazon

We don't have a way to differentiate them, so it's a coin toss (50%)

The other coefficients describe the change in odds that corresponds to a unit change in the variable.
* if the coefficient is positive, it means that an **increase** of variable corresponds to an **increase** in the odds of the positive outcome
* if the coefficient is negative, it means that an **increase** of variable corresponds to an **decrease** in the odds of the positive outcome


So what happens when `is_amazon` is equal to 1, i.e. one product is an amazon product and the other isn't?
By looking at the `coef` we already know it's going to go up (because the coefficient is positive). But by How much? Let's calculate:


$$ p = \frac{e^{\beta_{const} + \beta_{is\_amazon}}}{1+e^{\beta_{const} + \beta_{is\_amazon}}} $$

In [None]:
numerator = np.e**(log_reg.params[0] + log_reg.params[1])
print("P (placed higher) = %.3f" %(numerator/(1+numerator)))

That means that in this dataset 93\% of the time we have two products where one is from amazon and the other isn't, it's the amazon product that will be on the first place.

Solved? Well no, because maybe this is just because amazon products are better. Let's include the star difference in the analysis to **control** for this:

In [None]:
features = ['const', 'is_amazon','stars_delta', ]
log_reg = sm.Logit(train['placed_higher'], train[features]).fit()
log_reg.summary()

Optimization terminated successfully.
         Current function value: 0.494329
         Iterations 6


0,1,2,3
Dep. Variable:,placed_higher,No. Observations:,1132.0
Model:,Logit,Df Residuals:,1129.0
Method:,MLE,Df Model:,2.0
Date:,"Mon, 07 Feb 2022",Pseudo R-squ.:,0.2868
Time:,11:21:33,Log-Likelihood:,-559.58
converged:,True,LL-Null:,-784.58
Covariance Type:,nonrobust,LLR p-value:,1.9250000000000002e-98

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.0867,0.073,1.185,0.236,-0.057,0.230
is_amazon,2.4786,0.163,15.202,0.000,2.159,2.798
stars_delta,-0.2737,0.241,-1.138,0.255,-0.745,0.198


Now notice the last two columns - that's the 95% confidence interval. If one value is positive, and the other negative, the interval includes 0, which means we can't really tell if the effect is positive, or negative - it's not significant. Turns out, controling for the star rating doesn't change anything.

Let's add the review delta:

* **Write code to include `reviews_delta`**
* Is the `reviews_delta` *important*?

Let's include the rest of our variables:
* **Write code to include the other variables**
* Which variable is *important* among those included?
* How do you interpret its value? (*Hint*: look at the scenarios when this variable is equal to 0, and to 1)

What about the predictive performance? The Markup uses RandomForests with hyperparameter tuning to get the best prediction and ends up with 73.2% accuracy. 
* **Calculate accuracy for your model**
* Which model would you give preference and why? 

The problem that we looked at so far is defined as follows: given the top two items, which one is going to be placed in the first position?

Now, let's solve a slightly different problem formulation (not covered in the Markup write up): given characteristics of the product, how likely is it the top product on its search page?

<div style="background-color:lightblue; padding:20px">
<h5>Excercise 2: Interpreting the meaning of logistic regression coefficients</h5>
In this exercise you will interpret the values of the coefficients of a logistic regression model that tries to predict whether a product was the top product on its search page.

In particular, please answer these questions:
1. What is the probability that a hypothetical product with the lowest rating, no reviews, not sold/shipped/produced by amazon, and not among the top clicked products will be placed at the top?
1. Increase in which variables corresponds to a higher probability of being placed at the top?
1. How do you interpret the `stars` coefficient and its statistical significance?
1. If a product receives one more review without changing any other of its characteristics, how does it affect its chances to placed at the top? 
    * How about 1000000 more reviews?
1. What is the probability that a five-star product with 1000 reviews, produced, sold, and shipped by amazon that was among the top clicked results will be placed on the top of the result list? 
    * Is this number higher, or lower than you expected? Does it change how you interpret the results of the model for only the top two positions? Why?

</div>

In [19]:
# let's get the data first:
dataset = pd.read_csv('https://www.sapiezynski.com/cs4910/markup/dataset_full.csv')
# the lowest star rating is 1, not 0, let's adjust the rating such that it starts at 0:
dataset['stars'] -= 1
dataset.head()

Unnamed: 0,stars,reviews,is_amazon,is_sold_by_amazon,is_shipped_by_amazon,top_clicked,is_top
0,3.8,2120.0,0.0,1.0,1.0,0.0,1.0
1,3.7,3747.0,0.0,1.0,1.0,1.0,0.0
2,3.8,6031.0,0.0,1.0,1.0,1.0,0.0
3,3.8,7017.0,0.0,1.0,1.0,1.0,0.0
4,3.8,455.0,0.0,1.0,1.0,1.0,0.0


This dataset is an aggregation of all first search result pages from the Markup audit that had any amazon products on them. The sponsored results are removed.

In this dataset each row is a product with its star rating, the number of reviews, the binary indicators of whether it's an amazon product, it's sold or shipped by amazon, whether it was among the most clicked products, and the outcome variable - whether it was the top product in its search results page.

We're looking at at most 24 first results, so on average only one in 24 rows is the top placed product:


In [None]:
print("Fraction of top placed products (is_top = 1): %.2f %%" %(dataset['is_top'].mean()* 100))

In [None]:
##################
## You code here
###################