**Rationale** In this assignment, you will practice specifying, running, and interpreting regressions involving non-linear functional forms and heterogeneous effects. You will be using the datasets located [here](https://drive.google.com/drive/folders/1SRMp4QhyXxfjOiR_CnbNQ1zMI1AuCjyp?usp=sharing).

1. Avocado dataset
1. Starbucks campaign data

In [None]:
import pandas as pd, numpy as np, os, matplotlib.pyplot as plt
from statsmodels.formula import api as smf
from google.colab import drive
drive.mount('drive')

In [None]:
fpath = 'drive/MyDrive/Teaching/Analytics/Datasets/A7/' # change to your data folder
os.listdir(fpath)

# Problem 1 (5 points) Avocado prices redux 

**Only use data for conventional types and remove the data for the region TotalUS**

First, write a loop through each unique region to plot the 
1. total volume vs. average price
1. total volume vs. average price$^2$
1. ln(Total Volume) vs average price
1. Total volume vs. ln(Average Price)
1. ln(Total Volume) vs ln(Average Price)

For each of the above 5 plots, the syntax should look something like:
```
for r in df.region.unique():
    temp = df[df.region==r]
    plt.scatter(....) # fill this in
```

1. Which of the plots of the relationship between price and demand looks most "linear?"

Using the avocado dataset, estimate the following demand models while accounting for the region effects (incl region in the regression):

1. level - level
1. level - Quadratic
1. level - log
1. log - level
1. log - log

Answer the following:
1. Succinctly interpret the **price coefficient** for each model (pay attention to corresponding p-values as well). 
1. Based on the log-log model, is the demand for avocados price elastic or inelastic? What does this mean?

In [None]:
# read in the dataset and replace column name spaces with underscores
avocado = pd.read_csv(fpath + 'avocado.csv', index_col = 0)
avocado.columns = [c.replace(' ', '_') for c in avocado.columns]

In [None]:
# select regions that are not TotalUS e.g. region != 'TotalUS' and only conventional types




In [None]:
# check the unique regions to make sure TotalUS is not in the region column


In [None]:
# check the unique types to make sure there are only conventional types


In [None]:
# write a loop to plot average price vs total volume and color each region differently
# e.g.:
# for r in avocado.region.unique():
#     temp = avocado.loc[avocado.region == r]
#     plt.scatter(..., ..., s= 1) # <- change this, make sure price is on the x-axis



In [None]:
# do the same, but now plot AveragePrice^2 vs Total_Volume


In [None]:
# do the same, but now plot ln(average price) vs. total volume
# remember np.log() takes the log



In [None]:
# do the same, but now plot AveragePrice vs ln(Total_Volume)


In [None]:
# do the same, but now plot ln(AveragePrice) vs ln(Total_Volume)



**EDIT THIS CELL**

Which plot seems to exhibit the most linear relationship within each region? (i.e. dots of the same color seem to form closest to a straight line).

The plot that seems to exhibit the most linear relationship, is the ___________________ vs. ___________________ plot. This suggests that the ____________________ model may be most appropriate.

(*models can be level-level, quadratic, log-log, level-log, log-level.*)

In [None]:
# run the level-level regression of Total Volume on Average Price 
# while accounting for regions as an additional explanatory variable.


# print the regression table below



**Edit this cell**

For the level-level regression, explain the coefficient for the price variable:

1. Answer here:

In [None]:
# run the quadratic regression of Total Volume on Average Price and Average Price ^2
# while accounting for regions as an additional explanatory variable.



**Edit this cell**

For the quadratic regression, explain the marginal effect of price (1 unit change in price leads to...) for the quadratic regression. Remember, you must consider that when price changes, so does price$^2$:

1. Answer here:

In [None]:
# run the log-level regression of Total Volume on Average Price 
# while accounting for regions as an additional explanatory variable.


# print the regression table below



**Edit this cell**

For the log-level regression, explain the coefficient for the price variable:

1. Answer here:



In [None]:
# run the level-log regression of Total Volume on Average Price 
# while accounting for regions as an additional explanatory variable.



# print the regression table below


**Edit this cell**

For the level-log regression, explain the coefficient for the price variable:

1. Answer here:

In [None]:
# run the log-log regression of Total Volume on Average Price 
# while accounting for regions as an additional explanatory variable.


# print the regression table below



**Edit this cell**

For the log-log regression, explain the meaning of the coefficient for the price variable:

1. Answer here: 

Based on the log-log regression, are avocados price-elastic or inelastic? What does this mean?

1. Answer here:



# Problem 2 (5 points)

Use the Starbucks promotions data. Filter the data to use only the rows satisfying all of the following conditions:

1. Transaction amount more than 0 and less than 50.
1. Income is not missing (df.income.notnull())

We suspect the average transaction value might be higher for individuals with higher incomes. We also suspect that the higher the offer difficulty (min. amount spend needed to redeem), the higher the spend. However, this effect might be different depending on income. Additionally, the offer type (buy one get one vs. discount) may impact the transaction amount. Moreover, this effect of the offer type may differ depending on income.

1. To capture all of these potential effects, run the following regression: 
$$
ln(\text{Trans Amt}) = \beta_0 + \beta_1 ln(\text{Inc}) + \beta_2 \text{difficulty} + \beta_3 \text{Disc Offer} + \beta_4 ln(\text{Inc})\times \text{difficulty} + \beta_5  ln(\text{Inc})\times \text{Disc Offer}+ e
$$

1. Succinctly interpret the regression results.

**Edit this cell**

Demonstrate your understanding of interpretation of the regression here by filling the blanks in the statements below:

1. In terms of the $\beta's$ in the equation above, 1% increase in income when discount offer is made and qualification difficulty is \$10 leads to approximately ________% change in transaction amount. 
1. In terms of the $\beta's$ in the equation above, 1% increase in income when BOGO offer is made and qualification difficulty is \$5 leads to approximately ________% change in transaction amount. 

**Remember** the solution here is not as simple as looking at a single coefficient. The effect of income depends on offer type and difficulty.

Note, you can re-write the equation as:
$$
ln(\text{Trans Amt}) = \beta_0  + \beta_2 \text{difficulty} + \beta_3 \text{Disc Offer}  + \big[\beta_1 + \beta_4 \text{difficulty} + \beta_5 \text{Disc Offer}  \big] \times ln(\text{Inc})+ e
$$

Where the entire effect of $ln(income)$ on $ln(TransactionAmount)$ is captured by the expression $\big[\beta_1 + \beta_4 \text{difficulty} + \beta_5 \text{Disc Offer}  \big]$.

In [None]:
sb = pd.read_csv(fpath + 'starbucks_promos.csv', index_col=0) # read the starbucks data

In [None]:
# select rows with transactions >0 but <50 and income is not missing
# replace sb with the result of the selection



In [None]:
# run the regression here, store the result as the variable res



In [None]:
# print the result summary here, e.g. print(res.summary()):



**Edit this cell**

Demonstrate your understanding of interpretation of the regression here by filling the blanks in the statements below:

1. In terms of the the estimated coefficients, 1% increase in income when discount offer is made and qualification difficulty is \$10 leads to approximately ________% **increase/decrease (choose one)** in transaction amount. 
1. In terms of the estimated coefficients, 1% increase in income when BOGO offer is made and qualification difficulty is \$5 leads to ________% **increase/decrease (choose one)** in transaction amount. 

**Basically,** substitute the $\beta's$ from the answer above with the estimated coefficients.