# Prescriptive Models and Data Analytics Problem Set #3: Diff-in-diff

## Measuring the impact of online word-of-mouth

In [1]:
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np

In [2]:
data = pd.read_csv('weibo_data.csv')
data.head()

Unnamed: 0,location,show_id,episode_num,censor_dummy,log_rating,log_tweet,av_tweets,day_id,mainland_dummy
0,Mainland China,1,1,1,0.475764,0.0,3.692308,33,1
1,Mainland China,1,2,0,0.468479,0.0,3.692308,34,1
2,Mainland China,1,3,0,0.581327,1.386294,3.692308,35,1
3,Mainland China,1,4,0,0.547851,0.0,3.692308,36,1
4,Mainland China,1,5,0,0.483728,1.386294,3.692308,37,1


### 1.1 Simple regression

**Question 1. Load the data and regress (log) ratings of each show onto the (log) number of tweets per episode. Do you think this regression gives you the causal effect of tweets on show viewership? If not, do you think your estimate will be biased upwards or downwards?**

In [3]:
reg_weibo  = smf.ols(formula = 'log_rating ~ log_tweet', data = data)
result = reg_weibo.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.111
Model:                            OLS   Adj. R-squared:                  0.111
Method:                 Least Squares   F-statistic:                     987.9
Date:                Wed, 28 Feb 2024   Prob (F-statistic):          2.00e-204
Time:                        23:25:41   Log-Likelihood:                -87.734
No. Observations:                7899   AIC:                             179.5
Df Residuals:                    7897   BIC:                             193.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.2664      0.003     81.566      0.0

I think this regression cannot guarantee the causal effect of tweets on show viewership. First of all, the experiment does not utilize fully randomized data. Besides, there are other factors that could affect viewership, such as the quality of the shows and the actors. I believe the estimate is biased upwards due to a positive omitted variable bias, where better-quality shows tend to generate more discussion and have higher viewership. As a result, the real impact of tweets on show viewership is likely to be smaller.

## 1.2 Geographic Diff-in-diff

**Question 1. During the time period of your data, the Chinese government blocked the entire Sina Weibo platform due to a political scandal for three days (a dummy for those three days called censor dummy is included in the data). Assume that the censorship constitutes an exogenous shock that affected the number of tweets during the three days it lasted. You want to exploit this shock in order to analyze whether ratings decreased during the censorship.**

**(a) Run a regression of episode-level (log) ratings on show fixed effects and the censorship dummy using only data from mainland China. Interpret the coefficient on the censorship dummy. Is this result what you expected?**

In [4]:
data_mainland = data[data['mainland_dummy'] == 1]

data_mainland.tail()

Unnamed: 0,location,show_id,episode_num,censor_dummy,log_rating,log_tweet,av_tweets,day_id,mainland_dummy
7894,Mainland China,193,57,0,0.232501,1.791759,11.517241,57,1
7895,Mainland China,193,58,0,0.23128,2.564949,11.517241,58,1
7896,Mainland China,193,59,0,0.262297,2.079442,11.517241,59,1
7897,Mainland China,193,60,0,0.202024,2.772589,11.517241,60,1
7898,Mainland China,193,61,0,0.165636,1.94591,11.517241,61,1


In [5]:
reg_eps  = smf.ols(formula = 'log_rating ~ censor_dummy + C(show_id)', data = data_mainland)
result = reg_eps.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.881
Model:                            OLS   Adj. R-squared:                  0.878
Method:                 Least Squares   F-statistic:                     294.5
Date:                Wed, 28 Feb 2024   Prob (F-statistic):               0.00
Time:                        23:25:41   Log-Likelihood:                 7841.0
No. Observations:                7899   AIC:                        -1.529e+04
Df Residuals:                    7705   BIC:                        -1.394e+04
Df Model:                         193                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             0.5577      0.02

The censor_dummy has a coefficient of -0.0122, which means that, holding all other variables constant, the log ratings of shows decreased by 0.0122 during the three-day censorship period compared to days without censorship. This is expected, as the exogenous shock of censorship could reduce discussion on Sina Weibo, thereby having a negative impact on show ratings.

**(b) Was it necessary to control for show fixed effects in the regression above? If you ran the regression without show fixed effects, how would the interpretation of the coefficient on the censorship dummy differ?**

In [6]:
reg_eps_2  = smf.ols(formula = 'log_rating ~ censor_dummy', data = data_mainland)
result = reg_eps_2.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     6.093
Date:                Wed, 28 Feb 2024   Prob (F-statistic):             0.0136
Time:                        23:25:41   Log-Likelihood:                -550.22
No. Observations:                7899   AIC:                             1104.
Df Residuals:                    7897   BIC:                             1118.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.3196      0.003    105.685   

It is necessary to control for show fixed effects as this controls for all unobserved, time-invariant characteristics of each show that could influence its ratings. By including show fixed effects, the regression effectively compares the ratings of the same show during the censorship period with those from periods without censorship. Without including show fixed effects in the regression, the coefficient of censor_dummy becomes positive (0.0286), indicating that show ratings increase during the censorship period compared to periods without censorship. However, this interpretation might be flawed and contradict reality.

**(c) Run the same regression as in part (a), but use only data from Hong Kong (and not mainland China). Make sure to control for show fixed effects. Interpret the coefficient on the censorship dummy. Is this result what you expected?**

In [7]:
data_hk = data[data['mainland_dummy'] == 0]

data_hk.tail()

Unnamed: 0,location,show_id,episode_num,censor_dummy,log_rating,log_tweet,av_tweets,day_id,mainland_dummy
11422,hongkong,342,33,0,0.47611,,0.0,51,0
11423,hongkong,342,34,0,0.432756,,0.0,54,0
11424,hongkong,342,35,0,0.303211,,0.0,55,0
11425,hongkong,342,36,0,0.436511,,0.0,56,0
11426,hongkong,342,37,0,0.149712,,0.0,57,0


In [8]:
reg_eps_hk  = smf.ols(formula = 'log_rating ~ censor_dummy + C(show_id)', data = data_hk)
result = reg_eps_hk.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.974
Model:                            OLS   Adj. R-squared:                  0.973
Method:                 Least Squares   F-statistic:                     967.6
Date:                Wed, 28 Feb 2024   Prob (F-statistic):               0.00
Time:                        23:25:42   Log-Likelihood:                 1799.1
No. Observations:                3528   AIC:                            -3332.
Df Residuals:                    3395   BIC:                            -2512.
Df Model:                         132                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             0.1792      0.02

The coefficient of censor_dummy is 0.0106, meaning that, holding all other variables constant, the log ratings of shows increase by 0.0106 during the censorship period compared to days without censorship in Hong Kong, which is a relatively unsubstantial increase. Given that the p-value (0.34) of censor_dummy is larger than the threshold (0.05), I conclude that the censor_dummy variable is insignificant at the 95% confidence interval. Therefore, it is expected because Hong Kong residents primarily use Twitter instead of Sina Weibo, so the censorship should not have a significant impact on viewership.

**(d) Using data from both Hong Kong and mainland China, implement a difference-in-differences regression with mainland China as the treatment group and Hong Kong as the control group. In other words, you want to show that the censorship event had a differential effect in mainland China relative to Hong Kong. Make sure to control for show fixed effects. Interpret the relevant coefficients of this regression.**

In [9]:
reg_did = smf.ols(formula = 'log_rating ~ censor_dummy + mainland_dummy + C(show_id) + censor_dummy:mainland_dummy', data = data)
result = reg_did.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.964
Model:                            OLS   Adj. R-squared:                  0.963
Method:                 Least Squares   F-statistic:                     910.8
Date:                Wed, 28 Feb 2024   Prob (F-statistic):               0.00
Time:                        23:25:43   Log-Likelihood:                 9018.9
No. Observations:               11427   AIC:                        -1.738e+04
Df Residuals:                   11100   BIC:                        -1.498e+04
Df Model:                         326                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

The coefficient of the interaction term, censor_dummy:mainland_dummy, measures the differential effect of censorship on the log ratings in mainland China relative to the effect in Hong Kong, where there is no censorship. Since the interaction term is negative (-0.0227) and significant (p=0.02 < 0.05), it indicates that the effect of censorship on log ratings (viewership) is more negative in mainland China than in Hong Kong, corresponding to a decrease of 0.0227 in the log ratings.

## 1.3 Across-show Diff-in-diff

**Question 1. The variable av tweets denotes the average number of tweets associated with an episode of each show (outside of the censored time period). Therefore, this variable is show specific, but it does not vary over time. We can use this variable to capture the general level of social media interest in each show. Generate a set of three dummy variables based on the av tweets variable: The first dummy is equal to one for shows with fewer than 5 tweets per episode, the second dummy is equal to one for shows with at least 5 but less than 100 tweets per episode, and the third dummy should be equal to one for shows with at least 100 tweets per episode.**

In [10]:
data_mainland.head()

Unnamed: 0,location,show_id,episode_num,censor_dummy,log_rating,log_tweet,av_tweets,day_id,mainland_dummy
0,Mainland China,1,1,1,0.475764,0.0,3.692308,33,1
1,Mainland China,1,2,0,0.468479,0.0,3.692308,34,1
2,Mainland China,1,3,0,0.581327,1.386294,3.692308,35,1
3,Mainland China,1,4,0,0.547851,0.0,3.692308,36,1
4,Mainland China,1,5,0,0.483728,1.386294,3.692308,37,1


In [17]:
data_mainland.loc[:, 'data_lessthan_5'] = np.where(data_mainland['av_tweets'] < 5, 1, 0)
data_mainland.loc[:, 'data_between_5_and_100'] = np.where((data_mainland['av_tweets'] >= 5) & (data_mainland['av_tweets'] < 100), 1, 0)
data_mainland.loc[:, 'data_greaterthan_100'] = np.where(data_mainland['av_tweets'] >= 100, 1, 0)

In [18]:
data_mainland.head()

Unnamed: 0,location,show_id,episode_num,censor_dummy,log_rating,log_tweet,av_tweets,day_id,mainland_dummy,data_lessthan_5,data_between_5_and_100,data_greaterthan_100
0,Mainland China,1,1,1,0.475764,0.0,3.692308,33,1,1,0,0
1,Mainland China,1,2,0,0.468479,0.0,3.692308,34,1,1,0,0
2,Mainland China,1,3,0,0.581327,1.386294,3.692308,35,1,1,0,0
3,Mainland China,1,4,0,0.547851,0.0,3.692308,36,1,1,0,0
4,Mainland China,1,5,0,0.483728,1.386294,3.692308,37,1,1,0,0


**(a) Run three separate regressions for shows with less than 5 tweets per episode, shows with 5 to 100 tweets per episode and shows with at least 100 tweets. What do you find in terms of impact of the censorship event across the three regressions?**

In [19]:
reg_lessthan5 = smf.ols(formula = 'log_rating ~ censor_dummy + C(show_id)', data = data_mainland.loc[data_mainland['data_lessthan_5']==1])
result = reg_lessthan5.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.844
Model:                            OLS   Adj. R-squared:                  0.840
Method:                 Least Squares   F-statistic:                     203.3
Date:                Wed, 28 Feb 2024   Prob (F-statistic):               0.00
Time:                        23:26:04   Log-Likelihood:                 4134.6
No. Observations:                3405   AIC:                            -8091.
Df Residuals:                    3316   BIC:                            -7545.
Df Model:                          88                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             0.5573      0.01

In [20]:
reg_between_5_and_100 = smf.ols(formula = 'log_rating ~ censor_dummy + C(show_id)', data = data_mainland.loc[data_mainland['data_between_5_and_100']==1])
result = reg_between_5_and_100.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.888
Model:                            OLS   Adj. R-squared:                  0.886
Method:                 Least Squares   F-statistic:                     363.1
Date:                Wed, 28 Feb 2024   Prob (F-statistic):               0.00
Time:                        23:26:08   Log-Likelihood:                 2922.7
No. Observations:                2945   AIC:                            -5717.
Df Residuals:                    2881   BIC:                            -5334.
Df Model:                          63                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             0.8858      0.01

In [21]:
reg_greaterthan_100 = smf.ols(formula = 'log_rating ~ censor_dummy + C(show_id)', data = data_mainland.loc[data_mainland['data_greaterthan_100']==1])
result = reg_greaterthan_100.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.850
Model:                            OLS   Adj. R-squared:                  0.846
Method:                 Least Squares   F-statistic:                     203.8
Date:                Wed, 28 Feb 2024   Prob (F-statistic):               0.00
Time:                        23:26:11   Log-Likelihood:                 1090.7
No. Observations:                1549   AIC:                            -2095.
Df Residuals:                    1506   BIC:                            -1866.
Df Model:                          42                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             0.7549      0.01

For shows with fewer than 5 tweets per episode, the p-value for the censor_dummy variable is 0.1851. For shows with between 5 and 100 tweets per episode, this p-value is 0.5421. Since both values are above the threshold of 0.05, I conclude that the censor_dummy variable is insignificant at the 95% confidence interval for shows with fewer than 100 tweets per episode. This indicates a negligible effect on viewership.

For shows with at least 100 tweets per episode, the censor_dummy variable has a coefficient of -0.0335. This means that, holding all other variables constant, the log ratings of shows decrease by 0.0335 during the censorship period compared to days without censorship. With a p-value of 0.0034 for the censor_dummy variable, which is below the threshold of 0.05, it is significant at the 95% confidence interval, indicating a negative effect on viewership.

Therefore, only shows that have at least 100 tweets per episode exhibit a negative effect on viewership during periods of censorship. This finding suggests that higher levels of social media discussion may amplify the negative impact of censorship on viewership, which is significant for shows with substantial online word-of-mouth.

**(b) Run a difference-in-difference regression that allows for the censorship event to have a different effect for three sets of shows with the three different activity levels defined above. Interpret the relevant coefficients.**

In [22]:
reg_did_3 = smf.ols(formula = 'log_rating ~ censor_dummy + C(show_id) + data_between_5_and_100 + data_greaterthan_100 + censor_dummy:data_between_5_and_100 + censor_dummy:data_greaterthan_100', data = data_mainland)
result = reg_did_3.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rating   R-squared:                       0.881
Model:                            OLS   Adj. R-squared:                  0.878
Method:                 Least Squares   F-statistic:                     291.7
Date:                Wed, 28 Feb 2024   Prob (F-statistic):               0.00
Time:                        23:26:17   Log-Likelihood:                 7845.2
No. Observations:                7899   AIC:                        -1.530e+04
Df Residuals:                    7703   BIC:                        -1.393e+04
Df Model:                         195                                         
Covariance Type:            nonrobust                                         
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
In

The coefficient of censor_dummy is -0.0069, which indicates that for shows with fewer than 5 tweets per episode, the log rating decreases by 0.0069 due to censorship.

The coefficient of the interaction term, censor_dummy:data_between_5_and_100, is 0.0027. This shows that for shows with 5 to 100 tweets per episode, the presence of censorship (censor_dummy = 1) is associated with an increase in the log rating of 0.0027 compared to shows with fewer than 5 tweets per episode, holding all other variables constant. This effect is in addition to the main effects of censorship and having between 5 and 100 tweets, which are captured by their respective coefficients. However, its p-value is 0.778, which is greater than 0.05, indicating that the observed effect is insignificant and may be due to random chance.

The coefficient of the interaction term, censor_dummy:data_greaterthan_100, is -0.0266. This indicates that for shows with more than 100 tweets per episode, the presence of censorship (censor_dummy = 1) is associated with a decrease in the log rating of 0.0266 compared to shows with fewer than 5 tweets per episode, holding all other variables constant. This effect is in addition to the main effects of censorship and having more than 100 tweets, which are captured by their respective coefficients. Furthermore, its p-value is 0.013, which is smaller than 0.05, indicating that the observed effect is significant and likely to be real. The negative coefficient suggests that highly discussed shows might have more engaged or passionate viewers who may react more negatively to censorship.

**(c) Relate your findings across shows with different activity levels to the geographic difference-in-difference approach. Which regression is more informative regarding the impact of the censorship on ratings?**

Considering our goal is to measure the impact of online word-of-mouth, the across-show difference-in-differences (Diff-in-Diff) regression appears to be more informative. This approach allows for the examination of how the impact of censorship varies across shows with different levels of viewer engagement. It is important to note, however, that the across-show Diff-in-Diff regression assumes geographic differences are minimal or irrelevant. In contrast, the geographic Diff-in-Diff regression compares changes in show ratings over time between regions exposed to censorship and those not exposed. 

Therefore, the two approaches are complementary in illustrating the negative effect of censorship on show ratings.