**Building Nations through Shared Experiences: Evidence from African Football**

*By Emilio Depetris-Chauvin, Ruben Durante, and Filipe Campante*

Replicated by Saul Marenco

In [171]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.discrete.discrete_model import Probit
from patsy import dmatrices


In [172]:
### Table 1: Balance in Covariates ###

## Table 1 tests various respondent characteristics which could be potentially correlated with the timing of the
## interview and the outcomes of interest. Regressors include: gender, education, age, unemployment status,
## religious affiliation, ethnic majority status, rural residency, access to basic public goods in the
## respondent's area, gender of interviewer, education of interviewer, sharing a common language with the interviewed,
## and whether the interviewer perceived some external influence on the respondent during the interview.

## Two different tests are done, comparing individuals before and after a match in general (Panel A), and comparing
## individuals before and after a victory (Panel B). All regressions hold for country-match fixed effects and
## cluster standard errors at the same level.

# Load dataset 1-afrobarometer_games_15days.dta
df = pd.read_stata("/Users/saulmg/Downloads/BuildingNationsThroughSharedExperiencesEvidenceFromAfricanFootball/115011-V1/AER-2018-0805_replication/0-replication-afrobarometer/1-Final/1-afrobarometer_games_15days.dta")

# Filter the dataset to only include observations in main sample
df = df[df["main_sample"] == 1]

# Interviewed and interviewer characteristics
inv = ["male", "education", "age", "unemployed", "major_ethnicity", "rural",
       "religious_group_member", "public_goods", "same_language",
       "influenced_by_others", "male_interviewer", "education_interviewer", "age_interviewer"]


observations = []
means = []
panel_A_est = []
panel_A_std = []
panel_B_est = []
panel_B_std = []

data_table1 = []

# Loop through characteristics as dependent variables
for v in inv:
    
    ds = df.dropna(subset = [v, "post_match", "country_match_fe", "post_victory"])
    
    var_results = []
    
    var_results.append(v)
    
    # Record number of observations and means
    var_results.append(ds[v].count())
    var_results.append(round(ds[v].mean(), 2))
    
    # Panel A model
    model = smf.ols(f'{v} ~ post_match + C(country_match_fe)', data = ds).fit(cov_type = "cluster", cov_kwds = {"groups": ds["country_year_fe"]})
    
    # Panel A estimate and standard errors
    var_results.append(round(model.params["post_match"], 3))
    var_results.append(round(model.bse["post_match"], 3))
    
    # Panel B model
    model = smf.ols(f'{v} ~ post_victory + C(country_match_fe)', data = ds).fit(cov_type = "cluster", cov_kwds = {"groups": ds["country_year_fe"]})

    # Panel B estimate and standard errors
    var_results.append(round(model.params["post_victory"], 3))
    var_results.append(round(model.bse["post_victory"], 3))
    
    
    data_table1.append(var_results)
    
    

In [173]:
headers = ["Covariate", "Observations", "Mean", "Panel-A Estimate", "Panel-A SDE", "Panel-B Estimate", "Panel-B SDE"]
table1 = tabulate(data_table1, headers)

In [174]:
print(table1)

### Comparing Results:
### All results match exactly Table 1's results in the paper including standard
### errors.

Covariate                 Observations    Mean    Panel-A Estimate    Panel-A SDE    Panel-B Estimate    Panel-B SDE
----------------------  --------------  ------  ------------------  -------------  ------------------  -------------
male                             37134    0.5                0.006          0.004               0.009          0.005
education                        37134    3.08              -0.136          0.15               -0.317          0.155
age                              37134   36.93               0.804          0.702               1.534          0.8
unemployed                       37134    0.3                0              0.013              -0.003          0.013
major_ethnicity                  37134    0.46              -0.016          0.046              -0.016          0.039
rural                            37134    0.61               0.098          0.074               0.173          0.076
religious_group_member           37005    0.42              -0.026

In [175]:
### Table 2: National Team’s Performance and Ethnic Identification ###

## Table 2 tests the relationship between national team performance and ethnic identification. All regressions take
## the "ethnic_sentiment" dummy ("1" for stronger ethnic than national sentiment, "0" otherwise) as the dependent
## variable, taking into account different controls for each regression, or column. All observations are set within
## 15 days of just one match. Values for "Post-game", "Post_victory", "Post-draw" and "Post-defeat" take value "1" when
## the individual is interviewed within 15 days after a game, and 0 if within 15 days before a game. 

# The same dataset as Table 1 is loaded.
df = pd.read_stata("/Users/saulmg/Downloads/BuildingNationsThroughSharedExperiencesEvidenceFromAfricanFootball/115011-V1/AER-2018-0805_replication/0-replication-afrobarometer/1-Final/1-afrobarometer_games_15days.dta")

# Filter the dataset to only include observations in main sample
df = df[df["main_sample"] == 1]

# In python, one has to manually convert string type values to numeric in dataframes
df["ethnic_sentiment"] = df["ethnic_sentiment"].replace({
    "no in favor of ethnic": 0,
    "in favor of ethnic": 1 })
df["ethnic_sentiment"] = pd.to_numeric(df["ethnic_sentiment"], errors= "coerce")


In [176]:
## OLS 1 ##

## The first column regresses "ethnic_sentiment" with the "post_game" variable, controlling for country-match
## and language group x year dummies. 

# Clean relevant variables
d1 = df.dropna(subset = ["ethnic_sentiment", "post_match", "country_match_fe", "language_year_id", "country_year_fe"])

# Clustering standard errors by "country_year_fe"
model = smf.ols("ethnic_sentiment ~ post_match + C(country_match_fe) + C(language_year_id)",
                data = d1).fit(cov_type = "cluster", cov_kwds = {"groups": df["country_year_fe"]})

# Assign values
post_game_est1 = round(model.params["post_match"], 3)
post_game_sde1 = round(model.bse["post_match"], 3)
post_game_obs1 = int(model.nobs)
post_game_r1 = round(model.rsquared, 3)


  return np.sqrt(np.diag(self.cov_params()))


In [177]:
## OLS 2 ##

## The second column regresses "ethnic_sentiment" with the "post_game" variable, controlling for country-match,
## language group x year dummies and individual controls on the variables "male", "age", "age_sq", "unemployed",
## "rural" and "education".

# Clean relevant variables
d2 = d1.dropna(subset = ["male", "age", "age_sq", "unemployed", "rural", "education"])

# Clustering standard errors by "country_year_fe"
model = smf.ols("ethnic_sentiment ~ post_match +male+age+age_sq+unemployed+rural+education + C(country_match_fe) +\
                C(language_year_id)",
                data = d2).fit(cov_type = "cluster", cov_kwds = {"groups": df["country_year_fe"]})

# Assign values
post_game_est2 = round(model.params["post_match"], 3)
post_game_sde2 = round(model.bse["post_match"], 3)
post_game_obs2 = int(model.nobs)
post_game_r2 = round(model.rsquared, 3)

In [178]:
## OLS 3 ##

## The third column regresses "ethnic_sentiment" with the "post_game" variable, controlling for country-match,
## language group x year dummies, individual controls on the variables "male", "age", "age_sq", "unemployed",
## "rural" and "education", and seasonal fixed effects.

# Clean relevant variables
d3 = d2.dropna(subset = ["dayweek", "day", "month"])

# Clustering standard errors by "country_year_fe"
model = smf.ols("ethnic_sentiment ~ post_match +male+age+age_sq+unemployed+rural+education + C(country_match_fe) +\
                C(language_year_id) + C(dayweek) + C(day) + C(month)" ,
                data = d3).fit(cov_type = "cluster", cov_kwds = {"groups": df["country_year_fe"]})

# Assign values
post_game_est3 = round(model.params["post_match"], 3)
post_game_sde3 = round(model.bse["post_match"], 3)

post_game_obs3 = int(model.nobs)
post_game_r3 = round(model.rsquared, 3)

In [179]:
## OLS 4 ##

## The fourth column regresses "ethnic_sentiment" with the "post_game" and "post_fictory" variables, controlling
##for country-match, language group x year dummies, individual controls on the variables "male", "age", "age_sq",
## "unemployed","rural" and "education", and seasonal fixed effects.

# Clean relevant variables
d4 = d3.dropna(subset = ["post_victory"])

# Clustering standard errors by "country_year_fe"
model = smf.ols("ethnic_sentiment ~ post_match + post_victory +male+age+age_sq+unemployed+rural+education + C(country_match_fe) +\
                C(language_year_id) + C(dayweek) + C(day) + C(month)" ,
                data = d4).fit(cov_type = "cluster", cov_kwds = {"groups": df["country_year_fe"]})

# Assign values
post_game_est4 = round(model.params["post_match"], 3)
post_game_sde4 = round(model.bse["post_match"], 3)

post_victory_est4 = round(model.params["post_victory"], 3)
post_victory_sde4 = round(model.bse["post_victory"], 3)

post_game_obs4 = int(model.nobs)
post_game_r4 = round(model.rsquared, 3)

In [180]:
## OLS 5 ##

## The fifth column regresses "ethnic_sentiment" with the "post_victory", "post_draw" and "post_defeat" variables,
## controlling for country-match, language group x year dummies, individual controls on the variables "male", "age",
## "age_sq", "unemployed","rural" and "education", and seasonal fixed effects.

# Clean relevant variables
d5 = d4.dropna(subset = ["post_draw", "post_defeat"])

# Clustering standard errors by "country_year_fe"
model = smf.ols("ethnic_sentiment ~ post_victory + post_draw + post_defeat +male+age+age_sq+unemployed+rural+education +\
                C(country_match_fe) + C(language_year_id) + C(dayweek) + C(day) + C(month)" ,
                data = d5).fit(cov_type = "cluster", cov_kwds = {"groups": df["country_year_fe"]})

# Assign values
post_victory_est5 = round(model.params["post_victory"], 3)
post_victory_sde5 = round(model.bse["post_victory"], 3)

post_draw_est5 = round(model.params["post_draw"], 3)
post_draw_sde5 = round(model.bse["post_draw"], 3)

post_defeat_est5 = round(model.params["post_defeat"], 3)
post_defeat_sde5 = round(model.bse["post_defeat"], 3)

post_victory_obs5 = int(model.nobs)
post_victory_r5 = round(model.rsquared, 3)

In [181]:
## OLS 6 ##

## The sixth column regresses "ethnic_sentiment" with the "post_victory" variable, controlling for country-match,
## language group x year dummies, individual controls on the variables "male", "age", "age_sq", "unemployed", 
## "rural" and "education", and seasonal fixed effects.

# Clean relevant variables
d6 = d5

# Clustering standard errors by "country_year_fe"
model = smf.ols("ethnic_sentiment ~ post_victory +male+age+age_sq+unemployed+rural+education + C(country_match_fe) +\
                C(language_year_id) + C(dayweek) + C(day) + C(month)" ,
                data = d6).fit(cov_type = "cluster", cov_kwds = {"groups": df["country_year_fe"]})

# Assign values
post_victory_est6 = round(model.params["post_victory"], 3)
post_victory_sde6 = round(model.bse["post_victory"], 3)

post_victory_obs6 = int(model.nobs)
post_victory_r6 = round(model.rsquared, 3)

In [182]:
## Probit 7 ##

## The seventh column regresses "ethnic_Sentiment" with the "post_victory" variable, but intstead of using a linear model
## like the other columns, it looks at a Probit non-linear model. It is used to demonstrate that the results
## can also be translated to a model better adapted to binary dependent variables, like "ethnic_Sentiment".

# Clean relevant variables
d7 = d6

# Define formula
formula = "ethnic_sentiment ~ post_victory +male+age+age_sq+unemployed+rural+education + C(country_match_fe) +\
            C(language_year_id) + C(dayweek) + C(day) + C(month)"

# Set data as a matrix
y, X = dmatrices(formula, d7, return_type = "dataframe")

# Clustering standard errors by "country_year_fe"
model = sm.Probit(y ,X).fit(cov_type = "cluster", cov_kwds = {"groups": df["country_year_fe"]})

# Assign values
post_victory_est7 = round(model.params["post_victory"], 3)
post_victory_sde7 = round(model.bse["post_victory"], 3)

post_victory_obs7 = int(model.nobs)
post_victory_r7 = round(model.prsquared, 3)

  L = q*self.pdf(q*XB)/self.cdf(q*XB)


Optimization terminated successfully.
         Current function value: nan
         Iterations 16


In [183]:
# Assign values to Table 2
row1 = ["Post_game", post_game_est1, post_game_est2, post_game_est3, post_game_est4, "", "", ""]
row1sde = ["(Standard Error)", post_game_sde1, post_game_sde2, post_game_sde3, post_game_sde4, "", "", ""]

row2 = ["Post-victory", "", "", "", post_victory_est4, post_victory_est5, post_victory_est6,
        post_victory_est7]
row2sde = ["(Standard Error)",  "", "", "", post_victory_sde4, post_victory_sde5, post_victory_sde6,
           post_victory_sde7]

row3 = ["Post-draw", "", "", "", "", post_draw_est5, "", ""]
row3sde = ["(Standard Error)", "", "", "", "", post_draw_sde5, "", ""]

row4 = ["Post-defeat", "", "", "", "", post_defeat_est5, "", ""]
row4sde = ["(Standard Error)", "", "", "", "", post_defeat_sde5, "", ""]

row5 = ["Post-victory marginal effect", "", "", "", "", "", "", ""]
row5sde = ["(Standard Error)", "", "", "", "", "", "", ""]

row6 = ["Country × match FE", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"]
row7 = ["Language × year FE", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"]
row8 = ["Individual controls", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"]
row9 = ["Seasonal FE", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes"]
row10 = ["Observations", post_game_obs1, post_game_obs2, post_game_obs3, post_game_obs4, post_victory_obs5,
         post_victory_obs6, post_victory_obs7]
row11 = ["R-squared", post_game_r1, post_game_r2, post_game_r3, post_game_r4, post_victory_r5, post_victory_r6,
         post_victory_r7]

data_table2 = [row1, row1sde, row2, row2sde, row3, row3sde, row4, row4sde, row5, row5sde, row6, row7, row8,
               row9, row10, row11]


headers = ["OLS (1)", "OLS (2)", "OLS (3)", "OLS (4)", "OLS (5)", "OLS (6)", "Probit (7)"]
table2 = tabulate(data_table2, headers)

print(table2)

### Comparing Results:
### All the results from columns 2 through 6 match perfectly in estimate and standard error,
### and accurate up to a hundredth of the real R-squared value.
### Column number 1 matches in estimate but has a different standard error by a notable
### difference. It may have something to do with the warning "/opt/anaconda3/lib/python3.9/
### site-packages/statsmodels/regression/linear_model.py:1884: RuntimeWarning: invalid value encountered in sqrt
### return np.sqrt(np.diag(self.cov_params()))" which could mean a problem with the data which we
### were not able to solve. Nonetheless, considering that estimates and standard errors match
### for the other columns it is rather abnormal. Also, observations for all columns don't match
### the expected observations, even when all the necesary data filters have been applied and
### null-values have been dropped. Finally, column 7's Probit regression was not replicable
### as the code for its tranlsation from Stata to python was rather complicated and the best
### attempt made resulted in "nan" values. Because of column 7's malfunction, we are not able
### to look at the "Post-victory marginal Effect" and its standard error.

                              OLS (1)    OLS (2)    OLS (3)    OLS (4)    OLS (5)    OLS (6)    Probit (7)
----------------------------  ---------  ---------  ---------  ---------  ---------  ---------  ------------
Post_game                     -0.026     -0.029     -0.036     -0.001
(Standard Error)              31.277     0.014      0.014      0.016
Post-victory                                                   -0.052     -0.053     -0.053     nan
(Standard Error)                                               0.019      0.017      0.016      nan
Post-draw                                                                 -0.026
(Standard Error)                                                          0.039
Post-defeat                                                               -0.0
(Standard Error)                                                          0.017
Post-victory marginal effect
(Standard Error)
Country × match FE            Yes        Yes        Yes        Yes        Yes  