# Social Media Relationships with Ordinary Least Squares

**Linear models** in the form of **ordinary least squares regression** can often be help with understanding relationships between variables. In this exercise you will analyze synthetic data which relates game players' characteristics to social media engagement. 

To start, execute the code in the cell below to import the packages you will need for this example. 

In [21]:
import numpy as np
import numpy.random as nr
from scipy.stats import truncnorm
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd

## matplotlib with display of graphs inline
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The code in the cell below generates synthetic data with characteristics of game players and social media engagement. These variables are drawn from a truncated multivariate Normal distribution, with limits of 0.0 and 10.0. Execute this code.

In [22]:
## Create data set as multivariate Normal
covariance = np.array([[1.0,0.7,0.7,0.0],
                      [0.7,1.0,0.6,0.7],
                      [0.7,0.6,1.0,0.2],
                      [0.0,0.7,0.2,1.0]])
effect_data = pd.DataFrame(nr.multivariate_normal(mean=[3.0,2.0,2.0,2.0], cov=covariance, size=500),
                           columns=['Fan','TimePlaying','SocialMedia','GameFamiliarity'])

## Truncate values to range 0.0 <= x <= 10.0
effect_data[effect_data < 0.0] = 0.0
effect_data[effect_data > 10.0] = 10.0

The code in the cell below transforms the TimePlaying variable by squaring the values. This transformation gives the transformed variable a positive skew. The code also rounds the values to two digits. Execute this code and examine the resulting data frame. 

In [23]:
 ## Square the values of TimePlaying to give positive skew
effect_data['TimePlaying'] = np.square(effect_data['TimePlaying'])

## And round all values to 2 decimal places
effect_data = effect_data.round(decimals=2)
effect_data

Unnamed: 0,Fan,TimePlaying,SocialMedia,GameFamiliarity
0,2.45,2.41,2.46,1.92
1,3.36,4.80,1.52,1.96
2,2.15,2.22,1.45,2.05
3,1.83,2.04,2.86,2.19
4,2.18,2.64,2.57,2.56
5,3.21,11.54,3.40,4.14
6,3.80,4.74,2.53,1.16
7,2.70,1.42,1.28,1.02
8,4.26,9.61,3.46,2.03
9,1.15,0.83,1.30,2.46


The code in the cell below finds a linear models SocialMedia engagement as a function of TimePlaying. Execute this code and examine the results.   

> The code in the cell below uses the R style model formula. This modeling language was introduced in Chambers and Hastie, 1992, Statistical Models in S.     

> For a good [**cheatsheet and summary of the R modeling language**](http://faculty.chicagobooth.edu/richard.hahn/teaching/formulanotation.pdf) look at the posting by Richard Hahn of the Chicago Booth School.    

> Models are defined by an equation using the $\sim$ symbol to mean modeled by. In summary, the variable to be modeled is always on the left. The relationship between the variable to be modeled on the right. This basic scheme can be written: 

$$dependent\ variable\sim indepenent\ variables$$

> For example, if the dependent variable (dv) is modeled by two independent variables (var1 and var2), with no interaction, the formula would be:
$$dv \sim var1 + var2$$

In [24]:
social_time_model = smf.ols(formula = 'SocialMedia ~ TimePlaying', data=effect_data).fit()
social_time_model.summary()

0,1,2,3
Dep. Variable:,SocialMedia,R-squared:,0.328
Model:,OLS,Adj. R-squared:,0.327
Method:,Least Squares,F-statistic:,243.2
Date:,"Sun, 05 Jan 2020",Prob (F-statistic):,6.229999999999999e-45
Time:,18:03:14,Log-Likelihood:,-581.89
No. Observations:,500,AIC:,1168.0
Df Residuals:,498,BIC:,1176.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.3627,0.055,24.878,0.000,1.255,1.470
TimePlaying,0.1330,0.009,15.594,0.000,0.116,0.150

0,1,2,3
Omnibus:,10.639,Durbin-Watson:,2.038
Prob(Omnibus):,0.005,Jarque-Bera (JB):,6.331
Skew:,0.081,Prob(JB):,0.0422
Kurtosis:,2.473,Cond. No.,10.3


The summary of the ordinary least squares model contains quite a lot information. We will only focus on a few of these values:    
- The value of the coefficient is an estimate of the **effect size**. In this case the effect being measured is TimePlaying. 
- The confidence interval of the coefficient value which indicates the range of likely values for the effect size. It is important to keep in mind that any estimated effect size is uncertain and not exact. 
- The adjusted R-squared value which is the ratio of the ratio of the variance of the residuals of from the fitted model and the variance of the dependent variable. 



> #### Summary of R-Squared

> - **R squared or $R^2$**, also known as the **coefficient of determination**,  
$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$  
where,   
$SS_{res} = \sum_{i=1}^N r_i^2$, or the sum of the squared residuals,   
$SS_{tot} = \sum_{i=1}^N y_i^2$, or the sum of the squared label values.  

> In other words, $R^2$ is  measure of the reduction in sum of squared values between the raw label values and the residuals. If the model has not reduced the sum of squares of the labels (a useless model!), $R^2 = 0$. On the other hand, if the model fits the data perfectly so all $r_i = 0$, then $R^2 = 1$. 

> - **Adjusted R squared or $R^2_{adj}$** is $R^2$ adjusted for degrees of freedom in the model,
$$R^2_{adj} = 1 - \frac{var(r)}{var(y)} = 1 - \frac{\frac{SS_{res}}{(n - p -1)}}{\frac{SS_{tot}}{(n-1)}}$$  
where,   
$var(r) = $ the variance of the residuals,   
$var(y) = $ the variance of the labels,
$n = $ the number of samples or cases,
$p = $ number of model parameters.  

> The interpretation of $R^2_{adj}$ is the same as $R^2$. In many cases there will be little difference. However if the number of parameters is significant with respect to the number of cases, $R^2$ will give an overly optimistic measure of model performance. In general, the difference between $R^2_{adj}$ and $R^2$ becomes less significant as the number of cases $n$ grows. However, even for 'big data' models there can be a significant difference if there are a large number of model parameters.   


As a next step in the analysis we will add another variable to the model, Fan. Execute the code in the cell below and examine the summary of the new model. 

In [25]:
social_time_fan_model = smf.ols(formula='SocialMedia ~ TimePlaying + Fan', data=effect_data).fit()
social_time_fan_model.summary()

0,1,2,3
Dep. Variable:,SocialMedia,R-squared:,0.478
Model:,OLS,Adj. R-squared:,0.476
Method:,Least Squares,F-statistic:,227.5
Date:,"Sun, 05 Jan 2020",Prob (F-statistic):,6.96e-71
Time:,18:03:15,Log-Likelihood:,-518.78
No. Observations:,500,AIC:,1044.0
Df Residuals:,497,BIC:,1056.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.2901,0.102,2.846,0.005,0.090,0.490
TimePlaying,0.0463,0.010,4.427,0.000,0.026,0.067
Fan,0.4975,0.042,11.947,0.000,0.416,0.579

0,1,2,3
Omnibus:,4.342,Durbin-Watson:,1.947
Prob(Omnibus):,0.114,Jarque-Bera (JB):,3.701
Skew:,0.125,Prob(JB):,0.157
Kurtosis:,2.661,Cond. No.,25.1


Notice two changes between the first model and the model using two variables:
- The coefficient (effect) value for the variable Fan is large and the effect for TimePlaying is now quite small. 
- The adjusted R-squared value is larger indicating this model explains more of the variance of the data.    

Another possibility is to model social media engagement by TimePlaying and GameFamiliarity. Execute the code in the cell below, examine the results, and compare them to the previous models. 

In [26]:
social_time_familarity_model = smf.ols(formula='SocialMedia ~ TimePlaying + GameFamiliarity', data=effect_data).fit()
social_time_familarity_model.summary()

0,1,2,3
Dep. Variable:,SocialMedia,R-squared:,0.366
Model:,OLS,Adj. R-squared:,0.363
Method:,Least Squares,F-statistic:,143.4
Date:,"Sun, 05 Jan 2020",Prob (F-statistic):,6.65e-50
Time:,18:03:15,Log-Likelihood:,-567.38
No. Observations:,500,AIC:,1141.0
Df Residuals:,497,BIC:,1153.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.6885,0.080,21.089,0.000,1.531,1.846
TimePlaying,0.1683,0.011,15.993,0.000,0.148,0.189
GameFamiliarity,-0.2492,0.046,-5.450,0.000,-0.339,-0.159

0,1,2,3
Omnibus:,4.621,Durbin-Watson:,2.02
Prob(Omnibus):,0.099,Jarque-Bera (JB):,3.36
Skew:,0.038,Prob(JB):,0.186
Kurtosis:,2.606,Cond. No.,17.6


The effect sizes (coefficient values) are smaller in magnitude as is the R-squared value. Notice also that the sign of the GameFamiliarity effect is now negative. Given these results, we say that GameFamiliarity is a **confounding effect**. In other words, adding GameFamiliarity confounds, or masks the other effects, but is not explanitory. 

Given the above analysis, it seems that Fan is the variable that best explains social media engagement. This idea is easy to test, by creating a linear model of social media engagement as a function of Fan. Execute the code in the cell below to compute the model and display the summary. 

In [27]:
social_fan_model = smf.ols(formula='SocialMedia ~ Fan', data=effect_data).fit()
social_fan_model.summary()

0,1,2,3
Dep. Variable:,SocialMedia,R-squared:,0.457
Model:,OLS,Adj. R-squared:,0.456
Method:,Least Squares,F-statistic:,419.8
Date:,"Sun, 05 Jan 2020",Prob (F-statistic):,4.04e-68
Time:,18:03:15,Log-Likelihood:,-528.45
No. Observations:,500,AIC:,1061.0
Df Residuals:,498,BIC:,1069.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.1333,0.097,1.369,0.172,-0.058,0.325
Fan,0.6254,0.031,20.489,0.000,0.565,0.685

0,1,2,3
Omnibus:,3.005,Durbin-Watson:,1.923
Prob(Omnibus):,0.223,Jarque-Bera (JB):,3.045
Skew:,0.164,Prob(JB):,0.218
Kurtosis:,2.802,Cond. No.,10.8


Notice the following points about this model:
- The effect size for Fan is large relative to effect sizes in other models.  
- The R-squared value is nearly as large as the best previous model, indicating that the Fan variable has good explanatory power. 

From this analysis, what can you conclude about the relationship between Fan and TimePlaying and SocialMedia. 

##### Copyright 2020, Stephen F Elston. All rights reserved. 