# ZuCo x VAD Multiple Linear Regression Analysis

With these analysis we want to search for a correlation between the gaze feature for alla the words in the ZuCo dataset and their VAD values

## Import libraries and files

In [7]:
import pandas as pd
import statsmodels.api as sm


from sklearn.preprocessing import MinMaxScaler
# some_file.py
import sys
# insert at 1, 0 is the script path (or '' in REPL)
sys.path.insert(1, '../')
import BackwardElimination as be

## Take data from CSV files

In [2]:
vad_arousal = pd.read_csv(r'../Lexicons/NRC_VAD_Lexicon/NRC-VAD-Lexicon-a-scores.csv')
vad_valence = pd.read_csv(r'../Lexicons/NRC_VAD_Lexicon/NRC-VAD-Lexicon-v-scores.csv')
vad_dominance = pd.read_csv(r'../Lexicons/NRC_VAD_Lexicon/NRC-VAD-Lexicon-d-scores.csv')
zuco_cs = pd.read_csv(r'../Lexicons/ZuCo_words_dataset.csv')

In [3]:
zuco_ar_cs = pd.merge(zuco_cs, vad_arousal, how = 'inner', on = ['Word'])
zuco_ar_cs = zuco_ar_cs.sort_values(by=['Arousal']).reset_index(drop=True)
zuco_ar_cs = zuco_ar_cs.drop(['Word'], axis=1)

zuco_va_cs = pd.merge(zuco_cs, vad_valence, how = 'inner', on = ['Word'])
zuco_va_cs = zuco_va_cs.sort_values(by=['Valence']).reset_index(drop=True)
zuco_va_cs = zuco_va_cs.drop(['Word'], axis=1)

zuco_do_cs = pd.merge(zuco_cs, vad_dominance, how = 'inner', on = ['Word'])
zuco_do_cs = zuco_do_cs.sort_values(by=['Dominance']).reset_index(drop=True)
zuco_do_cs = zuco_do_cs.drop(['Word'], axis=1)

## Multiple Linear Regression between VAD and Gaze Features on words in ZuCo

### Analysis: Arousal as dependent and MPS, TRT, GD, FFD as indipendent¶

In [10]:
scaler = MinMaxScaler()

In [14]:
#x = zuco_ar_cs.iloc[:, :-1].values
#y = zuco_ar_cs.iloc[:, -1].values


zuco_ar_cs = pd.DataFrame(zuco_ar_cs, columns=['MPS','TRT','GD','FFD','Arousal'])

x = zuco_ar_cs[['MPS','TRT','GD','FFD']]
x = pd.DataFrame(scaler.fit_transform(x), columns=['MPS','TRT','GD','FFD'])
y = zuco_ar_cs['Arousal']
x = sm.add_constant(x)

model = be.backWardEliminationMLR(x,y)#sm.OLS(y, x).fit()
model.summary()

const    9.135880e-61
MPS      3.239329e-01
TRT      7.606495e-01
GD       7.415110e-02
FFD      5.732004e-01
dtype: float64
 
const    7.878619e-61
MPS      2.883989e-01
GD       5.145309e-02
FFD      5.865315e-01
dtype: float64
 
const    6.363491e-63
MPS      2.814249e-01
GD       1.923564e-02
dtype: float64
 
const    1.854772e-194
GD        2.223395e-02
dtype: float64
 


0,1,2,3
Dep. Variable:,Arousal,R-squared:,0.005
Model:,OLS,Adj. R-squared:,0.004
Method:,Least Squares,F-statistic:,5.244
Date:,"Sat, 21 Nov 2020",Prob (F-statistic):,0.0222
Time:,11:32:37,Log-Likelihood:,389.22
No. Observations:,1002,AIC:,-774.4
Df Residuals:,1000,BIC:,-764.6
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.4416,0.012,37.733,0.000,0.419,0.465
GD,0.1001,0.044,2.290,0.022,0.014,0.186

0,1,2,3
Omnibus:,34.29,Durbin-Watson:,0.01
Prob(Omnibus):,0.0,Jarque-Bera (JB):,37.133
Skew:,0.465,Prob(JB):,8.64e-09
Kurtosis:,2.846,Cond. No.,8.92


We can see a strong correlation between the dependent variable Arousal and the independent variable gaze duration(p values = 0.022)

SL = 0.05

### Analysis: Valence as dependent and MPS, TRT, GD, FFD as indipendent

In [11]:
zuco_va_cs = pd.DataFrame(zuco_va_cs, columns=['MPS','TRT','GD','FFD','Valence'])
x = zuco_va_cs[['MPS','TRT','GD','FFD']]
x = pd.DataFrame(scaler.fit_transform(x), columns=['MPS','TRT','GD','FFD'])
y = zuco_va_cs['Valence']

x = sm.add_constant(x)
model = be.backWardEliminationMLR(x,y)
model.summary()

const    3.000476e-87
MPS      8.192093e-02
TRT      9.358091e-01
GD       7.299814e-01
FFD      2.398259e-01
dtype: float64
 
const    2.470918e-87
MPS      7.909354e-02
GD       6.381597e-01
FFD      2.362608e-01
dtype: float64
 
const    8.460427e-88
MPS      8.014381e-02
FFD      2.063254e-01
dtype: float64
 
const    5.419405e-98
MPS      6.785672e-02
dtype: float64
 
const    0.0
dtype: float64
 


0,1,2,3
Dep. Variable:,Valence,R-squared:,-0.0
Model:,OLS,Adj. R-squared:,-0.0
Method:,Least Squares,F-statistic:,
Date:,"Sat, 21 Nov 2020",Prob (F-statistic):,
Time:,11:00:09,Log-Likelihood:,229.88
No. Observations:,1002,AIC:,-457.8
Df Residuals:,1001,BIC:,-452.9
Df Model:,0,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.6176,0.006,101.573,0.000,0.606,0.630

0,1,2,3
Omnibus:,73.386,Durbin-Watson:,0.0
Prob(Omnibus):,0.0,Jarque-Bera (JB):,88.709
Skew:,-0.681,Prob(JB):,5.46e-20
Kurtosis:,3.517,Cond. No.,1.0


We can see an apparently no correlation between the dependent variable Valence and the following independent variables: MPS, TRT, GD and FFD

SL = 0.05

### Analysis: Dominance as dependent and MPS, TRT, GD, FFD as indipendent

In [12]:
zuco_do_cs = pd.DataFrame(zuco_do_cs, columns=['MPS','TRT','GD','FFD','Dominance'])
x = zuco_do_cs[['MPS','TRT','GD','FFD']]
x = pd.DataFrame(scaler.fit_transform(x), columns=['MPS','TRT','GD','FFD'])
y = zuco_do_cs['Dominance']

x = sm.add_constant(x)
model = be.backWardEliminationMLR(x,y)
model.summary()

const    7.320637e-81
MPS      4.032543e-02
TRT      4.923203e-01
GD       8.312547e-03
FFD      4.722585e-01
dtype: float64
 
const    6.220076e-81
MPS      2.668924e-02
GD       6.534861e-03
FFD      5.008083e-01
dtype: float64
 
const    1.454908e-83
MPS      2.537672e-02
GD       7.622874e-04
dtype: float64
 


0,1,2,3
Dep. Variable:,Dominance,R-squared:,0.015
Model:,OLS,Adj. R-squared:,0.013
Method:,Least Squares,F-statistic:,7.823
Date:,"Sat, 21 Nov 2020",Prob (F-statistic):,0.000425
Time:,11:00:22,Log-Likelihood:,290.58
No. Observations:,1002,AIC:,-575.2
Df Residuals:,999,BIC:,-560.4
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.6109,0.029,21.342,0.000,0.555,0.667
MPS,-0.1113,0.050,-2.239,0.025,-0.209,-0.014
GD,0.1632,0.048,3.377,0.001,0.068,0.258

0,1,2,3
Omnibus:,30.242,Durbin-Watson:,0.033
Prob(Omnibus):,0.0,Jarque-Bera (JB):,17.428
Skew:,-0.155,Prob(JB):,0.000164
Kurtosis:,2.433,Cond. No.,11.4


We can see a strong correlation between the dependent variable Arousal and the following independent variables: mean pupil size(p values = 0.025) and  gaze duration(p values = 0.001)

SL = 0.05