# PEARSON CORRELATION

In this notebook, I calculate the Pearson correlation coefficient of severla viarable (without using Python built-in functions), as an exercise. The Pearson correlation coefficient, commonly denoted by r, is a measure of the linear correlation between two valriables.

For this exercise, I use the dataset published with the following paper: S. Lovdal, R. van der Hartigh, G. Azzopardi, "Injury Prediction in Competitive Runners With Machine Learning", International Journal of Sports Physiology and Performance, https://doi.org/10.1123/ijspp.2020-0518, 2021

The dataset provides information on the training routine of professional runners

In [1]:
import pandas as pd
import numpy as np
import math

In [2]:
df_day_original=pd.read_csv('day_approach_maskedID_timeseries.csv')
df_day_original.head()

Unnamed: 0,nr. sessions,total km,km Z3-4,km Z5-T1-T2,km sprinting,strength training,hours alternative,perceived exertion,perceived trainingSuccess,perceived recovery,...,km Z5-T1-T2.6,km sprinting.6,strength training.6,hours alternative.6,perceived exertion.6,perceived trainingSuccess.6,perceived recovery.6,Athlete ID,injury,Date
0,1.0,5.8,0.0,0.6,1.2,0.0,0.0,0.11,0.0,0.18,...,0.0,0.0,0.0,1.0,0.1,0.0,0.15,0,0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.01,-0.01,-0.01,...,0.5,1.2,0.0,0.0,0.1,0.0,0.17,0,0,1
2,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.1,0.0,0.17,...,0.0,0.0,0.0,0.0,-0.01,-0.01,-0.01,0,0,2
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.01,-0.01,-0.01,...,0.0,0.0,1.0,0.0,0.1,0.0,0.17,0,0,3
4,1.0,0.0,0.0,0.0,0.0,0.0,1.08,0.08,0.0,0.18,...,0.0,0.0,0.0,0.0,0.11,0.0,0.17,0,0,4


In [3]:
df_day_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42766 entries, 0 to 42765
Data columns (total 73 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   nr. sessions                 42766 non-null  float64
 1   total km                     42766 non-null  float64
 2   km Z3-4                      42766 non-null  float64
 3   km Z5-T1-T2                  42766 non-null  float64
 4   km sprinting                 42766 non-null  float64
 5   strength training            42766 non-null  float64
 6   hours alternative            42766 non-null  float64
 7   perceived exertion           42766 non-null  float64
 8   perceived trainingSuccess    42766 non-null  float64
 9   perceived recovery           42766 non-null  float64
 10  nr. sessions.1               42766 non-null  float64
 11  total km.1                   42766 non-null  float64
 12  km Z3-4.1                    42766 non-null  float64
 13  km Z5-T1-T2.1   

In [4]:
#The above outputs indicates that there are no nan values in the df_day dataframe

"perceived exertion" is the athletes' own estimation of how tired they were after completing the main session of the day. I would expect this to have some sort of correlation with the number, intensity, and lenght of the training sessions performed that day. As an exerise, I will calculate the Pearson correlation coefficient of "perceived exertion" with total km, km Z3-4, km Z3-4, km Z5-T1-T2, and km sprinting

In [7]:
#At the moment, I do not need the aggregated data, 
#so I will just keep the feature for the day of the observation
df_day=pd.concat([df_day_original.iloc[:,0:8],df_day_original.iloc[:,-3:]], axis=1)
df_day.head()

Unnamed: 0,nr. sessions,total km,km Z3-4,km Z5-T1-T2,km sprinting,strength training,hours alternative,perceived exertion,Athlete ID,injury,Date
0,1.0,5.8,0.0,0.6,1.2,0.0,0.0,0.11,0,0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.01,0,0,1
2,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.1,0,0,2
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.01,0,0,3
4,1.0,0.0,0.0,0.0,0.0,0.0,1.08,0.08,0,0,4


In [8]:
#Also, if on one day total km=0.0, then on that day the athlete did not run; hence, I will remove such observations

In [9]:
df_day=df_day.loc[df_day['total km'] != 0.0]
df_day.reset_index(inplace=True, drop=True)
df_day.head()

Unnamed: 0,nr. sessions,total km,km Z3-4,km Z5-T1-T2,km sprinting,strength training,hours alternative,perceived exertion,Athlete ID,injury,Date
0,1.0,5.8,0.0,0.6,1.2,0.0,0.0,0.11,0,0,0
1,1.0,16.4,10.0,0.0,0.0,1.0,0.0,0.11,0,0,5
2,1.0,5.2,0.0,0.5,1.2,0.0,0.0,0.1,0,0,7
3,1.0,17.6,7.2,0.0,0.0,0.0,0.0,0.11,0,0,10
4,1.0,10.5,6.5,0.0,0.0,0.0,0.0,0.09,0,0,12


In [10]:
df_day.drop(columns=['nr. sessions','strength training','hours alternative'], axis=1, inplace=True) #this info is not related to
                                                                                                    #'perceived exertion'
df_day.head()

Unnamed: 0,total km,km Z3-4,km Z5-T1-T2,km sprinting,perceived exertion,Athlete ID,injury,Date
0,5.8,0.0,0.6,1.2,0.11,0,0,0
1,16.4,10.0,0.0,0.0,0.11,0,0,5
2,5.2,0.0,0.5,1.2,0.1,0,0,7
3,17.6,7.2,0.0,0.0,0.11,0,0,10
4,10.5,6.5,0.0,0.0,0.09,0,0,12


In [11]:
#easy way to find the Pearson correlation coefficient
df_day.corr(method='pearson')

Unnamed: 0,total km,km Z3-4,km Z5-T1-T2,km sprinting,perceived exertion,Athlete ID,injury,Date
total km,1.0,0.327697,0.203395,-0.044483,0.016976,-0.113952,-0.009608,-0.046742
km Z3-4,0.327697,1.0,-0.162256,-0.058333,0.110803,0.002418,-0.007363,0.025261
km Z5-T1-T2,0.203395,-0.162256,1.0,-0.030888,0.202711,-0.037881,0.015726,-0.015269
km sprinting,-0.044483,-0.058333,-0.030888,1.0,-0.008283,-0.071147,0.009874,-0.070084
perceived exertion,0.016976,0.110803,0.202711,-0.008283,1.0,0.317989,0.03613,0.511037
Athlete ID,-0.113952,0.002418,-0.037881,-0.071147,0.317989,1.0,0.006848,0.59556
injury,-0.009608,-0.007363,0.015726,0.009874,0.03613,0.006848,1.0,0.025961
Date,-0.046742,0.025261,-0.015269,-0.070084,0.511037,0.59556,0.025961,1.0


In [12]:
#the function below calculate the mean of the values in series
def meanofcolumn(pd_series):
    
    sumofseries=0
    for i in pd_series:
        sumofseries=sumofseries+i
        
    meanofseries=sumofseries/len(pd_series)
    
    return meanofseries

In [13]:
# the function below calculates the Peearson correlation coefficient between two series

def pearsoncoeff(first_series, second_series):
    
    x_mean=meanofcolumn(first_series)
    y_mean=meanofcolumn(second_series)
    first_sum=0
    second_sum=0
    third_sum=0
    
    for x,y in zip(first_series, second_series):
        firstsum=(x-x_mean)*(y-y_mean)
        first_sum=first_sum+firstsum
        
        secondsum=(x-x_mean)**2
        second_sum=second_sum+secondsum
        
        thirdsum=(y-y_mean)**2
        third_sum=third_sum+thirdsum
        
    pearson_corr=first_sum/math.sqrt(second_sum*third_sum)
    
    return pearson_corr

In [14]:
#according to the df.corr() method, the Pearson correlation between perceived exertion and km Z5-T1-T2 is 0.202711;
#let's see if my function gives the same value
pearsoncoeff(df_day['perceived exertion'], df_day['km Z5-T1-T2'])

0.20271063943971676

In [15]:
df_day.head()

Unnamed: 0,total km,km Z3-4,km Z5-T1-T2,km sprinting,perceived exertion,Athlete ID,injury,Date
0,5.8,0.0,0.6,1.2,0.11,0,0,0
1,16.4,10.0,0.0,0.0,0.11,0,0,5
2,5.2,0.0,0.5,1.2,0.1,0,0,7
3,17.6,7.2,0.0,0.0,0.11,0,0,10
4,10.5,6.5,0.0,0.0,0.09,0,0,12


In [20]:
for i in df_day.columns:
    print(pearsoncoeff(df_day['perceived exertion'], df_day[i]))

0.01697592963303097
0.11080325567999154
0.20271063943971676
-0.008283217459983849
1.0
0.3179888470498536
0.03613030976116077
0.5110369618774546


In [22]:
df_day.corr()['perceived exertion']

total km              0.016976
km Z3-4               0.110803
km Z5-T1-T2           0.202711
km sprinting         -0.008283
perceived exertion    1.000000
Athlete ID            0.317989
injury                0.036130
Date                  0.511037
Name: perceived exertion, dtype: float64

The last two outputs confirm that my function is correct.