# Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

1.    Organize our data: Create two lists or columns, one for the amount of time students spend studying and another for their corresponding final exam scores. Make sure both lists have the same number of entries.

2.    Calculate the means: Find the mean (average) of both lists. Let's call them X̄ (mean of study time) and Ȳ (mean of exam scores).

3.    Calculate the differences: Subtract the mean of each list from each corresponding entry. For each entry, subtract X̄ from the study time and subtract Ȳ from the exam score. Create a new list with these differences.

4.    Square the differences: Take each difference from step 3 and square it. Create a new list with these squared differences.

5.    Calculate the sums: Add up all the squared differences from step 4. Let's call this sum Σ(X-X̄)(Y-Ȳ).

6.    Calculate the standard deviations: Find the standard deviation of both lists. Let's call them Sx (standard deviation of study time) and Sy (standard deviation of exam scores).

7.    Multiply the standard deviations: Multiply Sx and Sy together. Let's call this product SxSy.

8.    Calculate the Pearson correlation coefficient: Divide the sum from step 5 by the product from step 7. Finally, divide the result by the number of data points minus 1. The formula is:

      r = (Σ(X-X̄)(Y-Ȳ)) / ((n-1) * Sx * Sy)

      where r represents the Pearson correlation coefficient and n is the number of data points.
   
  Once we have calculated the Pearson correlation coefficient (r), we can interpret the result as follows:
   
    
*    If r is close to 1, it indicates a strong positive linear relationship. This means that as the amount of time students spend studying increases, their exam scores also tend to increase.

*    If r is close to -1, it indicates a strong negative linear relationship. This means that as the amount of time students spend studying increases, their exam scores tend to decrease.

*    If r is close to 0, it indicates a weak or no linear relationship. This means that there is no significant association between the amount of time students spend studying and their exam scores.

ormula to calculate the Pearson correlation coefficient (r) between two variables X and Y 

r = Σ((X - X̄)(Y - Ȳ)) / sqrt(Σ(X - X̄)² * Σ(Y - Ȳ)²)

*    X and Y: The variables for which i want to calculate the correlation coefficient.
*    X̄ and Ȳ: The means (averages) of variables X and Y, respectively.
*    Σ: The summation symbol, which represents the sum of values.
*    (X - X̄) and (Y - Ȳ): The differences between each data point and its corresponding mean.
*    sqrt: The square root function.
*    ²: The symbol for squaring a value.

calculate the correlation coefficient, we need to compute the sum of the products of the differences between each data point and its corresponding mean for both X and Y. Then, divide that sum by the product of the square roots of the sums of the squared differences for X and Y.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Create a sample DataFrame with study time and exam scores
data = {'Study Time': [5, 10, 15, 20, 25],
        'Exam Scores': [60, 75, 85, 90, 95]}

In [3]:
df = pd.DataFrame(data)

In [4]:
df

Unnamed: 0,Study Time,Exam Scores
0,5,60
1,10,75
2,15,85
3,20,90
4,25,95


In [5]:
# Calculate the Pearson correlation coefficient
correlation_coefficient = df['Study Time'].corr(df['Exam Scores'])

In [6]:
df["Pearson Correlation Coefficient:"] = correlation_coefficient

In [7]:
df

Unnamed: 0,Study Time,Exam Scores,Pearson Correlation Coefficient:
0,5,60,0.968665
1,10,75,0.968665
2,15,85,0.968665
3,20,90,0.968665
4,25,95,0.968665


# Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables. Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.

formula to calculate the Spearman's rank correlation coefficient (rho) 

ρ = 1 - ((6 * Σd^2) / (n * (n^2 - 1)))

where:

   ρ represents the Spearman's rank correlation coefficient
   Σd^2 is the sum of the squared differences between the ranks of the paired data points
   n is the number of data points

In the formula, Σd^2 represents the sum of the squared differences between the ranks of the paired data points. It quantifies the extent to which the ranks of the two variables deviate from perfect monotonicity. The denominator term, n * (n^2 - 1), represents the total number of rank differences that could occur if the ranks were assigned randomly.

By subtracting the ratio of the sum of squared differences to the total possible rank differences from 1, we obtain the Spearman's rank correlation coefficient, which ranges from -1 to 1. A value of -1 indicates a perfect negative monotonic relationship, 1 indicates a perfect positive monotonic relationship, and 0 indicates no monotonic relationship between the variables.

Let's assume we have collected data from 10 individuals.

Amount of Sleep: [7, 6, 8, 5, 7, 6, 9, 8, 6, 5]

Job Satisfaction: [8, 5, 9, 4, 7, 6, 9, 8, 7, 5]

Now, calculate the Spearman's rank correlation coefficient

1.    Rank the data for both variables:

Amount of Sleep (Ranked): [4, 2, 6, 1, 4, 2, 8, 6, 2, 1]

Job Satisfaction (Ranked): [7, 2, 9, 1, 5, 3, 9, 7, 5, 2]

2.    Calculate the difference between the ranks:

Rank Differences: [3, 0, 3, 0, 1, 1, 1, 1, 3, 1]

3.    Square the differences:

Squared Differences: [9, 0, 9, 0, 1, 1, 1, 1, 9, 1]

4.    Sum up the squared differences:

Sum of Squared Differences = 32

5.    Calculate the Spearman's rank correlation coefficient (rho):

n = 10

rho = 1 - ((6 * 32) / (10 * (10^2 - 1)))

= 1 - (192 / 990)

= 1 - 0.194

= 0.806

Interpretation:
The Spearman's rank correlation coefficient (rho) is 0.806. This indicates a strong positive monotonic relationship between the amount of sleep individuals get each night and their overall job satisfaction level. In simpler terms, as the amount of sleep increases, job satisfaction tends to increase as well, and vice versa.

In [19]:
import pandas as pd
from scipy.stats import spearmanr

In [20]:
# Sample dataset
data = {
    'Amount of Sleep': [7, 6, 8, 5, 7, 6, 9, 8, 6, 5],
    'Job Satisfaction': [8, 5, 9, 4, 7, 6, 9, 8, 7, 5]
}

In [21]:
df = pd.DataFrame(data)

In [22]:
df

Unnamed: 0,Amount of Sleep,Job Satisfaction
0,7,8
1,6,5
2,8,9
3,5,4
4,7,7
5,6,6
6,9,9
7,8,8
8,6,7
9,5,5


In [23]:
# Calculate Spearman's rank correlation coefficient
rho, p_value = spearmanr(df['Amount of Sleep'], df['Job Satisfaction'])

In [24]:
# Add correlation coefficient column to DataFrame
df['Spearman Correlation'] = rho

In [26]:
df

Unnamed: 0,Amount of Sleep,Job Satisfaction,Spearman Correlation
0,7,8,0.937346
1,6,5,0.937346
2,8,9,0.937346
3,5,4,0.937346
4,7,7,0.937346
5,6,6,0.937346
6,9,9,0.937346
7,8,8,0.937346
8,6,7,0.937346
9,5,5,0.937346


# Q3. Suppose you are conducting a study to examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.

1    Pearson Correlation Coefficient (r):
    The Pearson correlation coefficient measures the linear relationship between two variables.

Step 1: Calculate the means of both variables:

*    Let's denote the number of hours of exercise as X, and BMI as Y.
*    Calculate the mean of X, denoted as X̄, and the mean of Y, denoted as Ȳ.

Step 2: Calculate the standard deviations of both variables:

*    Calculate the standard deviation of X, denoted as sX.
*    Calculate the standard deviation of Y, denoted as sY.

Step 3: Calculate the covariance between X and Y:

*    Calculate the sum of the products of the differences from the means:
     Σ[(Xi - X̄)(Yi - Ȳ)]
*    Divide this sum by (n-1), where n is the number of data points.

Step 4: Calculate the Pearson correlation coefficient (r):

*    Divide the covariance by the product of the standard deviations:
*    r = Σ[(Xi - X̄)(Yi - Ȳ)] / [(n-1) * sX * sY]

1    Spearman's Rank Correlation (ρ):
     Spearman's rank correlation is a non-parametric measure of the monotonic relationship between two variables.

Step 1: Rank the data:

*    Rank the values of X and Y separately, from smallest to largest.
*    If there are ties, assign the average rank to the tied values.

Step 2: Calculate the difference in ranks for each pair of data points:

*    Calculate the differences in ranks, denoted as di, for each pair of data points (Xi, Yi).

Step 3: Calculate Spearman's rank correlation coefficient (ρ):

*    Use the formula:
     ρ = 1 - [6 * Σ(di^2)] / [n * (n^2 - 1)]
     
     
Now, once we have the data and calculate these coefficients, we can compare the results. The Pearson correlation coefficient measures the linear relationship between the variables, while Spearman's rank correlation measures the monotonic relationship (the direction and strength of the relationship, regardless of linearity). The Pearson correlation coefficient can range from -1 to +1, where -1 indicates a perfect negative linear relationship, +1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship. Spearman's rank correlation also ranges from -1 to +1, with similar interpretations.

In [5]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [6]:
# Example data (replace with actual data)
data = {
    'hours_of_exercise': [5, 8, 4, 6, 7, 3, 2, 9, 4, 6, 5, 7, 8, 9, 3, 2, 1, 6, 7, 4, 8, 5, 3, 9, 2, 4, 6, 7, 8, 5, 4, 3, 1, 2, 6, 5, 7, 8, 4, 2, 3, 5, 6, 7, 8, 4, 5, 6, 7, 3, 9],
    'bmi': [25, 27, 30, 22, 23, 28, 29, 26, 24, 20, 21, 26, 25, 22, 30, 28, 29, 25, 24, 27, 22, 23, 24, 25, 29, 26, 28, 23, 27, 22, 24, 26, 28, 27, 30, 29, 25, 23, 22, 24, 26, 27, 28, 30, 25, 23, 24, 26, 29, 22, 28]
}

In [7]:
df = pd.DataFrame(data)

In [8]:
# Pearson correlation coefficient
pearson_corr = df['hours_of_exercise'].corr(df['bmi'], method='pearson')

In [9]:
# Spearman's rank correlation
spearman_corr = df['hours_of_exercise'].corr(df['bmi'], method='spearman')

In [10]:
df["Pearson correlation coefficient:"] = pearson_corr

In [11]:
df["Spearman's rank correlation:"] = spearman_corr

In [12]:
df.head()

Unnamed: 0,hours_of_exercise,bmi,Pearson correlation coefficient:,Spearman's rank correlation:
0,5,25,-0.228821,-0.219207
1,8,27,-0.228821,-0.219207
2,4,30,-0.228821,-0.219207
3,6,22,-0.228821,-0.219207
4,7,23,-0.228821,-0.219207


# Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data on both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.

the Pearson correlation coefficient measures the linear relationship between two variables and ranges from -1 to 1.

1.    Enter the number of hours spent watching television per day in one column, let's say column A, and the corresponding level       of physical activity in another column, let's say column B.
2.    Select an empty cell where we want to calculate the correlation coefficient, let's say cell D1.
3.    Use the correlation function in Excel by typing the following formula: =CORREL(A1:A50, B1:B50)
4.    Press Enter to calculate the Pearson correlation coefficient.

The resulting value will be between -1 and 1. A positive correlation coefficient indicates a positive relationship between the variables, meaning that as the number of hours spent watching television per day increases, the level of physical activity also tends to increase. A negative correlation coefficient indicates an inverse relationship, where as the number of hours spent watching television per day increases, the level of physical activity tends to decrease. The closer the correlation coefficient is to -1 or 1, the stronger the relationship between the variables. A correlation coefficient close to 0 suggests a weak or no linear relationship.

In [2]:
import pandas as pd

In [22]:
# Create a DataFrame with the data
data = {'TV_hours': [4, 3, 2, 5, 6, 2, 1, 3, 4, 5, 6, 2, 3, 4, 5, 1, 2, 3, 4, 5, 2, 1, 3, 4, 2, 1, 3, 4, 5, 6, 2, 3, 4, 1, 2, 3, 4, 5, 6, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5],
        'Physical_activity': [60, 45, 30, 90, 120, 30, 15, 45, 60, 90, 120, 30, 45, 60, 90, 15, 30, 45, 60, 90, 30, 15, 45, 60, 30, 15, 45, 60, 90, 120, 30, 45, 60, 15, 30, 45, 60, 90, 120, 30, 45, 15, 30, 45, 60, 90, 120, 15, 30, 45, 60, 90]}


In [23]:
df = pd.DataFrame(data)

In [24]:
# Calculate the correlation coefficient
correlation_coefficient = df['TV_hours'].corr(df['Physical_activity'])

In [25]:
df["Pearson correlation coefficient:"] = correlation_coefficient

In [26]:
df.head()

Unnamed: 0,TV_hours,Physical_activity,Pearson correlation coefficient:
0,4,60,0.981398
1,3,45,0.981398
2,2,30,0.981398
3,5,90,0.981398
4,6,120,0.981398


# Q5. A survey was conducted to examine the relationship between age and preference for a particular brand of soft drink. The survey results are shown below:

![image.png](attachment:image.png)

In [3]:
df = pd.DataFrame({
    "Age(Years)": [25, 42, 37, 19, 31, 28],
    "Soft Drink Preference": ['Coke', 'Pepsi', 'Mountain Dew', 'Coke', 'Pepsi', 'Coke']
})
df

Unnamed: 0,Age(Years),Soft Drink Preference
0,25,Coke
1,42,Pepsi
2,37,Mountain Dew
3,19,Coke
4,31,Pepsi
5,28,Coke


For finding Pearson correlation, we first need to convert the categorical variable to numerical variable. We can do so by encoding it:

In [4]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoded_col = encoder.fit_transform(df['Soft Drink Preference'])
df_encoded = pd.DataFrame(encoded_col, columns=['Soft Drink Preference Enc'])
df_new = pd.concat([df, df_encoded], axis=1)
df_new = df_new.drop('Soft Drink Preference', axis=1)
df_new

Unnamed: 0,Age(Years),Soft Drink Preference Enc
0,25,0
1,42,2
2,37,1
3,19,0
4,31,2
5,28,0


### Pearson's Coefficient Correlation:

In [5]:
df_new.corr(method='pearson')

Unnamed: 0,Age(Years),Soft Drink Preference Enc
Age(Years),1.0,0.769175
Soft Drink Preference Enc,0.769175,1.0


# Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

In [2]:
from tabulate import tabulate

data = [[10, 150], [12, 180], [15, 200], [18, 220], [20, 240], [22, 260], [25, 280], [28, 300], [30, 320]]
headers = ["Sales Calls/Day", "Sales/Week"]
table = tabulate(data, headers=headers, tablefmt='pipe')

print(table)


|   Sales Calls/Day |   Sales/Week |
|------------------:|-------------:|
|                10 |          150 |
|                12 |          180 |
|                15 |          200 |
|                18 |          220 |
|                20 |          240 |
|                22 |          260 |
|                25 |          280 |
|                28 |          300 |
|                30 |          320 |


To calculate the Pearson correlation coefficient, we can use the following formula:

where:

* r is the Pearson correlation coefficient

* x is the number of sales calls made per day

* y is the number of sales made per week

* x̄ is the mean of the number of sales calls made per day

* ȳ is the mean of the number of sales made per week

First, we need to calculate the mean of the number of sales calls made per day and the number of sales made per week.

In [5]:
x̄ = 20
ȳ = 250

Now, we can calculate the numerator of the formula:

∑(x - x̄)(y - ȳ) = (10 - 20)(150 - 250) + (12 - 20)(180 - 250) + ... + (30 - 20)(320 - 250) = -1500

Next, we need to calculate the square of the standard deviation of the number of sales calls made per day and the number of sales made per week.

∑(x - x̄)^2 = 100 + 100 + ... + 100 = 600

∑(y - ȳ)^2 = 2500 + 2500 + ... + 2500 = 15000

Finally, we can calculate the Pearson correlation coefficient:
    
r = -1500 / √600 √15000 = -0.20


Therefore, the Pearson correlation coefficient between the number of sales calls made per day and the number of sales made per week is -0.20. This indicates that there is a weak negative correlation between these two variables. In other words, as the number of sales calls made per day increases, the number of sales made per week tends to decrease.

In [6]:
import pandas as pd

In [7]:
# Create a DataFrame with the sales data
data = {'Sales Calls/Day': [10, 12, 15, 18, 20, 22, 25, 28, 30],
        'Sales/Week': [150, 180, 200, 220, 240, 260, 280, 300, 320]}

In [8]:
df = pd.DataFrame(data)

In [9]:
df

Unnamed: 0,Sales Calls/Day,Sales/Week
0,10,150
1,12,180
2,15,200
3,18,220
4,20,240
5,22,260
6,25,280
7,28,300
8,30,320


In [10]:
# Calculate Pearson correlation coefficient
correlation_coefficient = df['Sales Calls/Day'].corr(df['Sales/Week'])

In [11]:
df["Pearson correlation coefficient:"] = correlation_coefficient

In [12]:
df

Unnamed: 0,Sales Calls/Day,Sales/Week,Pearson correlation coefficient:
0,10,150,0.997157
1,12,180,0.997157
2,15,200,0.997157
3,18,220,0.997157
4,20,240,0.997157
5,22,260,0.997157
6,25,280,0.997157
7,28,300,0.997157
8,30,320,0.997157
