# Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

To calculate the Pearson correlation coefficient between two variables (in this case, the amount of time students spend studying for an exam and their final exam scores), you would follow these steps:

1. Calculate the mean (average) of the time spent studying and the mean of the exam scores.
2. Calculate the difference between each individual data point and the respective mean for both variables.
3. Square each of these differences.
4. Calculate the product of the differences for each data point.
5. Sum up the squared differences and the product of differences.
6. Divide the product of differences by the square root of the product of squared differences for both variables.

The formula for the Pearson correlation coefficient (r) is:

$$ r = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum{(x_i - \bar{x})^2} \cdot \sum{(y_i - \bar{y})^2}}}$$

Where:
- $x_i$ and $y_i$ are individual data points for time spent studying and exam scores, respectively.
- $\bar{x}$ and $\bar{y}$ are the means of the time spent studying and exam scores, respectively.

The resulting value of $r$ will be between -1 and 1:
- $r = 1$ indicates a perfect positive linear correlation.
- $r = -1$ indicates a perfect negative linear correlation.
- $r = 0$ indicates no linear correlation (variables are not linearly related).

Interpretation of the result:
- If $r$ is close to 1, it indicates a strong positive linear relationship between the time spent studying and exam scores. This means that as the time spent studying increases, the exam scores tend to increase as well.
- If $r$ is close to -1, it indicates a strong negative linear relationship between the time spent studying and exam scores. This means that as the time spent studying increases, the exam scores tend to decrease.
- If $r$ is close to 0, it indicates that there is little to no linear relationship between the time spent studying and exam scores.



# Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables. Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.

Spearman's rank correlation is a statistical measure used to assess the strength and direction of a monotonic relationship between two variables. Unlike Pearson correlation, Spearman's rank correlation does not assume a linear relationship but focuses on whether the variables tend to move together in a consistent direction.

To calculate Spearman's rank correlation, follow these steps:

1. Rank the data for each variable separately, from lowest to highest.
2. Compute the difference in ranks for each pair of corresponding data points.
3. Square the rank differences.
4. Calculate the Spearman's rank correlation using the formula:

$$ \rho = 1 - \frac{6\sum{d_i^2}}{n(n^2 - 1)} $$

Where:
- $ d_i $ represents the difference in ranks for each pair of corresponding data points.
- $ n $ is the number of data points.

The resulting value of $ \rho $ will be between -1 and 1:
- $ \rho = 1 $ indicates a perfect monotonic positive relationship.
- $ \rho = -1 $ indicates a perfect monotonic negative relationship.
- $ \rho = 0 $ indicates no monotonic relationship (variables are not consistently moving in the same direction).

Interpretation of the result:
- If $ \rho $ is close to 1, it indicates a strong monotonic positive relationship between the amount of sleep individuals get and their overall job satisfaction. This means that as the amount of sleep increases, job satisfaction tends to increase as well.
- If $ \rho $ is close to -1, it indicates a strong monotonic negative relationship between the amount of sleep and job satisfaction. This means that as the amount of sleep increases, job satisfaction tends to decrease.
- If $ \rho $ is close to 0, it suggests that there is little to no monotonic relationship between the amount of sleep and job satisfaction.



# Q3. Suppose you are conducting a study to examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.



**Pearson Correlation Coefficient (r):**

1. Calculate the mean ($\bar{X}$) and standard deviation ($s_X$) of the exercise hours ($X$).
2. Calculate the mean ($\bar{Y}$) and standard deviation ($s_Y$) of the BMI ($Y$)).
3. Calculate the Pearson correlation coefficient using the formula:

$$ r = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{ns_Xs_Y} $$

Where:
- $X_i$ and $Y_i$ are individual data points for exercise hours and BMI, respectively.
- $n$ is the number of data points.

**Spearman's Rank Correlation Coefficient ($\rho$):**

1. Rank the data for both variables separately, from lowest to highest.
2. Calculate the difference in ranks for each pair of corresponding data points.
3. Calculate the Spearman's rank correlation coefficient using the formula:

$$\rho = 1 - \frac{6\sum{d_i^2}}{n(n^2 - 1)} $$

Where:
- $d_i$ represents the difference in ranks for each pair of corresponding data points.
- $n$ is the number of data points.

After calculating both correlation coefficients, you can compare the results. Here are some considerations:

- If both coefficients are close to 1 or -1 (positive or negative) and have similar values, it suggests a strong linear or monotonic relationship between exercise hours and BMI.
- If Pearson's $r$ is close to 1 or -1 while Spearman's $\rho$ is not, it indicates a strong linear relationship that may not be strictly monotonic.
- If Spearman's $\rho$ is closer to 1 or -1 than Pearson's $r$, it suggests a stronger monotonic relationship than a linear one, implying that the relationship might not be perfectly linear.



# Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data on both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.



1. Calculate the mean ($\bar{X}$) and standard deviation ($s_X$) of the hours spent watching television.
2. Calculate the mean ($\bar{Y}$) and standard deviation ($s_Y$) of the level of physical activity.
3. Calculate the Pearson correlation coefficient (\(r\)) using the formula:

$$ r = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{ns_Xs_Y} $$

Where:
- $X_i$ and $Y_i$ are individual data points for hours of television watched and level of physical activity, respectively.
- $n$ is the number of data points.

This formula calculates the degree and direction of the linear relationship between the two variables. The resulting $r$ value will be between -1 and 1:

- $r = 1$ indicates a perfect positive linear correlation.
- $r = -1$ indicates a perfect negative linear correlation.
- $r = 0$ indicates no linear correlation (variables are not linearly related).

Interpretation of the Pearson correlation coefficient ($r$):
- If $r$ is close to 1, it indicates a strong positive linear relationship between the number of hours spent watching television and the level of physical activity. This means that as the time spent watching TV increases, the level of physical activity tends to decrease.
- If $r$ is close to -1, it indicates a strong negative linear relationship between the number of hours spent watching television and the level of physical activity. This means that as the time spent watching TV increases, the level of physical activity tends to increase.
- If $r$ is close to 0, it suggests that there is little to no linear relationship between the number of hours spent watching television and the level of physical activity.



# Q5. A survey was conducted to examine the relationship between age and preference for a particular brand of soft drink. The survey results are shown below:

    
    Age(Years)	Soft Drink Preference
    25	Coke
    42	Pepsi
    37	Mountain Dew
    19	Coke
    31	Pepsi
    28	Coke


In [1]:
import pandas as pd
df = pd.DataFrame({
    "Age(Years)": [25, 42, 37, 19, 31, 28],
    "Soft Drink Preference": ['Coke', 'Pepsi', 'Mountain Dew', 'Coke', 'Pepsi', 'Coke']
})

df

Unnamed: 0,Age(Years),Soft Drink Preference
0,25,Coke
1,42,Pepsi
2,37,Mountain Dew
3,19,Coke
4,31,Pepsi
5,28,Coke


In [3]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

df['soft_drink_encoded'] = encoder.fit_transform(df['Soft Drink Preference'])

In [7]:
df_copy = df
df_copy.drop('Soft Drink Preference',axis=1)

df_copy.corr(method='pearson')

  df_copy.corr(method='pearson')


Unnamed: 0,Age(Years),soft_drink_encoded
Age(Years),1.0,0.769175
soft_drink_encoded,0.769175,1.0


# Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.



In [8]:
num_of_calls = [20, 30, 10, 40, 60, 80, 60, 50, 40, 100]
num_of_sales = [2, 3, 2, 5, 5, 6, 8, 5, 7, 20]

df = pd.DataFrame({
    'num_of_calls': num_of_calls,
    'num_of_sales': num_of_sales
})

df

Unnamed: 0,num_of_calls,num_of_sales
0,20,2
1,30,3
2,10,2
3,40,5
4,60,5
5,80,6
6,60,8
7,50,5
8,40,7
9,100,20


In [9]:
df.corr(method='pearson')

Unnamed: 0,num_of_calls,num_of_sales
num_of_calls,1.0,0.824008
num_of_sales,0.824008,1.0
