#### Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

In [None]:
## Ans-

To calculate the Pearson correlation coefficient between the amount of time students spend studying for an exam and their final exam scores, we need to follow these steps:

1.Calculate the mean and standard deviation of the amount of time students spend studying and their final exam scores.

2.Calculate the deviation from the mean for both variables.

3.Multiply the deviations from the mean for each variable.

4.Calculate the sum of the products of deviations from the mean.

5.Calculate the Pearson correlation coefficient using the formula:

r = (sum of products of deviations from the mean) / (product of standard deviations of both variables)

Interpretation:

The Pearson correlation coefficient, denoted as "r," ranges from -1 to 1. 
A value of 1 indicates a perfect positive correlation between the two variables, which means that as one variable increases, the other also increases.
A value of -1 indicates a perfect negative correlation, which means that as one variable increases, the other decreases.
A value of 0 indicates no correlation between the variables.

#### Example through Python

In [None]:
Let's assume we have the following data for 5 students:

Student	Study Time (hours)	Exam Score (out of 100)
    1	      5	                      85
    2	      3                       70
    3	      6 	                  90
    4	      4	                      75
    5	      2	                      60
To calculate the Pearson correlation coefficient between the study time and exam score variables, 
we can use the pearsonr function from the scipy.stats module in Python. Here's the code:



In [2]:
import scipy.stats as stats

study_time = [5, 3, 6, 4, 2]
exam_score = [85, 70, 90, 75, 60]

corr_coeff, p_value = stats.pearsonr(study_time, exam_score)

print("Pearson correlation coefficient:", corr_coeff)
print("p-value:", p_value)


Pearson correlation coefficient: 0.9933992677987827
p-value: 0.0006431193269336826


## Interpretation:

#### Since the Pearson correlation coefficient is positive and very close to 1,
#### we can say that there is a strong positive correlation between the amount of time spent studying and the final exam score.
#### Which means as the study time increses, the exam score also increses or vice-versa.

#### Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables. Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.

In [4]:
#Ans-

import scipy.stats as stats

# sample data on amount of sleep and job satisfaction
amount_of_sleep = [7, 8, 6, 9, 5]
job_satisfaction = [8, 9, 7, 10, 6]

# rank the data
sleep_ranked = stats.rankdata(amount_of_sleep)
satisfaction_ranked = stats.rankdata(job_satisfaction)

# calculate Spearman's rank correlation
spearman_corr, p_value = stats.spearmanr(sleep_ranked, satisfaction_ranked)

print("Spearman's rank correlation: ", spearman_corr)
print("p-value: ", p_value)


Spearman's rank correlation:  0.9999999999999999
p-value:  1.4042654220543672e-24


## Interpretation:

In [None]:
The Spearman's rank correlation coefficient is 1,
which indicates a strong positive monotonic relationship between the amount of sleep individuals get and their job satisfaction level.
This means that as the amount of sleep increases, job satisfaction tends to increase as well. 

The p-value for this example is 1.4042654220543672e-24, which is a very small number expressed in scientific notation. 
This indicates that the observed correlation between the two variables is extremely unlikely to occur by chance alone if there is no true correlation between them. 
In other words, it provides strong evidence to reject the null hypothesis of no correlation and supports the alternative hypothesis of a significant correlation between amount of sleep and job satisfaction.

#### Q3. Suppose you are conducting a study to examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.

In [None]:
Ans-

Let's assume the following data for the number of hours of exercise per week and BMI values for 50 participants:

In [6]:
hours_of_exercise = [2, 3, 1, 4, 2, 5, 1, 6, 2, 3, 4, 5, 6, 1, 3, 2, 4, 5, 1, 6, 3, 2, 4, 1, 5, 
                     2, 3, 4, 1, 5, 6, 3, 2, 4, 5, 1, 6, 2, 3, 1, 4, 5, 2, 6, 1, 3, 4, 5, 2, 6]

bmi_values = [22, 23, 21, 25, 24, 26, 22, 27, 23, 23, 25, 26, 27, 21, 22, 24, 25, 26, 21, 28, 
              23, 24, 25, 22, 27, 23, 24, 26, 21, 28, 29, 24, 23, 25, 26, 21, 28, 22, 23, 21, 25,
              26, 24, 27, 22, 23, 25, 26, 24, 28]


import scipy.stats as stats

# Calculate Pearson correlation coefficient
pearson_corr_coef, _ = stats.pearsonr(hours_of_exercise, bmi_values)
print(f"Pearson correlation coefficient: {pearson_corr_coef:.3f}")

# Calculate Spearman's rank correlation
spearman_corr_coef, _ = stats.spearmanr(hours_of_exercise, bmi_values)
print(f"Spearman's rank correlation: {spearman_corr_coef:.3f}")


Pearson correlation coefficient: 0.940
Spearman's rank correlation: 0.940


In [None]:
As we can see, both the Pearson and Spearman's correlation coefficient are the same, 
indicating a strong positive correlation between the two variables.
which means, as the hours of exercise increases, the BMI value also increases. (which may not true in real world).

If we obtain the same value for both Pearson correlation coefficient and Spearman's rank correlation, 
it suggests that there is a strong linear relationship between the two variables in your dataset. 
The Pearson correlation coefficient measures the strength of a linear relationship between two continuous variables, 
while the Spearman's rank correlation coefficient measures the strength of a monotonic relationship between two variables.

A monotonic relationship is a type of relationship where as one variable increases,
the other variable either increases or decreases but not necessarily at a constant rate. 
If the relationship between the two variables is both strong and monotonic, then it is possible for both the Pearson and Spearman's correlation coefficients to have the same value.

#### Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data on both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.

In [None]:
Ans-

here is an example using Python to generate some random data and calculate the Pearson correlation coefficient:

In [1]:
import numpy as np
from scipy.stats import pearsonr

# Generate random data for hours of TV watched and level of physical activity
tv_hours = np.random.normal(3, 1, 50)
activity_level = np.random.normal(60, 20, 50)

# Calculate the Pearson correlation coefficient and p-value
corr, p_value = pearsonr(tv_hours, activity_level)

print("Pearson correlation coefficient:", corr)
print("p-value:", p_value)


Pearson correlation coefficient: -0.1945801835271386
p-value: 0.1757132593584766


In [None]:
The Pearson correlation coefficient between TV hours and activity level is -0.196, which indicates a weak negative correlation between these two variables.
This means that as the number of TV hours watched per day increases, the level of physical activity tends to decrease slightly.
However, the correlation is weak, which suggests that there may be other factors that influence physical activity besides TV watching habits.

The p-value for the correlation coefficient is 0.173, which is greater than the conventional threshold of 0.05 for statistical significance. 
This means that we cannot reject the null hypothesis that there is no correlation between TV hours and activity level in the population, based on this sample.

In [None]:
Q5. A survey was conducted to examine the relationship between age and preference for a particular
brand of soft drink. The survey results are shown below:

Age(Years)               Soft drink Preference
25                              Coke
42                              Pepsi
37                            Mountain dew
19                               Coke
31                               Pepsi
28                               Coke

Ans-

Since, Pearson correlation coefficient is a measure of the linear association between two continuous variables. 
Age is a discrete variable and soft drink preference is a categorical variable,
which means we cannot compute a correlation coefficient between them.

Also, Spearman's rank correlation coefficient can be used to analyze the relationship between two ordinal variables.
Age can be treated as an ordinal variable since it is a measure of the participants' age in years.
However, soft drink preference is still a categorical variable, which cannot be treated as ordinal. 
Therefore, we cannot use Spearman's rank correlation coefficient to analyze the relationship between age and soft drink preference.

So, we need to use CHI Square test to find the relation between the two:-

In [2]:
import pandas as pd
from scipy.stats import chi2_contingency

# create a dataframe with the survey results
data = pd.DataFrame({
    'Age': [25, 42, 37, 19, 31, 28],
    'Preference': ['Coke', 'Pepsi', 'Mountain Dew', 'Coke', 'Pepsi', 'Coke']
})

# create a contingency table using pandas crosstab function
table = pd.crosstab(data['Age'] < 35, data['Preference'])

# add row and column labels
table.index = ['Below 35', '35 and above']
table.columns.name = 'Soft drink preference'

# perform the chi-square test
stat, p, dof, expected = chi2_contingency(table)

# interpret the p-value
alpha = 0.05
if p < alpha:
    print("There is evidence of a relationship between age and soft drink preference.")
else:
    print("There is no evidence of a relationship between age and soft drink preference.")

# display the contingency table
print(table)


There is no evidence of a relationship between age and soft drink preference.
Soft drink preference  Coke  Mountain Dew  Pepsi
Below 35                  0             1      1
35 and above              3             0      1


In [None]:
The code first interprets the p-value and concludes that there is no evidence of a relationship between age and soft drink preference.

The contingency table shows the frequency distribution of the data.
We can see that there are 2 people below 35 years of age who prefer Coke,
1 person below 35 who prefers Pepsi, 1 person below 35 and 1 person 35 and above who prefer Mountain Dew, 
and 2 people below 35 who prefer Pepsi.

Based on this sample, we cannot conclude that age and soft drink preference are related.
However, it is important to note that this is only a small sample, and we cannot generalize the results to the entire population. 
Additionally, there may be other factors (such as gender or location) that could affect soft drink preference.

####  or to calculate Pearson and Spearman's rank correlation between the variables, we first need to convert the categorical variable to numerical variable. We can do so by encoding it:

In [4]:
data

Unnamed: 0,Age,Preference
0,25,Coke
1,42,Pepsi
2,37,Mountain Dew
3,19,Coke
4,31,Pepsi
5,28,Coke


In [8]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoded_col = encoder.fit_transform(data['Preference'])
df_encoded = pd.DataFrame(encoded_col, columns=['Preference Enc'])
df_new = pd.concat([data, df_encoded], axis=1)
df_new = df_new.drop('Preference', axis=1)
df_new

Unnamed: 0,Age,Preference Enc
0,25,0
1,42,2
2,37,1
3,19,0
4,31,2
5,28,0


In [10]:
import pandas as pd
from scipy.stats import pearsonr, spearmanr

# calculate Pearson correlation coefficient
corr, p_value = pearsonr(df_new['Age'], df_new['Preference Enc'])
print('Pearson correlation coefficient:', corr)
print('p-value:', p_value)

# calculate Spearman's rank correlation
rho, p_value = spearmanr(df_new['Age'], df_new['Preference Enc'])
print('Spearman rank correlation:', rho)
print('p-value:', p_value)


Pearson correlation coefficient: 0.7691751415594736
p-value: 0.07377098537821529
Spearman rank correlation: 0.8332380897952965
p-value: 0.03939551647885117


#### Correlation between the particular two variables doesn't make sense, i am just trying to show how can we calculate the Pearson correlation coefficient and Spearman rank correlation through label encoding 

#### Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

In [3]:
import numpy as np

# define the sales calls per day and sales per week variables
sales_calls = np.array([20, 30, 15, 25, 10, 15, 30, 20, 25, 30, 35, 40, 15, 10, 20, 25, 30, 15, 20, 10, 5, 10, 15, 20, 30, 35, 40, 20, 15, 25])
sales_per_week = np.array([3, 4, 2, 4, 1, 2, 4, 3, 4, 5, 6, 7, 2, 1, 3, 4, 5, 2, 3, 1, 0, 1, 2, 3, 4, 5, 6, 3, 2, 4])

# calculate the Pearson correlation coefficient using numpy's corrcoef function
r = np.corrcoef(sales_calls, sales_per_week)[0, 1]

# display the Pearson correlation coefficient
print("The Pearson correlation coefficient between sales calls and sales per week is:", r)


The Pearson correlation coefficient between sales calls and sales per week is: 0.9821371955305618


#### Interpretation

In [None]:
The Pearson correlation coefficient between sales calls and sales per week is 0.98. 
This value indicates a strong positive linear relationship between the variables.
That is, as the number of sales calls per day increases, the number of sales made per week also tends to increase. 
This relationship is also illustrated in a scatter plot, where we can see a clear upward trend in the data points.

It is important to note that correlation does not imply causation, and there may be other factors that affect the relationship between sales calls and sales per week (e.g. product quality, pricing, marketing campaigns, etc.). 
Additionally, this is only a sample of 30 sales representatives, so we cannot generalize the results to the entire population of sales representatives.