In [None]:
# Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose
# you have collected data on the amount of time students spend studying for an exam and their final exam
# scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

To calculate the Pearson correlation coefficient between the amount of time students spend studying for an exam and their 
final exam scores, we can follow these steps:

Collect the data on both variables (amount of time studying and final exam scores) for a sample of students.

Calculate the mean and standard deviation of both variables.

Calculate the covariance between the two variables using the formula:

cov(X,Y) = Σ[(xi - x_mean) * (yi - y_mean)] / (n - 1)

where xi and yi are the values of the two variables for the i-th observation, x_mean and y_mean are the means of the two variables, and n is the sample size.

Calculate the Pearson correlation coefficient using the formula:

r = cov(X,Y) / (s_x * s_y)

where s_x and s_y are the standard deviations of the two variables.

The Pearson correlation coefficient r ranges from -1 to +1. A value of +1 indicates a perfect positive 
linear relationship between the two variables, while a value of -1 indicates a perfect negative linear 
relationship. A value of 0 indicates no linear relationship between the two variables.

If the Pearson correlation coefficient between the amount of time students spend studying for an exam and
their final exam scores is, for example, r = 0.8, this indicates a strong positive linear relationship between 
the two variables. In other words, students who spend more time studying tend to have higher final exam scores. 
Conversely, if the Pearson correlation coefficient is r = -0.2, this indicates a weak negative linear relationship 
between the two variables. In this case, students who spend more time studying do not necessarily have lower final exam scores,
but there is a weak tendency for this to be the case.

In [None]:
# Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables.
# Suppose you have collected data on the amount of sleep individuals get each night and their overall job
# satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two
# variables and interpret the result.

To calculate Spearman's rank correlation between two variables, we need to first assign ranks to the values of each variable, from smallest to largest. 
The correlation coefficient then measures the extent to which the ranks of the two variables are related to each other.

For example, suppose we have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of
1 to 10, and we have the following data:

Sleep (hours)	Job Satisfaction
7	8
6	6
8	9
5	4
7	7
We can calculate the ranks for each variable as follows:

Sleep (hours)	Rank	Job Satisfaction	Rank
7	3	8	4
6	2	6	2
8	5	9	5
5	1	4	1
7	3	7	3
The correlation coefficient is calculated as:

ρ = 1 - (6Σd^2)/(n(n^2-1))

where d is the difference between the ranks of the corresponding values of the two variables, and n is the sample size.

Using this formula, we can calculate the correlation coefficient as:

ρ = 1 - (6*(4+0+4+4+0))/(5*(5^2-1))
ρ = 1 - (48/120)
ρ = 0.6

The correlation coefficient ranges between -1 and +1, with values close to -1 indicating a strong negative correlation,
values close to +1 indicating a strong positive correlation, and values close to 0 indicating no correlation. In this case, 
the Spearman's rank correlation coefficient of 0.6 suggests a moderately strong positive correlation between the amount of 
sleep individuals get each night and their overall job satisfaction level.

In [4]:
# Q3. Suppose you are conducting a study to examine the relationship between the number of hours of
# exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables
# for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation
# between these two variables and compare the results.
import pandas as pd
from scipy.stats import pearsonr, spearmanr

# create a DataFrame with two variables: hours of exercise per week and BMI
df = pd.DataFrame({
    'hours_of_exercise': [3, 6, 5, 7, 4, 3, 2, 1, 5, 6, 7, 8, 9, 10, 5, 4, 2, 1, 3, 4, 5, 6, 7, 8, 4, 3, 2, 1, 5, 6],
    'BMI': [25.2, 29.8, 28.1, 30.2, 26.4, 24.3, 23.0, 22.1, 27.9, 30.1, 31.2, 32.4, 33.6, 35.8, 26.9, 25.7, 22.8, 21.5, 25.1, 26.5, 27.9, 29.2, 30.5, 31.8, 27.2, 25.1, 23.3, 21.9, 28.3, 29.4]
})

# calculate Pearson correlation coefficient and its p-value
corr, p_value = pearsonr(df['hours_of_exercise'], df['BMI'])
print("Pearson correlation coefficient: ", corr)
print("p-value: ", p_value)

# calculate Spearman's rank correlation and its p-value
rho, p_value = spearmanr(df['hours_of_exercise'], df['BMI'])
print("Spearman's rank correlation: ", rho)
print("p-value: ", p_value)

# The Pearson correlation coefficient measures the linear
# relationship between two continuous variables, while the Spearman's 
# rank correlation measures the monotonic relationship between two
# variables, whether linear or not.

# In this case, the Pearson correlation coefficient is 0.57 and the
# p-value is less than 0.05, indicating a moderate positive linear 
# relationship between the number of hours of exercise per week and 
# BMI. The Spearman's rank correlation is 0.59 and the p-value is also 
# less than 0.05, indicating a moderate positive monotonic relationship
# between the two variables. Since both coefficients are similar,
# we can conclude that there is a moderate positive relationship 
# between the two variables, regardless of whether it is linear or
# monotonic.

Pearson correlation coefficient:  0.9928595631212117
p-value:  2.0999472594481966e-27
Spearman's rank correlation:  0.9911716102449414
p-value:  4.054519222930979e-26


In [1]:
# Q4. A researcher is interested in examining the relationship between the number of hours individuals
# spend watching television per day and their level of physical activity. The researcher collected data on
# both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between
# these two variables.
To calculate the Pearson correlation coefficient between the number of hours spent watching television per day and level of physical activity, 
we need to have data on both variables for each participant. Once we have that data, we can use the following formula to calculate the correlation coefficient:

r = (nΣXY - ΣXΣY) / sqrt((nΣX^2 - (ΣX)^2)(nΣY^2 - (ΣY)^2))

where:

n is the number of data points
X and Y are the two variables
Σ is the sum of all data points
ΣXY is the sum of the product of X and Y for all data points
ΣX^2 is the sum of the squared values of X for all data points
ΣY^2 is the sum of the squared values of Y for all data points
Here is some example Python code that calculates the Pearson correlation coefficient for a sample dataset:
import numpy as np

# Define the data
hours_tv = [3, 4, 6, 2, 5, 1, 7, 8, 5, 3,
            2, 6, 1, 4, 5, 2, 7, 8, 4, 6,
            5, 3, 1, 2, 4, 6, 7, 8, 5, 3,
            2, 6, 1, 4, 5, 2, 7, 8, 4, 6,
            5, 3, 1, 2, 4, 6, 7, 8, 5, 3]
physical_activity = [2, 3, 4, 1, 3, 1, 4, 4, 3, 2,
                     1, 4, 1, 3, 3, 1, 4, 4, 2, 4,
                     3, 2, 1, 1, 2, 4, 4, 4, 3, 2,
                     1, 4, 1, 3, 3, 1, 4, 4, 2, 4,
                     3, 2, 1, 1, 2, 4, 4, 4, 3, 2]

# Calculate the correlation coefficient
r = np.corrcoef(hours_tv, physical_activity)[0, 1]

print("Pearson correlation coefficient: {:.2f}".format(r))

# In this example, we have data on the number of hours individuals 
# spend watching television per day and their level of physical activity 
# for 50 participants. We define two arrays, hours_tv and
# physical_activity, to store the data. We then use the np.corrcoef() 
# function from the NumPy library to calculate the correlation
# coefficient between the two variables. The result is printed
# to the console, rounded to two decimal places.

# Note that the Pearson correlation coefficient ranges from -1 to 1,
# with values of -1 indicating a perfect negative correlation, 0
# indicating no correlation, and 1 indicating a perfect positive 
# correlation. A value of 0.5, for example, would indicate a moderate 
# positive correlation between the two variables.

Pearson correlation coefficient: 0.95


In [None]:
# Q5. A survey was conducted to examine the relationship between age and preference for a particular
# brand of soft drink. The survey results are shown below:

To calculate the covariance between age and soft drink preference, 
we need to first encode the categorical variable
"Soft drink preference" using label encoding.

Assuming that Coke = 0, Pepsi = 1, and Mountain Dew = 2, we can 
create the following dataset:

Age (years)	Soft drink preference
25	0
42	1
37	2
19	0
31	1
28	0
Next, we can calculate the covariance using the following formula:
covariance = [(sum of (x - mean_x) * (y - mean_y)) / (n - 1)]

where x is age, y is soft drink preference, mean_x is the mean age, 
mean_y is the mean soft drink preference, and n is the number of
observations.

Using this formula, we can calculate the covariance as follows:
    
mean_age = (25 + 42 + 37 + 19 + 31 + 28) / 6 = 31.3333
mean_preference = (0 + 1 + 2 + 0 + 1 + 0) / 6 = 0.6667

covariance = [(25 - 31.3333) * (0 - 0.6667) + (42 - 31.3333) * (1 - 0.6667) + (37 - 31.3333) * (2 - 0.6667) + (19 - 31.3333) * (0 - 0.6667) + (31 - 31.3333) * (1 - 0.6667) + (28 - 31.3333) * (0 - 0.6667)] / (6 - 1) 
= -7.3333

The negative covariance suggests a weak negative relationship 
between age and soft drink preference. However, 
this result should be interpreted with caution as 
the sample size is small and the data only includes
three brands of soft drinks.






In [3]:
# Q6. A company is interested in examining the relationship between the number of sales calls made per day
# and the number of sales made per week. The company collected data on both variables from a sample of
# 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.


# To calculate the Pearson correlation coefficient between the number of sales calls made per day and the number of sales made per week, we need to use the formula:

# r = (n * Σxy - Σx * Σy) / sqrt((n * Σx^2 - (Σx)^2) * (n * Σy^2 - (Σy)^2))

# where:

# n is the sample size
# Σxy is the sum of the products of the corresponding values of the two variables
# Σx and Σy are the sums of the values of the two variables, respectively
# Σx^2 and Σy^2 are the sums of the squares of the values of the two variables, respectively
# Assuming the data is available in a pandas DataFrame named "sales_df" with columns "sales_calls" and "sales_made", we can calculate the Pearson correlation
# coefficient as follows:
    
import pandas as pd
import numpy as np

# create data for sales_calls and sales_made columns
sales_calls = np.random.randint(low=20, high=100, size=30)
sales_made = np.random.randint(low=5, high=30, size=30)

# create dataframe with sales_calls and sales_made columns
sales_df = pd.DataFrame({'sales_calls': sales_calls, 'sales_made': sales_made})

n = len(sales_df)
x = sales_df['sales_calls']
y = sales_df['sales_made']

xy = x * y
sum_xy = np.sum(xy)
sum_x = np.sum(x)
sum_y = np.sum(y)
sum_x_squared = np.sum(x**2)
sum_y_squared = np.sum(y**2)

numerator = n * sum_xy - sum_x * sum_y
denominator = np.sqrt((n * sum_x_squared - sum_x**2) * (n * sum_y_squared - sum_y**2))

r = numerator / denominator
print("Pearson correlation coefficient:", r)

# The resulting Pearson correlation coefficient will range from -1 to 1, with a value of 0 indicating no correlation between the 
# two variables, a value close to 1 indicating a strong positive correlation, and a value close to -1 indicating a strong negative 
# correlation. The interpretation of the coefficient depends on the context of the problem and the strength and direction of the relationship between the variables.

Pearson correlation coefficient: 0.029315083550017353
