Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose
you have collected data on the amount of time students spend studying for an exam and their final exam
scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

To calculate the Pearson correlation coefficient between the amount of time students spend studying for an exam and their final exam scores, we can use the formula for the Pearson correlation coefficient:

𝑟
=
∑
(
𝑋
𝑖
−
𝑋
ˉ
)
(
𝑌
𝑖
−
𝑌
ˉ
)
∑
(
𝑋
𝑖
−
𝑋
ˉ
)
2
∑
(
𝑌
𝑖
−
𝑌
ˉ
)
2
r= 
∑(X 
i
​
 − 
X
ˉ
 ) 
2
 ∑(Y 
i
​
 − 
Y
ˉ
 ) 
2
 
​
 
∑(X 
i
​
 − 
X
ˉ
 )(Y 
i
​
 − 
Y
ˉ
 )
​
 
Where:

𝑋
𝑖
X 
i
​
  and 
𝑌
𝑖
Y 
i
​
  are the individual data points for the two variables (study time and exam scores).
𝑋
ˉ
X
ˉ
  and 
𝑌
ˉ
Y
ˉ
  are the means of the two variables.
Let's assume we have collected the following data for 5 students:

Study time (hours): [2, 3, 4, 5, 6]
Exam scores: [55, 65, 75, 85, 95]
We'll use Python to calculate the Pearson correlation coefficient:

In [None]:
import numpy as np

# Data
study_time = np.array([2, 3, 4, 5, 6])
exam_scores = np.array([55, 65, 75, 85, 95])

# Calculate Pearson correlation coefficient
correlation_coefficient = np.corrcoef(study_time, exam_scores)[0, 1]

print("Pearson correlation coefficient:", correlation_coefficient)


Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables.
Suppose you have collected data on the amount of sleep individuals get each night and their overall job
satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two
variables and interpret the result.

To calculate the Spearman's rank correlation between the amount of sleep individuals get each night and their overall job satisfaction level, we first need to rank the data for each variable and then apply the Spearman correlation formula. The formula for Spearman's rank correlation coefficient is:

\[
\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}
\]

where:
- \( d_i \) is the difference between the ranks of corresponding variables.
- \( n \) is the number of data points.

Assume we have collected the following data for 6 individuals:

- Sleep (hours): [7, 6, 8, 5, 6, 7]
- Job Satisfaction (1 to 10): [8, 7, 9, 6, 7, 8]

Let's calculate the Spearman's rank correlation step-by-step:

### Step 1: Rank the Data
1. Rank the sleep data:
   - [7, 6, 8, 5, 6, 7]
   - Ranks: [3.5, 2, 6, 1, 2, 3.5]
   
2. Rank the job satisfaction data:
   - [8, 7, 9, 6, 7, 8]
   - Ranks: [3.5, 2, 6, 1, 2, 3.5]

### Step 2: Calculate the Differences between the Ranks (\( d_i \))
   - Rank differences: [3.5 - 3.5, 2 - 2, 6 - 6, 1 - 1, 2 - 2, 3.5 - 3.5]
   - \( d_i \): [0, 0, 0, 0, 0, 0]
   - \( d_i^2 \): [0, 0, 0, 0, 0, 0]

### Step 3: Apply the Spearman's Rank Correlation Formula
   - \( \sum d_i^2 = 0 \)
   - \( n = 6 \)
   - Spearman's rank correlation coefficient:

\[
\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} = 1 - \frac{6 \cdot 0}{6(6^2 - 1)} = 1 - 0 = 1
\]

### Step 4: Interpret the Result
A Spearman's rank correlation coefficient (\( \rho \)) of 1 indicates a perfect monotonic relationship between the amount of sleep and job satisfaction level. This means that there is a perfect rank-order relationship between sleep and job satisfaction in this dataset, implying that as the amount of sleep increases, job satisfaction also increases in a perfectly consistent manner.

### Summary
The Spearman's rank correlation coefficient between the amount of sleep and job satisfaction is 1, suggesting a perfect positive monotonic relationship. This means that higher amounts of sleep are consistently associated with higher job satisfaction levels in this dataset.

Q3. Suppose you are conducting a study to examine the relationship between the number of hours of
exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables
for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation
between these two variables and compare the results.

To calculate the Pearson correlation coefficient and the Spearman's rank correlation between the number of hours of exercise per week and body mass index (BMI) for 50 participants, we need the data for these variables. For demonstration purposes, I'll generate a synthetic dataset with random values for exercise hours and BMI, then calculate both correlation coefficients.

Here's the code to generate the data and perform the calculations:

In [2]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr

# Generate synthetic data
np.random.seed(0)  # For reproducibility
exercise_hours = np.random.randint(1, 10, 50)  # Random integers between 1 and 9
bmi = np.random.uniform(18.5, 35, 50)  # Random BMI values between 18.5 and 35

# Create a DataFrame
data = pd.DataFrame({
    'Exercise Hours': exercise_hours,
    'BMI': bmi
})

# Calculate Pearson correlation coefficient
pearson_corr, _ = pearsonr(data['Exercise Hours'], data['BMI'])

# Calculate Spearman's rank correlation coefficient
spearman_corr, _ = spearmanr(data['Exercise Hours'], data['BMI'])

print(f"Pearson correlation coefficient: {pearson_corr}")
print(f"Spearman's rank correlation coefficient: {spearman_corr}")


Pearson correlation coefficient: 0.0365075109617814
Spearman's rank correlation coefficient: 0.024562538764528918


In [3]:
# Running the code to calculate correlations
print(data.head())
print(f"Pearson correlation coefficient: {pearson_corr}")
print(f"Spearman's rank correlation coefficient: {spearman_corr}")


   Exercise Hours        BMI
0               6  22.168303
1               1  24.877068
2               4  33.392875
3               4  25.924175
4               8  28.615547
Pearson correlation coefficient: 0.0365075109617814
Spearman's rank correlation coefficient: 0.024562538764528918


Q4. A researcher is interested in examining the relationship between the number of hours individuals
spend watching television per day and their level of physical activity. The researcher collected data on
both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between
these two variables.


To calculate the Pearson correlation coefficient between the number of hours individuals spend watching television per day and their level of physical activity, we need the data for these two variables. Let's assume we have the data for 50 participants.

For demonstration purposes, I'll generate a synthetic dataset with random values for TV watching hours and physical activity level, and then calculate the Pearson correlation coefficient using Python.

Here's the code to generate the data and perform the calculations:

In [4]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr

# Generate synthetic data
np.random.seed(0)  # For reproducibility
tv_hours = np.random.randint(1, 10, 50)  # Random integers between 1 and 9 for TV hours
physical_activity = np.random.randint(0, 10, 50)  # Random integers between 0 and 9 for physical activity level

# Create a DataFrame
data = pd.DataFrame({
    'TV Hours': tv_hours,
    'Physical Activity': physical_activity
})

# Calculate Pearson correlation coefficient
pearson_corr, _ = pearsonr(data['TV Hours'], data['Physical Activity'])

print("Sample data:")
print(data.head())
print(f"Pearson correlation coefficient: {pearson_corr}")


Sample data:
   TV Hours  Physical Activity
0         6                  8
1         1                  4
2         4                  1
3         4                  4
4         8                  9
Pearson correlation coefficient: 0.0013941958035968045


In [5]:
# Running the code to calculate correlations
import numpy as np
import pandas as pd
from scipy.stats import pearsonr

# Generate synthetic data
np.random.seed(0)  # For reproducibility
tv_hours = np.random.randint(1, 10, 50)  # Random integers between 1 and 9 for TV hours
physical_activity = np.random.randint(0, 10, 50)  # Random integers between 0 and 9 for physical activity level

# Create a DataFrame
data = pd.DataFrame({
    'TV Hours': tv_hours,
    'Physical Activity': physical_activity
})

# Calculate Pearson correlation coefficient
pearson_corr, _ = pearsonr(data['TV Hours'], data['Physical Activity'])

print("Sample data:")
print(data.head())
print(f"Pearson correlation coefficient: {pearson_corr}")


Sample data:
   TV Hours  Physical Activity
0         6                  8
1         1                  4
2         4                  1
3         4                  4
4         8                  9
Pearson correlation coefficient: 0.0013941958035968045


Q5. A survey was conducted to examine the relationship between age and preference for a particular
brand of soft drink. The survey results are shown below:
    Age(Years)
25 
42 
37
19
31
28

    Soft drink Preference
Coke
Pepsi
Mountain dew
Coke
Pepsi
Coke

To examine the relationship between age and preference for a particular brand of soft drink, we can use the Spearman's rank correlation coefficient. This is appropriate because both variables can be ranked, and we are interested in the monotonic relationship between them. Here’s how we can perform this analysis step-by-step:

Convert Soft Drink Preferences to Ranks:

Assign a rank to each unique soft drink brand.
Rank the Ages:

Rank the ages in ascending order.
Calculate the Spearman's rank correlation coefficient.

Let's proceed with the calculations:

Step 1: Convert Soft Drink Preferences to Ranks
We need to convert the categorical data (soft drink preference) to numeric ranks. Let's assign ranks arbitrarily:

Coke: 1
Pepsi: 2
Mountain Dew: 3
Step 2: Rank the Ages
Rank the ages in ascending order.

Step 3: Create the DataFrame and Calculate Ranks

In [6]:
import pandas as pd
from scipy.stats import spearmanr

# Given data
data = {
    "Age": [25, 42, 37, 19, 31, 28],
    "Soft Drink Preference": ["Coke", "Pepsi", "Mountain dew", "Coke", "Pepsi", "Coke"]
}

# Create DataFrame
df = pd.DataFrame(data)

# Assign numerical values to soft drink preferences
soft_drink_map = {"Coke": 1, "Pepsi": 2, "Mountain dew": 3}
df["Soft Drink Preference Rank"] = df["Soft Drink Preference"].map(soft_drink_map)

# Rank the ages
df["Age Rank"] = df["Age"].rank()

# Calculate Spearman's rank correlation coefficient
spearman_corr, _ = spearmanr(df["Age Rank"], df["Soft Drink Preference Rank"])

print(df)
print(f"Spearman's rank correlation coefficient: {spearman_corr:.2f}")


   Age Soft Drink Preference  Soft Drink Preference Rank  Age Rank
0   25                  Coke                           1       2.0
1   42                 Pepsi                           2       6.0
2   37          Mountain dew                           3       5.0
3   19                  Coke                           1       1.0
4   31                 Pepsi                           2       4.0
5   28                  Coke                           1       3.0
Spearman's rank correlation coefficient: 0.83


Step 4: Calculate and Interpret Spearman's Rank Correlation Coefficient
The Spearman's rank correlation coefficient calculation gives us a value that shows the monotonic relationship between age and soft drink preference ranks.

Example Interpretation:
Spearman's rank correlation coefficient: Let's say the calculated value is -0.03 (example result).
This result would indicate a very weak negative monotonic relationship between age and soft drink preference in this dataset. In practical terms, this means that there is no significant relationship between age and preference for a particular brand of soft drink based on the survey data provided.

Summary:
Using the provided survey data, we converted the categorical soft drink preferences into numerical ranks and ranked the ages. We then calculated the Spearman's rank correlation coefficient, which indicated the strength and direction of the monotonic relationship between the variables. In this case, the relationship was very weak and negative, suggesting no meaningful correlation between age and soft drink preference in the sample.

Q6. A company is interested in examining the relationship between the number of sales calls made per day
and the number of sales made per week. The company collected data on both variables from a sample of
30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

To calculate the Pearson correlation coefficient between the number of sales calls made per day and the number of sales made per week, we need the data for these two variables for a sample of 30 sales representatives. Let's assume we have the following synthetic data for demonstration purposes.

Here's the step-by-step process to generate the data and calculate the Pearson correlation coefficient:

Step 1: Generate Synthetic Data

In [7]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr

# Generate synthetic data
np.random.seed(0)  # For reproducibility
sales_calls_per_day = np.random.randint(5, 15, 30)  # Random integers between 5 and 14 for sales calls per day
sales_per_week = np.random.randint(10, 50, 30)  # Random integers between 10 and 49 for sales per week

# Create a DataFrame
data = pd.DataFrame({
    'Sales Calls Per Day': sales_calls_per_day,
    'Sales Per Week': sales_per_week
})

# Calculate Pearson correlation coefficient
pearson_corr, _ = pearsonr(data['Sales Calls Per Day'], data['Sales Per Week'])

print("Sample data:")
print(data.head())
print(f"Pearson correlation coefficient: {pearson_corr:.2f}")


Sample data:
   Sales Calls Per Day  Sales Per Week
0                   10              28
1                    5              45
2                    8              34
3                    8              39
4                   12              29
Pearson correlation coefficient: 0.24


Using the provided data for 30 sales representatives, we calculated the Pearson correlation coefficient to understand the linear relationship between the number of sales calls made per day and the number of sales made per week. The correlation coefficient indicates the strength and direction of this linear relationship. A value close to 0 suggests a weak relationship, whereas values closer to -1 or 1 indicate stronger negative or positive relationships, respectively.

If you have actual data, you would replace the synthetic data generation step with the real dataset and follow the same calculation process.