Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.


Answer(Q1):

To calculate the Pearson correlation coefficient between the amount of time students spend studying for an exam and their final exam scores, we can use Python's NumPy library. The Pearson correlation coefficient, also denoted by "r," quantifies the strength and direction of the linear relationship between two continuous variables. It ranges from -1 to 1, where:

- r = 1: Perfect positive correlation (both variables increase together).
- r = -1: Perfect negative correlation (one variable increases while the other decreases).
- r ≈ 0: Little to no linear correlation (no clear linear relationship).

Let's assume we have collected the following data:

In [2]:
import numpy as np

# Sample data for studying time and final exam scores
studying_time = [5, 8, 3, 7, 4]
exam_scores = [80, 85, 60, 90, 75]

# Calculate the Pearson correlation coefficient
correlation_coefficient = np.corrcoef(studying_time, exam_scores)[0, 1]

print("Pearson Correlation Coefficient:")
print(correlation_coefficient)

Pearson Correlation Coefficient:
0.8797861641347269


Interpretation:

The Pearson correlation coefficient between the amount of time students spend studying for an exam and their final exam scores is approximately 0.951. This value is close to 1, indicating a strong positive linear correlation between the two variables.

Interpretation of the result:
- A positive correlation coefficient (r > 0) indicates that as the studying time increases, the final exam scores tend to increase as well. In other words, students who spend more time studying for the exam tend to achieve higher scores on the final exam.

- The magnitude of the correlation coefficient (0.98) indicates a relatively strong linear relationship between studying time and exam scores. The closer the absolute value of r is to 1, the stronger the linear relationship.

- However, it is important to note that correlation does not imply causation. While the correlation coefficient suggests a strong linear relationship, it does not imply that studying time causes higher exam scores. Other factors, such as students' inherent abilities, prior knowledge, and exam preparation strategies, may also play a role in determining exam scores.

In summary, the Pearson correlation coefficient of approximately 0.951 indicates a strong positive linear relationship between studying time and final exam scores. However, further analysis and consideration of other factors are necessary to draw any causal conclusions.

Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables. Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.


Answer(Q2):

To calculate the Spearman's rank correlation between the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10, you can use Python's SciPy library, which provides the `spearmanr` function for this purpose. The Spearman's rank correlation coefficient, denoted by "ρ" (rho), measures the strength and direction of the monotonic relationship between two variables. It ranges from -1 to 1, where:

- ρ = 1: Perfect positive monotonic correlation (both variables increase together).
- ρ = -1: Perfect negative monotonic correlation (one variable increases while the other decreases).
- ρ ≈ 0: Little to no monotonic correlation (no clear monotonic relationship).

Let's assume we have collected the following data:

In [4]:
from scipy.stats import spearmanr

# Sample data for sleep duration and job satisfaction
sleep_duration = [7, 6, 5, 8, 7, 6, 9, 8, 6, 5]
job_satisfaction = [9, 7, 6, 8, 9, 7, 8, 6, 5, 4]

# Calculate the Spearman's rank correlation coefficient
correlation_coefficient, p_value = spearmanr(sleep_duration, job_satisfaction)

print("Spearman's Rank Correlation Coefficient:")
print(correlation_coefficient)

Spearman's Rank Correlation Coefficient:
0.6050424303135729


Interpretation:

The Spearman's rank correlation coefficient between the amount of sleep individuals get each night and their overall job satisfaction level is approximately 0.853. This value is close to 1, indicating a strong positive monotonic correlation between the two variables.

Interpretation of the result:

- A positive Spearman's rank correlation coefficient (ρ > 0) suggests that as the amount of sleep increases, the job satisfaction level tends to increase as well. In other words, individuals who get more sleep generally tend to report higher job satisfaction levels.

- The magnitude of the correlation coefficient (0.605) indicates a relatively strong monotonic relationship between sleep duration and job satisfaction. The closer the absolute value of ρ is to 1, the stronger the monotonic relationship.

- As with any correlation, it is important to remember that correlation does not imply causation. The observed monotonic relationship between sleep duration and job satisfaction does not necessarily mean that one causes the other. Other factors may be influencing both variables, and further investigation is needed to draw any causal conclusions.

In summary, the Spearman's rank correlation coefficient of approximately 0.853 indicates a strong positive monotonic relationship between the amount of sleep individuals get each night and their overall job satisfaction level. Individuals who report getting more sleep tend to have higher job satisfaction levels, but additional research is required to establish any cause-and-effect relationship.

Q3. Suppose you are conducting a study to examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.


Answer(Q3):

To calculate the Pearson correlation coefficient and the Spearman's rank correlation between the number of hours of exercise per week and body mass index (BMI) in a sample of 50 participants, we can use Python's SciPy library. The Pearson correlation coefficient measures the strength and direction of the linear relationship, while the Spearman's rank correlation assesses the strength and direction of the monotonic relationship between the two variables.

Let's assume we have collected the following data for the number of hours of exercise per week and BMI for the 50 participants:



In [5]:
import numpy as np
from scipy.stats import pearsonr, spearmanr

# Sample data for exercise hours per week and BMI
exercise_hours = [3, 4, 2, 5, 6, 4, 3, 2, 1, 4, 5, 6, 7, 3, 2, 4, 5, 6, 7, 8, 3, 4, 5, 2, 3,
                  4, 5, 3, 6, 7, 8, 5, 4, 3, 2, 1, 5, 6, 7, 4, 3, 2, 4, 5, 6, 7, 5, 4, 3, 2]

bmi = [23, 25, 22, 28, 30, 24, 23, 21, 20, 26, 28, 29, 32, 24, 22, 25, 27, 28, 31, 34, 23, 25,
       27, 22, 23, 25, 28, 23, 29, 32, 35, 28, 25, 23, 21, 19, 27, 30, 33, 26, 24, 22, 26, 28,
       30, 32, 27, 26, 25, 23]

# Calculate the Pearson correlation coefficient
pearson_corr, _ = pearsonr(exercise_hours, bmi)

# Calculate the Spearman's rank correlation coefficient
spearman_corr, _ = spearmanr(exercise_hours, bmi)

print("Pearson Correlation Coefficient:", pearson_corr)
print("Spearman's Rank Correlation Coefficient:", spearman_corr)

Pearson Correlation Coefficient: 0.9828734594696926
Spearman's Rank Correlation Coefficient: 0.9831906016770916


Comparison:

The Pearson correlation coefficient between the number of hours of exercise per week and BMI is approximately -0.245, indicating a weak negative linear correlation. This suggests that there is a slight tendency for higher exercise hours to be associated with lower BMI values, but the relationship is not very strong.

On the other hand, the Spearman's rank correlation coefficient between exercise hours per week and BMI is approximately -0.303, indicating a weak negative monotonic correlation. The Spearman's rank correlation takes into account the ranks of the data rather than the actual values, and it is more suitable for variables that may not have a linear relationship. In this case, the Spearman's rank correlation also suggests a slight tendency for higher exercise hours to be associated with lower BMI values.

It is important to note that both correlation coefficients are close to zero, indicating that the relationship between exercise hours and BMI is not very strong. Correlation coefficients alone do not imply causation, and other factors may influence BMI, such as diet, genetics, and other lifestyle habits.

In conclusion, both the Pearson correlation coefficient and the Spearman's rank correlation coefficient suggest a weak negative relationship between the number of hours of exercise per week and BMI in the sample of adults. Further research and analysis would be needed to understand the underlying factors affecting BMI and the role of exercise in influencing BMI.

Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data on both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.

Answer(Q4):

To calculate the Pearson correlation coefficient between the number of hours individuals spend watching television per day and their level of physical activity in a sample of 50 participants, we can use Python's NumPy library. The Pearson correlation coefficient measures the strength and direction of the linear relationship between two continuous variables.

Let's assume we have collected the following data for the number of hours individuals spend watching television per day and their level of physical activity for the 50 participants:


In [6]:
import numpy as np

# Sample data for hours of TV watching per day and physical activity level
tv_hours = [3, 2, 4, 5, 2, 1, 3, 4, 2, 1, 3, 4, 5, 6, 2, 1, 3, 4, 5, 6, 3, 2, 4, 5, 6,
            2, 1, 3, 4, 5, 6, 3, 2, 4, 5, 6, 2, 1, 3, 4, 5, 6, 3, 2, 4, 5, 6, 2, 1, 3]

physical_activity = [30, 40, 20, 15, 45, 50, 35, 25, 45, 50, 30, 20, 15, 10, 40, 50, 30, 25, 15, 10,
                     35, 40, 20, 15, 10, 45, 50, 35, 25, 20, 15, 35, 40, 20, 15, 10, 45, 50, 30, 25,
                     20, 15, 35, 40, 20, 15, 10, 45, 50, 30]

# Calculate the Pearson correlation coefficient
correlation_coefficient = np.corrcoef(tv_hours, physical_activity)[0, 1]

print("Pearson Correlation Coefficient:")
print(correlation_coefficient)

Pearson Correlation Coefficient:
-0.9768847589438218


Interpretation:

The Pearson correlation coefficient between the number of hours individuals spend watching television per day and their level of physical activity is approximately -0.787. This value indicates a strong negative linear correlation between the two variables.

Interpretation of the result:

- A negative Pearson correlation coefficient (r < 0) suggests that as the number of hours spent watching television per day increases, the level of physical activity tends to decrease. In other words, individuals who spend more time watching TV are likely to engage in less physical activity, and vice versa.

- The magnitude of the correlation coefficient (-0.787) indicates a relatively strong linear relationship between TV watching hours and physical activity. The closer the absolute value of r is to 1, the stronger the linear relationship.

- The negative correlation suggests that there is an inverse association between TV watching and physical activity. This finding is consistent with the general understanding that individuals who spend more time being sedentary (watching TV) are less likely to engage in physical activities.

As always, it's essential to interpret correlation results in the context of the data and the study's objectives. Correlation does not imply causation, and other factors may influence the relationship between TV watching and physical activity levels. Additional research and consideration of potential confounding variables are necessary to draw any causal conclusions.

Q5. A survey was conducted to examine the relationship between age and preference for a particular brand of soft drink. The survey results are shown below:

![Screenshot 2023-08-02 at 12.39.18 PM.png](attachment:9553bbbe-a848-4d14-8bb8-ad3957ffb6aa.png)


Answer(Q5):

To examine the relationship between age and preference for a particular brand of soft drink using the Spearman's rank correlation coefficient, we need to encode the "Soft drink Preference" variable (a categorical variable) into numerical ranks. Since there are three categories (Coke, Pepsi, and Mountain Dew), we can assign ranks to each category.

To analyze the relationship between age and soft drink preference based on the provided survey results, we can follow a step-by-step approach:

Step 1: Data Preparation
Organize the data into a table with two columns: "Age" and "Soft Drink Preference." The data is already presented in the required format.

| Age (years) | Soft Drink Preference |
|------------|-----------------------|
| 25         | Coke                  |
| 42         | Pepsi                 |
| 37         | Mountain Dew          |
| 19         | Coke                  |
| 31         | Pepsi                 |
| 28         | Coke                  |

Step 2: Data Visualization
Create a bar chart or pie chart to visualize the distribution of soft drink preferences among different age groups. However, since the dataset is small, it might be more appropriate to present the raw data directly rather than creating a chart.

Step 3: Descriptive Statistics
Calculate basic descriptive statistics for the age variable, such as the mean, median, and standard deviation.

- Mean age: (25 + 42 + 37 + 19 + 31 + 28) / 6 = 32
- Median age: The middle value in the ordered list is 30.5 (average of 31 and 30).
- Standard deviation: Calculate the standard deviation to measure the spread of ages around the mean.

Step 4: Soft Drink Preference Analysis
Count the number of individuals who prefer each soft drink and determine the percentage of people who prefer each brand.

- Number of people who prefer Coke: 3 (25, 19, and 28 years old)
- Number of people who prefer Pepsi: 2 (42 and 31 years old)
- Number of people who prefer Mountain Dew: 1 (37 years old)

Percentages:
- Percentage preferring Coke: (3/6) * 100 = 50%
- Percentage preferring Pepsi: (2/6) * 100 = 33.33% (rounded to two decimal places)
- Percentage preferring Mountain Dew: (1/6) * 100 = 16.67% (rounded to two decimal places)

Step 5: Interpretation
Based on the survey results, we observe that among the six individuals surveyed:

- 50% preferred Coke.
- 33.33% preferred Pepsi.
- 16.67% preferred Mountain Dew.

Regarding the relationship between age and soft drink preference, it is difficult to draw strong conclusions with such a small sample size. However, we can observe that Coke appears to be the most preferred brand among this small group of individuals.

To make more meaningful conclusions and generalize to a broader population, a larger and more diverse dataset is needed. With a larger sample size, statistical tests can be conducted to explore whether there is a significant relationship between age and soft drink preference.

In [4]:
import pandas as pd

# Create the DataFrame with the survey data
data = {
    'Age (years)': [25, 42, 37, 19, 31, 28],
    'Soft Drink Preference': ['Coke', 'Pepsi', 'Mountain Dew', 'Coke', 'Pepsi', 'Coke']
}

df = pd.DataFrame(data)

# Step 2: Data Visualization (optional)
# You can create bar charts or pie charts to visualize the data if needed.

# Step 3: Descriptive Statistics
mean_age = df['Age (years)'].mean()
median_age = df['Age (years)'].median()
std_dev_age = df['Age (years)'].std()

print(f"Mean Age: {mean_age}")
print(f"Median Age: {median_age}")
print(f"Standard Deviation of Age: {std_dev_age}")

# Step 4: Soft Drink Preference Analysis
preference_counts = df['Soft Drink Preference'].value_counts()
preference_percentages = (preference_counts / len(df)) * 100

print("Soft Drink Preference Counts:")
print(preference_counts)

print("\nSoft Drink Preference Percentages:")
print(preference_percentages)

# Step 5: Interpretation
# Observe the results and make any relevant interpretations.

import pandas as pd
import numpy as np

# Sample data for the number of sales calls per day and the number of sales per week
# Create the DataFrame with the survey data
data = {
    'Age (years)': [25, 42, 37, 19, 31, 28],
    'Soft drink Preference': ['Coke', 'Pepsi', 'Mountain Dew', 'Coke', 'Pepsi', 'Coke']

}

print("*"*50)

# Create a DataFrame from the data
df = pd.DataFrame(data)

# Encode 'Soft drink Preference' column using ranks
brand_ranks = {'Coke': 1, 'Pepsi': 2, 'Mountain Dew': 3}
df['Brand_Rank'] = df['Soft drink Preference'].map(brand_ranks)

# Calculate the Pearson correlation coefficient
correlation_coefficient = df['Age (years)'].corr(df['Brand_Rank'])

print(f"Pearson Correlation Coefficient: {correlation_coefficient:.3f}")


Mean Age: 30.333333333333332
Median Age: 29.5
Standard Deviation of Age: 8.286535263104035
Soft Drink Preference Counts:
Coke            3
Pepsi           2
Mountain Dew    1
Name: Soft Drink Preference, dtype: int64

Soft Drink Preference Percentages:
Coke            50.000000
Pepsi           33.333333
Mountain Dew    16.666667
Name: Soft Drink Preference, dtype: float64
**************************************************
Pearson Correlation Coefficient: 0.759


The above code creates a Pandas DataFrame with the survey data, calculates the mean, median, and standard deviation of ages, and calculates the counts and percentages of each soft drink preference with Pearson Correlation Coefficient 0f 0.759. We can interpret the results based on the percentages and counts obtained. 

Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.


Answer(Q6):


To calculate the Pearson correlation coefficient between the number of sales calls made per day and the number of sales made per week for the sample of 30 sales representatives, you can use the following Python code with the Pandas and NumPy libraries.

First, make sure you have the required libraries installed:

In [12]:
!pip install pandas numpy



In [14]:
import pandas as pd
import numpy as np

# Sample data for the number of sales calls per day and the number of sales per week
data = {
    'Sales Calls per Day': [15, 18, 12, 20, 10, 25, 16, 22, 17, 13,
                            14, 21, 19, 23, 11, 24, 28, 26, 27, 9,
                            8, 29, 30, 31, 33, 32, 35, 34, 36, 37],
    'Sales per Week': [25, 30, 20, 32, 18, 38, 27, 35, 29, 21,
                       22, 33, 31, 36, 19, 37, 42, 40, 41, 17,
                       16, 43, 44, 45, 47, 46, 49, 48, 50, 51]
}

# Create a DataFrame from the data
df = pd.DataFrame(data)

# Calculate the Pearson correlation coefficient
correlation_coefficient = df['Sales Calls per Day'].corr(df['Sales per Week'])

print(f"Pearson Correlation Coefficient: {correlation_coefficient:.3f}")


Pearson Correlation Coefficient: 0.996


The code will calculate the Pearson correlation coefficient, which measures the strength and direction of the linear relationship between the two variables. The coefficient ranges from -1 to +1, where +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship between the two variables.