Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose
you have collected data on the amount of time students spend studying for an exam and their final exam
scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

Ans: The Pearson correlation coefficient, often denoted as "r," is a statistical measure that quantifies the strength and direction of the linear relationship between two continuous variables. It ranges from -1 to 1, where:

r = 1 indicates a perfect positive linear relationship.

r = -1 indicates a perfect negative linear relationship.

r ≈ 0 indicates little to no linear relationship.

To calculate the Pearson correlation coefficient between the amount of time students spend studying for an exam and their final exam scores, you would typically use a statistical software or programming language like Python or R. Here, I'll provide a general interpretation of the result:

If the Pearson correlation coefficient (r) is close to 1, it suggests a strong positive linear relationship. In this context, it would mean that as students spend more time studying, their final exam scores tend to be higher.

If r is close to -1, it indicates a strong negative linear relationship. This would imply that as students spend more time studying, their final exam scores tend to be lower.

If r is close to 0, it suggests little to no linear relationship. In this case, the amount of time spent studying does not have a significant linear impact on final exam scores. However, it's important to note that non-linear relationships may still exist, and other factors might be influencing the exam scores.

Keep in mind that correlation does not imply causation. Even if a strong correlation is found between study time and exam scores, it does not necessarily mean that studying more causes higher scores. Other variables or factors may be at play, and further analysis or experiments would be needed to establish causation.

In [1]:
import numpy as np
from scipy.stats import pearsonr

# Example data (replace with your actual data)
study_time = [10, 20, 30, 40, 50]
exam_scores = [60, 70, 80, 90, 100]

# Calculate Pearson correlation coefficient
corr_coefficient, p_value = pearsonr(study_time, exam_scores)

print("Pearson Correlation Coefficient:", corr_coefficient)

Pearson Correlation Coefficient: 1.0


The output will provide the Pearson correlation coefficient, which you can interpret as described above. Additionally, the p-value can be used to assess the statistical significance of the correlation.

Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables.
Suppose you have collected data on the amount of sleep individuals get each night and their overall job
satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two
variables and interpret the result.

Ans: Spearman's rank correlation coefficient, often denoted as "ρ" (rho), is a non-parametric measure of the strength and direction of the monotonic relationship between two variables. It is used when the relationship between variables is not necessarily linear but rather shows a consistent trend (increasing or decreasing) without adhering to the strict assumptions of linearity.

Spearman's rank correlation is calculated by first ranking the data for each variable and then calculating the Pearson correlation coefficient on the ranks. The interpretation of Spearman's rank correlation is as follows:

If ρ = 1, it indicates a perfect positive monotonic relationship. This means that as one variable increases, the other always increases as well, even if not linearly.

If ρ = -1, it indicates a perfect negative monotonic relationship. This means that as one variable increases, the other always decreases, following a consistent trend.

If ρ ≈ 0, it suggests little to no monotonic relationship between the two variables.

In [2]:
import numpy as np
from scipy.stats import spearmanr

# Example data (replace with your actual data)
sleep_duration = [7, 5, 8, 6, 6, 7, 8]
job_satisfaction = [8, 6, 9, 5, 7, 8, 9]

# Calculate Spearman's rank correlation coefficient
corr_coefficient, _ = spearmanr(sleep_duration, job_satisfaction)

print("Spearman's Rank Correlation Coefficient:", corr_coefficient)


Spearman's Rank Correlation Coefficient: 0.9346202568200739


The output will provide the Spearman's rank correlation coefficient (rho). Interpret the result as follows:

If ρ is close to 1, it suggests a strong positive monotonic relationship, indicating that as individuals get more sleep, their job satisfaction tends to increase monotonically.

If ρ is close to -1, it suggests a strong negative monotonic relationship, indicating that as individuals get more sleep, their job satisfaction tends to decrease monotonically.

If ρ is close to 0, it suggests little to no monotonic relationship between the two variables. In this case, there may not be a consistent trend in how sleep duration relates to job satisfaction.

Keep in mind that Spearman's rank correlation is robust to outliers and does not require the assumption of linearity. However, it only captures monotonic relationships and may not be sensitive to other types of associations.

Q3. Suppose you are conducting a study to examine the relationship between the number of hours of
exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables
for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation
between these two variables and compare the results.

Ans: To examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults, you can calculate both the Pearson correlation coefficient and the Spearman's rank correlation coefficient. These two measures will provide insights into the linear and monotonic relationships between the variables, respectively. 

In [3]:
import numpy as np
from scipy.stats import pearsonr, spearmanr

# Example data (replace with your actual data)
exercise_hours = [3, 5, 2, 4, 6, 2, 1, 3, 4, 5, 3, 2, 1, 6, 4, 5, 3, 2, 6, 1,
                  5, 3, 2, 4, 6, 3, 2, 1, 5, 4, 3, 2, 6, 4, 3, 2, 1, 5, 6, 3,
                  4, 2, 1, 5, 3, 6, 4, 2]

bmi = [26.1, 27.8, 28.5, 25.0, 24.3, 29.2, 30.0, 27.3, 26.9, 24.8, 26.5, 29.8,
       31.2, 24.1, 25.6, 26.7, 27.0, 29.5, 23.5, 30.4, 25.3, 27.1, 28.9, 25.6,
       23.9, 27.0, 28.8, 29.7, 24.2, 26.5, 27.4, 28.3, 23.0, 25.7, 23.8, 29.9,
       28.4, 25.4, 25.9, 30.1, 27.6, 24.7, 27.2, 30.5, 31.1, 25.8, 28.2, 23.7]

# Calculate Pearson correlation coefficient
pearson_corr_coeff, _ = pearsonr(exercise_hours, bmi)

# Calculate Spearman's rank correlation coefficient
spearman_corr_coeff, _ = spearmanr(exercise_hours, bmi)

print("Pearson Correlation Coefficient:", pearson_corr_coeff)
print("Spearman's Rank Correlation Coefficient:", spearman_corr_coeff)

Pearson Correlation Coefficient: -0.6452829482381308
Spearman's Rank Correlation Coefficient: -0.6399720554362738


The output will provide both the Pearson correlation coefficient and the Spearman's rank correlation coefficient. Compare the results as follows:

If the Pearson correlation coefficient is close to 1, it suggests a strong positive linear relationship between exercise hours and BMI. This means that as the number of exercise hours per week increases, BMI tends to decrease linearly.

If the Spearman's rank correlation coefficient is close to 1, it indicates a strong positive monotonic relationship between exercise hours and BMI. This means that as the number of exercise hours per week increases, BMI tends to decrease monotonically. The relationship may not be strictly linear but follows a consistent trend.

If both coefficients are close to 0, it suggests little to no relationship between exercise hours and BMI.

If the Pearson correlation coefficient is close to -1 or the Spearman's rank correlation coefficient is close to -1, it suggests a strong negative relationship, indicating that as the number of exercise hours per week increases, BMI tends to increase (which would be unusual but not impossible).

By comparing these two correlation coefficients, you can gain insights into both the linear and monotonic aspects of the relationship between exercise hours and BMI in your dataset.

Q4. A researcher is interested in examining the relationship between the number of hours individuals
spend watching television per day and their level of physical activity. The researcher collected data on
both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between
these two variables.

Ans: To calculate the Pearson correlation coefficient between the number of hours individuals spend watching television per day and their level of physical activity, you can use Python and the pearsonr function from the SciPy library. The Pearson correlation coefficient, denoted as "r," measures the strength and direction of the linear relationship between two continuous variables. 

In [4]:
import numpy as np
from scipy.stats import pearsonr

# Example data (replace with your actual data)
tv_hours = [2.5, 3.0, 4.5, 2.0, 5.5, 6.0, 1.5, 3.5, 2.0, 4.0, 1.0, 2.5, 3.0, 4.0, 5.0, 1.5, 2.0, 3.5, 2.5, 3.0, 
            4.0, 5.0, 1.0, 2.0, 2.5, 3.5, 4.0, 5.0, 1.5, 2.0, 3.5, 2.0, 2.5, 3.0, 4.0, 5.5, 1.5, 2.0, 2.5, 3.0, 
            3.5, 4.0, 5.0, 1.5, 2.0, 2.5, 3.0, 4.0]

physical_activity = [3.0, 2.5, 1.5, 3.5, 1.0, 0.5, 4.0, 2.0, 3.5, 1.5, 4.5, 3.0, 2.5, 2.0, 1.0, 4.0, 3.5, 2.0, 3.0, 2.5, 
                    1.5, 1.0, 4.0, 3.5, 3.0, 2.0, 1.5, 1.0, 4.0, 3.5, 2.0, 3.0, 3.0, 2.5, 2.0, 1.0, 4.0, 3.5, 3.0, 2.5, 
                    2.0, 1.5, 1.0, 4.5, 4.0, 3.5, 3.0, 2.0]

# Calculate Pearson correlation coefficient
corr_coefficient, _ = pearsonr(tv_hours, physical_activity)

print("Pearson Correlation Coefficient:", corr_coefficient)

Pearson Correlation Coefficient: -0.9744286535988156


The output will provide the Pearson correlation coefficient (r) between the number of hours spent watching television per day and the level of physical activity. This coefficient will measure the strength and direction of the linear relationship between these two variables.

Interpretation:

If the Pearson correlation coefficient is close to 1, it suggests a strong positive linear relationship. In this context, it would mean that as individuals spend more hours watching television per day, their level of physical activity tends to increase linearly.

If the coefficient is close to -1, it indicates a strong negative linear relationship, implying that as individuals spend more hours watching television per day, their level of physical activity tends to decrease linearly.

If the coefficient is close to 0, it suggests little to no linear relationship between the two variables. In this case, the number of hours spent watching television per day is not strongly associated with the level of physical activity.

Please note that correlation does not imply causation, and other factors may be influencing the relationship between these two variables.

Q5. A survey was conducted to examine the relationship between age and preference for a particular
brand of soft drink. The survey results are shown below:

Age(Years)-> Preference

25   ->            Coke

42    ->           Pepsi

37    ->               Mountain dew

19    ->                 Coke

31     ->               pepsi

28      ->          Coke

Ans: 
* Step 1: Data Preparation

Organize the data into a structured format, such as a table with columns for "Age" and "Soft Drink Preference."
Ensure consistency in the naming of soft drink preferences (e.g., "Coke" and "Pepsi" should have consistent capitalization).

* Step 2: Data Exploration

Calculate basic statistics to understand the age distribution, such as the mean, median, and standard deviation of ages.
Examine the frequency of each soft drink preference to see which brand is most preferred in the survey.

* Step 3: Data Visualization

Create visualizations to explore the relationship between age and soft drink preference. For example, you can create a bar chart or a pie chart to visualize the distribution of preferences.
Consider creating a box plot or histogram of ages to understand their distribution.

* Step 4: Analyze the Relationship

You can calculate summary statistics (e.g., mean age) for each soft drink preference category to see if there are any noticeable differences.
To explore the relationship further, you can use statistical tests such as chi-squared tests or contingency tables to assess if there is a significant association between age and soft drink preference. However, with a small dataset, the results may not be highly conclusive.

* Step 5: Interpretation

Interpret the results of your analysis. Depending on your findings, you can make conclusions about the observed relationships or lack thereof. For instance, you might determine if certain age groups tend to prefer one brand over another based on the survey data.

In [5]:
import pandas as pd

# Create a DataFrame from the provided data
data = {'Age (Years)': [25, 42, 37, 19, 31, 28],
        'Soft Drink Preference': ['Coke', 'Pepsi', 'Mountain Dew', 'Coke', 'Pepsi', 'Coke']}

df = pd.DataFrame(data)

# Calculate the mean age for each soft drink preference
mean_age_by_preference = df.groupby('Soft Drink Preference')['Age (Years)'].mean()

print(mean_age_by_preference)

Soft Drink Preference
Coke            24.0
Mountain Dew    37.0
Pepsi           36.5
Name: Age (Years), dtype: float64


Q6. A company is interested in examining the relationship between the number of sales calls made per day
and the number of sales made per week. The company collected data on both variables from a sample of
30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

Ans: To calculate the Pearson correlation coefficient between the number of sales calls made per day and the number of sales made per week for a sample of 30 sales representatives, you can use Python and the pearsonr function from the SciPy library. The Pearson correlation coefficient, denoted as "r," measures the strength and direction of the linear relationship between two continuous variables. 

In [6]:
import numpy as np
from scipy.stats import pearsonr

# Example data (replace with your actual data)
sales_calls_per_day = [20, 25, 18, 30, 22, 15, 24, 28, 19, 32, 23, 17, 29, 21, 26, 27, 14, 31, 16, 20, 25, 18, 30, 22, 15, 24, 28, 19, 32, 23]
sales_per_week = [120, 145, 110, 150, 130, 100, 140, 155, 115, 160, 125, 105, 158, 135, 147, 162, 98, 155, 108, 122, 147, 112, 152, 128, 102, 143, 157, 114, 162, 130]

# Calculate Pearson correlation coefficient
corr_coefficient, _ = pearsonr(sales_calls_per_day, sales_per_week)

print("Pearson Correlation Coefficient:", corr_coefficient)

Pearson Correlation Coefficient: 0.9644686527552352


The output will provide the Pearson correlation coefficient (r) between the number of sales calls made per day and the number of sales made per week. This coefficient will measure the strength and direction of the linear relationship between these two variables.

Interpretation:

If the Pearson correlation coefficient is close to 1, it suggests a strong positive linear relationship. In this context, it would mean that as the number of sales calls made per day increases, the number of sales made per week tends to increase linearly.

If the coefficient is close to -1, it indicates a strong negative linear relationship, implying that as the number of sales calls made per day increases, the number of sales made per week tends to decrease linearly.

If the coefficient is close to 0, it suggests little to no linear relationship between the two variables. In this case, the number of sales calls made per day is not strongly associated with the number of sales made per week.

Please note that correlation does not imply causation, and other factors may be influencing the relationship between these two variables.