# Feature Engineering - 6

## Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

In [None]:
import numpy as np
from scipy.stats import pearsonr

# Example data
study_hours = np.array([2, 4, 6, 8, 10])
exam_scores = np.array([60, 65, 70, 80, 85])

corr, p_value = pearsonr(study_hours, exam_scores)
print(f"Pearson correlation coefficient: {corr:.2f}, p-value: {p_value:.4f}")

**Interpretation:**

A Pearson correlation coefficient close to 1 indicates a strong positive linear relationship between study hours and exam scores.

## Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables. Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.

In [None]:
import numpy as np
from scipy.stats import spearmanr

# Example data
sleep_hours = np.array([6, 7, 5, 8, 6])
job_satisfaction = np.array([5, 7, 4, 9, 6])

corr, p_value = spearmanr(sleep_hours, job_satisfaction)
print(f"Spearman's rank correlation: {corr:.2f}, p-value: {p_value:.4f}")

**Interpretation:**

A Spearman's rank correlation close to 1 indicates a strong monotonic relationship between sleep hours and job satisfaction.

## Q3. Suppose you are conducting a study to examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.

In [None]:
import numpy as np
from scipy.stats import pearsonr, spearmanr

np.random.seed(0)
exercise_hours = np.random.randint(0, 10, 50)
bmi = np.random.normal(25, 3, 50)

pearson_corr, _ = pearsonr(exercise_hours, bmi)
spearman_corr, _ = spearmanr(exercise_hours, bmi)
print(f"Pearson: {pearson_corr:.2f}, Spearman: {spearman_corr:.2f}")

**Comparison:**

Pearson measures linear correlation, while Spearman measures monotonic correlation. If the values are similar, the relationship is likely linear; if not, it may be monotonic but not linear.

## Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data on both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.

In [None]:
import numpy as np
from scipy.stats import pearsonr

np.random.seed(1)
tv_hours = np.random.randint(1, 6, 50)
physical_activity = np.random.randint(1, 11, 50)

corr, p_value = pearsonr(tv_hours, physical_activity)
print(f"Pearson correlation coefficient: {corr:.2f}, p-value: {p_value:.4f}")

## Q5. A survey was conducted to examine the relationship between age and preference for a particular brand of soft drink. The survey results are shown below:

Age(Years): 25, 42, 37, 19, 31, 28
Soft drink Preference: Coke, Pepsi, Mountain dew, Coke, Pepsi, Coke

In [None]:
import pandas as pd
from scipy.stats import spearmanr

age = [25, 42, 37, 19, 31, 28]
pref = ['Coke', 'Pepsi', 'Mountain dew', 'Coke', 'Pepsi', 'Coke']

# Encode preferences
pref_map = {'Coke': 0, 'Pepsi': 1, 'Mountain dew': 2}
pref_encoded = [pref_map[x] for x in pref]

corr, p_value = spearmanr(age, pref_encoded)
print(f"Spearman's rank correlation: {corr:.2f}, p-value: {p_value:.4f}")

**Interpretation:**

Spearman's rank correlation is used here because preference is categorical. The result shows the strength and direction of the association between age and soft drink preference.

## Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

In [None]:
import numpy as np
from scipy.stats import pearsonr

np.random.seed(2)
sales_calls = np.random.randint(10, 50, 30)
sales_made = np.random.randint(1, 20, 30)

corr, p_value = pearsonr(sales_calls, sales_made)
print(f"Pearson correlation coefficient: {corr:.2f}, p-value: {p_value:.4f}")