# **Feature Engineering-6**

Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose
you have collected data on the amount of time students spend studying for an exam and their final exam
scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

In [1]:
import numpy as np

# Data
study_time = np.array([3, 5, 2, 8, 6])
exam_scores = np.array([70, 80, 65, 88, 85])

# Pearson correlation coefficient
r = np.corrcoef(study_time, exam_scores)[0, 1]
print(f'Pearson correlation coefficient: {r}')


Pearson correlation coefficient: 0.9774283038456142


Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables.
Suppose you have collected data on the amount of sleep individuals get each night and their overall job
satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two
variables and interpret the result.

In [2]:
import scipy.stats as stats

# Data
sleep = np.array([6, 8, 5, 7, 4])
job_satisfaction = np.array([5, 7, 6, 8, 4])

# Spearman's rank correlation
rho, p_value = stats.spearmanr(sleep, job_satisfaction)
print(f"Spearman's rank correlation coefficient: {rho}")


Spearman's rank correlation coefficient: 0.7999999999999999


Q3. Suppose you are conducting a study to examine the relationship between the number of hours of
exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables
for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation
between these two variables and compare the results.

In [3]:
# Sample data for 50 participants (random example)
exercise_hours = np.random.rand(50) * 10  # random hours between 0 and 10
BMI = np.random.rand(50) * 10 + 20  # random BMI between 20 and 30

# Pearson correlation
pearson_r = np.corrcoef(exercise_hours, BMI)[0, 1]

# Spearman correlation
spearman_r, _ = stats.spearmanr(exercise_hours, BMI)

print(f"Pearson correlation coefficient: {pearson_r}")
print(f"Spearman's rank correlation coefficient: {spearman_r}")


Pearson correlation coefficient: 0.30622006702261867
Spearman's rank correlation coefficient: 0.3167827130852341


Q4. A researcher is interested in examining the relationship between the number of hours individuals
spend watching television per day and their level of physical activity. The researcher collected data on
both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between
these two variables.

In [4]:
# Sample data for 50 participants
tv_hours = np.random.rand(50) * 10  # random hours between 0 and 10
physical_activity = np.random.rand(50) * 10  # random activity level between 0 and 10

# Pearson correlation
r = np.corrcoef(tv_hours, physical_activity)[0, 1]
print(f'Pearson correlation coefficient: {r}')


Pearson correlation coefficient: -0.04739039290099749


Q5. A survey was conducted to examine the relationship between age and preference for a particular
brand of soft drink. The survey results are shown below:

In [6]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from scipy.stats import chi2_contingency

# Data
data = {
    'Age': [25, 42, 37, 19, 31, 28],
    'Soft drink Preference': ['Coke', 'Pepsi', 'Mountain Dew', 'Coke', 'Pepsi', 'Coke']
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Label encoding for categorical variable
le = LabelEncoder()
df['Soft drink Preference Encoded'] = le.fit_transform(df['Soft drink Preference'])

# Create age groups (bins)
bins = [0, 20, 30, 40, 50]
labels = ['0-20', '21-30', '31-40', '41-50']
df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Create a contingency table
contingency_table = pd.crosstab(df['Age Group'], df['Soft drink Preference'])

# Chi-square test of independence
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Contingency Table:")
print(contingency_table)
print("\nChi-square Test Statistic:", chi2)
print("P-value:", p)
print("Degrees of Freedom:", dof)
print("Expected Frequencies Table:")
print(expected)


Contingency Table:
Soft drink Preference  Coke  Mountain Dew  Pepsi
Age Group                                       
0-20                      1             0      0
21-30                     2             0      0
31-40                     0             1      1
41-50                     0             0      1

Chi-square Test Statistic: 7.500000000000002
P-value: 0.27706844336610714
Degrees of Freedom: 6
Expected Frequencies Table:
[[0.5        0.16666667 0.33333333]
 [1.         0.33333333 0.66666667]
 [1.         0.33333333 0.66666667]
 [0.5        0.16666667 0.33333333]]


Q6. A company is interested in examining the relationship between the number of sales calls made per day
and the number of sales made per week. The company collected data on both variables from a sample of
30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

In [5]:
# Sample data for 30 sales representatives
sales_calls = np.random.rand(30) * 10  # random calls between 0 and 10
sales_made = np.random.rand(30) * 50  # random sales between 0 and 50

# Pearson correlation
r = np.corrcoef(sales_calls, sales_made)[0, 1]
print(f'Pearson correlation coefficient: {r}')


Pearson correlation coefficient: 0.027374656454734886


# **COMPLETE**