## Feature Engineering 5
**By Shahequa Modabbera**

### Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. 
### Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

In [2]:
import numpy as np
from scipy.stats import pearsonr

# generate sample data
studying_time = np.array([10, 8, 5, 7, 12, 9, 11, 6, 4, 8, 10, 7, 6, 3, 9, 11, 12, 10, 8, 7, 6, 5, 4, 3, 2, 5, 7, 9, 10, 11, 12, 3, 4, 6, 8, 9, 11, 12, 7, 5, 4, 6, 8, 10, 11, 9, 7, 6, 8, 10])
exam_scores = np.array([90, 80, 70, 75, 95, 85, 90, 65, 60, 75, 85, 80, 70, 50, 80, 85, 90, 95, 80, 75, 70, 65, 60, 55, 40, 60, 70, 80, 90, 95, 100, 50, 55, 65, 75, 80, 90, 95, 80, 70, 60, 65, 75, 85, 90, 85, 75, 70, 80, 90])

# calculate Pearson correlation coefficient and p-value
corr_coef, p_value = pearsonr(studying_time, exam_scores)

# print results
print(f"Pearson correlation coefficient: {corr_coef:.2f}")
print(f"P-value: {p_value:.2f}")

Pearson correlation coefficient: 0.96
P-value: 0.00


The Pearson correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative linear relationship, 0 indicates no linear relationship, and 1 indicates a perfect positive linear relationship. In this case, the correlation coefficient is positive and close to 1, which suggests a strong positive linear relationship between the amount of time spent studying and the final exam score. In other words, students who spend more time studying tend to score higher on the exam.

### Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables.
### Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.

In [3]:
import scipy.stats as stats
import numpy as np

# Sample data
sleep = np.array([8, 6, 7, 7, 5, 6, 7, 8, 7, 6])
satisfaction = np.array([9, 7, 8, 6, 5, 7, 8, 9, 8, 7])

# Calculate rank correlation coefficient
corr_coef, p_value = stats.spearmanr(sleep, satisfaction)

# Print results
print("Spearman's rank correlation coefficient:", corr_coef)
print("p-value:", p_value)


Spearman's rank correlation coefficient: 0.8432993810941914
p-value: 0.0021735421404129145


Interpretation: The Spearman's rank correlation coefficient is 0.84, which indicates a strong positive monotonic relationship between the amount of sleep individuals get each night and their overall job satisfaction level. The p-value is less than 0.05, which means that the correlation is statistically significant at the 5% level, indicating that the relationship is unlikely to be due to chance alone.

### Q3. Suppose you are conducting a study to examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.

In [4]:
import numpy as np
from scipy.stats import pearsonr, spearmanr

# generate random data for exercise hours and BMI
exercise_hours = np.random.normal(5, 2, 50)
bmi = np.random.normal(25, 5, 50)

# calculate Pearson correlation coefficient and p-value
pearson_corr, p_value = pearsonr(exercise_hours, bmi)

# calculate Spearman's rank correlation and p-value
spearman_corr, p_value = spearmanr(exercise_hours, bmi)

print("Pearson correlation coefficient: {:.3f}".format(pearson_corr))
print("Spearman's rank correlation: {:.3f}".format(spearman_corr))

Pearson correlation coefficient: 0.163
Spearman's rank correlation: 0.158


Since both correlation coefficients are positive, there is a weak positive relationship between the number of hours of exercise per week and BMI. However, the Pearson correlation coefficient is slightly greater than the Spearman's rank correlation, indicating that the relationship between these two variables might be entirely linear. The Spearman's rank correlation takes into account the rank order of the data, whereas the Pearson correlation coefficient assumes that the relationship between the two variables is linear.

### Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data on both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.

In [10]:
import numpy as np
from scipy.stats import pearsonr

# Assuming the hours of TV watching and physical activity data are stored in two arrays, tv_hours and activity_level
tv_hours = np.array([3, 2, 4, 5, 1, 3, 2, 4, 3, 1, 5, 4, 2, 1, 3, 4, 5, 2, 1, 4, 3, 2, 5, 1, 3, 4, 2, 1, 5, 3, 4, 2, 5, 1, 4, 3, 2, 1, 5, 4, 3, 2, 1, 5, 4, 3, 2, 1, 5])
activity_level = np.array([2, 3, 4, 3, 5, 2, 3, 4, 3, 2, 1, 2, 3, 5, 4, 3, 2, 5, 4, 3, 2, 1, 2, 3, 4, 5, 2, 3, 4, 5, 2, 1, 3, 4, 5, 2, 3, 4, 1, 5, 2, 3, 4, 5, 2, 3, 4, 1, 5])

# Calculate the Pearson correlation coefficient and its p-value
corr_coef, p_value = pearsonr(tv_hours, activity_level)

# Print the results
print("Pearson correlation coefficient: ", corr_coef)
print("p-value: ", p_value)

Pearson correlation coefficient:  -0.05523529619792324
p-value:  0.7062080321294039


The negative correlation coefficient of -0.050 indicates a weak negative linear relationship between the number of hours spent watching TV per day and the level of physical activity. In other words, as the number of hours of TV watching increases, the level of physical activity tends to decrease. The p-value of 0.70 indicates that this relationship is statistically not significant at a 0.05 level of significance, meaning that it is likely to have occurred by chance.

### Q5. A survey was conducted to examine the relationship between age and preference for a particular brand of soft drink. The survey results are shown below:

|Age(Years)  |  Soft drink Preference|
|:-----------|:--------------------|
|25        |       Coke          |
|42        |       Pepsi         |
|37        |       Mountain dew  |
|19        |       Coke          |
|31        |       Pepsi         |
|28        |       Coke          |

The given data shows the age and soft drink preference of 6 individuals. To calculate the correlation between age and soft drink preference, we need to convert the categorical variable "Soft drink preference" into numerical values using label encoding.

In [6]:
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object
le = LabelEncoder()

# Encode the "Soft drink Preference" variable
preference_encoded = le.fit_transform(['Coke', 'Pepsi', 'Mountain dew', 'Coke', 'Pepsi', 'Coke'])

# Print the encoded labels
print(preference_encoded)

[0 2 1 0 2 0]


Using this encoded data, we can calculate the Pearson correlation coefficient between age and soft drink preference using the pearsonr() function from the scipy.stats module in Python.

In [7]:
from scipy.stats import pearsonr

age = [25, 42, 37, 19, 31, 28]
drink_preference = [0, 1, 2, 0, 1, 0]

corr, _ = pearsonr(age, drink_preference)
print("Pearson correlation coefficient:", corr)

Pearson correlation coefficient: 0.7587035441865058


The Pearson correlation coefficient between age and soft drink preference is 0.75. This indicates a positive correlation between the two variables, which means that as age increases, there is a tendency for soft drink preference to shift from Coke to Pepsi to Mountain Dew. 

### Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

In [8]:
import numpy as np
import pandas as pd

# Creating example data
sales_calls = [12, 15, 18, 10, 8, 11, 16, 20, 22, 25, 14, 17, 19, 21, 24, 23, 13, 9, 7, 6, 4, 3, 2, 1, 5, 19, 17, 14, 11, 9]
sales_made = [4, 6, 8, 2, 1, 3, 7, 9, 10, 12, 5, 6, 7, 9, 11, 10, 4, 2, 1, 0, 0, 0, 0, 0, 0, 7, 6, 5, 3, 2]

# Creating a pandas DataFrame
data = pd.DataFrame({'Sales Calls': sales_calls, 'Sales Made': sales_made})

# Calculating the Pearson correlation coefficient
corr_coef = data['Sales Calls'].corr(data['Sales Made'], method='pearson')

print('Pearson correlation coefficient:', corr_coef)

Pearson correlation coefficient: 0.9808073526250323


Interpretation: The Pearson correlation coefficient is 0.98, which indicates a strong positive correlation between the number of sales calls made per day and the number of sales made per week. This suggests that as the number of sales calls increases, the number of sales made also tends to increase. However, correlation does not imply causation, so further analysis is needed to determine if there is a causal relationship between these variables.