### Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame({
    'Time_Studied': [4,3,5,7,6],
    'Marks_Obtained': [60 ,45, 70, 90, 83]
})

In [3]:
df

Unnamed: 0,Time_Studied,Marks_Obtained
0,4,60
1,3,45
2,5,70
3,7,90
4,6,83


The formula for finding Pearson's correlation between both the variables is:


![Screenshot 2023-12-20 204533.png](attachment:6f1b1502-43ae-4868-800f-5ef07aea86b2.png)


We can also find the Pearson's correlation using Python:

In [4]:
df.corr(method='pearson')

Unnamed: 0,Time_Studied,Marks_Obtained
Time_Studied,1.0,0.993678
Marks_Obtained,0.993678,1.0


**Interpretation:**

The Pearson's correlation coefficients in the given results indicate a very strong positive correlation between the variables of "Time_Studied" and "Marks_Obtained".

Both correlation coefficients are close to 1.000, which means that there is a very strong positive relationship between the two variables. This suggests that as "Time_Studied" increases, "Marks_Obtained" also increases.

### Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables. Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.

In [5]:
df = pd.DataFrame({
    'Sleep_Amount': [6, 7, 5, 8, 7, 6, 5, 9, 8, 7],
    'Job_Satisfaction': [8, 9, 4, 7, 6, 5, 3, 10, 9, 8]
})

In [6]:
df

Unnamed: 0,Sleep_Amount,Job_Satisfaction
0,6,8
1,7,9
2,5,4
3,8,7
4,7,6
5,6,5
6,5,3
7,9,10
8,8,9
9,7,8


The formula for calculating Spearman's correlation coefficient between the two variables is:

![Screenshot 2023-12-20 210252.png](attachment:787b48ac-59bb-4aa4-9e6c-1276e5c8d26d.png)

We can also find the Pearson's correlation using Python:

In [8]:
df['Sleep_Amount'].corr(df['Job_Satisfaction'], method='spearman')   

0.7976045521851899

In [9]:
df.corr(method='spearman')

Unnamed: 0,Sleep_Amount,Job_Satisfaction
Sleep_Amount,1.0,0.797605
Job_Satisfaction,0.797605,1.0


**Interpretation:**

The correlation coefficient between Sleep Amount and Job Satisfaction is 0.797605, which is close to +1. This indicates a strong positive monotonic correlation between these two variables.

As the amount of sleep individuals get each night increases, their overall job satisfaction level tends to increase as well. Conversely, as the amount of sleep decreases, job satisfaction tends to decrease.

### Q3. Suppose you are conducting a study to examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.

To calculate the Pearson correlation coefficient and the Spearman's rank correlation between the number of hours of exercise per week and body mass index (BMI) in a sample of 50 adults, we need to have data for both variables.

We can use Python to generate random data for both variables and then calculate the Pearson and Spearman's rank correlation between them:

In [11]:
import numpy as np

In [12]:
#generate random data
num_of_hours = np.random.normal(5, 2, 50) #50 random numbers of normal distribution with mean 5 and standard deviation 2
bmi = np.random.normal(25, 5, 50) #50 random numbers of normal distribution with mean 25 and standard deviation 5

df = pd.DataFrame({
    'num_of_hours': num_of_hours,
    'bmi': bmi
})

In [13]:
df.head()

Unnamed: 0,num_of_hours,bmi
0,6.325386,20.613008
1,4.798152,25.595971
2,5.899959,18.516436
3,4.854638,28.158847
4,3.495442,32.98829


**The above values are randomly generated so the values may not make sense in real world**

**Pearson Correlation:**

In [14]:
df.corr(method='pearson')

Unnamed: 0,num_of_hours,bmi
num_of_hours,1.0,-0.053217
bmi,-0.053217,1.0


**Spearman's Rank Correlation:**

In [15]:
df.corr(method='spearman')

Unnamed: 0,num_of_hours,bmi
num_of_hours,1.0,-0.037887
bmi,-0.037887,1.0


**Results Comparison:**

The Pearson correlation coefficient measures the linear relationship between two variables, and it ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no correlation. In this case, the Pearson correlation coefficient between num_of_hours and bmi is 0.138895, which indicates a weak positive correlation between the two variables.

On the other hand, the Spearman correlation coefficient measures the monotonic relationship between two variables, which means it assesses the strength and direction of a relationship between two variables, without assuming any particular functional form of the relationship. It also ranges from -1 to +1, with 0 indicating no correlation. In this case, the Spearman correlation coefficient between num_of_hours and bmi is 0.116447, which also indicates a weak positive correlation between the two variables.

### Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data on both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.

To calculate the Pearson correlation coefficient between the number of hours individuals spend watching television per day and their level of physical activity of hours of exercise per week in a sample of 50 pariticipants, we need to have data for both variables.

We can use Python to generate random data for both variables and then calculate the Pearson correlation coeffiecients:

In [16]:
# 50 random numbers of normal distribution with mean 5 and standard deviation 2
num_of_hours = np.random.normal(6, 4, 50)

# 50 random numbers of normal distribution with mean 25 and standard deviation 5
amount_of_activity = np.random.normal(10, 6, 50)

df = pd.DataFrame({
    'num_of_hours': num_of_hours,
    'physical_activity': amount_of_activity
})

In [17]:
df.head()

Unnamed: 0,num_of_hours,physical_activity
0,8.37006,5.388897
1,10.383614,10.87609
2,-5.410175,22.177463
3,8.760097,9.813546
4,8.659425,13.065074


**The above values are randomly generated so the values may not make sense in real world**

**Pearson Correlation:**

In [18]:
df.corr(method='pearson')

Unnamed: 0,num_of_hours,physical_activity
num_of_hours,1.0,0.057074
physical_activity,0.057074,1.0


### Q5. A survey was conducted to examine the relationship between age and preference for a particular brand of soft drink. The survey results are shown below:

In [19]:
df = pd.DataFrame({
    "Age(Years)": [25, 42, 37, 19, 31, 28],
    "Soft Drink Preference": ['Coke', 'Pepsi', 'Mountain Dew', 'Coke', 'Pepsi', 'Coke']
})

In [20]:
df

Unnamed: 0,Age(Years),Soft Drink Preference
0,25,Coke
1,42,Pepsi
2,37,Mountain Dew
3,19,Coke
4,31,Pepsi
5,28,Coke


For finding Pearson correlation, we first need to convert the categorical variable to numerical variable. We can do so by encoding it:

In [23]:
from sklearn.preprocessing import LabelEncoder

In [25]:
encoder = LabelEncoder()
encoded_col = encoder.fit_transform(df['Soft Drink Preference'])
df_encoded = pd.DataFrame(encoded_col, columns=['Soft Drink Preference Enc'])
df_new = pd.concat([df, df_encoded], axis=1)
df_new = df_new.drop('Soft Drink Preference', axis=1)     

In [26]:
df_new

Unnamed: 0,Age(Years),Soft Drink Preference Enc
0,25,0
1,42,2
2,37,1
3,19,0
4,31,2
5,28,0


In [27]:
df_new.corr(method='pearson')

Unnamed: 0,Age(Years),Soft Drink Preference Enc
Age(Years),1.0,0.769175
Soft Drink Preference Enc,0.769175,1.0


### Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

This time rather than generating 30 random values for both the variables, i am manually taking 10 values for each variable.

In [28]:
num_of_calls = [20, 30, 10, 40, 60, 80, 60, 50, 40, 100]
num_of_sales = [2, 3, 2, 5, 5, 6, 8, 5, 7, 20]

df = pd.DataFrame({
    'num_of_calls': num_of_calls,
    'num_of_sales': num_of_sales
})

In [29]:
df

Unnamed: 0,num_of_calls,num_of_sales
0,20,2
1,30,3
2,10,2
3,40,5
4,60,5
5,80,6
6,60,8
7,50,5
8,40,7
9,100,20
