### Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

The Pearson correlation coefficient, also known as Pearson's r, is a statistical measure that quantifies the linear relationship between two continuous variables. It indicates both the direction (positive or negative) and the strength of the linear relationship between the variables. The Pearson correlation coefficient ranges from -1 to +1, where -1 represents a perfect negative linear correlation, +1 represents a perfect positive linear correlation, and 0 represents no linear correlation between the variables.

In [27]:
import pandas as pd
from scipy.stats import pearsonr

# Sample data: hours of study and corresponding exam scores
data = {
    'Study_Hours': [2, 3, 5, 6, 8, 10, 12],
    'Exam_Score': [50, 55, 65, 70, 80, 88, 95]
}

df = pd.DataFrame(data)

# Calculate Pearson correlation coefficient
corr_coefficient, p_value = pearsonr(df['Study_Hours'], df['Exam_Score'])

print("Pearson Correlation Coefficient:", corr_coefficient)
print("P-value:", p_value)


Pearson Correlation Coefficient: 0.9977590270960863
P-value: 4.560618812867042e-07


 This means there is a very strong positive linear relationship between the number of hours studied and the exam score. As students study more, their scores tend to increase.

### Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables.Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.

**Spearman's rank correlation coefficient**, often referred to as Spearman's rho (ρ), is a non-parametric measure of the monotonic relationship between two variables. It assesses the strength and direction of the association between two variables by ranking their values and then computing the Pearson correlation coefficient on the ranked data.

**Pearson's correlation**, which measures the linear relationship, wheras Spearman's rank correlation captures any monotonic relationship between the variables. A monotonic relationship implies that as the values of one variable increase, the values of the other variable either increase or decrease, but not necessarily at a constant rate.

Spearman's rank correlation coefficient ranges from -1 to +1, where -1 indicates a perfect negative monotonic relationship, +1 indicates a perfect positive monotonic relationship, and 0 suggests no monotonic relationship between the variables. It is commonly used when the data does not meet the assumptions of normality or when the relationship between variables is non-linear.

In [28]:
import pandas as pd

amount_of_sleep = [7, 6, 8, 5, 6, 7, 8, 6, 5, 7]
job_satisfaction = [8, 7, 9, 3, 4, 7, 10, 7, 6, 8]

df = pd.DataFrame({"sleep hour": amount_of_sleep, "job satisfaction" : job_satisfaction})
df.head()

Unnamed: 0,sleep hour,job satisfaction
0,7,8
1,6,7
2,8,9
3,5,3
4,6,4


In [29]:
df.corr(method= 'spearman' )

Unnamed: 0,sleep hour,job satisfaction
sleep hour,1.0,0.914401
job satisfaction,0.914401,1.0


Here we observe from the spearman correlation being positive close to 1 symbolizes the data is highly positvely correlated ie.with the number of sleep hours the overall job satisfaction level increases.

### Q3. Suppose you are conducting a study to examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.

In [30]:
import numpy as np
import pandas as pd
import random

random.seed(42) # for reproducibility

# Generate random data for the number of hours of exercise per week (between 1 and 20 hours)
exercise_hours = np.random.randint(1, 20,size=50)

# Generate random data for body mass index (BMI) (between 15 and 35)
bmi_values = np.random.uniform(15, 35,size=50)

df = pd.DataFrame({"Excercise hours per week " : exercise_hours,
                   "BMI" : bmi_values
                  })
df.head()

Unnamed: 0,Excercise hours per week,BMI
0,15,18.688099
1,7,15.435407
2,9,15.831593
3,10,23.540984
4,2,33.574366


In [31]:
df.corr(method='pearson')

Unnamed: 0,Excercise hours per week,BMI
Excercise hours per week,1.0,-0.052401
BMI,-0.052401,1.0


In [32]:
df.corr(method = 'spearman')

Unnamed: 0,Excercise hours per week,BMI
Excercise hours per week,1.0,-0.047448
BMI,-0.047448,1.0


Based on the above values obtained we see there is a very little difference between the two correlation methods the values coming 0.132 and 0.140 for pearson and spearman respectively.

Based on this we see there is little or no relationship between the the variables as the values of the coefficient are close to 0.

### Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data on both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.

In [33]:
import numpy as np
import pandas as pd
import random

random.seed(382)#for reproduciblity

# Generate random data for the number of hours of physical activity during day (between 1 and 18 hours)
physical_activity_hours = np.random.randint(1,18,size=50)

# Generate random data for the number of hours of watching television during day (between 1 and 18 hours)
watching_televison_hours = np.random.randint(1, 18,size=50)

df = pd.DataFrame({"Physical Activity hours per day " : physical_activity_hours,
                   "Watching Televison hours per day" : watching_televison_hours
                  })
df.head()

Unnamed: 0,Physical Activity hours per day,Watching Televison hours per day
0,8,13
1,3,16
2,12,9
3,7,13
4,4,6


In [34]:
df.corr(method='pearson')

Unnamed: 0,Physical Activity hours per day,Watching Televison hours per day
Physical Activity hours per day,1.0,-0.091924
Watching Televison hours per day,-0.091924,1.0


In [35]:
df.corr(method='spearman')

Unnamed: 0,Physical Activity hours per day,Watching Televison hours per day
Physical Activity hours per day,1.0,-0.091461
Watching Televison hours per day,-0.091461,1.0


There is no significant diiference between the two methods of correction they are almost neutral as they are negative but close to 0.

### Q5. A survey was conducted to examine the relationship between age and preference for a particular brand of soft drink. The survey results are shown below:

![image.png](attachment:image.png)

In [36]:
df = pd.DataFrame({'Age(in Years)' : [25,42,37,19,31,28],
                   'Soft Drink Preference' : ["Coke","Pepsi","Mountain Dew","Coke","Pepsi","Coke"]
                  })
df

Unnamed: 0,Age(in Years),Soft Drink Preference
0,25,Coke
1,42,Pepsi
2,37,Mountain Dew
3,19,Coke
4,31,Pepsi
5,28,Coke


In [37]:
#First we will convert the soft drink preference to numerical for applying correlation
#We will use OHE for converting
from sklearn.preprocessing import OneHotEncoder

encode = OneHotEncoder()
values = encode.fit_transform(df[['Soft Drink Preference']]).toarray()

encode_df = pd.DataFrame(values,columns=encode.get_feature_names())
df = pd.concat([df,encode_df],axis = 1)
df.drop(columns="Soft Drink Preference",inplace=True)
df

Unnamed: 0,Age(in Years),x0_Coke,x0_Mountain Dew,x0_Pepsi
0,25,1.0,0.0,0.0
1,42,0.0,0.0,1.0
2,37,0.0,1.0,0.0
3,19,1.0,0.0,0.0
4,31,0.0,0.0,1.0
5,28,1.0,0.0,0.0


In [38]:
df.corr(method="pearson")

Unnamed: 0,Age(in Years),x0_Coke,x0_Mountain Dew,x0_Pepsi
Age(in Years),1.0,-0.83724,0.394132,0.576439
x0_Coke,-0.83724,1.0,-0.447214,-0.707107
x0_Mountain Dew,0.394132,-0.447214,1.0,-0.316228
x0_Pepsi,0.576439,-0.707107,-0.316228,1.0


In [39]:
df.corr(method="spearman")

Unnamed: 0,Age(in Years),x0_Coke,x0_Mountain Dew,x0_Pepsi
Age(in Years),1.0,-0.87831,0.392792,0.621059
x0_Coke,-0.87831,1.0,-0.447214,-0.707107
x0_Mountain Dew,0.392792,-0.447214,1.0,-0.316228
x0_Pepsi,0.621059,-0.707107,-0.316228,1.0


From this we observe :

People with higher age tend to like Pepsi the most as its the highest positively correlated by a value of 0.62

People with higher age tend to possibly like Mountain dew but its less positively correlated comapred to pepsi with a value of 0.39

As the age increases the like of prefrence of choosing coke reduces significantly as it has a correlation value of -0.87 very close to -1 indicating a very strong negative correlation.

### Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

In [40]:
import numpy as np
import pandas as pd
import random

# Generate synthetic data
np.random.seed(4323)  # for reproducibility


sales_calls = np.random.randint(20, 40, size=30)
sales_per_week = np.random.uniform(2000, 4000, size=30)
df = pd.DataFrame({"Sales Calls" :sales_calls,
                  "Weekly Sales" : sales_per_week})
df.head()

Unnamed: 0,Sales Calls,Weekly Sales
0,36,3234.082997
1,37,3657.311542
2,26,2014.340867
3,29,2495.656869
4,38,2223.092926


In [41]:
#Calculating Pearson by using Dataframe
df.corr(method='pearson')

Unnamed: 0,Sales Calls,Weekly Sales
Sales Calls,1.0,-0.101161
Weekly Sales,-0.101161,1.0


In [42]:
#Calculating Pearson by using Scipy Library
from scipy.stats import pearsonr

pearson_corr , p_value = pearsonr(sales_calls,sales_per_week)

print(f"Pearson Correlation Coefficient : {pearson_corr}")
print(f"P-value : {p_value} ")

Pearson Correlation Coefficient : -0.10116108915369341
P-value : 0.5947910499968548 
