## Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

The Pearson correlation coefficient, also known as Pearson's r, is a statistical measure that quantifies the linear relationship between two continuous variables. It indicates both the direction (positive or negative) and the strength of the linear relationship between the variables. The Pearson correlation coefficient ranges from -1 to +1, where -1 represents a perfect negative linear correlation, +1 represents a perfect positive linear correlation, and 0 represents no linear correlation between the variables.

In [1]:
import pandas as pd
data={"time spent":[12,10,6,9,3,6,4,8,7],"scores":[99,95,70,85,40,70,60,83,78]}
df=pd.DataFrame(data)
df

Unnamed: 0,time spent,scores
0,12,99
1,10,95
2,6,70
3,9,85
4,3,40
5,6,70
6,4,60
7,8,83
8,7,78


In [2]:
df.corr(method="pearson").round(2)

Unnamed: 0,time spent,scores
time spent,1.0,0.96
scores,0.96,1.0


## Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables. Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.

Spearman's rank correlation coefficient, often referred to as Spearman's rho (ρ), is a non-parametric measure of the monotonic relationship between two variables. It assesses the strength and direction of the association between two variables by ranking their values and then computing the Pearson correlation coefficient on the ranked data.

Pearson's correlation, which measures the linear relationship, wheras Spearman's rank correlation captures any monotonic relationship between the variables. A monotonic relationship implies that as the values of one variable increase, the values of the other variable either increase or decrease, but not necessarily at a constant rate.

Spearman's rank correlation coefficient ranges from -1 to +1, where -1 indicates a perfect negative monotonic relationship, +1 indicates a perfect positive monotonic relationship, and 0 suggests no monotonic relationship between the variables. It is commonly used when the data does not meet the assumptions of normality or when the relationship between variables is non-linear.

In [3]:
import pandas as pd
data={"sleep":[8,9,4,6,4,7],"job satisfaction":[9,9,3,7,8,6]}
df=pd.DataFrame(data)
df

Unnamed: 0,sleep,job satisfaction
0,8,9
1,9,9
2,4,3
3,6,7
4,4,8
5,7,6


In [4]:
df.corr(method="spearman").round(2)

Unnamed: 0,sleep,job satisfaction
sleep,1.0,0.68
job satisfaction,0.68,1.0


## Q3. Suppose you are conducting a study to examine the relationship between the number of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.

In [5]:
import pandas as pd
import numpy as np
np.random.seed(42)
exercise_hours=np.random.randint(1,15,size=50)
bmi=np.random.randint(16,35,size=50)
df=pd.DataFrame({"execise hours per week":exercise_hours,"BMI":bmi})
df.head()

Unnamed: 0,execise hours per week,BMI
0,7,18
1,4,20
2,13,34
3,11,22
4,8,24


In [6]:
df.corr(method="pearson").round(2)

Unnamed: 0,execise hours per week,BMI
execise hours per week,1.0,-0.05
BMI,-0.05,1.0


In [7]:
df.corr(method="spearman").round(2)

Unnamed: 0,execise hours per week,BMI
execise hours per week,1.0,-0.04
BMI,-0.04,1.0


## Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data on both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.

In [8]:
import pandas as pd
import numpy as np
watching_television_hours=np.random.randint(1,12,size=50)
physical_activity_hours=np.random.randint(1,12,size=50)
df=pd.DataFrame({'watching television hours per day':watching_television_hours,'physical activity hours per day':physical_activity_hours})
df.head()

Unnamed: 0,watching television hours per day,physical activity hours per day
0,8,7
1,1,7
2,8,11
3,8,4
4,11,7


In [9]:
df.corr("pearson").round(2)

Unnamed: 0,watching television hours per day,physical activity hours per day
watching television hours per day,1.0,-0.03
physical activity hours per day,-0.03,1.0


In [10]:
df.corr("spearman").round(2)

Unnamed: 0,watching television hours per day,physical activity hours per day
watching television hours per day,1.0,-0.05
physical activity hours per day,-0.05,1.0


## Q5. A survey was conducted to examine the relationship between age and preference for a particular brand of soft drink. The survey results are shown below:

In [11]:
df = pd.DataFrame({'Age(in Years)' : [25,42,37,19,31,28],
                   'Soft Drink Preference' : ["Coke","Pepsi","Mountain Dew","Coke","Pepsi","Coke"]
                  })
df

Unnamed: 0,Age(in Years),Soft Drink Preference
0,25,Coke
1,42,Pepsi
2,37,Mountain Dew
3,19,Coke
4,31,Pepsi
5,28,Coke


In [12]:
from sklearn.preprocessing import OneHotEncoder

In [13]:
encode=OneHotEncoder()
encoded=encode.fit_transform(df[['Soft Drink Preference']]).toarray()
encoded_df=pd.DataFrame(encoded,columns=encode.get_feature_names_out())
df=pd.concat([df,encoded_df],axis=1).drop(columns=['Soft Drink Preference'])
df

Unnamed: 0,Age(in Years),Soft Drink Preference_Coke,Soft Drink Preference_Mountain Dew,Soft Drink Preference_Pepsi
0,25,1.0,0.0,0.0
1,42,0.0,0.0,1.0
2,37,0.0,1.0,0.0
3,19,1.0,0.0,0.0
4,31,0.0,0.0,1.0
5,28,1.0,0.0,0.0


In [15]:
df.corr("spearman")

Unnamed: 0,Age(in Years),Soft Drink Preference_Coke,Soft Drink Preference_Mountain Dew,Soft Drink Preference_Pepsi
Age(in Years),1.0,-0.87831,0.392792,0.621059
Soft Drink Preference_Coke,-0.87831,1.0,-0.447214,-0.707107
Soft Drink Preference_Mountain Dew,0.392792,-0.447214,1.0,-0.316228
Soft Drink Preference_Pepsi,0.621059,-0.707107,-0.316228,1.0


In [16]:
df.corr("pearson")

Unnamed: 0,Age(in Years),Soft Drink Preference_Coke,Soft Drink Preference_Mountain Dew,Soft Drink Preference_Pepsi
Age(in Years),1.0,-0.83724,0.394132,0.576439
Soft Drink Preference_Coke,-0.83724,1.0,-0.447214,-0.707107
Soft Drink Preference_Mountain Dew,0.394132,-0.447214,1.0,-0.316228
Soft Drink Preference_Pepsi,0.576439,-0.707107,-0.316228,1.0


## Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

In [18]:
import numpy as np
import pandas as pd
Sales_per_day=np.random.randint(50,100,size=30)
sales_per_week=np.random.randint(2000,4000,size=30)
df=pd.DataFrame({"sales per day":Sales_per_day,"sales per week":sales_per_week})
df.head()

Unnamed: 0,sales per day,sales per week
0,60,2053
1,77,3143
2,74,3696
3,99,3943
4,72,2627


In [19]:
df.corr(method="pearson")

Unnamed: 0,sales per day,sales per week
sales per day,1.0,0.192969
sales per week,0.192969,1.0
