# Feature Engineering Assignment

### Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose you have collected data on the amount of time students spend studying for an exam and their final exam scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

In [2]:
import pandas as pd
import numpy as np

np.random.seed(42)

# Generate study time and exam scores
study_time = np.random.randint(1, 10, size=100)  # Study time in hours
exam_scores = study_time * 10  # Exam scores 

# Create the DataFrame
df = pd.DataFrame({'Study Time': study_time, 'Exam Score': exam_scores})

In [3]:
df.head()

Unnamed: 0,Study Time,Exam Score
0,7,70
1,4,40
2,8,80
3,5,50
4,7,70


In [4]:
df.corr(method='pearson')

Unnamed: 0,Study Time,Exam Score
Study Time,1.0,1.0
Exam Score,1.0,1.0


 The covariance between Study Time and Exam Score is 1.0. This value indicates a perfect positive linear relationship between the two variables. It suggests that as the Study Time increases, the Exam Score also increases in a direct proportion.

### Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables. Suppose you have collected data on the amount of sleep individuals get each night and their overall job satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two variables and interpret the result.

In [8]:
import pandas as pd
import numpy as np

np.random.seed(42)

## Generate random data
# Sleep duration in hours
sleep_duration = np.random.randint(4, 10, size=200)  
# Job satisfaction level (maximum of 10)
job_satisfaction = np.minimum(sleep_duration + np.random.randint(1, 4, size=200), 10) 


# Create the DataFrame
df = pd.DataFrame({'Sleep Duration': sleep_duration, 'Job Satisfaction Level': job_satisfaction})

# Display the DataFrame
df.head()

Unnamed: 0,Sleep Duration,Job Satisfaction Level
0,7,8
1,8,9
2,6,8
3,8,9
4,8,10


In [9]:
df.corr(method='spearman')

Unnamed: 0,Sleep Duration,Job Satisfaction Level
Sleep Duration,1.0,0.892691
Job Satisfaction Level,0.892691,1.0


The covariance between Sleep Duration and Job Satisfaction Level is 0.892691. This positive covariance value suggests a positive linear relationship between the two variables. It indicates that there is a tendency for individuals with a longer sleep duration to have higher job satisfaction levels.

### Q3. Suppose you are conducting a study to examine the relationship between the number   of hours of exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation between these two variables and compare the results.

In [18]:
import pandas as pd
import numpy as np

np.random.seed(42)

# Generate random data
 # Number of hours of exercise per week
exercise_hours = np.random.randint(1, 10, size=50)
# Body mass index (BMI) with a positive relationship
bmi = 20 + exercise_hours * 1.5  

# Create the DataFrame
df = pd.DataFrame({'Exercise Hours': exercise_hours, 'BMI': bmi})


In [19]:
df.head()

Unnamed: 0,Exercise Hours,BMI
0,7,30.5
1,4,26.0
2,8,32.0
3,5,27.5
4,7,30.5


In [20]:
df.corr(method='pearson')

Unnamed: 0,Exercise Hours,BMI
Exercise Hours,1.0,1.0
BMI,1.0,1.0


In [21]:
df.corr(method='spearman')

Unnamed: 0,Exercise Hours,BMI
Exercise Hours,1.0,1.0
BMI,1.0,1.0


The number of hours of exercise per week increases, the BMI of the participants in the dataset also increases in a perfect linear manner. The correlation coefficient value of 1 indicates a strong and direct relationship between the two variables.

### Q4. A researcher is interested in examining the relationship between the number of hours individuals spend watching television per day and their level of physical activity. The researcher collected data both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between these two variables.

In [35]:

import pandas as pd
import numpy as np

np.random.seed(42)


# Generate random data
tv_hours = np.random.randint(1, 6, size=50) 
physical_activity = 10 - tv_hours

# Create the DataFrame
df = pd.DataFrame({'TV Hours': tv_hours, 'Physical Activity': physical_activity})

df.head()

Unnamed: 0,TV Hours,Physical Activity
0,4,6
1,5,5
2,3,7
3,5,5
4,5,5


In [36]:
df.corr(method='pearson')

Unnamed: 0,TV Hours,Physical Activity
TV Hours,1.0,-1.0
Physical Activity,-1.0,1.0


This indicates a perfect negative correlation, confirming the assumption that as the number of hours individuals spend watching television per day increases, their level of physical activity decrease.

### Q5. A survey was conducted to examine the relationship between age and preference for a particular brand of soft drink. The survey results are shown below:

In [25]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

Age=[25,42,37,19,31,28]
Drink = ['Coke', 'Pepsi', 'Mountain Dew', 'Coke', 'Pepsi', 'Coke']

encoder = OneHotEncoder()
encoded = encoder.fit_transform(np.array(Drink).reshape(-1, 1))

df = pd.DataFrame(data=encoded.toarray(), columns=encoder.get_feature_names_out(['Drink']))

In [27]:
df['Age']=Age

In [28]:
df

Unnamed: 0,Drink_Coke,Drink_Mountain Dew,Drink_Pepsi,Age
0,1.0,0.0,0.0,25
1,0.0,0.0,1.0,42
2,0.0,1.0,0.0,37
3,1.0,0.0,0.0,19
4,0.0,0.0,1.0,31
5,1.0,0.0,0.0,28


### Q6. A company is interested in examining the relationship between the number of sales calls made per day and the number of sales made per week. The company collected data on both variables from a sample of 30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.

In [39]:
import numpy as np
import pandas as pd

calls=np.random.randint(20,60,30)
sales=calls*0.5 + 20

df=pd.DataFrame({'No. of calls': calls,'Sales':sales})

df.head()

Unnamed: 0,No. of calls,Sales
0,33,36.5
1,59,49.5
2,56,48.0
3,40,40.0
4,54,47.0


In [40]:
 df.corr(method='pearson')

Unnamed: 0,No. of calls,Sales
No. of calls,1.0,1.0
Sales,1.0,1.0


#### The correlation coefficient value of 1 indicates a strong and direct relationship between the two variables.

## The End