#### A researcher is conducting a study on the effects of different exercise regimens on blood pressure. The study involves 100 participants who are randomly assigned to one of three exercise groups: jogging, weightlifting, or yoga. Each participant's blood pressure is measured before and after the 6-week exercise program.
#### The researcher has collected the data and stored it in a CSV file. The file contains the following columns: Participant ID (numeric), Exercise group (text: "jogging," "weightlifting," or "yoga"),Pre-exercise systolic blood pressure (numeric), Post-exercise systolic blood pressure (numeric)
#### The researcher wants to analyze the data using Python and NumPy. Complete the following tasks as part of the initial statistical analysis of the scenario above.

In [1]:
# import necessary modules
import csv
import random
import pandas as pd
import numpy as np
from scipy import stats

### Generate Synthetic Dataset on Exercise and Blood Pressure
#### 1. Create a Python script that generates a synthetic dataset matching the description of your study. The dataset should be saved as a CSV file named "exercise_data.csv".

In [2]:
# genrate a sample of 100 participant ID numbers, the sample function in random ensures no duplicates are used.
ids = random.sample(range(100), 100)

In [3]:
#generate pre and post excercise blood pressure
pre_bp = []
post_bp = []
#define a function that creates 100 random blood pressures for range 70-150
def bp_generator(x):
    for i in range (0,100):
        n = random.randint(70, 150)
        x.append(n)
        
bp_generator(pre_bp)
print(pre_bp)
bp_generator(post_bp)
print(post_bp)

[74, 97, 149, 108, 86, 104, 100, 84, 88, 150, 76, 148, 99, 91, 81, 115, 127, 146, 85, 149, 99, 124, 124, 139, 85, 147, 130, 91, 111, 88, 89, 102, 107, 111, 109, 117, 92, 105, 134, 137, 113, 144, 98, 131, 120, 145, 134, 139, 123, 81, 71, 81, 100, 138, 131, 143, 89, 132, 82, 101, 145, 104, 117, 139, 114, 99, 121, 113, 118, 111, 120, 105, 132, 148, 140, 105, 72, 87, 96, 138, 118, 105, 100, 70, 110, 90, 129, 150, 105, 125, 114, 113, 124, 148, 136, 136, 89, 138, 123, 118]
[126, 108, 93, 149, 149, 111, 74, 100, 121, 90, 138, 107, 112, 125, 122, 142, 89, 96, 140, 100, 91, 102, 101, 126, 131, 80, 136, 115, 117, 76, 95, 127, 123, 87, 88, 109, 122, 123, 75, 137, 78, 138, 116, 135, 140, 103, 115, 84, 84, 148, 111, 146, 89, 110, 150, 141, 139, 70, 142, 91, 145, 135, 90, 114, 85, 123, 129, 117, 125, 96, 97, 89, 104, 117, 118, 108, 132, 71, 88, 94, 136, 75, 126, 109, 101, 140, 104, 125, 94, 124, 132, 101, 130, 108, 126, 146, 125, 111, 86, 81]


In [4]:
# create a list for groups
group_list = ["jogging", "yoga", "weightlifting"]
groups = []
#define a function that appends a random group to the list groups 100 times
def group_generator(x):
    for i in range(0,100):
        w = random.sample(group_list, 1)[0]
        x.append(w)
group_generator(groups)
print(groups)

['yoga', 'weightlifting', 'yoga', 'yoga', 'weightlifting', 'jogging', 'weightlifting', 'yoga', 'weightlifting', 'jogging', 'weightlifting', 'jogging', 'yoga', 'jogging', 'weightlifting', 'weightlifting', 'weightlifting', 'yoga', 'weightlifting', 'yoga', 'yoga', 'weightlifting', 'jogging', 'weightlifting', 'weightlifting', 'jogging', 'weightlifting', 'jogging', 'jogging', 'yoga', 'weightlifting', 'jogging', 'yoga', 'yoga', 'yoga', 'jogging', 'jogging', 'jogging', 'jogging', 'yoga', 'jogging', 'weightlifting', 'weightlifting', 'yoga', 'yoga', 'yoga', 'jogging', 'yoga', 'jogging', 'weightlifting', 'weightlifting', 'weightlifting', 'yoga', 'jogging', 'yoga', 'yoga', 'jogging', 'weightlifting', 'yoga', 'jogging', 'weightlifting', 'weightlifting', 'weightlifting', 'weightlifting', 'yoga', 'jogging', 'jogging', 'yoga', 'yoga', 'yoga', 'jogging', 'jogging', 'jogging', 'yoga', 'jogging', 'jogging', 'weightlifting', 'jogging', 'jogging', 'yoga', 'weightlifting', 'yoga', 'weightlifting', 'yoga', 

In [5]:
#store all lists in a dataframe for key value pairs and write dictionary to dataframe with column names
data_dict = zip(ids, groups, pre_bp, post_bp)
df = pd.DataFrame(data_dict, columns = ["Participant ID", "Excercise Group", 
          "Pre-exercise systolic blood pressure", 
          "Post-exercise systolic blood pressure"])

In [6]:
#write dataframe to csv file
df.to_csv('excercise_data.csv', index=False)

### Highest Pre-Exercise Blood Pressure by Group

#### 2.Write a Python script to read the "exercise_data.csv" file and print the participant with the highest pre-exercise systolic blood pressure in each exercise group.

In [7]:
#read csv file using pandas
excercise_data = pd.read_csv('excercise_data.csv')

In [8]:
#group and sort data to see highest pre systolic BP in each group
pre_BP_groups= excercise_data.groupby('Excercise Group', sort=True)[['Pre-exercise systolic blood pressure',
                                                                     'Participant ID']]

In [9]:
# print participants with top pre sysctolic BP in each group
print(pre_BP_groups.max())

                 Pre-exercise systolic blood pressure  Participant ID
Excercise Group                                                      
jogging                                           150              95
weightlifting                                     148              97
yoga                                              150              99


#### explore data frame

In [10]:
#print head of data frame
excercise_data.head()

Unnamed: 0,Participant ID,Excercise Group,Pre-exercise systolic blood pressure,Post-exercise systolic blood pressure
0,89,yoga,74,126
1,31,weightlifting,97,108
2,76,yoga,149,93
3,81,yoga,108,149
4,34,weightlifting,86,149


In [11]:
# call info() on dataframe to ensure each parameter contains 100 values
excercise_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Participant ID                         100 non-null    int64 
 1   Excercise Group                        100 non-null    object
 2   Pre-exercise systolic blood pressure   100 non-null    int64 
 3   Post-exercise systolic blood pressure  100 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 3.2+ KB


In [12]:
# use value_counts() function to investigate distribution of excercise groups
excercise_data.value_counts('Excercise Group')

Excercise Group
jogging          36
yoga             33
weightlifting    31
dtype: int64

### Extract the 5 Participants with Highest Blood Pressure

#### 3. Write a Python function that sorts the list based on blood pressure and displays the full record of the top 5.

In [13]:
# create a function to sort by Blood pressure and print top 5
def top5(x,n):
    return x.sort_values(n, ascending=False).head()

In [14]:
# print participants with top 5 pre excercise BP
print("5 highest incidences of pre excercise BP")
top5(excercise_data,'Pre-exercise systolic blood pressure')

5 highest incidences of pre excercise BP


Unnamed: 0,Participant ID,Excercise Group,Pre-exercise systolic blood pressure,Post-exercise systolic blood pressure
87,2,yoga,150,125
9,24,jogging,150,90
19,84,yoga,149,100
2,76,yoga,149,93
73,51,yoga,148,117


In [15]:
# print participants with top 5 post excercise BP
print("5 highest incidences of post excercise BP")
top5(excercise_data, 'Post-exercise systolic blood pressure')

5 highest incidences of post excercise BP


Unnamed: 0,Participant ID,Excercise Group,Pre-exercise systolic blood pressure,Post-exercise systolic blood pressure
54,9,yoga,131,150
3,81,yoga,108,149
4,34,weightlifting,86,149
49,78,weightlifting,81,148
51,56,weightlifting,81,146


### Monthly Blood Pressure Changes

#### 4. Write a Python script that assumes that blood pressure measurements were taken monthly. Compute and print the average change in blood pressure for each exercise group. Note: This is hypothetical as the original study is for 6 weeks only.

In [16]:
#group data and calculate change in pre and post systolic BP
change = excercise_data['Post-exercise systolic blood pressure'] - excercise_data['Pre-exercise systolic blood pressure']
#add column to dataframe
excercise_data['Change in BP'] = change

In [17]:
#Calculate and print average change for each group
excercise_data.groupby(['Excercise Group'])['Change in BP'].mean()

Excercise Group
jogging          -7.333333
weightlifting    12.483871
yoga             -7.333333
Name: Change in BP, dtype: float64

### Compare Pre- and Post-Exercise Blood Pressure

#### 5. Search for the 5 participants from the pre-exercise  and find their post-exercise blood pressure. Produce a table that compares their pre- and post-exercise pressure and displays the difference.

In [18]:
top5(excercise_data, 'Pre-exercise systolic blood pressure')

Unnamed: 0,Participant ID,Excercise Group,Pre-exercise systolic blood pressure,Post-exercise systolic blood pressure,Change in BP
87,2,yoga,150,125,-25
9,24,jogging,150,90,-60
19,84,yoga,149,100,-49
2,76,yoga,149,93,-56
73,51,yoga,148,117,-31


### Total Blood Pressure Reduction for Each Exercise Group
#### 6. Write a Python script to read the "exercise_data.csv" file and compute the measures of central tendency for each exercise group: mean, mode, standard deviation.

In [19]:
#print mean for groups
data = pd.read_csv('excercise_data.csv')
means = data.groupby(['Excercise Group'])[['Pre-exercise systolic blood pressure'
                                          ,'Post-exercise systolic blood pressure']].mean()
print(means)

                 Pre-exercise systolic blood pressure  \
Excercise Group                                         
jogging                                    114.277778   
weightlifting                              108.354839   
yoga                                       117.757576   

                 Post-exercise systolic blood pressure  
Excercise Group                                         
jogging                                     106.944444  
weightlifting                               120.838710  
yoga                                        110.424242  


In [20]:
#print mode for groups
modes =  data.groupby(['Excercise Group'])[['Pre-exercise systolic blood pressure',
                                            'Post-exercise systolic blood pressure']].apply(stats.mode)
print(modes)

Excercise Group
jogging          ([[105, 101]], [[4, 2]])
weightlifting     ([[81, 108]], [[3, 2]])
yoga               ([[99, 96]], [[2, 2]])
dtype: object


#### Standard deviations for groups

In [21]:
data.reset_index()
#Use numpy to calculate std
jogging_std = np.std(data.groupby(['Excercise Group'])[['Pre-exercise systolic blood pressure',
                           'Post-exercise systolic blood pressure']].get_group('jogging'))
weightlifting_std =  np.std(data.groupby(['Excercise Group'])[['Pre-exercise systolic blood pressure',
                           'Post-exercise systolic blood pressure']].get_group('weightlifting'))
yoga_std =  np.std(data.groupby(['Excercise Group'])[['Pre-exercise systolic blood pressure',
                           'Post-exercise systolic blood pressure']].get_group('yoga'))

#print stds
print("Jogging standard deviations: " , jogging_std)
print("Weightlifting standard deviations: " , weightlifting_std)
print("Yoga standard deviations: " ,yoga_std)

Jogging standard deviations:  Pre-exercise systolic blood pressure     18.352431
Post-exercise systolic blood pressure    18.278622
dtype: float64
Weightlifting standard deviations:  Pre-exercise systolic blood pressure     24.325147
Post-exercise systolic blood pressure    21.612566
dtype: float64
Yoga standard deviations:  Pre-exercise systolic blood pressure     22.772964
Post-exercise systolic blood pressure    22.425676
dtype: float64


In [22]:
#This is an easier way to print standard deviation for groups using std() function of python, Numpy does not calculate 
# on groups as cleanly
std = (data.groupby(['Excercise Group'])[['Pre-exercise systolic blood pressure'
             ,'Post-exercise systolic blood pressure']]).std()
print(std)

                 Pre-exercise systolic blood pressure  \
Excercise Group                                         
jogging                                     18.612762   
weightlifting                               24.727243   
yoga                                        23.126054   

                 Post-exercise systolic blood pressure  
Excercise Group                                         
jogging                                      18.537906  
weightlifting                                21.969823  
yoga                                         22.773381  
