# Python Proficiency for Statistics


Generate Synthetic Dataset on Exercise and Blood Pressure

1.     Create a Python script that generates a synthetic dataset matching the description of your study. The dataset should be saved as a CSV file named "exercise_data.csv".

In [44]:
import pandas as pd
import numpy as np


# Number of participants
num_participants = 100

# Exercise groups
exercise_groups = ["jogging", "weightlifting", "yoga"]

# Generate synthetic data
participant_ids = np.arange(1, num_participants + 1)
exercise_group_data = np.random.choice(exercise_groups, num_participants)
pre_bp_data = np.random.normal(loc=120, scale=5, size=num_participants)
post_bp_data = np.random.normal(loc=120, scale=5, size=num_participants)

# Create a DataFrame
df = pd.DataFrame({
    'Participant ID': participant_ids,
    'Exercise group': exercise_group_data,
    'Pre-exercise systolic blood pressure': pre_bp_data,
    'Post-exercise systolic blood pressure': post_bp_data
})

# Save the dataset to a CSV file
df.to_csv('exercise_data.csv')

In this task, the script uses the NumPy library to generate random data representing participants, exercise groups, and blood pressure measurements before and after exercise. The dataset is then saved in a CSV file.

Code explanation:
The line of code "np.arange(1, num_participants + 1)" is a NumPy function call that creates an array containing a sequence of numbers. Let's break down what this line does:

    arange(): This is a function provided by NumPy that generates a sequence of numbers within a specified range.
    
The line of code "np.random.choice(exercise_groups, num_participants)" is a call to NumPy's random module that selects items randomly from the given exercise_groups array to create an array of exercise group assignments for the participants in the study. 
The line of code np.random.normal(loc=120, scale=5, size=num_participants) generates a random array of numbers sampled from a normal distribution.

        loc=120: This parameter specifies the mean of the normal distribution. In this case, it is set to 120, which means that the distribution is centered around 120. This number was chosen because the average blood pressure of a human being is 120.

    scale=5: This parameter specifies the standard deviation of the normal distribution.

subsequently, the table was created and saved
   


In [45]:
df

Unnamed: 0,Participant ID,Exercise group,Pre-exercise systolic blood pressure,Post-exercise systolic blood pressure
0,1,weightlifting,114.722231,101.692316
1,2,weightlifting,127.714390,122.006518
2,3,weightlifting,101.040352,123.625708
3,4,weightlifting,121.702984,133.684220
4,5,weightlifting,130.329639,131.881457
...,...,...,...,...
95,96,jogging,119.946026,110.574346
96,97,jogging,130.383888,117.781391
97,98,weightlifting,125.054241,124.216845
98,99,weightlifting,133.094729,115.647778


In [46]:
# Read the CSV file
df = pd.read_csv('exercise_data.csv')

# Group the data by exercise group
grouped_data = df.groupby('Exercise group')

# Iterate over each group and find the participant with the highest pre-exercise blood pressure
for group_name, group_data in grouped_data:
    max_pre_bp_index = group_data['Pre-exercise systolic blood pressure'].idxmax()
    max_pre_bp_participant = group_data.loc[max_pre_bp_index, 'Participant ID']
    max_pre_bp_value = group_data.loc[max_pre_bp_index, 'Pre-exercise systolic blood pressure']
    print(f"Highest pre-exercise systolic blood pressure in {group_name} group: Participant {max_pre_bp_participant} (BP: {max_pre_bp_value})")

Highest pre-exercise systolic blood pressure in jogging group: Participant 71 (BP: 137.6986242067697)
Highest pre-exercise systolic blood pressure in weightlifting group: Participant 60 (BP: 146.63182350119823)
Highest pre-exercise systolic blood pressure in yoga group: Participant 70 (BP: 140.100092893)


In this task, a Python script is used to read the CSV file containing the study data. The data is then grouped by exercise group using the Pandas library. Then, for each group, the participant with the highest pre-exercise blood pressure is identified and their details are displayed.
The line of code grouped_data = df.groupby('Exercise group') in Python with pandas creates a grouped object based on the 'Exercise group' column in the DataFrame 'df'. 

    .groupby('Exercise group'): This is a method provided by pandas that groups the DataFrame by a specified column, in this case, the 'Exercise group' column. It creates a DataFrameGroupBy object, which splits the original DataFrame into groups based on the unique values in the specified column.
    max_pre_bp_index = group_data['Pre-exercise systolic blood pressure'].idxmax(): This line finds the index of the row with the maximum value in the 'Pre-exercise systolic blood pressure' column within the current group (group_data).
    max_pre_bp_participant = group_data.loc[max_pre_bp_index, 'Participant ID']: This line retrieves the 'Participant ID' of the participant with the highest pre-exercise systolic blood pressure within the current group.

    max_pre_bp_value = group_data.loc[max_pre_bp_index, 'Pre-exercise systolic blood pressure']: This line retrieves the actual value of the highest pre-exercise systolic blood pressure within the current group.

In [47]:
# Sort the dataframe by pre-exercise systolic blood pressure and display the top 5 records
top_5 = df.nlargest(5, 'Pre-exercise systolic blood pressure')
print(top_5)



    Unnamed: 0  Participant ID Exercise group  \
59          59              60  weightlifting   
49          49              50  weightlifting   
69          69              70           yoga   
63          63              64           yoga   
70          70              71        jogging   

    Pre-exercise systolic blood pressure  \
59                            146.631824   
49                            144.147910   
69                            140.100093   
63                            138.115048   
70                            137.698624   

    Post-exercise systolic blood pressure  
59                             132.570600  
49                              97.599501  
69                             108.786194  
63                             132.251323  
70                             115.979333  


df.nlargest(5, 'Pre-exercise systolic blood pressure') selects the top 5 rows from the DataFrame df based on the values in the 'Pre-exercise systolic blood pressure' column.


    .nlargest(5, 'Pre-exercise systolic blood pressure'): This is a method provided by pandas DataFrame objects that selects the top n rows from the DataFrame based on the values in a specified column, in this case, 'Pre-exercise systolic blood pressure'. The nlargest() function is designed to efficiently retrieve the rows with the largest values in the specified column.

In [48]:
# Compute the average change in blood pressure for each exercise group
average_change = df.groupby('Exercise group').apply(lambda x: np.mean(x['Post-exercise systolic blood pressure'] - x['Pre-exercise systolic blood pressure']))
print("Average change in blood pressure for each exercise group:")
print(average_change)

Average change in blood pressure for each exercise group:
Exercise group
jogging          3.351721
weightlifting   -1.715461
yoga             1.088048
dtype: float64


average_change = df.groupby('Exercise group').apply(lambda x: np.mean(x['Post-exercise systolic blood pressure'] - x['Pre-exercise systolic blood pressure'])) computes the average change in systolic blood pressure for each exercise group in the DataFrame df

In [49]:
# Search for the top 5 participants from the pre-exercise data
top_5_pre_exercise = sorted_df.head()

# Find their post-exercise blood pressure
post_exercise_bp = df[df['Participant ID'].isin(top_5_pre_exercise['Participant ID'])][['Participant ID', 'Post-exercise systolic blood pressure']]

# Merge pre-exercise and post-exercise data
comparison_table = pd.merge(top_5_pre_exercise, post_exercise_bp, on='Participant ID', suffixes=('_pre', '_post'))

# Display the comparison table
print("Comparison of pre- and post-exercise blood pressure:")
print(comparison_table)

Comparison of pre- and post-exercise blood pressure:
   Participant ID Exercise group  Pre-exercise systolic blood pressure  \
0              87  weightlifting                            147.837774   
1              96           yoga                            144.338481   
2              43  weightlifting                            143.743294   
3              17  weightlifting                            143.741417   
4              73        jogging                            140.433205   

   Post-exercise systolic blood pressure_pre  \
0                                 136.837774   
1                                 136.338481   
2                                 132.743294   
3                                 129.741417   
4                                 132.433205   

   Post-exercise systolic blood pressure_post  
0                                  110.336876  
1                                  110.574346  
2                                  138.078762  
3                    

 The script extracts data from the five participants with the highest pre-exercise blood pressure, then merges this data with the post-exercise blood pressure measurements.

    df['Participant ID'].isin(top_5_pre_exercise['Participant ID']): This part of the code checks if the 'Participant ID' column in the DataFrame df is present in the 'Participant ID' column of another DataFrame or Series top_5_pre_exercise.

    [['Participant ID', 'Post-exercise systolic blood pressure']]: This part of the code specifies which columns to include in the filtered DataFrame. It selects the 'Participant ID' and 'Post-exercise systolic blood pressure' columns.
    
        pd.merge(): This is the function used to merge pandas DataFrames.

    suffixes=('_pre', '_post'): This parameter specifies the suffixes to be added to the columns from each DataFrame to distinguish them after the merge operation. In this case, '_pre' will be added to columns from the first DataFrame (top_5_pre_exercise) and '_post' will be added to columns from the second DataFrame (post_exercise_bp).

In [51]:
# Compute measures of central tendency for each exercise group
measures_of_central_tendency = df.groupby('Exercise group')['Post-exercise systolic blood pressure'].agg(['mean',  'std'])
print("Measures of central tendency for each exercise group:")
print(measures_of_central_tendency)

Measures of central tendency for each exercise group:
                      mean       std
Exercise group                      
jogging         122.178201  7.839167
weightlifting   118.951413  9.741809
yoga            120.111891  9.631482


The script uses Pandas to calculate central tendency measures (mean, standard deviation) of post-exercise blood pressure for each group.

    The mean is the average value in a dataset.
    
    Standard deviation measures how spread out your values are from the mean of your dataset. It calculates the typical distance of a data point from the mean.

    The mode is the most frequently occurring value in a dataset(I had difficulty calculating the mode).