## Healthy Bison: Lifestyle Choices and Their Impact On College Students
Name: M. Ashura Langford


Student ID: 004001453


___

### Data Overview
___
#### Dataset Information
Healthy Bison: [LifeStyle Choices and Their Impact On College Students](https://www.kaggle.com/datasets/charlottebennett1234/lifestyle-factors-and-their-impact-on-students)

**Real World Context:**
This data was collected in order to determine the correlation between student lifestyle patterns and their academic performance(represented by GPA). This data was collected from 2,000 university students in Lisboa via a Google form between August 2023 and May 2024. This data is intended to help analyze the impact of daily habits on academic performance and student well-being. This data is essential in fields such as education, psychology, and health sciences.


#### Basic Data Exploration
**Explanation:** In cell [1], I began by loading and briefly examining the dataset. First, I imported the pandas library, a standard tool in Python for data handling and manipulation. I then read the file titled "student_lifestyle_dataset(psych).csv" into a pandas DataFrame named df, and created a duplicate DataFrame to use for subsequent analysis. Using the .shape function, I confirmed that the dataset contains 2,000 rows and 9 columns. In this context, the rows represent the individual participants, while the columns correspond to the various topics or variables being examined. Lastly, as directed, I utilized the *.head()* and *.tail()* functions to display the first and last five entries of the dataset, allowing for an initial review of the data structure and content.

**Column Names:**
* Study Hours Per Day
* Extracurricular Hours Per Day
* Sleep Hours Per Day
* Social Hours Per Day
* Physical Activity Hours Per Day
* Stress Level
* Gender
* Grades(GPA)


In [1]:
import pandas as pd
df = pd.read_csv('student_lifestyle_dataset(psych).csv')
print(df.head())
#copy to new data frame
data= df.copy()
#1. Number of rows and columns
print("\nShape:\n",data.shape)
#2. First rows and last rows of data
print("\nFirst 5 rows:")
print(data.head())

print("\nLast 5 rows:")
print(data.tail())

   Student_ID  Study_Hours_Per_Day  Extracurricular_Hours_Per_Day  \
0           1                  6.9                            3.8   
1           2                  5.3                            3.5   
2           3                  5.1                            3.9   
3           4                  6.5                            2.1   
4           5                  8.1                            0.6   

   Sleep_Hours_Per_Day  Social_Hours_Per_Day  Physical_Activity_Hours_Per_Day  \
0                  8.7                   2.8                              1.8   
1                  8.0                   4.2                              3.0   
2                  9.2                   1.2                              4.6   
3                  7.2                   1.7                              6.5   
4                  6.5                   2.2                              6.6   

  Stress_Level  Gender  Grades  
0     Moderate    Male    7.48  
1          Low  Female    6.88  

#### Data Types and Structure
**Explanation:**
There are 7 columns that are numerical, either being labeled with float64 or int64. 
**Note:** *Column titles including "per day" will be shortened.*


**Numerical:** Student ID(Int64), Study Hours, Extracurricular Hours, Sleep Hours, Social Hours, Physical Activity Hours, and Grades

There are 2 columns that are categorical or identified with the object title. 

**Categorical:** Stress Level & Gender

Based on this information, there is no data type conversion that would appear necessary. This is because all columns that would need to be numerical are, and those that would be categorical are already categorical. 

In [2]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 9 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Student_ID                       2000 non-null   int64  
 1   Study_Hours_Per_Day              2000 non-null   float64
 2   Extracurricular_Hours_Per_Day    2000 non-null   float64
 3   Sleep_Hours_Per_Day              2000 non-null   float64
 4   Social_Hours_Per_Day             2000 non-null   float64
 5   Physical_Activity_Hours_Per_Day  2000 non-null   float64
 6   Stress_Level                     2000 non-null   object 
 7   Gender                           2000 non-null   object 
 8   Grades                           2000 non-null   float64
dtypes: float64(6), int64(1), object(2)
memory usage: 140.8+ KB
None


#### Descriptive Summary
**Explanation:** The summary statistics provide insights into the central tendency, spread, and range of the numerical values within the dataset. My particular summary tells me that the average study hours(~7.48) and the average sleep hours(7.50) are somewhat high. These results suggest that the typical student in this sample has a relatively focused and well-rested lifestyle. The average GPA(7.78) is also moderately high. The average extracurricular time(1.99 hours) is low, and the average social time (2.70 hours) is moderate, which shows that students spend less time on these activities compared to studying or sleeping. 

In [3]:
print("\n Summary Statistics:\n",data.describe())


 Summary Statistics:
         Student_ID  Study_Hours_Per_Day  Extracurricular_Hours_Per_Day  \
count  2000.000000          2000.000000                    2000.000000   
mean   1000.500000             7.475800                       1.990100   
std     577.494589             1.423888                       1.155855   
min       1.000000             5.000000                       0.000000   
25%     500.750000             6.300000                       1.000000   
50%    1000.500000             7.400000                       2.000000   
75%    1500.250000             8.700000                       3.000000   
max    2000.000000            10.000000                       4.000000   

       Sleep_Hours_Per_Day  Social_Hours_Per_Day  \
count          2000.000000           2000.000000   
mean              7.501250              2.704550   
std               1.460949              1.688514   
min               5.000000              0.000000   
25%               6.200000              1.200000  

####

#### Missing or Duplicate Data
**Explanation:** This step is essential for cleaning and the validity of my data. Using *is.null().sum()* function makes the code check to look at the total missing values within each column. The function *.duplicated().sum()* makes the code look for any duplicated data. Based on the results, there are no missing or duplicated values.

In [4]:
#check for missing and duplicated data
print(data.isnull().sum())
print("\nDuplicated Data:",data.duplicated().sum())

Student_ID                         0
Study_Hours_Per_Day                0
Extracurricular_Hours_Per_Day      0
Sleep_Hours_Per_Day                0
Social_Hours_Per_Day               0
Physical_Activity_Hours_Per_Day    0
Stress_Level                       0
Gender                             0
Grades                             0
dtype: int64

Duplicated Data: 0


### Data Wrangling
____

#### Add a New Column Using Existing Columns
**Reflection:** The primary purpose of this code is to create a new and meaningful variable that doesn't currently exist in the original dataset. This operation allows the code to calculate the sum of the total hours of activity that the students partake in. Instead of having to look at the 4 different columns, there is now a single column that represents the daily time spent outside of sleeping. This will allow researchers to check the correlation with GPA(*Do students with high activity hours have lower grades?*). You could also compare activity loads across categorical variables like gender.

In [5]:
#Add new column to dataset
data['Total_Activity_Hours'] = data['Study_Hours_Per_Day'] + \
                               data['Extracurricular_Hours_Per_Day'] + \
                               data['Social_Hours_Per_Day'] + \
                               data['Physical_Activity_Hours_Per_Day']
print("First Five Rows with New Column:\n",data.head())

First Five Rows with New Column:
    Student_ID  Study_Hours_Per_Day  Extracurricular_Hours_Per_Day  \
0           1                  6.9                            3.8   
1           2                  5.3                            3.5   
2           3                  5.1                            3.9   
3           4                  6.5                            2.1   
4           5                  8.1                            0.6   

   Sleep_Hours_Per_Day  Social_Hours_Per_Day  Physical_Activity_Hours_Per_Day  \
0                  8.7                   2.8                              1.8   
1                  8.0                   4.2                              3.0   
2                  9.2                   1.2                              4.6   
3                  7.2                   1.7                              6.5   
4                  6.5                   2.2                              6.6   

  Stress_Level  Gender  Grades  Total_Activity_Hours  
0     Mod

#### Filter the Data
**Explanation:** By creating these two filters, researchers can look at the two key variables: Stress Levels and Grades. I decided to filter for high stress levels, so that I am able to see the particpants who have rated themselves as being very stressed. From this filter, I can look at each high-stress participant's corresponding measures of their daily life. I can also see trends in the data. For example, with the high stress filter, I can see from the first five rows that these participants all have a higher number of total activity hours as well as high GPAs. My second filter was looking at individuals with lower GPAs to see if a lower GPA can be an indicator of lower study hours. Based on the first five rows displayed under this filter, these five participants do have a lower number of study hours when you compare them to students with higher GPAs.

In [6]:
#Filter 
high_stress = data[data['Stress_Level'] == 'High']

print("High Stress Filter:\n",high_stress.head())

GPA_low = data[data['Grades'] < 6.0]
print("\n\n Low GPA Filter:\n",GPA_low.head())


High Stress Filter:
     Student_ID  Study_Hours_Per_Day  Extracurricular_Hours_Per_Day  \
4            5                  8.1                            0.6   
6            7                  8.0                            0.7   
7            8                  8.4                            1.8   
10          11                  9.7                            3.6   
12          13                  6.4                            2.2   

    Sleep_Hours_Per_Day  Social_Hours_Per_Day  \
4                   6.5                   2.2   
6                   5.3                   5.7   
7                   5.6                   3.0   
10                  8.0                   2.5   
12                  5.7                   4.8   

    Physical_Activity_Hours_Per_Day Stress_Level Gender  Grades  \
4                               6.6         High   Male    8.78   
6                               4.3         High   Male    7.70   
7                               5.2         High   Male    8.0

#### Unique Values and Categories
**Explanation:** The function *.unique* allows me to see if there are any typos or irregularities within my dataset. I decided to look at both of my categorical columns: Stress Levels and Gender. This particular function allows me to see all the input values for both categories. I see that there is the correct number of categorical options for both variables. 

In [7]:
gender_unique= data['Gender'].unique()
print(gender_unique)

stress_unique= data['Stress_Level'].unique()
print(stress_unique)

['Male' 'Female']
['Moderate' 'Low' 'High']


### Summary And Conclusion

**Initial Observations:** There were a lot of interesting aspects of this dataset. The most interesting or even significant finding is that when you add the mean of the Total Activity Hours to the mean of sleep hours, you get exactly 24 hours. However, this pattern does not hold for all quartiles. This indicates that this dataset was collected under somewhat strict constraints, as the hours vary. The second observation is that this population is very "busy". The mean for Study hours is high at 7.5 hours, with a minimum being only 5 hours. Furthermore, the mean for the Physical  Activity hours is about 4.3 hours. A combined average of nearly 12 hours of only studying and physical activity suggests this is a mixture of high-performing or high-activity groups of students. My final observation is that there could be potential outliers in physical activity. This is because the maximum value for physical activity is 13 hours. This is a strong outlier as the 75th percentile is only 6.1 hours. This one data point could skew the mean. Overall, the dataset was very insightful in understanding the correlation between how students spend their time and its impact on their grades.

In [8]:
twentyfour_hour = data[['Total_Activity_Hours', 'Sleep_Hours_Per_Day']].describe()

twentyfour_hour['Sum_Check (Should be 24)'] = twentyfour_hour['Total_Activity_Hours'] + twentyfour_hour['Sleep_Hours_Per_Day']

print("Observation that Total Activity + Sleep = 24 hours:")
print(twentyfour_hour)

Observation that Total Activity + Sleep = 24 hours:
       Total_Activity_Hours  Sleep_Hours_Per_Day  Sum_Check (Should be 24)
count           2000.000000          2000.000000               4000.000000
mean              16.498750             7.501250                 24.000000
std                1.460949             1.460949                  2.921897
min               14.000000             5.000000                 19.000000
25%               15.200000             6.200000                 21.400000
50%               16.500000             7.500000                 24.000000
75%               17.800000             8.800000                 26.600000
max               19.000000            10.000000                 29.000000


In [9]:
data.to_csv('edited_student_lifestyle.csv', index= False)
print("Your edited file has been saved!")

Your edited file has been saved!
