
#Global Coffee Health Dataset
link https://www.kaggle.com/datasets/uom190346a/global-coffee-health-dataset/data\

Description:

The GlobalCoffeeHealth dataset contains 10,000 synthetic records reflecting real-world patterns of coffee consumption, sleep behavior, and health outcomes across 20 countries. It includes demographics, daily coffee intake, caffeine levels, sleep duration and quality, BMI, heart rate, stress, physical activity, health issues, occupation, smoking, and alcohol consumption.

The dataset captures realistic correlations observed in research—such as caffeine’s impact on sleep, stress, and health—making it ideal for statistical analysis, predictive modeling, and lifestyle or wellness studies.

##Imports

In [3]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier


##Load the data

In [4]:
df = pd.read_csv("synthetic_coffee_health_10000.csv")

##Explore

In [5]:
df.head()

Unnamed: 0,ID,Age,Gender,Country,Coffee_Intake,Caffeine_mg,Sleep_Hours,Sleep_Quality,BMI,Heart_Rate,Stress_Level,Physical_Activity_Hours,Health_Issues,Occupation,Smoking,Alcohol_Consumption
0,1,40,Male,Germany,3.5,328.1,7.5,Good,24.9,78,Low,14.5,,Other,0,0
1,2,33,Male,Germany,1.0,94.1,6.2,Good,20.0,67,Low,11.0,,Service,0,0
2,3,42,Male,Brazil,5.3,503.7,5.9,Fair,22.7,59,Medium,11.2,Mild,Office,0,0
3,4,53,Male,Germany,2.6,249.2,7.3,Good,24.7,71,Low,6.6,Mild,Other,0,0
4,5,32,Female,Spain,3.1,298.0,5.3,Fair,24.1,76,Medium,8.5,Mild,Student,0,1


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       10000 non-null  int64  
 1   Age                      10000 non-null  int64  
 2   Gender                   10000 non-null  object 
 3   Country                  10000 non-null  object 
 4   Coffee_Intake            10000 non-null  float64
 5   Caffeine_mg              10000 non-null  float64
 6   Sleep_Hours              10000 non-null  float64
 7   Sleep_Quality            10000 non-null  object 
 8   BMI                      10000 non-null  float64
 9   Heart_Rate               10000 non-null  int64  
 10  Stress_Level             10000 non-null  object 
 11  Physical_Activity_Hours  10000 non-null  float64
 12  Health_Issues            4059 non-null   object 
 13  Occupation               10000 non-null  object 
 14  Smoking                

In [7]:
df['Occupation'].value_counts()

Unnamed: 0_level_0,count
Occupation,Unnamed: 1_level_1
Office,2073
Other,2038
Student,1968
Healthcare,1964
Service,1957


In [8]:
df['Country'].value_counts()

Unnamed: 0_level_0,count
Country,Unnamed: 1_level_1
Canada,543
India,524
Norway,523
China,521
UK,519
Sweden,513
South Korea,512
Finland,510
Italy,509
Switzerland,500


In [9]:
df['Sleep_Quality'].value_counts()

Unnamed: 0_level_0,count
Sleep_Quality,Unnamed: 1_level_1
Good,5637
Fair,2050
Excellent,1352
Poor,961


In [10]:
df['Health_Issues'].value_counts()

Unnamed: 0_level_0,count
Health_Issues,Unnamed: 1_level_1
Mild,3579
Moderate,463
Severe,17


In [11]:
df.isnull().sum()

Unnamed: 0,0
ID,0
Age,0
Gender,0
Country,0
Coffee_Intake,0
Caffeine_mg,0
Sleep_Hours,0
Sleep_Quality,0
BMI,0
Heart_Rate,0


In [12]:
df.duplicated().sum()

np.int64(0)

In [13]:
df.shape

(10000, 16)

##Preprocessing

In [14]:
# Fill missing values in 'Health_Issues' with the mode
df['Health_Issues'].fillna(df['Health_Issues'].mode()[0], inplace=True)
display(df.isnull().sum())

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Health_Issues'].fillna(df['Health_Issues'].mode()[0], inplace=True)


Unnamed: 0,0
ID,0
Age,0
Gender,0
Country,0
Coffee_Intake,0
Caffeine_mg,0
Sleep_Hours,0
Sleep_Quality,0
BMI,0
Heart_Rate,0


In [15]:
stress_mapping = {'Low': 0, 'Medium': 1, 'High': 2}
df['Stress_Level'] = df['Stress_Level'].map(stress_mapping)
health_mapping = {'Mild': 0, 'Moderate': 1, 'Severe': 2}
df['Health_Issues'] = df['Health_Issues'].map(health_mapping)

display(df.head())

Unnamed: 0,ID,Age,Gender,Country,Coffee_Intake,Caffeine_mg,Sleep_Hours,Sleep_Quality,BMI,Heart_Rate,Stress_Level,Physical_Activity_Hours,Health_Issues,Occupation,Smoking,Alcohol_Consumption
0,1,40,Male,Germany,3.5,328.1,7.5,Good,24.9,78,0,14.5,0,Other,0,0
1,2,33,Male,Germany,1.0,94.1,6.2,Good,20.0,67,0,11.0,0,Service,0,0
2,3,42,Male,Brazil,5.3,503.7,5.9,Fair,22.7,59,1,11.2,0,Office,0,0
3,4,53,Male,Germany,2.6,249.2,7.3,Good,24.7,71,0,6.6,0,Other,0,0
4,5,32,Female,Spain,3.1,298.0,5.3,Fair,24.1,76,1,8.5,0,Student,0,1


In [16]:
df = pd.get_dummies(df, columns=['Gender', 'Country', 'Sleep_Quality','Occupation'], dummy_na=False, dtype=int)
display(df.head())

Unnamed: 0,ID,Age,Coffee_Intake,Caffeine_mg,Sleep_Hours,BMI,Heart_Rate,Stress_Level,Physical_Activity_Hours,Health_Issues,...,Country_USA,Sleep_Quality_Excellent,Sleep_Quality_Fair,Sleep_Quality_Good,Sleep_Quality_Poor,Occupation_Healthcare,Occupation_Office,Occupation_Other,Occupation_Service,Occupation_Student
0,1,40,3.5,328.1,7.5,24.9,78,0,14.5,0,...,0,0,0,1,0,0,0,1,0,0
1,2,33,1.0,94.1,6.2,20.0,67,0,11.0,0,...,0,0,0,1,0,0,0,0,1,0
2,3,42,5.3,503.7,5.9,22.7,59,1,11.2,0,...,0,0,1,0,0,0,1,0,0,0
3,4,53,2.6,249.2,7.3,24.7,71,0,6.6,0,...,0,0,0,1,0,0,0,1,0,0
4,5,32,3.1,298.0,5.3,24.1,76,1,8.5,0,...,0,0,1,0,0,0,0,0,0,1


In [17]:
display(df.dtypes)

Unnamed: 0,0
ID,int64
Age,int64
Coffee_Intake,float64
Caffeine_mg,float64
Sleep_Hours,float64
BMI,float64
Heart_Rate,int64
Stress_Level,int64
Physical_Activity_Hours,float64
Health_Issues,int64


In [18]:
df = df.drop(columns=["ID"])

In [19]:
df.head()

Unnamed: 0,Age,Coffee_Intake,Caffeine_mg,Sleep_Hours,BMI,Heart_Rate,Stress_Level,Physical_Activity_Hours,Health_Issues,Smoking,...,Country_USA,Sleep_Quality_Excellent,Sleep_Quality_Fair,Sleep_Quality_Good,Sleep_Quality_Poor,Occupation_Healthcare,Occupation_Office,Occupation_Other,Occupation_Service,Occupation_Student
0,40,3.5,328.1,7.5,24.9,78,0,14.5,0,0,...,0,0,0,1,0,0,0,1,0,0
1,33,1.0,94.1,6.2,20.0,67,0,11.0,0,0,...,0,0,0,1,0,0,0,0,1,0
2,42,5.3,503.7,5.9,22.7,59,1,11.2,0,0,...,0,0,1,0,0,0,1,0,0,0
3,53,2.6,249.2,7.3,24.7,71,0,6.6,0,0,...,0,0,0,1,0,0,0,1,0,0
4,32,3.1,298.0,5.3,24.1,76,1,8.5,0,0,...,0,0,1,0,0,0,0,0,0,1


In [20]:
# Define features (X) and target (y) for predicting 'Health_Issues'
X_health = df.drop(['Stress_Level', 'Health_Issues'], axis=1)  # Exclude both target variables from features
y_health = df['Health_Issues']

# Split the data into training and testing sets
X_train_health, X_test_health, y_train_health, y_test_health = train_test_split(X_health, y_health, test_size=0.2, random_state=42)

print("Health Issues Data Splits:")
print("X_train_health shape:", X_train_health.shape)
print("X_test_health shape:", X_test_health.shape)
print("y_train_health shape:", y_train_health.shape)
print("y_test_health shape:", y_test_health.shape)

Health Issues Data Splits:
X_train_health shape: (8000, 41)
X_test_health shape: (2000, 41)
y_train_health shape: (8000,)
y_test_health shape: (2000,)


In [21]:
from sklearn.ensemble import RandomForestClassifier
rf_model_health = RandomForestClassifier(random_state=42)

rf_model_health.fit(X_train_health, y_train_health)

print("Random Forest model for Health_Issues trained successfully!")

Random Forest model for Health_Issues trained successfully!


In [22]:
from sklearn.metrics import classification_report, accuracy_score

# Predict on the test set for Health_Issues
y_pred_health = rf_model_health.predict(X_test_health)

# Evaluate the model for Health_Issues
print("Health Issues Model Evaluation:")
print("Accuracy:", accuracy_score(y_test_health, y_pred_health))
print(classification_report(y_test_health, y_pred_health))

Health Issues Model Evaluation:
Accuracy: 0.9945
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1912
           1       0.94      0.93      0.93        83
           2       0.00      0.00      0.00         5

    accuracy                           0.99      2000
   macro avg       0.65      0.64      0.64      2000
weighted avg       0.99      0.99      0.99      2000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [27]:
param_grid_stress = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

## Hyperparameter tuning for stress level


In [28]:
# Define features (X) and target (y) for predicting 'Stress_Level'
X_stress = df.drop(['Stress_Level', 'Health_Issues'], axis=1)  # Exclude both target variables from features
y_stress = df['Stress_Level']

# Split the data into training and testing sets
X_train_stress, X_test_stress, y_train_stress, y_test_stress = train_test_split(X_stress, y_stress, test_size=0.2, random_state=42)


In [35]:
from sklearn.model_selection import GridSearchCV

# Instantiate GridSearchCV
grid_search_stress = GridSearchCV(estimator=RandomForestClassifier(random_state=42),
                                  param_grid=param_grid_stress,
                                  cv=5,
                                  n_jobs=-1)

# Fit GridSearchCV to the training data
grid_search_stress.fit(X_train_stress, y_train_stress)

In [36]:
print("Best hyperparameters for Stress_Level model:")
print(grid_search_stress.best_params_)

Best hyperparameters for Stress_Level model:
{'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}


In [37]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid for Health_Issues model (can use the same or a different one)
param_grid_health = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Instantiate GridSearchCV
grid_search_health = GridSearchCV(estimator=RandomForestClassifier(random_state=42),
                                  param_grid=param_grid_health,
                                  cv=5,
                                  n_jobs=-1)

# Fit GridSearchCV to the training data
grid_search_health.fit(X_train_health, y_train_health)

print("GridSearchCV for Health_Issues completed.")
print("Best hyperparameters for Health_Issues model:")
print(grid_search_health.best_params_)

GridSearchCV for Health_Issues completed.
Best hyperparameters for Health_Issues model:
{'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}


## Save the trained models


**Reasoning**:
Save the trained Random Forest models to files using joblib.



In [38]:
import joblib

joblib.dump(rf_model_health, 'rf_health_model.joblib')
joblib.dump(grid_search_stress.best_estimator_, 'rf_stress_model.joblib')

print("Random Forest models saved successfully!")

Random Forest models saved successfully!


## Create a prediction function


In [42]:
import joblib

joblib.dump(grid_search_health.best_estimator_, 'rf_health_model.joblib')

joblib.dump(grid_search_stress.best_estimator_, 'rf_stress_model.joblib')

print("Random Forest models saved successfully!")

Random Forest models saved successfully!


In [43]:
import joblib
import pandas as pd

def predict_health_stress(
    Age, Coffee_Intake, Caffeine_mg, Sleep_Hours, BMI, Heart_Rate,
    Physical_Activity_Hours, Smoking, Alcohol_Consumption,
    Gender, Country, Sleep_Quality, Occupation
):
    # Create a DataFrame from user input
    user_data = pd.DataFrame([{
        'Age': Age,
        'Coffee_Intake': Coffee_Intake,
        'Caffeine_mg': Caffeine_mg,
        'Sleep_Hours': Sleep_Hours,
        'BMI': BMI,
        'Heart_Rate': Heart_Rate,
        'Physical_Activity_Hours': Physical_Activity_Hours,
        'Smoking': Smoking,
        'Alcohol_Consumption': Alcohol_Consumption,
        'Gender': Gender,
        'Country': Country,
        'Sleep_Quality': Sleep_Quality,
        'Occupation': Occupation
    }])

    # Apply the same preprocessing steps as training data
    # Map categorical features (Stress_Level and Health_Issues were target, not features)
    # One-hot encode categorical features
    categorical_cols = ['Gender', 'Country', 'Sleep_Quality', 'Occupation']
    user_data = pd.get_dummies(user_data, columns=categorical_cols, dummy_na=False, dtype=int)

    # Ensure all columns from training data are present, fill missing with 0
    # Get the list of columns from the training data (excluding targets)
    training_cols = [col for col in X_health.columns if col not in ['Stress_Level', 'Health_Issues']]
    for col in training_cols:
        if col not in user_data.columns:
            user_data[col] = 0

    # Reorder columns to match training data
    user_data = user_data[training_cols]

    # Load the trained models
    rf_health_model = joblib.load('rf_health_model.joblib')
    rf_stress_model = joblib.load('rf_stress_model.joblib')

    # Predict using the loaded models
    predicted_health_numeric = rf_health_model.predict(user_data)[0]
    predicted_stress_numeric = rf_stress_model.predict(user_data)[0]

    # Convert numerical predictions back to original labels
    health_issues_labels = {0: 'Mild', 1: 'Moderate', 2: 'Severe'}
    stress_level_labels = {0: 'Low', 1: 'Medium', 2: 'High'}

    predicted_health_label = health_issues_labels.get(predicted_health_numeric, 'Unknown')
    predicted_stress_label = stress_level_labels.get(predicted_stress_numeric, 'Unknown')

    return predicted_stress_label, predicted_health_label

In [44]:
# Launch the Gradio interface
iface.launch(debug=True)

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://295cd5501e470c054d.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.


KeyboardInterrupt: 