# 📋 Table of Contents
* [Import and first glance](#import)
* [Basic EDA](#eda)
* [Data Cleaning](#clean)
* [Target](#target)
* [Gradient Boosting Model](#model)
* [Linear Model](#linear)

In [1]:
# standard
import numpy as np
import pandas as pd
import time

# plots
import matplotlib.pyplot as plt
import seaborn as sns

# statistics
from scipy import stats

# ML tools
import h2o
from h2o.estimators import H2OGeneralizedLinearEstimator, H2OGradientBoostingEstimator

In [14]:
'''Step 1: Load the Dataset
First, we need to load the dataset. Let's assume the dataset is in a CSV file named train.csv.'''
import pandas as pd

# Load the dataset
df_train = pd.read_csv('/kaggle/input/exploring-predictive-health-factors/train.csv')
df_test = pd.read_csv('/kaggle/input/exploring-predictive-health-factors/test.csv')

# Display the first few rows
print(df_train.head())

   ID    Age  Weight_kg PCOS Hormonal_Imbalance Hyperandrogenism Hirsutism  \
0   0  20-25       64.0   No                 No               No        No   
1   1  15-20       55.0   No                 No               No        No   
2   2  15-20       91.0   No                 No               No       Yes   
3   3  15-20       56.0   No                 No               No        No   
4   4  15-20       47.0   No                Yes               No        No   

  Conception_Difficulty Insulin_Resistance Exercise_Frequency  \
0                    No                 No             Rarely   
1                    No                 No   6-8 Times a Week   
2                    No                 No             Rarely   
3                    No                 No   6-8 Times a Week   
4                    No                 No             Rarely   

                               Exercise_Type     Exercise_Duration  \
0  Cardio (e.g., running, cycling, swimming)            30 minutes   


<div style="background-color:#f8f9fa; padding:15px; border-radius:10px; border-left:5px solid #17a2b8;">
    <h2 style="color:#17a2b8;">📂 Dataset Overview</h2>
    <p style="font-size:16px; color:#333;">
        The dataset has been successfully loaded from <code>train.csv</code>. Below is a preview of the first few rows to understand its structure.
    </p>

   <ul style="font-size:16px; color:#333;">
        <li><strong>Number of Features:</strong> Multiple health-related factors including <code>Age</code>, <code>Weight_kg</code>, <code>PCOS</code>, and various exercise and hormonal imbalance indicators.</li>
        <li><strong>Target Variable:</strong> <code>PCOS</code> (Polycystic Ovary Syndrome) - a categorical feature with "Yes" or "No" values.</li>
        <li><strong>Feature Types:</strong> A mix of numerical (e.g., <code>Weight_kg</code>), categorical (e.g., <code>Exercise_Frequency</code>), and ordinal data (e.g., <code>Exercise_Benefit</code>).</li>
    </ul>

    <p style="font-size:16px; color:#333;">
        Understanding the dataset structure is crucial for feature engineering and model development. The next step involves handling missing values and data preprocessing.
    </p>
</div>


In [3]:
# '''Step 2: Understand the Dataset
# Let's get a sense of the dataset's structure, including the number of rows, columns, and data types.
# # Check the shape of the dataset'''

# print(f"Dataset shape: {df_train.shape}")

# # Check column names and data types
# print(df_train.info())

# # Check for missing values
# print(df_train.isnull().sum())

<div style="background-color:#f8f9fa; padding:15px; border-radius:10px; border-left:5px solid #ffc107;">
    <h2 style="color:#ffc107;">📊 Dataset Structure & Missing Values</h2>
    <p style="font-size:16px; color:#333;">
        Before proceeding with data preprocessing, it is essential to understand the structure of our dataset, including the number of rows, columns, data types, and missing values.
    </p>

   <ul style="font-size:16px; color:#333;">
        <li><strong>Dataset Shape:</strong> The dataset contains <code>210</code> rows and <code>14</code> columns.</li>
        <li><strong>Data Types:</strong> 
            <ul>
                <li>1 integer column (<code>ID</code>), which is likely an identifier.</li>
                <li>1 float column (<code>Weight_kg</code>).</li>
                <li>12 categorical/object columns.</li>
            </ul>
        </li>
        <li><strong>Missing Values:</strong> Some columns have missing values, notably:
            <ul>
                <li><code>Hirsutism</code>: 5 missing values.</li>
                <li><code>Hyperandrogenism</code>: 3 missing values.</li>
                <li><code>Weight_kg</code>: 2 missing values.</li>
                <li>Other categorical columns also contain 1-2 missing values.</li>
            </ul>
        </li>
    </ul>

    <p style="font-size:16px; color:#333;">
        Addressing missing values will be an essential part of the data preprocessing pipeline. Potential strategies include imputation, removal of highly missing data, or handling categorical missing values effectively.
    </p>
</div>


In [4]:
# '''Step 3: Basic Statistics
# Let's compute basic statistics for numerical and categorical columns.'''
# # Summary statistics for numerical columns
# print(df_train.describe())

# # Summary statistics for categorical columns
# print(df_train.describe(include='object'))

<div style="background-color:#f8f9fa; padding:15px; border-radius:10px; border-left:5px solid #28a745;">
    <h2 style="color:#28a745;">📈 Basic Statistical Summary</h2>
    <p style="font-size:16px; color:#333;">
        Understanding the distribution of numerical and categorical features is crucial for further analysis. Below is a summary of key statistics:
    </p>

   <h3 style="color:#28a745;">📊 Numerical Features</h3>
   <ul style="font-size:16px; color:#333;">
        <li><strong>Weight (kg):</strong> 
            <ul>
                <li>Mean: <code>56.16</code> kg</li>
                <li>Min: <code>20</code> kg, Max: <code>116</code> kg</li>
                <li>Standard Deviation: <code>12.57</code></li>
            </ul>
        </li>
        <li>The <code>ID</code> column is likely just an identifier and does not contribute to model learning.</li>
    </ul>

   <h3 style="color:#28a745;">📋 Categorical Features</h3>
   <ul style="font-size:16px; color:#333;">
        <li><strong>Age:</strong> 
            <ul>
                <li>Most common category: <code>20-25</code> (125 occurrences).</li>
            </ul>
        </li>
        <li><strong>PCOS Distribution:</strong> 
            <ul>
                <li>Majority class: <code>No</code> (164 occurrences) vs. <code>Yes</code> (46 occurrences).</li>
            </ul>
        </li>
        <li><strong>Most Frequent Exercise Pattern:</strong> 
            <ul>
                <li>Most respondents exercise <code>Rarely</code> (102 occurrences).</li>
                <li>Most common exercise type: <code>No Exercise</code> (90 occurrences).</li>
            </ul>
        </li>
    </ul>

    <p style="font-size:16px; color:#333;">
        The dataset has an imbalance in certain categorical variables, such as <code>PCOS</code> distribution and exercise habits. 
        Further analysis will be required to determine the impact of these factors on PCOS prediction.
    </p>
</div>


In [5]:
# '''Step 4: Visualize Distributions
# Visualizing the data helps us understand the distribution of features and their relationships.'''
# import matplotlib.pyplot as plt
# import seaborn as sns

# # Plot histograms for numerical features
# plt.figure(figsize=(10, 6))
# sns.histplot(df_train['Weight_kg'], kde=True, bins=20)
# plt.title('Distribution of Weight (kg)')
# plt.xlabel('Weight (kg)')
# plt.ylabel('Frequency')
# plt.show()

# # Plot bar chart for Exercise_Frequency
# plt.figure(figsize=(10, 6))
# sns.countplot(data=df_train, x='Exercise_Frequency', order=df_train['Exercise_Frequency'].value_counts().index)
# plt.title('Exercise Frequency Distribution')
# plt.xlabel('Exercise Frequency')
# plt.ylabel('Count')
# plt.xticks(rotation=45)
# plt.show()

<div style="background-color: #f7f7f7; padding: 20px; border-radius: 10px; border: 1px solid #ddd;">
    <h2 style="color: #2c3e50; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;">Step 4: Visualize Distributions</h2>
    <p style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; color: #34495e;">
        <strong>Overview:</strong> This step involves visualizing the distributions of numerical and categorical features to better understand the data. Visualizations help identify patterns, trends, and potential outliers.
    </p>
    <p style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; color: #34495e;">
        <strong>Key Observations:</strong>
        <ul style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; color: #34495e;">
            <li><strong>Distribution of Weight (kg):</strong>
                <ul>
                    <li>The histogram shows that the weight distribution is approximately normal, with most individuals weighing between 40 kg and 70 kg.</li>
                    <li>There are a few outliers on the higher end, with weights exceeding 100 kg.</li>
                </ul>
            </li>
            <li><strong>Exercise Frequency Distribution:</strong>
                <ul>
                    <li>The most common exercise frequency is "Rarely," with approximately 100 individuals reporting this.</li>
                    <li>"1-2 Times a Week" is the second most common, with around 40 individuals.</li>
                    <li>Other categories like "Never," "3-4 Times a Week," and "6-8 Times a Week" have fewer individuals, with counts ranging from 20 to 30.</li>
                    <li>There are a few rare categories like "6-8 hours," "Less than usual," and "Less than 6 hours," each with only 1 individual.</li>
                </ul>
            </li>
        </ul>
    </p>
    <p style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; color: #34495e;">
        <strong>Insights:</strong>
        <ul style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; color: #34495e;">
            <li>The weight distribution suggests that most individuals in the dataset fall within a healthy weight range, but there are some outliers that may need further investigation.</li>
            <li>The exercise frequency data indicates that a significant portion of the population exercises infrequently, which could be a critical factor in predicting PCOS.</li>
            <li>The presence of rare categories in exercise frequency may require grouping or handling to improve model performance.</li>
        </ul>
    </p>
    <p style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; color: #34495e;">
        <strong>Next Steps:</strong>
        <ul style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; color: #34495e;">
            <li><strong>Handle Outliers:</strong> Investigate and address outliers in the weight distribution, if necessary.</li>
            <li><strong>Group Rare Categories:</strong> Consider grouping rare exercise frequency categories into a single "Other" category to simplify the data.</li>
            <li><strong>Explore Relationships:</strong> Investigate the relationship between exercise frequency and PCOS to identify potential predictive patterns.</li>
        </ul>
    </p>
</div>

In [6]:
# '''Step 5: Analyze Target Variable
# The target variable is PCOS. Let's analyze its distribution'''

# # Count of PCOS vs Non-PCOS cases
# print(df_train['PCOS'].value_counts())

# # Plot the distribution of PCOS
# plt.figure(figsize=(6, 4))
# sns.countplot(data=df_train, x='PCOS')
# plt.title('Distribution of PCOS')
# plt.xlabel('PCOS')
# plt.ylabel('Count')
# plt.show()

<div style="background-color:#f8f9fa; padding:15px; border-radius:10px; border-left:5px solid #007bff;">
    <h2 style="color:#007bff;">🔍 PCOS Distribution Analysis</h2>
    <p style="font-size:16px; color:#333;">
        The target variable <strong>PCOS</strong> (Polycystic Ovary Syndrome) is analyzed to understand its distribution in the dataset.
        The bar chart above reveals a significant class imbalance, with a much higher count of <strong>non-PCOS</strong> cases compared to <strong>PCOS</strong> cases.
    </p>
    <ul style="font-size:16px; color:#333;">
        <li><strong>Majority Class:</strong> Non-PCOS</li>
        <li><strong>Minority Class:</strong> PCOS</li>
        <li><strong>Imbalance Implication:</strong> The imbalance may affect model performance, requiring techniques such as resampling, class weighting, or synthetic data generation.</li>
    </ul>
    <p style="font-size:16px; color:#333;">
        Addressing class imbalance is crucial for building an effective predictive model, ensuring fair representation of both classes.
    </p>
</div>

In [7]:
# '''Step 6: Correlation Analysis
# Let's check the correlation between numerical features and the target variable.'''

# # Convert categorical target to numerical (e.g., 'Yes' -> 1, 'No' -> 0)
# df_train['PCOS_numeric'] = df_train['PCOS'].apply(lambda x: 1 if x == 'Yes' else 0)

# # Compute correlation matrix
# corr_matrix = df_train.corr()

# # Plot heatmap
# plt.figure(figsize=(10, 8))
# sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
# plt.title('Correlation Matrix')
# plt.show()

<div style="background-color:#f8f9fa; padding:15px; border-radius:10px; border-left:5px solid #28a745;">
    <h2 style="color:#28a745;">📊 Correlation Analysis</h2>
    <p style="font-size:16px; color:#333;">
        The correlation matrix provides insights into the relationships between numerical features in the dataset, particularly with the target variable <strong>PCOS</strong>.
    </p>
    
   <ul style="font-size:16px; color:#333;">
        <li><strong>PCOS vs Weight (0.16):</strong> There is a weak positive correlation, indicating that weight might have a slight influence on PCOS presence.</li>
        <li><strong>ID Column:</strong> The ID column should be ignored as it has no meaningful correlation with other variables.</li>
        <li><strong>Feature Selection:</strong> Since no strong correlations exist, feature engineering or interaction terms may be necessary to improve predictive modeling.</li>
    </ul>

    <p style="font-size:16px; color:#333;">
        The weak correlations suggest that PCOS prediction may require advanced techniques such as non-linear models or additional feature extraction.
    </p>
</div>

In [8]:
# '''Step 7: Feature Engineering
# Let's preprocess the data for modeling. This includes:
# - Encoding categorical variables.
# - Handling missing values (if any).
# - Scaling numerical features.'''

# # One-hot encode categorical variables
# df_encoded = pd.get_dummies(df_train, columns=['Age', 'Exercise_Frequency'], drop_first=True)
# print(df_encoded.head())

# # Fill missing values (if any)
# df_encoded.fillna(df_encoded.median(), inplace=True)

# from sklearn.preprocessing import StandardScaler

# # Scale numerical features
# scaler = StandardScaler()
# numerical_features = ['Weight_kg']
# df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])
# print(df_encoded.head())

# DATA CLEANING

## AGE

In [15]:
# check categories of Age in train set
df_train.Age.value_counts()

20-25              125
15-20               50
Less than 20        18
35-44                4
25-30                4
45 and above         3
30-35                2
30-25                1
30-40                1
Less than 20-25      1
Name: Age, dtype: int64

In [16]:
# simplify age structure - training data
df_train['Age_Group'] = 'MISSING'
# 20-25
df_train.loc[df_train.Age=='20-25', 'Age_Group'] = '20-25'
# translate all that are < 20 in level "lt20"
df_train.loc[df_train.Age=='15-20', 'Age_Group'] = 'lt20'
df_train.loc[df_train.Age=='Less than 20', 'Age_Group'] = 'lt20'
df_train.loc[df_train.Age=='Less than 20-25', 'Age_Group'] = 'lt20'
# translate all that are > 25 in level "gt25"
df_train.loc[df_train.Age=='35-44', 'Age_Group'] = 'gt25'
df_train.loc[df_train.Age=='25-30', 'Age_Group'] = 'gt25'
df_train.loc[df_train.Age=='45 and above', 'Age_Group'] = 'gt25'
df_train.loc[df_train.Age=='30-35', 'Age_Group'] = 'gt25'
df_train.loc[df_train.Age=='30-25', 'Age_Group'] = 'gt25'
df_train.loc[df_train.Age=='30-40', 'Age_Group'] = 'gt25'
# check results
df_train['Age_Group'].value_counts()

20-25      125
lt20        69
gt25        15
MISSING      1
Name: Age_Group, dtype: int64

In [17]:
# check categories of Age in test set
df_test.Age.value_counts()

20-25              95
Less than 20       37
30-35               2
35-44               1
Less than 20-25     1
30-30               1
Less than 20)       1
25-25               1
50-60               1
22-25               1
20                  1
30-40               1
45-49               1
Name: Age, dtype: int64

In [18]:
# simplify age structure - test data
df_test['Age_Group'] = 'MISSING'
# 20-25
df_test.loc[df_test.Age=='20-25', 'Age_Group'] = '20-25'
df_test.loc[df_test.Age=='20', 'Age_Group'] = '20-25'
df_test.loc[df_test.Age=='22-25', 'Age_Group'] = '20-25'
df_test.loc[df_test.Age=='25-25', 'Age_Group'] = '20-25'
# translate all that are < 20 in level "lt20"
df_test.loc[df_test.Age=='Less than 20', 'Age_Group'] = 'lt20'
df_test.loc[df_test.Age=='Less than 20-25', 'Age_Group'] = 'lt20'
df_test.loc[df_test.Age=='Less than 20)', 'Age_Group'] = 'lt20'
# translate all that are > 25 in level "gt25"
df_test.loc[df_test.Age=='30-35', 'Age_Group'] = 'gt25'
df_test.loc[df_test.Age=='35-44', 'Age_Group'] = 'gt25'
df_test.loc[df_test.Age=='30-30', 'Age_Group'] = 'gt25'
df_test.loc[df_test.Age=='50-60', 'Age_Group'] = 'gt25'
df_test.loc[df_test.Age=='30-40', 'Age_Group'] = 'gt25'
df_test.loc[df_test.Age=='45-49', 'Age_Group'] = 'gt25'
# check results
df_test['Age_Group'].value_counts()

20-25      98
lt20       39
gt25        7
MISSING     1
Name: Age_Group, dtype: int64

## EXERCISE TYPE

In [19]:
# check categories of Exercise_Type in train set
df_train.Exercise_Type.value_counts()

No Exercise                                                                                                                                                90
Cardio (e.g., running, cycling, swimming)                                                                                                                  51
Cardio (e.g.                                                                                                                                               25
Flexibility and balance (e.g., yoga, pilates)                                                                                                              16
Strength training (e.g., weightlifting, resistance exercises)                                                                                               6
Cardio (e.g., running, cycling, swimming), Strength training (e.g., weightlifting, resistance exercises)                                                    4
Cardio (e.g., running, cycling, swimming), Flexibili

In [20]:
# simplify Exercise_Type structure - training data
df_train['Exercise_Type_Clean'] = 'MISSING'

# replace values
df_train.loc[df_train.Exercise_Type=='No Exercise', 'Exercise_Type_Clean'] = 'No Exercise'
df_train.loc[df_train.Exercise_Type=='Cardio (e.g., running, cycling, swimming)', 'Exercise_Type_Clean'] = 'Cardio'
df_train.loc[df_train.Exercise_Type=='Cardio (e.g.', 'Exercise_Type_Clean'] = 'Cardio'
df_train.loc[df_train.Exercise_Type=='Flexibility and balance (e.g., yoga, pilates)', 'Exercise_Type_Clean'] = 'Flexibility'
df_train.loc[df_train.Exercise_Type=='Strength training (e.g., weightlifting, resistance exercises)', 'Exercise_Type_Clean'] = 'Strength'
df_train.loc[df_train.Exercise_Type=='Cardio (e.g., running, cycling, swimming), Strength training (e.g., weightlifting, resistance exercises)', 'Exercise_Type_Clean'] = 'Flexibility+Strength'
df_train.loc[df_train.Exercise_Type=='Cardio (e.g., running, cycling, swimming), Flexibility and balance (e.g., yoga, pilates)', 'Exercise_Type_Clean'] = 'Cardio+Flexibility'
df_train.loc[df_train.Exercise_Type=='High-intensity interval training (HIIT)', 'Exercise_Type_Clean'] = 'HIIT'
df_train.loc[df_train.Exercise_Type=='Cardio (e.g., running, cycling, swimming), Strength training (e.g., weightlifting, resistance exercises), Flexibility and balance (e.g., yoga, pilates),', 'Exercise_Type_Clean'] = 'Cardio+Flexibiliy+Strength'
df_train.loc[df_train.Exercise_Type=='Strength training (e.g., weightlifting, resistance exercises), Flexibility and balance (e.g., yoga, pilates)', 'Exercise_Type_Clean'] = 'Strength'
df_train.loc[df_train.Exercise_Type=='Flexibility and balance (e.g., yoga, pilates), None', 'Exercise_Type_Clean'] = 'Flexibility'
df_train.loc[df_train.Exercise_Type=='Cardio (e.g., running, cycling, swimming), None', 'Exercise_Type_Clean'] = 'Cardio'
df_train.loc[df_train.Exercise_Type=='Strength training', 'Exercise_Type_Clean'] = 'Strength'
df_train.loc[df_train.Exercise_Type=='Strength training (e.g.', 'Exercise_Type_Clean'] = 'Strength'
df_train.loc[df_train.Exercise_Type=='Somewhat', 'Exercise_Type_Clean'] = 'Somewhat'
df_train.loc[df_train.Exercise_Type=='Flexibility and balance (e.g.', 'Exercise_Type_Clean'] = 'Flexibility'
     
# check results
df_train['Exercise_Type_Clean'].value_counts()

No Exercise             90
Cardio                  77
Flexibility             18
Strength                 9
Flexibility+Strength     4
Cardio+Flexibility       4
MISSING                  4
HIIT                     3
Somewhat                 1
Name: Exercise_Type_Clean, dtype: int64

In [21]:
# check categories of Exercise_Type in test set
df_test.Exercise_Type.value_counts()

Cardio (e.g.                     65
No Exercise                      63
Flexibility and balance (e.g.     7
Strength training (e.g.           2
Strength training                 1
Yes Significantly                 1
No                                1
Sleep_Benefit                     1
Not Applicable                    1
Somewhat                          1
Strength (e.g.                    1
Name: Exercise_Type, dtype: int64

In [22]:
# simplify Exercise_Type structure - test data
df_test['Exercise_Type_Clean'] = 'MISSING'

# replace values
df_test.loc[df_test.Exercise_Type=='Cardio (e.g.', 'Exercise_Type_Clean'] = 'Cardio'
df_test.loc[df_test.Exercise_Type=='No Exercise', 'Exercise_Type_Clean'] = 'No Exercise'
df_test.loc[df_test.Exercise_Type=='Flexibility and balance (e.g.', 'Exercise_Type_Clean'] = 'Flexibility'
df_test.loc[df_test.Exercise_Type=='Strength training (e.g.', 'Exercise_Type_Clean'] = 'Strength'
df_test.loc[df_test.Exercise_Type=='Strength training', 'Exercise_Type_Clean'] = 'Strength'
df_test.loc[df_test.Exercise_Type=='Yes Significantly', 'Exercise_Type_Clean'] = 'Other'
df_test.loc[df_test.Exercise_Type=='No', 'Exercise_Type_Clean'] = 'No Exercise'
df_test.loc[df_test.Exercise_Type=='Sleep_Benefit', 'Exercise_Type_Clean'] = 'MISSING'
df_test.loc[df_test.Exercise_Type=='Not Applicable', 'Exercise_Type_Clean'] = 'MISSING'
df_test.loc[df_test.Exercise_Type=='Somewhat', 'Exercise_Type_Clean'] = 'Somewhat'
df_test.loc[df_test.Exercise_Type=='Strength (e.g.', 'Exercise_Type_Clean'] = 'Strength'
# check results
df_test['Exercise_Type_Clean'].value_counts()

Cardio         65
No Exercise    64
Flexibility     7
Strength        4
MISSING         3
Other           1
Somewhat        1
Name: Exercise_Type_Clean, dtype: int64