<a href="https://colab.research.google.com/github/Cerasela-b/health-and-lifestyle-analytics/blob/main/notebook/health_and_lifestyle_analyze.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Loading and Inspecting the Dataset
To begin the analysis, we need to explore the dataset structure and understand the type of information it contains.
This step helps identify potential data quality issues, missing values, and the nature of each feature before we start the cleaning and modeling process.

In this step, we will:
- **Load** the CSV file from GitHub repository
- **Check** the dataset’s dimensions (number of rows and columns)
- **Preview** the first few rows to get an overview of the variables
- **Inspect** data types, value ranges, and basic statistical metadata
- **Identify** possible missing or inconsistent values

This initial inspection forms the foundation for further **data cleaning, feature engineering**, and **machine learning tasks**.

In [5]:
import pandas as pd

# Load dataset from GitHub
url = "https://raw.githubusercontent.com/Cerasela-b/health-and-lifestyle-analytics/main/data/health_lifestyle_dataset.csv"
df = pd.read_csv(url)

# Check dataset dimensions(number of rows and columns)
print("Dataset shape (rows, cloumns): ", df.shape)

print("\nFirst 5 rows: ")
print(df.head())

# Show column data types and non-null counts
print("\nDataset info: ")
df.info()

Dataset shape (rows, cloumns):  (100000, 16)

First 5 rows: 
   id  age  gender   bmi  daily_steps  sleep_hours  water_intake_l  \
0   1   56    Male  20.5         4198          3.9             3.4   
1   2   69  Female  33.3        14359          9.0             4.7   
2   3   46    Male  31.6         1817          6.6             4.2   
3   4   32  Female  38.2        15772          3.6             2.0   
4   5   60  Female  33.6         6037          3.8             4.0   

   calories_consumed  smoker  alcohol  resting_hr  systolic_bp  diastolic_bp  \
0               1602       0        0          97          161           111   
1               2346       0        1          68          116            65   
2               1643       0        1          90          123            99   
3               2460       0        0          71          165            95   
4               3756       0        1          98          139            61   

   cholesterol  family_history  disea


## Dataset Loading and Initial Inspection

The dataset was successfully loaded from GitHub and contains **100,000 rows** and **16 columns** describing various **health and lifestyle factors** such as age, BMI, sleep, activity level, and disease risk.

All columns have **non-null values**, ensuring a clean starting point for analysis.
The dataset includes a mix of **numeric** and **categorical** variables — with `gender` as an object type and several binary indicators already encoded as integers.

**Key observations**:

- Data types: 12 integer, 3 float, 1 object (`gender`)
- Binary columns (`smoker`, `alcohol`, `family_history`) are already encoded as 0/1
- No missing data detected across any column
- Each record represents one individual’s lifestyle and health profile

The data is well-structured and ready for **cleaning, preprocessing**, and **exploratory data analysis (EDA)** to identify relationships between lifestyle habits and disease risk.