# Data Exploration 
Data exploration involves summarising the dataset's main characteristics and visualising its distribution and relationships. This step helps identify patterns, trends, and potential issues such as missing values, setting foundation for subsequent analysis and modelling.

## Importing Libraries
Essential libraries such as `pandas`, `os`, and visualisation libraries like `matplotlib` are imported. These libraries provide the necessary functions to handle and visualise the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import os

## Loading the Dataset
The dataset is loaded from a CSV file from the `datasets` folder into a `pandas` `DataFrame`.

In [None]:
csv_path = f"{os.getcwd()}/datasets/obesityData.csv"

ob_df = pd.read_csv(csv_path) # ob_df --> obesity dataframe

## Exploring the Dataset
Initial exploration involves checking the shape of the dataset, viewing the first few rows, and summarising basic statistics. This helps us understand the structure and basic characteristics of the data.

In [None]:
# view the first few rows
ob_df.head()

In [None]:
# Check the shape of the dataset
print(f"The dataset contains {ob_df.shape[0]} rows and {ob_df.shape[1]} columns.")

In [None]:
# Generate descriptive statistics for the data
ob_df.describe()

In [None]:
# Check for missing values
ob_df.isnull().sum()

In [None]:
# check data types
ob_df.dtypes

In [None]:
# check duplicates 
ob_df[ob_df.duplicated()]

## Data Visualisation 

### 1. **Distribution of Ages**

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(ob_df['Age'], bins=30, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

###  2. **Distribution of BMI (Body Mass Index)**
BMI is calculated as the weight in kilograms divided by the square of the height in meters ($\text{BMI} = \frac{\text{Weight}}{\text{Height}^2}$).

In [None]:
ob_df['BMI'] = ob_df['Weight'] / (ob_df['Height'])**2

plt.figure(figsize=(10,6))
sns.histplot(ob_df['BMI'], bins=30, kde=True)
plt.title('BMI Distribution')
plt.xlabel('BMI')
plt.ylabel('Frequency')
plt.show()

### 3. **Box Plot of BMI by Gender**

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='Gender', y='BMI', data=ob_df)
plt.title('BMI by Gender')
plt.xlabel('Gender')
plt.ylabel('BMI')
plt.show()

### 4. **Correlation Heatmap**

In [None]:
# Convert categorical variables to numeric using Label Encoding
label_encoders = {}
ob_df_view = ob_df[:]
for column in ob_df.select_dtypes(include=['object']).columns:
    label_encoders[column] = LabelEncoder()
    ob_df_view[column] = label_encoders[column].fit_transform(ob_df[column])
    
# Calculate the correlation matrix
correlation_matrix = ob_df_view.corr()

# Plot the correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

### 5. **Count Plot of Obesity Levels**
`NObeyesdad` is the target variable for obesity levels

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(x='NObeyesdad', hue='NObeyesdad', data=ob_df, palette='viridis', dodge=False, legend=False)
plt.title('Count of Each Obesity Level')
plt.xlabel('Obesity Level')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

### 6. **Pairplot of Selected Features**
Chosen features are `age`, `height`, `weight`, and `BMI`.

In [None]:
selected_features = ['Age', 'Height', 'Weight',  'BMI']
sns.pairplot(ob_df[selected_features])
plt.show()

### 7. **Distribution of Family History of Overweightness**

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(x='family_history_with_overweight', hue='family_history_with_overweight', data=ob_df, palette='viridis', dodge=False, legend=False)
plt.title('Family History with Overweight')
plt.xlabel('Family History')
plt.ylabel('Count')
plt.show()

### 8. **Count Plot for Frequent Consumption of High Caloric Food (FAVC)**

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(x='FAVC', data=ob_df)
plt.title('Count Plot of Frequent Comsumption of High Caloric Food (FAVC)')
plt.xlabel('Frequent Consumption of High Caloric Food')
plt.ylabel('Count')
plt.show()