# Obesity: Data Analysis

The data has the following columns, which can be any of the following values:

Attribute Feature Values
* Sex
    1. Male 712
    2. Female 898
* Age Values in integers (years)
* Height Values in integers (cm)
* Overweight/Obese Families
    1. Yes 266
    2. No 1344
* Consumption of Fast Food
    1. Yes 436
    2. No 1174
* Frequency of Consuming Vegetables
    1. Rarely 400
    2. Sometimes 708
    3. Always 502
* Number of Main Meals Daily
    1. 1-2 444
    2. 3 928
    3. 3+ 238
* Food Intake Between Meals
    1. Rarely 346
    2. Sometimes 564
    3. Usually 417
    4. Always 283
* Smoking 1. Yes 492
    2. No 118
* Liquid Intake Daily
    1. Amount smaller than one liter 456
    2. Within the range of 1 to 2 liters 523
    3. In excess of 2 liters 631
* Calculation Of Calorie Intake 1. Yes 286
    2. No 1324
* Physical Exercise
    1. No physical activity 206
    2. In the range of 1-2 days 290
    3. In the range of 3-4 days 370
    4. In the range of 5-6 days 358
    5. 6+ days 386
* Schedule Dedicated to Technology
    1. Between 0 and 2 hours 382
    2. Between 3 and 5 hours 826
    3. Exceeding five hours 402
* Type of Transportation Used
    1. Automobile 660
    2. Motorbike 94
    3. Bike 116
    4. Public transportation 602
    5. Walking 138
* Class
    1. Underweight 73
    2. Normal 658
    3. Overweight 592
    4. Obesity 287

The data was already sanitized, so there are no extreme values, also it largely consists of categorical data. The data is stored in a CSV file, with each row representing a person and each column representing a feature. The last column is the class of the person, which can be one of the following values: Underweight, Normal, Overweight, or Obesity.

## Load Data

In [1]:
import numpy as np

# Load the data from the user-provided text into a numpy array
data = ""

file_path = 'Obesity_Dataset/Obesity_Dataset.csv'
with open(file_path, 'r') as file:
    data = file.read()

# Split the data into rows and then columns (assuming the CSV uses commas)
lines = data.splitlines()
columns = [
    "Sex", "Age", "Height", "Overweight_Obese_Family", "Consumption_of_Fast_Food",
    "Frequency_of_Consuming_Vegetables", "Number_of_Main_Meals_Daily", "Food_Intake_Between_Meals",
    "Smoking", "Liquid_Intake_Daily", "Calculation_of_Calorie_Intake", "Physical_Exercise",
    "Schedule_Dedicated_to_Technology", "Type_of_Transportation_Used", "Class"
]

rows = [line.split(',') for line in lines]

data_np = np.array(rows)

numeric_data = data_np.astype(float)

## Analyze Data

In [2]:
# col_data = data_np[2].astype(float)
c = 1
col_data = numeric_data[:, c]
print("Column:", columns[c])

mean_value = np.mean(col_data)
median_value = np.median(col_data)
std_value = np.std(col_data)

Q1 = np.percentile(col_data, 25)
Q3 = np.percentile(col_data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = (col_data < lower_bound) | (col_data > upper_bound)

# Output statistical value
print("Mean of column:", mean_value)
print("Median of column:", median_value)
print("Standard deviation of column:", std_value)

sorted_data = np.unique(np.sort(col_data))
lowest_value = sorted_data[:3]
highest_value = sorted_data[-3:]

print("Three lowest value for column:", lowest_value)
print("Three highest value for column:", highest_value)


Column: Age
Mean of column: 33.11552795031056
Median of column: 32.0
Standard deviation of column: 9.832021199648462
Three lowest value for column: [18. 19. 20.]
Three highest value for column: [52. 53. 54.]
