## Dataset Explained

The dataset appears to be related to obesity, with 2,111 entries and 17 columns. Here's an overview of the columns:

1. **Age**: Age of the individual (float).
2. **Gender**: Gender of the individual (object).
3. **Height**: Height of the individual in meters (float).
4. **Weight**: Weight of the individual in kilograms (float).
5. **CALC**: Consumption of alcohol (object) - categories such as 'no', 'Sometimes', 'Frequently'.
6. **FAVC**: Frequent consumption of high-caloric food (object) - 'yes' or 'no'.
7. **FCVC**: Frequency of consumption of vegetables (float).
8. **NCP**: Number of main meals (float).
9. **SCC**: Monitor calorie consumption (object) - 'yes' or 'no'.
10. **SMOKE**: Smoking habit (object) - 'yes' or 'no'.
11. **CH2O**: Consumption of water daily (float).
12. **family_history_with_overweight**: Family history with overweight (object) - 'yes' or 'no'.
13. **FAF**: Physical activity frequency (float).
14. **TUE**: Time using technology devices (float).
15. **CAEC**: Consumption of food between meals (object) - categories such as 'Sometimes', 'Frequently'.
16. **MTRANS**: Transportation used (object) - categories such as 'Public_Transportation', 'Walking'.
17. **NObeyesdad**: Obesity level (object) - categories such as 'Normal_Weight', 'Overweight_Level_I', etc.

This dataset contains a mix of continuous (float) and categorical (object) variables. The target variable appears to be `NObeyesdad`, which indicates the obesity level of individuals.

## Actual Dataset

In [3]:
import pandas as pd
import numpy as np

#Loading data
dataframe = pd.read_csv("obese.csv") #file is not available
dataframe

Unnamed: 0,Age,Gender,Height,Weight,CALC,FAVC,FCVC,NCP,SCC,SMOKE,CH2O,family_history_with_overweight,FAF,TUE,CAEC,MTRANS,NObeyesdad
0,21.000000,Female,1.620000,64.000000,no,no,2.0,3.0,no,no,2.000000,yes,0.000000,1.000000,Sometimes,Public_Transportation,Normal_Weight
1,21.000000,Female,1.520000,56.000000,Sometimes,no,3.0,3.0,yes,yes,3.000000,yes,3.000000,0.000000,Sometimes,Public_Transportation,Normal_Weight
2,23.000000,Male,1.800000,77.000000,Frequently,no,2.0,3.0,no,no,2.000000,yes,2.000000,1.000000,Sometimes,Public_Transportation,Normal_Weight
3,27.000000,Male,1.800000,87.000000,Frequently,no,3.0,3.0,no,no,2.000000,no,2.000000,0.000000,Sometimes,Walking,Overweight_Level_I
4,22.000000,Male,1.780000,89.800000,Sometimes,no,2.0,1.0,no,no,2.000000,no,0.000000,0.000000,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,20.976842,Female,1.710730,131.408528,Sometimes,yes,3.0,3.0,no,no,1.728139,yes,1.676269,0.906247,Sometimes,Public_Transportation,Obesity_Type_III
2107,21.982942,Female,1.748584,133.742943,Sometimes,yes,3.0,3.0,no,no,2.005130,yes,1.341390,0.599270,Sometimes,Public_Transportation,Obesity_Type_III
2108,22.524036,Female,1.752206,133.689352,Sometimes,yes,3.0,3.0,no,no,2.054193,yes,1.414209,0.646288,Sometimes,Public_Transportation,Obesity_Type_III
2109,24.361936,Female,1.739450,133.346641,Sometimes,yes,3.0,3.0,no,no,2.852339,yes,1.139107,0.586035,Sometimes,Public_Transportation,Obesity_Type_III


## Data Cleaning

In [4]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load the dataset
file_path = 'obese.csv'
dataset = pd.read_csv(file_path)

# Display initial info about the dataset
print("Initial Dataset Info:")
print(dataset.info())

# Check for missing values
missing_values = dataset.isnull().sum()
print("\nMissing Values in each column:")
print(missing_values)

# Handle missing values (if any) - Here we'll just drop rows with missing values for simplicity
dataset.dropna(inplace=True)

# Convert categorical variables to numerical using Label Encoding
label_encoder = LabelEncoder()

# Columns to be label encoded
label_columns = ['Gender', 'CALC', 'FAVC', 'SCC', 'SMOKE',
                 'family_history_with_overweight', 'CAEC', 'MTRANS', 'NObeyesdad']

# Apply label encoding to these columns
for column in label_columns:
    dataset[column] = label_encoder.fit_transform(dataset[column])

# Normalize numerical features using StandardScaler
scaler = StandardScaler()

# Columns to be scaled
numerical_columns = ['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE']

# Apply scaling
dataset[numerical_columns] = scaler.fit_transform(dataset[numerical_columns])

# Display cleaned dataset info
print("\nCleaned Dataset Info:")
print(dataset.info())

# Display first few rows of the cleaned dataset
print("\nFirst few rows of the cleaned dataset:")
print(dataset.head())


Initial Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Age                             2111 non-null   float64
 1   Gender                          2111 non-null   object 
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   CALC                            2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   SCC                             2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  family_history_with_overweight  2111 non-null   object 
 12  FAF         

In [5]:
dataset.to_csv('cleane_obese.csv', index= False)

## Cleaned Data

In [6]:
dataset

Unnamed: 0,Age,Gender,Height,Weight,CALC,FAVC,FCVC,NCP,SCC,SMOKE,CH2O,family_history_with_overweight,FAF,TUE,CAEC,MTRANS,NObeyesdad
0,-0.522124,0,-0.875589,-0.862558,3,0,-0.785019,0.404153,0,0,-0.013073,1,-1.188039,0.561997,2,3,1
1,-0.522124,0,-1.947599,-1.168077,2,0,1.088342,0.404153,1,1,1.618759,1,2.339750,-1.080625,2,3,1
2,-0.206889,1,1.054029,-0.366090,1,0,-0.785019,0.404153,0,0,-0.013073,1,1.163820,0.561997,2,3,1
3,0.423582,1,1.054029,0.015808,1,0,1.088342,0.404153,0,0,-0.013073,0,1.163820,-1.080625,2,4,5
4,-0.364507,1,0.839627,0.122740,2,0,-0.785019,-2.167023,0,0,-0.013073,0,-1.188039,-1.080625,2,3,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,-0.525774,0,0.097045,1.711763,2,1,1.088342,0.404153,0,0,-0.456705,1,0.783135,0.407996,2,3,4
2107,-0.367195,0,0.502844,1.800914,2,1,1.088342,0.404153,0,0,-0.004702,1,0.389341,-0.096251,2,3,4
2108,-0.281909,0,0.541672,1.798868,2,1,1.088342,0.404153,0,0,0.075361,1,0.474971,-0.019018,2,3,4
2109,0.007776,0,0.404927,1.785780,2,1,1.088342,0.404153,0,0,1.377801,1,0.151471,-0.117991,2,3,4


## KNN

In [7]:
import pandas as pd
import numpy as np
from sklearn import linear_model #scikit Learn python Library, using Linear_model for Linear regression python modeL
import matplotlib.pyplot as plt

In [8]:
#Loading data
dataframe = pd.read_csv("cleane_obese.csv") #file is not available
dataframe

Unnamed: 0,Age,Gender,Height,Weight,CALC,FAVC,FCVC,NCP,SCC,SMOKE,CH2O,family_history_with_overweight,FAF,TUE,CAEC,MTRANS,NObeyesdad
0,-0.522124,0,-0.875589,-0.862558,3,0,-0.785019,0.404153,0,0,-0.013073,1,-1.188039,0.561997,2,3,1
1,-0.522124,0,-1.947599,-1.168077,2,0,1.088342,0.404153,1,1,1.618759,1,2.339750,-1.080625,2,3,1
2,-0.206889,1,1.054029,-0.366090,1,0,-0.785019,0.404153,0,0,-0.013073,1,1.163820,0.561997,2,3,1
3,0.423582,1,1.054029,0.015808,1,0,1.088342,0.404153,0,0,-0.013073,0,1.163820,-1.080625,2,4,5
4,-0.364507,1,0.839627,0.122740,2,0,-0.785019,-2.167023,0,0,-0.013073,0,-1.188039,-1.080625,2,3,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,-0.525774,0,0.097045,1.711763,2,1,1.088342,0.404153,0,0,-0.456705,1,0.783135,0.407996,2,3,4
2107,-0.367195,0,0.502844,1.800914,2,1,1.088342,0.404153,0,0,-0.004702,1,0.389341,-0.096251,2,3,4
2108,-0.281909,0,0.541672,1.798868,2,1,1.088342,0.404153,0,0,0.075361,1,0.474971,-0.019018,2,3,4
2109,0.007776,0,0.404927,1.785780,2,1,1.088342,0.404153,0,0,1.377801,1,0.151471,-0.117991,2,3,4


In [9]:
from sklearn.model_selection import train_test_split
X = dataframe.drop(['NObeyesdad'], axis='columns')
y = dataframe.NObeyesdad

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [10]:
len(X_train)

1688

In [11]:

len(X_test)

423

In [12]:
#creating KNN ( K neearest neighbor Classifier)
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)

In [13]:
knn.fit(X_train, y_train) # Learning and prediction


In [14]:
knn.score(X_test, y_test) #check prediction/performance score

0.8841607565011821

## Decision Tree

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

df=pd.read_csv('cleane_obese.csv')

x=df.drop(['NObeyesdad'],axis='columns')
y=df.NObeyesdad
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.8,random_state=20)

dt=DecisionTreeClassifier()
dt.fit(x_train,y_train)

print(dt.score(x_test,y_test))

0.9290780141843972


## Linear

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

df=pd.read_csv('cleane_obese.csv')

x = df.drop('NObeyesdad',axis='columns')
y = df.NObeyesdad

scaler = StandardScaler()
x = scaler.fit_transform(x)

X_train,X_test,y_train,y_test = train_test_split(x,y,train_size=0.5,random_state=1)

lr=LogisticRegression()
lr.fit(X_train,y_train)
lr.score(X_test,y_test)

0.8626893939393939

## Random Forest

In [17]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

df=pd.read_csv('cleane_obese.csv')

df['NObeyesdad']

x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.6,random_state=40)

forest=RandomForestClassifier(n_estimators=100)
forest.fit(x_train,y_train)
forest.score(x_test,y_test)

0.9526627218934911