# **Multi-Class Prediction of Obesity Risk Project**
#### By Angel Gutierrez Sanjuan

### Objective
In this project we are analyzing individuals weight, we will use multiple algorithms, with the goal 
of observing and comparing the prediction results. My part of the project is to implement knn algorithm on the data set.

### K-Nearest Neighbor (KNN) Algorithm
KNN algorithm allows for supervised learning it makes predictions based on data points. It identifies the *k* closest points from the data set in order to make prediction on outcomes.


In [9]:
'''
Libraries needed

'''

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix


df = pd.read_csv("ObesityDataSet.csv")

df.head()


Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


## Defining Abbriviations

- FAVC  : Frequent Consumption of High Calorie Food
- FCVC  : Frequency of Consumption of Vegtables
- NCP   : Number of Main Meals
- CAEC  : Consumption of Foood Between Meals
- CH20  : Consumption of Water Daily
- CALC  : Consumption of Alcohol
- SCC   : Calories Consumption Monitoring
- FAF   : Physical Activity Frequency 
- TUE   : Time Using Technology devices
- MTrans: Transportation Used


In [13]:
'''
In this part we want to hadle the categories and the numeric values
The goal of using LabelEncoder is to covert categories into numerical values
'''


encoders = {}
categorical_headers = ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SMOKE', 'SCC', 'CALC', 'MTRANS', 'NObeyesdad']

for column in categorical_headers:
    label_encoder = LabelEncoder()
    df[column] = label_encoder.fit_transform(df[column])
    encoders[column] = label_encoder

# Features and target

X = df.drop("NObeyesdad", axis= 1)
y = df["NObeyesdad"]

# splits data into training(20%) and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.2, random_state=42)

# Scales numerical features
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

for k in range(1,10):
    knn = KNeighborsClassifier(n_neighbors = k)

    # trains model
    knn.fit(X_train_scaled,y_train)

    # prediction 
    y_pred = knn.predict(X_test_scaled)

    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy in k = {k}: {accuracy * 100:.2f}%")


Accuracy in k = 1: 82.51%
Accuracy in k = 2: 81.56%
Accuracy in k = 3: 81.80%
Accuracy in k = 4: 82.03%
Accuracy in k = 5: 82.03%
Accuracy in k = 6: 81.09%
Accuracy in k = 7: 79.91%
Accuracy in k = 8: 79.20%
Accuracy in k = 9: 78.49%


### Conclusions

Now that we have implemented KNN Algorithm, we can make conclusions based on the results. We noticed that a smaller k value tends to make the model a bit more sensitive to *noise* , this can indicate it could be overfitting. Larger K values on the other hand means that it is less sensitive to *noise* but it can equally mean that its underfitting. From the results we can say that when k = 1 knn algorithm performace most accurate. On the other hand using a k value of 4 and 5 may be best since it gives a more stable option and it's accuracy isn't as impacted as other k values. 

