This README file provides an overview of the project that applies a Support Vector Machine (SVM) model to a dataset for estimating obesity levels based on eating habits and physical condition. The dataset contains various independent variables, including Gender, Age, Height, Weight, family_history_with_overweight, FAVC, FCVC, NCP, CAEC, Smoking, CH2O, SCC, FAF, TUE, CALC, Mode of Transport, and the dependent variable, Obesity Category.
The dataset includes the following columns:
- Gender: Gender of the individuals (Categorical: 'Female' or 'Male').
- Age: Age of the individuals.
- Height: Height of the individuals.
- Weight: Weight of the individuals.
- family_history_with_overweight: Family history of overweight (Categorical: 'yes' or 'no').
- FAVC: Frequent consumption of high caloric food (Categorical: 'no' or 'yes').
- FCVC: Frequency of consumption of vegetables.
- NCP: Number of main meals.
- CAEC: Consumption of food between meals (Categorical: 'Sometimes', 'Frequently', 'Always', or 'no').
- Smoking: Smoking habits (Categorical: 'no' or 'yes').
- CH2O: Daily water consumption.
- SCC: Calories consumption monitoring (Categorical: 'no' or 'yes').
- FAF: Physical activity frequency.
- TUE: Time using technology devices.
- CALC: Consumption of alcohol (Categorical: 'no', 'Sometimes', 'Frequently', or 'Always').
- MTRANS: Mode of transportation (Categorical: 'Public_Transportation', 'Walking', 'Automobile', 'Motorbike', 'Bike').
- Obesity Category: Dependent variable with categories: 'Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 'Insufficient_Weight', 'Obesity_Type_II', 'Obesity_Type_III'.
- The dataset was split into a training set and a test set.
- Categorical data in columns such as 'Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'Smoking', 'SCC', 'CALC', and 'MTRANS' were label encoded.
- The dataset was standardized using StandardScaler to ensure that features had similar scales, which is important for SVM.
- An SVM model was trained with the linear kernel.
- The random state was set to 0 for reproducibility.
- A confusion matrix was generated to assess model performance.
[[56 0 0 0 0 0 0]
[ 5 53 0 0 0 4 0]
[ 0 0 75 2 0 0 1]
[ 0 0 1 57 0 0 0]
[ 0 0 0 0 63 0 0]
[ 0 2 0 0 0 52 2]
[ 0 0 0 0 0 2 48]]
- Accuracy score: 0.9550827423167849
precision recall f1-score support
0 0.92 1.00 0.96 56
1 0.96 0.85 0.91 62
2 0.99 0.96 0.97 78
3 0.97 0.98 0.97 58
4 1.00 1.00 1.00 63
5 0.90 0.93 0.91 56
6 0.94 0.96 0.95 50
accuracy 0.96 423
macro avg 0.95 0.96 0.95 423
weighted avg 0.96 0.96 0.95 423
- K-fold cross-validation was used, resulting in a mean accuracy of 94.20% and a standard deviation of 1.34%.
- Grid Search was performed to find the best hyperparameters, yielding the following result:
- Best Accuracy: 94.20%
- Best Parameters: {'C': 1, 'kernel': 'linear'}
This project demonstrates the application of SVM for obesity level estimation, achieving a high level of accuracy and providing insights into the factors influencing obesity.