# Khipus.ai
## Introduction to Machine Learning
### Supervised Learning - Linear Regression
<span>© Copyright Notice 2025, Khipus.ai - All Rights Reserved.</span>

### Body Fat Prediction Assignment
### Name :(Please Enter Your Name Before Submitting)


## Assignment Instructions
Using the Body Fat dataset provided:
1. Import the bodyfat.csv file into a pandas dataframe
2. Perform a detailed data exploration.
3. Clean the dataset by addressing missing values.
4. Select relevant features for predicting body fat percentage.
5. Split the dataset into training and testing sets (80%-20% split).
6. Use a linear regression model to predict body fat percentage.
7. Evaluate the model performance using appropriate metrics.

## 1. Import the Body Fat Dataset

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer

In [2]:
# Load the Body Fat dataset
df = pd.read_csv('bodyfat.csv')
print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")

Dataset loaded successfully!
Dataset shape: (252, 15)


## 2. Data Exploration
Before working with the dataset, it's important to understand its structure, data types, and summary statistics. Let's explore the body fat data.


In [3]:
# Display the first few rows of the dataset
print("First 5 rows of the dataset:")
print(df.head())

First 5 rows of the dataset:
   Density  BodyFat  Age  Weight  Height  Neck  Chest  Abdomen    Hip  Thigh  \
0   1.0708     12.3   23  154.25   67.75  36.2   93.1     85.2   94.5   59.0   
1   1.0853      6.1   22  173.25   72.25  38.5   93.6     83.0   98.7   58.7   
2   1.0414     25.3   22  154.00   66.25  34.0   95.8     87.9   99.2   59.6   
3   1.0751     10.4   26  184.75   72.25  37.4  101.8     86.4  101.2   60.1   
4   1.0340     28.7   24  184.25   71.25  34.4   97.3    100.0  101.9   63.2   

   Knee  Ankle  Biceps  Forearm  Wrist  
0  37.3   21.9    32.0     27.4   17.1  
1  37.3   23.4    30.5     28.9   18.2  
2  38.9   24.0    28.8     25.2   16.6  
3  37.3   22.8    32.4     29.4   18.2  
4  42.2   24.0    32.2     27.7   17.7  


In [4]:
# Display basic information about the dataset
print("Dataset Info:")
print(df.info())
print("\nColumn names:")
print(df.columns.tolist())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 15 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Density  252 non-null    float64
 1   BodyFat  252 non-null    float64
 2   Age      252 non-null    int64  
 3   Weight   252 non-null    float64
 4   Height   252 non-null    float64
 5   Neck     252 non-null    float64
 6   Chest    252 non-null    float64
 7   Abdomen  252 non-null    float64
 8   Hip      252 non-null    float64
 9   Thigh    252 non-null    float64
 10  Knee     252 non-null    float64
 11  Ankle    252 non-null    float64
 12  Biceps   252 non-null    float64
 13  Forearm  252 non-null    float64
 14  Wrist    252 non-null    float64
dtypes: float64(14), int64(1)
memory usage: 29.7 KB
None

Column names:
['Density', 'BodyFat', 'Age', 'Weight', 'Height', 'Neck', 'Chest', 'Abdomen', 'Hip', 'Thigh', 'Knee', 'Ankle', 'Biceps', 'Forearm', 'Wrist']


In [5]:
# Display summary statistics
print("Summary Statistics:")
print(df.describe())

Summary Statistics:
          Density     BodyFat         Age      Weight      Height        Neck  \
count  252.000000  252.000000  252.000000  252.000000  252.000000  252.000000   
mean     1.055574   19.150794   44.884921  178.924405   70.148810   37.992063   
std      0.019031    8.368740   12.602040   29.389160    3.662856    2.430913   
min      0.995000    0.000000   22.000000  118.500000   29.500000   31.100000   
25%      1.041400   12.475000   35.750000  159.000000   68.250000   36.400000   
50%      1.054900   19.200000   43.000000  176.500000   70.000000   38.000000   
75%      1.070400   25.300000   54.000000  197.000000   72.250000   39.425000   
max      1.108900   47.500000   81.000000  363.150000   77.750000   51.200000   

            Chest     Abdomen         Hip       Thigh        Knee       Ankle  \
count  252.000000  252.000000  252.000000  252.000000  252.000000  252.000000   
mean   100.824206   92.555952   99.904762   59.405952   38.590476   23.102381   
std    

In [6]:
# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

Missing values in each column:
Density    0
BodyFat    0
Age        0
Weight     0
Height     0
Neck       0
Chest      0
Abdomen    0
Hip        0
Thigh      0
Knee       0
Ankle      0
Biceps     0
Forearm    0
Wrist      0
dtype: int64

Total missing values: 0


## 3. Data Cleaning
Real-world datasets often contain missing values, duplicate rows, or incorrect data. Let's clean our data to ensure quality for analysis.


In [7]:
# Check for duplicates
print(f"Number of duplicate rows: {df.duplicated().sum()}")



Number of duplicate rows: 0


## 4. Feature Selection 
To build an effective machine learning model, selecting relevant features is crucial. 
Let's select relevant features for predicting body fat percentage.


In [8]:
# Select features based on correlation and domain knowledge
# We'll use multiple features that show good correlation with body fat
#selected_features = ['Abdomen', 'Weight', 'Chest', 'Hip', 'Thigh', 'Neck', 'Age']
selected_features = ['Abdomen', 'Weight', 'Chest', 'Hip', 'Thigh', 'Neck', 'Age']

# Select the features from the DataFrame
X = df[selected_features]

# Select the target variable from the DataFrame
y = df['BodyFat']

## 5. Splitting Training and Test Data

In [9]:
# Split into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## 6. Use Linear Regression Model to Predict Body Fat Percentage

### Train the Linear Regression Model
We will train a linear regression model using the training data to predict body fat percentage.

In [10]:
# Initialize the Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)


### Evaluate the Model
We'll calculate various metrics to evaluate the model performance.

In [11]:

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
print(f'R² Score: {r2}')

Mean Squared Error: 17.035582481596958
Root Mean Squared Error: 4.1274183797619735
R² Score: 0.6337856989455327


## 7. Model Interpretation and Conclusions

In [13]:
# Make sample predictions to demonstrate the model
print("Sample Predictions:")
print("-" * 50)
for i in range(5):
    actual = y_test.iloc[i]
    predicted = y_pred[i]
    print(f"Sample {i+1}:")
    print(f"  Actual Body Fat: {actual:.2f}%")
    print(f"  Predicted Body Fat: {predicted:.2f}%")
    print(f"  Difference: {abs(actual - predicted):.2f}%")
    print()

Sample Predictions:
--------------------------------------------------
Sample 1:
  Actual Body Fat: 19.20%
  Predicted Body Fat: 16.94%
  Difference: 2.26%

Sample 2:
  Actual Body Fat: 19.20%
  Predicted Body Fat: 16.51%
  Difference: 2.69%

Sample 3:
  Actual Body Fat: 28.00%
  Predicted Body Fat: 31.80%
  Difference: 3.80%

Sample 4:
  Actual Body Fat: 20.50%
  Predicted Body Fat: 16.44%
  Difference: 4.06%

Sample 5:
  Actual Body Fat: 16.70%
  Predicted Body Fat: 16.34%
  Difference: 0.36%



### Summary
Based on our linear regression analysis:

1. **Model Performance**: Our model achieved an R² score of approximately [insert your R² value] on the test set, indicating that it explains [percentage]% of the variance in body fat percentage.

2. **Key Features**: The most important features for predicting body fat appear to be [based on your coefficient analysis].

3. **Model Accuracy**: The Root Mean Square Error (RMSE) of [insert RMSE value]% suggests that our predictions are typically within this range of the actual body fat percentage.

4. **Practical Applications**: This model could be useful for fitness professionals and healthcare providers to estimate body fat percentage using easily measurable body dimensions.

**Next Steps**: To improve the model, we could:
- Try polynomial features
- Use regularization techniques (Ridge, Lasso)
- Explore other algorithms like Random Forest
- Collect more data