# Age Prediction Project using Logistic Regression



## 1. Introduction

### Project Overview
The goal of this project is to predict the age group based on various features using logistic regression. The dataset contains information about individuals and their corresponding age groups. Logistic regression is used for classification tasks, and in this case, it will help in classifying the age groups.

### Libraries Used
- **numpy**: For numerical operations.
- **pandas**: For data manipulation and analysis.
- **matplotlib**: For data visualization.
- **seaborn**: For statistical data visualization.
- **sklearn**: For machine learning tasks like splitting data, scaling, and building the logistic regression model.


In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score


## 2. Dataset Overview

### Dataset Description

The dataset contains multiple features that can be used to predict the age group of individuals. The `Age_group` is the target variable, while other features are used for classification.

### Columns in the Dataset

The dataset has the following columns:

- **Age_group**: The target variable indicating the age group of the individual.

- **Other features**: The dataset may include various features (e.g., age, gender, education, etc.) that contribute to predicting the target variable.


## 3. Data Exploration

### Basic Analysis

We load the dataset and perform an initial check to understand its structure.

In [2]:
data = pd.read_csv(r"C:\Users\devad\Downloads\Age Prediction.csv")
data


Unnamed: 0,ID,Age_group,Age,Gender,PAQ605,Body Mass Index,Blood Glucose after fasting,Diabetic or not,Respondent's Oral,Blood Insulin Levels
0,73564,Adult,61,2,2,35.7,110,2,150,14.91
1,73568,Adult,26,2,2,20.3,89,2,80,3.85
2,73576,Adult,16,1,2,23.2,89,2,68,6.14
3,73577,Adult,32,1,2,28.9,104,2,84,16.15
4,73580,Adult,38,2,1,35.9,103,2,81,10.92
...,...,...,...,...,...,...,...,...,...,...
2273,83711,Adult,38,2,2,33.5,100,2,73,6.53
2274,83712,Adult,61,1,2,30.0,93,2,208,13.02
2275,83713,Adult,34,1,2,23.7,103,2,124,21.41
2276,83718,Adult,60,2,2,27.4,90,2,108,4.99


## Null Values

The dataset does not contain any missing values, which means no imputation is required.

In [3]:
data.isnull().sum()


ID                             0
Age_group                      0
Age                            0
Gender                         0
PAQ605                         0
Body Mass Index                0
Blood Glucose after fasting    0
Diabetic or not                0
Respondent's Oral              0
Blood Insulin Levels           0
dtype: int64

## Convert Categorical Data to Numeric

Since the target variable `Age_group` is categorical, we use `LabelEncoder` to convert it into numerical values.

In [4]:
le = LabelEncoder()
data['Age_group'] = le.fit_transform(data['Age_group'])


### Feature and Target Variable

The features are stored in `x`, and the target variable is `y`.

In [5]:
x = data.drop(['Age_group'], axis=1)
y = data['Age_group']


## 4. Data Preprocessing

### Train-Test Split

We split the data into training and testing sets (80% for training and 20% for testing) using `train_test_split`.

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
print("x_train", x_train.shape)
print("x_test", x_test.shape)
print("y_train", y_train.shape)
print("y_test", y_test.shape)


x_train (1822, 9)
x_test (456, 9)
y_train (1822,)
y_test (456,)


### Feature Scaling


We standardize the features using `StandardScaler` to improve the performance of the model.


In [7]:
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)


## 5. Model Building

### Logistic Regression Model


We create a logistic regression model and train it using the scaled training data.

In [8]:
log_model = LogisticRegression()
log_model.fit(x_train_scaled, y_train)

## Model Predictions

After training the model, we use it to make predictions on the test data.

In [9]:
predictions = log_model.predict(x_test_scaled)
print(predictions)


[0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0
 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0
 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0
 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0
 1 1 0 1 0 0 0 1 0 0 1 0]


## 6. Model Evaluation

### Confusion Matrix

We evaluate the model's performance using a confusion matrix, which shows the number of correct and incorrect predictions.

In [10]:
conf_matrix = confusion_matrix(y_test, predictions)
print(conf_matrix)


[[382   0]
 [  1  73]]


This matrix shows that there are 382 true negatives, 73 true positives, 0 false negatives, and 1 false positive.

## Accuracy

The accuracy of the model is calculated as the percentage of correct predictions.

In [11]:
accuracy = accuracy_score(y_test, predictions) * 100
print("Model accuracy:", accuracy)


Model accuracy: 99.78070175438597


## 7. Conclusion


### Summary

In this project, we successfully built a logistic regression model to predict the age group based on the given features. The model achieved an accuracy of 99.78%, which is an excellent result.

### Accuracy

- **Accuracy**: The model's accuracy is 99.78%, indicating that it performs well in classifying the age group.

Improvements
The model's performance could be improved by experimenting with other machine learning algorithms or adding more features.
Feature engineering could further enhance the predictive power of the model.
Final Thoughts
This project demonstrates the power of logistic regression in classification tasks and the importance of data preprocessing, including scaling and label encoding, in building accurate models.