## Healthcare Dataset Analysis and KNN Regression Model Report

## 1. Introduction

This report presents an in-depth analysis of the healthcare dataset and the implementation of a K-Nearest Neighbors (KNN) regression model. The objective is to analyze the dataset, perform preprocessing, build a predictive model, and evaluate its performance.

## 2. Libraries Used

The following libraries were used in the project:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## 3. Dataset Overview

The dataset used in this project is the healthcare dataset. It contains multiple attributes related to patient details and medical history, and the goal is to predict the Gender of the patients based on other attributes.

### 3.1 Number of Columns

The dataset consists of multiple columns:

- Name

- Age

- Gender

- Blood Type

- Medical Condition

- Date of Admission

- Doctor

- Hospital

- Insurance Provider

- Admission Type

- Discharge Date

- Medication

- Test Results

### 3.2 Relationship Between Columns

- Medical Condition, Blood Type, and Medication might have an impact on a patient's treatment and recovery.

- Date of Admission and Discharge Date could help analyze hospital stay duration.

- Doctor, Hospital, and Insurance Provider provide categorical information that might influence medical decisions.

- Gender is the target variable for prediction using KNN regression.

## 4. Basic Analysis

The dataset was loaded and examined using the following functions



In [2]:
data = pd.read_csv(r"C:\Users\devad\Downloads\healthcare_dataset.csv")
print(data.head())
print(data.tail())
print(data.info())
print(data.describe())

            Name  Age  Gender Blood Type Medical Condition Date of Admission  \
0  Bobby JacksOn   30    Male         B-            Cancer        2024-01-31   
1   LesLie TErRy   62    Male         A+           Obesity        2019-08-20   
2    DaNnY sMitH   76  Female         A-           Obesity        2022-09-22   
3   andrEw waTtS   28  Female         O+          Diabetes        2020-11-18   
4  adrIENNE bEll   43  Female        AB+            Cancer        2022-09-19   

             Doctor                    Hospital Insurance Provider  \
0     Matthew Smith             Sons and Miller         Blue Cross   
1   Samantha Davies                     Kim Inc           Medicare   
2  Tiffany Mitchell                    Cook PLC              Aetna   
3       Kevin Wells  Hernandez Rogers and Vang,           Medicare   
4    Kathleen Hanna                 White-White              Aetna   

   Billing Amount  Room Number Admission Type Discharge Date   Medication  \
0    18856.281306    

### 4.1 Checking for Null Values

In [3]:
print(data.isnull().sum())

Name                  0
Age                   0
Gender                0
Blood Type            0
Medical Condition     0
Date of Admission     0
Doctor                0
Hospital              0
Insurance Provider    0
Billing Amount        0
Room Number           0
Admission Type        0
Discharge Date        0
Medication            0
Test Results          0
dtype: int64


The dataset has **no missing values**.

## 5. Data Preprocessing

Since the dataset contains categorical variables, we used Label Encoding to convert them into numeric values.

In [4]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data['Name'] = le.fit_transform(data['Name'])
data['Gender'] = le.fit_transform(data['Gender'])
data['Blood Type'] = le.fit_transform(data['Blood Type'])
data['Medical Condition'] = le.fit_transform(data['Medical Condition'])
data['Date of Admission'] = le.fit_transform(data['Date of Admission'])
data['Doctor'] = le.fit_transform(data['Doctor'])
data['Hospital'] = le.fit_transform(data['Hospital'])
data['Insurance Provider'] = le.fit_transform(data['Insurance Provider'])
data['Admission Type'] = le.fit_transform(data['Admission Type'])
data['Discharge Date'] = le.fit_transform(data['Discharge Date'])
data['Medication'] = le.fit_transform(data['Medication'])
data['Test Results'] = le.fit_transform(data['Test Results'])

## 6. Model Building

The `Gender` column is the target variable (`y`), while the remaining columns are features (`X`).

In [5]:
from sklearn.model_selection import train_test_split

x = data.drop(['Gender'], axis=1)
y = data['Gender']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

print("x_train:", x_train.shape)
print("x_test:", x_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

x_train: (44400, 14)
x_test: (11100, 14)
y_train: (44400,)
y_test: (11100,)


### 6.1 Applying KNN Regression

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)

knn = KNeighborsRegressor()
knn.fit(x_train_scaled, y_train)

## 7. Model Evaluation

### 7.1 Predictions

In [7]:
x_test_scaled = scaler.transform(x_test)
predictions = knn.predict(x_test_scaled)

### 7.2 R2 Score

In [8]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, predictions)
print("The r2 score error:", r2)

The r2 score error: -0.15728474321774266


## 8. Conclusion

- The dataset contained **no missing values**, simplifying preprocessing.

- Categorical columns were encoded using Label Encoding.

- The KNN regression model was implemented.

- The **R2 Score** of the model is **-0.1514**, indicating that the model is performing poorly.

- A negative R2 score suggests that the model is not a good fit for the dataset.

- Alternative regression models (such as Linear Regression or Decision Trees) should be explored for better performance.

This report provides a comprehensive analysis and evaluation of the healthcare dataset using KNN regression.