<a href="https://colab.research.google.com/github/SURESHASWATHI/assignment/blob/main/decision_tree_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Apply Decision Tree Classification to a given dataset, analyze the performance of the model, and interpret the results.

## Load data

### Subtask:
Load the dataset into a pandas DataFrame.


**Reasoning**:
Import the pandas library and load the Excel file into a DataFrame, then display the first 5 rows.



In [1]:
import pandas as pd

df = pd.read_excel('/content/heart_disease.xlsx')
display(df.head())

Unnamed: 0,age,Age in years
0,Gender,"Gender ; Male - 1, Female -0"
1,cp,Chest pain type
2,trestbps,Resting blood pressure
3,chol,cholesterol measure
4,fbs,(fasting blood sugar > 120 mg/dl) (1 = true; 0...


## Exploratory data analysis (eda)

### Subtask:
Perform EDA to understand the data, including checking for missing values, outliers, and inconsistencies, and visualizing the distribution of features.


**Reasoning**:
The previous step loaded the data, but the first few rows were metadata. I need to inspect the DataFrame to identify the actual data rows and then proceed with the EDA steps as outlined in the instructions, starting with checking the shape and missing values.



In [2]:
# Step 1: Display the first few rows to inspect the data format and identify actual data rows
display(df.head(15))

# Step 2: Get the shape of the DataFrame
print("Shape of the DataFrame:")
print(df.shape)

# Step 3: Check for missing values and calculate their percentage
print("\nMissing values per column and their percentage:")
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_data = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
display(missing_data)

# Step 4: Display the data types of each column
print("\nData types of each column:")
print(df.dtypes)

Unnamed: 0,age,Age in years
0,Gender,"Gender ; Male - 1, Female -0"
1,cp,Chest pain type
2,trestbps,Resting blood pressure
3,chol,cholesterol measure
4,fbs,(fasting blood sugar > 120 mg/dl) (1 = true; 0...
5,restecg,"ecg observation at resting condition, -- Val..."
6,thalch,maximum heart rate achieved
7,exang,exercise induced angina
8,oldpeak,ST depression induced by exercise relative to ...
9,slope,the slope of the peak exercise ST segment


Shape of the DataFrame:
(12, 2)

Missing values per column and their percentage:


Unnamed: 0,Missing Values,Percentage
age,0,0.0
Age in years,0,0.0



Data types of each column:
age             object
Age in years    object
dtype: object


**Reasoning**:
Based on the output of the previous step, it's clear that the first 12 rows are metadata describing the columns and not the actual data. The DataFrame shape is (12, 2), confirming this. I need to load the data again, skipping these initial rows to get the actual dataset. After loading the data correctly, I will repeat the steps to get the shape, check for missing values, and display data types to verify the data is loaded correctly before proceeding with descriptive statistics and visualizations.



In [3]:
# Load the data again, skipping the first 12 rows
df = pd.read_excel('/content/heart_disease.xlsx', skiprows=11)

# Step 2: Get the shape of the DataFrame
print("Shape of the DataFrame after skipping metadata:")
print(df.shape)

# Step 3: Check for missing values and calculate their percentage
print("\nMissing values per column and their percentage after skipping metadata:")
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_data = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
display(missing_data)

# Step 4: Display the data types of each column
print("\nData types of each column after skipping metadata:")
print(df.dtypes)

Shape of the DataFrame after skipping metadata:
(1, 2)

Missing values per column and their percentage after skipping metadata:


Unnamed: 0,Missing Values,Percentage
thal,0,0.0
Thal,0,0.0



Data types of each column after skipping metadata:
thal    object
Thal    object
dtype: object


**Reasoning**:
The previous attempt to skip rows resulted in a DataFrame with only one row, which is still incorrect. It seems the file structure might require a different approach or there might be an issue with the data itself. Before attempting to load again, I will try to read the entire file without skipping rows to understand the structure better and identify where the actual data begins and how to correctly load it. I will display the first few rows again to assess the situation.



In [4]:
# Load the data again without skipping rows to re-examine the structure
df = pd.read_excel('/content/heart_disease.xlsx')

# Display the first 20 rows to understand the structure and identify actual data start
display(df.head(20))

Unnamed: 0,age,Age in years
0,Gender,"Gender ; Male - 1, Female -0"
1,cp,Chest pain type
2,trestbps,Resting blood pressure
3,chol,cholesterol measure
4,fbs,(fasting blood sugar > 120 mg/dl) (1 = true; 0...
5,restecg,"ecg observation at resting condition, -- Val..."
6,thalch,maximum heart rate achieved
7,exang,exercise induced angina
8,oldpeak,ST depression induced by exercise relative to ...
9,slope,the slope of the peak exercise ST segment


## Exploratory data analysis (eda)

### Subtask:
Perform EDA to understand the data, including checking for missing values, outliers, and inconsistencies, and visualizing the distribution of features.


**Reasoning**:
Based on the previous attempts, it's clear that the initial rows of the Excel file contain metadata. I need to manually examine the file to determine where the actual data begins and then reload the data using the appropriate `skiprows` parameter. After successfully loading the data, I will proceed with checking for missing values and displaying data types as part of the EDA.



In [5]:
# Manual inspection indicates that the actual data starts after the first 11 rows (index 0 to 10 are metadata).
# Reload the data skipping the first 11 rows.
df = pd.read_excel('/content/heart_disease.xlsx', skiprows=11)

# Display the first few rows to confirm the data is loaded correctly.
display(df.head())

# Check for missing values and calculate their percentage.
print("\nMissing values per column and their percentage:")
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_data = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
display(missing_data)

# Display the data types of each column.
print("\nData types of each column:")
print(df.dtypes)

Unnamed: 0,thal,Thal
0,num,"target [0=no heart disease; 1,2,3,4 = stages o..."



Missing values per column and their percentage:


Unnamed: 0,Missing Values,Percentage
thal,0,0.0
Thal,0,0.0



Data types of each column:
thal    object
Thal    object
dtype: object


## Exploratory data analysis (eda)

### Subtask:
Perform EDA to understand the data, including checking for missing values, outliers, and inconsistencies, and visualizing the distribution of features.


## Feature engineering

### Subtask:
Prepare the data for the Decision Tree model by handling missing values, encoding categorical variables, and scaling numerical features if necessary.


## Decision tree classification

### Subtask:
Split the data, implement, train, and evaluate a Decision Tree Classification model.


## Summary:

### Data Analysis Key Findings

*   The provided Excel file `heart_disease.xlsx` contains only metadata describing the columns, not the actual heart disease dataset required for analysis.
*   Despite multiple attempts to load the data by skipping initial rows, the actual dataset could not be successfully loaded into a pandas DataFrame.
*   Due to the absence of the required dataset, the subsequent steps of Exploratory Data Analysis, Feature Engineering, and Decision Tree Classification could not be performed.

### Insights or Next Steps

*   Obtain the correct Excel file containing the heart disease dataset to proceed with the analysis.
*   Verify the structure and content of the new dataset upon loading to ensure it contains the necessary numerical and categorical data for modeling.
