# DECISION TREE


Objective:
The objective of this assignment is to apply Decision Tree Classification to a given dataset, analyse the performance of the model, and interpret the results.
Tasks:
1. Data Preparation:
Load the dataset into your preferred data analysis environment (e.g., Python with libraries like Pandas and NumPy).
2. Exploratory Data Analysis (EDA):
Perform exploratory data analysis to understand the structure of the dataset.
Check for missing values, outliers, and inconsistencies in the data.
Visualize the distribution of features, including histograms, box plots, and correlation matrices.
3. Feature Engineering:
If necessary, perform feature engineering techniques such as encoding categorical variables, scaling numerical features, or handling missing values.
4. Decision Tree Classification:
Split the dataset into training and testing sets (e.g., using an 80-20 split).
Implement a Decision Tree Classification model using a library like scikit-learn.
Train the model on the training set and evaluate its performance on the testing set using appropriate evaluation metrics (e.g., accuracy, precision, recall, F1-score, ROC-AUC).
5. Hyperparameter Tuning:
Perform hyperparameter tuning to optimize the Decision Tree model. Experiment with different hyperparameters such as maximum depth, minimum samples split, and criterion.
6. Model Evaluation and Analysis:
Analyse the performance of the Decision Tree model using the evaluation metrics obtained.
Visualize the decision tree structure to understand the rules learned by the model and identify important features
Interview Questions:
1. What are some common hyperparameters of decision tree models, and how do they affect the model's performance?
2. What is the difference between the Label encoding and One-hot encoding?




## 1. Data Preparation

In [4]:
import pandas as pd
import numpy as np

In [5]:
# Loading dataset
data = pd.read_excel(r'C:/Users/DELL/Desktop/DATAsets/heart_disease.xlsx')
data

Unnamed: 0,age,Age in years
0,Gender,"Gender ; Male - 1, Female -0"
1,cp,Chest pain type
2,trestbps,Resting blood pressure
3,chol,cholesterol measure
4,fbs,(fasting blood sugar > 120 mg/dl) (1 = true; 0...
5,restecg,"ecg observation at resting condition, -- Val..."
6,thalch,maximum heart rate achieved
7,exang,exercise induced angina
8,oldpeak,ST depression induced by exercise relative to ...
9,slope,the slope of the peak exercise ST segment


In [6]:
pd.set_option('display.max_columns', None)

In [7]:
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', None)  # Show all rows (if needed)

In [8]:
pd.set_option('display.width', 1000) 

In [9]:
# Loading dataset
data = pd.read_excel(r'C:/Users/DELL/Desktop/DATAsets/heart_disease.xlsx')
data

Unnamed: 0,age,Age in years
0,Gender,"Gender ; Male - 1, Female -0"
1,cp,Chest pain type
2,trestbps,Resting blood pressure
3,chol,cholesterol measure
4,fbs,(fasting blood sugar > 120 mg/dl) (1 = true; 0...
5,restecg,"ecg observation at resting condition, -- Val..."
6,thalch,maximum heart rate achieved
7,exang,exercise induced angina
8,oldpeak,ST depression induced by exercise relative to ...
9,slope,the slope of the peak exercise ST segment


In [10]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   age           12 non-null     object
 1   Age in years  12 non-null     object
dtypes: object(2)
memory usage: 324.0+ bytes
None


In [11]:
print(data.head())

        age                                       Age in years
0    Gender                       Gender ; Male - 1, Female -0
1        cp                                    Chest pain type
2  trestbps                             Resting blood pressure
3      chol                                cholesterol measure
4       fbs  (fasting blood sugar > 120 mg/dl) (1 = true; 0...


### 2. Exploratory Data Analysis (EDA)

In [13]:
#Inspect the dataset for missing values and outliers:
print(data.isnull().sum()) 

age             0
Age in years    0
dtype: int64


In [14]:
print(data.describe())    

           age                  Age in years
count       12                            12
unique      12                            12
top     Gender  Gender ; Male - 1, Female -0
freq         1                             1


##### Visualize distributions and correlations:

In [16]:
import matplotlib.pyplot as plt
import seaborn as sns

In [17]:
# Histograms
data.hist(figsize=(10, 8))
plt.show()


ValueError: hist method requires numerical or datetime columns, nothing to plot.

In [None]:
data = pd.read_excel(r'C:/Users/DELL/Desktop/DATAsets/heart_disease.xlsx', engine='openpyxl')
data

In [None]:
print(data.shape) 

In [None]:
print(data.columns) 

In [None]:
data = pd.read_excel(r'C:/Users/DELL/Desktop/DATAsets/heart_disease.xlsx', engine='openpyxl')
data

In [None]:
xls = pd.ExcelFile(r'C:/Users/DELL/Desktop/DATAsets/heart_disease.xlsx')
print(xls.sheet_names)

In [None]:
data = pd.read_excel(r'C:/Users/DELL/Desktop/DATAsets/heart_disease.xlsx', header=1)  # Use the second row as headers
print(data.head())

In [None]:
wb = load_workbook(r'C:/Users/DELL/Desktop/DATAsets/heart_disease.xlsx')
ws = wb.active 

In [None]:
data = pd.read_excel(r'C:/Users/DELL/Desktop/DATAsets/heart_disease.xlsx', nrows=5000)
data

In [None]:
print(f"Rows: {data.shape[0]}, Columns: {data.shape[1]}")

In [None]:
print(data.columns)