# Decision Tree:
It is a tree shaped diagram used to determine a course of action. Each branch of tree represents a positive decision, occurence or reaction.
It was solve two problems:
- Classification: It will determine a set of logical if then conditions to classify problems. example: Three types of flowers based on certain features.
- Regression: When target variable is numerical or continious in nature, we fit regression model to target variable using each of independent variables. for eg: each split is based on sum of squared error.

## Advantages:
 - Simple to understand interpret and visualize. 
 - Little effort for data preparation
 - it can handle both numerical and categorical data
 - Non linear parameters doesnot affect performance.


### Demerits:
 - Overfitting: it can occur when algorithm captures noise in data.
 - High variance: model can get unstable due to small variations in data. 
 - Low biased Tree: A Highly complicated trends to have low bias which means difficult for model to work with new data. 

#### *Some Terms:*

- Entropy: measure of randomness or unpredictability in datasets.
- information gain : measure of decrase in entropy after dataset is splited. 
- leaf node: carries classification or the decision at base. (Final Node)
- root node: top most decision node is root node. (Top node)


## Entropy Formula:

$$
H(X) = -\sum_{i=1}^{n} p(x_i) \log_2 p(x_i)
$$


### Use Case: Loan Repayment Prediction

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
df=pd.read_csv(r"..\Datasets\loan_data.csv")
print(df.shape)
df.info()

(9578, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   credit.policy      9578 non-null   int64  
 1   purpose            9578 non-null   object 
 2   int.rate           9578 non-null   float64
 3   installment        9578 non-null   float64
 4   log.annual.inc     9578 non-null   float64
 5   dti                9578 non-null   float64
 6   fico               9578 non-null   int64  
 7   days.with.cr.line  9578 non-null   float64
 8   revol.bal          9578 non-null   int64  
 9   revol.util         9578 non-null   float64
 10  inq.last.6mths     9578 non-null   int64  
 11  delinq.2yrs        9578 non-null   int64  
 12  pub.rec            9578 non-null   int64  
 13  not.fully.paid     9578 non-null   int64  
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB


In [4]:
df.head()

Unnamed: 0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
0,1,debt_consolidation,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,1,credit_card,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0
2,1,debt_consolidation,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0
3,1,debt_consolidation,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0
4,1,credit_card,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0


*Handling categorical data*

A categorical column called "purpose" converts into numerical format for machine learning models, which can't handle text categories directly. It replaces the original column with efficient 0/1 flags while avoiding redundancy.


In [5]:
category=pd.get_dummies(df["purpose"],drop_first=True)
loans=pd.concat([df,category],axis=1).drop("purpose",axis=1)
loans.head()

Unnamed: 0,credit.policy,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid,credit_card,debt_consolidation,educational,home_improvement,major_purchase,small_business
0,1,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0,False,True,False,False,False,False
1,1,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0,True,False,False,False,False,False
2,1,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0,False,True,False,False,False,False
3,1,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0,False,True,False,False,False,False
4,1,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0,True,False,False,False,False,False


`pd.concat() `glues pandas DataFrames together, and axis=1 tells it to stick them side-by-side (adding new columns) instead of stacking top-to-bottom.

In [6]:
# Splitting dataset into test and train set
from sklearn.model_selection import train_test_split
X=loans.drop("not.fully.paid",axis=1)
y=loans["not.fully.paid"]

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=40)


`X = loans.drop("not.fully.paid", axis=1)`
 - Removes the target column "not.fully.paid" (the outcome we want to predict: 0=paid, 1=not paid).
 - X becomes all other columns—the INPUT FEATURES like loan amount, interest rate, purpose dummies, etc. Models use these to learn patterns.

`y = loans["not.fully.paid"]`
- Keeps ONLY the target column—the OUTPUT LABEL we want to predict.
- y answers "Will this loan be fully paid?" for each row.

In [13]:
# Using decision tree

from sklearn.tree import DecisionTreeClassifier
detree=DecisionTreeClassifier(criterion="entropy",random_state=100,max_depth=8, min_samples_leaf=2)
detree.fit(X_train,y_train)
prediction=detree.predict(X_test)
print(prediction)

[0 1 0 ... 0 0 0]


In [14]:
# Checking performance of the model
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test,prediction))
print(classification_report(y_test,prediction))

[[2320   68]
 [ 461   25]]
              precision    recall  f1-score   support

           0       0.83      0.97      0.90      2388
           1       0.27      0.05      0.09       486

    accuracy                           0.82      2874
   macro avg       0.55      0.51      0.49      2874
weighted avg       0.74      0.82      0.76      2874



In [15]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,prediction)*100

81.59359777313848

Decision tree model on loan repayment prediction shows high overall accuracy (82.15%) but reveals critical flaws due to class imbalance, where class 0 (repaid, 2388 samples) dominates over class 1 (defaulted, 486 samples ~17%).

The confusion matrix [[2320, 68], [461, 25]] indicates strong prediction for repaid loans (97% recall) but poor performance on defaults (5% recall, 461 false negatives). This means the model flags some risky loans but still misses 95% of defaults, achieving accuracy primarily through majority class predictions rather than balanced learning.

Total samples: 2,874
- Class 0 (Repaid): 2,388 (83.1%) 
- Class 1 (Default): 486 (16.9%)

### Confusion matrix
[[2320   68]  ← 97% correct for repaid loans
 [ 461   25]] ← 5% detection of defaults 

High accuracy is not equal to Good model when classes are uneven.
