<a href="https://colab.research.google.com/github/ArifAygun/Machine-Learning-with-Python-Decision-Trees/blob/main/AA__linked_03b_decesion_tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Regression Trees in Python

## Learning Objectives
Decision Trees are one of the most popular approaches to supervised machine learning. Decison Trees use an inverted tree-like structure to model the relationship between independent variables and a dependent variable. A tree with a continuous dependent variable is known as a **Regression Tree**. By the end of this tutorial, you will have learned:

+ How to import, explore and prepare data
+ How to build a Regression Tree model
+ How to visualize the structure of a Regression Tree
+ How to Prune a Regression Tree 

## 1. Collect the Data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
from numpy import logical_and
#import pandas as pd
#loan = pd.read_csv("loan.csv")
import pandas as pd
import os

def file_locator(file_name):
  main_dir = os.getcwd()

  for dir, folder, files in os.walk(main_dir):
    for file in files:
      if file == file_name:
        file_path = os.path.join(dir,file)
  return file_path

file_path = file_locator('income.csv')
income = pd.read_csv(file_path)
income.head()

Unnamed: 0,Age,Education,Salary
0,25,Bachelors,43.9
1,30,Bachelors,54.4
2,45,Bachelors,62.5
3,55,Bachelors,72.5
4,65,Bachelors,74.6


## 2. Explore the Data

In [4]:
income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        30 non-null     int64  
 1   Education  30 non-null     object 
 2   Salary     30 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 848.0+ bytes


In [5]:
income.describe()

Unnamed: 0,Age,Salary
count,30.0,30.0
mean,43.366667,64.406667
std,14.375466,26.202684
min,24.0,16.8
25%,30.5,46.35
50%,45.0,62.1
75%,55.0,76.8
max,65.0,118.0


In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
ax = sns.boxplot(data = income, x = 'Education', y = 'Salary')

In [None]:
ax = sns.boxplot(data = income, x = 'Education', y = 'Age')

In [None]:
ax = sns.scatterplot(data = income, 
                     x = 'Age', 
                     y = 'Salary', 
                     hue = 'Education', 
                     style = 'Education', 
                     s = 150)
ax = plt.legend(bbox_to_anchor = (1.02, 1), loc = 'upper left')

## 3. Prepare the Data

In [None]:
y

In [None]:
X

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size = 0.6,
                                                    stratify = X['Education'],
                                                    random_state = 1234) 

In [None]:
X_train.shape, X_test.shape

In [None]:
X_train.head()

In [None]:
X_train = pd.get_dummies(X_train)
X_train.head()

In [None]:
X_test = pd.get_dummies(X_test)
X_test.head()

## 4. Train and Evaluate the Regression Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 1234)

In [None]:
model

In [None]:
model

In [None]:
y_test_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_test_pred)

## 5. Visualize the Regression Tree

In [None]:
from sklearn import tree
plt.figure(figsize = (15,15))
tree.plot_tree();

In [None]:
plt.figure(figsize = (15,15))
tree.plot_tree();

In [None]:
importance
importance

In [None]:
feature_importance = pd.Series(importance, index = X_train.columns)
feature_importance.sort_values().plot(kind = 'bar')
plt.ylabel('Importance');

## 6. Prune the Regression Tree

In [None]:
model

In [None]:
model

Let's get the list of effective alphas for the training data.

In [None]:
path = regressor.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas
list(ccp_alphas)

We remove the maximum effective alpha because it is the trivial tree with just one node.

In [None]:
ccp_alphas
list(ccp_alphas)

Next, we train several trees using the different values for alpha.

In [None]:
train_scores, test_scores = [], []
for alpha in ccp_alphas:
    regressor_ = DecisionTreeRegressor(random_state = 1234, ccp_alpha = alpha)
    model_ = regressor_.fit(X_train, y_train)
    train_scores.append(model_.score(X_train, y_train))
    test_scores.append(model_.score(X_test, y_test))

In [None]:
plt.plot(ccp_alphas, 
         train_scores, 
         marker = "o", 
         label = 'train_score', 
         drawstyle = "steps-post")
plt.plot(ccp_alphas, 
         test_scores, 
         marker = "o", 
         label = 'test_score', 
         drawstyle = "steps-post")
plt.legend()
plt.title('R-squared by alpha');

In [None]:
test_scores

In [None]:
ix = test_scores.index(max(test_scores))
best_alpha
best_alpha

In [None]:
regressor_ = DecisionTreeRegressor(random_state = 1234, ccp_alpha = best_alpha)
model_ = regressor_.fit(X_train, y_train)

In [None]:
model_.score(X_train, y_train)

In [None]:
model_.score(X_test, y_test)

In [None]:
plt.figure(figsize = (15,15))
tree.plot_tree(model_, 
                   feature_names = list(X_train.columns),
                   filled = True);