# Práctica - Decision Tree Regressor
## Ana Sofía Hinojosa Bale

In [63]:
from sklearn.tree import DecisionTreeRegressor
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

In [64]:
data = pd.read_csv('Advertising.csv')
data = data.drop(columns=['Unnamed: 0'])
x = data.drop(columns=['sales'])
y = data['sales']

## Regresión Lineal como benchmark

In [65]:
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
reg = LinearRegression()
reg.fit(x_scaled, y)
y_pred = reg.predict(x_scaled)
print("Linear Regression R^2:", r2_score(y, y_pred))

Linear Regression R^2: 0.8972106381789521


## Decision Tree

In [66]:
tree = DecisionTreeRegressor()
tree.fit(x_scaled, y)
y_pred_tree = tree.predict(x_scaled)
print("Decision Tree R^2:", r2_score(y, y_pred_tree))

Decision Tree R^2: 1.0


El $R^2$ utilizando Decision Tree Regressor es mucho mejor que utilizando Linear Regression, aunque con un score de 1.0 podría indicar sobreajuste, por lo que se utilizará train y test como método de Cross Validation.

## Train Test Split

In [67]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [68]:
red_cv = LinearRegression()
red_cv.fit(x_train, y_train)
y_train_pred = red_cv.predict(x_train)
y_test_pred = red_cv.predict(x_test)
print("Linear Regression Train R^2:", r2_score(y_train, y_train_pred))
print("Linear Regression Test R^2:", r2_score(y_test, y_test_pred))

Linear Regression Train R^2: 0.9055159502227753
Linear Regression Test R^2: 0.8609466508230366


In [69]:
tree_cv = DecisionTreeRegressor()
tree_cv.fit(x_train, y_train)
y_pred_tree_cv = tree_cv.predict(x_train)
y_pred_tree_cv_test = tree_cv.predict(x_test)
print("Decision Tree Test R^2:", r2_score(y_train, y_pred_tree_cv))
print("Decision Tree Test R^2:", r2_score(y_test, y_pred_tree_cv_test))

Decision Tree Test R^2: 1.0
Decision Tree Test R^2: 0.9279242545443336


Aunque si se puede ver un ligero sobreajuste de train a test, resulta un mejor modelo el árbol a la regresión lineal, como se puede notar tanto en train como en test.

## Polynomial Features

In [70]:
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(x_train)
x_poly_test = poly.transform(x_test)
poly_reg = LinearRegression()
poly_reg.fit(x_poly, y_train)
y_pred_poly = poly_reg.predict(x_poly)
y_pred_poly_test = poly_reg.predict(x_poly_test)
print("Polynomial Regression R^2:", r2_score(y_train, y_pred_poly))
print("Polynomial Regression Test R^2:", r2_score(y_test, y_pred_poly_test))

Polynomial Regression R^2: 0.9865054729019952
Polynomial Regression Test R^2: 0.9808386009966376


In [71]:
poly_tree = DecisionTreeRegressor()
poly_tree.fit(x_poly, y_train)
y_pred_poly_tree = poly_tree.predict(x_poly)
y_pred_poly_tree_test = poly_tree.predict(x_poly_test)
print("Polynomial Decision Tree R^2:", r2_score(y_train, y_pred_poly_tree))
print("Polynomial Decision Tree Test R^2:", r2_score(y_test, y_pred_poly_tree_test))

Polynomial Decision Tree R^2: 1.0
Polynomial Decision Tree Test R^2: 0.9179869820509541


Como se puede ver, aunque si hay ligero sobreajuste utilizando Decision Tree, sigue teniendo muy buenos resultados, aunque en el caso de Polynomial Features resulta mejor el resultado con la regresión linela, además de ser más estable en train y test.