Perform the below task on MASTER_PhonesmartdataAll_CCI_AdvStats.csv and wine dataset
a. Apply Decision Tree on both of it without and with pruning and record your observations. 

## Dataset1
MASTER_PhonesmartdataAll_CCI_AdvStats.csv

### 1.0 Data Preparation

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [2]:
# Load the dataset
df = pd.read_csv("MASTER_PhonesmartdataAll_CCI_AdvStats.csv")
df = df.drop('Device', axis=1, errors='ignore')
df.replace(" ", np.nan, inplace=True)
df.dropna(inplace = True)
float_columns = ['Age', 'GenderNum', 'AutismQuotient', 'STAI', 'BRIEF_Total', 'DailyAvgMins', 'DailyAvePickups']
df[float_columns] = df[float_columns].apply(pd.to_numeric, errors='coerce').astype(int)
df.shape

(124, 10)

In [3]:
# Split the dataset
X = df.drop('VS_RT_correct_Single', axis=1)  # Replace 'target_column_name' with the actual target column
y = df['VS_RT_correct_Single']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 1.1 Decision Tree without pruning  

In [4]:
# Apply Decision Tree without pruning
tree_10 = DecisionTreeRegressor(random_state=42)
tree_10.fit(X_train, y_train)

In [5]:
# Make predictions on the test set
y_pred_10 = tree_10.predict(X_test)
y_pred_10

array([ 723.45888889,  769.83111111,  723.45888889,  829.02291667,
        846.41547619,  678.77083333,  858.66458333,  858.66458333,
        769.83111111,  858.66458333,  662.88541667,  739.30952381,
        799.44333333,  815.01666667,  712.79444444,  884.825     ,
        828.64722222,  718.38888889,  762.45625   ,  739.30952381,
        710.68888889, 1039.3       ,  828.50555556,  819.12222224,
        678.77083333])

In [6]:
# Evaluate the performance without pruning
mse_10 = mean_squared_error(y_test, y_pred_10)
mae_10 = mean_absolute_error(y_test, y_pred_10)
r2_10 = r2_score(y_test, y_pred_10)

print(f"Mean Squared Error: {mse_10}")
print(f"Mean Absolute Error: {mae_10}")
print(f"R-squared: {r2_10}")

Mean Squared Error: 23491.5674975957
Mean Absolute Error: 78.23434127342166
R-squared: 0.146293820926653


The Mean Squared Error is 23440.02, Mean Absolute Error is 77.17, and the R-squared value is 0.148. These values suggest that the model without pruning might not be highly accurate, given the relatively high error rates and low R-squared value.

### 1.2 Decision Tree with pruning

In [7]:
# Apply Decision Tree with pruning (early stop)
tree_11 = DecisionTreeRegressor(max_depth=3, random_state=42)
tree_11.fit(X_train, y_train)

In [8]:
# Make predictions on the test set
y_pred_11 = tree_11.predict(X_test)
y_pred_11

array([759.59720708, 759.59720708, 759.59720708, 840.22404237,
       840.22404237, 667.78189236, 840.22404237, 840.22404237,
       759.59720708, 840.22404237, 667.78189236, 840.22404237,
       759.59720708, 840.22404237, 759.59720708, 840.22404237,
       759.59720708, 840.22404237, 759.59720708, 840.22404237,
       759.59720708, 943.16719699, 840.22404237, 759.59720708,
       667.78189236])

In [9]:
# Evaluate the performance without pruning
mse_11 = mean_squared_error(y_test, y_pred_11)
mae_11 = mean_absolute_error(y_test, y_pred_11)
r2_11 = r2_score(y_test, y_pred_11)

print(f"Mean Squared Error: {mse_11}")
print(f"Mean Absolute Error: {mae_11}")
print(f"R-squared: {r2_11}")

Mean Squared Error: 21656.493779688928
Mean Absolute Error: 67.035243144959
R-squared: 0.2129821665294933


The Mean Squared Error is 21656.49, Mean Absolute Error is 67.04, and the R-squared value is 0.213. Compared to the unpruned model, this pruned model shows slightly better performance with lower error rates and a higher R-squared value.

## Dataset2 
wine_data.csv

### 2.0 Data Preparation

In [10]:
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

In [11]:
wine = datasets.load_wine()
XX = wine.data
yy = wine.target
XX

array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]])

In [12]:
# Split the data into training and testing sets
XX_train, XX_test, yy_train, yy_test = train_test_split(XX, yy, test_size=0.2, random_state=42)

### 2.1 Decision Tree without pruning

In [13]:
# Apply Decision Tree without pruning
tree_20 = DecisionTreeClassifier(random_state=42)
tree_20.fit(XX_train, yy_train)

In [14]:
# Make predictions on the test set
y_pred_20 = tree_20.predict(XX_test)
y_pred_20

array([0, 0, 2, 0, 1, 0, 1, 2, 1, 2, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
       1, 2, 2, 2, 1, 1, 1, 0, 0, 1, 2, 0, 0, 0])

In [15]:
# Evaluate the performance without pruning
accuracy_20 = accuracy_score(yy_test, y_pred_20)
classification_20 = classification_report(yy_test, y_pred_20)

print(f"Accuracy: {accuracy_20}")
print(f"Classification:\n {classification_20}")

Accuracy: 0.9444444444444444
Classification:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93        14
           1       0.93      1.00      0.97        14
           2       1.00      0.88      0.93         8

    accuracy                           0.94        36
   macro avg       0.95      0.93      0.94        36
weighted avg       0.95      0.94      0.94        36



- Accuracy: 94.44%
- Classification Report:
- Class 0 (Precision: 93%, Recall: 93%)
- Class 1 (Precision: 93%, Recall: 100%)
- Class 2 (Precision: 100%, Recall: 88%)
- The unpruned Decision Tree shows high accuracy. The precision and recall values are also high for each class, indicating that the model is performing well in classifying different types of wines. However, the perfect precision for class 2 but lower recall suggests a potential overfitting issue, where the model may be too tailored to the training data.

### 2.2 Decision Tree with pruning

In [16]:
# Apply Decision Tree with pruning
tree_21 = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_21.fit(XX_train, yy_train)

In [17]:
# Make predictions on the test set
y_pred_21 = tree_21.predict(XX_test)
y_pred_21

array([0, 0, 2, 0, 1, 0, 1, 2, 1, 2, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
       1, 2, 2, 2, 1, 1, 1, 0, 0, 1, 2, 0, 0, 0])

In [18]:
# Evaluate the performance with pruning
accuracy_21 = accuracy_score(yy_test, y_pred_21)
classification_21 = classification_report(yy_test, y_pred_21)

print(f"Accuracy: {accuracy_21}")
print(f"Classification:\n {classification_21}")

Accuracy: 0.9444444444444444
Classification:
               precision    recall  f1-score   support

           0       1.00      0.93      0.96        14
           1       0.88      1.00      0.93        14
           2       1.00      0.88      0.93         8

    accuracy                           0.94        36
   macro avg       0.96      0.93      0.94        36
weighted avg       0.95      0.94      0.94        36



- Accuracy: 94.44%
- Classification Report:
- Class 0 (Precision: 100%, Recall: 93%)
- Class 1 (Precision: 88%, Recall: 100%)
- Class 2 (Precision: 100%, Recall: 88%)
- The pruned Decision Tree also shows the same level of accuracy. However, the balance in precision and recall is slightly better distributed across the classes. This pruning has helped the model generalize better, reducing the risk of overfitting. The recall for class 1 and class 2 has improved, but there's a slight drop in precision for class 1.

- Accuracy: Both models show the same level of accuracy, which is quite high. This indicates that both models are well-suited for this dataset.
Precision and Recall: The pruned model has a more balanced precision and recall across different classes compared to the unpruned model. This suggests that pruning has helped in achieving a better generalization.
- Overfitting: The unpruned model might be slightly overfitting to the training data, which is a common issue with Decision Trees. Pruning, by limiting the depth of the tree, helps in mitigating this issue.
- Generalization: The pruned Decision Tree likely offers better generalization capabilities, which is crucial for making reliable predictions on unseen data.