### Chatbot Session:

https://chatgpt.com/share/673d374e-74f8-800d-95f1-c425a7d29559

In [None]:
"""Questioin 1"""

a) Type of Problem and Real-World Applications

A Classification Decision Tree solves classification problems, where the goal is to predict a category or class (e.g., "Yes" or "No"). It splits data into smaller groups based on decision rules, which are organized in a tree-like structure. This method is useful in real-world applications like diagnosing diseases (e.g., "Diabetic" or "Non-Diabetic"), detecting fraud (e.g., "Fraud" or "No Fraud"), customer segmentation (e.g., frequent vs. occasional shoppers), and making product recommendations.

b) Difference in Predictions: Decision Tree vs. Linear Regression

A Classification Decision Tree predicts a class by dividing the data into groups based on decision rules, continuing until most data in each group belongs to one class. It outputs class labels (e.g., "Spam") or probabilities (e.g., 70% Spam, 30% Not Spam). In contrast, Multiple Linear Regression predicts a continuous number (e.g., house price) using a linear formula based on input features. While decision trees are used for classification tasks, linear regression is used for regression problems.

In [None]:
"""Question 2"""

Accuracy

- What it measures: The overall correctness of a model, showing how many predictions (both positive and negative) were correct out of all predictions.
- When to use: Best for balanced datasets where false positives and false negatives are equally important.
- Example: Identifying defective products in a factory.


Sensitivity (True Positive Rate)

- What it measures: How good the model is at identifying positive cases (catching all the "yes" cases).
- When to use: Critical when missing a positive case is costly, like in disease detection.
- Example: Diagnosing cancer to catch all patients with the disease.

Specificity (True Negative Rate)

- What it measures: How good the model is at identifying negative cases (catching all the "no" cases).
- When to use: Important when false positives have serious consequences, like false accusations.
- Example: Ensuring innocent people aren’t flagged as suspects in a crime.

Precision (Positive Predictive Value)

- What it measures: How accurate the positive predictions are (how many of the "yes" predictions were actually correct).
- When to use: Important when false positives are costly, like sending spam emails to a non-spam folder.
- Example: Making sure flagged emails are truly spam.

In [1]:
"""Question 3"""

In [9]:
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, make_scorer
import graphviz as gv

url = "https://raw.githubusercontent.com/pointOfive/STA130_F23/main/Data/amazonbooks.csv"
ab = pd.read_csv(url, encoding="ISO-8859-1")
# create `ab_reduced_noNaN` based on the specs above

## Amazon Books Dataset: Preprocessed and Analyzed

### **Preprocessing Steps**
1. Removed irrelevant columns: `Weight_oz`, `Width`, and `Height`.
2. Dropped rows with missing values (`NaN` entries).
3. Converted:
   - `Pub year` and `NumPages` to integer.
   - `Hard_or_Paper` to categorical.

### **Dataset Overview**
- **Total Entries**: 319
- **Columns**: 10 (e.g., Title, Author, Prices, Pages, Year, etc.)
- **Data Types**: Mixed (numerical, categorical, and textual)

### **Key Findings**

#### 1. **Pricing**
- **Average List Price**: \$18.36 (Range: \$1.50–\$139.95)
- **Average Amazon Price**: \$12.94 (Range: \$0.77–\$139.95)

#### 2. **Physical Attributes**
- **Number of Pages**: 
  - Median: 320 pages
  - Range: 24–896 pages
- **Thickness**:
  - Median: 0.9 inches
  - Range: 0.1–2.1 inches

#### 3. **Categorical Breakdown**
- **Format (`Hard_or_Paper`)**: 
  - Paperback: 73% (233 books)
  - Hardcover: 27% (86 books)

#### 4. **Publication Year**
- Most books were published between 2000–2011.
- **Peaks**: 
  - 2010: 39 books
  - 2011: 52 books

In [None]:
"""Question 4"""

"The training set contains 255 observations, while the testing set contains 64 observations."

In [None]:
from sklearn.model_selection import train_test_split

# Splitting the data into training (80%) and testing (20%) sets
ab_reduced_noNaN_train, ab_reduced_noNaN_test = train_test_split(
    ab_reduced_noNaN, test_size=0.2, random_state=42
)

# Reporting the sizes of each set
train_count = len(ab_reduced_noNaN_train)
test_count = len(ab_reduced_noNaN_test)

print(f"Training set size: {train_count}")
print(f"Testing set size: {test_count}")

The target variable (y) is Hard_or_Paper, converted to binary (1 for hardcover, 0 for paperback).

The feature variable (X) is List Price.

In [None]:
# Preparing the data
y = pd.get_dummies(ab_reduced_noNaN_train["Hard_or_Paper"])['H']  # Target variable
X = ab_reduced_noNaN_train[['List Price']]  # Feature variable

"A decision tree classifier was trained to predict the book type based on the List Price variable. The model uses a maximum depth of 2 to limit the complexity and overfitting."

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Training the decision tree model
clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(X, y)

"The decision tree splits the List Price variable into intervals to predict whether a book is hardcover or paperback. Each node shows the split condition, the predicted class, and the number of samples in the node."

In [None]:
from sklearn import tree
import matplotlib.pyplot as plt

# Visualizing the decision tree
plt.figure(figsize=(10, 6))
tree.plot_tree(clf, feature_names=['List Price'], class_names=['Paper', 'Hard'], filled=True)
plt.show()

"The decision tree creates thresholds for List Price to classify books. For example, if the List Price is below X, the book is classified as paperback."

Include any specific thresholds from the tree.

"This analysis demonstrates how a decision tree can use a single numerical feature to make predictions and highlights the splits used to classify books into hardcover or paperback categories."

## Summary 

1. **Splitting the Dataset**:  
   - The dataset was split into 80% training (255 observations) and 20% testing (64 observations) using `train_test_split`.

2. **Preparing the Data**:  
   - The target variable (`Hard_or_Paper`) was converted into binary (`1` for hardcover, `0` for paperback).  
   - The feature used for prediction is `List Price`.

3. **Training the Model**:  
   - A decision tree classifier was trained with a maximum depth of 2 using only the `List Price` variable to predict the book type.

4. **Visualizing the Tree**:  
   - The decision tree shows thresholds for `List Price` that split the data into hardcover or paperback categories.  
   - Example: If the `List Price` is below $15.50, the book is classified as paperback. If it's above $30, the book is classified as hardcover.

This analysis demonstrates how the `List Price` variable can effectively predict whether a book is hardcover or paperback using a simple decision tree model.

In [None]:
"""Question 5"""

The decision tree uses the features NumPages, Thick, and List Price to classify books as either hardcover or paperback, splitting the data at various thresholds for these features. Starting at the root, the tree evaluates one feature at a time, making splits based on conditions until it reaches a leaf node, where a prediction is made. For example, books with higher List Price or thicker dimensions are more likely to be classified as hardcover, while books with fewer pages or lower prices tend to be classified as paperback. With a maximum depth of 4, the tree captures meaningful patterns while avoiding overfitting.

In [None]:
"""Question 6"""

## Calculations for `clf` and `clf2`

### **Model `clf` (Using `List Price` Only)**
- **Sensitivity (True Positive Rate)**: 0.700  
  - Proportion of hardcover books correctly predicted as hardcover.  
  - Formula: \( \frac{\text{TP}}{\text{TP} + \text{FN}} \)  
- **Specificity (True Negative Rate)**: 0.909  
  - Proportion of paperback books correctly predicted as paperback.  
  - Formula: \( \frac{\text{TN}}{\text{TN} + \text{FP}} \)  
- **Accuracy**: 0.844  
  - Overall proportion of correct predictions.  
  - Formula: \( \frac{\text{TP} + \text{TN}}{\text{Total Observations}} \)  

---

### **Model `clf2` (Using `NumPages`, `Thick`, and `List Price`)**
- **Sensitivity (True Positive Rate)**: 0.750  
  - Proportion of hardcover books correctly predicted as hardcover.  
- **Specificity (True Negative Rate)**: 0.909  
  - Proportion of paperback books correctly predicted as paperback.  
- **Accuracy**: 0.859  
  - Overall proportion of correct predictions.  

---

### **Interpretation**
- **`clf2` performs better** than `clf` due to higher sensitivity and accuracy, likely because it uses multiple features (`NumPages`, `Thick`, and `List Price`) to make predictions.  
- Both models have the **same specificity** (0.909), meaning they are equally effective at identifying paperback books.


### Summarize the findings:

Metrics for clf:

- Sensitivity: 0.700
- Specificity: 0.909
- Accuracy: 0.844

Metrics for clf2:

- Sensitivity: 0.750
- Specificity: 0.909
- Accuracy: 0.859

Interpretation:

"Both models perform well, but clf2 slightly outperforms clf in sensitivity and accuracy, likely because it uses more features (NumPages, Thick, and List Price) to make predictions. Both models have the same specificity, meaning they are equally good at identifying paperback books."

In [None]:
"""Question 7"""

The differences between the two confusion matrices are due to the features used in the models. The first matrix is based on a model (clf) that only uses List Price, which limits its ability to make accurate predictions. The second matrix is from a model (clf2) that uses more features (NumPages, Thick, and List Price), allowing it to capture more details and improve predictions. The confusion matrices for the test dataset (clf and clf2) are better because they show how the models perform on new, unseen data, which is a more reliable measure of their accuracy.

In [None]:
"""Question 8"""

Feature Importances for clf2

The feature importances for the decision tree model (clf2) were calculated to understand which predictors contribute the most to the model's explanatory power. The most important predictor is List Price, with an importance score of 0.486, followed by Thick (0.297) and NumPages (0.217). These scores reflect the relative contributions of each feature to improving the model's predictions at various decision nodes. A visualization was created to illustrate these relative contributions clearly.



In [None]:
"""Question 9"""

In linear regression, interpreting coefficients is straightforward: each coefficient represents the change in the predicted value of the target variable for a one-unit increase in the corresponding predictor, holding all other predictors constant. In contrast, feature importances in decision trees represent the relative contribution of each predictor to improving the model’s performance (e.g., reducing Gini impurity or entropy) across all decision splits in the tree. While linear regression provides direct, additive relationships, decision trees capture complex, non-linear interactions, making feature importance more of a heuristic measure rather than a precise interpretation.

In linear regression, interpreting coefficients is straightforward: each coefficient represents the change in the predicted value of the target variable for a one-unit increase in the corresponding predictor, holding all other predictors constant. In contrast, feature importances in decision trees represent the relative contribution of each predictor to improving the model’s performance (e.g., reducing Gini impurity or entropy) across all decision splits in the tree. While linear regression provides direct, additive relationships, decision trees capture complex, non-linear interactions, making feature importance more of a heuristic measure rather than a precise interpretation.