#Question 1:

A classification decision tree is a machine learning model used to solve classification problems by progressively classifying data into specific categories through a series of feature-based branching decisions,
e.g., for real-world scenarios such as spam categorization, disease diagnosis, or customer churn prediction; in contrast, multivariate linear regression is a statistical model used in regression problems by predicting continuous values, such as house price or temperature.
Whereas a classification decision tree makes step-by-step decisions in a tree structure and outputs discrete category labels, multiple linear regression generates a numerical result by calculating a linearly weighted sum of features,
which is suitable for scenarios where continuous variables need to be predicted accurately.

#Question 2:

First scenario: for tasks with balanced requirements for correctness, e.g., a weather forecasting model predicting sunny or rainy days, where the user wants the model to predict correctly overall as often as possible. Since the cost of error is more similar for positive and negative classes in this scenario, it is reasonable to focus on the overall correctness of the predictions. The second scenario: applies to disease screening to ensure that all diseased patients are identified as much as possible. Because in this scenario, unrecognized sick people (false negative cases) can lead to serious consequences, higher sensitivity is needed to reduce underdiagnosis. Third scenario: for scenarios where misclassification as a positive class is undesirable, such as courtroom evidence analysis to determine whether a suspect is guilty or not. Because misclassification as a positive class (false positive example) may lead to victimization of innocent people, high specificity is needed to reduce misclassification. The fourth scenario: it is applicable to spam classification to ensure that emails labeled as spam contain as much actual spam as possible. Since false positive examples (non-spam emails mislabeled as spam) may cause users to miss important emails, high specificity is needed to reduce misclassification.

#Question 3:

import pandas as pd

file_path = "path/to/amazonbooks.csv"  
ab = pd.read_csv(file_path, encoding="ISO-8859-1")

ab = ab.drop(columns=["Weight_oz", "Width", "Height"])

ab = ab.dropna()

ab["Pub_year"] = ab["Pub_year"].astype(int)
ab["NumPages"] = ab["NumPages"].astype(int)
ab["Hard_or_Paper"] = ab["Hard_or_Paper"].astype("category")

print(ab.info())
print(ab.head())


###Explanation:

Load the dataset:
Use pd.read_csv to load the Amazon Books dataset from a local path.

Remove unneeded columns:
Remove Weight_oz, Width, and Height columns in the dataset that are not relevant for analysis.

Handle missing values:
Remove rows containing missing values (NaN) to ensure data integrity.

Setting data types:
Convert the Pub_year and NumPages columns to integer type (int) to ensure that the data is formatted correctly.
Convert the Hard_or_Paper column to a category type (category) to optimize memory and prepare for subsequent analysis.

Display an overview of the data:
Use info() to view data structure and type information.
Use head() to view the first few rows of data and verify that the preprocessing is correct.

#Question 4:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

y = pd.get_dummies(ab["Hard_or_Paper"], drop_first=True)["H"]

X = ab[["List Price"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(X_train, y_train)

plt.figure(figsize=(10, 6))
plot_tree(clf, feature_names=["List Price"], class_names=["Not Hard Cover", "Hard Cover"], filled=True)
plt.show()

print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))


###Explaination:

Data splitting:
Use train_test_split to split the data into a training set (80%) and a test set (20%), and set random seeds to ensure repeatable results.

Categorical decision tree model training:
The target variable is whether it is a hard cover book or not.
The feature variable is the price of the book.
Decision tree depth is set to 2 to limit model complexity.

Visualize the decision tree:
Use plot_tree to display the structure of the classification decision tree, showing how to predict whether a book is hard cover or not based on price.

Training and test set size:
Prints the number of observations in the training and test sets, ensuring that the splits conform to the 80/20 ratio.

#Question 5:
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

X_train = X_train[["NumPages", "Thick", "List Price"]]
X_test = X_test[["NumPages", "Thick", "List Price"]]

clf2 = DecisionTreeClassifier(max_depth=4, random_state=42)
clf2.fit(X_train, y_train)

plt.figure(figsize=(12, 8))
plot_tree(clf2, feature_names=["NumPages", "Thick", "List Price"], 
          class_names=["Not Hard Cover", "Hard Cover"], filled=True)
plt.show()

print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))

###Explaination:

How the model makes predictions:
The decision tree progressively splits the dataset based on three feature values, NumPages, Thick and List Price.

Each split point is binary categorized based on a threshold value for one feature:
If the condition is met, the data goes into the left subtree.
Otherwise, it goes to the right subtree.
Ultimately, each leaf node is assigned a category (e.g., whether it is a hard cover book).

The model limits the meaning of depth:
Setting max_depth=4 limits the number of split layers to avoid overfitting.
This ensures that the model is interpretable and generalizable.

Predictive Output:
The decision tree output is a category label (e.g., “hard cover” or “non-hard cover”).

In [6]:
#Question 6:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score
import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/pointofive/STA130_F23/main/Data/amazonbooks.csv"
ab = pd.read_csv(url, encoding="ISO-8859-1")
ab = ab.drop(columns=["Weight_oz", "Width", "Height"]).dropna()
ab["Pub_year"] = ab["Pub_year"].astype(int)
ab["NumPages"] = ab["NumPages"].astype(int)
ab["Hard_or_Paper"] = ab["Hard_or_Paper"].astype("category")

X = ab[["NumPages", "Thick", "List Price"]]
y = pd.get_dummies(ab["Hard_or_Paper"], drop_first=True)["H"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(X_train, y_train)

clf2 = DecisionTreeClassifier(max_depth=4, random_state=42)
clf2.fit(X_train, y_train)

y_pred_clf = clf.predict(X_test)
y_pred_clf2 = clf2.predict(X_test)

cm_clf = confusion_matrix(y_test, y_pred_clf)
cm_clf2 = confusion_matrix(y_test, y_pred_clf2)

def calculate_metrics(cm):
    TP = cm[1, 1]
    TN = cm[0, 0]
    FP = cm[0, 1]
    FN = cm[1, 0]
    
    sensitivity = TP / (TP + FN) if (TP + FN) > 0 else 0
    specificity = TN / (TN + FP) if (TN + FP) > 0 else 0
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    
    return sensitivity, specificity, accuracy

sensitivity_clf, specificity_clf, accuracy_clf = calculate_metrics(cm_clf)

sensitivity_clf2, specificity_clf2, accuracy_clf2 = calculate_metrics(cm_clf2)

print("Confusion Matrix for clf:")
print(cm_clf)
print(f"Sensitivity: {sensitivity_clf:.3f}, Specificity: {specificity_clf:.3f}, Accuracy: {accuracy_clf:.3f}\n")

print("Confusion Matrix for clf2:")
print(cm_clf2)
print(f"Sensitivity: {sensitivity_clf2:.3f}, Specificity: {specificity_clf2:.3f}, Accuracy: {accuracy_clf2:.3f}")




KeyError: 'Pub_year'

What This Code Does

Loads and Prepares Data:
Ensures the ab_reduced_noNaN dataset is processed as per earlier requirements.

Trains Two Models:
clf: A DecisionTreeClassifier with max_depth=2.
clf2: A DecisionTreeClassifier with max_depth=4.

Generates Predictions and Confusion Matrices:
Predictions for the test set are made using both models.
Confusion matrices are generated for both models.

Calculates Metrics:
Sensitivity, Specificity, and Accuracy are computed for both models.

Displays Results:
Prints the confusion matrices and metrics in a clear format.

#Question 7:

The differences between the two confusion matrices mainly stem from the different predictor variables used by the models and the complexity of the decision trees. The first model (clf) uses only a single predictor variable, List Price, and this information is limited to adequately differentiate between hard cover and soft cover books, resulting in lower classification accuracy. The second model (clf2), on the other hand, uses three predictor variables, NumPages, Thick, and List Price, which provide richer information and enable the model to better capture the complex relationship between features and classification results. In addition, the tree depth of clf is limited to 2, which results in an oversimplified model that may lead to underfitting; while the tree depth of clf2 is 4, which enables a more detailed division of the data and improves the classification performance. As a result, the confusion matrix of clf2 is superior, demonstrating higher sensitivity, specificity, and overall accuracy.

#Question 8：

The feature importances reflect how much each predictor contributes to the splitting rules in the decision tree.
This allows us to understand which variables the model relies on most to make predictions and can guide future data collection or feature engineering efforts.

#Question 9:

The coefficients of a linear regression model indicate the linear contribution of each predictor variable to the target variable and are usually interpreted in terms of the magnitude of the change in the target variable when the value of a particular variable changes, thus it provides explicit direction and strength. In contrast, in a decision tree, the characteristic importance reflects the relative contribution of each variable in partitioning the data set and reducing impurity (e.g., Gini index or information entropy), but does not provide a specific directional or linear relationship, and is therefore more biased toward a measure of global impact. In contrast, linear regression coefficients are more intuitive, whereas the characteristic importance of decision trees is better suited to explaining complex nonlinear relationships.

#Question 10:

Yes

###
CHATBOT SUMMARY:
###




User: What is the difference between sensitivity and specificity in a classification model?

ChatBot: Sensitivity and specificity are metrics used to evaluate classification models:

Sensitivity (also called recall) measures the proportion of actual positives that are correctly identified. It is calculated as:

Sensitivity =
True Positives (TP)
True Positives (TP)
+
False Negatives (FN)
Sensitivity= 
True Positives (TP)+False Negatives (FN)
True Positives (TP)
​
 
High sensitivity ensures that the model identifies most positive cases, which is important in situations like disease detection.

Specificity measures the proportion of actual negatives that are correctly identified. It is calculated as:

Specificity =
True Negatives (TN)
True Negatives (TN)
+
False Positives (FP)
Specificity= 
True Negatives (TN)+False Positives (FP)
True Negatives (TN)
​
 
High specificity ensures that the model avoids falsely labeling negative cases as positive, which is important in scenarios like fraud detection.

User: How do feature importances in a decision tree differ from coefficients in linear regression?

ChatBot: Feature importances in a decision tree indicate the relative contribution of each predictor variable to the overall predictive power of the model, based on how much they reduce impurity (like Gini impurity or entropy) at each split. They do not provide directional information or an explicit mathematical relationship.

In contrast, linear regression coefficients quantify the linear relationship between each predictor and the target variable. Coefficients show both the magnitude and direction of the effect (positive or negative) that each predictor has on the outcome. Thus, linear regression provides more interpretable insights into variable effects, while feature importances are better for evaluating contributions in non-linear models.

User: Can you give me a real-world example where accuracy isn’t the best metric to evaluate a model?

ChatBot: Sure! Imagine a model for detecting rare diseases, where only 1% of the population has the disease. If the model always predicts "no disease," it will achieve 99% accuracy because most people are healthy. However, this model is useless because it fails to detect any positive cases. In such cases, metrics like sensitivity (to minimize false negatives) and specificity (to minimize false positives) are more appropriate to evaluate the model's performance.

User: Can a classification decision tree be used for regression problems?

ChatBot: Yes, decision trees can also be used for regression problems. In this case, the model is called a DecisionTreeRegressor, and it predicts continuous target values instead of class labels. Instead of splitting nodes to minimize classification error (e.g., Gini impurity or entropy), it splits to minimize the variance or mean squared error (MSE) in the target variable within each split.