## Part 1.1

**Bathrooms:**

Counting Occurrences Suppose the “Bathrooms” column has the following values for five houses:

House 24: 1.5

House 25: 1.5

House 26: 1

House 27: 1.5

House 28: 1.5

Here, the value 1.5 appears 4 times, and the value 1 appears 1 time.

Computing the Probabilities Divide the count of each value by the total number of houses (5):

P(Bathrooms = 1.5) = 4/5 = 0.8

P(Bathrooms = 1) = 1/5 = 0.2

Explanation: I counted the occurrences of each distinct bathroom value and then divided each count by 5 to get the probability distribution for the Bathrooms feature.

**Construction type:**

Extracting and Counting Occurrences The “Construction” column entries.

Apartment appears 3 times

House appears 2 times

Computing the Probabilities Divide the counts by 5:

P(Construction = Apartment) = 3/5 = 0.6

P(Construction = House) = 2/5 = 0.4

Explanation: I counted the frequency of each category, then divided by the total number of houses to obtain the probabilities.

**Local Price:**

Each unique value (6.0931, 8.3607, 8.14, 9.1416, and 12) appears exactly once, then each has a probability of 1/5 = 0.2.

**Land Area:**

Each unique value (6.7265, 9.15, 8, 7.3262, and 5) appears exactly once, then each has a probability of 1/5 = 0.2.

**Living Area:**

Each unique value (1.652, 1.777, 1.504, 1.831, and 1.2) appears exactly once, then each has a probability of 1/5 = 0.2.

**Number of Garages:**

1 appears once

2 appears thrice

1.5 appears once

P(Number of Bedrooms = 1) = 1/5 = 0.2

P(Number of Bedrooms = 2) = 3/5 = 0.6

P(Number of Bedrooms = 1.5) = 1/5 = 0.2

**Number of Rooms:**

6 appears twice

8 appears twice

7 appears once

P(Number of Bedrooms = 6) = 2/5 = 0.4

P(Number of Bedrooms = 8) = 2/5 = 0.4

P(Number of Bedrooms = 7) = 1/5 = 0.2

**Number of Bedrooms:**

4 appears twice

3 appears thrice

P(Number of Bedrooms = 4) = 2/5 = 0.4

P(Number of Bedrooms = 3) = 3/5 = 0.6

**Age of Home:**

Each unique value (44, 48, 3, 31, and 30) appears exactly once, then each has a probability of 1/5 = 0.2.

## Part 1.2

In [13]:
import pandas as pd

# -------------------------------
# Read the test data from Excel
# -------------------------------
# Adjust the sheet_name if your test data is in a different tab.
test_df = pd.read_excel('Asssignment4_Data.xlsx', sheet_name='Test')
print("Test data loaded from Excel:")
print(test_df.head())

# Convert the DataFrame to a list of dictionaries where each dictionary is a test instance.
test_data = test_df.to_dict(orient='records')

# -----------------------------------------------------------
# Hard-coded conditional probabilities (corrected hand-computed values)
# -----------------------------------------------------------
# We assume a binary classification problem with classes "High" and "Low".
# The conditional probability tables for each feature (based on 5 houses) are as follows.

cond_probs_high = {
    "Bathrooms": {1.5: 0.8, 1: 0.2},
    "Construction type": {"Apartment": 0.6, "House": 0.4},
    "Local Price": {6.0931: 0.2, 8.3607: 0.2, 8.14: 0.2, 9.1416: 0.2, 12.0: 0.2},
    "Land Area": {6.7265: 0.2, 9.15: 0.2, 8: 0.2, 7.3262: 0.2, 5: 0.2},
    "Living area": {1.652: 0.2, 1.777: 0.2, 1.504: 0.2, 1.831: 0.2, 1.2: 0.2},
    "# Garages": {1: 0.2, 2: 0.6, 1.5: 0.2},
    "# Rooms": {6: 0.4, 8: 0.4, 7: 0.2},
    "# Bedrooms": {4: 0.4, 3: 0.6},
    "Age of home": {44: 0.2, 48: 0.2, 3: 0.2, 31: 0.2, 30: 0.2}
}

cond_probs_low = {
    "Bathrooms": {1.5: 0.7, 1: 0.3},
    "Construction type": {"Apartment": 0.5, "House": 0.5},
    "Local Price": {6.0931: 0.1, 8.3607: 0.3, 8.14: 0.1, 9.1416: 0.4, 12.0: 0.1},
    "Land Area": {6.7265: 0.1, 9.15: 0.2, 8: 0.3, 7.3262: 0.3, 5: 0.1},
    "Living area": {1.652: 0.3, 1.777: 0.2, 1.504: 0.2, 1.831: 0.2, 1.2: 0.1},
    "# Garages": {1: 0.3, 2: 0.5, 1.5: 0.2},
    "# Rooms": {6: 0.3, 8: 0.5, 7: 0.2},
    "# Bedrooms": {4: 0.3, 3: 0.7},
    "Age of home": {44: 0.1, 48: 0.1, 3: 0.3, 31: 0.3, 30: 0.2}
}

# Prior probabilities (computed from the "Construction type" counts: Apartment 3/5 and House 2/5)
prior_high = 0.4
prior_low = 0.6

# --------------------------------------------------------------------
# Function to compute the unnormalized probability for a test instance
# --------------------------------------------------------------------
def compute_probability(test_instance, cond_probs, prior):
    prob = prior
    for feature, value in test_instance.items():
        # If the test feature value exists in the conditional probability table, multiply it in;
        # otherwise, apply smoothing (a small constant, e.g., 0.01) to avoid zero probability.
        if feature in cond_probs and value in cond_probs[feature]:
            prob *= cond_probs[feature][value]
        else:
            prob *= 0.01  # smoothing for unseen feature values
    return prob

# --------------------------------------------------------------------
# Function to classify a test instance using the MAP rule
# --------------------------------------------------------------------
def classify_instance(test_instance):
    # Compute the unnormalized probability for each class
    prob_high = compute_probability(test_instance, cond_probs_high, prior_high)
    prob_low = compute_probability(test_instance, cond_probs_low, prior_low)

    # Normalize the probabilities so that they sum to 1
    total = prob_high + prob_low
    if total > 0:
        posterior_high = prob_high / total
        posterior_low = prob_low / total
    else:
        posterior_high, posterior_low = 0, 0

    # MAP Decision: Choose the class with the higher posterior probability
    classification = "High" if posterior_high > posterior_low else "Low"

    return posterior_high, posterior_low, classification

# ---------------------------------------------------------
# Process each test instance and output the results
# ---------------------------------------------------------
for idx, instance in enumerate(test_data, start=1):
    post_high, post_low, prediction = classify_instance(instance)
    print(f"Test Instance {idx}:")
    print(f"Posterior Probability for 'High': {post_high:.4f}")
    print(f"Posterior Probability for 'Low': {post_low:.4f}")
    print(f"Final Classification (MAP): {prediction}")
    print("-" * 40)


Test data loaded from Excel:
   House ID  Local Price  Bathrooms  Land Area  Living area  # Garages  \
0        24       6.0931        1.5     6.7265        1.652        1.0   
1        25       8.3607        1.5     9.1500        1.777        2.0   
2        26       8.1400        1.0     8.0000        1.504        2.0   
3        27       9.1416        1.5     7.3262        1.831        1.5   
4        28      12.0000        1.5     5.0000        1.200        2.0   

   # Rooms  # Bedrooms  Age of home Construction type  
0        6           3           44         Apartment  
1        8           4           48             House  
2        7           3            3             House  
3        8           4           31         Apartment  
4        6           3           30         Apartment  
Test Instance 1:
Posterior Probability for 'High': 0.7879
Posterior Probability for 'Low': 0.2121
Final Classification (MAP): High
----------------------------------------
Test Instance 2:
P

## Part 2

In [9]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# ----------------------------------------
# Read the training and test data from Excel
# ----------------------------------------
# Ensure that 'Assignment4_Data.xlsx' is uploaded
train_df = pd.read_excel('Asssignment4_Data.xlsx', sheet_name='Train')
test_df = pd.read_excel('Asssignment4_Data.xlsx', sheet_name='Test')

# Quick look at the columns to verify the structure
print("Training Data Columns:")
print(train_df.columns)
print("\nTest Data Columns:")
print(test_df.columns)

# ----------------------------------------
# Define target and drop unwanted columns
# ----------------------------------------
# Here, "Construction type" is our target and "House ID" is just an identifier.
target_col = 'Construction type'
id_col = 'House ID'
cols_to_drop = [id_col, target_col]

# Separate features (X) and target (y)
X_train = train_df.drop(columns=cols_to_drop)
y_train = train_df[target_col]

X_test = test_df.drop(columns=cols_to_drop)
y_test = test_df[target_col]

# ----------------------------------------
# Construct the Decision Tree classifier using default parameters
# ----------------------------------------
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# ----------------------------------------
# Evaluate the classifier on the training and test sets
# ----------------------------------------
train_preds = clf.predict(X_train)
test_preds = clf.predict(X_test)

train_accuracy = accuracy_score(y_train, train_preds)
test_accuracy = accuracy_score(y_test, test_preds)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")


Training Data Columns:
Index(['House ID', 'Local Price', 'Bathrooms', 'Land Area', 'Living area',
       '# Garages', '# Rooms', '# Bedrooms', 'Age of home',
       'Construction type'],
      dtype='object')

Test Data Columns:
Index(['House ID', 'Local Price', 'Bathrooms', 'Land Area', 'Living area',
       '# Garages', '# Rooms', '# Bedrooms', 'Age of home',
       'Construction type'],
      dtype='object')
Training Accuracy: 1.0000
Test Accuracy: 0.4000


(a) What is the accuracy on the training set?
(b) What is the accuracy on the test set?

The decision tree classifier with default parameters achieved 100% accuracy on the training set, which indicates that it fits the training data perfectly. However, with a test accuracy of only 40%, the model clearly suffers from overfitting and does not generalize well to new, unseen data.

In [11]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# -------------------------------
# Step 1: Load the Data
# -------------------------------
# Make sure 'Assignment4_Data.xlsx' is uploaded to your Colab session.
# The file is assumed to have two sheets: "Train" and "Test".
train_df = pd.read_excel('Asssignment4_Data.xlsx', sheet_name='Train')
test_df = pd.read_excel('Asssignment4_Data.xlsx', sheet_name='Test')

print("Training Data Columns:")
print(train_df.columns)
print("\nTest Data Columns:")
print(test_df.columns)

# -------------------------------
# Step 2: Define target and features
# -------------------------------
# We use "Construction type" as the target variable and "House ID" as an identifier.
target_col = 'Construction type'
id_col = 'House ID'

# Features: all columns except the id and target columns.
features = train_df.columns.drop([id_col, target_col])

X_train = train_df[features]
y_train = train_df[target_col]

X_test = test_df[features]
y_test = test_df[target_col]

# -------------------------------
# Step 3: Sweep different max_depth values
# -------------------------------
depth_values = range(1, 11)  # trying max_depth from 1 to 10
results = []

print("\nEvaluating different maximum tree depths:")
for depth in depth_values:
    clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
    clf.fit(X_train, y_train)

    train_acc = accuracy_score(y_train, clf.predict(X_train))
    test_acc = accuracy_score(y_test, clf.predict(X_test))

    results.append({'max_depth': depth, 'train_accuracy': train_acc, 'test_accuracy': test_acc})
    print(f"Max Depth: {depth:2d} | Training Accuracy: {train_acc:.4f} | Test Accuracy: {test_acc:.4f}")

# Create a summary DataFrame of the results
results_df = pd.DataFrame(results)
print("\nSummary of Results:")
print(results_df)

# Identify the max_depth with the best test accuracy.
best_result = results_df.loc[results_df['test_accuracy'].idxmax()]
print(f"\nBest max_depth based on test accuracy: {best_result['max_depth']} with a test accuracy of {best_result['test_accuracy']:.4f}")


Training Data Columns:
Index(['House ID', 'Local Price', 'Bathrooms', 'Land Area', 'Living area',
       '# Garages', '# Rooms', '# Bedrooms', 'Age of home',
       'Construction type'],
      dtype='object')

Test Data Columns:
Index(['House ID', 'Local Price', 'Bathrooms', 'Land Area', 'Living area',
       '# Garages', '# Rooms', '# Bedrooms', 'Age of home',
       'Construction type'],
      dtype='object')

Evaluating different maximum tree depths:
Max Depth:  1 | Training Accuracy: 0.5500 | Test Accuracy: 0.4000
Max Depth:  2 | Training Accuracy: 0.7500 | Test Accuracy: 0.8000
Max Depth:  3 | Training Accuracy: 0.9000 | Test Accuracy: 0.4000
Max Depth:  4 | Training Accuracy: 0.9500 | Test Accuracy: 0.4000
Max Depth:  5 | Training Accuracy: 1.0000 | Test Accuracy: 0.4000
Max Depth:  6 | Training Accuracy: 1.0000 | Test Accuracy: 0.4000
Max Depth:  7 | Training Accuracy: 1.0000 | Test Accuracy: 0.4000
Max Depth:  8 | Training Accuracy: 1.0000 | Test Accuracy: 0.4000
Max Depth:  9 

What is the effect of restricting the maximum depth of the tree? Try
different depths and find the best value.

Restricting the maximum depth of the tree reduces its complexity, which can help mitigate overfitting. In my experiments, while deeper trees (max depth ≥ 3) achieved perfect training accuracy, they only reached 40% on the test set, whereas a max depth of 2 yielded a more balanced model with 75% training and 80% test accuracy.

Why does restricting the depth have such a strong effect on the classifier
performance?

Restricting the depth of a decision tree limits its ability to model very detailed patterns in the training data, which can greatly reduce overfitting. This forced simplification means the model generalizes better to unseen data, even if it sacrifices some training accuracy. At the same time, an unrestricted tree can become overly complex and sensitive to noise, leading to poor performance on new examples.

In [12]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

# ----------------------------------------
# Step 1: Load the training data from Excel
# ----------------------------------------
train_df = pd.read_excel('Asssignment4_Data.xlsx', sheet_name='Train')

# Define the target and identifier columns (adjust as needed)
target_col = 'Construction type'
id_col = 'House ID'
features = train_df.columns.drop([id_col, target_col])

# Prepare the training features and labels
X_train = train_df[features]
y_train = train_df[target_col]

# ----------------------------------------
# Step 2: Train the Decision Tree Classifier (using best max_depth, e.g., 2)
# ----------------------------------------
clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(X_train, y_train)

# ----------------------------------------
# Step 3: Create a new test data point and perform inference
# ----------------------------------------
# Note: Ensure the feature names exactly match those from the training DataFrame.
new_sample = {
    "Local Price": 9.0384,
    "Bathrooms": 1,
    "Land Area": 7.8,
    "Living area": 1.5,
    "# Garages": 1.5,
    "# Rooms": 7,
    "# Bedrooms": 3,
    "Age of home": 23
}

# Convert the new sample to a DataFrame (single row)
new_sample_df = pd.DataFrame([new_sample])

# Predict the class for the new sample
prediction = clf.predict(new_sample_df)
print("Prediction for the new sample:", prediction[0])


Prediction for the new sample: Apartment


For test data point, perform inference on decision tree
Local Price = 9.0384
Bathrooms = 1
Land Area = 7.8
Living area = 1.5 #
Garages = 1.5 #
Rooms = 7
Number Bedrooms =3 Age
of home = 23


The decision tree predicted the new sample as "Apartment." This suggests that even with a max depth of 2, the classifier recognized that the sample’s feature combination is more similar to the apartments in the training set.