<img src="https://github.com/PKhosravi-CityTech/LightCnnRad/raw/main/Images/BioMindLogo.png" alt="BioMind AI Lab Logo" width="150" height="150" align="left" style="margin-bottom: 40px;"> **Repository Developed by Pegah Khosravi, Principal Investigator of the BioMind AI Lab**

Welcome to this repository! This notebook is designed to provide hands-on experience and foundational knowledge in machine learning. It is part of our journey to explore key ML concepts, algorithms, and applications. Whether you're a PhD student, or a master's student, this repository aims to support your learning goals and encourage critical thinking about machine learning systems.


# Quiz 1

## Quiz Description

You will have a maximum of 30 minutes to complete the quiz. However, if you are experiencing high stress or require additional time, you may take up to 1 hour to finish.

## Quiz Structure
- 6 Short-answer questions
- 1 True/False question
- 1 Fill-in-the-blank question (with two blanks)
- 2 Essay questions

## Instructions
- For the short-answer questions, select one of the four options and write your answer as A, B, C, or D (capital letters preferred).
- For the fill-in-the-blank question, provide one word per blank, separated by a comma (e.g., word1, word2).
- For the essay questions, write your explanation and code in Google Colab or any Python-compatible platform.
- This is an open-book quiz, so you may use textbooks or other resources—just make sure to work independently and do not use OpenAI or similar AI tools.

Follow these instructions carefully. Good luck!

In [None]:
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output

# Function to create a question-answer toggle button
def create_question(question_text, answer_text):
    button = widgets.Button(description="Click to Reveal Answer")
    output = widgets.Output()

    def reveal_answer(b):
        with output:
            output.clear_output()
            display(Markdown(f"**Answer:** {answer_text}"))

    button.on_click(reveal_answer)
    display(Markdown(f"**{question_text}**"), button, output)

# List of questions and answers
qa_pairs = [
    ("Question 1: What is the key difference between generative and discriminative models?\n"
     "A) Generative models estimate P(Y∣X), while discriminative models estimate P(X∣Y)\n"
     "B) Discriminative models estimate P(Y∣X), while generative models estimate P(X,Y)\n"
     "C) Generative models only work with supervised learning, while discriminative models work with unsupervised learning\n"
     "D) Discriminative models learn the joint probability distribution of the data", "B"),

    ("Question 2: Which of the following regularization techniques helps prevent overfitting by adding the sum of squared weights to the loss function?\n"
     "A) L1 regularization\nB) Dropout\nC) L2 regularization\nD) Batch Normalization", "C"),

    ("Question 3: What is the main purpose of the bias-variance tradeoff in machine learning?\n"
     "A) To increase the accuracy of a model\nB) To reduce both bias and variance simultaneously\n"
     "C) To find a balance between underfitting and overfitting\nD) To determine the optimal number of decision trees in an ensemble", "C"),

    ("Question 4: Which of the following statements about decision trees is FALSE?\n"
     "A) Decision trees can handle both numerical and categorical data\n"
     "B) Decision trees are prone to overfitting, especially with deep trees\n"
     "C) Decision trees use impurity measures such as Gini impurity and entropy\n"
     "D) Decision trees require feature scaling before training", "D"),

    ("Question 5: What is the main advantage of ensemble methods like Random Forest over a single decision tree?\n"
     "A) They reduce variance and improve generalization\n"
     "B) They are always more interpretable than a single decision tree\n"
     "C) They require less computational power than decision trees\n"
     "D) They eliminate the need for feature selection", "A"),

    ("Question 6: Which boosting algorithm works by iteratively training models that correct the errors of previous models?\n"
     "A) Bagging\nB) AdaBoost\nC) K-Means\nD) Principal Component Analysis (PCA)", "B"),

    ("Question 7: True/False? Increasing the complexity of a model always leads to better performance on unseen data.", "False"),

    ("Question 8: Fill in the Blank -> Logistic Regression is commonly used for ______ problems where the target variable has two possible outcomes. "
     "Unlike linear regression, it applies the ______ function to map predictions to probability values between 0 and 1.", "classification, sigmoid"),

]

# Display all questions with interactive answer buttons
for question, answer in qa_pairs:
    create_question(question, answer)


In [None]:
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output

# Function to create a question-answer toggle button
def create_question(question_text, answer_text):
    button = widgets.Button(description="Click to Reveal Answer")
    output = widgets.Output()

    def reveal_answer(b):
        with output:
            output.clear_output()
            display(Markdown(f"**Answer:**\n\n{answer_text}"))

    button.on_click(reveal_answer)
    display(Markdown(question_text), button, output)

# Corrected question formatting for proper Markdown rendering
question_9 = (
    "**Question 9: Essay - Compare and Contrast Random Forest and XGBoost**\n\n"
    "- **Explain the key differences** between Random Forest and XGBoost in terms of their learning strategies.\n"
    "- **How does each method handle decision trees** and optimize performance?\n"
    "- **When would you prefer** to use Random Forest over XGBoost and vice versa?\n"
)

answer_9 = (
    "**Random Forest and XGBoost** are both ensemble learning algorithms that use decision trees as their base models, "
    "but they differ in how they build and optimize these trees.\n\n"
    "**Random Forest** is based on **bagging** (Bootstrap Aggregating), where multiple decision trees are trained independently "
    "on different random subsets of the data. The final prediction is made by averaging (for regression) or majority voting "
    "(for classification) across all trees. Each tree in a Random Forest is grown fully without pruning, making it robust "
    "but also computationally expensive.\n\n"
    "**XGBoost (Extreme Gradient Boosting)**, on the other hand, uses **boosting**, meaning trees are built sequentially, "
    "with each new tree correcting the mistakes of the previous ones. Instead of training trees independently, XGBoost applies "
    "gradient boosting, where each new tree learns to minimize the residual errors of the previous trees. XGBoost also "
    "incorporates **regularization techniques (L1 and L2 penalties)** to reduce overfitting and supports parallel computing, "
    "making it much faster and more efficient for large datasets compared to Random Forest.\n\n"
    "**When to Use Which?**\n\n"
    "- **Random Forest** is better for simple, interpretable models and high-variance datasets. It handles complex relationships well but can be computationally expensive.\n"
    "- **XGBoost** is preferred when speed and performance are critical, especially in large-scale applications and competitions. It controls overfitting better due to built-in regularization.\n\n"
    "In summary, **Random Forest is easier to implement and interpret, while XGBoost is more powerful for complex problems** requiring high accuracy. "
    "However, its sequential boosting approach makes it computationally more expensive than Random Forest, though optimizations make it faster than traditional gradient boosting methods."
)

# Display the question
create_question(question_9, answer_9)


In [None]:
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output

def create_question(question_text, answer_text):
    button = widgets.Button(description="Click to Reveal Answer")
    output = widgets.Output()

    def reveal_answer(b):
        with output:
            output.clear_output()
            display(Markdown(f"**Answer:**\n\n{answer_text}"))

    button.on_click(reveal_answer)
    display(Markdown(question_text), button, output)

question_10 = (
    "**Question 10: Essay & Code - Housing Price Classification**\n\n"
    "You are provided with a housing dataset containing the following features:\n"
    "- **RM**: Average number of rooms per dwelling.\n"
    "- **LSTAT**: Percentage of lower-status population in the area.\n"
    "- **PTRATIO**: Pupil-teacher ratio in the neighborhood.\n"
    "- **MEDV**: House price in dollars (target variable).\n\n"
    "**Task:**\n"
    "- Convert the `MEDV` column into a binary classification target:\n"
      "  - Calculate the **median** house price.\n"
      "  - If `MEDV` is above the median, classify it as `'High' (1)`.\n"
      "  - If `MEDV` is below or equal to the median, classify it as `'Low' (0)`.\n"
    "- Split the data into training (70%) and testing (30%) sets.\n"
    "- Train a **Decision Tree Classifier** with:\n"
      "  - Features: `RM`, `LSTAT`, `PTRATIO`.\n"
      "  - Max Depth: `3`.\n"
      "  - Random State: `42`.\n"
    "- Evaluate the model:\n"
      "  - Compute accuracy.\n"
      "  - Display the confusion matrix.\n"
      "  - Generate the classification report.\n"
    "- **Visualize** the Decision Tree structure.\n\n"
    "**Interpretation Questions:**\n"
    "- Which feature appears at the top of the tree?\n"
    "- What does this tell you about its importance?\n"
    "- Suggest one improvement to this model."
)

answer_10 = (
    "**Python Implementation**\n\n"
    "```python\n"
    "import numpy as np\n"
    "import pandas as pd\n"
    "import matplotlib.pyplot as plt\n"
    "from sklearn.model_selection import train_test_split\n"
    "from sklearn.tree import DecisionTreeClassifier, plot_tree\n"
    "from sklearn.metrics import accuracy_score, confusion_matrix, classification_report\n\n"
    "df = pd.read_csv('/content/housing.csv')\n"
    "median_price = df['MEDV'].median()\n"
    "df['PriceCategory'] = (df['MEDV'] > median_price).astype(int)\n"
    "y = df['PriceCategory']\n"
    "X = df[['RM', 'LSTAT', 'PTRATIO']]\n"
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n\n"
    "clf = DecisionTreeClassifier(max_depth=3, random_state=42)\n"
    "clf.fit(X_train, y_train)\n"
    "y_pred = clf.predict(X_test)\n\n"
    "print(f'Accuracy: {accuracy_score(y_test, y_pred):.4f}')\n"
    "print('\\nConfusion Matrix:\\n', confusion_matrix(y_test, y_pred))\n"
    "print('\\nClassification Report:\\n', classification_report(y_test, y_pred))\n\n"
    "plt.figure(figsize=(15, 8))\n"
    "plot_tree(clf, feature_names=X.columns, class_names=['Low', 'High'], filled=True)\n"
    "plt.show()\n"
    "```\n\n"
    "**Interpreting the Results**\n\n"
    "- **Which feature appears at the top?**\n"
    "  - LSTAT (Percentage of lower-status population in the area).\n\n"
    "- **What does this indicate?**\n"
    "  - The top feature is the most important in classification. Since LSTAT appears first, it suggests that "
    "areas with a lower percentage of lower-status residents are more likely to have higher house prices.\n\n"
    "- **Suggested improvements:**\n"
    "  - Tune hyperparameters (increase max depth, adjust min samples per leaf).\n"
    "  - Use **Random Forest or XGBoost** for better generalization.\n"
    "  - Include additional features like crime rate, highway accessibility, or property tax rate."
)

create_question(question_10, answer_10)


### 📥 Download `housing.csv` from Kaggle and Upload to Colab  
#### 🔗 Dataset Link: [Boston Housing Dataset](https://www.kaggle.com/datasets/schirmerchad/bostonhoustingmlnd)  


In [None]:
# Import necessary libraries
import numpy as np  # For numerical operations
import pandas as pd  # For handling data
import matplotlib.pyplot as plt  # For visualization
from sklearn.model_selection import train_test_split  # For splitting data
from sklearn.tree import DecisionTreeClassifier, plot_tree  # For decision tree classification and visualization
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report  # For model evaluation

# Load the housing dataset
df = pd.read_csv('/content/housing.csv')  # Read the dataset into a Pandas DataFrame

# Convert the 'MEDV' column into a binary classification target
median_price = df['MEDV'].median()  # Calculate the median house price
df['PriceCategory'] = (df['MEDV'] > median_price).astype(int)  # Convert to binary (1 if above median, 0 otherwise)

# Define the target variable (y) and features (X)
y = df['PriceCategory']  # Target variable: PriceCategory (High = 1, Low = 0)
X = df[['RM', 'LSTAT', 'PTRATIO']]  # Select features: RM (rooms), LSTAT (low-income %), PTRATIO (pupil-teacher ratio)

# Split the dataset into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 'random_state=42' ensures results are reproducible

# Train a Decision Tree Classifier
clf = DecisionTreeClassifier(max_depth=3, random_state=42)  # Create a decision tree model with max depth = 3
clf.fit(X_train, y_train)  # Train the model using the training dataset

# Make predictions on the test set
y_pred = clf.predict(X_test)  # Predict housing price category (High/Low) for test data

# Evaluate the model's performance
print(f'Accuracy: {accuracy_score(y_test, y_pred):.4f}')  # Print the accuracy of the model
print('\nConfusion Matrix:\n', confusion_matrix(y_test, y_pred))  # Show confusion matrix (TP, FP, TN, FN)
print('\nClassification Report:\n', classification_report(y_test, y_pred))  # Show precision, recall, and F1-score

# Visualize the trained Decision Tree model
plt.figure(figsize=(15, 8))  # Set figure size for better readability
plot_tree(clf, feature_names=X.columns, class_names=['Low', 'High'], filled=True)  # Plot the decision tree
plt.show()  # Display the visualization
