# Assignment 2  
**Applied Machine Learning**

---

### 1. [20 pts]  
At a high level (i.e., without entering into mathematical details), please describe, compare, and contrast the following classifiers:  

- **Perceptron (textbook's version)**  
    - A Perceptron is a type of artificial neural network and a foundational algorithms in machine learning, mainly used for binary classification. It takes input features, applies weights and a bias, and passes the weighted sum through an activation function to produce an output. During training, the Perceptron updates its weights whenever it makes errors, gradually improving its predictions through this adjustment process.
- **SVM**  
    - A supervised classifier that finds the maximum-margin decision boundary. With kernels it can capture non-linear structure. Speed is moderate and can be slow on large datasets, but strength and robustness are high with proper scaling/tuning. it uses numeric, scaled features. It’s a discriminative method that solves a convex optimization with hinge loss and regularization.
- **Decision Tree**  
    - A decision tree is a non-linear, rule-based classifier that splits the dataset into smaller subsets using criteria such as Gini impurity or information gain. Each split is made to maximize separation between the classes, and the model continues splitting until some stopping condition is met. Decision trees are easy to interpret and can handle both numerical and categorical features without requiring feature scaling. However, they are prone to overfitting and can be unstable, since small changes in the data may lead to very different tree structures. While they train quickly, a single tree can have high variance, making them less robust compared to ensemble methods.
- **Random Forest** 
    - A random forest builds on the weaknesses of decision trees by combining many of them into an ensemble. It does this by training each tree on a random sample of the data and using a random subset of features at each split. The final prediction is made by aggregating the results of all trees through majority vote for classification. This approach reduces variance and overfitting, leading to more robust models. Random forests work well with mixed feature types, require little preprocessing, and are less sensitive to noise. The main drawbacks are that they can be slower to train and harder to interpret compared to a single decision tree.

Some comparison criteria can be:  
- Speed?  
- Strength?  
- Robustness?  
- The feature type that the classifier naturally uses (e.g., relying on distance means that numerical features are naturally used)  
- Is it statistical?  
- Does the method solve an optimization problem? If yes, what is the cost function?  

**Question:** Which one will be the first that you would try on your dataset?  

- On typical tabular data, a Random Forest is what I would use for a first pass baseline since it's robust to noise/outliers, handles non-linearities and interactions, needs almost no scaling/encoding work. But depending on what the goal is and the type of data we are working with different classifiers might be needed. If interpretability is the top requirement, I’d add a shallow Decision Tree alongside the forest to produce a clear, human-readable explanation.


### 2. [20 pts]  
Define the following feature types and give example values from a dataset. You can pull examples from an existing dataset (like the Iris dataset) or you could write out a dataset yourself.  

(Hint: In order to give examples for each feature type, you will probably have to use more than one dataset.)  

- **Numerical**
  - Quantitative values you can add/average; continuous or discrete.
  *Examples:* Iris sepal length = 5.1 cm, petal width = 1.8 cm.

- **Nominal**
  - Categorical labels with **no natural order**.
  *Examples:* Iris species (setosa, versicolor, virginica), color (red, blue, green).

- **Date**
  - Calendar/time stamps representing when an event occurred. Often expanded into features like year, month, day-of-week, hour.
  *Examples:* order_date = 2025-09-03 and timestamp = 2024-07-15 13:05:22.

- **Text**
  - Unstructured strings of characters/tokens. Typically vectorized (e.g., bag-of-words, TF-IDF, embeddings).
  *Examples:* review_text = “Loved the food, will return”

- **Image**
  - Pixel arrays (grayscale or RGB); high-dimensional signals.
  *Examples:* MNIST digit image 28×28 (label “7”).

- **Dependent variable**
  - The target/response the model predicts (often denoted $y$).
  *Examples:* Iris species label, house_price = $345,000, and “Chance of Admit” in the Graduate Admission dataset.



### 3. [20 pts]  
Using online resources, research and find other classifier performance metrics which are also as common as the accuracy metric. Provide the mathematical equations for them and explain in your own words the meaning of the different metrics you found.  

Note that providing mathematical equations might involve defining some more fundamental terms. For example, you should define **“False Positive”** if you answer with a metric that builds on that.  

- **True Positive (TP)**: predicted positive, and the actual class is positive.
- **False Positive (FP)**: predicted positive, but the actual class is negative.
- **False Negative (FN)**: predicted negative, but the actual class is positive.
- **Specificity**
     - of all the actual negatives, what fraction did the model correctly identify?
    $$
    \mathrm{Specificity}= \frac{TN+FP}{TN}
    $$
- **Precision**
    - A metric that measures of the ones predicted to be true, how many were actually true. This measure is important for useages when false positives can be costly.
    $$
    \mathrm{Precision} = \frac{TP}{TP + FP}
    $$
- **Recall**
    - Measues of all the true positives, what fraction did the model find. So this metric punishes false negatives. 
    $$
    \mathrm{Recall} = \frac{TP}{TP + FN}
    $$
- **F1 Score**
    - A single number that’s high only when both precision and recall are high. It penalizes an imbalance between them more than an arithmetic mean would.
    $$
    F_{1} = \frac{2\,\mathrm{Precision}\,\mathrm{Recall}}
                          {\mathrm{Precision}+\mathrm{Recall}}
                     = \frac{2TP}{2TP + FP + FN}
    $$



### 4. [40 pts]  Correlation Program From Scratch
**Task:** Display the correlation matrix where each row and column are the features.  

Additional questions to answer:  
- Should we use *'Serial no'*? Why or why not?  
    - We should **not** use Serial No. as this is a unique value that doesn't add any predictive signal to our model.
- Observe that the diagonal of this matrix should have all 1's; why is this?  
    - Because the correlation of any variable with itself is always going to be 1.
- Since the last column can be used as the target (dependent) variable, what do you think about the correlations between all the variables?  
    - GRE, TOEFL and GPA seem to be strongly correlated with each other, suggesting that we have some multicollinearity and overlapping information.
- Which variable should be the most important to try to predict *'Chance of Admit'*
    - CGPA has the highest correlation with Chance of Admit (~0.873), so it’s the single most predictive feature by correlation.


In [1]:
import pandas as pd
admission_data = pd.read_csv('Admission_Predict.csv')
print(admission_data.shape)

admission_data.head()


(400, 9)


Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [2]:
import math

#drop serial no
admission_data_clean = admission_data.drop(columns=['Serial No.'])

X = admission_data_clean.astype(float)

def mean(values):
    return sum(values) / len(values)

def stdev(values):
    average = mean(values)
    sample_variance = sum((value - average) ** 2 for value in values) / (len(values) - 1)
    return math.sqrt(sample_variance)

def covariance(x_values, y_values):
    x_mean = mean(x_values)
    y_mean = mean(y_values)

    centered_products_sum = 0.0
    for x_value, y_value in zip(x_values, y_values):
        centered_x = x_value - x_mean
        centered_y = y_value - y_mean
        centered_products_sum += centered_x * centered_y

    sample_covariance = centered_products_sum / (len(x_values) - 1)
    return sample_covariance

def correlation(x_values, y_values):
    x_stdev = stdev(x_values)
    y_stdev = stdev(y_values)
    if x_stdev == 0 or y_stdev == 0:
        return 0.0
    return covariance(x_values, y_values) / (x_stdev * y_stdev)

data = {feature_name: X[feature_name].tolist() for feature_name in X.columns}
corr_matrix = [
    [correlation(data[row_feature_name], data[col_feature_name]) for col_feature_name in X.columns]
    for row_feature_name in X.columns
]
corr_df = pd.DataFrame(corr_matrix, index=X.columns, columns=X.columns)
corr_df

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
GRE Score,1.0,0.835977,0.668976,0.612831,0.557555,0.83306,0.580391,0.80261
TOEFL Score,0.835977,1.0,0.69559,0.657981,0.567721,0.828417,0.489858,0.791594
University Rating,0.668976,0.69559,1.0,0.734523,0.660123,0.746479,0.447783,0.71125
SOP,0.612831,0.657981,0.734523,1.0,0.729593,0.718144,0.444029,0.675732
LOR,0.557555,0.567721,0.660123,0.729593,1.0,0.670211,0.396859,0.669889
CGPA,0.83306,0.828417,0.746479,0.718144,0.670211,1.0,0.521654,0.873289
Research,0.580391,0.489858,0.447783,0.444029,0.396859,0.521654,1.0,0.553202
Chance of Admit,0.80261,0.791594,0.71125,0.675732,0.669889,0.873289,0.553202,1.0


In [3]:
# output `.corr` matrix for comparison
real_corr_matrix = admission_data_clean.corr()

real_corr_matrix

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
GRE Score,1.0,0.835977,0.668976,0.612831,0.557555,0.83306,0.580391,0.80261
TOEFL Score,0.835977,1.0,0.69559,0.657981,0.567721,0.828417,0.489858,0.791594
University Rating,0.668976,0.69559,1.0,0.734523,0.660123,0.746479,0.447783,0.71125
SOP,0.612831,0.657981,0.734523,1.0,0.729593,0.718144,0.444029,0.675732
LOR,0.557555,0.567721,0.660123,0.729593,1.0,0.670211,0.396859,0.669889
CGPA,0.83306,0.828417,0.746479,0.718144,0.670211,1.0,0.521654,0.873289
Research,0.580391,0.489858,0.447783,0.444029,0.396859,0.521654,1.0,0.553202
Chance of Admit,0.80261,0.791594,0.71125,0.675732,0.669889,0.873289,0.553202,1.0


# References
OpenAI. (2025). ChatGPT (GPT-5) [Large language model]. Retrieved from https://chat.openai.com/