### 1. 
[20 pts] At a high level (i.e., without entering into mathematical details), please describe,
compare, and contrast the following classifiers:
-  Perceptron (textbook's version)
-  SVM
-  Decision Tree
- Random Forest (you have to research a bit about this classifier)

Some comparison criterion can be:
-  Speed?
-  Strength?
-  Robustness?
-  The feature type that the classifier naturally uses (e.g. relying on distance means that numerical features are naturally used)
-  Is it statistical?
-  Does the method solve an optimization problem? If yes, what is the cost function?
Which one will be the first that you would try on your dataset?

**Ans:**

1. **Perceptron:**
- **Speed**: Generally fast due to its simplicity.
- **Strength**: Good for linearly separable problems but struggles with non-linear data.
- **Robustness**: Sensitive to noisy data and outliers.
- **Feature Type**: Works best with numerical features as it operates based on linear combinations of input features.
- **Statistical**: Not particularly, more geometric in nature.
- **Optimization Problem**: Yes, it tries to find a separating hyperplane by minimizing classification errors.

2. **SVM:** [1]
- **Speed**: Can be slow, especially with large datasets and when using kernel tricks.
- **Strength**: Very effective for high-dimensional spaces and for cases where the number of dimensions exceeds the number of samples.
- **Robustness**: Quite robust against overfitting, especially in high-dimensional space. Kernel SVMs can handle non-linear data well.
- **Feature Type**: Primarily numerical; kernel tricks allow it to handle non-linear relationships.
- **Statistical**: Yes, based on statistical learning theory.
- **Optimization Problem**: Yes, it solves a convex optimization problem to maximize the margin between classes.

3. **Decision Tree:** [2]
- **Speed**: Fast for training, but predictions can be slower with very deep trees.
- **Strength**: Can handle both numerical and categorical data well. Easy to interpret and understand.
- **Robustness**: Prone to overfitting, especially with complex trees.
- **Feature Type**: Can naturally handle a mix of numerical and categorical features without needing dummy variables.
- **Statistical**: Yes, it uses statistical measures (like information gain or Gini impurity) to split the data.

4. **Random Forest:** [2]
- **Speed**: Slower than Decision Trees due to the ensemble approach but can be parallelized.
- **Strength**: Very strong classifier due to the ensemble of decision trees which reduces overfitting and improves generalization.
- **Robustness**: More robust than a single decision tree, less sensitive to noisy data and outliers.
- **Feature Type**: Like decision trees, it can handle both numerical and categorical features effectively.
- **Statistical**: Yes, it builds upon the statistical approach of decision trees and adds methods like bootstrapping and aggregation to improve performance.

#### Which to Try First?
The choice of classifier to try first on a new dataset might depend on the dataset's size, feature types, and the problem's complexity. If the dataset is high-dimensional and linearly separable, a Perceptron or SVM might be the first choice. For a dataset with mixed feature types and potential non-linear relationships, a Decision Tree or Random Forest might be more appropriate due to their versatility and ability to handle complexity and avoid overfitting through ensemble methods.

Ultimately, the best approach is experimenting with multiple models and validate to compare their performance systematically.

### 2. 
[20 pts] Define the following feature types and give example values from a dataset. You
can pull examples from an existing dataset (like the Iris dataset) or you could write out a
dataset yourself. (Hint:** In order to give examples for each feature type, you will probably
have to use more than one dataset.)

**Ans:**
- **Numerical:** Numerical features represent quantitative measurements and can be either discrete or continuous. Example from the Iris dataset: The 'sepal length' feature is a  quantifies the length of the sepal in centimeters.

- **Nominal:** categorical features, represent categories or groups that do not have a specific order or ranking. Hypothetical example: A dataset of cars might have a 'color' feature that is nominal. Example values could be Red, Blue, Green, etc.

- **Date:** Date features represent dates and/or times, often used to mark events or records. Hypothetical example: A dataset of library book checkouts might include a 'checkout date' feature. Example values could be 2023-01-15,  etc., in the format YYYY-MM-DD.

- **Text:** Text features contain textual data, often unstructured, such as sentences, paragraphs, or documents. Hypothetical example: A dataset of customer reviews might include a 'review' feature. Example values could be "Excellent service, will come again!", "Product not as described, very disappointed.", etc.

- **Image:** Image features consist of data in the form of images, which can be used for tasks like image classification, object detection, etc. Hypothetical example: A dataset for a facial recognition system might include an 'image' feature, where each record is an image file of a person's face. Example values  imagine filenames like person1.jpg, person2.jpg, etc

- **Dependent variable:** The variable that we are trying to predict or explain. Example from the Iris dataset: The 'species' feature is the dependent variable, indicating the species of Iris plant (Iris setosa, Iris virginica, Iris versicolor) for classification tasks.


### 3. 
[20 pts] Using online resources, research and find other classifier performance metrics
which are also as common as the accuracy metric. Provide the mathematical equations
for them and explain in your own words the meaning of the different metrics you found.
Note that providing mathematical equations might involve defining some more fundamental
terms, e.g. you should define “False Positive,” if you answer with a metric that builds on
that.

**Ans:**
1. **Precision**:
Precision measures the accuracy of positive predictions. It is the ratio of correctly predicted positive observations to the total predicted positives. The formula is given by:
$$
\text{Precision} = \frac{TP}{TP + FP}
$$
Where:
- \(TP\) (True Positives) is the number of positive instances correctly identified by the model.
- \(FP\) (False Positives) is the number of negative instances incorrectly labeled as positive by the model.

2. **Recall (Sensitivity or True Positive Rate)**:
Recall is the ratio of correctly predicted positive observations to all observations in the actual class. It is defined as:
$$
\text{Recall} = \frac{TP}{TP + FN}
$$
Where:
- \(FN\) (False Negatives) is the number of positive instances incorrectly labeled as negative by the model.

3. **F1 Score** [3]:
The F1 Score is the harmonic mean of Precision and Recall, providing a single metric to assess a model's performance when you need a balance between precision and recall. The formula is:
$$
\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

4. **AUC-ROC** [3]:
The ROC (Receiver Operating Characteristics) curve visualizes the model's performance by plotting the true positive rate against the false positive rate at various classification thresholds. The AUC represents the total area beneath this plot, serving as an indicator of the model's performance across all potential classification thresholds. AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes.  ROC curves are particularly useful for assessing the accuracy of models, making them ideal for evaluating model performance in scenarios where the data distribution is balanced.

- True Positive Rate (TPR): 
$$
\text{TPR} = \frac{TP}{TP + FN}
$$
- False Positive Rate (FPR): 
$$
\text{FPR} = \frac{FP}{FP + TN}
$$

Where \(TN\) (True Negatives) is the number of negative instances correctly identified by the model.

5. **Log Loss (Cross-Entropy Loss)** [4]:
Log Loss quantifies the accuracy of a classifier by penalising false classifications. It is defined for a binary classifier as:
$$
\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)]
$$
Where:
- \(N\) is the number of samples.
- \(y_i\) is the actual label of instance \(i\), which can be 0 or 1.
- \(p_i\) is the predicted probability that instance \(i\) is of class 1.

### 4 
Implement a correlation program from scratch to look at the correlations between the features of Admission_Predict.csv dataset file. (This Graduate Admission dataset, with 9 features and 500 data points, is not provided on Canvas; you have to download it from Kaggle by following the instructions in the module Jupyter notebook.) 
Remember, you are not allowed to usednumpyfunctionssuch asmean(),stdev(),cov(),etc.
You may use DataFrame.corr() only to verify the correctness of your from-scratch matrix.
  
Display the correlation matrix where each row and column are the features. (Hint: this should be an 8 by 8 matrix.)

In [1]:
import pandas as pd
# Locate and load the data file
df = pd.read_csv('../../Desktop/APML/Datasets/Admission_Predict_Ver1.1.csv')

# Check dimensions
print(f'N rows={len(df)}, M columns={len(df.columns)}')
df.head()

N rows=500, M columns=9


Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [2]:
def mean(values):
    return sum(values) / len(values)

def stddev(values, mean):
    return (sum(([(x - mean)** 2 for x in values])) / (len(values) - 1)) ** 0.5

def covariance(x, x_mean, y, y_mean):
    covariance = 0.0
    for i in range(len(x)):
        covariance += (x[i] - x_mean) * (y[i]  - y_mean)
    return covariance / (len(x) - 1)

# Function to calculate Pearson's correlation coefficient
def pearson_correlation(x, y):
    mean_x, mean_y = mean(x), mean(y)
    covar = covariance(x, mean_x, y, mean_y)
    stddev_x, stddev_y  = stddev(x, mean_x), stddev(y, mean_y)
    return covar / (stddev_x * stddev_y)

# Initialize an empty DataFrame for the correlation matrix
# features = df.columns[:-1]  # Exclude 'Chance of Admit' for the matrix
features = df.columns  
correlation_matrix = pd.DataFrame(index=features, columns=features)

# Calculate the correlation matrix
for col1 in features:
    for col2 in features:
        correlation_matrix.loc[col1, col2] = pearson_correlation(df[col1], df[col2])

# Print the DataFrame directly to display it as a table
print(correlation_matrix)

                  Serial No. GRE Score TOEFL Score University Rating  \
Serial No.               1.0 -0.103839   -0.141696         -0.067641   
GRE Score          -0.103839       1.0      0.8272          0.635376   
TOEFL Score        -0.141696    0.8272         1.0          0.649799   
University Rating  -0.067641  0.635376    0.649799               1.0   
SOP                -0.137352  0.613498     0.64441          0.728024   
LOR                -0.003694  0.524679    0.541563          0.608651   
CGPA               -0.074289  0.825878    0.810574          0.705254   
Research           -0.005332  0.563398    0.467012          0.427047   
Chance of Admit     0.008505  0.810351    0.792228          0.690132   

                        SOP      LOR       CGPA  Research Chance of Admit   
Serial No.        -0.137352 -0.003694 -0.074289 -0.005332         0.008505  
GRE Score          0.613498  0.524679  0.825878  0.563398         0.810351  
TOEFL Score         0.64441  0.541563  0.810574 

In [3]:
# Calculate the correlation matrix using pandas
correlation_matrix_pandas = df.corr()

# Display the correlation matrix from pandas for comparison
print(correlation_matrix_pandas)

                   Serial No.  GRE Score  TOEFL Score  University Rating  \
Serial No.           1.000000  -0.103839    -0.141696          -0.067641   
GRE Score           -0.103839   1.000000     0.827200           0.635376   
TOEFL Score         -0.141696   0.827200     1.000000           0.649799   
University Rating   -0.067641   0.635376     0.649799           1.000000   
SOP                 -0.137352   0.613498     0.644410           0.728024   
LOR                 -0.003694   0.524679     0.541563           0.608651   
CGPA                -0.074289   0.825878     0.810574           0.705254   
Research            -0.005332   0.563398     0.467012           0.427047   
Chance of Admit      0.008505   0.810351     0.792228           0.690132   

                        SOP      LOR       CGPA  Research  Chance of Admit   
Serial No.        -0.137352 -0.003694 -0.074289 -0.005332          0.008505  
GRE Score          0.613498  0.524679  0.825878  0.563398          0.810351  
TOEFL

- • Should we use 'Serial no'? Why or why not?

**Ans:** No, we should not use 'Serial No.' as a feature. Typically a unique identifier for each entry in a dataset and does not hold any intrinsic predictive value regarding the outcomes. 

- • Observe that the diagonal of this matrix should have all 1's; why is this?

**Ans:** The diagonal of a correlation matrix has all 1's because it represents the correlation of each variable with itself. Since the correlation measures the strength and direction of a linear relationship between two variables, the correlation of a variable with itself is always perfect, hence a coefficient of 1.

- • Since the last column can be used as the target (dependent) variable, what do you think about the correlations between all the variables?

**Ans:** When looking at the correlations with the target variable 'Chance of Admit', variables with higher absolute values of the correlation coefficient are more linearly related, positively or negatively, to the chances of admission.
- • Which variable should be the most important to try to predict 'Chance of Admit'?

**Ans:** Based on the correlation matrix, the variable 'CGPA' has the highest correlation with 'Chance of Admit' (0.882413). This suggests that 'CGPA' likely would be the most important predictor in a predictive model for admission chances.

## References:

1. DeepAI. (n.d.). Support Vector Machine Definition. Retrieved from https://deepai.org/machine-learning-glossary-and-terms/support-vector-machine
2. IBM. (n.d.). Random Forest. Retrieved from https://www.ibm.com/topics/random-forest 
3. Erickson, B. J., & Kitamura, F. (2021). Magician’s Corner: 9. Performance Metrics for Machine Learning Models. *Radiology: Artificial Intelligence*, *3*(3). [https://doi.org/10.1148/ryai.2021200126](https://doi.org/10.1148/ryai.2021200126)
4. Roberts, A. (2023, January 30). Binary Cross Entropy: Where To Use Log Loss In Model Monitoring. Arize Blog. https://arize.com/blog-course/binary-cross-entropy-log-loss/ 
