# 1. [20 pts] At a high level (i.e., without entering into mathematical details), please describe, compare, and contrast the following classifiers:
 - Perceptron (textbook's version)
 - SVM
 - Decision Tree
 - Random Forest (you have to research a bit about this classifier)


Some comparison criterion can be:
 - Speed?
 - Strength?
 - Robustness?
 - The feature type that the classifier naturally uses (e.g. relying on distance means that
numerical features are naturally used)
 - Is it statistical?
 - Does the method solve an optimization problem? If yes, what is the cost function?


Which one will be the first that you would try on your dataset?

### Answer 1
With an unknown dataset, I would likely try the SVM first.  If I knew that the dataset had relatively few features (less than say 20), then I would use a decision tree for its speed and simplicity.  I've seen decision trees perform very well on tasks that took deep neural networks incredible amounts of data and time. The following paper is just such an example [6].

- Perceptron: Minimizes the error in predictions. It is not statistical, but does solve an optimization (minimization) problem. It's a relatively weak/unrobust algorithm, because it will never converge if data isn't linearly separable.  It doesn't have kernel methods available and thus can't add dimensions to make hyperplanes evident.  The perceptron relies on numeric data.

- SVM: algorithm for classification and regression.  Data must be linearly separable unless the "kernel trick" is utilized. Maximizes the distance between classes of data. SVMs perform better with high dimension data than decision trees [2]. Can use a variety of methods for optimization-- radial basis function, sigmoid, polynomial, linear. It is not statistical, but does solve optimization (maximization) problem.

- Decision tree: an algorithm for classification.  Create a rule for a subset, or all of, the features in the dataset. Great for nominal data and numeric data. The logic of a decision tree is easy to interpret. Doesn't require feature scaling.  Not as robust as Random Forest-- more susceptible to overfitting and doesn't generalize as well [1]. It is stastical in that it's looking for things like averages/means, medians, and modes. Quick and easy to train if the dataset has few features or is small.

- Random forest: ensemble learner for classification and regression. Generalizes well due to how trees are generated [1]. It is stastical because it is derived from decision trees. More time consuming to train than a decision tree, since you are generating multiple decision trees.

# 2. [20 pts] Define the following feature types and give example values from a dataset. You can pull examples from an existing dataset (like the Iris dataset) or you could write out a dataset yourself. (Hint: In order to give examples for each feature type, you will probably have to use more than one dataset.)
 - Numerical
 - Nominal
 - Date
 - Text
 - Image
 - Dependent variable

### Answer 2
 - Numerical: characteristics of features represented by numerical values.  In the Iris dataset `petal length`, `petal width`, `sepal length`, `sepal width` are all numeric features.
 - Nominal: are descriptive features, without numerica characteristics.  On Kaggle there is a dataset named [Vehicle Dataset](https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho).  The `Name` feature contains information on the make, model and trim level of the car, which are each features in themselves.
 - Date: These features indicate a time related to the sample.  They can be values such as `year`, `month`, `day`, `hour`, `minute`, `second`, `millisecond`, or a combination of all of those.  The [Vehicle Dataset](https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho) contains `year`. Another dataset with more date features is [Room Occupancy Estimation](https://archive.ics.uci.edu/dataset/864/room+occupancy+estimation) which contains `date` and `time` features.  `date` is formatted as YYYY/MM/DD, `time` is formatted as HH:MM:SS.
 - Text: these are sentences, or words.  They are used in tasks like sentiment analysis, or perhaps something like classifying a support ticket as "hardware issue" or "finance issue."  An example of a sentiment analysis dataset is the [Customer Feedback Dataset](https://www.kaggle.com/datasets/vishweshsalodkar/customer-feedback-dataset) on Kaggle.  The dataset shows only one feature, `Text, Sentiment, Source, Date/Time, User ID, Location, Confidence Score`, but on inspection we can see that the first field in this single feature is itself a `text` feature.  Another dataset rich with text is the [Support-tickets-classification](https://www.kaggle.com/datasets/aniketg11/supportticketsclassification) dataset on Kaggle.  It contains text features `title` and `body` which can be used to determine what team is responsible for the ticket.
 - Image: image features are just that--images.  From what I've read it appears these are always transformed into RGB with values (0,255).  An example dataset is the [AI vs. Human-Generated Images](https://www.kaggle.com/datasets/alessandrasala79/ai-vs-human-generated-dataset).  This dataset contains a CSV with relative paths to image files, the feature is called `file_name`.  Another example of images in a dataset is [Olivetti Faces](https://scikit-learn.org/stable/datasets/real_world.html#olivetti-faces-dataset) which I reviewed on Scikit-Learn's website.  The dataset has an `images` feature which has 400 samples, each of which is 64x64 matrix of grayscale image values.
 - Dependent variable: this is a variable which is determined by features (or independent variables).  In the Iris dataset these are the species Iris-Setosa, Iris-Versicolour, Iris-Virginica.



In [1]:
### Code 2

# Directly from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_olivetti_faces.html#sklearn.datasets.fetch_olivetti_faces
from sklearn.datasets import fetch_olivetti_faces
olivetti_faces = fetch_olivetti_faces()
print(f'Shape of Olivetti faces dataset: {olivetti_faces.data.shape}')


Shape of Olivetti faces dataset: (400, 4096)


# 3. [20 pts] Using online resources, research and find other classifier performance metrics which are also as common as the accuracy metric. Provide the mathematical equations for them and explain in your own words the meaning of the different metrics you found. Note that providing mathematical equations might involve defining some more fundamental terms, e.g. you should define “False Positive,” if you answer with a metric that builds on that.

### Answer 3

# 4. [40 pts] Implement a correlation program from scratch to look at the correlations between the features of Admission_Predict.csv dataset file. (This Graduate Admission dataset, with 9 features and 500 data points, is not provided on Canvas; you have to download it from Kaggle by following the instructions in the module Jupyter notebook.) Remember, you are not allowed to used numpy functions such as mean(), stdev(), cov(), etc. You may use DataFrame.corr() only to verify the correctness of your from-scratch matrix.
## Display the correlation matrix where each row and column are the features. (Hint: this
should be an 8 by 8 matrix.)
 - Should we use 'Serial no'? Why or why not?
 - Observe that the diagonal of this matrix should have all 1's; why is this?
 - Since the last column can be used as the target (dependent) variable, what do you
think about the correlations between all the variables?
 - Which variable should be the most important to try to predict 'Chance of Admit'?

### Answer 4
- We should not use serial number as it has no bearing on the applicant, it's an arbitrarily assigned number.
- Because the diagonal is represents each vector, Xi, correlated with itself.  Since these are the same they are perfectly related.
- TODO
- The variable most important in trying to predict 'Chance of Admit' is `CGPA` because it has the highest correlation with the target samples.



##### Note on my code:
I didn't understand whether we were supposed to provide the correlation matrix or just the correlation coefficient for each column. Please find both included. 
- functions are defined first
- data is read in
- my functions are applied, and followed by the analog in Pandas

In [110]:
"""
    Note to the grader: 
    - functions are defined first
    - data is read in
    - my functions are applied, and followed by the analog in Pandas
"""

import pandas as pd
import numpy as np




def vector_checker(vec1, vec2):
    """
    Returns TRUE if the vectors are of the same size, FALSE otherwise
    
    Parameters
    ----------
    vec1 : {numpy.ndarray}, shape = [n, 1]
      A vector
    vec2 : {numpy.ndarray}, shape = [n, 1]
      A vector
    
    Returns
    -------
    bool
    """
    if len(vec1) != len(vec2):
        print(f"Input vectors must be of same length. Recieved vec1 of length {len(vec1)}; vec2 of length {len(vec2)}.")
        return False
    if len(vec1) == 0:
        print(f"Vectors must have non-zero length")
        return False
    return True




def mean(vec):
    """
    Returns the mean, or expected value I guess, of a vector
    
    Parameters
    ----------
    vec : {numpy.ndarray}, shape = [n, 1]
      A vector
    
    Returns
    -------
    float
    """
    return np.sum(vec)/len(vec)




def std_dev(vec):
    """
    Returns the standard deviation of a vector
    
    Parameters
    ----------
    vec : {numpy.ndarray}, shape = [n, 1]
      A vector
    
    Returns
    -------
    ret : float
    """
    mu = mean(vec)
    ret = (vec - mu)**2
    ret = ret.sum()
    ret = np.sqrt(ret / len(vec))
    return ret




def variance(vec):
    """
    Returns the variance of a vector
    
    Parameters
    ----------
    vec : {numpy.ndarray}, shape = [n, 1]
      A vector
    
    Returns
    -------
    ret : float
    """
    mu = mean(vec)
    n = len(vec)
    
    ret = 0
    for i in range(len(vec)):
        ret += (vec[i] - mu)**2 / n
    return ret




def covariance(vec1, vec2):
    """
    Returns the covariance of vectors vec1 and vec2 
    
    Parameters
    ----------
    vec1 : {numpy.ndarray}, shape = [n, 1]
      A vector
    vec2 : {numpy.ndarray}, shape = [n, 1]
      A vector
    
    Returns
    -------
    ret : float
    """
    if not vector_checker(vec1, vec2):
        return "N/A"

    count = len(vec1)
    mu_vec1 = mean(vec1)
    mu_vec2 = mean(vec2)
    ret = 0
    for i in range(count):
        ret += (vec1[i] - mu_vec1)*(vec2[i] - mu_vec2)
    return ret / (count-1)




def dot_product(vec1, vec2):
    """
    Returns the dot product of vectors vec1 and vec2 
    
    Parameters
    ----------
    vec1 : {numpy.ndarray}, shape = [n, 1]
      A vector
    vec2 : {numpy.ndarray}, shape = [n, 1]
      A vector
    
    Returns
    -------
    ret : float
    """
    if vector_checker(vec1, vec2) == False:
        return "N/A"
    product = 0
    for i in range(len(vec1)):
        product += vec1[i] *  vec2[i]
    return product


def correlation_coefficient(vec1, vec2):
    """
    Returns the Pearson correlation coefficient for vec1 and vec2.  Uses the formula at
    https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
    
    Parameters
    ----------
    vec1 : {numpy.ndarray}, shape = [n, 1]
      A vector
    vec2 : {numpy.ndarray}, shape = [n, 1]
      A vector
    
    Returns
    -------
    Pearson's correlation coefficient : float
    """
    mu_vec1 = mean(vec1)
    mu_vec2 = mean(vec2)
    numerator = dot_product(vec1 - mu_vec1, vec2 - mu_vec2)
    denominator = ( np.sum( (vec1 - mu_vec1)**2 ) )**.5 * ( np.sum( (vec2 - mu_vec2)**2 ) )**.5
    return numerator / denominator




def correlate_with_target(feature_data:np.ndarray=None, feature_labels:list=None, target:np.ndarray=None, target_label:list=None):
    """
    Prints each feature name and its correlation with the target.
    
    Parameters
    ----------
    feature_data : {numpy.ndarray}, shape = [n, m]
        An array of feature data, where each feature is in a column, and each row is a sample

    feature_labels: list {str}, shape = [1,m]
        A list of feature names, as strings

    target : {numpy.ndarray}, shape = [n, 1]
        A vector of the target data

    target_label: list {str}, shape = [1,1]
        A list of a single string, the target column name.
    
    Returns
    -------
    None
    """
    print('{:<30}'.format('Feature'), end='|\t')
    print('{:<30}'.format(target_label))
    print('-' * 55)
    for i in range(len(feature_labels)):
        print(f'{feature_labels[i]:<30}', end='|\t')
        col = feature_data[:,i]
        print(f'{correlation_coefficient(col,target):<30.6f}')
    print(f'{target_label:<30}', end='|\t')
    print(f'{correlation_coefficient(target,target):<30.6f}')




def correlation_matrix(data:np.ndarray=None, labels:list=None):
    """
    Prints the correlation matrix for every column.
    
    Parameters
    ----------
    data : {numpy.ndarray}, shape = [n, m]
        An array of feature data, where each feature is in a column, and each row is a sample

    labels: list {str}, shape = [1,m]
        A list of column names, as strings
    
    Returns
    -------
    None
    """
    print(f'{n:<20}', end='')
    # rows
    for i in range(len(labels)):
        # columns
        for j in range(len(labels)):
            if i == 0:
                print(f'{labels[j]:<20}', end='\t')
            else:
                if j == 0:
                    print(f'{labels[i]:<20}', end='\t')
                s = 'm'
                v1 = data[:,i]
                v2 = data[:,j]
                print(f'{correlation_coefficient(v1,v2):<20.6f}', end='\t')
        print()
            
    return None

## Read the data
valid = pd.read_csv('datasets/Admission_Predict.csv')
labels = valid.columns.tolist()
data = np.loadtxt('datasets/Admission_Predict.csv',delimiter=',', skiprows=1)
feature_data = data[:,:8]
feature_labels = labels[:8]
target = data[:,data.shape[1]-1]
target_label = labels[len(labels)-1]


## Correlate each feature with the target
correlate_with_target(feature_data=feature_data, feature_labels=feature_labels, target=target, target_label=target_label)
df = pd.DataFrame(data, columns=labels)
print(f'\n\n{"-"*50}\nPandas DataFrame.corrwith():\n{"-"*50}\n{df.corrwith(df["Chance of Admit "])}')
print('\n\n\n')

## Print the correlation matrix
correlation_matrix(data,labels)
print(f'\n\n{"-"*50}\nPandas DataFrame.corr():\n{"-"*50}\n{df.corr()}')

Feature                       |	Chance of Admit               
-------------------------------------------------------
Serial No.                    |	0.042336                      
GRE Score                     |	0.802610                      
TOEFL Score                   |	0.791594                      
University Rating             |	0.711250                      
SOP                           |	0.675732                      
LOR                           |	0.669889                      
CGPA                          |	0.873289                      
Research                      |	0.553202                      
Chance of Admit               |	1.000000                      


--------------------------------------------------
Pandas DataFrame.corrwith():
--------------------------------------------------
Serial No.           0.042336
GRE Score            0.802610
TOEFL Score          0.791594
University Rating    0.711250
SOP                  0.675732
LOR                  0.669889
C

# References
1. Raschka, Sebastian, et al. Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing Ltd, 2022.
2. IBM. What are support vector machines (SVMs)? https://www.ibm.com/think/topics/support-vector-machine. Last accessed 1 February, 2025.
3. IBM. What is random forest? https://www.ibm.com/think/topics/random-forest.  Last accessed 1 February, 2025.
4. IBM. What is a decision tree? https://www.ibm.com/think/topics/decision-trees. Last accessed 1 February, 2025.
5. Scikit-Learn. https://scikit-learn.org/stable/.
6. Azzouz, Elsayed Elsayed, and Asoke K. Nandi. "Automatic identification of digital modulation types." Signal processing 47.1 (1995): 55-69.
