# Module 7 Notes: Metrics and Model Development

## Metrics

Metrics should be unbiased, universal, and concise.

    1. A way to obtain similar responses
    2. A way to measure the performance
    3. A way to measure prediction

For our sample analysis we will use `KNN` K-Nearest Neighbor
    - K is an arbitrary pick
    - Need a "base case"
    - Compare the neighbors
    - Sort the results

Data set for this analysis:
```bash
icarus.cs.weber.edu:~hvalle/cs4580/data/movies.csv
```

In [3]:
import pandas as pd
import numpy as np
import pandas as pd
import get_data as gt  # download and load data
import Levenshtein  # Levenshtein distance
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Constants
K = 10  # number of closest matches
BASE_CASE_ID = 88763  # IMDB id for 'Back to the Future'
SECOND_CASE_ID = 89530  # IMDB id for 'Mad Max Beyond Thunderdome'
BASE_YEAR = 1980  # year for 'Back to the Future'

METRIC1_WT = 0.2  # weight for cosine similarity
METRIC2_WT = 0.8  # weight for weighted Jaccard similarity

### KNN-Euclidean Distance

The Euclidean distance is the distance between points 
in `N-dimensional` space.

Formula

$
d(p, q) = \sqrt{\sum_{i=1}^n (q_i = p_i)^2}
$

where
- $p = (p_1, p_2, \dots p_n)$
- $q = (q_1, q_2, \dots, q_n)$

#### Task:
Find the distance between these points:
- x = (0,0)
- y = (4,4)

Distance = 5.65685...

In [2]:
def euclidean_distance(base_case_year: int, comparator_year: int):
    """Euclidean distance between two years

    Args:
        base_case_year (int): Base case year
        comparator_year (int): Comparator year

    Returns:
        int: Absolute difference between the two years
    """
    return abs(base_case_year - comparator_year)

### KNN with Jaccard Similarity Index
Compares members of two individual sets to determin which members are `shared` and which are `distinct`.
The index measures the similarity between the two sets.

$$
J(A, B) = \frac{|A \cap B|}{|A \cup B|}
$$

### KNN with Weighted Jaccard Simlarity Index
The traditional Jaccard works well when doing 
`one-to-one` comparisons between a category.

One solution is the `weighted` version.
- build a ditionary for `each genre` of the movies in our preferred list

In [None]:
# see
def weighted_jaccard_weighted():


### KNN with Levenshtein Distance
an initial sequence to a target sequence.

- It is used to determine the difference between two sequences (strings)
- It is the distance between two words (minimum number of digits edits)
  - insertions, deletions, or substitutions

$$
D(i, j) = 
\begin{cases}
j & \text{if } i = 0 \\
i & \text{if } j = 0 \\
D(i-1, j-1) & \text{if } s[i] = t[j] \\
1 + \min \{D(i-1, j), D(i, j-1), D(i-1, j-1)\} & \text{if } s[i] \neq t[j]
\end{cases}
$$

#### For example:

Consider these strings:

- s = 'kitten'
- t = 'sitting'

Find the `Levenshtein` Distance
1. Substitute `k`with `s` in `kitten` -> `sitten`(1 substitution)
2. Substitute `e` with `i` in `sitten` -> `sittin` (1 substitution)
3. Insert `g` at the end of `sittin` -> `sitting` (1 insertion)

Result is 3 edits, so the distance is $ = 3$

In [None]:
# see
def knn_levenshtein_title():
    pass

Need this package:
```bash
# VE must be running python 3.11 or less
pip install Levenshtein
```

### KNN Cosine Similarity Distance

This is used to measure the cosine of the angle between two vectors in a 
multi-dimensional space. This is commonly used in text analysis to measure 
similarities between documents.

$$
\text{Cosine Similarity} = \cos(\theta) = \\
\frac{A \cdot B}{|A| |B|}
= \frac{\sum_{i=1}^{n} A_i B_i}{ \sqrt{sum_{i=1} A_i^2} \cdot \sqrt{\sum_{i=1}^{n} B_i^2}}
$$

Where
- $ A \cdot B$ is the dot product of vectors $A$ and $B$
- $|A|$ and $|B|$ are the agnitude (or Euclidean norms) of vectors $A$ and $B$

In [8]:
def cosine_and_weighted_jaccard(df: pd.DataFrame, plots: str, comparator_movie: pd.core.series.Series,):
    # Perform the cosine similiarty and weighted Jaccard metrics:
    cs_result = cosine_similarity_function(plots, comparator_movie["plot"])
    weighted_dictionary = _get_weighted_jaccard_similarity_dict(df)
    wjs_result = weighted_jaccard_similarity(
        df, comparator_movie["genres"]
    )

    # Normalization:
    # The weighted Jaccard similarity result has a range from 0.0 to 1.0.
    # The cosine similarity result has a range from -1.0 to 1.0. We need to change the range for the cosine similarity result.
    # First, add 1 to the cosine similarity result so that it has a range from 0.0 to 2.0
    # Second, divide the result by 2.0 so that it has a range from 0.0 to 1.0:
    cs_result = (cs_result + 1) / 2.0

    # Weights:
    # Use a weight of 0.2 (20%) for the cosine similarity result:
    cs_result *= METRIC1_WT
    # Use a weight of 0.8 (80%) for the weighted Jaccard similarity result:
    wjs_result *= METRIC2_WT
    return wjs_result + cs_result

In [4]:
def cosine_similarity_function(base_case_plot, comparator_plot):
    # this line will convert the plots from strings to vectors in a single matrix:
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(
        (base_case_plot, comparator_plot))
    results = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
    return results[0][0]

### KNN Combining Metrics and Filtering Conditions

Two main concerns with `filtering`:
- Making it too complicated (think hard SQL queries)
- too strict (end up with no results)

Combine `metrics` to generate `one` result:
- Weight each metric
    - Should metrics contribute equally? (50%-50%, 80%-80%)
- Normalization of the combine metric
    - Make sure they have the same range

For our example, we will use:
- `Cosine`: Use 20% of the `plot`
- `Weighted Jaccard`: Use 80% of `genres`

In [6]:
# See
def cosine_and_weighted_jaccard(df: pd.DataFrame, plots: str, comparator_movie: pd.core.series.Series,):
    # Perform the cosine similiarty and weighted Jaccard metrics:
    cs_result = cosine_similarity_function(plots, comparator_movie["plot"])
    weighted_dictionary = _get_weighted_jaccard_similarity_dict(df)
    wjs_result = weighted_jaccard_similarity(
        df, comparator_movie["genres"]
    )

## Prediction Metrics

If I predict that it will snow tomorrow, to check my answer I have to wait until it's tomorrow and see if it snows.

A **prediction** is simply a guess about what is going to transpire. One prediction is `yes` or `no`.

How do we measure the `accuracy` of the prediction?

New file:
```bash
accuracy_metric.py
```

### Confusion Matrix
This is done to measure how well your classification model is. This model could be `binary` or `multi-class`. Each entry in a confusion matrix represents a specific combination `predicted vs actual` classes.

For binary classification, you have `four` parts:
- `True Positive (TP)`: Correctly predicted positive observations
- `True Negative (TN)`: Correctly predicted negative observations
- `False Positive (FP)`: Incorrectly predicted positive observations (also known as `Type I Error`)
- `False Negative (FN)`: Incorrectly preedicted negative observations (aka `Type II Error`)

The structure of the matrix is as follows: 

|       | Predicted Positive | Predicted Negative|
|-------|--------------------|-------------------|
|Actual Positive | True Positive (TP) | False Negative (FN) |
|Actual Negative | False Positive (FP) | True Negative (TN) |

Key metrics:
- **Accuracy** = $\frac{{TP + TN}}{{Tp + TN + FP + FN}}$
- **Precision** = $\frac{{TP}}{{TP + FP}}$ (useful for imbalance classes-- which means something like more yes's than no's)
- **Recall** (or **Sensitivity**) = $\frac{{TP}}{{TP + FN}}$
- **F1 Score** = $2 \times \frac{{Precision \times Recall}}{{Precision + Recall}}$ (also known as harmonic mean of Precision and Recall)

New python file:
```bash
confusion_matrix.py
```