# Using the Item-Item Collaborative Filtering algorithm (T10 - slide 24 to 29)

**Objective**: Implement the baseline for the Collaborative Filtering (CF) method.
1. Create a miniHashLsh where we can retrieve the K nearest neighbors of a molecule (based on Jaccard similarity).
2. Select some lines of the training data to test our estimator.
3. Implement the estimator.
    - Given a protein and a molecule ID
    - Retrieve the K nearest neighbors of the molecule (Molecules most similar to the given molecule, that have an activity value for the given protein)
    - Estimate the activity level as weighted average.
4. Test the estimator with the selected lines (notice that the line of the protein does not count as a neighbor, only as groundTruth).



VER O QUE FAZER COM AS SITUAÇÕES ONDE NÃO HÁ VIZINHOS (MAYBE GLOBAL BASELINE??)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# imports
import random

import pandas as pd

from helpers import *

In [3]:
# Constants

ACTIVITY_TRAIN = pd.read_csv('./activity_train.csv', names=['uniprot_id', 'mol_id', 'activity'])
ACTIVITY_TRAIN["mol_id"] = ACTIVITY_TRAIN["mol_id"].apply(remove_blank_space)

ACTIVITY_TEST = pd.read_csv('./activity_test_blanked.csv', names=['uniprot_id', 'mol_id', 'activity'])
ACTIVITY_TEST["mol_id"] = ACTIVITY_TEST["mol_id"].apply(remove_blank_space)

len(ACTIVITY_TRAIN), len(ACTIVITY_TEST)

(135711, 4628)

# 1. MiniHashLsh

In [4]:
from sim import *

# 2. Select some lines of the training data to test our estimator

Selected 33% of the lines to check the performance of the estimator.

In [5]:
random.seed(42)
random_indexes = random.sample(range(0, len(ACTIVITY_TRAIN)), len(ACTIVITY_TRAIN) // 33)
ACTIVITY_VAL = ACTIVITY_TRAIN.iloc[random_indexes]
ACTIVITY_VAL["predicted"] = [0] * len(ACTIVITY_VAL)
ACTIVITY_VAL

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ACTIVITY_VAL["predicted"] = [0] * len(ACTIVITY_VAL)


Unnamed: 0,uniprot_id,mol_id,activity,predicted
29184,P20309,CHEMBL206127,4,0
6556,P08173,CHEMBL75880,3,0
72097,P32245,CHEMBL393789,5,0
64196,P30542,CHEMBL258755,1,0
58513,P29274,CHEMBL4566592,1,0
...,...,...,...,...
75096,P34969,CHEMBL2164342,5,0
15063,P0DMS8,CHEMBL375501,5,0
45946,P28222,CHEMBL1241546,9,0
129595,Q9H3N8,CHEMBL1915347,5,0


In [16]:
# use_case = ACTIVITY_VAL.iloc[0]
# type(use_case)

In [7]:
# knn = find_similar_keys(use_case["mol_id"], threshold=0.1)

In [73]:
def global_mean(use_case:pd.Series, ACTIVITY_TRAIN:pd.DataFrame):
    subset = ACTIVITY_TRAIN[ACTIVITY_TRAIN["uniprot_id"] == use_case["uniprot_id"]]
    return round(subset["activity"].mean())

def estimate_score(use_case: pd.Series, knn:dict, ACTIVITY_TRAIN:pd.DataFrame) -> int:

    # Retrieve the set of molecules that the protein has activity for
    uniprot_mols = set(ACTIVITY_TRAIN[ACTIVITY_TRAIN["uniprot_id"] == use_case["uniprot_id"]]["mol_id"].values)

    # Intercept the set of molecules that the protein has activity for and the set of molecules that are similar to the molecule
    sim_mols = set(knn.keys())
    intercept_mols = sim_mols.intersection(uniprot_mols)

    # Get a subset of the training data that contains only the molecules that are similar to the molecule
    subset = ACTIVITY_TRAIN[ACTIVITY_TRAIN["mol_id"].isin(intercept_mols)]

    if knn == {} or intercept_mols == {} or subset.empty:
        return global_mean(use_case, ACTIVITY_TRAIN)
    
    # Calculate the estimated activity
    num = 0
    den = 0
    for mol in intercept_mols:
        #      s_ij    *      r_jx
        num += knn[mol] * subset[subset["mol_id"] == mol]["activity"].values[0]
        #     s_ij
        den += knn[mol]


    return round(num / den)

In [74]:
list_preds = []
for i in range(len(ACTIVITY_VAL)):
    # Get the use case
    use_case = ACTIVITY_VAL.iloc[i]

    # Find the similar molecules
    knn = find_similar_keys(use_case["mol_id"], threshold=0.5)

    # Estimate the score
    pred = estimate_score(use_case, knn, ACTIVITY_TRAIN)

    # Append the prediction to the list
    list_preds.append(pred)

In [75]:
ACTIVITY_VAL["predicted"] = list_preds

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ACTIVITY_VAL["predicted"] = list_preds


In [76]:
ACTIVITY_VAL

Unnamed: 0,uniprot_id,mol_id,activity,predicted
29184,P20309,CHEMBL206127,4,4
6556,P08173,CHEMBL75880,3,4
72097,P32245,CHEMBL393789,5,8
64196,P30542,CHEMBL258755,1,1
58513,P29274,CHEMBL4566592,1,1
...,...,...,...,...
75096,P34969,CHEMBL2164342,5,5
15063,P0DMS8,CHEMBL375501,5,6
45946,P28222,CHEMBL1241546,9,10
129595,Q9H3N8,CHEMBL1915347,5,6


In [79]:
# Contando o número de linhas onde A == B
num_iguais = (ACTIVITY_VAL['activity'] == ACTIVITY_VAL['predicted']).sum()

# Contando o número de linhas onde A != B
num_diferentes = (ACTIVITY_VAL['activity'] != ACTIVITY_VAL['predicted']).sum()

num_diff_por_um = (abs(ACTIVITY_VAL['activity'] - ACTIVITY_VAL['predicted']) == 1).sum()
num_diff_por_dois = (abs(ACTIVITY_VAL['activity'] - ACTIVITY_VAL['predicted']) == 2).sum()

print(f"Número de linhas onde activity == predicted: {num_iguais/len(ACTIVITY_VAL) * 100:.2f}%")
print(f"Número de linhas onde activity != predicted: {(num_diferentes)/len(ACTIVITY_VAL) * 100:.2f}%")

print()

print(f"Número de linhas onde |activity - predicted| == 1: {num_diff_por_um/len(ACTIVITY_VAL) * 100:.2f}%")
print(f"Número de linhas onde |activity - predicted| == 2: {num_diff_por_dois/len(ACTIVITY_VAL) * 100:.2f}%")

print()

print(f"Retirando diferenças por 1, falhados: {(num_diferentes - num_diff_por_um)/len(ACTIVITY_VAL) * 100:.2f}%")
print(f"Retirando diferenças por 1, acertados: {(num_iguais + num_diff_por_um)/len(ACTIVITY_VAL) * 100:.2f}%")

print() 

print(f"Retirando diferenças por 1 e 2, falhados: {(num_diferentes - num_diff_por_dois - num_diff_por_um)/len(ACTIVITY_VAL) * 100:.2f}%")
print(f"Retirando diferenças por 1 e 2, acertados: {(num_iguais + num_diff_por_um + num_diff_por_dois)/len(ACTIVITY_VAL) * 100:.2f}%")

Número de linhas onde activity == predicted: 24.54%
Número de linhas onde activity != predicted: 75.46%

Número de linhas onde |activity - predicted| == 1: 31.71%
Número de linhas onde |activity - predicted| == 2: 18.34%

Retirando diferenças por 1, falhados: 43.75%
Retirando diferenças por 1, acertados: 56.25%

Retirando diferenças por 1 e 2, falhados: 25.41%
Retirando diferenças por 1 e 2, acertados: 74.59%
