# MVD 11. cvičení


## 1. část - Collaborative Filtering (item-item)

Vytvořte si matici podle příkladu v přednášce na item-item přístup na snímku 24. Využijte centered cosine similarity pro výpočet podobností stejně jako v přednášce a vyberte dva nejbližší sousedy. Ověřte, že je výsledek v přednášce správný a implementujte funkci tak, aby bylo možné jednoduše spočítat i libovolné další vyhodnocení. 

In [21]:
import numpy as np

In [22]:
data_table = [
    [1,    None, 3,    None, None, 5,    None, None, 5,    None, 4,  None],
    [None, None, 5,    4,    None, None, 4,    None, None, 2,    1,  3,  ],
    [2,    4,    None, 1,    2,    None, 3,    None, 4,    3,    5,  None],
    [None, 2,    4,    None, 5,    None, None, 4,    None, None, 2,  None],
    [None, None, 4,    3,    4,    2,    None, None, None, None, 2,  5   ],
    [1,    None, 3,    None, 3,    None, None, 2,    None, None, 4,  None],
]
data_table = np.array(data_table)
data_table[data_table == None] = 0
data_table = data_table.astype(float)
data_table

array([[1., 0., 3., 0., 0., 5., 0., 0., 5., 0., 4., 0.],
       [0., 0., 5., 4., 0., 0., 4., 0., 0., 2., 1., 3.],
       [2., 4., 0., 1., 2., 0., 3., 0., 4., 3., 5., 0.],
       [0., 2., 4., 0., 5., 0., 0., 4., 0., 0., 2., 0.],
       [0., 0., 4., 3., 4., 2., 0., 0., 0., 0., 2., 5.],
       [1., 0., 3., 0., 3., 0., 0., 2., 0., 0., 4., 0.]])

In [23]:
cos_sim = lambda a, b: np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [24]:
def normailize_matrix(matrix: np.ndarray) -> np.ndarray:
    """
    normilize a matrix (mean of each row == 0) of data
    :param data: input data
    :return: centered matrix
    """
    mean_matrix = np.sum(matrix, 1) / np.count_nonzero(matrix, 1)
    result = matrix - mean_matrix.reshape(-1, 1)
    result[np.where(matrix == 0)] = 0
    return result

In [None]:
def collaborative_filtering(row: int, col: int, matrix: np.ndarray, n_neighboors=2) -> np.float64:
    centered_matrix = normailize_matrix(matrix)

    R, _ = matrix.shape  # get rows number

    # calculate cosinus similarities for all row pairs
    cos_sims = []
    chosen_row = centered_matrix[row]
    for r in range(R):
        cos_sims.append(cos_sim(chosen_row, centered_matrix[r]))

    # set similarities to -inf where in the matrix there is no value in chosen column and similarity row
    similarities = [(idx, sim) for idx, sim in enumerate(cos_sims) if matrix[idx, col] != 0]

    # find n neighboors with the highest similarity
    similarities.sort(key=lambda x: x[1], reverse=True)
    n_sims = similarities[:n_neighboors]
    
    # calculate predicted value based on n-similarities and values from chosen column on n-similarities rows
    indeces, sims = zip(*n_sims)
    sims = np.array(sims)
    indeces = list(indeces)
    predicted_row = np.sum(sims * matrix[indeces, col]) / sims.sum()

    return predicted_row
    


In [28]:
collaborative_filtering(row=0, col=4, matrix=data_table)

np.float64(2.586406866934817)

## Bonus - Content-based Filtering

Stáhněte si Kaggle dataset [Spotify Recommendation system](https://www.kaggle.com/bricevergnou/spotify-recommendation). Z datasetu budete potřebovat:

- data.csv = příznaky k jednotlivým skladbám + příznak liked pro klasifikaci

Úkolem je:

1. Načíst data z csv.
2. Vytvořit train (90 %) a test (10 %) split pomocí knihovny [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
3. Vytvořit model logistické regrese (vlastní implementace / využití knihovny sklearn nebo jiné).
4. Vyhodnotit data na testovací sadě (např. metoda score u LogisticRegression).

**Skóre pro uznání úlohy musí být vyšší než 89 %.**

Dobrovolné:
- vytvořit graf predikovaných vs aktuálních hodnot
- využít i jiný model

In [32]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif

from pprint import pprint

In [33]:
data_file = "data.csv"
df = pd.read_csv(data_file)
df.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,liked
0,0.803,0.624,7,-6.764,0,0.0477,0.451,0.000734,0.1,0.628,95.968,304524,4,0
1,0.762,0.703,10,-7.951,0,0.306,0.206,0.0,0.0912,0.519,151.329,247178,4,1
2,0.261,0.0149,1,-27.528,1,0.0419,0.992,0.897,0.102,0.0382,75.296,286987,4,0
3,0.722,0.736,3,-6.994,0,0.0585,0.431,1e-06,0.123,0.582,89.86,208920,4,1
4,0.787,0.572,1,-7.516,1,0.222,0.145,0.0,0.0753,0.647,155.117,179413,4,1


In [34]:
features = df.drop(columns=['liked'])
targets = df['liked']

train_data, test_data, train_targets, test_targets = train_test_split(features, targets, random_state=42, test_size=0.1)
print(f"DF: {df.shape}\nTEST: {test_data.shape}, {test_targets.shape}\nTRAIN: {train_data.shape}, {train_targets.shape}")

DF: (195, 14)
TEST: (20, 13), (20,)
TRAIN: (175, 13), (175,)


In [35]:
# find best features (that have higher importance than the others) and select 'k' of them
selector = SelectKBest(score_func=f_classif, k=8)
best_train_data = selector.fit_transform(train_data, train_targets)

selected_features = df.columns[selector.get_support(indices=True)]

print("Best features:\n", train_data[selected_features].head())

Best features:
      danceability  loudness  speechiness  instrumentalness  valence    tempo  \
123         0.847    -2.901        0.305             0.000    0.633  142.012   
144         0.130    -5.888        0.095             0.368    0.334   60.631   
66          0.791    -9.805        0.420             0.000    0.492  130.027   
45          0.373    -5.016        0.122             0.906    0.340   97.346   
158         0.368   -36.759        0.035             0.922    0.085   69.363   

     duration_ms  time_signature  
123       190986               4  
144       272995               4  
66        170582               4  
45        211947               4  
158       254000               3  


In [36]:
# learn model
pipe = make_pipeline(MinMaxScaler(), LogisticRegression())
pipe.fit(best_train_data, train_targets)

# calc accurancy
best_test_data = test_data[selected_features]
acc = pipe.score(best_test_data, test_targets)
print(f"Accurancy: {acc}")

Accurancy: 0.95


