# 1. 目标相似度--向量的余弦相似度

In [1]:
import numpy as np
import pandas as pd
import sklearn
# 生成一些示例向量。 这里我们有三个用户, 5本书
vector_user1 = np.array([1, 2, 3, 4, 5])
vector_user2 = np.array([2, 3, 4, 5, 6])
vector_user3 = np.array([5, 4, 3, 2, 1])

## 例子1： 面向用户，把用户对物品的喜好向量化，这可以用来计算用户之间的相似度

例如：一个向量，有5个值，代表**一个用户对于5本书的打分。**

In [2]:
import numpy as np
import pandas as pd
import numpy
# 生成一些示例向量。 这里我们有三个用户
vector_user1 = np.array([1, 2, 3, 4, 5])
vector_user2 = np.array([2, 3, 4, 5, 6])
vector_user3 = np.array([5, 4, 3, 2, 1])

# 生成一些示例向量和标签
user_vectors = {
    'vector_user1': vector_user1,
    'vector_user2': vector_user2,
    'vector_user3': vector_user3
}

In [3]:
def cosine_similarity(vector1, vector2):
    dot_product = np.dot(vector1, vector2)  #点积
    norm_vector1 = np.linalg.norm(vector1)  #模1
    norm_vector2 = np.linalg.norm(vector2)  #模2

    similarity = dot_product / (norm_vector1 * norm_vector2)
    return similarity

In [4]:
def numpy_cos_similar(vectors):
    # 计算余弦相似度矩阵
    vector_array = np.array(list(vectors.values()))
    vector_labels = list(vectors.keys())

    num_vectors = len(vector_array)
    similarity_matrix = np.zeros((num_vectors, num_vectors))

    for i in range(num_vectors):
        for j in range(num_vectors):
            similarity_matrix[i, j] = cosine_similarity(vector_array[i], vector_array[j])

    # 将相似度矩阵包装成DataFrame
    similarity_df = pd.DataFrame(similarity_matrix, index=vector_labels, columns=vector_labels)

    # 打印结果
    print("User Vectors:")
    print(pd.DataFrame(vector_array, index=vector_labels))
    print("\nCosine Similarity Matrix:")
    print(similarity_df)

In [5]:
numpy_cos_similar(user_vectors)

User Vectors:
              0  1  2  3  4
vector_user1  1  2  3  4  5
vector_user2  2  3  4  5  6
vector_user3  5  4  3  2  1

Cosine Similarity Matrix:
              vector_user1  vector_user2  vector_user3
vector_user1      1.000000      0.994937      0.636364
vector_user2      0.994937      1.000000      0.710669
vector_user3      0.636364      0.710669      1.000000


## 使用Numpy 库

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# 将向量组合成矩阵
matrix = np.vstack([vector_user1, vector_user2, vector_user3])

# 计算余弦相似度
similarity_matrix = cosine_similarity(matrix)

# 打印结果
print("Vectors:")
print(matrix)
print("\nCosine Similarity Matrix:")
print(similarity_matrix)

## 使用Pandas库

In [None]:

from sklearn.metrics.pairwise import cosine_similarity


# 将向量和标签组合成DataFrame
df = pd.DataFrame(user_vectors)

# 计算余弦相似度
similarity_matrix = cosine_similarity(df.T)

# 将标签添加到相似度矩阵
similarity_df = pd.DataFrame(similarity_matrix, index=df.columns, columns=df.columns)

# 打印结果
print("User Vectors:")
print(df)
print("\nCosine Similarity Matrix:")
print(similarity_df)

## 例子2： 面向物品，把用户对物品的喜好向量化，用来计算物品之间的相似度

例如：一个向量，有3个值，代表**3个用户对于1本书的打分。**
我们简单把上文的所有user_vector 做转至：


In [7]:
def get_item_vector(user_vectors):
    item_vectors = {}
    item_num = len(user_vectors[0])  # Assuming all vectors have the same length

    for i in range(item_num):
        new_vector_i = np.array([user_vector[i] for user_vector in user_vectors])
        item_vectors[f'item_{i+1}'] = new_vector_i
        print(f'Item {i+1} Vector:', new_vector_i)

    return item_vectors
    

In [8]:
item_vectors = get_item_vector(user_vectors)

KeyError: 0

In [None]:
item_vectors

# 2.基于相似度做产品推荐度计算 

## 2.1 User CF :基于用户的协同过滤算法

User-Based Collaborative Filtering (User-Based CF):

In User-Based CF, recommendations are made based on the preferences and behaviors of users who are similar to the target user. The idea is that if two users have similar tastes or preferences, and one user likes an item that the other has not yet interacted with, it's likely that the second user will also like that item. The similarity between users is typically measured using metrics such as cosine similarity or Pearson correlation.

The steps involved in User-Based CF are as follows:

User Similarity Calculation: Compute the similarity between the target user and other users in the system.
Neighborhood Selection: Identify a set of users (neighborhood) who are most similar to the target user.
Rating Prediction: Predict the target user's preference for items by aggregating the ratings of the items from the selected neighborhood.
One drawback of User-Based CF is the scalability issue. As the number of users grows, calculating user similarities for all pairs can become computationally expensive.

问题：我们知道了用户A的喜好向量，我们想为A推荐“潜在地符合他的喜好的产品，该怎么做”？

合理假设：用户兴趣相似，那么他们对于同一个产品的评价应该类似。

思路：其它相似用户，如果他们已经使用过产品，并对产品有了评分，那么我们可以通过计算他们的加权分数，作为用户A对于该产品可能的喜好度的“推测值”。而权重，就是他们和用户A之间的相似度（见上文，余弦相似度）。

