<a href="https://colab.research.google.com/github/Manya-65/assigment1/blob/main/Recommendation_System_Data_Description.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Implement a recommendation system using cosine similarity on the "anime.csv" dataset, including data preprocessing, feature extraction, recommendation function design, and evaluation using precision, recall, and F1-score. Also, explain the difference between user-based and item-based collaborative filtering and how collaborative filtering works.

## Load the dataset

### Subtask:
Load the `anime.csv` file into a pandas DataFrame.


**Reasoning**:
Import pandas, load the csv file into a dataframe and display the first 5 rows.



In [None]:
import pandas as pd

df_anime = pd.read_csv('anime.csv')
display(df_anime.head())

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


## Handle missing values

### Subtask:
Identify and handle missing values in the dataset.


**Reasoning**:
Check for missing values in each column of the `df_anime` DataFrame and determine a strategy for handling them.



In [None]:
missing_values = df_anime.isnull().sum()
print("Missing values before handling:")
print(missing_values)

# Strategy:
# 'genre': Impute with 'Unknown' as genre is important for recommendations.
# 'type': Impute with the mode as it represents the most frequent type.
# 'episodes': Impute with 'Unknown' as it's a string and mode might not be representative.
# 'rating': Impute with the mean as rating is numerical.

df_anime['genre'] = df_anime['genre'].fillna('Unknown')
df_anime['type'] = df_anime['type'].fillna(df_anime['type'].mode()[0])
df_anime['episodes'] = df_anime['episodes'].fillna('Unknown')
df_anime['rating'] = df_anime['rating'].fillna(df_anime['rating'].mean())

print("\nMissing values after handling:")
print(df_anime.isnull().sum())

Missing values before handling:
anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

Missing values after handling:
anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64


## Explore the dataset

### Subtask:
Perform exploratory data analysis to understand the data distribution and characteristics.


**Reasoning**:
Display descriptive statistics, data types, and value counts for relevant columns to understand the data distribution and characteristics as requested in the instructions.



In [None]:
print("Descriptive statistics of numerical columns:")
display(df_anime.describe())

print("\nData types of each column:")
display(df_anime.info())

print("\nUnique values and counts for 'type':")
display(df_anime['type'].value_counts())

print("\nUnique values and counts for 'genre' (first 20):")
# Displaying only the first 20 as genre has many unique values
display(df_anime['genre'].value_counts().head(20))

Descriptive statistics of numerical columns:


Unnamed: 0,anime_id,rating,members
count,12294.0,12294.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.017096,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.9,225.0
50%,10260.5,6.55,1550.0
75%,24794.5,7.17,9437.0
max,34527.0,10.0,1013917.0



Data types of each column:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12294 non-null  object 
 3   type      12294 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


None


Unique values and counts for 'type':


Unnamed: 0_level_0,count
type,Unnamed: 1_level_1
TV,3812
OVA,3311
Movie,2348
Special,1676
ONA,659
Music,488



Unique values and counts for 'genre' (first 20):


Unnamed: 0_level_0,count
genre,Unnamed: 1_level_1
Hentai,823
Comedy,523
Music,301
Kids,199
"Comedy, Slice of Life",179
Dementia,137
"Fantasy, Kids",128
Fantasy,114
"Comedy, Kids",112
Drama,107


## Feature engineering

### Subtask:
Select relevant features for similarity calculation and preprocess them as needed (e.g., one-hot encoding for genres).


**Reasoning**:
Select the relevant features, apply one-hot encoding to 'genre' and 'type', and concatenate with the 'rating' column to create the feature matrix.



In [None]:
# Select relevant features
features_df = df_anime[['genre', 'type', 'rating']].copy()

# Apply one-hot encoding to 'genre'
genre_encoded = features_df['genre'].str.get_dummies(sep=', ')

# Apply one-hot encoding to 'type'
type_encoded = pd.get_dummies(features_df['type'], prefix='type')

# Concatenate the encoded features and the rating
feature_matrix = pd.concat([genre_encoded, type_encoded, features_df['rating']], axis=1)

# Display the resulting feature matrix
display(feature_matrix.head())

Unnamed: 0,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,...,Vampire,Yaoi,Yuri,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,rating
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,True,False,False,False,False,False,9.37
1,1,1,0,0,0,0,1,0,1,0,...,0,0,0,False,False,False,False,False,True,9.26
2,1,0,0,1,0,0,0,0,0,0,...,0,0,0,False,False,False,False,False,True,9.25
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,False,False,False,False,False,True,9.17
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,False,False,False,False,False,True,9.16


## Compute cosine similarity

### Subtask:
Calculate the cosine similarity matrix based on the selected features.


**Reasoning**:
Calculate the cosine similarity matrix based on the feature matrix.



In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(feature_matrix)
print("Cosine similarity matrix shape:", cosine_sim.shape)

Cosine similarity matrix shape: (12294, 12294)


## Build recommendation function

### Subtask:
Create a function that takes an anime title as input and returns a list of similar anime based on the cosine similarity matrix.


**Reasoning**:
Define the `get_recommendations` function to retrieve similar anime based on cosine similarity.



In [None]:
def get_recommendations(title, cosine_sim=cosine_sim, df=df_anime):
    """
    Gets recommendations for a given anime title based on cosine similarity.

    Args:
        title (str): The title of the anime.
        cosine_sim (np.ndarray): The pre-computed cosine similarity matrix.
        df (pd.DataFrame): The DataFrame containing anime information.

    Returns:
        list: A list of recommended anime titles, or an error message if the title is not found.
    """
    # Get the index of the anime that matches the title
    indices = pd.Series(df.index, index=df['name']).drop_duplicates()
    if title not in indices:
        return "Anime title not found in the dataset."
    idx = indices[title]

    # Get the pairwise similarity scores for all anime with that anime
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the anime based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar anime (excluding the anime itself)
    sim_scores = sim_scores[1:11]

    # Get the anime indices
    anime_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar anime titles
    return df['name'].iloc[anime_indices].tolist()

# Example usage (optional - for testing the function)
# print(get_recommendations('Naruto'))

## Evaluate the system

### Subtask:
Split the data, implement evaluation metrics (precision, recall, F1-score), and evaluate the recommendation system's performance.


**Reasoning**:
Split the data, implement evaluation metrics, and evaluate the recommendation system's performance.



In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np

# 1. Split the data into training and testing sets
# For simplicity in this item-based approach without explicit user data,
# we split the dataset itself.
train_df, test_df = train_test_split(df_anime, test_size=0.2, random_state=42)

# Re-compute the feature matrix and cosine similarity for the training data
# This is necessary because recommendations should be based on the training set.
train_features_df = train_df[['genre', 'type', 'rating']].copy()
train_genre_encoded = train_features_df['genre'].str.get_dummies(sep=', ')
train_type_encoded = pd.get_dummies(train_features_df['type'], prefix='type')
train_feature_matrix = pd.concat([train_genre_encoded, train_type_encoded, train_features_df['rating']], axis=1)
train_cosine_sim = cosine_similarity(train_feature_matrix)

# Update the get_recommendations function to use the training data and similarity matrix
def get_recommendations_train(title, cosine_sim=train_cosine_sim, df=train_df):
    """
    Gets recommendations for a given anime title based on cosine similarity from the training set.

    Args:
        title (str): The title of the anime.
        cosine_sim (np.ndarray): The pre-computed cosine similarity matrix for the training set.
        df (pd.DataFrame): The training DataFrame containing anime information.

    Returns:
        list: A list of recommended anime titles from the training set, or an error message if the title is not found.
    """
    indices = pd.Series(df.index, index=df['name']).drop_duplicates()
    if title not in indices.index: # Check if title is in the index of the Series
        return "Anime title not found in the training dataset."
    idx = indices[title]

    sim_scores = list(enumerate(cosine_sim[indices.get_loc(title)])) # Use get_loc for positional index

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11] # Get top 10 recommendations excluding itself

    anime_indices = [i[0] for i in sim_scores]

    # Return the names using the original index from the training dataframe
    return df['name'].iloc[anime_indices].tolist()


# 2. Define ground truth of "relevant" recommendations for each item in the test set
# We'll consider anime in the training set with a cosine similarity > threshold as relevant to a test item.
relevance_threshold = 0.8 # Example threshold

# Re-compute cosine similarity between test set items and training set items
# This is needed to find relevant items in the training set for each test item.
test_features_df = test_df[['genre', 'type', 'rating']].copy()
test_genre_encoded = test_features_df['genre'].str.get_dummies(sep=', ')
test_type_encoded = pd.get_dummies(test_features_df['type'], prefix='type')
test_feature_matrix = pd.concat([test_genre_encoded, test_type_encoded, test_features_df['rating']], axis=1)

# Align columns of test_feature_matrix with train_feature_matrix before calculating similarity
# This is crucial if one-hot encoding resulted in different columns (e.g., due to different unique values)
test_feature_matrix = test_feature_matrix.reindex(columns=train_feature_matrix.columns, fill_value=0)


# Calculate similarity between test items and train items
test_train_cosine_sim = cosine_similarity(test_feature_matrix, train_feature_matrix)


# 3. & 4. Calculate Precision, Recall, and F1-score for each test item
precision_list = []
recall_list = []
f1_list = []

for i in range(len(test_df)):
    test_anime_title = test_df.iloc[i]['name']

    # Get recommendations from the training set
    recommended_anime = get_recommendations_train(test_anime_title, cosine_sim=train_cosine_sim, df=train_df)

    # Handle the case where the test anime title is not in the training set
    # This can happen with train_test_split on the dataset itself.
    # If the title is not found in training, we skip evaluation for this item.
    if recommended_anime == "Anime title not found in the training dataset.":
        continue

    # Define relevant anime in the training set for the current test item
    # Find the index of the current test anime in the test_df (needed for test_train_cosine_sim)
    original_test_index = test_df.iloc[i].name
    test_matrix_index = test_df.index.get_loc(original_test_index)

    # Get similarity scores between the current test item and all train items
    similarity_to_train = test_train_cosine_sim[test_matrix_index]

    # Identify relevant anime in the training set based on the threshold
    relevant_anime_indices_in_train = np.where(similarity_to_train > relevance_threshold)[0]

    # Get the names of relevant anime from the training set, excluding the test item itself if it somehow appears
    relevant_anime_names = train_df.iloc[relevant_anime_indices_in_train]['name'].tolist()

    # Ensure the test anime itself is not considered relevant if it exists in train_df
    if test_anime_title in relevant_anime_names:
        relevant_anime_names.remove(test_anime_title)


    # Convert recommended and relevant lists to sets for easier comparison
    recommended_set = set(recommended_anime)
    relevant_set = set(relevant_anime_names)

    # Calculate True Positives (TP), False Positives (FP), False Negatives (FN)
    # TP: Recommended and relevant
    tp = len(recommended_set.intersection(relevant_set))
    # FP: Recommended but not relevant
    fp = len(recommended_set - relevant_set)
    # FN: Relevant but not recommended
    fn = len(relevant_set - recommended_set)

    # Calculate Precision, Recall, and F1-score for the current test item
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    precision_list.append(precision)
    recall_list.append(recall)
    f1_list.append(f1)

# 5. Calculate the average precision, recall, and F1-score
average_precision = np.mean(precision_list) if precision_list else 0
average_recall = np.mean(recall_list) if recall_list else 0
average_f1 = np.mean(f1_list) if f1_list else 0

# 6. Print the average precision, recall, and F1-score
print(f"\nAverage Precision: {average_precision:.4f}")
print(f"Average Recall: {average_recall:.4f}")
print(f"Average F1-score: {average_f1:.4f}")



Average Precision: 0.0000
Average Recall: 0.0000
Average F1-score: 0.0000


**Reasoning**:
The evaluation results (Precision, Recall, and F1-score) are all 0.0000, which indicates an issue with the evaluation logic, particularly the definition of "relevant" anime or how recommendations are being compared against this ground truth. The current approach of using a high similarity threshold (0.8) between a test item and *any* item in the training set to define relevance might be too strict or not aligning well with the top-N recommendations generated by `get_recommendations_train`. The `get_recommendations_train` function returns the top 10 most similar anime *from the training set* to a given anime *from the training set*. When evaluating against test set items, we need to find relevant items in the *training set* for each test item and then compare these to recommendations *from the training set* for that test item. The current implementation of `test_train_cosine_sim` correctly calculates similarity between test and train items, but the `relevant_anime_names` are then based on *all* training items above a threshold, which might not be the most intuitive ground truth when comparing against top-N recommendations. A more appropriate approach might be to define relevance based on some other criteria or to adjust the similarity threshold for relevance. However, given the constraints and the goal of evaluating the *current* recommendation function which provides top-N based on similarity within the training set, let's re-examine the evaluation logic. The most likely issue is the definition of relevant items for a test item. If no training items have a similarity above 0.8 to any test item, the relevant set will always be empty, leading to zero precision, recall, and F1. Let's try a lower relevance threshold to see if any relevant items are being identified.



In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np

# 1. Split the data into training and testing sets
# For simplicity in this item-based approach without explicit user data,
# we split the dataset itself.
train_df, test_df = train_test_split(df_anime, test_size=0.2, random_state=42)

# Re-compute the feature matrix and cosine similarity for the training data
# This is necessary because recommendations should be based on the training set.
train_features_df = train_df[['genre', 'type', 'rating']].copy()
train_genre_encoded = train_features_df['genre'].str.get_dummies(sep=', ')
train_type_encoded = pd.get_dummies(train_features_df['type'], prefix='type')
train_feature_matrix = pd.concat([train_genre_encoded, train_type_encoded, train_features_df['rating']], axis=1)

# Align columns of train_feature_matrix to handle potential differences in dummy variables
all_cols = list(set(train_feature_matrix.columns) | set(test_feature_matrix.columns))
train_feature_matrix = train_feature_matrix.reindex(columns=all_cols, fill_value=0)


train_cosine_sim = cosine_similarity(train_feature_matrix)

# Update the get_recommendations function to use the training data and similarity matrix
def get_recommendations_train(title, cosine_sim=train_cosine_sim, df=train_df):
    """
    Gets recommendations for a given anime title based on cosine similarity from the training set.

    Args:
        title (str): The title of the anime.
        cosine_sim (np.ndarray): The pre-computed cosine similarity matrix for the training set.
        df (pd.DataFrame): The training DataFrame containing anime information.

    Returns:
        list: A list of recommended anime titles from the training set, or an error message if the title is not found.
    """
    # Get the index of the anime that matches the title within the training dataframe
    # Use boolean indexing to find the row(s) with the matching title
    title_row = df[df['name'] == title]

    if title_row.empty:
        return "Anime title not found in the training dataset."

    # Get the index from the training dataframe
    idx_in_train_df = title_row.index[0]


    # Get the pairwise similarity scores for all anime with that anime in the training set
    # We need to get the positional index within the train_df's current index
    sim_scores = list(enumerate(cosine_sim[df.index.get_loc(idx_in_train_df)]))


    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11] # Get top 10 recommendations excluding itself

    # Get the anime indices from the sorted similarity scores (these are indices within the training set)
    anime_indices = [i[0] for i in sim_scores]

    # Return the names using the original index from the training dataframe
    return df.iloc[anime_indices]['name'].tolist()


# 2. Define ground truth of "relevant" recommendations for each item in the test set
# We'll consider anime in the training set with a cosine similarity > threshold as relevant to a test item.
relevance_threshold = 0.5 # Lowered example threshold

# Re-compute cosine similarity between test set items and training set items
# This is needed to find relevant items in the training set for each test item.
test_features_df = test_df[['genre', 'type', 'rating']].copy()
test_genre_encoded = test_features_df['genre'].str.get_dummies(sep=', ')
test_type_encoded = pd.get_dummies(test_features_df['type'], prefix='type')
test_feature_matrix = pd.concat([test_genre_encoded, test_type_encoded, test_features_df['rating']], axis=1)

# Align columns of test_feature_matrix with train_feature_matrix before calculating similarity
# This is crucial if one-hot encoding resulted in different columns (e.g., due to different unique values)
test_feature_matrix = test_feature_matrix.reindex(columns=train_feature_matrix.columns, fill_value=0)

# Calculate similarity between test items and train items
test_train_cosine_sim = cosine_similarity(test_feature_matrix, train_feature_matrix)


# 3. & 4. Calculate Precision, Recall, and F1-score for each test item
precision_list = []
recall_list = []
f1_list = []

for i in range(len(test_df)):
    test_anime_title = test_df.iloc[i]['name']

    # Get recommendations from the training set
    # We need to find the closest anime in the training set to the current test anime to get recommendations
    # Find the index of the current test anime in the test_df (needed for test_train_cosine_sim)
    original_test_index = test_df.iloc[i].name
    test_matrix_index = test_df.index.get_loc(original_test_index)

    # Get similarity scores between the current test item and all train items
    similarity_to_train = test_train_cosine_sim[test_matrix_index]

    # Find the index of the most similar anime in the training set to the current test anime
    most_similar_train_index_in_matrix = np.argmax(similarity_to_train)
    most_similar_train_original_index = train_df.iloc[most_similar_train_index_in_matrix].name
    most_similar_train_title = train_df.loc[most_similar_train_original_index]['name']


    recommended_anime = get_recommendations_train(most_similar_train_title, cosine_sim=train_cosine_sim, df=train_df)


    # Handle the case where the most similar training anime title is not found (shouldn't happen with argmax but as a safeguard)
    if recommended_anime == "Anime title not found in the training dataset.":
        continue

    # Define relevant anime in the training set for the current test item
    # Identify relevant anime in the training set based on the threshold using the similarity to the test item
    relevant_anime_indices_in_train_matrix = np.where(similarity_to_train > relevance_threshold)[0]

    # Get the names of relevant anime from the training set based on their indices in the training feature matrix
    relevant_anime_names = train_df.iloc[relevant_anime_indices_in_train_matrix]['name'].tolist()


    # Convert recommended and relevant lists to sets for easier comparison
    recommended_set = set(recommended_anime)
    relevant_set = set(relevant_anime_names)

    # Calculate True Positives (TP), False Positives (FP), False Negatives (FN)
    # TP: Recommended and relevant
    tp = len(recommended_set.intersection(relevant_set))
    # FP: Recommended but not relevant
    fp = len(recommended_set - relevant_set)
    # FN: Relevant but not recommended
    fn = len(relevant_set - recommended_set)

    # Calculate Precision, Recall, and F1-score for the current test item
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    precision_list.append(precision)
    recall_list.append(recall)
    f1_list.append(f1)

# 5. Calculate the average precision, recall, and F1-score
average_precision = np.mean(precision_list) if precision_list else 0
average_recall = np.mean(recall_list) if recall_list else 0
average_f1 = np.mean(f1_list) if f1_list else 0

# 6. Print the average precision, recall, and F1-score
print(f"\nAverage Precision: {average_precision:.4f}")
print(f"Average Recall: {average_recall:.4f}")
print(f"Average F1-score: {average_f1:.4f}")


Average Precision: 1.0000
Average Recall: 0.0010
Average F1-score: 0.0020


## Interview Questions

### 1. Can you explain the difference between user-based and item-based collaborative filtering?

**User-Based Collaborative Filtering:**

*   **How it works:** This approach recommends items to a user based on the preferences of *similar users*. It finds users who have similar taste (e.g., they rated the same items similarly) to the target user. Then, it recommends items that those similar users liked but the target user hasn't seen or rated yet.
*   **Analogy:** "People who are like you liked this, so you might like it too."
*   **Pros:** Can recommend unexpected items outside of the user's past preferences, can handle new items relatively well once a few users rate them.
*   **Cons:** Can suffer from the "cold-start problem" for new users (difficult to find similar users), performance can degrade with a large number of users, and user preferences can change over time.

**Item-Based Collaborative Filtering:**

*   **How it works:** This approach recommends items to a user based on the similarity of *items* they have already liked or interacted with. It finds items that are similar to the items the target user has shown interest in.
*   **Analogy:** "If you liked this item, you might like this other item because they are similar."
*   **Pros:** Generally more stable than user-based as item similarity is often more static than user preferences, scales better with a large number of users, and can handle the "cold-start problem" for new users better than user-based if there's enough data about the items themselves.
*   **Cons:** Can suffer from the "cold-start problem" for new items (difficult to determine similarity without user interaction data), recommendations can be less diverse and stick to items similar to what the user already knows.

### 2. What is collaborative filtering, and how does it work?

**Collaborative Filtering:**

Collaborative filtering is a type of recommendation system technique that makes predictions about a user's interest by collecting preferences (information from other users) on items. The core idea is that if multiple users agree on the preference of some items, they are likely to agree on the preference of other items as well. It "collaborates" the filtering process among users or items to make recommendations.

**How it works:**

Collaborative filtering typically involves the following steps:

1.  **Data Collection:** Gather data on user interactions with items. This can be explicit (like ratings) or implicit (like purchase history, viewing time, clicks). This data is often represented as a user-item matrix.
2.  **Similarity Calculation:** Determine the similarity between users (user-based) or between items (item-based). Similarity is calculated based on their interaction patterns. Common similarity measures include:
    *   **Cosine Similarity:** Measures the cosine of the angle between two vectors (users or items) in a multi-dimensional space.
    *   **Pearson Correlation:** Measures the linear relationship between two sets of data (ratings of users or items).
    *   **Euclidean Distance:** Measures the straight-line distance between two points in a multi-dimensional space.
3.  **Prediction/Recommendation:**
    *   **User-Based:** To predict a user's rating for an item, find users similar to the target user and average their ratings for that item (often weighted by similarity). Recommend items that similar users liked but the target user hasn't interacted with.
    *   **Item-Based:** To predict a user's rating for an item, find items similar to the items the target user has already liked. Average the user's ratings for those similar items (weighted by similarity). Recommend items that are similar to those the user has liked.

In essence, collaborative filtering leverages the collective intelligence of users to provide personalized recommendations. It assumes that users who liked similar items in the past will likely like similar items in the future (item-based), or that users with similar tastes will like the same items (user-based).