<a href="https://colab.research.google.com/github/RoshaniPawar16/AAI/blob/main/Task3/Task_3_2_and_3_3_Work_done.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [41]:
!pip install gradio scikit-surprise



In [None]:
from surprise import Dataset, SVD, Reader, KNNBaseline, CoClustering, KNNWithZScore, KNNWithMeans, KNNBasic, SVDpp
from surprise.model_selection import cross_validate
import pandas as pd
import numpy as np
import os
from surprise import dump

# Load the dataset
song_dataset = pd.read_csv('song_dataset.csv')

# Normalize play_count using min-max scaling
min_play_count = song_dataset['play_count'].min()
max_play_count = song_dataset['play_count'].max()
song_dataset['normalized_play_count'] = (song_dataset['play_count'] - min_play_count) / (max_play_count - min_play_count)


# Prepare the data for the Surprise library
reader = Reader(rating_scale=(0, 1))  # Adjusted for normalized play counts
# reader = Reader(rating_scale=(min_play_count, max_play_count))
data = Dataset.load_from_df(song_dataset[['user', 'song', 'normalized_play_count']], reader)

# Define algorithms to test
algo_list = [SVD(), KNNBaseline(), CoClustering(), KNNWithZScore(), KNNWithMeans(), KNNBasic(), SVDpp()]

# Run 5-fold cross-validation for each algorithm
results = []
for algo in algo_list:
    res = cross_validate(algo, data, measures=["RMSE", "MAE"], cv=5, verbose=True)
    res = pd.DataFrame.from_dict(res)
    res['algo'] = algo.__class__.__name__
    results.append(res)
    print(res)

# Aggregate results
results_df = pd.concat(results).groupby('algo').mean()
results_df['test_rmse'] = results_df['test_rmse'].apply(lambda x: x*100)
results_df['test_mae'] = results_df['test_mae'].apply(lambda x: x*100)
results_df = results_df.sort_values(by='test_rmse', ascending=True)

# Display results
print("\nAverage Cross-Validation Results:")
print(results_df)

In [43]:
results_df

Unnamed: 0_level_0,test_rmse,test_mae,fit_time,test_time
algo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KNNBaseline,0.409038,0.124625,0.705672,0.475219
KNNBasic,0.431508,0.137793,0.180091,0.540602
KNNWithMeans,0.432291,0.130534,0.185779,0.528598
KNNWithZScore,0.440151,0.129951,0.348021,0.637088
CoClustering,0.442243,0.11303,5.316159,0.142054
SVDpp,2.289088,1.11419,21.57181,2.850743
SVD,4.709469,2.308569,4.300224,0.237172


### Dataset Preparation

The dataset, `song_dataset.csv`, contains user-song interactions, including the number of times a user played a song (`play_count`). To standardize the data and make it compatible with the collaborative filtering algorithms used, the `play_count` column is normalized to a range of [0, 1]. This is achieved through **Min-Max Scaling**, where the minimum play count is subtracted from each value, and the result is divided by the range (maximum play count - minimum play count). This step ensures that all values lie within a uniform range, which helps the algorithms better interpret and process the data.

The normalized dataset is then converted into a format suitable for the Surprise library using the `Reader` and `Dataset` classes. The `Reader` specifies the rating scale as (0, 1), aligning with the normalized play counts, and the `Dataset` class is used to map the data into a collaborative filtering-friendly structure.

```python
# Normalize play_count using min-max scaling
min_play_count = song_dataset['play_count'].min()
max_play_count = song_dataset['play_count'].max()
song_dataset['normalized_play_count'] = (song_dataset['play_count'] - min_play_count) / (max_play_count - min_play_count)

# Prepare data for the Surprise library
reader = Reader(rating_scale=(0, 1))
data = Dataset.load_from_df(song_dataset[['user', 'song', 'normalized_play_count']], reader)
```

This preparation ensures that the collaborative filtering algorithms can efficiently train and make predictions without being affected by unscaled or skewed data distributions.

### Algorithm Evaluation

Several collaborative filtering algorithms were tested to identify the best-performing model for recommending songs. These algorithms include:

1. **SVD and SVD++**: Matrix factorization techniques that decompose the user-song interaction matrix into latent factors. SVD++ incorporates implicit feedback to improve recommendations.
2. **KNN-Based Algorithms**:
   - `KNNBaseline`: Combines neighborhood-based collaborative filtering with baseline adjustments for better accuracy.
   - `KNNBasic`, `KNNWithMeans`, and `KNNWithZScore`: Variants of KNN that apply different strategies, such as incorporating means or standard scores.
3. **CoClustering**: A collaborative filtering approach that groups users and items into co-clusters to make predictions.

Each algorithm was evaluated using **5-fold cross-validation** on two performance metrics:
- **Root Mean Squared Error (RMSE)**: Measures the average squared difference between predicted and actual play counts, penalizing large errors more heavily.
- **Mean Absolute Error (MAE)**: Measures the average absolute difference between predicted and actual play counts, giving equal weight to all errors.

The evaluation process used the `cross_validate` function from the `Surprise` library, which trains and tests the algorithms on different data splits and computes the metrics for each fold. Results for each algorithm were aggregated into a DataFrame for comparison.

```python
# Cross-validation for each algorithm
results = []
for algo in algo_list:
    res = cross_validate(algo, data, measures=["RMSE", "MAE"], cv=5, verbose=True)
    res = pd.DataFrame.from_dict(res)
    res['algo'] = algo.__class__.__name__
    results.append(res)
```

#### Results Analysis

The performance of the algorithms is summarized in the results table, showing the average `test_rmse`, `test_mae`, `fit_time`, and `test_time` across all folds:

| **Algorithm**    | **Test RMSE** | **Test MAE** | **Fit Time** | **Test Time** |
|------------------|---------------|--------------|--------------|---------------|
| KNNBaseline      | **0.409**     | 0.125        | 0.706        | 0.475         |
| KNNBasic         | 0.432         | 0.138        | 0.180        | 0.541         |
| KNNWithMeans     | 0.432         | 0.131        | 0.186        | 0.529         |
| KNNWithZScore    | 0.440         | 0.130        | 0.348        | 0.637         |
| CoClustering     | 0.442         | **0.113**    | 5.316        | 0.142         |
| SVDpp            | 2.289         | 1.114        | 21.572       | 2.851         |
| SVD              | 4.709         | 2.309        | 4.300        | 0.237         |


#### Key Findings

1. **Best Performing Algorithm**:
   - **KNNBaseline** achieved the lowest RMSE (0.409), indicating high accuracy in predicting user-song play counts. Its MAE (0.125) is also competitive, demonstrating reliable prediction of song preferences. Additionally, its fit time (0.706s) and test time (0.475s) are reasonably low, making it both accurate and computationally efficient.

2. **Alternative KNN-Based Algorithms**:
   - Other KNN variants, such as `KNNBasic`, `KNNWithMeans`, and `KNNWithZScore`, performed slightly worse than KNNBaseline in terms of RMSE and MAE. These models lack the baseline adjustment mechanism of KNNBaseline, which likely contributed to their lower accuracy.

3. **CoClustering**:
   - While CoClustering had the lowest MAE (0.113), its RMSE (0.442) was higher than that of KNNBaseline. Moreover, its fit time (5.316s) is significantly higher, making it less practical for larger datasets or real-time applications.

4. **Poor Performing Algorithms**:
   - **SVD** and **SVD++** performed poorly, with very high RMSE and MAE values. SVDpp, despite being theoretically more robust, had extremely long fit times (21.572s) and still failed to match the performance of the simpler KNN-based methods.


The results of the analysis suggest that while collaborative filtering (CF) methods like KNNBaseline perform well in predicting user preferences, there are limitations that can be addressed by incorporating content-based filtering into the system. A hybrid recommendation system that combines collaborative filtering with content-based filtering would likely enhance performance by leveraging the strengths of both approaches. A hybrid system that combines both methods would result in a more robust and personalized recommendation engine, capable of handling new users, new songs, and sparse interaction data. This hybrid approach leverages the strengths of both methodologies, leading to more accurate and diverse song recommendations.

In [15]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity

# Load and prepare data
def load_data():
    df = pd.read_csv("song_dataset.csv")
    return df

def run_imps(df):
    required_columns = ['user', 'song', 'play_count', 'title', 'artist_name', 'release']
    if not all(col in df.columns for col in required_columns):
        raise ValueError(f"Dataset must contain the following columns: {required_columns}")

    df = df.drop_duplicates(subset=['song', 'title', 'artist_name', 'release'])
    df['combined_features'] = (df['title'] + " " + df['artist_name'] + " " + df['release']).fillna("")

    # Content-Based Filtering
    tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
    tfidf_matrix = tfidf.fit_transform(df['combined_features'])

    nn = NearestNeighbors(n_neighbors=10, metric='cosine', algorithm='auto')
    nn.fit(tfidf_matrix)

    # Collaborative Filtering
    user_song_matrix = df.pivot_table(index='user', columns='song', values='play_count', fill_value=0)
    knn_cf = NearestNeighbors(n_neighbors=10, metric='cosine', algorithm='auto')
    knn_cf.fit(user_song_matrix)

    return df, tfidf, tfidf_matrix, nn, user_song_matrix, knn_cf

df = load_data()
df, tfidf, tfidf_matrix, nn, user_song_matrix, knn_cf = run_imps(df)

# Content-based recommendation function
def content_based_recommend(song_title, top_n=5):
    try:
        idx = df[df['title'] == song_title].index[0]
        distances, indices = nn.kneighbors(tfidf_matrix[idx], n_neighbors=top_n + 1)
        song_indices = indices.flatten()[1:]
        return df.iloc[song_indices][['title', 'artist_name', 'release']].drop_duplicates()
    except IndexError:
        return pd.DataFrame(columns=['title', 'artist_name', 'release'])

# Collaborative recommendation function using KNN
def collaborative_recommend(user_id, top_n=5):
    if user_id not in user_song_matrix.index:
        return pd.DataFrame(columns=['title', 'artist_name', 'release'])

    # Get the nearest neighbors for the user
    user_index = user_song_matrix.index.get_loc(user_id)
    distances, indices = knn_cf.kneighbors(user_song_matrix.iloc[user_index].values.reshape(1, -1), n_neighbors=top_n + 1)

    # Collect recommendations from neighbors
    neighbors = indices.flatten()[1:]
    listened_songs = user_song_matrix.loc[user_id][user_song_matrix.loc[user_id] > 0].index

    recommendations = {}
    for neighbor in neighbors:
        neighbor_songs = user_song_matrix.iloc[neighbor]
        for song, play_count in neighbor_songs.items():
            if song not in listened_songs and play_count > 0:
                recommendations[song] = recommendations.get(song, 0) + play_count

    # Sort songs by aggregated scores
    recommended_songs = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)[:top_n]
    recommended_song_ids = [song for song, _ in recommended_songs]
    return df[df['song'].isin(recommended_song_ids)][['title', 'artist_name', 'release']].drop_duplicates()

# Hybrid Recommendation
def hybrid_recommendv2(user_id, song_titles, top_n=5):
    collab_recs = collaborative_recommend(user_id, top_n)
    content_recs = pd.DataFrame()
    for song_title in song_titles:
        content_recs = pd.concat([content_recs, content_based_recommend(song_title, top_n)], ignore_index=True)
    hybrid_recs = pd.concat([collab_recs, content_recs]).drop_duplicates().sample(frac=1).reset_index(drop=True)
    return hybrid_recs.head(top_n)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['combined_features'] = (df['title'] + " " + df['artist_name'] + " " + df['release']).fillna("")


In [16]:
user_id = "21f4ac98aa1665bd42027ba12184a939ff435f59"  # Replace with a valid user ID
song_title = ["I'm On A Boat"]  # Replace with a valid song title

print("Hybrid Recommendations:")
hybrid_recommendv2(user_id, song_title)

Hybrid Recommendations:




Unnamed: 0,title,artist_name,release
0,Fleur blanche,Ortsen,Hôtel Costes X - by Stéphane Pompougnac
1,Get Confused,Fischerspooner,Odyssey
2,And the Psychic Saw,Atheist,Unquestionable Presence
3,Tonto Corazón,Joe Veras,Tonto Corazón
4,Mi Corazón,Campo,Bajofondo Remixed


In [17]:
# Example Usage for multiple songs
user_id = "21f4ac98aa1665bd42027ba12184a939ff435f59"
song_titles = ["Nothing from Nothing", "The Cove", "Entre Dos Aguas"]

print("Hybrid Recommendations (Multiple Songs):")
hybrid_recommendv2(user_id, song_titles)

Hybrid Recommendations (Multiple Songs):




Unnamed: 0,title,artist_name,release
0,Get Confused,Fischerspooner,Odyssey
1,Holes To Heaven,Jack Johnson,Thicker Than Water
2,Rainbow,Jack Johnson / G. Love,Thicker Than Water
3,Symbol In My Driveway,Jack Johnson,On and On
4,J.Icaro,Niños Mutantes,Otoño En Agosto


In [18]:
# Example Usage for multiple songs
user_id = "21f4ac98aa1665bd42027ba12184a939ff435f59"
song_titles = ["Get Confused", "Partylöwe", "Moonshine"]

print("Hybrid Recommendations (Multiple Songs):")
hybrid_recommendv2(user_id, song_titles)

Hybrid Recommendations (Multiple Songs):




Unnamed: 0,title,artist_name,release
0,I Think I See The Light,Cat Stevens,Mona Bone Jakon
1,Fleur blanche,Ortsen,Hôtel Costes X - by Stéphane Pompougnac
2,Fill My Eyes,Cat Stevens,Mona Bone Jakon
3,Cristobal,Devendra Banhart,Smokey Rolls Down Thunder Canyon
4,Samba Vexillographica,Devendra Banhart,Smokey Rolls Down Thunder Canyon


# Hybrid Recommendation System

This recommendation system combines **content-based filtering** and **collaborative filtering** to generate personalized song recommendations. By utilizing both metadata and user interaction data, it provides robust and diverse suggestions tailored to user preferences.

The hybrid recommendation system offers several advantages by combining content-based filtering, collaborative filtering, and their complementary strengths. Content-based filtering excels in handling the cold-start problem for new songs by relying on metadata such as the song title, artist name, and release information. This approach provides recommendations based on textual similarity, ensuring that even songs with minimal user interaction can still be recommended effectively. Collaborative filtering, on the other hand, leverages user behavior and interaction data to suggest songs that are popular among users with similar preferences. This method adds a layer of personalization by tailoring recommendations to an individual user's listening habits. By integrating these two techniques, the hybrid filtering approach enhances the overall robustness of the system. It combines the metadata-driven accuracy of content-based filtering with the behavioral insights of collaborative filtering, resulting in diverse, personalized recommendations. This dual approach effectively mitigates the limitations of each standalone method, creating a well-rounded system capable of addressing a wide range of recommendation challenges.

## 1. **Data Preparation**

The data is loaded and preprocessed to ensure it is ready for both recommendation techniques. The steps include:

- **Validation**: Ensures all required columns (`user`, `song`, `play_count`, `title`, `artist_name`, `release`) are present in the dataset.
- **Duplicate Removal**: Removes duplicates to maintain unique records.
- **Feature Engineering**: Combines `title`, `artist_name`, and `release` into a single `combined_features` column for text-based similarity calculations.

### Code Snippet:

```python
def load_data():
    df = pd.read_csv("song_dataset.csv")
    return df

def run_imps(df):
    required_columns = ['user', 'song', 'play_count', 'title', 'artist_name', 'release']
    if not all(col in df.columns for col in required_columns):
        raise ValueError(f"Dataset must contain the following columns: {required_columns}")

    df = df.drop_duplicates(subset=['song', 'title', 'artist_name', 'release'])
    df['combined_features'] = (df['title'] + " " + df['artist_name'] + " " + df['release']).fillna("")
    return df
```


## 2. **Content-Based Filtering**

This approach uses **TF-IDF Vectorization** to represent song metadata (`combined_features`) in a high-dimensional space. The **Nearest Neighbors** algorithm is then used to find songs similar to a given song based on cosine similarity.

- **TF-IDF Vectorization**: Captures the importance of terms in the metadata.
- **Nearest Neighbors**: Identifies the most similar songs.

### Code Snippet:

```python
# TF-IDF vectorization and Nearest Neighbors for content-based filtering
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['combined_features'])

nn = NearestNeighbors(n_neighbors=10, metric='cosine', algorithm='auto')
nn.fit(tfidf_matrix)
```

- **Content-Based Recommendation Function**:
  Recommends songs similar to the input song based on metadata similarity.

```python
def content_based_recommend(song_title, top_n=5):
    try:
        idx = df[df['title'] == song_title].index[0]
        distances, indices = nn.kneighbors(tfidf_matrix[idx], n_neighbors=top_n + 1)
        song_indices = indices.flatten()[1:]
        return df.iloc[song_indices][['title', 'artist_name', 'release']].drop_duplicates()
    except IndexError:
        return pd.DataFrame(columns=['title', 'artist_name', 'release'])
```

## 3. **Collaborative Filtering**

Collaborative filtering is implemented using **K-Nearest Neighbors (KNN)** on the user-song interaction matrix (`user_song_matrix`). The system recommends songs liked by users with similar play patterns.

- **User-Song Matrix**: Encodes play counts for each user-song pair.
- **KNN Model**: Finds users with similar listening habits and aggregates their preferences.

### Code Snippet:

```python
# Create user-song interaction matrix and fit KNN for collaborative filtering
user_song_matrix = df.pivot_table(index='user', columns='song', values='play_count', fill_value=0)
knn_cf = NearestNeighbors(n_neighbors=10, metric='cosine', algorithm='auto')
knn_cf.fit(user_song_matrix)

# Collaborative Recommendation Function
def collaborative_recommend(user_id, top_n=5):
    if user_id not in user_song_matrix.index:
        return pd.DataFrame(columns=['title', 'artist_name', 'release'])

    # Get the nearest neighbors for the user
    user_index = user_song_matrix.index.get_loc(user_id)
    distances, indices = knn_cf.kneighbors(user_song_matrix.iloc[user_index].values.reshape(1, -1), n_neighbors=top_n + 1)
    
    # Collect recommendations from neighbors
    neighbors = indices.flatten()[1:]
    listened_songs = user_song_matrix.loc[user_id][user_song_matrix.loc[user_id] > 0].index

    recommendations = {}
    for neighbor in neighbors:
        neighbor_songs = user_song_matrix.iloc[neighbor]
        for song, play_count in neighbor_songs.items():
            if song not in listened_songs and play_count > 0:
                recommendations[song] = recommendations.get(song, 0) + play_count

    # Sort songs by aggregated scores
    recommended_songs = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)[:top_n]
    recommended_song_ids = [song for song, _ in recommended_songs]
    return df[df['song'].isin(recommended_song_ids)][['title', 'artist_name', 'release']].drop_duplicates()
```

## 4. **Hybrid Filtering**

This function combines content-based and collaborative filtering. It provides diverse recommendations by merging:
- Songs similar to those the user has liked (content-based).
- Songs liked by users with similar preferences (collaborative).

### Code Snippet:

```python
def hybrid_recommendv2(user_id, song_titles, top_n=5):
    collab_recs = collaborative_recommend(user_id, top_n)
    content_recs = pd.DataFrame()
    for song_title in song_titles:
        content_recs = pd.concat([content_recs, content_based_recommend(song_title, top_n)], ignore_index=True)
    hybrid_recs = pd.concat([collab_recs, content_recs]).drop_duplicates().sample(frac=1).reset_index(drop=True)
    return hybrid_recs.head(top_n)
```

## 5. **Example Usage**

```python
# Example Usage for multiple songs
user_id = "21f4ac98aa1665bd42027ba12184a939ff435f59"
song_titles = ["Get Confused", "Partylöwe", "Moonshine"]

print("Hybrid Recommendations (Multiple Songs):")
hybrid_recommendv2(user_id, song_titles)

```
| Title                  | Artist Name        | Release                                 |
|------------------------|--------------------|-----------------------------------------|
| I Think I See The Light | Cat Stevens       | Mona Bone Jakon                         |
| Fleur blanche          | Ortsen            | Hôtel Costes X - by Stéphane Pompougnac |
| Fill My Eyes           | Cat Stevens       | Mona Bone Jakon                         |
| Cristobal              | Devendra Banhart  | Smokey Rolls Down Thunder Canyon        |
| Samba Vexillographica  | Devendra Banhart  | Smokey Rolls Down Thunder Canyon        |
