# Music recommender system

One of the most used machine learning algorithms is recommendation systems. A **recommender** (or recommendation) **system** (or engine) is a filtering system which aim is to predict a rating or preference a user would give to an item, eg. a film, a product, a song, etc.

Which type of recommender can we have?   

There are two main types of recommender systems: 
- Content-based filters
- Collaborative filters
  
> Content-based filters predicts what a user likes based on what that particular user has liked in the past. On the other hand, collaborative-based filters predict what a user like based on what other users, that are similar to that particular user, have liked.

### 1) Content-based filters

Recommendations done using content-based recommenders can be seen as a user-specific classification problem. This classifier learns the user's likes and dislikes from the features of the song.

The most straightforward approach is **keyword matching**.

In a few words, the idea behind is to extract meaningful keywords present in a song description a user likes, search for the keywords in other song descriptions to estimate similarities among them, and based on that, recommend those songs to the user.

*How is this performed?*

In our case, because we are working with text and words, **Term Frequency-Inverse Document Frequency (TF-IDF)** can be used for this matching process.
  
We'll go through the steps for generating a **content-based** music recommender system.

### Importing required libraries

First, we'll import all the required libraries.

In [2]:
import numpy as np
import pandas as pd

In [3]:
from typing import List, Dict

We have already used the **TF-IDF score before** when performing Twitter sentiment analysis. 

Likewise, we are going to use TfidfVectorizer from the Scikit-learn package again.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Dataset

In [5]:
import pyodbc
print(pyodbc.drivers())


['SQL Server', 'ODBC Driver 17 for SQL Server']


So imagine that we have the [following dataset](https://www.kaggle.com/mousehead/songlyrics/data#). 

This dataset contains name, artist, and lyrics for *57650 songs in English*. The data has been acquired from LyricsFreak through scraping.

In [6]:
import pyodbc
import csv
import pandas as pd

# Chuỗi kết nối SQL Server
server = 'job-hub-kltn.database.windows.net'
database = 'job-hub-database'
username = 'jobhub'
password = '28072002Thanh'
driver = '{ODBC Driver 17 for SQL Server}'

# Tạo chuỗi kết nối
connection_string = f'DRIVER={driver};SERVER={server};DATABASE={database};UID={username};PWD={password}'

# Kết nối đến cơ sở dữ liệu
conn = pyodbc.connect(connection_string)

# Tạo một đối tượng cursor để thực thi các truy vấn SQL
cursor = conn.cursor()

# Ví dụ truy vấn SQL
query = """
SELECT * FROM job
"""
cursor.execute(query)

# Lấy tất cả các hàng từ cursor
rows = cursor.fetchall()

# Ghi dữ liệu vào tệp CSV
with open('data_sql.csv', 'w', newline='', encoding='utf-8') as f:
    # Tạo đối tượng ghi CSV
    writer = csv.writer(f)
    
    # Ghi tiêu đề của các cột
    writer.writerow([column[0] for column in cursor.description])
    
    # Ghi dữ liệu từ các hàng
    writer.writerows(rows)

# Đóng kết nối đến cơ sở dữ liệu
conn.close()

# Đọc dữ liệu từ tệp CSV vào DataFrame
job_df = pd.read_csv('data_sql.csv')


In [7]:
rows

[('00249d9f-88ea-4b81-a0ca-1923a8a75fce', '&lt;p&gt;- Thu nhập:\xa0&lt;/p&gt;&lt;p&gt;- Giáo viên Full time: Lương từ 10-18 Triệu/tháng&lt;/p&gt;&lt;p&gt;- Giáo viên Part time: Lương thỏa thuận&lt;/p&gt;&lt;p&gt;- Ứng viên có thể lựa chọn 1 trong 2 cơ sở để thuận tiện đi lại\xa0&lt;/p&gt;&lt;p&gt;- Mỗi lớp tối đa 8 - 10 học sinh và 1 trợ giảng hỗ trợ\xa0&lt;/p&gt;&lt;p&gt;- Đa số các lớp được cung cấp giáo trình, slide và phiếu bài tập về nhà&lt;/p&gt;&lt;p&gt;- Môi trường làm việc trẻ trung, năng động, tâm huyết với nghề.&lt;/p&gt;&lt;p&gt;- Được sắp xếp nhiều ca dạy để tăng thu nhập&lt;/p&gt;&lt;p&gt;- Được đào tạo góp ý để nâng cao chuyên môn và kỹ năng&lt;/p&gt;&lt;p&gt;- Được ưu đãi học phí cho người thân khi đăng ký học tại trung tâm&lt;/p&gt;&lt;p&gt;- Cơ hội thăng tiến được cân nhắc lên vị trí quản lý&lt;/p&gt;&lt;p&gt;\r\n&lt;/p&gt;', datetime.datetime(2024, 3, 15, 23, 45, 36), datetime.datetime(2024, 4, 1, 0, 0), '&lt;p&gt;- Giảng dạy ngữ pháp, kỹ năng đọc viết và quản lý trậ

In [8]:
job_df.head()

Unnamed: 0,job_id,benefit,created_at,deadline,description,experience,is_active,job_type,link,location,...,quantity,requirement,salary_range,status,time,updated_at,employer_user_id,position_position_id,total_view,user_id
0,00249d9f-88ea-4b81-a0ca-1923a8a75fce,&lt;p&gt;- Thu nhập: &lt;/p&gt;&lt;p&gt;- Giáo...,2024-03-15 23:45:36,2024-04-01 00:00:00,"&lt;p&gt;- Giảng dạy ngữ pháp, kỹ năng đọc viế...",2,True,CONTRACT,https://www.topcv.vn/viec-lam/giao-vien-tieng-...,"5/4 Lê Văn Chí, Linh Trung, Thủ Đức, Thành phố...",...,8,&lt;p&gt;- Tốt nghiệp chuyên ngành sư phạm Anh...,VND:10000000-18000000,0,5 hours,,e0f3045c-ae4f-40cf-83ba-d845f3c76364,2DF30267-DFC8-499D-8C43-FF8E436E8EF9,0,
1,00526e4f-c995-4bc8-8d4c-527387dbc113,&lt;ul&gt;&lt;li&gt;&lt;b&gt;Thu nhập hấp dẫn ...,2024-03-15 23:43:53,2024-03-31 00:00:00,&lt;ul&gt;&lt;li&gt;Tìm kiếm và tư vấn dịch vụ...,2,True,FULL_TIME,https://www.topcv.vn/viec-lam/nhan-vien-kinh-d...,"5/4 Lê Văn Chí, Linh Trung, Thủ Đức, Thành phố...",...,2,"&lt;ul&gt;&lt;li&gt;Trình độ cao đẳng trở lên,...",VND:10000000-30000000,0,6 hours,,d5abbf18-9ac8-44dc-ba71-5a768d1f3887,EBC24826-7EF7-4540-86B3-7CCD11481E21,0,
2,016d51fa-ca59-46f3-9dc1-3ca070026424,"&lt;p&gt;- Môi trường làm việc dân chủ, hiện đ...",2024-03-15 23:41:26,2024-03-20 00:00:00,&lt;p&gt;- Nắm bắt được xu hướng về content.\r...,2,False,CONTRACT,https://www.topcv.vn/viec-lam/tro-ly-tieng-tru...,"5/4 Lê Văn Chí, Linh Trung, Thủ Đức, Thành phố...",...,9,&lt;p&gt;- Ưu tiên sinh viên sắp và mới tốt ng...,VND:10000000-15000000,0,7 hours,2024-03-20 00:00:06.176000,99b11b9f-7288-4111-aae0-7e2f407163f9,c5606085-3ee9-4e1c-9810-daedf514994a,0,
3,018e75b3-dc53-4022-b06b-6beb23198a1a,&lt;p&gt;&lt;b&gt;- Mức lương: Lương cạnh tran...,2024-03-15 23:43:18,2024-04-14 00:00:00,"&lt;p&gt;- Lập hồ sơ thiết kế, triển khai hồ s...",2,True,CONTRACT,https://www.topcv.vn/viec-lam/kien-truc-su-luo...,"5/4 Lê Văn Chí, Linh Trung, Thủ Đức, Thành phố...",...,3,&lt;p&gt;&lt;b&gt;- Chỉ nhận Tốt nghiệp đại họ...,VND:0-15000000,0,8 hours,,c88d0cbe-0dd4-44d2-842a-c3497e081ead,235A69EE-10E0-40E7-B261-4EBA9B5A491E,0,
4,01a27fb6-f982-4b67-b0b6-c16a88b2eebb,&lt;p&gt;- Lương khởi điểm: 10 - 12tr/tháng (t...,2024-03-15 23:44:53,2024-04-15 00:00:00,"&lt;p&gt;- Biên soạn, tổng hợp tài liệu, xây d...",1,True,PART_TIME,https://www.topcv.vn/viec-lam/giao-vien-toan-t...,"5/4 Lê Văn Chí, Linh Trung, Thủ Đức, Thành phố...",...,4,"&lt;p&gt;- Có chuyên môn vững, kinh nghiệm giả...",VND:0-12000000,0,6 hours,,b44b8cc4-e962-45c9-8442-77c86db2f2d2,CBB96933-92E2-407C-89C7-AABA9279B9ED,0,


Because of the dataset being so big, we are going to resample only 5000 random songs.

In [9]:
job_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 894 entries, 0 to 893
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   job_id                894 non-null    object 
 1   benefit               894 non-null    object 
 2   created_at            894 non-null    object 
 3   deadline              894 non-null    object 
 4   description           894 non-null    object 
 5   experience            894 non-null    int64  
 6   is_active             894 non-null    bool   
 7   job_type              894 non-null    object 
 8   link                  890 non-null    object 
 9   location              894 non-null    object 
 10  logo                  894 non-null    object 
 11  name                  894 non-null    object 
 12  quantity              894 non-null    int64  
 13  requirement           894 non-null    object 
 14  salary_range          894 non-null    object 
 15  status                8

We can notice also the presence of `\n` in the text, so we are going to remove it.

In [10]:
job_df['name'] = job_df['name'].str.replace(r'\n', '')

After that, we use TF-IDF vectorizerthat calculates the TF-IDF score for each song lyric, word-by-word. 

Here, we pay particular attention to the arguments we can specify.

In [11]:
tfidf = TfidfVectorizer(analyzer='word', stop_words='english')

In [12]:
lyrics_matrix = tfidf.fit_transform(job_df['name'])

*How do we use this matrix for a recommendation?* 

We now need to calculate the similarity of one lyric to another. We are going to use **cosine similarity**.

We want to calculate the cosine similarity of each item with every other item in the dataset. So we just pass the lyrics_matrix as argument.

In [13]:
cosine_similarities = cosine_similarity(lyrics_matrix) 

Once we get the similarities, we'll store in a dictionary the names of the 50  most similar songs for each song in our dataset.

In [14]:
similarities = {}

In [15]:
for i in range(len(cosine_similarities)):
    # Now we'll sort each element in cosine_similarities and get the indexes of the songs. 
    similar_indices = cosine_similarities[i].argsort()[:-50:-1] 
    # After that, we'll store in similarities each name of the 50 most similar songs.
    # Except the first one that is the same song.
    similarities[job_df['name'].iloc[i]] = [(cosine_similarities[i][x], job_df['name'][x], job_df['position_position_id'][x]) for x in similar_indices][1:]

After that, all the magic happens. We can use that similarity scores to access the most similar items and give a recommendation.

For that, we'll define our Content based recommender class.

In [16]:
class ContentBasedRecommender:
    def __init__(self, matrix):
        self.matrix_similar = matrix

    def _print_message(self, job, recom_job):
        rec_items = len(recom_job)
        
        print(f'The {rec_items} recommended job for {job} are:')
        for i in range(rec_items):
            print(f"Number {i+1}:")
            print(f"{recom_job[i][1]} by {recom_job[i][2]} with {round(recom_job[i][0], 3)} similarity score") 
            print("--------------------")
        
    def recommend(self, recommendation):
        # Get song to find recommendations for
        song = recommendation['song']
        # Get number of songs to recommend
        number_songs = recommendation['number_songs']
        # Get the number of songs most similars from matrix similarities
        recom_song = self.matrix_similar[song][:number_songs]
        # print each item
        self._print_message(job=song, recom_job=recom_song)

Now, instantiate class

In [17]:
recommedations = ContentBasedRecommender(similarities)

Then, we are ready to pick a song from the dataset and make a recommendation.

In [18]:
recommendation = {
    "song": job_df['name'].iloc[13],
    "number_songs": 10
}

In [19]:
recommendation

{'song': 'Kế Toán Trưởng', 'number_songs': 10}

In [20]:
recommedations.recommend(recommendation)

The 10 recommended job for Kế Toán Trưởng are:
Number 1:
Kế Toán Trưởng by 0CABFB66-08D0-4875-B8D0-D354FAD825A6 with 1.0 similarity score
--------------------
Number 2:
Kế Toán Trưởng by 47947EC3-CD18-4FB6-81F9-833C7F375208 with 1.0 similarity score
--------------------
Number 3:
Kế Toán Trưởng Thu Nhập Từ 20 - 30 Triệu by D5846B71-9143-4458-8DCE-1FFCD49AB3C2 with 0.697 similarity score
--------------------
Number 4:
Chuyên Viên Kế Toán by 06FF8A9B-71F5-4CFC-A71F-72020484B955 with 0.678 similarity score
--------------------
Number 5:
Kế Toán Bán Hàng by E93F133B-8B6F-45C1-9A32-C23D26676224 with 0.595 similarity score
--------------------
Number 6:
Kế Toán Trưởng Thu Nhập Từ 18 -20 Triệu, Tại Hà Nội by 42FC988E-29A8-4DF2-A693-E9C8D5256502 with 0.593 similarity score
--------------------
Number 7:
Kế Toán Trưởng - Lương Upto Đến 35 Triệu by 7459E7A6-FF41-4E2B-AA1A-8F0A52AF5F91 with 0.59 similarity score
--------------------
Number 8:
Kế Toán Thuế by 5ECE813D-0D54-48A6-B32D-A76717028E48 w