In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# minimal packages to import
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.stem import *
import string
import re
from tqdm import tqdm
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import spacy
from nltk.tokenize import word_tokenize
import regex as re
import requests
import heapq



# setting the filepath for the dataset, depending on where you saved them
scraping_courses = "/content/drive/MyDrive/MyDatasetFolder/your_file.csv"


In [65]:
# Load the dataframe
df_courses = pd.read_csv(scraping_courses)


## **2.0 Preprocessing**

2.0.0) Preprocessing the text

In [66]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Load SnowballStemmer
snowstem = SnowballStemmer('english')

# Load stopwords to filter
lst_stopwords = set(stopwords.words('english'))

# Define the function to preprocess text
def preprocess_text(text):
    # Convert to lowercase this is because for some reason it wasn't elimination from stopword the word 'the' when starting with an uppercase
    text_lower = text.lower()

    # Tokenize, stem words, remove punctuation and remove stopwords
    stemmed_words = [snowstem.stem(word) for word in nltk.word_tokenize(text_lower) if not word in lst_stopwords and word.isalnum()]

    return stemmed_words

# Create column named 'descr_clean' and then apply preprocess_text to the 'Description' column in df_courses
df_courses['descr_clean'] = df_courses['Description'].apply(preprocess_text)

df_courses

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Course Name,University Name,Faculty Name,Description,Fees,Modality,Duration,City,Country,Link,administration,descr_clean
0,3D Design for Virtual Environments - MSc,Glasgow Caledonian University,School of Engineering and Built Environment,3D visualisation and animation play a role in ...,Please see the university website for further ...,MSc,1 year full-time,Glasgow,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[3d, visualis, anim, play, role, mani, area, p..."
1,Accounting and Finance - MSc,University of Leeds,Leeds University Business School,Businesses and governments rely on sound finan...,"UK: £18,000 (Total) International: £34,750 (To...",MSc,1 year full time,Leeds,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[busi, govern, reli, sound, financi, knowledg,..."
2,"Accounting, Accountability & Financial Managem...",King’s College London,King’s Business School,"Our Accounting, Accountability & Financial Man...",Please see the university website for further ...,MSc,1 year FT,London,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[account, account, financi, manag, msc, cours,..."
3,"Accounting, Financial Management and Digital B...",University of Reading,Henley Business School,Embark on a professional accounting career wit...,Please see the university website for further ...,MSc,1 year full time,Reading,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[embark, profession, account, career, academ, ..."
4,Addictions MSc,King’s College London,"Institute of Psychiatry, Psychology and Neuros...",Join us for an online session for prospective ...,Please see the university website for further ...,MSc,One year FT,London,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[join, us, onlin, session, prospect, student, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...
5995,Materials Engineering,University of Padua,School of Engineering,The Master's degree Materials Engineering is a...,Our tuition fees will not exceed 2700 euros pe...,MSc,2 years,Padua,Italy,https://www.findamasters.com/masters-degrees/c...,On Campus,"[master, degre, materi, engin, interdisciplina..."
5996,Materials Engineering MSc,Swansea University,School of Engineering and Applied Sciences,The MSc in Materials Engineering provides you ...,Please visit our website for the Materials Eng...,MSc,1 year full-time; 2 years part-time; 3 years p...,Swansea,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[msc, materi, engin, provid, deep, understand,..."
5997,Materials Engineering MSc by Research,Swansea University,School of Engineering and Applied Sciences,Swansea is one of the UK’s leading centres for...,Please visit our website for the Materials Eng...,"MSc, Research Only",1 year full-time; 2 years part-time,Swansea,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[swansea, one, uk, lead, centr, materi, teach,..."
5998,"Materials Engineering with Industry, MSc",Swansea University,School of Engineering and Applied Sciences,Our MSc in Materials Engineering with Industry...,For current fees of the Materials Engineering ...,MSc,2 Years Full Time With a Year In Industry,Swansea,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[msc, materi, engin, industri, cours, open, in..."


**2.0.1) Preprocessing the fees column**

To do this we used the exchangerate-api.com provider. (https://www.exchangerate-api.com/)


In [67]:
# Get a dictionary with exchange rates where GBP is the base currency you want to use
# I saved the API key in another text file. There is no need to do it in this case but I thought it could be good practice

# Read API key from the file
with open('/content/drive/MyDrive/MyDatasetFolder/api_key_exchangerates', 'r') as api_key_file:
    api_key = api_key_file.read().strip()

# URL with the API key
url = f'https://v6.exchangerate-api.com/v6/{api_key}/latest/GBP'

# Making our request for the conversion rates
response = requests.get(url)
data_exchange = response.json()

with open('data_exchange.json', 'w') as json_file:
    json.dump(data_exchange, json_file)

# Read the JSON data from the file
with open('data_exchange.json', 'r') as json_file:
    loaded_data = json.load(json_file)

# Convert the 'result' column to a Pandas DataFrame
df = pd.DataFrame(loaded_data)

# Extract only the 'conversion_rates' column
conversion_rates = df['conversion_rates']


In [68]:
# Defined a function to select the maximum fee from a list of fees
def max_value_fee(list_fees):
    filtered_fee_list = [int(fee) for fee in list_fees]
    if filtered_fee_list:
        max_fee = max(filtered_fee_list)
        max_fee = float(max_fee)
    else:
        # If the list is empty, set max_fee to None
        max_fee = None
    return max_fee

# Define a function to remove dots and commas from numbers in a string
def remove_dots_from_numbers(input_string):
    # Use regular expression to find numbers with dots or commas and remove the dots or commas
    result_string = re.sub(r'(\d)[.,](\d{3})', r'\1\2', input_string)
    return result_string

# Iterate over each row in df_courses
for index, row in df_courses.iterrows():
    # Extract the 'Fees' column as a string
    fees_string = str(row['Fees'])

    # Remove dots from numbers and specific year strings (These years tend to create problems)
    fees_string = remove_dots_from_numbers(fees_string)
    fees_string = fees_string.replace("2022", '').replace("2023", '').replace("2024", '')

    # Remove various punctuation, spaces, and special characters
    fees_string = fees_string.replace(".00 ", '').replace(".00€ ", '').replace(",00 ", '').replace(",00€ ", '').replace(".0 ", '').replace(",0 ", '').replace('.', '').replace(',', '').replace("'", '').replace(" ", '')

    # Use regular expression to find currency symbols or codes ISO 4217 for all currencies (Those are the once that are used in our conversion_rates variable)
    matches_cur = re.findall(r'HKD(?:s)?|HK(?:s)?|\p{Sc}|euro(?:s)?|dollar(?:s)?|pound(?:s)?|EUR(?:s)?|USD(?:s)?|CHF(?:s)?|SEK(?:s)?|ISK(?:s)?|RMB(?:s)?|QR(?:s)?|GBP(?:s)?|QAR(?:s)?|JPY(?:s)?', fees_string, flags=re.IGNORECASE)

    # Use regular expression to find fees (excluding specific years) in the cleaned string
    matches_fee = re.findall(r'(?!2021|2022|2023|2024)\d{4,}', fees_string)

    # Call the max_value_fee function to get the maximum fee
    fees_float = max_value_fee(matches_fee)

    # Check if currency matches and assign the corresponding ISO 4217 code for all currencies (Those are the once that are used in our conversion_rates variable)
    if matches_cur:
        currency = None
        if matches_cur[0] == 'euro' or matches_cur[0] == 'euros' or matches_cur[0] == '€' or matches_cur[0] == 'EURs' or matches_cur[0] == 'Eur' or matches_cur[0] == 'Euros' or matches_cur[0] == 'EUROS' or matches_cur[0] == 'eurs'or matches_cur[0] == 'Euro':
            currency = 'EUR'
        elif matches_cur[0] == 'dollar' or matches_cur[0] == 'dollars' or matches_cur[0] == '$':
            currency = 'USD'
        elif matches_cur[0] == 'pound' or matches_cur[0] == 'pounds' or matches_cur[0] == '£':
            currency = 'GBP'
        elif matches_cur[0] == 'RMB':
            currency = 'CNY'
        elif matches_cur[0] == 'QR':
            currency = 'QAR'
        elif matches_cur[0] == 'HK':
            currency = 'HKD'
        else:
            currency = matches_cur[0]
    else:
        currency = None

    # Initialize fees_pound as None
    fees_pound = None

    # Check if fees_float is not None and less than 100000 before rounding and conversion (fees (£) above 100000 are due to super specific errors, less than 10 which are more reasonable to deal in this way)
    if fees_float is not None and fees_float < 100000:
        fees_pound = round(fees_float / conversion_rates.get(currency, 1.0), 2)

    # Assign the calculated fees in pounds to the 'fees (£)' column
    df_courses.at[index, 'fees (£)'] = fees_pound

# Filter the df_cources to include only rows where 'fees (£)' is not null
filtered_df = df_courses[(df_courses['fees (£)'].notnull())]
filtered_df


Unnamed: 0,Course Name,University Name,Faculty Name,Description,Fees,Modality,Duration,City,Country,Link,administration,descr_clean,fees (£)
1,Accounting and Finance - MSc,University of Leeds,Leeds University Business School,Businesses and governments rely on sound finan...,"UK: £18,000 (Total) International: £34,750 (To...",MSc,1 year full time,Leeds,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[busi, govern, reli, sound, financi, knowledg,...",34750.00
5,Advanced Chemical Engineering - MSc,University of Leeds,School of Chemical and Process Engineering,The Advanced Chemical Engineering MSc at Leeds...,"UK: £13,750 (Total) International: £31,000 (To...",MSc,1 year full time,Leeds,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[advanc, chemic, engin, msc, leed, build, core...",31000.00
7,Agricultural Sciences - MSc (Agriculture and F...,University of Helsinki,International Masters Degree Programmes,Goal of the pro­grammeWould you like to be inv...,Tuition fee per year (non-EU/EEA students): 15...,MSc,2 years,Helsinki,Finland,https://www.findamasters.com/masters-degrees/c...,On Campus,"[goal, like, involv, find, solut, futur, chall...",13026.49
8,"Agricultural, Environmental and Resource Econo...",University of Helsinki,International Masters Degree Programmes,Goal of the pro­grammeAre you looking forward ...,Tuition fee per year (non-EU/EEA students): 15...,MSc,2 years,Helsinki,Finland,https://www.findamasters.com/masters-degrees/c...,On Campus,"[goal, look, forward, futur, expert, agricultu...",13026.49
9,Air Quality Solutions - MSc,University of Leeds,Institute for Transport Studies,Up to 7 million people are estimated to die ev...,"UK: £12,500 (Total) International: £28,750 (To...",MSc,"1 year full time, 2 or 3 years part-time",Leeds,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[7, million, peopl, estim, die, everi, year, d...",28750.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5983,Master's of Financial Technology (Fintech),Harbour.Space University,Masters Programmes,Harbour.Space's FinTech Master programme is de...,"€29,900/year","MBA, MSc",1 Year,Barcelona,Spain,https://www.findamasters.com/masters-degrees/c...,On Campus,"[fintech, master, programm, design, prepar, gr...",25966.13
5984,Master's of Front-end Development,Harbour.Space University,Masters Programmes,Front-end Development at Harbour.Space Univers...,"€29,900/year",MSc,1 year,Barcelona,Spain,https://www.findamasters.com/masters-degrees/c...,On Campus,"[develop, univers, provid, uniqu, environ, stu...",25966.13
5992,Materials and Molecular Modelling MSc,University College London,Department of Chemistry,Register your interest in graduate study at UC...,"Full time - £14,100",MSc,1 year full time,London,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[regist, interest, graduat, studi, uclther, gr...",14100.00
5995,Materials Engineering,University of Padua,School of Engineering,The Master's degree Materials Engineering is a...,Our tuition fees will not exceed 2700 euros pe...,MSc,2 years,Padua,Italy,https://www.findamasters.com/masters-degrees/c...,On Campus,"[master, degre, materi, engin, interdisciplina...",2344.77


## **2.1. Conjunctive query**


2.1.1) Create your index!


In [69]:
############################################### DON'T RUN #######################################################
# Create Vocabulary first

# Create list of all words
words = set()
df_courses.descr_clean.apply(lambda row: [words.add(word) for word in row])

# Assign term_id to each word by creating an index
vocabulary = {}
unique_id = 1
for word in list(words):
  vocabulary[unique_id] = word
  unique_id += 1

# Save this file into a pickle file which I can later on retrieve
with open('/content/drive/MyDrive/MyDatasetFolder/vocabulary.pkl', 'wb') as pickle_file:
    pickle.dump(vocabulary, pickle_file)


In [None]:
############################################### DON'T RUN #######################################################
# Create Inverted Index

with open('/content/drive/MyDrive/MyDatasetFolder/vocabulary.pkl', 'rb') as pickle_file:
    vocabulary = pickle.load(pickle_file)

inverted_index = dict()
overall_progress = tqdm(total=len(list(vocabulary.keys())), desc="Building Vocabulary")

# Loop through the keys of the vocabulary
for i in list(vocabulary.keys()):
    inverted_index[i] = []
    for index, row in df_courses.iterrows():
        if vocabulary[i] in row['descr_clean']:
            inverted_index[i].append(index)

    overall_progress.update(1)  # Update the overall progress

overall_progress.close()

with open('/content/drive/MyDrive/MyDatasetFolder/inverted_indexl.pkl', 'wb') as file:
        pickle.dump(inverted_index, file)


2.1.2) Execute the query


In [72]:
# This allows to retrieve the term_id from a word
def find_key_by_value(vocabulary, value):
  for key, val in vocabulary.items():
    if val == value:
      return key
  return None


def matching_documents_finder(query):
  # Load inverted index and vocabulary
  with open('/content/drive/MyDrive/MyDatasetFolder/inverted_index.pkl', 'rb') as pickle_file:
      inverted_index = pickle.load(pickle_file)

  with open('/content/drive/MyDrive/MyDatasetFolder/vocabulary.pkl', 'rb') as vocab_file:
      vocabulary = pickle.load(vocab_file)

  # Preprocess the query in order to compare it with desc_clean
  # Tokenize and preprocess the query
  query = ''.join(char for char in query if char.isalnum() or char.isspace())
  stop_words = set(stopwords.words('english'))
  words = word_tokenize(query)
  query = ' '.join([word for word in words if word.lower() not in stop_words])

  # Stemming using SnowballStemmer
  snowstem = SnowballStemmer('english')
  query_words = [snowstem.stem(word) for word in query.lower().split()]

  # Initialize a list to store matching documents
  matching_documents = []

  # Find documents that contain all the words in the query
  for stemmed_word in query_words:
      if stemmed_word in vocabulary.values():
        term_id = find_key_by_value(vocabulary, stemmed_word)
        matching_documents.append(set(inverted_index[term_id])) # If found in vocabulary it appends the the list of documents to matching_documents. I used a set to then use the intersection function.

  return matching_documents


def common_documents_finder(query):
  matching_documents = matching_documents_finder(query)
  if matching_documents:
      common_documents = set.intersection(*matching_documents) # Using set intersection I can find the common_documents (documents that contain all of the words)

  return common_documents


def search_engine_with_vocabulary(query):
  common_documents = common_documents_finder(query)
  df_result = df_courses.loc[list(common_documents), ['Course Name', 'University Name', 'Description', 'Link']].copy()

  # Display the DataFrame
  return df_result

# Example usage
search_engine_with_vocabulary(input())


Fashion, AND MArketing!!


Unnamed: 0,Course Name,University Name,Description,Link
5122,Luxury Buying & Logistics - MSc,University for the Creative Arts,The business of moving and shipping luxury ite...,https://www.findamasters.com/masters-degrees/c...
3587,Fashion Forecasting & Data Analysis - MA/MSc,University for the Creative Arts,UCA's new MSc degree in Fashion Forecasting an...,https://www.findamasters.com/masters-degrees/c...
3588,Fashion Marketing Management - MSc/PgD/PgC,Cardiff Metropolitan University,The MSc Fashion Marketing Management degree at...,https://www.findamasters.com/masters-degrees/c...
4776,International Fashion Marketing - MSc,Glasgow Caledonian University,The global fashion industry is valued at over ...,https://www.findamasters.com/masters-degrees/c...
4777,International Fashion Marketing - MSc,Heriot-Watt University,MSc International Fashion Marketing (IFM) at H...,https://www.findamasters.com/masters-degrees/c...
4778,International Fashion Marketing MSc,Coventry University London,"Are you looking to move into fashion, accelera...",https://www.findamasters.com/masters-degrees/c...
4779,International Fashion Marketing MSc,York St John University,Combine fashion industry expertise with transf...,https://www.findamasters.com/masters-degrees/c...
140,Global Marketing & Communications - MA/MSc,University for the Creative Arts,"Refine your existing skills in marketing, comm...",https://www.findamasters.com/masters-degrees/c...
204,Master Programme in Fashion Marketing and Mana...,University of Boras,Fashion is one of the most complex and fascina...,https://www.findamasters.com/masters-degrees/c...
112,Fashion Business & Management - MA/MSc,University for the Creative Arts,If you are seeking a high-level career in mana...,https://www.findamasters.com/masters-degrees/c...


# **2.2) Conjunctive query & Ranking score**

2.2.1) Inverted index


In [73]:
# Build tfidf_data
tfidf = TfidfVectorizer(input='content', lowercase=False, tokenizer=lambda text: text)
results = tfidf.fit_transform(df_courses.descr_clean)
result_dense = results.todense()
tfidf_data = pd.DataFrame(result_dense.tolist(), index=df_courses.index, columns=tfidf.get_feature_names_out()) # Used to create the New Inverted Index




In [None]:
############################################### DON'T RUN #######################################################
# 2.2.1 New Inverted index

inverted_index_2 = dict()
overall_progress = tqdm(total=len(list(inverted_index.keys())), desc="Building Inverted Index 2")

for term_id in list(inverted_index.keys()):
  values = tfidf_data.loc[tfidf_data[vocabulary[term_id]] > 0, [vocabulary[term_id]]] # Find values for which tfidf are > 0
  term = vocabulary[term_id] # Find the term
  values_list = list(zip(values.index, values[term]))
  inverted_index_2[term_id] = values_list
  overall_progress.update(1)  # Update the overall progress

overall_progress.close()

print(inverted_index_2)

# Save pickle file
with open('/content/drive/MyDrive/MyDatasetFolder/inverted_index_2.pkl', 'wb') as file:
        pickle.dump(inverted_index_2, file)

print(f"Vocabulary saved to /content/drive/MyDrive/MyDatasetFolder/inverted_index_2.pkl")


Building Inverted Index 2: 100%|██████████| 8753/8753 [00:35<00:00, 245.76it/s]
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



2.2.2) Execute the query

In [74]:
# This allows retrieving the term_id from a word
def find_key_by_value(vocabulary, value):
    for key, val in vocabulary.items():
        if val == value:
            return key
    return None

def search_engine_similarity():
    # Input query
    query = input()

    # Load inverted indexes and vocabulary from pickled files
    with open('/content/drive/MyDrive/MyDatasetFolder/inverted_index.pkl', 'rb') as pickle_file:
        inverted_index = pickle.load(pickle_file)

    with open('/content/drive/MyDrive/MyDatasetFolder/inverted_index_2.pkl', 'rb') as pickle_file:
        inverted_index_2 = pickle.load(pickle_file)

    with open('/content/drive/MyDrive/MyDatasetFolder/vocabulary.pkl', 'rb') as vocab_file:
        vocabulary = pickle.load(vocab_file)

    # Tokenize and preprocess the input query
    query = ''.join(char for char in query if char.isalnum() or char.isspace())
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(query)
    query = ' '.join([word for word in words if word.lower() not in stop_words])

    # Stem the query to its root form using SnowballStemmer
    snowstem = SnowballStemmer('english')
    query_text = ' '.join(snowstem.stem(word) for word in query.lower().split())
    query_words = [snowstem.stem(word) for word in query.lower().split()]

    # Initialize a list to store matching documents
    matching_documents = []

    # Find documents that contain all the stemmed words in the query
    for stemmed_word in query_words:
        if stemmed_word in vocabulary.values():
            # Retrieve the term_id from the vocabulary
            term_id = find_key_by_value(vocabulary, stemmed_word)
            matching_documents.append(set(inverted_index[term_id]))

    # Check if there are any matching documents
    if matching_documents:
        # Find the common set of documents among the matching sets
        common_documents = set.intersection(*matching_documents)

    # Extract relevant information from the df_courses for the matching documents and create a filtered_df
    filtered_df = df_courses.iloc[list(common_documents)]
    documents_text = [' '.join(words) for words in filtered_df['descr_clean']]

    # Create a list of texts including the stemmed query and the descr_clean of matching documents
    all_texts = [query_text] + documents_text

    # Use TF-IDF vectorization to convert the texts into a matrix
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(all_texts)

    # Calculate cosine similarities between the query and each document
    cosine_similarities = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1:]).flatten()

    # Create a min heap to store the top-k documents
    heap = []

    for i, document in enumerate(common_documents):
        # Push the document details along with the cosine similarity to the heap
        heapq.heappush(heap, (cosine_similarities[i], df_courses.loc[document, 'Course Name'], df_courses.loc[document, 'University Name'], df_courses.loc[document, 'Description'], df_courses.loc[document, 'Link']))

        # If the heap size exceeds k, pop the smallest element
        if len(heap) > 5:
            heapq.heappop(heap)

    # Convert the heap to a DataFrame
    heap_df = pd.DataFrame(heap, columns=['Cosine Similarity', 'Course Name', 'University Name', 'Description', 'Link'])

    # Sort the DataFrame based on the 'Cosine Similarity' column in descending order
    heap_df = heap_df.sort_values(by='Cosine Similarity', ascending=False)

    return heap_df

search_engine_similarity()

Fashion, AND MArketing!!


Unnamed: 0,Cosine Similarity,Course Name,University Name,Description,Link
4,0.42963,Fashion Marketing Management - MSc/PgD/PgC,Cardiff Metropolitan University,The MSc Fashion Marketing Management degree at...,https://www.findamasters.com/masters-degrees/c...
3,0.297942,International Fashion Marketing - MSc,Glasgow Caledonian University,The global fashion industry is valued at over ...,https://www.findamasters.com/masters-degrees/c...
2,0.276519,Master Programme in Fashion Marketing and Mana...,University of Boras,Fashion is one of the most complex and fascina...,https://www.findamasters.com/masters-degrees/c...
1,0.24516,International Fashion Marketing MSc,Coventry University London,"Are you looking to move into fashion, accelera...",https://www.findamasters.com/masters-degrees/c...
0,0.222708,International Fashion Marketing MSc,York St John University,Combine fashion industry expertise with transf...,https://www.findamasters.com/masters-degrees/c...


## **3. Define a new score!**

In [75]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Load SnowballStemmer
snowstem = SnowballStemmer('english')

# Load stopwords to filter
lst_stopwords = set(stopwords.words('english'))

# Create column named 'descr_clean' and then apply preprocess_text to the 'Description' column in df_courses
df_courses['title_clean'] = df_courses['Course Name'].apply(preprocess_text)

df_courses

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Course Name,University Name,Faculty Name,Description,Fees,Modality,Duration,City,Country,Link,administration,descr_clean,fees (£),title_clean
0,3D Design for Virtual Environments - MSc,Glasgow Caledonian University,School of Engineering and Built Environment,3D visualisation and animation play a role in ...,Please see the university website for further ...,MSc,1 year full-time,Glasgow,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[3d, visualis, anim, play, role, mani, area, p...",,"[3d, design, virtual, environ, msc]"
1,Accounting and Finance - MSc,University of Leeds,Leeds University Business School,Businesses and governments rely on sound finan...,"UK: £18,000 (Total) International: £34,750 (To...",MSc,1 year full time,Leeds,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[busi, govern, reli, sound, financi, knowledg,...",34750.00,"[account, financ, msc]"
2,"Accounting, Accountability & Financial Managem...",King’s College London,King’s Business School,"Our Accounting, Accountability & Financial Man...",Please see the university website for further ...,MSc,1 year FT,London,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[account, account, financi, manag, msc, cours,...",,"[account, account, financi, manag, msc]"
3,"Accounting, Financial Management and Digital B...",University of Reading,Henley Business School,Embark on a professional accounting career wit...,Please see the university website for further ...,MSc,1 year full time,Reading,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[embark, profession, account, career, academ, ...",,"[account, financi, manag, digit, busi, msc]"
4,Addictions MSc,King’s College London,"Institute of Psychiatry, Psychology and Neuros...",Join us for an online session for prospective ...,Please see the university website for further ...,MSc,One year FT,London,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[join, us, onlin, session, prospect, student, ...",,"[addict, msc]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5995,Materials Engineering,University of Padua,School of Engineering,The Master's degree Materials Engineering is a...,Our tuition fees will not exceed 2700 euros pe...,MSc,2 years,Padua,Italy,https://www.findamasters.com/masters-degrees/c...,On Campus,"[master, degre, materi, engin, interdisciplina...",2344.77,"[materi, engin]"
5996,Materials Engineering MSc,Swansea University,School of Engineering and Applied Sciences,The MSc in Materials Engineering provides you ...,Please visit our website for the Materials Eng...,MSc,1 year full-time; 2 years part-time; 3 years p...,Swansea,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[msc, materi, engin, provid, deep, understand,...",,"[materi, engin, msc]"
5997,Materials Engineering MSc by Research,Swansea University,School of Engineering and Applied Sciences,Swansea is one of the UK’s leading centres for...,Please visit our website for the Materials Eng...,"MSc, Research Only",1 year full-time; 2 years part-time,Swansea,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[swansea, one, uk, lead, centr, materi, teach,...",,"[materi, engin, msc, research]"
5998,"Materials Engineering with Industry, MSc",Swansea University,School of Engineering and Applied Sciences,Our MSc in Materials Engineering with Industry...,For current fees of the Materials Engineering ...,MSc,2 Years Full Time With a Year In Industry,Swansea,United Kingdom,https://www.findamasters.com/masters-degrees/c...,On Campus,"[msc, materi, engin, industri, cours, open, in...",,"[materi, engin, industri, msc]"


In [76]:
# Define the country_score dictionary
country_score = {
    'United Kingdom': 4,
    'USA': 4,
    'Finland': 3,
    'Netherlands': 3,
    'Germany': 3,
    'Iceland': 2,
    'Ireland': 2,
    'Portugal': 2,
    'France': 3,
    'Belgium': 2,
    'Canada': 3,
    'Hong Kong': 2,
    'Sweden': 2,
    'China': 2,
    'Estonia': 1,
    'Italy': 3,
    'New Zealand': 1,
    'Lithuania': 1,
    'Austria': 1,
    'Saudi Arabia': 1,
    'Hungary': 1,
    'Greece': 3,
    'Denmark': 1,
    'Croatia': 1,
    'Turkey': 2,
    'Romania': 1,
    'Spain': 2,
    'Japan': 3,
    'Czechia': 1,
    'Switzerland': 3,
    'Singapore': 2,
    'Australia': 3,
    'Qatar': 1,
    'Cyprus': 2,
    'Israel': 2,
    'Gibraltar': 1,
    'Luxembourg': 2,
    'India': 2,
    'Malaysia': 1,
    'Chile': 1,
    'Kazakhstan': 1,
    'United Arab Emirates': 1,
    'Jamaica': 1,
    'Norway': 1,
}


In [81]:
def new_score(query):
  common_documents = list(common_documents_finder(query))
  df_result = df_courses.loc[common_documents, [
            'Course Name', 'University Name', 'Description', 'Fees', 'Modality', 'Duration', 'City', 'Country', 'descr_clean', 'fees (£)', 'title_clean']].copy()

  query = ''.join(char for char in query if char.isalnum() or char.isspace())
  stop_words = set(stopwords.words('english'))
  words = word_tokenize(query)
  query = ' '.join([word for word in words if word.lower() not in stop_words])

  # Stemming using SnowballStemmer
  snowstem = SnowballStemmer('english')
  query_words = [snowstem.stem(word) for word in query.lower().split()]


  # Find documents that contain all the words in the query

  for i in range(len(common_documents)):
      count = 0  # Initialize count for each document
      for stemmed_word in query_words:
          if stemmed_word in df_result['title_clean'].iloc[i]:
              count += 1
      df_result.at[df_result.index[i], 'title_score'] = count  # Update count for the current document

  for i in range(len(common_documents)):
      count = 0  # Initialize count for each document

      # Check if the country is in the country_score dictionary
      country = df_result['Country'].iloc[i]
      if country in country_score:
          count += country_score[country]

      df_result.at[df_result.index[i], 'location_score'] = count

  # Create a new column 'duration_numeric' to store the extracted numerical values
  df_result['duration_numeric'] = df_result['Duration'].str.extract('(\d+)')

  # Convert the 'duration_numeric' column to a list
  duration_list = df_result['duration_numeric'].tolist()

  # Define a dictionary for duration scores
  duration_score_dict = {'1': 1, '12': 1, '2': 2, '24': 2}  # Add more conditions as needed

  # Create a new column 'duration_score' based on the dictionary
  df_result['duration_score'] = df_result['duration_numeric'].map(duration_score_dict)
  return df_result


def calculate_final_score(title_score, location_score, duration_score, title_weight=1, location_weight=1, duration_weight=1):
    # Adjust the weights based on your preference
    weighted_title_score = title_score * title_weight
    weighted_location_score = location_score * location_weight
    weighted_duration_score = duration_score * duration_weight

    # Calculate the final score as the sum of weighted scores
    final_score = weighted_title_score + weighted_location_score + weighted_duration_score

    return final_score


def rank_courses_new_score(query, k=5):
    df_result = new_score(query)

    # Define weights for each score
    w_title = 4
    w_location = 1
    w_duration = 1

    # Fill NaN values in the individual scores with zeros
    df_result[['title_score', 'location_score', 'duration_score']] = df_result[['title_score', 'location_score', 'duration_score']].fillna(0)

    # Create a min heap to store the top-k documents
    heap = []

    for _, row in df_result.iterrows():
        # Calculate the final score using the weights
        final_score = calculate_final_score(row['title_score'], row['location_score'], row['duration_score'], title_weight=w_title, location_weight=w_location, duration_weight=w_duration)

        # Push the document details along with the final score to the heap
        heapq.heappush(heap, (final_score, row['Course Name'], row['University Name'], row['Description'], row['Fees'], row['Modality'], row['Duration'], row['City'], row['Country'], row['descr_clean'], row['fees (£)'], row['title_clean'], row['duration_numeric'], row['duration_score'], final_score))

        # If the heap size exceeds k, pop the smallest element
        if len(heap) > k:
            heapq.heappop(heap)

    # Convert the heap to a DataFrame
    heap_df = pd.DataFrame(heap, columns=['final_score_heap', 'Course Name', 'University Name', 'Description', 'Fees', 'Modality', 'Duration', 'City', 'Country', 'descr_clean', 'fees (£)', 'title_clean', 'duration_numeric', 'duration_score', 'final_score'])

    # Sort the DataFrame based on the 'final_score' column in descending order
    heap_df = heap_df.sort_values(by='final_score', ascending=False)

    return heap_df



rank_courses_new_score(input())


Fashion, AND MArketing!!


Unnamed: 0,final_score_heap,Course Name,University Name,Description,Fees,Modality,Duration,City,Country,descr_clean,fees (£),title_clean,duration_numeric,duration_score,final_score
1,13.0,International Fashion Marketing - MSc,Heriot-Watt University,MSc International Fashion Marketing (IFM) at H...,Please see the university website for further ...,"MSc, PGDip","1 year full time, 2 years part time",Edinburgh,United Kingdom,"[msc, intern, fashion, market, ifm, univers, e...",,"[intern, fashion, market, msc]",1,1.0,13.0
2,13.0,Fashion Marketing Management - MSc/PgD/PgC,Cardiff Metropolitan University,The MSc Fashion Marketing Management degree at...,"​As a postgraduate, research or part-time stud...","PGCert, PGDip, MSc","1 year full time, 2 years part time",Cardiff,United Kingdom,"[msc, fashion, market, manag, degre, cardiff, ...",,"[fashion, market, manag]",1,1.0,13.0
3,13.0,International Fashion Marketing MSc,Coventry University London,"Are you looking to move into fashion, accelera...",UK Fees: 14250 ; 15450 (with Professional Prac...,MSc,"1 year full time, 18 months extended professio...",London,United Kingdom,"[look, move, fashion, acceler, fashion, career...",18950.0,"[intern, fashion, market, msc]",1,1.0,13.0
4,13.0,International Fashion Marketing MSc,York St John University,Combine fashion industry expertise with transf...,Please see the university website for further ...,MSc,"1 year full time, 2 years part time, 2 years f...",York,United Kingdom,"[combin, fashion, industri, expertis, transfer...",,"[intern, fashion, market, msc]",1,1.0,13.0
0,12.0,Master Programme in Fashion Marketing and Mana...,University of Boras,Fashion is one of the most complex and fascina...,Please see the university website for further ...,MSc,Full-time: 2 years,Boras,Sweden,"[fashion, one, complex, fascin, area, research...",,"[master, programm, fashion, market, manag]",2,2.0,12.0


In the 'Fashion and Marketing' query example we can see that both in the scores the courses returned are really enherent with the query. This is because even thoug it has a greater importance in the query it's also true that an importance of all words in the title takes also a heavy weight in the scoring process as it carries more meaning than in the description. In addition to that we can see that in the new score, more 'prestigious' universities are being shown.
