# Woofya Custom Recommendation System utilizing Woofya database and insights from GPT

### Packages
Several packages required for this recommendation system:
- pandas provides general data manipulation and analysis functions.
- haversine and cosine_similarity are the metrics used to calculate the most suitable location for the user.
- re, stopwords, PorterStemmer, TfidfVectorizer are crucial in the tokenisation and searching of key-words.

In [1]:
import pandas as pd
from haversine import haversine, Unit
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Dataset
The dataset utilised in this recommendation system is a product of the Woofya supplied dataset which has been cleaned and slimmed down, as well as inclduding insights on each location given by gpt-4o-mini API (this process is covered in 'gpt_pre_process_v2.ipynb'. Upon reading into this file, the dataset has 262 rows (locations) and 23 columns of crucial analysis and display information. Only locations with over 500 characters of review were accepted for this dataset, hopefully giving efficient, robust information.

In [2]:
# Load the CSV file. 
file_path = 'input_select.csv'
input_select = pd.read_csv(file_path)
# input_select << Uncomment this to view input dataset.

## Recommendation System Processes
This recommendation system is built across several defined functions, namely calculate_dist, search_preprocessing, keyword_search, and finally the main function recommender_final. 

In [3]:
# Define function to calculate the haversine distance for location related analysis and filtering.
def calulate_dist(row, user_lat, user_lon):
    location_point = (row['geometry.location.lat'], row['geometry.location.lng'])
    user_point = (user_lat, user_lon)
    return haversine(user_point, location_point, unit=Unit.KILOMETERS)

In [4]:
# Initialise stemmer and stop words, then create preprocessing function to retrieve only useful string data and tokenise each word.
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
def search_preprocessing(text):
    text = re.sub(r'[^\w\s]', '', text.lower())
    tokens = [stemmer.stem(word) for word in text.split() if word not in stop_words]
    return tokens

In [5]:
# Create an index to analyse the tokens, and fill any empty space in the dataset columns to be searched for key words.
keyword_index = {}
for idx, row in input_select.iterrows():
    combined_text = f"{row['name']} {row['description']} {row['review_text']} {row['editorial_summary.overview']}"
    tokens = search_preprocessing(combined_text)
    
    for token in tokens:
        if token not in keyword_index:
            keyword_index[token] = []
        keyword_index[token].append(idx)

input_select['name'] = input_select['name'].fillna('')
input_select['description'] = input_select['description'].fillna('')
input_select['review_text'] = input_select['review_text'].fillna('')
input_select['editorial_summary.overview'] = input_select['editorial_summary.overview'].fillna('')
input_select['full_text'] = input_select['name'] + ' ' + input_select['description'] + ' ' + input_select['review_text'] + ' ' + input_select['editorial_summary.overview']
input_select['full_text'] = input_select['full_text'].fillna('')

In [6]:
# Create function to search key word, taking the dataset input as well as the key words searched by the user. It returns a 
# cosine_similarity to determine the relevance of each location for the user based on their word searched and contents of
# 'combined_text'.
def keyword_search(input_select, keyword):
    kw_tokens = search_preprocessing(keyword)
    result_inds = set()
    
    for token in kw_tokens:
        if token in kw_tokens:
            result_inds.update(keyword_index[token])
    
    results = input_select
   
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(results['full_text'].tolist() + [keyword])
    
    cosine_similarities = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])
    results['cosine_similarity'] = cosine_similarities.flatten()
    
    return results.iloc[list(result_inds)]

## Main Recommendation System Function
This is the function to be called when a user searches using this recommendation system. It takes inputs of location (latitude, longitude), dataset (input_select), the maximum distance the user will consider travelling in km (f_dist), binary yes/no options for wheelchair_accessible_entrance, food, serves_vegetarian_food, off_leash, dog_water, fenced_off, seating, parking, night_light, shade, wildlife, open_field, alcohol, and finally the keyword to be searched.

In [7]:
def recommeder_final(user_lat, user_lon, input_select, f_dist, wheelchair_accessible_entrance, food, serves_vegetarian_food, off_leash, dog_water, fenced_off, seating, parking, night_light, shade, wildlife, open_field, alcohol, keyword):
    
    filtered_locs = []
    weights ={
        'prox_score': 0.3,
        'cosine_similarity': 0.7
    }
    results = keyword_search(input_select, keyword)
    
    results['h_dist'] = results.apply(calulate_dist, axis=1, user_lat=user_lat, user_lon=user_lon)
    results['prox_score'] = results['h_dist'].apply(lambda d: max(0, 1-d/f_dist) if f_dist > 0 else 0)
    
    user_reqs = {
        'wheelchair_accessible_entrance': wheelchair_accessible_entrance,
        'food': food, 
        'serves_vegetarian_food': serves_vegetarian_food,
        'off_leash': off_leash, 
        'dog_water': dog_water, 
        'fenced_off': fenced_off, 
        'seating': seating, 
        'parking': parking, 
        'night_light': night_light, 
        'shade': shade, 
        'wildlife': wildlife, 
        'open_field': open_field, 
        'alcohol': alcohol
    }
    
    mask = pd.Series([True] * len(input_select))
    for col, req in user_reqs.items():
        if req == 1:
            mask &= results[col] == 1
    filtered_locs = results[mask]
    filtered_locs['weighted_score'] = (filtered_locs['prox_score'] * weights['prox_score'] + filtered_locs['cosine_similarity'] * weights['cosine_similarity'])
    filtered_locs = filtered_locs[filtered_locs['h_dist'] <= f_dist]
    sorted_locs = filtered_locs.sort_values(by='weighted_score', ascending=False)
    display_locs = sorted_locs[['name', 'rating', 'suburb', 'h_dist', 'description']]
    return display_locs.head(10)

In [8]:
# Example One: Busy day in the city with a small dog, require a dog-friendly cafe on your walk together.
example1 = recommeder_final(-37.8124, 144.9623, input_select, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 'dog-friendly cafe')
example1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['h_dist'] = results.apply(calulate_dist, axis=1, user_lat=user_lat, user_lon=user_lon)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['prox_score'] = results['h_dist'].apply(lambda d: max(0, 1-d/f_dist) if f_dist > 0 else 0)
  filtered_locs = results[mask]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-

Unnamed: 0,name,rating,suburb,h_dist,description
17,Café Felice,4.6,Melbourne,0.394666,"['Best coffee in Melbourne CBD', 'Friendly and..."
14,Hot Poppy cafe,4.4,North Melbourne,1.435029,"['Charming Atmosphere', 'Quality Organic Ingre..."
31,Auction Rooms,4.4,North Melbourne,1.574278,"['Rustic-chic decor', 'Sunny courtyard seating..."
10,Bell Street Coffee Window,4.4,Fitzroy,1.908336,"['Coffee Quality', 'Service Experience', 'Comm..."


In [9]:
# Example Two: Living in the eastern suburbs area (Box Hill), looking to explore somewhere with wildlife and open space.
# Since it is the weekend you are willing to spend more time driving.
example2 = recommeder_final(-37.8181, 145.1239, input_select, 30, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 'natural beauty')
example2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['h_dist'] = results.apply(calulate_dist, axis=1, user_lat=user_lat, user_lon=user_lon)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['prox_score'] = results['h_dist'].apply(lambda d: max(0, 1-d/f_dist) if f_dist > 0 else 0)
  filtered_locs = results[mask]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-

Unnamed: 0,name,rating,suburb,h_dist,description
91,Boronia Grove Reserve Dog Off Leash Area,4.3,Doncaster East,3.147982,"['Spacious grassy area for dogs', 'Quiet despi..."
106,Nettleton Park Dog Off Leash Area,4.3,Glen Iris,7.027979,"['Dog-Friendly', 'Family-Friendly', 'Socializa..."
81,Banksia Park Fenced Dog Park,4.4,Bulleen,7.992896,"['Fully fenced for off-lead play', 'Plenty of ..."
3,Phillips Reserve,4.5,Brunswick East,13.302307,"['Family-friendly playground', 'Tranquil natur..."
2,Port Melbourne Beach,4.5,Port Melbourne,15.992423,"['Stunning beauty', 'Family-friendly', 'Gentle..."
111,Nortons Park,4.5,Wantirna South,9.409527,"['Dog-friendly', 'Picnic area', 'Scenic views'..."
83,Rosanna Parklands,4.5,Rosanna,10.018992,"['Nature-friendly urban park', 'Ideal for dog ..."
131,Packer Park Dog Off Leash Area,4.4,Carnegie,10.772727,"['Dog-friendly environment', 'Scenic pond and ..."
125,David Cooper Park,4.4,Wantirna South,10.841162,"['Biking Friendly', 'Playground and BBQ Facili..."
93,Warrandyte River Reserve Dog Off Leash Area,4.6,Warrandyte,11.977229,"['Dog-friendly', 'Scenic walking paths', 'Vibr..."


In [10]:
# Save the DataFrame with descriptions to a new CSV file
input_select.to_csv('input_select.csv', index=False)