# SafariHub - Tour Recommender System

## Business Understanding

### Overview
Tourism is a thriving industry in Kenya, and travelers often face the challenge of choosing the right destinations for their trips. Our project aims to address this problem by creating a recommendation system that assists users in discovering personalized tourist destinations in the country.

### Problem Statement

Travelers often struggle to choose the most suitable tourist destinations for their trips. With an overwhelming number of options available, personalized recommendations are crucial. Our project aims to address this challenge by creating a recommendation system that suggests relevant destinations in Kenya based on user preferences and historical interactions.

#### Stakeholders
1. **Travelers**: They seek relevant recommendations based on their preferences, interests, and historical interactions.
2. **Tourism Agencies**: These organizations can enhance user experiences by providing tailored suggestions.
3. **Local Businesses**: Recommendations can drive footfall to local attractions, restaurants, and accommodations.

### Proposed Solution and Metrics of Success
We propose building a hybrid recommendation system that combines collaborative filtering and content-based approaches. Success metrics include accuracy, recall and precision scores.

### Objectives:

- "Build a collaborative filtering model to recommend destinations."
- "Reduce cold-start problem by incorporating content-based features."
- "Model Recall score ≥ 80%"
- "Model Accuracy ≥ 80%"

### Challenges

1. **Data Quality and Diversity**:
   - Presence of missing values, outliers, or inaccuracies.
   - Ensuring diverse and representative data across different types of destinations (e.g., cities, beaches, historical sites) is essential.

2. **Cold-Start Problem**:
   - New users with limited interaction history pose a challenge. How do we recommend destinations for them?
   - Balancing collaborative filtering (based on user behavior) with content-based filtering (based on destination features) is critical.

3. **Scalability and Real-Time Recommendations**:
   - As the user base grows, the system must handle increased computational demands.
   - Providing real-time recommendations during user interactions requires efficient algorithms.

4. **User Engagement and Interpretability**:
   - Recommendations should align with user interests to keep them engaged.
   - Ensuring transparency and interpretability of the recommendation process is important.

### Conclusion
Our project has significant implications for travelers, tourism agencies, and local businesses. By solving this problem, we contribute to enhancing travel experiences and promoting local economies.


In [1]:
# Import modules
import json
import pandas as pd
import numpy as np


In [2]:
# Load the JSON file
with open('data\Tripadvisor1.json', 'r') as file:
    data = json.load(file)

# Now you can access the data as a Python dictionary
data


[{'reviewTags': [{'text': 'meryl streep', 'reviews': 58},
   {'text': 'ngong hills', 'reviews': 104},
   {'text': 'finch hatton', 'reviews': 23},
   {'text': 'coffee plantation', 'reviews': 32},
   {'text': 'guided tour', 'reviews': 67},
   {'text': 'personal guide', 'reviews': 33},
   {'text': 'interesting history', 'reviews': 22},
   {'text': 'worth a visit', 'reviews': 78},
   {'text': 'interesting visit', 'reviews': 25},
   {'text': 'entrance fee', 'reviews': 36},
   {'text': 'gift shop', 'reviews': 37},
   {'text': 'movie', 'reviews': 523},
   {'text': 'africa', 'reviews': 840},
   {'text': 'machinery', 'reviews': 19},
   {'text': 'legacy', 'reviews': 21},
   {'text': 'restored', 'reviews': 19},
   {'text': 'artifacts', 'reviews': 43},
   {'text': 'insight', 'reviews': 37}],
  'category': 'attraction',
  'rating': 4,
  'numberOfReviews': 2281,
  'name': 'Karen Blixen Museum',
  'image': 'https://media-cdn.tripadvisor.com/media/photo-m/1280/16/af/ca/e8/the-karen-blixen-house.jpg',


In [3]:
# Function to extract review tags
def extract_review_tags(review_tags):
    tags = [tag['text'] for tag in review_tags]
    reviews = [tag['reviews'] for tag in review_tags]
    return tags, reviews

# Initialize lists to hold the data
names = []
categories = []
ratings = []
number_of_reviews = []
images = []
photo_counts = []
price_ranges = []
photos_list = []
price_levels = []
review_tags = []
review_counts = []

# Loop through each item in the data
for item in data:
    names.append(item['name'])
    categories.append(item['category'])
    ratings.append(item['rating'])
    number_of_reviews.append(item['numberOfReviews'])
    images.append(item['image'])
    photo_counts.append(item.get('photoCount'))
    price_ranges.append(item.get('priceRange'))
    photos_list.append(item['photos'])
    price_levels.append(item.get('priceLevel'))
    
    tags, reviews = extract_review_tags(item['reviewTags'])
    review_tags.append(tags)
    review_counts.append(reviews)

# Create a DataFrame
df = pd.DataFrame({
    'Name': names,
    'Category': categories,
    'Rating': ratings,
    'NumberOfReviews': number_of_reviews,
    'Image': images,
    'PhotoCount': photo_counts,
    'PriceRange': price_ranges,
    'Photos': photos_list,
    'PriceLevel': price_levels,
    'ReviewTags': review_tags,
    'ReviewCounts': review_counts
})

df.head()

Unnamed: 0,Name,Category,Rating,NumberOfReviews,Image,PhotoCount,PriceRange,Photos,PriceLevel,ReviewTags,ReviewCounts
0,Karen Blixen Museum,attraction,4.0,2281,https://media-cdn.tripadvisor.com/media/photo-...,934,,[https://media-cdn.tripadvisor.com/media/photo...,,"[meryl streep, ngong hills, finch hatton, coff...","[58, 104, 23, 32, 67, 33, 22, 78, 25, 36, 37, ..."
1,Hell's Gate National Park,attraction,4.5,949,https://media-cdn.tripadvisor.com/media/photo-...,1100,,[https://media-cdn.tripadvisor.com/media/photo...,,"[bike ride, the lion king, wild animals, tomb ...","[40, 21, 43, 11, 9, 11, 20, 7, 7, 43, 16, 7, 1..."
2,Tsavo Park,attraction,4.5,1741,https://media-cdn.tripadvisor.com/media/photo-...,4466,,[https://media-cdn.tripadvisor.com/media/photo...,,"[national park, great park, red soil, three da...","[34, 12, 11, 8, 8, 8, 275, 120, 67, 64, 80, 24..."
3,Hilton Garden Inn Nairobi Airport,hotel,4.5,519,https://media-cdn.tripadvisor.com/media/photo-...,392,$169 - $200,[https://media-cdn.tripadvisor.com/media/photo...,$$,"[close to the airport, nairobi airport, roofto...","[43, 27, 22, 7, 6, 5, 13, 7, 7, 5, 7, 9, 6, 10..."
4,PrideInn Flamingo Beach Resort & Spa,hotel,4.5,2922,https://media-cdn.tripadvisor.com/media/photo-...,2003,$128 - $218,[https://media-cdn.tripadvisor.com/media/photo...,$$,"[pavilion bar, guest relations, big swimming p...","[25, 110, 27, 81, 52, 43, 30, 27, 25, 23, 22, ..."


In [3]:
# Define the list of columns you expect
expected_columns = ["name", "category", "rating", "numberOfReviews", "image", "photoCount", "priceRange", "reviewTags", "photos", "priceLevel"]

# Create a DataFrame while filling missing keys with NaN
df1 = pd.DataFrame([
    {
        col: item.get(col, np.nan) for col in expected_columns
    }
    for item in data
])

# Display the features DataFrame
print("\nFeatures DataFrame:")
df1


Features DataFrame:


Unnamed: 0,name,category,rating,numberOfReviews,image,photoCount,priceRange,reviewTags,photos,priceLevel
0,Karen Blixen Museum,attraction,4.0,2281,https://media-cdn.tripadvisor.com/media/photo-...,934,,"[{'text': 'meryl streep', 'reviews': 58}, {'te...",[https://media-cdn.tripadvisor.com/media/photo...,
1,Hell's Gate National Park,attraction,4.5,949,https://media-cdn.tripadvisor.com/media/photo-...,1100,,"[{'text': 'bike ride', 'reviews': 40}, {'text'...",[https://media-cdn.tripadvisor.com/media/photo...,
2,Tsavo Park,attraction,4.5,1741,https://media-cdn.tripadvisor.com/media/photo-...,4466,,"[{'text': 'national park', 'reviews': 34}, {'t...",[https://media-cdn.tripadvisor.com/media/photo...,
3,Hilton Garden Inn Nairobi Airport,hotel,4.5,519,https://media-cdn.tripadvisor.com/media/photo-...,392,$169 - $200,"[{'text': 'close to the airport', 'reviews': 4...",[https://media-cdn.tripadvisor.com/media/photo...,$$
4,PrideInn Flamingo Beach Resort & Spa,hotel,4.5,2922,https://media-cdn.tripadvisor.com/media/photo-...,2003,$128 - $218,"[{'text': 'pavilion bar', 'reviews': 25}, {'te...",[https://media-cdn.tripadvisor.com/media/photo...,$$
...,...,...,...,...,...,...,...,...,...,...
586,iKWETA Safari Camp,hotel,4.5,383,https://media-cdn.tripadvisor.com/media/photo-...,486,$177 - $206,"[{'text': 'affordable luxury', 'reviews': 19},...",[https://media-cdn.tripadvisor.com/media/photo...,$$
587,GemSuites Riverside,hotel,4.5,61,https://media-cdn.tripadvisor.com/media/photo-...,170,$139 - $301,"[{'text': 'security team', 'reviews': 2}, {'te...",[https://media-cdn.tripadvisor.com/media/photo...,$$$
588,Mara Major Camp,hotel,5.0,34,https://media-cdn.tripadvisor.com/media/photo-...,133,$660 - $760,"[{'text': 'highly recommend staying here', 're...",[https://media-cdn.tripadvisor.com/media/photo...,$$$$
589,Villa Mandhari - Diani Beach,hotel,4.5,41,https://media-cdn.tripadvisor.com/media/photo-...,96,$139 - $234,"[{'text': 'pool', 'reviews': 11}, {'text': 'di...",[https://media-cdn.tripadvisor.com/media/photo...,$$


In [4]:

# Load the JSON file
with open('data\Tripadvisor2.json', 'r') as file:
    data2 = json.load(file)

# Now you can access the data as a Python dictionary
data2


[{'photos': ['https://media-cdn.tripadvisor.com/media/photo-o/09/80/6c/4b/saruni-samburu.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/11/de/8e/6d/single-villa-5.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/11/e0/d6/b8/saruni-samburu-suite.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/2c/8b/00/2f/caption.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/1d/73/81/b7/panoramic-overview-from.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/1d/73/81/b9/panoramic-overview-from.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/1d/73/81/ba/the-room.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/23/a3/23/e1/caption.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/23/a3/23/e2/caption.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/23/a3/23/e3/caption.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/1c/1d/fd/76/this-is-eric-with-our.jpg'],
  'reviewTags': [{'text': 'outdoor shower', 'reviews': 40

In [5]:
import pandas as pd
import numpy as np

# Define the list of columns you expect
expected_columns = ["name", "category", "rating", "numberOfReviews", "image", "photoCount", "priceRange", "reviewTags", "photos", "priceLevel"]

# Create a DataFrame while filling missing keys with NaN
df2 = pd.DataFrame([
    {
        col: item.get(col, np.nan) for col in expected_columns
    }
    for item in data2
])

# Display the resulting DataFrame
df2.head()


Unnamed: 0,name,category,rating,numberOfReviews,image,photoCount,priceRange,reviewTags,photos,priceLevel
0,Saruni Samburu,hotel,5.0,697,https://media-cdn.tripadvisor.com/media/photo-...,1178,"KES 45,419 - KES 97,327","[{'text': 'outdoor shower', 'reviews': 40}, {'...",[https://media-cdn.tripadvisor.com/media/photo...,$$$$
1,Fort Jesus Museum,attraction,4.0,994,https://media-cdn.tripadvisor.com/media/photo-...,900,,"[{'text': 'old town', 'reviews': 132}, {'text'...",[https://media-cdn.tripadvisor.com/media/photo...,
2,Karen Blixen Coffee Garden,attraction,4.5,602,https://media-cdn.tripadvisor.com/media/photo-...,159,,"[{'text': 'movie', 'reviews': 42}, {'text': 'a...",[https://media-cdn.tripadvisor.com/media/photo...,
3,Mara Triangle,attraction,5.0,830,https://media-cdn.tripadvisor.com/media/photo-...,1279,,"[{'text': 'the river', 'reviews': 57}, {'text'...",[https://media-cdn.tripadvisor.com/media/photo...,
4,Lake Naivasha,attraction,4.5,438,https://media-cdn.tripadvisor.com/media/photo-...,1077,,"[{'text': 'fish', 'reviews': 31}, {'text': 'bo...",[https://media-cdn.tripadvisor.com/media/photo...,


In [6]:
df2.shape

(1000, 10)

In [7]:
# Load the JSON file
with open('data\Tripadvisor3.json', 'r') as file:
    data3 = json.load(file)

# Now you can access the data as a Python dictionary
data3


[{'photos': ['https://media-cdn.tripadvisor.com/media/photo-o/16/9a/88/83/ngutuni-safari-lodge.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/16/9a/8a/48/ngutuni-safari-lodge.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/16/9a/87/53/ngutuni-safari-lodge.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/2c/95/63/e2/caption.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/12/30/66/db/ngutuni-safari-lodge.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/11/d1/d3/51/p-20180107-130718-vhdr.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/1c/05/d9/1e/img-20200914-wa0004-largejpg.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/1d/31/12/0a/caption.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/1d/31/12/0c/caption.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/1a/91/5c/23/dscn6626-largejpg.jpg',
   'https://media-cdn.tripadvisor.com/media/photo-o/1a/91/5c/22/dscn6591-largejpg.jpg',
   'https://media-cdn.tr

In [8]:
# Define the list of columns you expect
expected_columns = ["name", "category", "rating", "numberOfReviews", "image", "photoCount", "priceRange", "reviewTags", "photos", "priceLevel"]

# Create a DataFrame while filling missing keys with NaN
df3 = pd.DataFrame([
    {
        col: item.get(col, np.nan) for col in expected_columns
    }
    for item in data3
])

# Display the resulting DataFrame
df3.head()


Unnamed: 0,name,category,rating,numberOfReviews,image,photoCount,priceRange,reviewTags,photos,priceLevel
0,Ngutuni Safari Lodge,hotel,4.5,327,https://media-cdn.tripadvisor.com/media/photo-...,653,"KESÂ 8,046 - KESÂ 9,603","[{'text': 'water hole', 'reviews': 41}, {'text...",[https://media-cdn.tripadvisor.com/media/photo...,$
1,Bamburi Beach Hotel,hotel,4.0,1319,https://media-cdn.tripadvisor.com/media/photo-...,1360,"KESÂ 16,481 - KESÂ 22,320","[{'text': 'bamburi beach', 'reviews': 138}, {'...",[https://media-cdn.tripadvisor.com/media/photo...,$$
2,Kozi Suites,hotel,4.0,26,https://media-cdn.tripadvisor.com/media/photo-...,22,"KESÂ 4,672 - KESÂ 7,137","[{'text': 'an early morning flight', 'reviews'...",[https://media-cdn.tripadvisor.com/media/photo...,$
3,Kiambethu Farm,attraction,5.0,233,https://media-cdn.tripadvisor.com/media/photo-...,291,,"[{'text': 'indigenous forest', 'reviews': 22},...",[https://media-cdn.tripadvisor.com/media/photo...,
4,Taita Hills Wildlife Sanctuary,attraction,5.0,105,https://media-cdn.tripadvisor.com/media/photo-...,301,,"[{'text': 'wildlife sanctuary', 'reviews': 3},...",[https://media-cdn.tripadvisor.com/media/photo...,


In [9]:
df3.shape

(1499, 10)

In [10]:

dataframes = pd.concat([df1, df2, df3], ignore_index=True)

dataframes.shape

(3090, 10)

In [11]:
# Convert unhashable types (like lists) to strings for duplicate checking
dataframes['reviewTags'] = dataframes['reviewTags'].apply(lambda x: str(x) if isinstance(x, list) else x)
dataframes['photos'] = dataframes['photos'].apply(lambda x: str(x) if isinstance(x, list) else x)

# Check for duplicates
duplicates = dataframes.duplicated()

# Display duplicates
print("Duplicate Rows:")
duplicates.value_counts()

Duplicate Rows:


False    2567
True      523
dtype: int64

In [12]:
# Check for duplicates
duplicates = dataframes.duplicated(keep=False)  # Keep all duplicates

# Display duplicated rows
duplicated_rows = dataframes[duplicates]
print("Duplicated Rows:")
duplicated_rows

Duplicated Rows:


Unnamed: 0,name,category,rating,numberOfReviews,image,photoCount,priceRange,reviewTags,photos,priceLevel
0,Karen Blixen Museum,attraction,4.0,2281,https://media-cdn.tripadvisor.com/media/photo-...,934,,"[{'text': 'meryl streep', 'reviews': 58}, {'te...",['https://media-cdn.tripadvisor.com/media/phot...,
1,Hell's Gate National Park,attraction,4.5,949,https://media-cdn.tripadvisor.com/media/photo-...,1100,,"[{'text': 'bike ride', 'reviews': 40}, {'text'...",['https://media-cdn.tripadvisor.com/media/phot...,
2,Tsavo Park,attraction,4.5,1741,https://media-cdn.tripadvisor.com/media/photo-...,4466,,"[{'text': 'national park', 'reviews': 34}, {'t...",['https://media-cdn.tripadvisor.com/media/phot...,
34,Nairobi National Park,attraction,4.5,3562,https://media-cdn.tripadvisor.com/media/photo-...,4502,,"[{'text': 'city skyline', 'reviews': 60}, {'te...",['https://media-cdn.tripadvisor.com/media/phot...,
35,Diani Beach,attraction,4.5,1955,https://media-cdn.tripadvisor.com/media/photo-...,1951,,"[{'text': 'white sand', 'reviews': 108}, {'tex...",['https://media-cdn.tripadvisor.com/media/phot...,
...,...,...,...,...,...,...,...,...,...,...
2645,Karibuni Lodge,hotel,5.0,35,https://media-cdn.tripadvisor.com/media/photo-...,21,,[],['https://media-cdn.tripadvisor.com/media/phot...,
2649,Hibiscus Guest House,hotel,4.5,63,https://media-cdn.tripadvisor.com/media/photo-...,32,,[],['https://media-cdn.tripadvisor.com/media/phot...,
2653,Camp Ndotto,hotel,5.0,11,https://media-cdn.tripadvisor.com/media/photo-...,44,,[],['https://media-cdn.tripadvisor.com/media/phot...,
2662,Jumeirah Beach Front Apartments,hotel,4.0,35,https://media-cdn.tripadvisor.com/media/photo-...,26,,"[{'text': 'view of the ocean', 'reviews': 2}, ...",['https://media-cdn.tripadvisor.com/media/phot...,


In [13]:
karen_blixen_museum_rows = duplicated_rows[duplicated_rows['name'] == 'Karen Blixen Museum']
print("Rows with name 'Karen Blixen Museum':")
karen_blixen_museum_rows


Rows with name 'Karen Blixen Museum':


Unnamed: 0,name,category,rating,numberOfReviews,image,photoCount,priceRange,reviewTags,photos,priceLevel
0,Karen Blixen Museum,attraction,4.0,2281,https://media-cdn.tripadvisor.com/media/photo-...,934,,"[{'text': 'meryl streep', 'reviews': 58}, {'te...",['https://media-cdn.tripadvisor.com/media/phot...,
617,Karen Blixen Museum,attraction,4.0,2281,https://media-cdn.tripadvisor.com/media/photo-...,934,,"[{'text': 'meryl streep', 'reviews': 58}, {'te...",['https://media-cdn.tripadvisor.com/media/phot...,
1621,Karen Blixen Museum,attraction,4.0,2281,https://media-cdn.tripadvisor.com/media/photo-...,934,,"[{'text': 'meryl streep', 'reviews': 58}, {'te...",['https://media-cdn.tripadvisor.com/media/phot...,


In [14]:
# Check for duplicates and drop them, keeping the first occurrence
dataframes = dataframes.drop_duplicates(keep='first')

# Display the updated DataFrame
print("DataFrame after dropping duplicates:")
dataframes.shape

DataFrame after dropping duplicates:


(2567, 10)

In [16]:
dataframes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2567 entries, 0 to 3089
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             2567 non-null   object 
 1   category         2567 non-null   object 
 2   rating           2564 non-null   float64
 3   numberOfReviews  2567 non-null   int64  
 4   image            2564 non-null   object 
 5   photoCount       2567 non-null   int64  
 6   priceRange       1487 non-null   object 
 7   reviewTags       2567 non-null   object 
 8   photos           2567 non-null   object 
 9   priceLevel       1487 non-null   object 
dtypes: float64(1), int64(2), object(7)
memory usage: 220.6+ KB
