# Capstone Project 2: Indonesia Tourism Recommender System

### Problem Statement:

Indonesia wants to boost its tourism industry using advanced machine-learning techniques. You’re tasked with using the tourism data collected by the Indonesian government to understand tourist preferences and build a recommender system to recommend places to tourists.

### Overview:

For effective marketing, it is of utmost importance to understand the customers(tourists) and their expectations. The recommender system is a great technique to augment the existing marketing outreach to prospects. This project requires you to perform exploratory data analysis and create a recommender system.

### Input Dataset:

The government has provided you with three input files: 

- tourism_with_id.xlsx - it has information on tourist attractions in 5 major cities of Indonesia. </br>
- user.csv - it contains demographic information about the users to make recommendations. </br>
- tourism_rating.csv - it contains 3 columns - the user, the place, and the rating given, and serves to create a recommendation system based on the rating.</br>
 

### Directions: 

1. Import all the datasets and perform a preliminary inspection. </br>
    a. Check for missing values and duplicates. </br>
    b. Remove any anomalies found in the data. </br>


2. Explore the data in depth to understand the tourism patterns. </br> 
    a. Explore the user group that provides the tourism ratings by answering the following questions: </br>
        i. The age distribution of users visiting the places and giving the ratings.
        ii. What are the places from where most of these users (tourists) are coming from? 
    b. Explore the locations and categories of tourist spots by answering the following questions: </br>
        i. What are the different categories of tourist spots? 
        ii. What kind of tourism each city/location is most famous or suitable for?
        iii. Which city would be best for a nature enthusiast to visit?
        iv. What is the average price/cost of these places?
    c. Create combined data with places and their user ratings.
    d. Use this data to figure out the spots that are most loved by the tourists. Also, which city has the most loved tourist spots? </br>
    e. Indonesia provides a wide range of tourist spots ranging from historical and cultural beauties to advanced amusement parks. What category of places are the most popular among tourists? </br>


3. Build a Recommendation model for the tourists. </br>
    a.  Use the above data to develop a collaborative filtering model for recommendation. And use that to recommend other places to visit using the current tourist location (place name). </br>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import all the datasets and perform the preliminary inspection.

> *   Check for missing values and duplicates.
> *   Remove any anomalies found in the data.


In [None]:
tourism_with_id = pd.read_excel('tourism_with_id.xlsx')
tourism_rating = pd.read_csv('tourism_rating.csv')
user = pd.read_csv('user.csv')

In [None]:
tourism_with_id.head(2)

In [None]:
user.head(2)

In [None]:
tourism_rating.head(2)

In [None]:
tourism_with_id.info()

In [None]:
tourism_with_id.isna().sum()

In [None]:
tourism_with_id.columns

## Remove columns: 
Remove the excess or unnecessary columns from the dataset.

In [None]:
# Remove the excess columns from tourism_with_id[]
tourism_with_id.drop(columns = ['Unnamed: 11', 'Unnamed: 12', 'Time_Minutes', 'Coordinate'], inplace = True)

In [None]:
tourism_with_id.info()

In [None]:
user.info()

In [None]:
tourism_rating.info()

In [None]:
tourism_with_id.columns = tourism_with_id.columns.str.strip()

# 2. To understand the tourism highlights better, we should explore the data in depth.
## a. Explore the user groups used to get the tourism ratings.
> 1. The age distribution of users visiting the places and giving the ratings.


In [None]:
# Create a bar plot and box plot to visualize the age distribution of the tourists visiting Indonesia.



> 2. What are the places from where most of these users (tourists) are coming from?

In [None]:
user['city'] = user.Location.apply(lambda x: x.split(",")[0])

In [None]:
# Visualize the most frequented cities in Indonesia


## b. Next, explore the locations and categories of tourist spots.

> 1. What are the different categories of tourist spots?

In [None]:
tourism_with_id.Category = tourism_with_id.Category.str.strip().str.capitalize()
#print(tourism_with_id.Category.unique())

In [None]:
#Visualize the number of visits for each tourism category to find the most popular category (type) of tourist spot.




> 2. What kind of tourism each city/location is most famous or suitable for ?


In [None]:
#setting the colors to represent the graph values
color = ['seagreen', 'slateblue', 'darkred', 'saddlebrown']

In [None]:
# Visualize the distribution of the most famous category (type) of tourist spots in each city. 



> 3. Which city would be best for a nature enthusiast to visit?

In [None]:
# Find the list of category types 
tourism_with_id.Category.unique()

In [None]:
vc = tourism_with_id[tourism_with_id.Category == "Nature preserve"].City.value_counts()
# Plot the percentage distribution of tourist spots in each city. In this case, only consider spots categorized as "Nature preserve"


> 4. What is the avg price/costing of these places?

In [None]:
# Plot the price distribution for these tourist spots


##  c. To better understand the tourism ecosystem, we need to create a combined data with places and their ratings.

In [None]:
tourism_rating.head(2)

In [None]:
tourism_with_id.head(2)

## Calculate weighted average ratings for each place

In [None]:
# Calculate the weighted average of the 'Place_Ratings' column for each place/location.


In [None]:
# Merge this new average place rating to the tourism_with_id table. Hint: Join on the Place_Id column. Check the head of the new table to confirm the join operation.



## d. Use this data to figure out the spots that  are most loved by the tourists. 

In [None]:
place_ratings.sort_values("Place_Ratings",ascending = False)

## Also, which city has the most loved tourist spots.

- Solution : Picking up the places with average rating above 3.5 as most loved places and finding the cities where most of these highly rated spots are present

In [None]:
# Plot the percentage distribution of the cities with the most number of popular tourist spots. A popular tourist spot is defined as a place with an average rating greater than 3.5


__Observations:__
- Record your observations here.

## e. Indonesia provides a wide range of tourist spots ranging from historical and cultural beauties to advanced amusement parks. What category of places are users liking the most amongst these ?


- Again picking up the places with average rating above 3.5 and finding out the which are the most liked categories

- Most people liking the amusement parks very closely followed by the nature preserve.

In [None]:
# Plot the distribution of the popular tourist spots (average ratings > 3.5) across the tourist categories


# Build a Recommendation model for the tourists.

- Create a dataframe with information about these spots to include place id, user rating, name, description, category, location and price.


- Use the above data to develop a content based filtering model for recommendation. And use that to recommend other places to visit using the current tourist location(place name).



In [None]:
# Create the dataframe for the recommender system.


In [None]:
ratings_data = recom_data.groupby(['User_Id', 'Place_Name'])['Place_Ratings'].mean().unstack()

In [None]:
ratings_data

In [None]:
# Normalize user-item matrix


In [None]:
# create a User similarity matrix using Pearson correlation


In [None]:
# Similarity
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
user_similarity_cosine = cosine_similarity(data_norm.fillna(0))
user_similarity_cosine

In [None]:
# Pick a user ID
picked_userid = 1
# Remove picked user ID from the candidate list
user_similarity.drop(index=picked_userid, inplace=True)
# Take a look at the data
user_similarity.head()

In [None]:
# Number of similar users
n = 10
# User similarity threashold
user_similarity_threshold = 0.3
# Get top n similar users
# Print out top n similar users


In [None]:
# List the places that the target user has visited and rated


In [None]:
# List the places that similar users visited and rated. 


In [None]:
# Remove the places already visitied
# Take a look at the data


In [None]:
# A dictionary to store item scores

# Convert dictionary to pandas dataframe
    
# Sort the places by score

# Display top m places
