# Selection of Attractions - Entropy Weight Method
Since the differences in ratings of attractions on lifestyle consumer guide websites are small, and some attractions are very popular but have few reviews, the tourist attraction data that users have recently visited in the *Amap* map can reflect popular trends.\
\
We considered a total of six indicators combined to arrive at a composite "popularity" score, eliminating attractions with lower ratings. *The entropy weight method* can calculate the weight and give a comprehensive score by using the information entropy tool in information theory. Therefore, we use *the entropy weight method* to calculate the weight of the six data. The specific steps are as follows:

## Data preparation
After the preprocessing of the attraction data in the previous step, we obtained the basic processed data, but we still need to perform two steps:
1. Manually process attraction names that are unsuccessful in matching, delete attractions with less than 10 reviews, small attractions among large attractions, attractions that cannot be found on the Amap map, etc.
2. Through the Amap Map APP, obtain the tourist attraction data that Amap users have recently visited as of March 1, 2024.

So here we directly replace the previous data with the processed data. Then in step 2.2, the entropy weight method is used to filter attractions.

In [1]:
import pandas as pd
from tabulate import tabulate

def display_df_data(df, num_rows=3):
    """
    Display the first few rows of data in the dataframe.

    Parameters:
        df (dataframe): The dataframe which contains the data.
        num_rows (int): The number of rows to display initially. Default is 3.
    """
    # If the DataFrame has more rows than we want to display, insert an ellipsis
    if len(df) > num_rows + 1:
        ellipsis = pd.DataFrame([['...'] * len(df.columns)], columns=df.columns)
        df = pd.concat([df.head(num_rows), ellipsis, df.tail(1)], ignore_index=True)

    print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))

attraction_file_path = '../data/processed/attractions_information.xlsx'

df_attraction = pd.read_excel(attraction_file_path)
df_data = pd.read_excel('../data/processed/manually_processed_attractions_information/attractions_information_before_ewm.xlsx')

df_attraction = df_data.copy()

display_df_data(df_attraction)

+--------------+------------------+-------------------+-----------------+---------------+--------------------+------------------+
| Attraction   | visited_number   | ctrip_hot_score   | ctrip_comment   | ctrip_score   | dianping_comment   | dianping_score   |
|--------------+------------------+-------------------+-----------------+---------------+--------------------+------------------|
| 永利发财树   | 100000           | 4.4               | 69              | 4.7           | 861                | 4.9              |
| 澳门科学馆   | 24683            | 7.0               | 763             | 4.8           | 1240               | 4.9              |
| 主教山小堂   | 12235            | 4.3               | 152             | 4.7           | 352                | 4.9              |
| ...          | ...              | ...               | ...             | ...           | ...                | ...              |
| 融和门       | 151              | 1.7               | 29              | 4.5           | 6                 

## Data normalization
There are *six indicators* in total, all positive indicators. The mathematical principle of normalizing *positive indicators* is as follows:\
For *n samples* and *m indicators*, $x_{ij}$ is the value of the j-th indicator of the i-th sample.

$$
(i=1,2,\cdots,n;j=1,2,\cdots,m)
$$
The indicators are normalized, and *the normalized data* $x^{\prime}_{ij}$ is still recorded as $x_{ij}$.
 
$$
Positive indicators:x^{\prime}{}_{ij}=\frac{x_{ij}-\min\{x_{1j},\cdots,x_{nj}\}}{\max\{x_{1j},\cdots,x_{nj}\}-\min\{x_{1j},\cdots,x_{nj}\}}
$$

In [2]:
import numpy as np

# Select the column representing the indicator
columns = ['visited_number', 'ctrip_hot_score', 'ctrip_comment', 'ctrip_score', 'dianping_comment', 'dianping_score']

df_norm = df_attraction[columns].apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))

display_df_data(df_norm)

+-----------------------+---------------------+-----------------------+--------------------+-----------------------+------------------+
| visited_number        | ctrip_hot_score     | ctrip_comment         | ctrip_score        | dianping_comment      | dianping_score   |
|-----------------------+---------------------+-----------------------+--------------------+-----------------------+------------------|
| 1.0                   | 0.4347826086956522  | 0.009083622762489982  | 0.8000000000000002 | 0.0558227190018196    | 1.0              |
| 0.24658890845070422   | 0.8115942028985507  | 0.10179000801496126   | 0.8666666666666666 | 0.08045230049389135   | 1.0              |
| 0.12206906209987196   | 0.42028985507246375 | 0.020170985840235106  | 0.8000000000000002 | 0.02274499610085781   | 1.0              |
| ...                   | ...                 | ...                   | ...                | ...                   | ...              |
| 0.0011903809218950063 | 0.04347826086956522 | 

## Calculate entropy value
*Information Entropy* is the expectation of the amount of information. *Information Entropy* can be understood as the size of uncertainty.\
The greater the uncertainty, the greater the information entropy. Its calculation formula is as follows:
### First calculate the proportion of the i-th sample value under the j-th indicator to that indicator:
$$p_{ij}=\frac{x_{ij}}{\sum_{i=1}^nx_{ij}}$$
### Calculate the entropy value of the jth indicator:
 $$e_j=-k\cdot\sum_{i=1}^np_{ij}\cdot\ln\left(p_{ij}\right)， for\hspace{0.6em} k=\frac1{\ln(n)}>0,meets\hspace{0.6em} e_j\geq0$$


In [3]:
# In order to prevent log(0) from happening
epsilon = 1e-10

# Calculate the proportion
pij = df_norm / df_norm.sum()

# Calculate the entropy value
eij = -np.sum(pij * np.log(pij + epsilon), axis=0) / np.log(len(df_attraction))

# Output information entropies and weights
print("Information Entropy of each indicator:")
print(eij)

Information Entropy of each indicator:
visited_number      0.697903
ctrip_hot_score     0.943221
ctrip_comment       0.576674
ctrip_score         0.992612
dianping_comment    0.633523
dianping_score      0.991459
dtype: float64


## Calculate the weight of each indicator
### Calculate the information entropy redundancy (difference):
$$d_j=1-e_j$$
### Calculate the weight of each indicator:
$$w_i=\frac{d_j}{\sum_{j=1}^mw_j\cdot x_{ij}}$$

In [4]:
# Calculate the information entropy redundancy (difference)
dj = 1 - eij

# Calculate the weight of each indicator
wj = dj / dj.sum()

print("Weight of each indicator:")
print(wj)

Weight of each indicator:
visited_number      0.259398
ctrip_hot_score     0.048754
ctrip_comment       0.363492
ctrip_score         0.006344
dianping_comment    0.314678
dianping_score      0.007334
dtype: float64


## Calculate the overall score of each attraction and filter
### The comprehensive score of each attraction is obtained through weighted calculation 
The higher *the comprehensive score*, \
the more *popular* the attraction is.
$$
s_i=\sum_{j=1}^mw_j\cdot x_{ij},\:i=1,\cdots,n
$$

*Comprehensive Score = \
0.259398 * visited_number + \
0.048754 * ctrip_hot_score + \
0.363492 * ctrip_comment + \
0.006344 * ctrip_score + \
0.314678 * dianping_comment + \
0.007334 * dianping_score*

### Sort in descending order based on overall score and filter
The highest comprehensive score is 100 points and the lowest is 0 points. \
The higher the score, the more popular the attraction is. After screening, *49 attractions* were obtained.

In [5]:
# Calculate the overall score of each attraction
df_attraction['composite_score'] = df_norm.mul(wj).sum(axis=1)

# Sort by the value of the composite_score and store
df_attraction = df_attraction.sort_values(by='composite_score', ascending=False)

# Filter the attractions
df_attraction = df_attraction.head(53)
df_attraction = df_attraction[df_attraction['ctrip_comment'] >= 10]

display_df_data(df_attraction[['Attraction', 'composite_score']], 50)

df_attraction.to_excel(attraction_file_path, index=False)

+--------------------+-------------------+
| Attraction         |   composite_score |
|--------------------+-------------------|
| 大三巴牌坊         |         0.903271  |
| 澳门塔             |         0.671463  |
| 澳门渔人码头       |         0.410402  |
| 澳门妈祖阁         |         0.346698  |
| 玫瑰圣母堂         |         0.340415  |
| 议事亭前地         |         0.322961  |
| 永利发财树         |         0.313872  |
| 大炮台             |         0.296592  |
| 澳门科学馆         |         0.178682  |
| 澳门博物馆         |         0.173934  |
| 金莲花广场         |         0.137757  |
| 澳门大赛车博物馆   |         0.0994647 |
| 主教山小堂         |         0.0790538 |
| 民政总署大楼       |         0.0777964 |
| 郑家大屋           |         0.0762142 |
| 澳门回归贺礼陈列馆 |         0.0748734 |
| 加思栏花园         |         0.0736881 |
| 恋爱巷             |         0.0707756 |
| 澳门艺术博物馆     |         0.0705656 |
| 澳门永利喷泉       |         0.0684841 |
| 二龙喉公园         |         0.0678186 |
| 东望洋炮台及灯塔   |         0.0617945 |
| 澳门美高梅天幕广场 |         0.0608819 |
| 塔石广场   

### Heatmap of Tourist Attractions
In order to show the distribution of selected attractions, we use Heatmap to show the heat map.
1. Put the crawled latitude and longitude data into heatmapData.js after manual adjustment.
2. Created and displayed a heat map using the Amap API. Initialize the map and configure the heat map plug-in through JavaScript, and use the data just defined in `heatmapData.js` to display the heat map.

In [6]:
df_attraction = pd.read_excel('../data/processed/manually_processed_attractions_information/attractions_information_before_heatmap.xlsx')

heatmap_data = []

# Traverse each row of the DataFrame,
# convert the data in each row to the required format,
# and add it to the list
for index, row in df_attraction.iterrows():
    count = int(round(row['composite_score'], 2) * 100)
    heatmap_data.append({
        "lng": row['lng'],
        "lat": row['lat'],
        "count": count
    })

# Open the heatmapData.js file in write mode and write the converted data
with open('../data/map/js/heatmapData.js', 'w') as f:
    f.write('var heatmapData = ')
    f.write(str(heatmap_data).replace("'", '"'))
    f.write(';')

df_attraction.to_excel(attraction_file_path, index=False)

**I have deployed the HTML webpage to Github, click to view: [Heatmap](https://stevencetanke.github.io/map/html/heatmap.html)**

# Comment Data Preprocessing
In the previous step, we screened out the *49* most popular attractions. Next, we collected the review data of these *49 attractions* on Ctrip through a crawler and obtained *20,907 pieces of data*. Then use python to deduplicate the data, remove invalid comments, special symbols and expressions in the comment data. Moreover, in order to facilitate later model processing, the traditional Chinese characters in the comments are converted into simplified Chinese. Finally, python’s jieba package is used for stop word processing and Chinese word segmentation to prepare for subsequent model building.

## Prepare
To make it more clear, I encapsulated the import preprocessing operation into a function. So the preparation work includes importing the required related libraries, defining some convenient functions, etc.

In [7]:
import os
import re
import jieba
from opencc import OpenCC


def clean_comment(text):
    """
    Clean up special symbols and emoticons in comments.
    """
    text = re.sub(r'[^\w\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    return text


def remove_invalid_comments(df):
    """
    Remove invalid comments.
    """
    # Remove empty comments or comments containing only spaces
    df = df[df['commentDetail'].str.strip().astype(bool)]
    # Remove comments less than 5 in length
    df = df[df['commentDetail'].str.len() > 5]
    return df


def load_stopwords(stopwords_path):
    """
    Load stop word list
    """
    with open(stopwords_path, 'r', encoding='utf-8') as f:
        stopwords = set([line.strip() for line in f.readlines()])
    return stopwords


def convert_T_to_S(df):
    """
    Convert traditional Chinese characters in the 'commentDetail' column
        to simplified characters
    """
    # Create OpenCC objects for Traditional and Simplified Chinese conversion
    cc = OpenCC('t2s')  # t2s means converting from Traditional Chinese to Simplified

    # Convert traditional characters to simplified characters
    df['commentDetail'] = df['commentDetail'].apply(lambda x: cc.convert(x))
    return df


def process_comment(comment, stopwords):
    """
    Segment words and remove stop words
    """
    words = jieba.cut(comment, cut_all=False)
    filtered_words = [word for word in words if word not in stopwords]
    processed_comment = ' '.join(filtered_words)
    
    return processed_comment


## Data cleaning and preprocessing
Deduplicate the data, remove invalid comments, and remove special symbols and expressions in the comment data. Don't forget to convert traditional Chinese characters in comments to simplified Chinese.

In [8]:
# Read all Excel files in comments folder
comment_raw_folder_path = '../data/raw/attractions_reviews'
comment_processed_file_path = '../data/processed/attractions_reviews.xlsx'

files = [f for f in os.listdir(comment_raw_folder_path) if f.endswith('.xlsx')]

# Create an empty DataFrame to store the merged data
df_comments = pd.DataFrame(columns=['Attraction', 'comment'])

# Process files containing reviews of individual attractions separately
for file in files:
    file_path = os.path.join(comment_raw_folder_path, file)
    df_comment = pd.read_excel(file_path, usecols=['commentDetail'])

    # Data preprocessing
    # Remove duplicates
    df_comment.drop_duplicates(subset=['commentDetail'], inplace=True)
    # Remove invalid comments
    df_comment = remove_invalid_comments(df_comment)
    # Clear special symbols and emoticons
    df_comment['commentDetail'] = df_comment['commentDetail'].apply(clean_comment)
    df_comment = convert_T_to_S(df_comment)

    attraction_name = file.replace("评论.xlsx", "")
    merged_comments = df_comment['commentDetail'].str.cat(sep=' ')
    df_comments = pd.concat([df_comments, pd.DataFrame({'Attraction': [attraction_name], 'comment': [merged_comments]})], 
                        ignore_index=True)

## Text segmentation and stop word processing
Use python’s jieba package for stop word processing and Chinese word segmentation to prepare for subsequent model building.

In [9]:
# Path to disabled vocabulary
stopwords_path = '../data/tool/stopwords.txt'
stopwords = load_stopwords(stopwords_path)

df_comments['comment'] = df_comments['comment'].apply(lambda x: process_comment(x, stopwords))

df_comments.to_excel(comment_processed_file_path, index=False)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\16247\AppData\Local\Temp\jieba.cache
Loading model cost 0.287 seconds.
Prefix dict has been built successfully.


Display one processed example: '大三巴牌坊'.

In [10]:
print("The processed comments about '大三巴牌坊':\n")

filtered_comments = df_comments[df_comments['Attraction'] == '大三巴牌坊'][['comment']]
print(filtered_comments.to_string(index=False))

The processed comments about '大三巴牌坊':

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 

**Data preprocessing ends.**