# Travel Reviews Rating Dataset

- __Abstract:__ Google reviews on attractions from 24 categories across Europe are considered. Google user rating ranges from 1 to 5 and average user rating per category is calculated.

- __Objective:__ Apply affinity analysis & apriori algorithm to check recommend user category

- __Problem:__ Transform sparse matrix, where zeroes represent categories wasn't reviewd by user, maybe user didn't pay attention or don't like a category so no review is available. Using apriori algorithm and it's variation we can recommend user categories that he may like.


## Association Mining Rule 

__IF-THEN relationship:__

Category A favored by the customer, then the chances of item B being favored by the customer too under the same Transaction ID is found out.
$$ \underset{\textbf{if}}  A \Rightarrow \underset{\textbf{else}} B $$ 

__Antecedent (If):__ This is an group of items that are typically found together in the dataset

__Consequent (Else):__ This comes along as an item with an Antecedent/group of Antecedents


#### Ways To Measure Association
- __Support__
- __Confidence__
- __Level__

#### A. Support
Fraction of transactions which contains items A and B
$$ Support = \frac{f(A,B)}{N}$$ 
 
__Note:__
   - Help filtering out the items with low frequency
    

#### B. Confidence
Measure often the items A and B occur together, given the number times A occurs
$$ Confidence = \frac{f(A,B)}{f(A)}$$

#### C. Level
Measures how often the items A and B occur together, given the number times A occurs.
$$Lift = \frac{support}{supp(A)*supp(B)}$$

__Note:__
 - More the Lift more is the strength. 

## Preprocessing Data

In [2]:
# Load Data
data_path='/home/fayssal/Desktop/DataScience/Machine-Learning/data/travel-review-rating/google_review_ratings.csv'
ratings = pd.read_csv(data_path)

ratings.head()

Unnamed: 0,User,Category 1,Category 2,Category 3,Category 4,Category 5,Category 6,Category 7,Category 8,Category 9,...,Category 16,Category 17,Category 18,Category 19,Category 20,Category 21,Category 22,Category 23,Category 24,Unnamed: 25
0,User 1,0.0,0.0,3.63,3.65,5.0,2.92,5.0,2.35,2.33,...,0.59,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0,
1,User 2,0.0,0.0,3.63,3.65,5.0,2.92,5.0,2.64,2.33,...,0.59,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0,
2,User 3,0.0,0.0,3.63,3.63,5.0,2.92,5.0,2.64,2.33,...,0.59,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0,
3,User 4,0.0,0.5,3.63,3.63,5.0,2.92,5.0,2.35,2.33,...,0.59,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0,
4,User 5,0.0,0.0,3.63,3.63,5.0,2.92,5.0,2.64,2.33,...,0.59,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0,


__Categories with 0 values are considered not ratted by user, for unknown reason. Sparse data take a signifcant amount from memory, because there is many cells that does not contain information about user ratings__

## Reviews Correct Values

In [3]:
ratings.describe()

Unnamed: 0,Category 1,Category 2,Category 3,Category 4,Category 5,Category 6,Category 7,Category 8,Category 9,Category 10,...,Category 16,Category 17,Category 18,Category 19,Category 20,Category 21,Category 22,Category 23,Category 24,Unnamed: 25
count,5456.0,5456.0,5456.0,5456.0,5456.0,5456.0,5456.0,5456.0,5456.0,5456.0,...,5456.0,5456.0,5456.0,5456.0,5456.0,5456.0,5456.0,5456.0,5455.0,2.0
mean,1.45572,2.319707,2.489331,2.796886,2.958941,2.89349,3.351395,2.540795,3.126019,2.832729,...,1.192801,0.949203,0.822414,0.969811,1.000071,0.965838,1.750537,1.531453,1.560755,1.81
std,0.827604,1.421438,1.247815,1.309159,1.339056,1.2824,1.413492,1.111391,1.356802,1.307665,...,1.107005,0.973536,0.947911,1.203972,1.193891,0.929853,1.598734,1.316889,1.171756,1.088944
min,0.0,0.0,0.0,0.83,1.12,1.11,1.12,0.86,0.84,0.81,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.04
25%,0.92,1.36,1.54,1.73,1.77,1.79,1.93,1.62,1.8,1.64,...,0.69,0.58,0.53,0.52,0.54,0.57,0.74,0.79,0.88,1.425
50%,1.34,1.905,2.06,2.46,2.67,2.68,3.23,2.17,2.8,2.68,...,0.8,0.74,0.69,0.69,0.69,0.76,1.03,1.07,1.29,1.81
75%,1.81,2.6825,2.74,4.0925,4.3125,3.84,5.0,3.19,5.0,3.53,...,1.16,0.91,0.84,0.86,0.86,1.0,2.07,1.56,1.66,2.195
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,2.58


__All ratings in the 1st & 5st quartile are between 1 and five which is correct because all ratings should be between 1 and 5, and positive.__

In [4]:
# Missing Values ?
ratings.count()

User           5456
Category 1     5456
Category 2     5456
Category 3     5456
Category 4     5456
Category 5     5456
Category 6     5456
Category 7     5456
Category 8     5456
Category 9     5456
Category 10    5456
Category 11    5456
Category 12    5455
Category 13    5456
Category 14    5456
Category 15    5456
Category 16    5456
Category 17    5456
Category 18    5456
Category 19    5456
Category 20    5456
Category 21    5456
Category 22    5456
Category 23    5456
Category 24    5455
Unnamed: 25       2
dtype: int64

In [5]:
# Is all rows empty?
# Useless columns 
ratings[~ratings['Unnamed: 25'].isna() ]

Unnamed: 0,User,Category 1,Category 2,Category 3,Category 4,Category 5,Category 6,Category 7,Category 8,Category 9,...,Category 16,Category 17,Category 18,Category 19,Category 20,Category 21,Category 22,Category 23,Category 24,Unnamed: 25
1347,User 1348,1.06,1.1,5.0,3.28,5.0,5.0,5.0,1.83,1.81,...,1.8,0.0,0.0,0.0,0.0,0.0,5.0,0.26,,1.04
2712,User 2713,1.71,1.68,1.46,1.13,1.12,1.15,1.26,1.17,1.59,...,1.08,1.1,1.04,5.0,4.43,5.0,5.0,5.0,2.57,2.58


__Column 25 will be dropped, only two user ratted this category, and data description mentionned that file only contain reviews about 24 products. Next we check cells types, all reviews must be floates.__ 

In [6]:
# Drop Column 25
ratings = ratings.drop('Unnamed: 25', 1)

In [7]:
# Attributes types
ratings.dtypes

User            object
Category 1     float64
Category 2     float64
Category 3     float64
Category 4     float64
Category 5     float64
Category 6     float64
Category 7     float64
Category 8     float64
Category 9     float64
Category 10    float64
Category 11     object
Category 12    float64
Category 13    float64
Category 14    float64
Category 15    float64
Category 16    float64
Category 17    float64
Category 18    float64
Category 19    float64
Category 20    float64
Category 21    float64
Category 22    float64
Category 23    float64
Category 24    float64
dtype: object

__Now we have a sparsed matrix, ready to be transformed into a new dataframe (just for fun) where each user mapped only to the categories he reviewed, this way we reshape the data structure by reducing the number of attributes into just three field user & category id and user review.__

In [26]:
%%writefile -a Apriori.py
class TransactionPreprocessor:
    """
    There is three type of dataset format:
    sparse format
    
    Output:
    ------
    DataFrame reprsenting Transaction/User Id, Item Id, Rate or Reiew
    
    """
    
    def __init__(self):
        pass
    
    def sparse_to_compact(self, df, na_values=0): 
        columns = ['UID', 'TID', 'rate']
        populate = {k:[] for k in columns}
        
        for row in df.iterrows():
            _, row = row
            user_id, *rates = row
            for cat_idx, rate in enumerate(rates):
                try:
                    if float(rate) != na_values:
                        populate['UID'].append(int(user_id.split()[-1]))
                        populate['TID'].append(cat_idx + 1)
                        populate['rate'].append(float(rate))
                except:
                    pass
            
        return pd.DataFrame(populate)

Writing Apriori.py


In [13]:
# Preprocess Data
google_ratings = TransactionPreprocessor()
google_rates = google_ratings.sparse_to_compact(ratings)

## Select Favorable Set

Assume that favorable categories are all categories having rate above certain threshold given to the program, at first we will consider that all categories over 3 are favorable for user X.

In [27]:
%%writefile -a Apriori.py

google_rates['favorable'] = google_rates['rate'] > 3

Appending to Apriori.py


In [28]:
%%writefile -a Apriori.py

# Sample Dataset to form training dataset
# Reduce search in the dataset 
# Keep multiple reviews in the same training set
train_df = google_rates[google_rates['UID'].isin(range(200))]

# Select only favorable reviews
favorable_ratings_by_user = train_df[train_df['favorable']]

Appending to Apriori.py


In [16]:
favorable_ratings_by_user.head()

Unnamed: 0,UID,TID,rate,favorable
0,1,3,3.63,True
1,1,4,3.65,True
2,1,5,5.0,True
4,1,7,5.0,True
16,2,3,3.63,True


__Be careful certain Categories has no reviews in this train set (Take bigger sample or resampling)__