# Task 2: Recommendation Engine - Skeleton Notebook

This notebook provides a very basic example for the notebook you are expected to submit for Task 2 of the Final Project. The main purpose is that we can try different examples to get a better sense of your approach. Compared to Task 1 (Kaggle Competition), we don't have any objective means to evaluate the recommendations. 

Some general comments:
* You can import any data you need. This particularly includes your cleaned version of the properties dataset (incl. the auxiliary data or any other data you might have collected); there's no need to show the data cleaning / preprocessing steps in this notebook.
* You can also import your code in form of external Python (.py) script. You're actually encouraged to do so to keep this notebook light and uncluttered.
* **Important:** Please consider this notebook as an example and not to set specific requirements. Your notebook is likely to look very different. As long there is a section where we can easily test your solution, it should be fine.

## Setting up the Notebook

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.cluster.hierarchy import linkage,dendrogram,cut_tree
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from pathlib import Path
from efficient_apriori import apriori




## Load the Data

For this example, we use a simplified version of the dataset with only 2k+ data samples, each with only a subset of features.

In [3]:
df_train = pd.read_csv("../../../clean_data/train_preproc.csv", index_col=0)
df_train

Unnamed: 0,property_type,built_year,num_beds,num_baths,size_sqft,total_num_units,lat,lng,subzone,planning_area,...,mean_property_sqft,mean_planning_sqft,planning_area_mean,total_rooms,size_per_room,mean_property_type,distance,num_shopping_malls,tenure_99-year leasehold,tenure_freehold
0,hdb,1988,3.0,2.0,1115,116,1.414399,103.837196,yishun south,yishun,...,1079.757868,1231.474343,1.143726e+06,5.0,223.00,6.620452e+05,0.573567,2.0,1,0
1,hdb,1992,4.0,2.0,1575,375,1.372597,103.875625,serangoon north,serangoon,...,1079.757868,2514.468039,3.670975e+06,6.0,262.50,6.620452e+05,1.728895,3.0,1,0
2,condo,2022,4.0,6.0,3070,56,1.298773,103.895798,mountbatten,marine parade,...,1154.798804,2011.903265,4.159877e+06,10.0,307.00,2.919816e+06,1.315256,5.0,0,1
3,condo,2023,3.0,2.0,958,638,1.312364,103.803271,farrer court,bukit timah,...,1154.798804,2468.346271,5.576084e+06,5.0,191.60,2.919816e+06,0.723885,4.0,0,1
4,condo,2026,2.0,1.0,732,351,1.273959,103.843635,anson,downtown core,...,1154.798804,1590.161473,4.853464e+06,3.0,244.00,2.919816e+06,0.370022,16.0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20091,condo,2026,2.0,2.0,635,605,1.385938,103.834466,tagore,ang mo kio,...,1154.798804,1634.952055,2.355193e+06,4.0,158.75,2.919816e+06,0.150007,4.0,1,0
20092,condo,2026,2.0,2.0,883,137,1.315948,103.857589,lavender,kallang,...,1154.798804,1106.259542,2.256485e+06,4.0,220.75,2.919816e+06,0.442631,6.0,0,1
20093,condo,2023,4.0,4.0,1378,340,1.315961,103.836848,moulmein,novena,...,1154.798804,1662.211716,3.637860e+06,8.0,172.25,2.919816e+06,0.422131,5.0,0,1
20094,hdb,2017,3.0,2.0,1205,402,1.440753,103.806671,woodlands east,woodlands,...,1079.757868,1229.137405,8.223176e+05,5.0,241.00,6.620452e+05,0.632423,8.0,1,0


In [7]:
df_test = pd.read_csv("../../../clean_data/test_preproc.csv", index_col=0)
# df_test

## Computing the Top Recommendations

The method `get_top_recommendations()` shows an example of how to get the top recommendations for a given data 

Following steps are used for creating recommendations:
1. Train ARM on input train data
2. Take one random test data
3. Filter rules that have test data attributes
4. Recommend from train data using the rules derived from test data samples



In [17]:
# Pick a row id of choice
row_id = 10
#row_id = 20
#row_id = 30
#row_id = 40
#row_id = 50

# Get the row from the dataframe (an valid row ids will throw an error)
row_location = df_test.iloc[row_id][["lat", "lng"]]
row = df_test.iloc[row_id]
test_data = pd.DataFrame([row])
test_data

# Just for printing it nicely, we create a new dataframe from this single row
pd.DataFrame([row])

Unnamed: 0,property_type,built_year,num_beds,num_baths,size_sqft,total_num_units,lat,lng,subzone,planning_area,...,mean_property_sqft,mean_planning_sqft,planning_area_mean,total_rooms,size_per_room,mean_property_type,distance,num_shopping_malls,tenure_99-year leasehold,tenure_freehold
10,semi-detached house,2009.0,5.0,5.0,5709,363.0,1.398441,103.822165,springleaf,yishun,...,5253.011811,1231.474343,1143726.0,10.0,570.9,7704932.0,0.488762,2.0,0,1


**Preprocessing of data for ARM**

In [9]:
feature_list = ['size_per_room','num_baths','num_beds','size_sqft', 'num_shopping_malls']
def preprocessingTrainData(feature_list, input_df):
    df_train1 = input_df[feature_list]
    bins = [0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000,1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000,
     2100, 2200, 2300, 2400, 2500, 2600, 2700, 2800, 2900 ]
    labels =[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
    21, 22, 23, 24, 25, 26, 27, 28, 29]
    df_train1['size_per_room_binned'] = pd.cut(df_train1['size_per_room'], bins,labels=labels)
    df_train1["num_baths_str"] = df_train1.num_baths.apply(lambda s: 'num_baths_'+ str(int(s)))
    df_train1["num_beds_str"]= df_train1.num_beds.apply(lambda s: 'num_beds_'+ str(int(s)))
    df_train1["size_sqft_str"]= df_train1.size_sqft.apply(lambda s: 'size_sqft_'+ str(int(s)))
    df_train1["num_shopping_malls_str"]= df_train1.num_shopping_malls.apply(lambda s: 'num_shopping_malls_'+ str(int(s)))
    df_train1['size_per_room_binned_str'] = df_train1.size_per_room_binned.apply(lambda s: 'size_per_room_binned_str_'+ str(int(s)))
    df_train2 = df_train1[['size_per_room_binned_str','num_baths_str','num_beds_str','size_sqft_str', 'num_shopping_malls_str']]
    df_train2
    return df_train2

In [10]:
feature_list = ['size_per_room','num_baths','num_beds','size_sqft', 'num_shopping_malls']
# df_train1.min()
# #52.7
# df_train1.max()
# #2825.4
# #size_per_room
df_train2 = preprocessingTrainData(feature_list, df_train)
df_train2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

S

Unnamed: 0,size_per_room_binned_str,num_baths_str,num_beds_str,size_sqft_str,num_shopping_malls_str
0,size_per_room_binned_str_3,num_baths_2,num_beds_3,size_sqft_1115,num_shopping_malls_2
1,size_per_room_binned_str_3,num_baths_2,num_beds_4,size_sqft_1575,num_shopping_malls_3
2,size_per_room_binned_str_4,num_baths_6,num_beds_4,size_sqft_3070,num_shopping_malls_5
3,size_per_room_binned_str_2,num_baths_2,num_beds_3,size_sqft_958,num_shopping_malls_4
4,size_per_room_binned_str_3,num_baths_1,num_beds_2,size_sqft_732,num_shopping_malls_16
...,...,...,...,...,...
20091,size_per_room_binned_str_2,num_baths_2,num_beds_2,size_sqft_635,num_shopping_malls_4
20092,size_per_room_binned_str_3,num_baths_2,num_beds_2,size_sqft_883,num_shopping_malls_6
20093,size_per_room_binned_str_2,num_baths_4,num_beds_4,size_sqft_1378,num_shopping_malls_5
20094,size_per_room_binned_str_3,num_baths_2,num_beds_3,size_sqft_1205,num_shopping_malls_8


**Applying Apriori and finding patterns in listings**

In [11]:
ap_input_trans = df_train2.values.tolist()
itemsets, rules = apriori(ap_input_trans, min_support=0.1, min_confidence=0.4, max_length=3)
rules_rhs = filter(lambda rule: len(rule.lhs) >= 2 and len(rule.rhs) == 1, rules)
for rule in sorted(rules_rhs, key=lambda rule: rule.lift):
  print(rule) # Prints the rule and its confidence, support, lift, ...

{num_baths_2, num_beds_3} -> {size_per_room_binned_str_3} (conf: 0.633, supp: 0.187, lift: 1.303, conv: 1.400)
{num_beds_3, size_per_room_binned_str_3} -> {num_baths_2} (conf: 0.812, supp: 0.187, lift: 1.616, conv: 2.643)
{num_baths_2, size_per_room_binned_str_3} -> {num_beds_3} (conf: 0.691, supp: 0.187, lift: 1.759, conv: 1.963)
{num_beds_2, size_per_room_binned_str_2} -> {num_baths_2} (conf: 0.899, supp: 0.104, lift: 1.791, conv: 4.950)
{num_baths_2, num_beds_2} -> {size_per_room_binned_str_2} (conf: 0.673, supp: 0.104, lift: 2.024, conv: 2.039)
{num_baths_2, size_per_room_binned_str_2} -> {num_beds_2} (conf: 0.497, supp: 0.104, lift: 2.277, conv: 1.553)


In [34]:
threshold = 0.2
def get_pref_feature_bases(user_preferences):
    recommendation_to_user = []
    for rule in rules:
        left_hand_side = rule.lhs    # or antecedent
        right_hand_side = rule.rhs   # or consequent
        support = rule.support
        confidence = rule.confidence
        lift = rule.lift
        conviction = rule.conviction
        #print('{} => {} -- lift: {}'.format(left_hand_side, right_hand_side, lift))
        for pref in user_preferences:
            # print(pref, rule.lhs[0])
            if pref in rule.lhs:
                # print('{} => {} -- lift: {}'.format(left_hand_side, right_hand_side, lift))
                if lift > threshold:
                  recommendation_to_user.append(rule)

    return recommendation_to_user

def filterTrain(rules_set, row_set):
    for rule in rules_set:
        # print(rule, row_set)
        if rule.issubset(row_set):
            return True
    return False

def filter_rules_based_on_user_viewing(final_rules, input_data, input_data_processed):

    rules_rhs = filter(lambda rule: len(rule.lhs) >= 2 and len(rule.rhs) == 1, final_rules)
    rules_set = []

    for rule in sorted(rules_rhs, key=lambda rule: rule.lift):
        # print(set(rule.lhs + rule.rhs)) # Prints the rule and its confidence, support, lift, ...
        rules_set.append(set(rule.lhs + rule.rhs))

    input_data_processed['feature_lst'] = input_data_processed.apply(lambda row : set((row['size_per_room_binned_str'],
                        row['num_baths_str'],
                        row['size_sqft_str'],
                        row['num_beds_str'],
                        row['num_shopping_malls_str']
                        )) ,axis = 1)


    top_recommendations = input_data_processed[input_data_processed.apply(lambda row: filterTrain(rules_set, row['feature_lst']), axis=1 )]  
    final_recommendations = input_data[input_data.index.isin(top_recommendations.index)]
    return final_recommendations

def get_top_recommendations(user_preferences,  **kwargs) -> pd.DataFrame:
    
    #####################################################
    ## Initialize the required parameters
    
    # The number of recommendations seem recommended
    # Additional input parameters are up to you
    k = None
    
    # Extract all **kwargs input parameters
    # and set the used paramaters (here: k)
    for key, value in kwargs.items():
        if key == 'k':
            k = value
        if key == 'input_data':
            input_data = value
        if key == 'input_data_processed':
            input_data_processed = value
    # print(input_data_processed.shape) 
    # print(input_data.shape)      
    possible_rules_on_user_viewing = get_pref_feature_bases(user_preferences)
    for final_rules in possible_rules_on_user_viewing:
        print(final_rules)
    recommendation_to_user = filter_rules_based_on_user_viewing(possible_rules_on_user_viewing, input_data, input_data_processed)
    return recommendation_to_user
        

## Testing the Recommendation Engine

This will be the main part of your notebook to allow for testing your solutions. Most basically, for a given listing (defined by the row id in your input dataframe), we would like to see the recommendations you make. So however you set up your notebook, it should have at least a comparable section that will allow us to run your solution for different inputs.

### Pick a Sample Listing as Input

In [13]:
# Pick a row id of choice
row_id = 10
#row_id = 20
#row_id = 30
#row_id = 40
#row_id = 50

# Get the row from the dataframe (an valid row ids will throw an error)
row = df_train.iloc[row_id]
test_data = pd.DataFrame([row])
test_data
# Just for printing it nicely, we create a new dataframe from this single row
# pd.DataFrame([row])

Unnamed: 0,property_type,built_year,num_beds,num_baths,size_sqft,total_num_units,lat,lng,subzone,planning_area,...,mean_property_sqft,mean_planning_sqft,planning_area_mean,total_rooms,size_per_room,mean_property_type,distance,num_shopping_malls,tenure_99-year leasehold,tenure_freehold
10,condo,1985,2.0,2.0,1733,280,1.322153,103.945223,bedok south,bedok,...,1154.798804,2287.615299,3128573.0,4.0,433.25,2919816.0,0.570617,2.0,1,0


In [23]:
feature_list = ['size_per_room','num_baths','num_beds','size_sqft', 'num_shopping_malls']
df_test_processed = preprocessingTrainData(feature_list, test_data)
df_test_processed

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

S

Unnamed: 0,size_per_room_binned_str,num_baths_str,num_beds_str,size_sqft_str,num_shopping_malls_str
10,size_per_room_binned_str_5,num_baths_2,num_beds_2,size_sqft_1733,num_shopping_malls_2


## Compute and Display the recommendations

Since the method `get_top_recommendations()` returns a `pd.DataFrame`, it's easy to display the result.

In [35]:
k = 3

df_recommendations = get_top_recommendations(df_test_processed.iloc[0], k=k, input_data=df_train, input_data_processed=df_train2)
df_recommendations.head(k)

{num_beds_2} -> {num_baths_2} (conf: 0.710, supp: 0.155, lift: 1.414, conv: 1.717)
{num_baths_2} -> {num_beds_3} (conf: 0.589, supp: 0.296, lift: 1.500, conv: 1.477)
{num_shopping_malls_2} -> {num_baths_2} (conf: 0.547, supp: 0.101, lift: 1.089, conv: 1.099)
{num_baths_2} -> {size_per_room_binned_str_2} (conf: 0.418, supp: 0.210, lift: 1.256, conv: 1.146)
{num_baths_2} -> {size_per_room_binned_str_3} (conf: 0.539, supp: 0.271, lift: 1.111, conv: 1.117)
{num_beds_2} -> {size_per_room_binned_str_2} (conf: 0.531, supp: 0.116, lift: 1.597, conv: 1.423)
{num_beds_2, size_per_room_binned_str_2} -> {num_baths_2} (conf: 0.899, supp: 0.104, lift: 1.791, conv: 4.950)
{num_baths_2, size_per_room_binned_str_2} -> {num_beds_2} (conf: 0.497, supp: 0.104, lift: 2.277, conv: 1.553)
{num_baths_2, num_beds_2} -> {size_per_room_binned_str_2} (conf: 0.673, supp: 0.104, lift: 2.024, conv: 2.039)
{num_baths_2, num_beds_2} -> {size_per_room_binned_str_2} (conf: 0.673, supp: 0.104, lift: 2.024, conv: 2.039)
{

Unnamed: 0,property_type,built_year,num_beds,num_baths,size_sqft,total_num_units,lat,lng,subzone,planning_area,...,mean_property_sqft,mean_planning_sqft,planning_area_mean,total_rooms,size_per_room,mean_property_type,distance,num_shopping_malls,tenure_99-year leasehold,tenure_freehold
0,hdb,1988,3.0,2.0,1115,116,1.414399,103.837196,yishun south,yishun,...,1079.757868,1231.474343,1143726.0,5.0,223.0,662045.2,0.573567,2.0,1,0
5,condo,2024,2.0,2.0,689,633,1.339338,103.763893,bukit batok south,bukit batok,...,1154.798804,1273.03169,1612262.0,4.0,172.25,2919816.0,1.330448,2.0,0,1
6,condo,2026,3.0,2.0,1076,407,1.31064,103.852149,kampong java,kallang,...,1154.798804,1106.259542,2256485.0,5.0,215.2,2919816.0,0.400608,6.0,1,0
