### Model: Generalized Significant Terms

Here, we model the recommendations accounting for:
1. Multiple signals (`visit`, `share`, `purchase`, `like`)
2. Signal interaction counts

We use a custom scoring function to do this with the significant terms aggregation.

#### Assumptions / Constraints

1. We're not accounting for timestamps of interaction. (Perhaps post POC, this can be done at the time of upserting a document by only rolling up counts that are within the timeframe to be considered),
2. We will have as many queries as the type of interactions (4 currently) to recommend the next items for each of these four categories,
3. We can't pass additional parameters to the script scoring in aggregations (e.g. I would have liked to pass signal weights and values),
4. A limitation is that we have to pass the interactions in the query (as a foreground filter) for which we want the next set of recommendations. Since term queries aren't scored, the foreground filter can either be:  
  A. At least one match - current condition of should,  
  B. Be too restrictive - using a must clause,  
  C. Be somewhere in between but needs to be defined - using a terms_set query with a minimum_should_match clause.  
 Combined with 3. where even if the query could be scored, since the aggs don't allow passing custom params while scoring, I think we need to look at something beyond generalized significant terms to do this.
  


#### Data Model Change

One change we will make here is to use a normalized data model. Instead of storing all the item interactions for a user in a single doc, a doc will be unique for a given (user, item). This also streamlines the process of updating the doc.

```json
{
    "_id": "1~foo", // combining user_id and item_id
    "user_id": "1",
    "item_id": "foo",
    "visit": 3,
    "like": 1
}
```

In [2]:
# We will define some persistant variables that we will use everywhere over here. Always run this script first

# you may want to update the value below to something like 'http://localhost:9200/search_recommendations' for testing locally
url = 'https://vYBoRZTxv:cafabdca-c61c-4b70-9c19-f7f7a5e27258@es-cluster-dc-test-2-b5c555.searchbase.io/recommendations_generalized'

headers = {
    'Content-Type': 'application/json'
}
%store url
%store headers

Stored 'url' (str)
Stored 'headers' (dict)


In [38]:
# (optional) deletes the index
import requests

response = requests.request("DELETE", url)
print(response.text.encode('utf8'))

b'{"acknowledged":true}'


In [39]:
# (only run once) Here's the script to populate some qualitative data in the index
# We will use the following data:
# 
#| user~item | a | b | c | d | e |
#|-----------|---|---|---|---|---|
#| 1         | + |   | + |   |   |
#| 2         | + |   |   | + |   |
#| 3         | + |   |   | + | + |
#| 4         |   | + |   |   | + |
#| 5         | + |   |   |   | + |

import requests
import json

data = '''
{ "index": { "_id": "1~a" } }
{ "user_id": 1, "item_id" : "a", "visit": 1}
{ "index": { "_id": "1~c" } }
{ "user_id": 1, "item_id" : "c", "visit": 1}
{ "index": { "_id": "2~a" } }
{ "user_id": 2, "item_id" : "a", "visit": 1}
{ "index": { "_id": "2~d" } }
{ "user_id": 2, "item_id" : "d", "visit": 1}
{ "index": { "_id": "3~a" } }
{ "user_id": 3, "item_id" : "a", "visit": 1}
{ "index": { "_id": "3~d" } }
{ "user_id": 3, "item_id" : "d", "visit": 1}
{ "index": { "_id": "3~e" } }
{ "user_id": 3, "item_id" : "e", "visit": 1}
{ "index": { "_id": "4~b" } }
{ "user_id": 4, "item_id" : "b", "visit": 1}
{ "index": { "_id": "4~e" } }
{ "user_id": 4, "item_id" : "e", "visit": 1}
{ "index": { "_id": "5~a" } }
{ "user_id": 5, "item_id" : "a", "visit": 1}
{ "index": { "_id": "5~e" } }
{ "user_id": 5, "item_id" : "e", "visit": 1}
'''

headers = {
    'Content-Type': 'application/x-ndjson'
}

response = requests.request("POST", url+'/_bulk', headers=headers, data = data)
print(json.dumps(response.json(), indent=2))

{
  "took": 319,
  "errors": false,
  "items": [
    {
      "index": {
        "_index": "recommendations_generalized",
        "_type": "_doc",
        "_id": "1~a",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "_seq_no": 0,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "index": {
        "_index": "recommendations_generalized",
        "_type": "_doc",
        "_id": "1~c",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "_seq_no": 1,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "index": {
        "_index": "recommendations_generalized",
        "_type": "_doc",
        "_id": "2~a",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "succ

In [40]:
# Lets take a scenario where this data needs to be updated via a script insert for a given user and a given item.

import requests
import json

headers = {
    'Content-Type': 'application/x-ndjson'
}

def update_interaction(user_id, item_id):
    data = {
        "user_id": user_id,
        "item_id": item_id,
        "visit": 1
    }
    response = requests.request("PUT", f"{url}/_doc/{user_id}~{item_id}", headers=headers, data = json.dumps(data))
    print(json.dumps(response.json(), indent=2))

# Let's add an interaction for user 1 for item d
update_interaction(1, "d")


{
  "_index": "recommendations_generalized",
  "_type": "_doc",
  "_id": "1~d",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 11,
  "_primary_term": 1
}


In [44]:
# This is the recommendations query: For a specified user as an input, it should generate the best recommendations as the output

import requests
import json

# get the past item interactions of a given user
def past_interactions(user_id):
    payload = {
        "query": {
            "bool": {
                "must": [
                    {
                        "term": {
                            "user_id": user_id
                        }
                    }, {
                        "term": {
                            "visit": 1
                        }
                    }
                ]
            }
        }   
    }
    response = requests.request("GET", url+"/_search", headers=headers, data=json.dumps(payload))
    response_hits = response.json()['hits']['hits']
    items = list(map(lambda x: x['_source']['item_id'], response_hits))
    print(f"user {user_id}'s past interactions: ", items)
    return items

def recommended_interactions(user_id):
    #first, get the already visited items from previous interactions
    past_items = past_interactions(user_id)
    # now, we get the recommendations using significant terms
    payload = {
        "size": 10,
        "query": {
            "bool": {
                "should": {
                    "terms": {
                        "item_id.keyword": past_items
                    }
                }
            }
        },
        "aggs": {
            "sig_terms": {
                "significant_terms": {
                    "field": "item_id.keyword",
                    "min_doc_count": 1
                }
            }
        }
    }
    response = requests.request("GET", url+"/_search", headers=headers, data=json.dumps(payload))
    print(json.dumps(response.json(), indent=2))

recommended_interactions(1)

user 1's past interactions:  ['a', 'c', 'd']
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 8,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "recommendations_generalized",
        "_type": "_doc",
        "_id": "1~a",
        "_score": 1.0,
        "_source": {
          "user_id": 1,
          "item_id": "a",
          "visit": 1
        }
      },
      {
        "_index": "recommendations_generalized",
        "_type": "_doc",
        "_id": "1~c",
        "_score": 1.0,
        "_source": {
          "user_id": 1,
          "item_id": "c",
          "visit": 1
        }
      },
      {
        "_index": "recommendations_generalized",
        "_type": "_doc",
        "_id": "2~a",
        "_score": 1.0,
        "_source": {
          "user_id": 2,
          "item_id": "a",
          "visit": 1
        }
      },
  

In [48]:
import requests
import json

payload = {
    "query": {
        "term": {
            "item_id.keyword": "d"
        }
    },
    "aggs": {
        "sf": {
            "significant_terms": {
                "field": "item_id.keyword"
            }
        }
    }
}

response = requests.request("GET", url+'/_search',  headers=headers, data=json.dumps(payload))
print(json.dumps(response.json(), indent=2))

{
  "took": 15,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 1.3121864,
    "hits": [
      {
        "_index": "recommendations_generalized",
        "_type": "_doc",
        "_id": "2~d",
        "_score": 1.3121864,
        "_source": {
          "user_id": 2,
          "item_id": "d",
          "visit": 1
        }
      },
      {
        "_index": "recommendations_generalized",
        "_type": "_doc",
        "_id": "3~d",
        "_score": 1.3121864,
        "_source": {
          "user_id": 3,
          "item_id": "d",
          "visit": 1
        }
      },
      {
        "_index": "recommendations_generalized",
        "_type": "_doc",
        "_id": "1~d",
        "_score": 1.3121864,
        "_source": {
          "user_id": 1,
          "item_id": "d",
          "visit": 1
        }
      }
    ]
  },
  "aggregati

In [54]:
import requests
import json

payload = {
    "size": 0,
    "aggs": {
        "t1": {
            "terms": {
                "field": "item_id.keyword"
            },
            "aggs": {
                "t2": {
                    "terms": {
                        "field": "item_id.keyword"
                    }
                }
            }
        }
    }
}

response = requests.request("GET", url+'/_search',  headers=headers, data=json.dumps(payload))
print(json.dumps(response.json(), indent=2))

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 12,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "t1": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "a",
          "doc_count": 4,
          "t2": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "a",
                "doc_count": 4
              }
            ]
          }
        },
        {
          "key": "d",
          "doc_count": 3,
          "t2": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "d",
                "doc_count": 3
              }
            ]
          }
        },
    