## Next city recommendation in a multi-destination trip

- Build next city recommendation using only the two columns city_id and utrip_id. 
- Method-1: Transition probablity matrix - a simple lookup
- Method-2: Use previous cities in the trip to create features (using embeddings) and train a ML model. ML model is not trained in this notebook.

In [1]:
import numpy as np
import pandas as pd
from collections import Counter
from tqdm import tqdm

In [2]:
df = pd.read_csv("train_set.csv")
print(df.shape)
df.head()

(1166835, 9)


Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id
0,1000027,2016-08-13,2016-08-14,8183,desktop,7168,Elbonia,Gondal,1000027_1
1,1000027,2016-08-14,2016-08-16,15626,desktop,7168,Elbonia,Gondal,1000027_1
2,1000027,2016-08-16,2016-08-18,60902,desktop,7168,Elbonia,Gondal,1000027_1
3,1000027,2016-08-18,2016-08-21,30628,desktop,253,Elbonia,Gondal,1000027_1
4,1000033,2016-04-09,2016-04-11,38677,mobile,359,Gondal,Cobra Island,1000033_1


In [3]:
df = df[['utrip_id', 'city_id', 'checkin']]
df = df.sort_values(['utrip_id', 'checkin'])

In [4]:
df = df[['utrip_id', 'city_id']]
df.head(10)

Unnamed: 0,utrip_id,city_id
0,1000027_1,8183
1,1000027_1,15626
2,1000027_1,60902
3,1000027_1,30628
4,1000033_1,38677
5,1000033_1,52089
6,1000033_1,21328
7,1000033_1,27485
8,1000033_1,38677
9,1000045_1,64876


In [5]:
df.columns = ['trip_id', 'city_id']
df.to_csv("tripid_cityid.csv", index=False)

In [6]:
!pwd

/Users/sridharkannam/Desktop/myCourses_Code/trip_recommendation


In [None]:
1000027_1: 8183 -> 15626 -> 60902 -> 30628 

In [None]:
1000033_1: 38677 -> 52089 -> 21328 -> 27485 (-> 38677)

In [None]:
trip_1:  A -> C -> A
trip_2:  B -> A -> C -> A -> B
trip_3:  A -> B -> C 

In [None]:
trip_1: [A, C] = 1
trip_2: [B, A] = 1, [A, C] = 1, [C, A] = 1,
trip_3: [A, B] = 1, [B, C] = 1  

In [None]:
  A B C  T
A 0 1 2  3
B 1 0 1  2
C 1 0 0  1

In [None]:
  A    B    C  
A 0   1/3  2/3 
B 1/2  0   1/2
C 1    0   0  

### Method-1: Transition probablity matrix

In [4]:
print(df['city_id'].nunique())

39901


In [5]:
%%time 
df_trip_cities = df.groupby('utrip_id')['city_id'].apply(list)
df_trip_cities.head()

CPU times: user 2.86 s, sys: 27.3 ms, total: 2.89 s
Wall time: 2.89 s


utrip_id
1000027_1                         [8183, 15626, 60902, 30628]
1000033_1                 [38677, 52089, 21328, 27485, 38677]
1000045_1    [64876, 55128, 9608, 31817, 36170, 58178, 36063]
1000083_1                        [55990, 14705, 35160, 36063]
100008_1                    [11306, 12096, 6761, 6779, 65690]
Name: city_id, dtype: object

In [6]:
%%time

# From the historial trips data create all the pairs of cities (current city, next city)

from_to = []
for trip_cities in df_trip_cities.values:
    if trip_cities[0] == trip_cities[-1]:
        trip_cities.pop() # for some trips the last city is same as first city meaning the traveller simply returning home. Remove the last city for such cases
    
    from_city = trip_cities[:-1]
    to_city = trip_cities[1:]
    from_to.append(list(zip(from_city, to_city)))
    
from_to = [item for sublist in from_to for item in sublist]
from_to = [f_t for f_t in from_to if f_t[0] != f_t[1]] # remove if the current and next city are the same - data quality issue

print(len(from_to))
print(from_to[:3])

840002
[(8183, 15626), (15626, 60902), (60902, 30628)]
CPU times: user 369 ms, sys: 16.4 ms, total: 386 ms
Wall time: 385 ms


In [7]:
%%time
trans_prob_mat = pd.Series(Counter(map(tuple, from_to))).unstack().fillna(0)
trans_prob_mat = trans_prob_mat.divide(trans_prob_mat.sum(axis=1),axis=0)
print(trans_prob_mat.shape)

(37188, 37481)
CPU times: user 14.2 s, sys: 38.4 s, total: 52.7 s
Wall time: 39.5 s


In [8]:
# Given the current city, this function returns top n cities for next destination 
    
def top_recommendation(current_city_id, n):
    city_prob = trans_prob_mat.filter(items = [current_city_id], axis=0).T
    city_prob = city_prob.sort_values(current_city_id, ascending=False)
    top_n_cities = city_prob.index[:n].to_list()
    top_n_probs = city_prob[current_city_id].values[:n] 
    top_n_probs = list(np.round(top_n_probs, 4))
    top_recos = {'city_ids': top_n_cities, 'probabilities': top_n_probs}
    return top_recos 

In [9]:
top_recos = top_recommendation(35850, 5)
print(top_recos)

{'city_ids': [17764, 27112, 56651, 47499, 7810], 'probabilities': [0.2271, 0.1643, 0.1304, 0.0652, 0.0411]}


If a customer is planning to visit the city 35850, then the best recommendation for the next city is 17764, followed by 27112.

In [10]:
top_recos = top_recommendation(27112, 3)
print(top_recos)

{'city_ids': [17764, 47499, 56651], 'probabilities': [0.6027, 0.0959, 0.0639]}


In [11]:
top_recos = top_recommendation(60902, 1)
print(top_recos)

{'city_ids': [58015], 'probabilities': [0.1087]}


### Embeddings

In [12]:
df.head()

Unnamed: 0,utrip_id,city_id,checkin
0,1000027_1,8183,2016-08-13
1,1000027_1,15626,2016-08-14
2,1000027_1,60902,2016-08-16
3,1000027_1,30628,2016-08-18
4,1000033_1,38677,2016-04-09


In [13]:
%%time 
df_trip_cities = df.groupby('utrip_id')['city_id'].apply(list)
df_trip_cities.head()

CPU times: user 3.23 s, sys: 164 ms, total: 3.39 s
Wall time: 3.41 s


utrip_id
1000027_1                         [8183, 15626, 60902, 30628]
1000033_1                 [38677, 52089, 21328, 27485, 38677]
1000045_1    [64876, 55128, 9608, 31817, 36170, 58178, 36063]
1000083_1                        [55990, 14705, 35160, 36063]
100008_1                    [11306, 12096, 6761, 6779, 65690]
Name: city_id, dtype: object

In [14]:
def generate_prev_next_cities(input_list):
    output_lists = []
    
    for i in range(len(input_list) - 1):
        for j in range(i + 2, len(input_list) + 1):
            sublist = input_list[i:j]
            output_lists.append(sublist)
    
    return output_lists

In [15]:
# Create records/rows with two columns, previous cities and the next city in each trip.
# For the trip 1000027_1 with cities [8183, 15626, 60902, 30628], the ouput looks like below
# In each list, the last value is the next city and all the values except the last are previous cities
input_list = ['city1', 'city2', 'city3', 'city4', 'city5', 'city6']
input_list = [8183, 15626, 60902, 30628]
prev_next_cities = generate_prev_next_cities(input_list)
print("[previous_cities], next_city")
for prev_next in prev_next_cities:
    print(prev_next[:-1], prev_next[-1])

[previous_cities], next_city
[8183] 15626
[8183, 15626] 60902
[8183, 15626, 60902] 30628
[15626] 60902
[15626, 60902] 30628
[60902] 30628


In [16]:
%%time 

prev_next_cities_ll = []
for trip_cities in tqdm(df_trip_cities.values):
    prev_next_cities = generate_prev_next_cities(trip_cities)
    prev_next_cities_ll.append(prev_next_cities)
    
prev_next_cities_ll = [item for sublist in prev_next_cities_ll for item in sublist]

100%|███████████████████████████████| 217686/217686 [00:02<00:00, 105085.10it/s]

CPU times: user 2.01 s, sys: 179 ms, total: 2.19 s
Wall time: 2.19 s





In [17]:
prev_next_df = pd.DataFrame({'trip': prev_next_cities_ll}) 

prev_next_df['previous_cities'] = prev_next_df.apply(lambda row: row['trip'][:-1], axis=1)

#prev_next_df['previous_cities'] = prev_next_df['previous_cities'].str.strip('[]') #join(',')

prev_next_df['previous_cities'] = prev_next_df.apply(lambda row: str(row['previous_cities']).strip('[]'), axis=1)
       
prev_next_df['next_city'] = prev_next_df.apply(lambda row: row['trip'][-1], axis=1)

prev_next_df.head()

Unnamed: 0,trip,previous_cities,next_city
0,"[8183, 15626]",8183,15626
1,"[8183, 15626, 60902]","8183, 15626",60902
2,"[8183, 15626, 60902, 30628]","8183, 15626, 60902",30628
3,"[15626, 60902]",15626,60902
4,"[15626, 60902, 30628]","15626, 60902",30628


In [18]:
prev_next_df.tail()

Unnamed: 0,trip,previous_cities,next_city
2985426,"[17944, 47075, 228]","17944, 47075",228
2985427,"[17944, 47075, 228, 62930]","17944, 47075, 228",62930
2985428,"[47075, 228]",47075,228
2985429,"[47075, 228, 62930]","47075, 228",62930
2985430,"[228, 62930]",228,62930


In [19]:
import requests
model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = ""

In [20]:
api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}

In [21]:
def create_embds(texts):
    response = requests.post(api_url, headers=headers, json={"inputs": texts, "options":{"wait_for_model":True}})
    return response.json()

In [22]:
# creating embeddings only for the first 100 records
texts = list(prev_next_df["previous_cities"].values[:100])
embds = create_embds(texts)

In [23]:
embds_df = pd.DataFrame(embds)
embds_df['target'] = prev_next_df['next_city'].values[:100]
print(embds_df.shape)
embds_df.head()

(100, 385)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,375,376,377,378,379,380,381,382,383,target
0,-0.087536,0.012962,-0.054277,0.071328,-0.031723,-0.023091,0.004846,0.086497,0.002042,-0.120288,...,-0.05987,-0.055316,-0.034741,-0.001732,0.084464,0.075043,0.052253,-0.041033,-0.095299,15626
1,-0.023394,-0.021337,-0.042222,0.0411,-0.087723,-0.037878,-0.040731,-0.00165,0.053713,-0.11517,...,-0.019104,-0.072425,-0.018901,0.036811,0.045646,0.026298,0.042792,0.030901,-0.047055,60902
2,-0.024174,-0.007061,-0.043312,0.015833,-0.074283,-0.007714,-0.033722,0.022991,0.026141,-0.099314,...,-0.004968,-0.084713,-0.052688,0.026497,0.054743,0.037487,0.102648,0.04367,-0.059041,30628
3,-0.029118,0.015374,-0.069965,0.014509,-0.097234,0.003281,-0.048886,0.015815,0.028806,-0.083979,...,0.033824,0.015397,0.022432,-0.000587,-0.000909,0.083887,-0.004035,-0.018506,-0.03711,60902
4,-0.026712,0.020225,-0.029628,0.013419,-0.102775,0.009542,-0.065748,0.029631,0.009293,-0.077677,...,0.06527,-0.033991,-0.036413,0.001968,0.013815,0.020153,0.107023,0.002898,-0.039028,30628


The first 384 columns are embedding features and the 385th column is the target (next city in the trip of historical data)