General task : 

   - The goal of this task is to create personalized recommendations for football matches
   (events) for users over a one-week period. Personalized recommendations are crucial in
   numerous applications to ensure users are satisfied with the content provided. In this case,
   you will generate personalized recommendations of football matches for a one-week period
   based on the teams each user follows.

Data retrival

In [None]:
from clickhouse_driver import Client
import pandas as pd
import ast
import datetime
import pickle 

In [None]:
bqClient = Client(
user='mharalovic',
password='Fs75EePJ3m54EyysB75U',
host='clickhouse.sofascore.ai',
port='9000',
)

Retriving all football teams that played a match between '20230101' and '20230630' from the table sports.event.

In [None]:
query = """    SELECT t.id
               FROM sports.event AS e
               LEFT JOIN sports.sport AS s ON e.sport_id = toInt8(s.id)
               LEFT JOIN sports.team AS t ON e.hometeam_id = t.id
               WHERE s.name = 'Football'
               AND toYYYYMMDD(e.startdate) BETWEEN '20230101' AND '20230630'
               UNION DISTINCT
               SELECT t1.id
               FROM sports.event AS e
               LEFT JOIN sports.sport AS s ON e.sport_id = toInt8(s.id)
               LEFT JOIN sports.team AS t1 ON e.awayteam_id = t1.id
               WHERE s.name = 'Football'
               AND toYYYYMMDD(e.startdate) BETWEEN '20230101' AND '20230630';
       """

In [None]:
team_ids = bqClient.execute(query)

In [None]:
teams_df = pd.DataFrame(team_ids, columns=['team_id'])
teams_df.to_csv('teams.csv', index=False)

Retrieving all users data 

In [None]:
query2 = """ SELECT user_account_id, teams, mcc
            FROM bq.mobileuser
            WHERE user_account_id IS NOT NULL
               AND teams IS NOT NULL
               AND length(teams) > 0
               AND toYYYYMMDD(created_at) <= '20230630'
               AND toYYYYMMDD(updated_at) <= '20230630'
               AND mcc IN (216, 218, 219, 220, 221, 222, 226, 232, 262, 276, 284, 293, 294,297)
            ORDER BY mcc DESC
        """

In [None]:
user_data =  bqClient.execute(query2)

In [None]:
user_data_df = pd.DataFrame(user_data, columns=['user_account_id', 'teams', 'mcc'])
user_data_df['teams'] = user_data_df['teams'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
user_data_df['teams'] = user_data_df['teams'].apply(tuple)
user_data_df = user_data_df.drop_duplicates()

Removing 47 users that had multiple MCC values (identical rows, simply having 2+ MCC values)

In [None]:
user_counts = user_data_df['user_account_id'].value_counts()
users_with_multiple_records = user_counts[user_counts > 1].index

user_data_df = user_data_df[~user_data_df['user_account_id'].isin(users_with_multiple_records)]

In [None]:
user_data_df.to_csv('./data/user_data.csv',index=False)

In [None]:
user_data_df.head(1)

Data preprocessing tasks 
- filter the teams that each user follows by retaining only those teams that are playing relevant events
- if a user does not follow any of those teams, discard the user

1st task -> getting relevant teams

In [None]:
user_data_df['teams'] = user_data_df['teams'].apply(lambda team_list: [team for team in team_list if team in teams_df['team_id'].values])

In [None]:
user_data_df = user_data_df[user_data_df['teams'].apply(len) > 0]

- Grouping user by MCC and getting the teams that each user follow
- Creating dict afterwards -> key is mcc, values are 'teams' and 'user_account_id', which hold all the inf about the teams and user account connected to specific group (mcc) -> for O(1) lookups

In [None]:
grouped_by_mcc = user_data_df.groupby('mcc').agg({
    'user_account_id': lambda x: list(x),
    'teams': lambda x: list(set(sum(x, [])))
}).reset_index().set_index('mcc')

In [None]:
grouped_by_mcc = grouped_by_mcc.apply(lambda row: row.update({'user_account_id': sorted(row['user_account_id']), 'teams': sorted(row['teams'])}) or row, axis=1)

In [None]:
grouped_by_mcc_dict = grouped_by_mcc.to_dict(orient='index')

Saving dict and dataframe locally

In [None]:
import pickle 

with open('./data/grouped_by_mcc_dict.pkl', 'wb') as f:
    pickle.dump(grouped_by_mcc_dict, f)

In [None]:
user_data_df.to_csv('./data/user_data.csv', index=False)
user_data_df.to_csv('./data/user_data.csv',index=False)


In [None]:
user_data_dict = user_data_df.set_index('user_account_id').to_dict(orient='index')

with open('./data/user_data_dict.pkl', 'wb') as f:
    pickle.dump(user_data_dict, f)

Retrieving events data

In [None]:
query3 = """
        SELECT 
            toStartOfWeek(startdate, 1) AS week_start,
            groupArray(id) AS event_ids,
            groupArray((hometeam_id, awayteam_id)) AS team_ids
         FROM
            sports.event
         LEFT JOIN
            sports.sport s ON event.sport_id = toInt8(s.id)
         WHERE
            toYYYYMMDD(startdate) >= '20230601' AND toYYYYMMDD(startdate) <= '20230630' AND s.name = 'Football'
         GROUP BY 
            week_start
         ORDER BY 
            week_start
        """

In [None]:
events_data = bqClient.execute(query3)

In [None]:
events_data = pd.DataFrame(events_data, columns=['week_start', 'event_ids', 'team_ids'])
events_data.to_csv('./data/events_data.csv', index=False)

In [None]:
events_data_dict = events_data.set_index('week_start').to_dict(orient='index')
with open('./data/events_data_dict.pkl', 'wb') as f:
    pickle.dump(events_data_dict, f)

Recommendation system 
- recommendations should be generated for events based on the teams that each user
follows.
- we recommend an event to a user if the user follows any of the two
teams playing in that event.

In [None]:
import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn.metrics import jaccard_score
from itertools import chain


In [None]:
resolved_users, non_resolved_users = [],[]
event_based_user_recommendation = {}
user_based_user_recommendation = {}

In [None]:
dates = [datetime.date(2023, 5, 29), 
         datetime.date(2023, 6, 5),
         datetime.date(2023, 6, 12), 
         datetime.date(2023, 6, 19), 
         datetime.date(2023, 6, 26)
         ]

Mappping, for each data, map team_id with event id

In [None]:
teams_to_event_mapping = {date : {} for date in dates}

In [None]:
for date in dates:
   for event_id,teams in zip(events_data_dict[date]['event_ids'],events_data_dict[date]['team_ids']):
      for team in teams:
         if team not in teams_to_event_mapping[date]:
            teams_to_event_mapping[date][team] = []
         teams_to_event_mapping[date][team].append(event_id)  

Event based recommendation

In [None]:
for user, user_data in user_data_dict.items():
   event_based_user_recommendation[user] = {}
   for date in dates:
      event_ids_for_date = set()
      for team in user_data['teams']:
         if team in teams_to_event_mapping[date]:
            event_ids_for_date.update(teams_to_event_mapping[date][team])
      if event_ids_for_date:
         event_based_user_recommendation[user][date] = list(event_ids_for_date)
   if all(date in event_based_user_recommendation[user] for date in dates):
      resolved_users.append(user)
   else:
      non_resolved_users.append(user)         

User mappings based on similarity for each date

*Similarity* is defined ad : 
- for each user, looking at the users in their macc that have team followings from the previous task, calculate the number of teams they both follow and assign a score of similarity, then choose top 10 msot similar user to the user and give theirs recommendations for each period.

*Algorithm*
- for user *u* look at set of users *U* in the same macc
- calculate static similarity between each user based on the teams they follow (using Jaccard similarity, as there are set of items (teams) without specific continuous variable)
- for each date, filter users *U* based on the fact they got following recommendations
- get 10 most similar users in preferences and all of ther reccomendations add as user *u* recommendation to follow in the given week

1.st -> precomputing the similarities between users in the same macc

In [306]:
import itertools

In [None]:
def jaccard_similarity(set1, set2): #  accessed on https://www.geeksforgeeks.org/how-to-calculate-jaccard-similarity-in-python/ , 26.05.2024. at 14:09
    # intersection of two sets
    intersection = len(set1.intersection(set2))
    # Unions of two sets
    union = len(set1.union(set2))
     
    return intersection / union 

In [None]:
similarity_scores = {}

In [312]:
grouped_by_mcc_dict

{216: {'user_account_id': ['5319e482f8b4cf16058b45f1',
   '531a128cf8b4cf710c8b45eb',
   '531ac5067f246171168b462d',
   '531b123ef8cb6e54108b45cc',
   '531b84f3f8cb6eff268b45f0',
   '531c5ae85ac83968078b45bc',
   '531cdafe5ac8391f258b463b',
   '531d4e12437f87a3428b45ae',
   '531e1af9d7f6c626138b45ff',
   '531f7ff6668a7188078b45de',
   '5320b276f850851c1e8b45f8',
   '53219d0bd0a6feca0d8b458e',
   '5325aeb5d0cd715b078b468e',
   '5325cccb9cfefd15478b4606',
   '532a168078e13a5ba33d6119',
   '532b2fe078e13a37063b1cb6',
   '532b4fec78e13a64356b4bfe',
   '532d515278e13a99cc85e77d',
   '532f11d778e13ab240727a6e',
   '532f493f8ccb39cf0a8b45f8',
   '532f579678e13a2fba038df3',
   '5333310678e13a39ec79d350',
   '5339629d78e13a8735e15ff5',
   '533c6d3b78e13a454c5ee2b7',
   '5340ebf61e31a4bc4b8b45a9',
   '534112351e31a4dd598b4588',
   '53423da178e13a9730ae0429',
   '5342d86b78e13a84e5ad855b',
   '534446ca78e13af469e6044d',
   '5345336b78e13a8e5ccf6861',
   '5345a95178e13ab6e7559e98',
   '5348354c78e

In [316]:
for mcc,data in grouped_by_mcc_dict.items():
   for user1,user2 in itertools.combinations(sorted(data['user_account_id']), 2):
      if i1 != i2 and i1 < i2:
         pass

In [309]:
def similarity(i1, i2):
    # Implement the actual similarity function here
    return abs(i1 - i2)

In [308]:
for i1,i2 in itertools.combinations([1,2,3,4,5,6,6,5,4,3,2,1],2):
   print(i1,i2)

1 2
1 3
1 4
1 5
1 6
1 6
1 5
1 4
1 3
1 2
1 1
2 3
2 4
2 5
2 6
2 6
2 5
2 4
2 3
2 2
2 1
3 4
3 5
3 6
3 6
3 5
3 4
3 3
3 2
3 1
4 5
4 6
4 6
4 5
4 4
4 3
4 2
4 1
5 6
5 6
5 5
5 4
5 3
5 2
5 1
6 6
6 5
6 4
6 3
6 2
6 1
6 5
6 4
6 3
6 2
6 1
5 4
5 3
5 2
5 1
4 3
4 2
4 1
3 2
3 1
2 1
