### Project N 4
## Recommendation system

### Goal: 
To prepare the basis of a recommendation system, 
with the help of which it will be possible to offer interesting courses to users and thereby increase the average check.

### Tasks:
- Analyze data using SQL and Python 
- Prepare a table containing a list of courses and two recommended courses for each of them




In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import itertools
from itertools import combinations
import collections
from collections import Counter
import psycopg2
import psycopg2.extras 
import random

recommendations_df = pd.read_csv('data/query_result.csv')
recommendations_df.head()

Unnamed: 0,user_id,resource_id
0,51,516
1,51,1099
2,6117,356
3,6117,357
4,6117,1125


In [2]:
# Group courses by user, displaying only unique users in the first column and all courses purchased by each user in the second

df_grouped = recommendations_df.groupby('user_id')['resource_id'].apply(list).reset_index()
df_grouped.head()

Unnamed: 0,user_id,resource_id
0,51,"[516, 1099]"
1,6117,"[356, 357, 1125]"
2,10275,"[553, 1147]"
3,10457,"[361, 1138]"
4,17166,"[357, 356]"


In [3]:
# Sort the courses in ascending order in each row in order to exclude the formation of mirror pairs of courses in the future

df_grouped['resource_id'] = df_grouped['resource_id'].apply(lambda y: sorted(y))

In [4]:
# Combine the courses inside each row into pairs, while each course forms a separate pair with each subsequent course in the row

df_grouped['resource_id'] = df_grouped['resource_id'].apply(lambda x: list(itertools.combinations(x,2)))
df_grouped.head()

Unnamed: 0,user_id,resource_id
0,51,"[(516, 1099)]"
1,6117,"[(356, 357), (356, 1125), (357, 1125)]"
2,10275,"[(553, 1147)]"
3,10457,"[(361, 1138)]"
4,17166,"[(356, 357)]"


In [5]:
# Create a set of pairs to count the number of unique pairs

set_of_pairs = set()
for courses in df_grouped['resource_id']:
    for e in courses:
        set_of_pairs.add(e)

print(len(set_of_pairs))

3989


In [6]:
# form a list of pairs, and from it a dictionary that returns unique pairs of courses as keys and the number of their purchases as values 

list_of_pairs = []
for courses in df_grouped['resource_id']:
    for e in courses:
        list_of_pairs.append(e)

courses_count = Counter(list_of_pairs)

# Form a dataset from the dictionary

courses_count_df = pd.DataFrame(list(courses_count.items()), columns=['pairs','count'])
courses_count_df.head()

Unnamed: 0,pairs,count
0,"(516, 1099)",25
1,"(356, 357)",100
2,"(356, 1125)",44
3,"(357, 1125)",52
4,"(553, 1147)",16


In [7]:
# Define the minimum frequency of buying pairs of courses as 60% of the purchases of all pairs of courses. 
# When choosing 50% of all courses, the minimum limit was 3 joint purchases, which is too low a limit 
# and increases the risk that the courses have nothing in common and were bought together due to the individual needs of the user 

min_freq = np.percentile(courses_count_df['count'],60)
min_freq

5.0

In [8]:
# Write a function whose argument is the id of the course, when entered, the function returns 2 recommendations, , 
# each of which contains the id of the course itself, the id of the recommended course and the frequency of joint purchase, recommendations are returned in descending order of frequency

def course_recommendation(course_id):
    rec_list=[]
    for i in courses_count.keys():
        if i[0] == course_id:
            rec_list.append((i, courses_count[i]))
        if i[1] == course_id:
            rec_list.append((i, courses_count[i]))
    rec_list = sorted(rec_list, key=lambda x: x[1],  reverse=True)
    return rec_list[:2]
        
# Checking the operation of the recommendation function 
course_recommendation(517)

[((517, 551), 52), ((517, 750), 34)]

In [9]:
# Creating a set with a list of unique courses

set_of_courses = set(recommendations_df['resource_id'])
print(len(set_of_courses))

126


In [10]:
# For courses that have been sold in pairs with other courses less than 5 times,  
# create a variable in which a random course will be selected from the list of all courses each time 

random_course = random.choice(list(set_of_courses))

*If the course was rarely bought in conjunction with any other courses, then most likely this course is not thematically related to other courses on the platform, or is quite complete and fully covers the needs of the user. Or vice versa - users were not satisfied with the course materials and they had no desire to purchase further courses. To find out if any of these hypotheses correspond to reality, you can analyze how often users complete courses that were rarely bought in conjunction with other courses.*

In [11]:
# Create a final table in which unique course ids are indicated as indexes, 
# in the first column the most frequent pair for this course, and in the second - the second most frequent pair, 
# and for courses for which there are too few recommendations, random courses are indicated  

recommendation_list = []
rec_df = pd.DataFrame(recommendation_list, columns=['rec_1', 'rec_2'])
for course_id in set_of_courses:
    rec1 = None
    rec2 = None
    if course_recommendation(course_id)[0][1] >= min_freq:
        rec1 = (set(course_recommendation(course_id)[0][0]) - set([course_id])).pop()
    if course_recommendation(course_id)[1][1] >= min_freq:
        rec2 = (set(course_recommendation(course_id)[1][0]) - set([course_id])).pop()
    if course_recommendation(course_id)[0][1] <= min_freq:
        rec1 = random_course
    if course_recommendation(course_id)[1][1] <= min_freq:
        rec2 = random_course
    rec_df.loc[course_id] = [rec1,rec2]

# Check whether there is a recommendation for each course from the list    
rec_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 126 entries, 513 to 511
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   rec_1   126 non-null    int64
 1   rec_2   126 non-null    int64
dtypes: int64(2)
memory usage: 3.0 KB


In [12]:
rec_df.head()

Unnamed: 0,rec_1,rec_2
513,503,551
514,551,515
515,551,489
516,745,553
517,551,750


#### Conclusions
*If you look at the percentage of courses that were bought in pairs with other courses, you can see that half of the courses were bought less than 3 times in pairs with any other courses. Probably these courses have no thematic connection with other courses, or they are self-sufficient. It is worth paying attention to these courses. If in general they have few sales, then it may be worth updating them. If there are no problems with the sales of such courses, then it may be worth considering creating courses that are similar in subject. It is also worth analyzing the behavior of users who buy such courses. Perhaps these courses are quite voluminous and difficult to understand and therefore take a lot of time for users and are generally longer, not allowing users to take additional courses for a long time. In addition, there is a possibility that popular courses themselves are short and quite simple, so in addition to them, users buy similar courses. For further analysis, additional information is needed about the content, duration of the courses and the success of their completion by users.*