<a href="https://colab.research.google.com/github/TunaInABottle/data_mining_2022/blob/main/Collaborative_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone https://github.com/TunaInABottle/data_mining_2022.git

Cloning into 'data_mining_2022'...
remote: Enumerating objects: 111, done.[K
remote: Counting objects: 100% (111/111), done.[K
remote: Compressing objects: 100% (80/80), done.[K
remote: Total 111 (delta 45), reused 82 (delta 25), pack-reused 0[K
Receiving objects: 100% (111/111), 53.63 KiB | 6.70 MiB/s, done.
Resolving deltas: 100% (45/45), done.


In [2]:
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np

<h4>Here we will work with the Euclidean distance (normalized, it will be equivalent to the cosine distance). <br>
In the data cleaning part, we want to transform the <strong> City</strong> and <strong> Profession </strong> columns into numerical data, to be useful for the similarity measure, for this we need two datasets: <br>
<ul>
<li>A dataset of culturally similar cities (because people from same cultures tend to have similar tastes, behaviors...) </li>
<li>A dataset of similar occupations (we can transform the Profession column into categorical data (science, art, law), then into numerical data) </li>
</ul> </h4>

<h4>For the similarity measurement, two options are available: <br>
<ul>
<li>We can have a symmetric square matrix S of size n²=length(df)², indication the similarity between each two queries. <br>
But, calculating the S matrix will be computationally costly, at best complexity of O(n²) </li>
<li>Clustering  </li>
</h4>

<h4>For now, we will only work with the <strong>Age </strong> column, the rest of the work will remain the same since we should only change the formula for the Euclidean distance.
</h4>

<h2> S Matrix
</h2>

In [3]:
#the S matrix will contain distances of each query to the other queries
#then we normalize rows of the S matrix
#When people talk about generations, they're referring to all those born within roughly a 20-year period
#So we will set 20 as a threshold to similarity 

def S_Matrix (df, n): 
  #n: index of the column "Age"
  S=np.zeros(df.shape[0]**2).reshape(df.shape[0], df.shape[0])
  for i in range (df.shape[0]):
    S[i,i]=100  #The threshold is set to 20, So we will set values that we don't need (not similar) to 100>20, to not select them in the disered queries
    for j in range (i+1,df.shape[0]):
      if (df.iloc[i][n] != 'Null') and (df.iloc[j][n] != 'Null') :
        S[i,j]=abs(df.iloc[i][n] - df.iloc[j][n])
        S[j,i]=S[i,j]
      else :
        S[i,j]=100
        S[j,i]=100
  return (S)

In [43]:
from sklearn.preprocessing import normalize
#we will add a new column to df, "Similar_Queries"
#which will contain a matrix of the most similar queries to each query, and the similarity values
#These values will be normalized, and will be used as weights when calculating the missing ratings

def similar(S, df) :
  a=[]
  for i in range (S.shape[0]):
    b=[]
    for j in range (S.shape[0]):
      if S[i,j]<20 :
        b.append([j,S[i,j]])
    a.append(b)
  df['Similar_Queries']=a
  df['Similar_Queries']=df['Similar_Queries'].apply(lambda x: np.array(x).reshape(len(x),2), 1)
  for i in range (df.shape[0]):
    if (len(df['Similar_Queries'][i][:,1].reshape(-1, 1))!=0):
      df['Similar_Queries'][i][:,1]=normalize(df['Similar_Queries'][i][:,1].reshape(-1, 1), axis=0, norm='l1').reshape(1,-1)  #normalised Euclidean distance, but to use them as weights, we have to inverse them, so that nearest queries have the heighest weights
      df['Similar_Queries'][i][:,1]=normalize((np.ones(len(df['Similar_Queries'][i][:,1])) - df['Similar_Queries'][i][:,1]).reshape(-1, 1), axis=0, norm='l1').reshape(1,-1)  
  return (df) 

In [11]:
#for a query that we want to rate, for a random user
#Similar queries that are not rated 

<h4>

*   Transform the Queries definition into a table, similar to the query content table
*   Calculate similarity of each query to the others (if possible), this is content-based similarity (list of similar queries, as well as their weights)

</h4>

In [30]:
df2= pd.read_csv("/content/data_mining_2022/data/size_30/queries.csv")

In [31]:
df2['content']=df2['content'].apply(lambda x: x.strip().split('AND'))

def f1(x):
  for i in range (len(x)):
    x[i]=x[i].strip().split('=')
  return x 
df2['content']=df2['content'].apply(lambda x: f1(x) ,0 )

df2['content']=df2['content'].apply(lambda x: np.array(x).reshape(len(x),2), 0)

def f2(x,c):
  if c in x[:,0]:
    return (x[int((np.where(x==c))[0]),1])
  else :
    return('Null')
df2['Surname']=df2['content'].apply(lambda x: f2(x,'Surname'), 0)
df2['Name']=df2['content'].apply(lambda x: f2(x,'Name'), 0)
df2['City']=df2['content'].apply(lambda x: f2(x,'City'), 0)
df2['Profession']=df2['content'].apply(lambda x: f2(x,'Profession'), 0)
df2['Age']=df2['content'].apply(lambda x: f2(x,'Age'), 0)

df2['Age']=df2['Age'].apply(lambda x: int(x) if x!='Null' else x, 0)

df2=df2.set_index ('id')

In [44]:
S=S_Matrix (df2, 5)
df2=similar(S, df2)

In [45]:
df2 # !!!Queries id in Similar_Queries starts from index=0

Unnamed: 0_level_0,content,Surname,Name,City,Profession,Age,Similar_Queries
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
query_01,"[[Profession, architect], [City, Rome], [Age, ...",Null,Ainsley,Rome,architect,38,"[[2.0, 0.15517241379310345], [13.0, 0.11206896..."
query_02,"[[Profession, pharmacist], [City, Moscow]]",Null,Null,Moscow,pharmacist,Null,[]
query_03,"[[Profession, advertising executive], [Age, 34]]",Null,Null,Null,advertising executive,34,"[[0.0, 0.15333333333333335], [13.0, 0.11666666..."
query_04,"[[Profession, financial analyst], [Age, 90]]",Null,Null,Null,financial analyst,90,"[[5.0, 0.39285714285714285], [7.0, 0.25], [12...."
query_05,"[[Profession, agricultural engineer], [Surname...",Johnson,Null,Null,agricultural engineer,Null,[]
query_06,"[[Profession, teacher], [Surname, Sanders], [A...",Sanders,Annie,Null,teacher,96,"[[3.0, 0.39285714285714285], [7.0, 0.357142857..."
query_07,"[[Profession, psychologist], [City, Dublin]]",Null,Null,Dublin,psychologist,Null,[]
query_08,"[[City, Odessa], [Surname, Hughes], [Age, 104]]",Hughes,Null,Odessa,Null,104,"[[3.0, 0.36363636363636365], [5.0, 0.636363636..."
query_09,"[[City, Tirana]]",Null,Null,Tirana,Null,Null,[]
query_10,"[[City, Toronto], [Age, 69]]",Null,Null,Toronto,Null,69,"[[12.0, 0.1333333333333333], [14.0, 0.86666666..."
