[View in Colaboratory](https://colab.research.google.com/github/PanthonImem/Content-Based-Lab-Notebook/blob/master/Content-based_Answer_key.ipynb)

# Implementing a Simple Content-based Recommender

In this guideline, you will be guided through implementing a simple content-based recommender system in python. 

In a content-based recommender system, what we care about most is the **content** of the items we are recommending.

Note: You need python 3 and above for this tutorial. You can check your version of python by typing 
'python --version' into your terminal.

(if you are using macOS and have already installed python3, try 'python3 --version' instead, as python2 and python3 can coexist in the same system)

## Installing Packages

We've handled the installation of packages in this tutorial on google collab. If you are ever working locally,  paste the following code in your terminal

-pip3 install pandas

-pip3 install numpy

-pip3 install pythainlp

-pip3 install gensim


## Opening our data files with Pandas. 

Please feel free to skip this section if you are already familiar with Pandas. 

Python has a very convenient library for dealing with a large amount of data called **Pandas**.

We have provided several different data files for you to use in the **CSV(comma-separated values)** form. However, instead of comma, we use semicolon as the separator instead to avoid the case of our data containing commas. 

Pandas has a class for working with data called **Dataframe**.

In this section of this guideline, we will explore useful functions of the pandas.dataframe class. 


Before we use Dataframe, we will have to import pandas.

In [1]:
import pandas

We will now try to open our .csv data file with pandas' **.read_csv**

You can also obtain basic statistics of the dataframe with **.describe()**

In [2]:
df = pandas.read_csv('https://raw.githubusercontent.com/PanthonImem/Content-Based-Lab-Notebook/master/userLog_201801_201802_for_participants.csv', sep = ';')
df.describe()


Unnamed: 0,project_id,year,month,day,hour
count,1234579.0,1234579.0,1234579.0,1234579.0,1234579.0
mean,6401.208,2018.0,1.427945,14.76546,14.24961
std,2197.497,0.0,0.494781,8.233066,5.774527
min,4.0,2018.0,1.0,1.0,0.0
25%,4928.0,2018.0,1.0,8.0,11.0
50%,6446.0,2018.0,1.0,15.0,15.0
75%,8428.0,2018.0,2.0,21.0,19.0
max,9504.0,2018.0,2.0,31.0,23.0


**.head(n)** shows the first n rows of the dataframe. 

**.tail(n)** shows the last n rows of the dataframe.

In [3]:
df.head(10)

Unnamed: 0,userCode,project_id,requestedDevice,userAgent,pageReferrer,year,month,day,hour
0,7717bdc2-ea3e-e8ad-5d6b-178bd71c38b2,7956,Mobile,Android,HomeWebsite,2018,1,1,0
1,7717bdc2-ea3e-e8ad-5d6b-178bd71c38b2,7956,Mobile,Android,HomeWebsite,2018,1,1,0
2,cb5b4b68-cc01-6db6-f54b-4a0f881301c5,5067,Mobile,iPhone,HomeWebsite,2018,1,1,0
3,5f74cef2-0d1e-b619-3564-0955a14e0985,6654,Mobile,iPhone,Google,2018,1,1,0
4,dba8f279-844e-eef6-73ac-22bd7d1353cc,6474,Mobile,iPad,Google,2018,1,1,0
5,1bce397f-9edd-6f75-40ea-408a27ddb0e8,6297,Mobile,Android,Other_PageReferer,2018,1,1,0
6,dba8f279-844e-eef6-73ac-22bd7d1353cc,6474,Mobile,iPad,Google,2018,1,1,0
7,7717bdc2-ea3e-e8ad-5d6b-178bd71c38b2,7956,Mobile,Android,HomeWebsite,2018,1,1,0
8,9794fe34-fb0f-4242-8a35-7610cb1e0ee8,9323,Mobile,iPad,Facebook,2018,1,1,0
9,812246ad-97a8-2a9d-457b-4b5bb521e20c,5764,Mobile,Android,Facebook,2018,1,1,0


**.unique()** gives you an array of all possible values of a column in the dataframe. 

In [4]:
df['userAgent'].unique()

array(['Android', 'iPhone', 'iPad', 'Windows', 'Macintosh', 'Other_OS'],
      dtype=object)

Below is how you can modify the dataframe to only have the columns that you desire. 

In [5]:
df_b = df[['userCode','project_id']]
df_b.head(30)

Unnamed: 0,userCode,project_id
0,7717bdc2-ea3e-e8ad-5d6b-178bd71c38b2,7956
1,7717bdc2-ea3e-e8ad-5d6b-178bd71c38b2,7956
2,cb5b4b68-cc01-6db6-f54b-4a0f881301c5,5067
3,5f74cef2-0d1e-b619-3564-0955a14e0985,6654
4,dba8f279-844e-eef6-73ac-22bd7d1353cc,6474
5,1bce397f-9edd-6f75-40ea-408a27ddb0e8,6297
6,dba8f279-844e-eef6-73ac-22bd7d1353cc,6474
7,7717bdc2-ea3e-e8ad-5d6b-178bd71c38b2,7956
8,9794fe34-fb0f-4242-8a35-7610cb1e0ee8,9323
9,812246ad-97a8-2a9d-457b-4b5bb521e20c,5764




Below is a list of useful pandas.dataframe attributes.

dataframe**.shape** gives a tuple representing the size of the dataframe in the form (rows, columns)

dataframe**.iloc[row][column]** provides the integer-location based indexing of a dataframe for selection(for index-based, use **loc** instead of **iloc**)

dataframe**.at[row,column]** let you access the value at the specific location 

You can find out more about pandas [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)


In [6]:
print(df_b.shape)
print(df_b.iloc[0][1])

#note that we can use the column name instead of integer-indexing too. 
#This is convenient if we ever manipulate our dataframe, which we likely will
print(df_b.iloc[0]['userCode'])

#Note that writing df_copy = df simple creates a reference to the same object.
#If you do not want to modify your dataframe, use df.copy 
df_copy = df_b.copy(deep=True)
df_copy.at[0,'userCode']= 'Dummy Text'
print(df_copy.iloc[0]['userCode'])

#the original dataframe remains the same
print(df_b.iloc[0]['userCode'])

(1234579, 2)
7956
7717bdc2-ea3e-e8ad-5d6b-178bd71c38b2
Dummy Text
7717bdc2-ea3e-e8ad-5d6b-178bd71c38b2


### TO DO #1:  Warming Up (Sorting)
Try sorting the dataframe by

1) userCode

2) userCode but descending

3) userCode as the primary sort axis(ascending), project_id as the secondary sort axis(descending)

Note1: Check the documentation for dataframe .sort_values() [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values).

Note2: You may want to use .reset_index(drop = True) after sorting. Try it out and see what happens. 

In [7]:
df_copy = df_b.copy(deep=True)

#manipulate df_copy, not df_b or df


#Empty this cell for participant
#print(df.sort_values('userCode'))
#print(df.sort_values('userCode', ascending = False))
#print(df.sort_values(['userCode','project_id'], ascending = [True, False]))

There is also a process called **groupby**, which the [documentation](https://pandas.pydata.org/pandas-docs/stable/groupby.html) nicely explains with three words: split-apply-combine

Below is one example of how you can use groupby to 
1) split by userCode
2) apply the function 'count'
3) combine the result back together

In [8]:
df_b = df_b.sort_values('userCode')
df_b['num_browse'] = df_b.groupby('userCode').transform('count')
df_b = df_b.reset_index()
df_b.head(20)


Unnamed: 0,index,userCode,project_id,num_browse
0,847445,00005aba-5ebc-0821-f5a9-bacca40be125,5342,1
1,66594,0000bae7-6233-d7cc-2a6d-48aa70fe8ad4,5678,1
2,314280,0000c576-e929-19eb-615a-349ec3b4709b,6461,1
3,511353,0000d196-6385-80b8-661d-b7427042daa3,9040,1
4,57898,0000e1e2-f595-0ae7-860f-fcc07dcb116e,6709,2
5,58475,0000e1e2-f595-0ae7-860f-fcc07dcb116e,6712,2
6,533568,0000fa46-1f0b-9504-b568-43479d17620e,8829,6
7,533614,0000fa46-1f0b-9504-b568-43479d17620e,4703,6
8,533715,0000fa46-1f0b-9504-b568-43479d17620e,4703,6
9,533718,0000fa46-1f0b-9504-b568-43479d17620e,8829,6


## Data Exploration and Preprocessing

### TO DO #2: Find Top Viewers

Find out the maximum number of times that a single browsed our log. 

Hint: Utilize num_browse, dataframe.drop_duplicates, and .sort_values()

Answer Check: The top viewer is 'de89bac5-57c6-ecfb-184d-cc4e973c31ac'. If you get this user as the top viewer, your result is correct. 

In [9]:
df_b =  df_b[['userCode','num_browse']].sort_values('num_browse', ascending = False).drop_duplicates(subset = 'userCode')
df_b.head(20)

Unnamed: 0,userCode,num_browse
1077297,de89bac5-57c6-ecfb-184d-cc4e973c31ac,24308
233976,31bb9bf0-8ad5-3e50-f334-e6c7de89bac5,8238
304068,3f0b3f64-468d-8411-efee-debb3facd532,1641
1152955,eec0a125-d5ab-f692-c501-4713c35c756d,1533
875884,b883abba-dc3e-f30d-a571-98fffeffe88f,1362
210487,2d023fed-0cad-0f48-c7c8-77ce40e36d0f,1261
726185,987779b7-042c-ddea-e43c-5132045fe84e,1138
433255,5a5c8ed0-dee9-beff-0a91-4c65d31e3744,954
652917,894a4600-3006-06c5-bff9-cd37a3618bc1,806
546549,72bd11b1-1880-2468-a691-78a33f0b3f64,790


We can see that some users have browsed an absurd number of projects. 

It is up to your discretion as to whether you want to train your model with these users, as some might be bots and do not represent the user behaviour that we are trying to model.

A simple way to screen out users who have viewed too many projects is as follow:

In [10]:
df_b[df_b.num_browse<50].head(20)

Unnamed: 0,userCode,num_browse
447543,5d5558eb-2bd5-ecff-5ba4-f756af3b0c41,49
139078,1e337972-3f0f-8583-3c26-83b3ec3d0a20,49
58951,0cfa6658-0234-ab1d-a919-a89a72a33747,49
1167889,f1bcd2e3-eed6-b5ca-d8e8-b433cf8db704,49
647760,882cbb2a-8c24-4624-74f6-252006893e0f,49
288922,3c48e9b3-daa0-9292-77b9-28f2646c1b74,49
459696,60163bf9-e667-470c-e124-86b9fd7da9f7,49
1117962,e716fb5e-516c-e18d-72a2-b430d80c3097,49
444850,5cafc0f4-afcc-6a91-d84f-0e1e7f023c13,49
109837,17f0dad0-a448-88ba-a8b8-5e3b82374cf7,49


However, this way of screening might not be what we want. A user that browse a project everyday will have browsed ~60 projects by the end of two months, whereas a user who have browsed ~60 projects within a minute will most likely be a bot(or is not paying attention to any of the content of the page, in which the user should be screened out anyway)

### TO DO #3: Write Your Own Data Preprocessing Function
Preprocess your data. Please keep all the original columns, but feel free to add more. 

In [11]:
def preprocess(dataframe):
    
    #Insert your code here
    return dataframe

df = pandas.read_csv('https://raw.githubusercontent.com/PanthonImem/Content-Based-Lab-Notebook/master/userLog_201801_201802_for_participants.csv', sep = ';')
df = preprocess(df) 

## Splitting Training-Testing Data set

In order to evaluate your model, you should not train your model with all the data you have, but rather split a part specifically for testing. Whether you want to split by date or not will depend on your preference.

Below is an example of how you can split your data by date.

In [12]:
import datetime
#this line takes a while, as we'd be processing 1,000,000++ rows
df['date'] = df.apply(lambda row : datetime.datetime(row['year'], row['month'], row['day'], row['hour']).date(), axis=1)
df = df[['userCode', 'project_id','date']]
test_date = datetime.date(2018,2,14)
df_train = df[df.date<test_date]
df_test = df[df.date>=test_date]

df_test.head(15)

df = df_train


## English and Thai Word Extraction

While we have a considerable amount of features explicitly stored in our data file, a significant amount of content can be mined from the textual description of each item. Take an example of project number 1698

1698: คอนโดมิเนียมหรูบนชายหาดส่วนตัว บรรยากาศเงียบสงบ ห้องชุดริมทะเล พร้อมสระว่ายน้ำส่วนตัว และระเบียงโค้งใส เชื่อมความต่อเนื่องกับทะเลสีครามภายนอกที่อยู่ห่างเพียง 30 เมตร ตกแต่งเฟอร์นิเจอร์ครบ

From the description, we can find out that the project is 1) a condominium 2) has a swimming pool 3) is near a beach 4) has a balcony 5) comes with furniture

Unfortunately, words from descriptions in Thai are harder to extract since they are not space-separated like english. 

In order to extract these features, we will want to extract each word from the text chunk first.

We will start by using a python library called **pythainlp**.

In [13]:
!pip3 install pythainlp
!pip3 install stop_words
from pythainlp import word_tokenize
from pythainlp.corpus.newthaiword import get_data 
from pythainlp.corpus import alphabet
from pythainlp.corpus import stopwords
from stop_words import get_stop_words
import string



Before we split the string into words, we should remove all punctuations first. 
The string of all punctuations can be obtained using string.punctuation

In [14]:
text2 = "คอนโดมิเนียมหรูบนชายหาดส่วนตัว, บรรยากาศเงียบสงบ, ห้องชุดริมทะเล พร้อมสระว่ายน้ำส่วนตัว และระเบียงโค้งใส เชื่อมความต่อเนื่องกับทะเลสีครามภายนอกที่อยู่ห่างเพียง 30 เมตร ตกแต่งเฟอร์นิเจอร์ครบ"

text = ""
for c in text2:
    if c not in string.punctuation:
        text = text+c
    else:
        text = text+" "
print(text)

คอนโดมิเนียมหรูบนชายหาดส่วนตัว  บรรยากาศเงียบสงบ  ห้องชุดริมทะเล พร้อมสระว่ายน้ำส่วนตัว และระเบียงโค้งใส เชื่อมความต่อเนื่องกับทะเลสีครามภายนอกที่อยู่ห่างเพียง 30 เมตร ตกแต่งเฟอร์นิเจอร์ครบ


Now we will split the text chunk into a list of words using word_tokenize from Pythainlp.

In [15]:
#newmm stands for Maximum Matching algorithm, you can try a different algorithm such as longest matching
tokens = word_tokenize(text, engine='newmm')

print(tokens)

['คอนโดมิเนียม', 'หรู', 'บน', 'ชายหาด', 'ส่วนตัว', '  ', 'บรรยากาศ', 'เงียบสงบ', '  ', 'ห้องชุด', 'ริมทะเล', ' ', 'พร้อม', 'สระว่ายน้ำ', 'ส่วนตัว', ' ', 'และ', 'ระเบียง', 'โค้ง', 'ใส', ' ', 'เชื่อม', 'ความต่อเนื่อง', 'กับ', 'ทะเล', 'สี', 'คราม', 'ภายนอก', 'ที่อยู่', 'ห่าง', 'เพียง', ' ', '30', ' ', 'เมตร', ' ', 'ตกแต่ง', 'เฟอร์นิเจอร์', 'ครบ']


Note that we have a list of thai words now. However, since we will be using these words as features, some words such as 'และ' or 'กับ' clearly does not belong here. We call these words **stopwords**. Luckily, pythainlp has a list of stopwords that we can use without having to create one ourselves. Let's have a look at our stopwords. 

In [16]:
words = stopwords.words('thai')
#print(words)

Just to be safe, you would also want to remove english stopwords. 

In [17]:
en_stop = get_stop_words('en')

#Get rid of stopwords. 
stopped_tokens = [i for i in tokens if not i in words and not i in en_stop]
print('This is our text after we remove the stop_words: ')
print(stopped_tokens)



This is our text after we remove the stop_words: 
['คอนโดมิเนียม', 'หรู', 'ชายหาด', 'ส่วนตัว', '  ', 'บรรยากาศ', 'เงียบสงบ', '  ', 'ห้องชุด', 'ริมทะเล', ' ', 'สระว่ายน้ำ', 'ส่วนตัว', ' ', 'ระเบียง', 'โค้ง', 'ใส', ' ', 'เชื่อม', 'ความต่อเนื่อง', 'ทะเล', 'สี', 'คราม', 'ที่อยู่', 'ห่าง', ' ', '30', ' ', 'เมตร', ' ', 'ตกแต่ง', 'เฟอร์นิเจอร์']


We can improve this even more with the idea of removing morphology in english. Words like 'cute' and 'cuteness' are more or less about the same meaning. Thus we can reduce it with the porter stemmer library. 

In [18]:
from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()
ls = ['cuteness', 'cute', 'aggressive', 'aggression']
print('Before: ', ls)
stemmed_tokens = [p_stemmer.stem(i) for i in ls]
print('After: ',stemmed_tokens)

Before:  ['cuteness', 'cute', 'aggressive', 'aggression']
After:  ['cute', 'cute', 'aggress', 'aggress']


Notice that we have reduced them to the same word.

In [19]:
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
print(stemmed_tokens)

['คอนโดมิเนียม', 'หรู', 'ชายหาด', 'ส่วนตัว', '  ', 'บรรยากาศ', 'เงียบสงบ', '  ', 'ห้องชุด', 'ริมทะเล', ' ', 'สระว่ายน้ำ', 'ส่วนตัว', ' ', 'ระเบียง', 'โค้ง', 'ใส', ' ', 'เชื่อม', 'ความต่อเนื่อง', 'ทะเล', 'สี', 'คราม', 'ที่อยู่', 'ห่าง', ' ', '30', ' ', 'เมตร', ' ', 'ตกแต่ง', 'เฟอร์นิเจอร์']


Next, we will want to remove numbers like '30' out from our data.

In [20]:
single_alpha_num_tokens = [i for i in stemmed_tokens if not i in alphabet.get_data() and not i.isnumeric()]
print(single_alpha_num_tokens)

['คอนโดมิเนียม', 'หรู', 'ชายหาด', 'ส่วนตัว', '  ', 'บรรยากาศ', 'เงียบสงบ', '  ', 'ห้องชุด', 'ริมทะเล', ' ', 'สระว่ายน้ำ', 'ส่วนตัว', ' ', 'ระเบียง', 'โค้ง', 'ใส', ' ', 'เชื่อม', 'ความต่อเนื่อง', 'ทะเล', 'สี', 'คราม', 'ที่อยู่', 'ห่าง', ' ', ' ', 'เมตร', ' ', 'ตกแต่ง', 'เฟอร์นิเจอร์']


As you can see, there still exists meaningless element like '  '. 

Write a python code to get rid of those. 

In [21]:
deletelist = [' ','  ','   ', '    ']
tokens = [i for i in single_alpha_num_tokens if not i in deletelist]
print(tokens)

['คอนโดมิเนียม', 'หรู', 'ชายหาด', 'ส่วนตัว', 'บรรยากาศ', 'เงียบสงบ', 'ห้องชุด', 'ริมทะเล', 'สระว่ายน้ำ', 'ส่วนตัว', 'ระเบียง', 'โค้ง', 'ใส', 'เชื่อม', 'ความต่อเนื่อง', 'ทะเล', 'สี', 'คราม', 'ที่อยู่', 'ห่าง', 'เมตร', 'ตกแต่ง', 'เฟอร์นิเจอร์']


### TO DO #4: Create Your Own Word Splitting Function

Create your own function for splitting Thai chunk of text into a list of words. 

Your function will take a string as an input, and return a list.

Feel free to use the code above or experiment with different methods. 

Name this function split_word for future use within this notebook.

In [22]:
text = 'BOLD Your Space คิด...ให้...ครบ ทุกขนาดไลฟ์สไตล์ ดีไซน์ชีวิตให้เด่น ไลฟ์สไตล์ให้เดิร์นสุดๆ กับสถาปัตยกรรมอาคารที่มีเอกลักษณ์ชัดเจนเหมือนคนรุ่นใหม่เช่นคุณ ด้วยอาคารสีเทาดำ เรียบเท่ ทางเข้าและโถงต้อนรับสไตล์ Extravagant Hotel Lobby ให้ความโอ่อ่า หรูหรา กว่าคอนโดมิเนียมทั่วไป การตกแต่งภายในใส่ใจในทุกรายละเอียด ตอบสนองไลฟ์สไตล์คนรุ่นใหม่ได้สูงสุด ด้วย \'\'เฟอร์นิเจอร์ดีไซน์พิเศษ\'\' มีสระว่ายน้ำ ฟิตเนสเห็นวิวสระว่ายน้ำ เป็นสระออกกำลังกาย'

def split_word(text2):
    
    #empty this for participants
    words = stopwords.words('thai')
    en_stop = get_stop_words('en')
    p_stemmer = PorterStemmer()
    # remove punctuation
    text = ""
    for c in text2:
        if c not in string.punctuation:
            text = text+c
        else:
            text = text+" "
    tokens = word_tokenize(text, engine='newmm')
    # remove stop words
    stopped_tokens = [i for i in tokens if not i in words and not i in en_stop]
    # stem words
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    # remove single alphabet and number
    single_alpha_num_tokens = [i for i in stemmed_tokens if not i in alphabet.get_data() and not i.isnumeric()]
    deletelist = [' ', '  ', '   ','none','    ']
    tokens = [i for i in single_alpha_num_tokens if not i in deletelist]
    
    return tokens

print(split_word(text))

['bold', 'your', 'space', 'ขนาด', 'ไลฟ์สไตล์', 'ดีไซน์', 'ชีวิต', 'เด่น', 'ไลฟ์สไตล์', 'เดิร์น', 'สถาปัตยกรรม', 'อาคาร', 'เอกลักษณ์', 'ชัดเจน', 'เหมือน', 'คนรุ่นใหม่', 'อาคาร', 'สีเทา', 'ดำ', 'เท่', 'ทางเข้า', 'โถง', 'ต้อนรับ', 'สไตล์', 'extravag', 'hotel', 'lobbi', 'ความโอ่อ่า', 'หรูหรา', 'คอนโดมิเนียม', 'ทั่วไป', 'ตกแต่งภายใน', 'ใส่ใจ', 'รายละเอียด', 'ตอบสนอง', 'ไลฟ์สไตล์', 'คนรุ่นใหม่', 'เฟอร์นิเจอร์', 'ดีไซน์', 'พิเศษ', 'สระว่ายน้ำ', 'ฟิต', 'เน', 'วิว', 'สระว่ายน้ำ', 'สระ', 'ออกกำลังกาย']


## Latent Dirichlet Allocation!

While we could use each and every of these words as a project feature, consider that there exists way too many words, maybe 100,000 or more words, in total in our ~5,000 projects. Too many features would also mean that it is too computationally expensive. 

Instead, we will use Latent-Dirichlet Allocation(LDA) to find the topic probability and use those instead of using each word as a feature.

We will start by importing a python library called **gensim**.

Note that in order to save time(since LDA takes a while to train), we will only train the model with the first 3000 project descriptions in this tutorial. You should use project descriptions from the entire data file to train the model. 

The model takes an input of a term dictionary and a document-term matrix. We have handled the dictionary and corpus preparation for you, but you can find out about it more [here](https://radimrehurek.com/gensim/models/ldamodel.html).

We have also fixed the random state. In reality, you might want to remove this and train the model a couple of times and see which result you like the most. 

In [23]:
!pip3 install gensim
!pip3 install corpus
import gensim
from gensim import corpora, models



In [24]:
inpt_list = []

#This is obtaining the project description of the first 1000 projects
proj_df = pandas.read_csv('https://raw.githubusercontent.com/PanthonImem/Content-Based-Lab-Notebook/master/project_description.csv', sep = ';').fillna('None')
proj_df = proj_df.sort_values('project_id')
proj_df = proj_df[proj_df.project_id<3000]
proj_df = proj_df.reset_index(drop = True)
for i in range(proj_df.shape[0]):
    inpt_list.append(split_word(proj_df.iloc[i]['description_th']))

#We will turn this into a term dictionary for our model

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(inpt_list)
dict2 = {dictionary[ID]:ID for ID in dictionary.keys()  }

# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in inpt_list]

We will now train the LDA model. 

Note that the top words for each topic might not make much sense to you, but they are the underlying structure(hence the name 'Latent' in Latent Dirichlet Allocation), you can retrain the model if you are unsure. 

Also, note that the topics will be different on every execution, so if you obtain a desirable result, load the model from the saved file instead of retraining. 

In [25]:
# generate LDA model
num_top = 8
num_words = 8
num_it = 50
ldamodel = gensim.models.ldamodel.LdaModel(corpus,num_top, id2word = dictionary, random_state = 2, passes=num_it)
ldamodel.show_topics(num_top, num_words, log=True, formatted=False)

#ldamodel.save('HDTwork/lda' + str(num_top) + '_Topics_' + str(num_it) + '_Passes.model')


[(0,
  [('โครงการ', 0.017890211),
   ('ทำเล', 0.013937739),
   ('n', 0.010907112),
   ('ถนน', 0.010448513),
   ('บ้านเดี่ยว', 0.007971386),
   ('สะดวกสบาย', 0.007641118),
   ('แหล่ง', 0.0074599786),
   ('ศักยภาพ', 0.007253299)]),
 (1,
  [('นิ', 0.049044136),
   ('ยู', 0.04669608),
   ('ชั้น', 0.04065418),
   ('อาคาร', 0.03313869),
   ('จำนวน', 0.031437453),
   ('เมตร', 0.020785395),
   ('คอนโดมิเนียม', 0.019199815),
   ('ห้องชุด', 0.012328168)]),
 (2,
  [('ชั้น', 0.020443456),
   ('ชีวิต', 0.014535136),
   ('อาคาร', 0.011951573),
   ('รูปแบบ', 0.011647448),
   ('n', 0.011393745),
   ('ห้อง', 0.011099728),
   ('ออกแบบ', 0.009578494),
   ('พื้นที่', 0.009307983)]),
 (3,
  [('รถไฟฟ้า', 0.023896594),
   ('เดินทาง', 0.022971153),
   ('สถานี', 0.022267966),
   ('ถนน', 0.017102845),
   ('สาย', 0.014412835),
   ('ทำเล', 0.012076355),
   ('คอนโด', 0.011703157),
   ('ชีวิต', 0.011432385)]),
 (4,
  [('บ้าน', 0.032400567),
   ('พื้นที่', 0.023777794),
   ('ใช้สอย', 0.016938018),
   ('ห้องนอน', 0.0

We can see that the LDA model nicely generated a topic probability profile for us. 

We will now try to get a probability from a text. 


In [26]:
#ldamodel = models.LdaModel.load('HDTwork/lda' + str(num_top) + '_Topics_' + str(num_it) + '_Passes.model')
def print_topic(text):
    list = split_word(text)
    bow = dictionary.doc2bow(list)
    temp = ldamodel.get_document_topics(bow,minimum_probability=0.1, minimum_phi_value=None,per_word_topics=False)
    print(text)
    print(temp)
    print('-------------------------------------')
text = [0]*4
text[0] = 'บ้านหลังใหญ่ในสวน ท่ามกลางธรรมชาติ แวดล้อมด้วยสวนป่าธรรมชาติ ต้นไม้ และสายน้ำ ต้นไม้ใหญ่เป็นไม้จากป่า ปลูกร่วม 15 ปี สามารถเข้าถึงสวนได้อย่างสะดวกจากที่พักอาศัย แบ่งพื้นที่ในสวนเป็นส่วนต่าง ๆ ล้อมลานกิจกรรมขนาดใหญ่ สถานที่ออกกำลังกาย ลานสนามหญ้าโล่งกว้างริมน้ำ สำหรับนั่งพักผ่อนชมสวนได้ตลอดเวลา'
text[1] = 'โครงการเป็นอาคารพักอาศัยสูง 16 ชั้น จำนวน 2 อาคาร และอาคารคลับเฮ้าส์ จำนวน 448 ยูนิต+ 3 ยูนิตเพื่อการพาณิชย์ '
text[2] = 'ทาวน์โฮมหน้ากว้าง ให้ความรู้สึกเหมือนบ้านเดี่ยว ออกแบบเพดานสูงทำให้รู้สึกโปร่สบาย ไม่อึดอัด ออกแบบพื้นที่ใช่สอยได้อย่างเต็มพื้นที่'
text[3] = 'ค้นพบมุมมองใหม่ของการใช้ชีวิต กับดีไซต์ที่ตอบสนองแนวคิดที่แตกต่าง บนทำเลที่สะดวกสบาย ใกล้ท่าเรือแหลมฉบัง'

for i in range(4):
    print_topic(text[i])

บ้านหลังใหญ่ในสวน ท่ามกลางธรรมชาติ แวดล้อมด้วยสวนป่าธรรมชาติ ต้นไม้ และสายน้ำ ต้นไม้ใหญ่เป็นไม้จากป่า ปลูกร่วม 15 ปี สามารถเข้าถึงสวนได้อย่างสะดวกจากที่พักอาศัย แบ่งพื้นที่ในสวนเป็นส่วนต่าง ๆ ล้อมลานกิจกรรมขนาดใหญ่ สถานที่ออกกำลังกาย ลานสนามหญ้าโล่งกว้างริมน้ำ สำหรับนั่งพักผ่อนชมสวนได้ตลอดเวลา
[(4, 0.24236691), (7, 0.64153814)]
-------------------------------------
โครงการเป็นอาคารพักอาศัยสูง 16 ชั้น จำนวน 2 อาคาร และอาคารคลับเฮ้าส์ จำนวน 448 ยูนิต+ 3 ยูนิตเพื่อการพาณิชย์ 
[(1, 0.8459821), (6, 0.10041111)]
-------------------------------------
ทาวน์โฮมหน้ากว้าง ให้ความรู้สึกเหมือนบ้านเดี่ยว ออกแบบเพดานสูงทำให้รู้สึกโปร่สบาย ไม่อึดอัด ออกแบบพื้นที่ใช่สอยได้อย่างเต็มพื้นที่
[(4, 0.8107192)]
-------------------------------------
ค้นพบมุมมองใหม่ของการใช้ชีวิต กับดีไซต์ที่ตอบสนองแนวคิดที่แตกต่าง บนทำเลที่สะดวกสบาย ใกล้ท่าเรือแหลมฉบัง
[(0, 0.6729826), (4, 0.25193408)]
-------------------------------------


We can use the values that the LDA gave us as the features of the project. 

Note that the LDA gave us probabilities, not just what topic the project most likely fall under, thus we can see that our text is a little bit of other topics as well.

## User Profile

Now that we can create a LDA topic profile for every projects, we will want to find a way to capture the interest of our user. 

One simple way is to average the user profile, which is what we will do in this exercise.  

Suppose that user '0003d990-37d2-140c-a68e-060bc1e81ea5' has browsed 3 projects in 5 browsing: 




In [27]:
df = pandas.read_csv('https://raw.githubusercontent.com/PanthonImem/Content-Based-Lab-Notebook/master/userLog_201801_201802_for_participants.csv', sep = ';')

#Only keep the part where our userCode is the desire user
df = df[df.userCode == '0003d990-37d2-140c-a68e-060bc1e81ea5']

#Get rid of the unnecessary columns
df = df[['userCode','project_id']]
df = df.reset_index(drop = True)
print(df)

#note that we use 0.0 instead of 0 to indicate that this field should store value of the type float, not int.
topic_col = []
for i in range(num_top):
    df['Topic '+str(i)] = 0.0
    topic_col.append('Topic '+str(i))


                               userCode  project_id
0  0003d990-37d2-140c-a68e-060bc1e81ea5        9334
1  0003d990-37d2-140c-a68e-060bc1e81ea5        8742
2  0003d990-37d2-140c-a68e-060bc1e81ea5        8742
3  0003d990-37d2-140c-a68e-060bc1e81ea5        9334
4  0003d990-37d2-140c-a68e-060bc1e81ea5        8631


### TO DO #5:
1) Open the project description file down below. The file name is project_description.csv 

Note that there exists empty description for some projects. Use .fillna(' ') to fill NaN with something else.

In [28]:
#Empty this cell for participants

proj_df = pandas.read_csv('https://raw.githubusercontent.com/PanthonImem/Content-Based-Lab-Notebook/master/project_description.csv', sep = ';').fillna('None')
proj_df = proj_df.sort_values('project_id')
proj_df = proj_df.reset_index(drop = True)
proj_df.head(20)

Unnamed: 0,project_id,description_th
0,4,ตั้งอยู่ติดถนนรังสิต-นครนายก จัดแบ่งผังโครงกา...
1,24,Balance Your Life เลือกคุณภาพชีวิตสมดุลลงตัวใน...
2,29,
3,41,บ้านเดี่ยวเล่นระดับ สไตล์ Modren Contemporary ...
4,44,รูปแบบชีวิตที่มีคุณค่า หรูหรา สง่างาม สะท้อนภา...
5,45,พาร์คเวย์ ชาเล่ต์ โครงการที่มากกว่าความใส่ใจ เ...
6,73,โดดเด่นด้วยสไตล์ Thai Contemporary ที่สวยงามลง...
7,95,สำหรับทุกครอบครัวที่ต้องการความเป็นสัดส่วนเพื่...
8,115,บ้านเดี่ยวที่ดินแปลงใหญ่ในราคาคุ้มค่า บนทำเลถน...
9,133,ด้วยศักยภาพสูงสุดของพื้นที่สีเขียวรอบโครงการ เ...


2) Obtain the project description for the six projects that the user have browsed.
Preferably store them in a list for future reusability. 

In [29]:
textList = []

#Insert your code here, get all the Project descriptions that the user have browsed into textList
browseList = []
for i in range(df.shape[0]):
    browseList.append(df.iloc[i]['project_id'])
    
proj_df = proj_df.set_index('project_id')
for i, proj in enumerate(browseList):
    textList.append(proj_df.loc[proj]['description_th'])


3) Obtain the topic probability for each of the project. 
Hint: Review how to use get_document_topics in the LDA part of this tutorial

In [30]:
for i, text in enumerate(textList):
    #Insert your code here
    list = split_word(text)
    bow = dictionary.doc2bow(list)
    topic_prob = ldamodel.get_document_topics(bow,minimum_probability=None, minimum_phi_value=None,per_word_topics=False)
    for j in range(len(topic_prob)):
        df.at[i, 'Topic '+str(topic_prob[j][0])] = topic_prob[j][1]
df.head()


Unnamed: 0,userCode,project_id,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7
0,0003d990-37d2-140c-a68e-060bc1e81ea5,9334,0.05573,0.0,0.0,0.118378,0.056521,0.0,0.0,0.756196
1,0003d990-37d2-140c-a68e-060bc1e81ea5,8742,0.0,0.0,0.244907,0.0,0.0,0.0,0.0,0.717549
2,0003d990-37d2-140c-a68e-060bc1e81ea5,8742,0.0,0.0,0.24491,0.0,0.0,0.0,0.0,0.717546
3,0003d990-37d2-140c-a68e-060bc1e81ea5,9334,0.055757,0.0,0.0,0.118372,0.05652,0.0,0.0,0.756175
4,0003d990-37d2-140c-a68e-060bc1e81ea5,8631,0.0,0.0,0.272016,0.0,0.228988,0.0,0.0,0.482539


We will then find the user's interest by simple averaging.

Note that this is for simplicity. You should try weighted average or other methods that you think captures the interest of the user well.  

In [31]:
df = df.groupby('userCode')[topic_col].agg('mean')
print(df)

#obtain the user vector list from the dataframe
user_vector = df.iloc[[0]].values.tolist().pop(0)

                                       Topic 0  Topic 1   Topic 2  Topic 3  \
userCode                                                                     
0003d990-37d2-140c-a68e-060bc1e81ea5  0.022297      0.0  0.152367  0.04735   

                                       Topic 4  Topic 5  Topic 6   Topic 7  
userCode                                                                    
0003d990-37d2-140c-a68e-060bc1e81ea5  0.068406      0.0      0.0  0.686001  


## Cosine Similarity


Suppose we have two vectors. We want to know how close these two vectors are.

One simple way is to find the angle between the two vector. 

If the two vectors point in the same direction, then they are similar. 

Since cosine of the angle between the angle will be close to one if the vectors align and zero if the vectors are perpendicular. 

Recall from mathematics that for two vector U and V, 

$ U\dot V = |U||V|cos(\theta) $
        
 $cos(\theta) = \frac{U\dot V }{|U||V|}$
  
 ### TO DO #6
 
 1) Write a function for finding cosine similarity. 
 
 Name it cosine_sim(a, b) (a and b are of type list of any length)
 
 We have a few testing examples for you to try out.

In [32]:
import math

def cosine_sim( list1, list2):
    
    #Empty below for participants
    size_l1 = 0
    size_l2 = 0
    dot_sum = 0
    if len(list1)!=len(list2):
        return 0
    for i in range(0, len(list1)):
        size_l1+=list1[i]**2
        size_l2+=list2[i]**2
        dot_sum+=list1[i]*list2[i]
    if(size_l1==0 or size_l2==0):
        return 0
    return dot_sum/(math.sqrt(size_l1*size_l2)) 

In [33]:
a = [1,1]
b = [0.5,0.5]
if(cosine_sim(a,b) == 1.0):
    print('Test 1 Passed')
else:
    print('Test 1 Failed')
a = [1,2,3]
b = [2,1,3]
if(cosine_sim(a,b) == 0.9285714285714286):
    print('Test 2 Passed')
else:
    print('Test 2 Failed')
a = [1,0,1]
b = [0,1,0]
if(cosine_sim(a,b) == 0.0):
    print('Test 3 Passed')
else:
    print('Test 3 Failed')

Test 1 Passed
Test 2 Passed
Test 3 Passed


Now we will compare the projects with cosine similarity to recommend a project based on the cosine similarity of LDA topics. 

In [34]:
#obtain ~50 projects for testing
proj_df = pandas.read_csv('https://raw.githubusercontent.com/PanthonImem/Content-Based-Lab-Notebook/master/project_description.csv', sep = ';').fillna('None')
proj_df = proj_df.sort_values('project_id')
proj_df = proj_df[proj_df.project_id>9300]
proj_df = proj_df[proj_df.project_id<9350]
proj_df = proj_df.reset_index(drop = True)

#Create a column for each of our topic
for i in range(num_top):
    proj_df['Topic '+str(i)] = 0.0

#Get all projects in our testing set
browseList = []
for i in range(proj_df.shape[0]):
    browseList.append(proj_df.iloc[i]['project_id'])

#get all the project description from project in our testing set
textList = []
proj_df = proj_df.set_index('project_id')
for i, proj in enumerate(browseList):
    textList.append(proj_df.loc[proj]['description_th'])
    
#get the LDA topic distribution for all of our testing projects
for i, text in enumerate(textList):
    list = split_word(text)
    bow = dictionary.doc2bow(list)
    topic_prob = ldamodel.get_document_topics(bow,minimum_probability=None, minimum_phi_value=None,per_word_topics=False)
    for j in range(len(topic_prob)):
        proj_df.at[browseList[i], 'Topic '+str(topic_prob[j][0])] = topic_prob[j][1]

#compute the cosine similarity
dist_list = [0.0]*len(browseList)
for i, proj in enumerate(browseList):
    item_vector = proj_df.loc[[proj]].values.tolist().pop(0)
    item_vector.pop(0)
    dist_list[i] += cosine_sim(user_vector, item_vector)
    
def argmax(list):
        max = -1*float("inf")
        ind = 0
        for i in range(0, len(list)):
            if list[i] > max:
                ind = i
                max = list[i]
        return ind
    
print('Below is our Top 5 Recommendations and their cosine similarity to our user profile')
for i in range(0,5):
    recom = argmax(dist_list)
    print(browseList[recom], dist_list[recom])
    dist_list[recom] = 0.0

Below is our Top 5 Recommendations and their cosine similarity to our user profile
9334 0.9718599828503817
9335 0.9680030905353205
9349 0.9589027928928925
9333 0.8845076340508135
9306 0.8804714782852602


Our code above shows the top 5 recommendations from ~50 projects with cosine similarity.

You should see 9334 as one of our top recommended projects. 

This is because the user really did browse project 9334 (and in fact, twice) so our user profile which is derived from the browsing history should reflect this.

In reality, we want the next best project that the user has not yet browsed. So, we will screen 9334 out from our recommendation. 

**Keep in mind that this result is purely from LDA. In reality, many more possible features(such as location, starting price) can be extracted from other data file.**

## Bonus: Multiprocessing with imap

We can potentially speed up our program with multiprocessing.

Below is a simple tutorial on how to use multiprocessing in Python.


In [35]:
import multiprocessing

Below is how to find out how many CPU cores you have

In [36]:
num_cores = multiprocessing.cpu_count()
print(num_cores)

2


Suppose we have our data below

In [37]:
ls = [0]*100000
for i in range(0,100000):
    ls[i] = i

We will then find the cutoff for spliting our data into **num_cores** parts so we can feed each to each core. 

In the tutorial below we will only find the cutoff for each core and store them in tuples. 

In [38]:
n = len(ls)
core = 0
arg_instances = []
for i in range(num_cores):
    arg_instances.append((i*int(n/num_cores), (i+1)*int(n/num_cores), core))
    core+=1
print(arg_instances)

[(0, 50000, 0), (50000, 100000, 1)]


In [39]:
def process(arg):
    start, stop, core = arg
    print('Core '+str(core)+' start!')
    ls = [0]*100000
    for i in range(0,100000):
        ls[i] = i
    sum = 0
    for i in range(start,stop):
        sum+=ls[i]
    print('Core '+str(core)+' end!')
    return sum

We can then pass the function above into multiprocessing. 

In [40]:
p = multiprocessing.Pool(num_cores)
sum = 0
for result in p.imap(process, arg_instances):
    sum+=result
print('Sum of element in the list:', sum)

Core 0 start!
Core 1 start!
Core 0 end!
Core 1 end!
Sum of element in the list: 4999950000


## -- End of Tutorial --