## In this notebook we aim to build a baseline model according to EDA we did in the last notebook

### Before building the model let's recap some factors and insights we learned from EDA

- In this dataset we have content data, topic data and correlation data
- For topic data
  - title: only have 2 NA values, a good feature as starting point
  - channel, level, category no NA values, a good categorical feature as starting point
  - parent, no NA values for non-root topic, a good feature as staring point
  - description and has_content too many NA values, can be used for later model as backup feature 
- For content data
  - title only have 9 NA values, a good starting point feature
  - kind as a categorical feature no NA values a good staring point feature as well
  - description and text too many NA values, can be used for later model as backup feature
- For correlation data
  - we know that over 99% topic and content has same language, so we will use this as a hard rule for candidate selection

In [1]:
import pandas as pd


In [2]:
%%time
# loading data
contents = pd.read_csv('./data/content.csv')
topics = pd.read_csv('./data/topics.csv')
correlations = pd.read_csv('./data/correlations.csv')

CPU times: user 4.65 s, sys: 630 ms, total: 5.28 s
Wall time: 8.23 s


In [3]:
%%time
# explode column content_ids -> convert to list -> explode from signle row to multiple rows
correlations['content_id'] = correlations['content_ids'].apply(lambda x : str(x).split(' '))
correlations = correlations.explode('content_id')
correlations = correlations[['topic_id', 'content_id']]
correlations.head()

CPU times: user 76.3 ms, sys: 6.51 ms, total: 82.8 ms
Wall time: 108 ms


Unnamed: 0,topic_id,content_id
0,t_00004da3a1b2,c_1108dd0c7a5d
0,t_00004da3a1b2,c_376c5a8eb028
0,t_00004da3a1b2,c_5bc0e1e2cba0
0,t_00004da3a1b2,c_76231f9d0b5e
1,t_00068291e9a4,c_639ea2ef9c95


### Generate traning data
- all the pairs in correlations are positive samples
- all the pairs not in correlations are negative samples ()

In [5]:
%%time
# merage content, topic and correlation data

# rename columns before join
topics.columns = ['topic_id', 'topic_title', 'topic_description', 'topic_channel', 'topic_category', 'topic_level', 'topic_language', 'topic_parent', 'topic_has_content']

# renmae columns before join
contents.columns = ['content_id', 'content_title', 'content_description', 'content_kind', 'content_text', 'content_language', 'content_copyright_holder', 'content_license']

# merge corr with topics
corr_topics = pd.merge(correlations, topics, on=['topic_id'], how='left')

# generate positive samples corr_topic.content_id == contents.content_id
positive_data = pd.merge(corr_topics, contents, on=['content_id'], how='left')

positive_data.head()

CPU times: user 139 ms, sys: 14.1 ms, total: 153 ms
Wall time: 152 ms


Unnamed: 0,topic_id,content_id,topic_title,topic_description,topic_channel,topic_category,topic_level,topic_language,topic_parent,topic_has_content,content_title,content_description,content_kind,content_text,content_language,content_copyright_holder,content_license
0,t_00004da3a1b2,c_1108dd0c7a5d,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True,Молив като резистор,"Моливът причинява промяна в отклонението, подо...",video,,bg,,
1,t_00004da3a1b2,c_376c5a8eb028,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True,Да чуем променливото съпротивление,Тук чертаем линия на лист хартия и я използвам...,video,,bg,,
2,t_00004da3a1b2,c_5bc0e1e2cba0,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True,Променлив резистор (реостат) с графит от молив,Използваме сърцевината на молива (неговия граф...,video,,bg,,
3,t_00004da3a1b2,c_76231f9d0b5e,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True,Последователно свързване на галваничен елемент...,"Защо отклонението се променя, когато се свърже...",video,,bg,,
4,t_00068291e9a4,c_639ea2ef9c95,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True,Dados e resultados de funções: gráficos,Encontre todas as entradas que correspondem a ...,exercise,,pt,,


Unnamed: 0,topic_id,content_id,topic_title,topic_channel,topic_category,content_title,content_kind,target
0,t_0710556bc907,c_0e1efb291f9b,Solving Inequalities Topic,8afc0b,supplemental,"Algebra II Module 1, Topic A, Lesson 2: Studen...",document,0
1,t_0710556bc907,c_b20621e92cd4,Solving Inequalities Topic,8afc0b,supplemental,Deductive reasoning,video,0
2,t_0710556bc907,c_afc24567695b,Solving Inequalities Topic,8afc0b,supplemental,5.1: Relationships and the Relationships Tool,html5,0
3,t_0710556bc907,c_4147ec3b4d88,Solving Inequalities Topic,8afc0b,supplemental,1.5: Scientific Nomenclature,html5,0
4,t_0710556bc907,c_113ec092cd1f,Solving Inequalities Topic,8afc0b,supplemental,Meteorology,html5,0
