### Exploratory Data Analysis

- Since this data only have 832 MB for content, 4.9 MB for correlations.csv and 14 MB for topic.csv so it's totally OK to use pandas here

In [3]:
!ls -alh ./data/

total 850M
drwxrwxrwx 1 root root 4.0K  2月  2 21:38 .
drwxrwxrwx 1 root root 4.0K  2月  2 21:42 ..
-rwxrwxrwx 1 root root 832M 12月 14 19:43 content.csv
-rwxrwxrwx 1 root root 4.9M 12月 14 19:44 correlations.csv
-rwxrwxrwx 1 root root  336 12月 14 19:44 sample_submission.csv
-rwxrwxrwx 1 root root  14M 12月 14 19:44 topics.csv


### Loading liberary

In [67]:
import pandas as pd


In [68]:
%%time
# loading data
contents = pd.read_csv('./data/content.csv')
topics = pd.read_csv('./data/topics.csv')
correlations = pd.read_csv('./data/correlations.csv')
sample_sub = pd.read_csv('./data/sample_submission.csv')

CPU times: user 4.18 s, sys: 367 ms, total: 4.55 s
Wall time: 4.55 s


## Content
Contains a row for each content item in the dataset. Note that the hidden dataset used for scoring contains additional content items not in the public version. These additional content items are only correlated to topics in the test set. Some content items may not be correlated with any topic.
- id - A unique identifier for this content item.
- title - Title text for this content item.
- description - Description text. May be empty.
- language - Language code representing the language of this content item.
- kind - Describes what format of content this item represents, as one of:
  - document (text is extracted from a PDF or EPUB file)
  - video (text is extracted from the subtitle file, if available)
  - exercise (text is extracted from questions/answers)
  - audio (no text)
   - html5 (text is extracted from HTML source)
- text - Extracted text content, if available and if licensing permitted (around half of content items have text content).
- copyright_holder - If text was extracted from the content, indicates the owner of the copyright for that content. Blank for all test set items.
- license - If text was extracted from the content, the license under which that content was made available. Blank for all test set items.

In [69]:
contents.head(3)

Unnamed: 0,id,title,description,kind,text,language,copyright_holder,license
0,c_00002381196d,"Sumar números de varios dígitos: 48,029+233,930","Suma 48,029+233,930 mediante el algoritmo está...",video,,es,,
1,c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120.\n\n,video,,it,,
2,c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda\n\n,video,,es,,


In [70]:
contents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154047 entries, 0 to 154046
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                154047 non-null  object
 1   title             154038 non-null  object
 2   description       89456 non-null   object
 3   kind              154047 non-null  object
 4   text              74035 non-null   object
 5   language          154047 non-null  object
 6   copyright_holder  71821 non-null   object
 7   license           74035 non-null   object
dtypes: object(8)
memory usage: 9.4+ MB


### Content data highlevel statistic

#### Factors
- id is primary key and have no NA values, data integrity is guaranteed
- title, 130937 unique value and only 9 NA values, integrity is good but need to handle NA values
- description, almost half of the content don't have description, integrity is bad, low priority considered as a feature
- kind no NA values, integerity is good and treat as a categorical feature
- text, more than half of the content don't have text, integrity is bad, low priority considered as a feature
- language, no NA values, integerity is good and treat as a categorical feature and can be used as a criteria for language model when calc embeddings
- copyright_holder, can be ignored as mentioned above blank for all the test items
- license, can be ignored as mentiond above blank for all the test items

In [71]:
%%time
# id
print("-------------------------------------------------------------------------------------------------------------------------------------")
print(f"Total have {contents.id.nunique()} unique IDs, min ID is {sorted(contents.id.values)[0]}, max ID is {sorted(contents.id.values)[-1]}")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# title
print(f"Total have {contents.title.nunique()} unique titles and {contents.title.isna().sum()} contents don't have title and {contents.title.duplicated().sum()} content have duplicate titles.")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# descritpion
print(f"Total have {contents.description.nunique()} unique descriptions and {contents.description.isna().sum()} contents don't have description.")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# kind
print(f"Total have {contents.kind.nunique()} unique kinds and kinds distribution -> {contents.kind.value_counts().to_dict()}")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# text
print(f"Total have {contents.text.nunique()} unique text and {contents.text.isna().sum()} contents don't have text.")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# language
print(f"Total have {contents.language.nunique()} unique languages and language distribution -> {contents.language.value_counts().to_dict()}")
print("-------------------------------------------------------------------------------------------------------------------------------------")

-------------------------------------------------------------------------------------------------------------------------------------
Total have 154047 unique IDs, min ID is c_00002381196d, max ID is c_ffffe5254266
-------------------------------------------------------------------------------------------------------------------------------------
Total have 130937 unique titles and 9 contents don't have title and 23109 content have duplicate titles.
-------------------------------------------------------------------------------------------------------------------------------------
Total have 76305 unique descriptions and 64591 contents don't have description.
-------------------------------------------------------------------------------------------------------------------------------------
Total have 5 unique kinds and kinds distribution -> {'video': 61487, 'document': 33873, 'html5': 32563, 'exercise': 25925, 'audio': 199}
-------------------------------------------------------------

## Topic
Contains a row for each topic in the dataset. These topics are organized into "channels", with each channel containing a single "topic tree" (which can be traversed through the "parent" reference). Note that the hidden dataset used for scoring contains additional topics not in the public version. You should only submit predictions for those topics listed in sample_submission.csv.
- id - A unique identifier for this topic.
- title - Title text for this topic.
- description - Description text (may be empty)
- channel - The channel (that is, topic tree) this topic is part of.
- category - Describes the origin of the topic.
  - source - Structure was given by original content creator (e.g. the topic tree as imported from Khan Academy). There are no topics in the test set with this category.
  - aligned - Structure is from a national curriculum or other target taxonomy, with content aligned from multiple sources.
  - supplemental - This is a channel that has to some extent been aligned, but without the same level of granularity or fidelity as an aligned channel.
- language - Language code for the topic. May not always match apparent language of its title or description, but will always match the language of any associated content items.
- parent - The id of the topic that contains this topic, if any. This field if empty if the topic is the root node for its channel.
- level - The depth of this topic within its topic tree. Level 0 means it is a root node (and hence its title is the title of the channel).
- has_content - Whether there are content items correlated with this topic. Most content is correlated with leaf topics, but some non-leaf topics also have content correlations.

In [49]:
topics.head(3)

Unnamed: 0,id,title,description,channel,category,level,language,parent,has_content
0,t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
1,t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
2,t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True


In [50]:
topics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76972 entries, 0 to 76971
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           76972 non-null  object
 1   title        76970 non-null  object
 2   description  34953 non-null  object
 3   channel      76972 non-null  object
 4   category     76972 non-null  object
 5   level        76972 non-null  int64 
 6   language     76972 non-null  object
 7   parent       76801 non-null  object
 8   has_content  76972 non-null  bool  
dtypes: bool(1), int64(1), object(7)
memory usage: 4.8+ MB


### Topic data highlevel statistic

#### Factors
- id, primary key and have no NA values, data integrity is guaranteed
- title, 45802 unique value and only 2 NA values, integrity is good but need to handle NA values
- description, 2/3 don't have description, integrity is bad, low priority considered as a feature
- channel, no NA values, integerity is good and treat as a categorical feature
- level, on NA values, indicates which level current topic have, min level is 0 stands for parent node and max level is 10
- category, no NA values, integerity is good and treat as a categorical feature
- language, no NA values, integerity is good and treat as a categorical feature and can be used as a criteria for language model when calc embeddings
- parent, 17152 unqiue values and 171 NA values -> same as no of unique channels and no of 0 level -> so this 171 topic maybe parent topic which it-self represent a chanel
- has_content, over 4/5 topic has assiciate content, integrity is good

In [72]:
%%time
# id
print("-------------------------------------------------------------------------------------------------------------------------------------")
print(f"Total have {topics.id.nunique()} unique IDs, min ID is {sorted(topics.id.values)[0]}, max ID is {sorted(topics.id.values)[-1]}")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# title
print(f"Total have {topics.title.nunique()} unique titles and {topics.title.isna().sum()} topics don't have title and {topics.title.duplicated().sum()} topics have duplicate titles.")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# descritpion
print(f"Total have {topics.description.nunique()} unique descriptions and {topics.description.isna().sum()} topics don't have description.")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# channel
print(f"Total have {topics.channel.nunique()} unique channel and channel distribution -> {topics.channel.value_counts().to_dict()}")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# level
print(f"Total have {topics.level.nunique()} unique level and level distribution -> {topics.level.value_counts().to_dict()}")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# category
print(f"Total have {topics.category.nunique()} unique category and category distribution -> {topics.category.value_counts().to_dict()}")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# language
print(f"Total have {topics.language.nunique()} unique languages and language distribution -> {topics.language.value_counts().to_dict()}")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# parent
print(f"Total have {topics.parent.nunique()} unique parent and {topics.parent.isna().sum()} topics don't have parent and {topics.parent.duplicated().sum()} topics have duplicate parent.")
print("-------------------------------------------------------------------------------------------------------------------------------------")
# has content
print(f"Total have {topics.has_content.nunique()} unique has_content and has_content distribution -> {topics.has_content.value_counts().to_dict()}")
print("-------------------------------------------------------------------------------------------------------------------------------------")

-------------------------------------------------------------------------------------------------------------------------------------
Total have 76972 unique IDs, min ID is t_00004da3a1b2, max ID is t_fffe88835149
-------------------------------------------------------------------------------------------------------------------------------------
Total have 45082 unique titles and 2 topics don't have title and 31889 topics have duplicate titles.
-------------------------------------------------------------------------------------------------------------------------------------
Total have 23067 unique descriptions and 42019 topics don't have description.
-------------------------------------------------------------------------------------------------------------------------------------
Total have 171 unique channel and channel distribution -> {'fef095': 5770, '0ec697': 5355, '6e90a7': 4554, '2ee29d': 4438, '36a98b': 3667, '000cf7': 2867, '8e286a': 2780, '0c929f': 2698, 'c152d6': 2555, 'e

Unnamed: 0,id,title,description,channel,category,level,language,parent,has_content
7814,t_1a45cff2e90f,\n प्रभावी कार्यव्यवहार,,4d2d4a,source,4,hi,t_dcd9f8687267,True
40638,t_87aaa640a2b1,\n प्रभावी कार्यव्यवहार,,4d2d4a,source,4,hi,t_232a6f06533f,True
40766,t_881e51a344d0,\n प्रभावी कार्यव्यवहार,,4d2d4a,source,4,hi,t_0f47d393a3aa,True
55202,t_b89fae6825cc,\n प्रभावी कार्यव्यवहार,,4d2d4a,source,4,hi,t_ea7a0d4fae78,True
45290,t_97bceff1c37d,\n प्रभावी कार्यव्यवहार,,4d2d4a,source,4,hi,t_fe4f26107363,True
...,...,...,...,...,...,...,...,...,...
47867,t_a02db7492ff6,高级折纸,,da1fa7,source,3,zh,t_9e8fb8002b2c,True
15747,t_34ca20bec76e,ﺎﻠﻔﻳﺰﻳﺍﺀ,,00fda4,source,3,ar,t_578c5ec7f529,True
3381,t_0b3e011d9f83,ﺎﻠﻔﻳﺰﻳﺍﺀ,,00fda4,source,3,ar,t_3d526ca71fd7,True
18344,t_3d9ad9931021,,BC: BIOL 2 - Introduction to Human Biology (Gr...,ebc86c,supplemental,3,en,t_8e8ef88e79d1,True


## Correlation data
The content items associated to topics in the training set. A single content item may be associated to more than one topic. In each row, we give a topic_id and a list of all associated content_ids. These comprise the targets of the training set.

In [73]:
correlations.head(3)

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
1,t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
2,t_00069b63a70a,c_11a1dc0bfb99


In [74]:
correlations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61517 entries, 0 to 61516
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   topic_id     61517 non-null  object
 1   content_ids  61517 non-null  object
dtypes: object(2)
memory usage: 961.3+ KB


### Sample submission
 A submission file in the correct format. See the Evaluation page for more details. You must use this file to identify which topics in the test set require predictions.

In [76]:
# This is just a sample submission, in the test enviornment we need to generate it with test data, test data will have new content.csv and topic.csv but correlation.csv will be the same
sample_sub

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
1,t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
2,t_00069b63a70a,c_11a1dc0bfb99
3,t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
4,t_4054df11a74e,c_3695c5dc1df6 c_f2d184a98231
