# Tips and Recommendations

This notebook contains some tips and recommendations for working with this problem and this dataset, to help enrich your thinking about how you might approach it.

### [Tip #1: Context matters! Explore the tree!](#Tip-1:-Context-matters!-Explore-the-tree!)

### [Tip #2: Narrow down by language](#Tip-2:-Narrow-down-by-language)

### [Tip #3: Focus on aligned and supplemental for performance](#Tip-3:-Focus-on-aligned-and-supplemental-for-performance)

### [Tip #4: Balance the semantics of title, description, and text](#Tip-4:-Balance-the-semantics-of-title,-description,-and-text)

### [Tip #5: Disregard copyright_holder for training purposes](#Tip-5:-Disregard-copyright_holder-for-training-purposes)

### [Tip #6: Restructure correlations for efficiency](#Tip-6:-Restructure-correlations-for-efficiency)

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from IPython.display import display, Markdown
from pathlib import Path

data_dir = Path('/kaggle/input/learning-equality-curriculum-recommendations')

In [2]:
# load the data into pandas dataframes
topics_df = pd.read_csv(data_dir / "topics.csv", index_col=0).fillna({"title": "", "description": ""})
content_df = pd.read_csv(data_dir / "content.csv", index_col=0).fillna("")
correlations_df = pd.read_csv(data_dir / "correlations.csv", index_col=0)

In [3]:
# define some helper functions and classes to aid with data traversal

def print_markdown(md):
    display(Markdown(md))

class Topic:
    def __init__(self, topic_id):
        self.id = topic_id

    @property
    def parent(self):
        parent_id = topics_df.loc[self.id].parent
        if pd.isna(parent_id):
            return None
        else:
            return Topic(parent_id)

    @property
    def ancestors(self):
        ancestors = []
        parent = self.parent
        while parent is not None:
            ancestors.append(parent)
            parent = parent.parent
        return ancestors

    @property
    def siblings(self):
        if not self.parent:
            return []
        else:
            return [topic for topic in self.parent.children if topic != self]

    @property
    def content(self):
        if self.id in correlations_df.index:
            return [ContentItem(content_id) for content_id in correlations_df.loc[self.id].content_ids.split()]
        else:
            return tuple([]) if self.has_content else []

    def get_breadcrumbs(self, separator=" >> ", include_self=True, include_root=True):
        ancestors = self.ancestors
        if include_self:
            ancestors = [self] + ancestors
        if not include_root:
            ancestors = ancestors[:-1]
        return separator.join(reversed([a.title for a in ancestors]))

    @property
    def children(self):
        return [Topic(child_id) for child_id in topics_df[topics_df.parent == self.id].index]

    def subtree_markdown(self, depth=0):
        markdown = "  " * depth + "- " + self.title + "\n"
        for child in self.children:
            markdown += child.subtree_markdown(depth=depth + 1)
        for content in self.content:
            markdown += ("  " * (depth + 1) + "- " + "[" + content.kind.title() + "] " + content.title) + "\n"
        return markdown

    def __eq__(self, other):
        if not isinstance(other, Topic):
            return False
        return self.id == other.id

    def __getattr__(self, name):
        return topics_df.loc[self.id][name]

    def __str__(self):
        return self.title
    
    def __repr__(self):
        return f"<Topic(id={self.id}, title=\"{self.title}\")>"


class ContentItem:
    def __init__(self, content_id):
        self.id = content_id

    @property
    def topics(self):
        return [Topic(topic_id) for topic_id in topics_df.loc[correlations_df[correlations_df.content_ids.str.contains(self.id)].index].index]

    def __getattr__(self, name):
        return content_df.loc[self.id][name]

    def __str__(self):
        return self.title
    
    def __repr__(self):
        return f"<ContentItem(id={self.id}, title=\"{self.title}\")>"

    def __eq__(self, other):
        if not isinstance(other, ContentItem):
            return False
        return self.id == other.id

    def get_all_breadcrumbs(self, separator=" >> ", include_root=True):
        breadcrumbs = []
        for topic in self.topics:
            new_breadcrumb = topic.get_breadcrumbs(separator=separator, include_root=include_root)
            if new_breadcrumb:
                new_breadcrumb = new_breadcrumb + separator + self.title
            else:
                new_breadcrumb = self.title
            breadcrumbs.append(new_breadcrumb)
        return breadcrumbs

## Tip 1: Context matters! Explore the tree!

These topics are organized into trees for a reason. The trees represent the overall organizational structure of the curriculum or other taxonomy into which the content is being organized. Just looking at the target topic and ignoring its context in its tree is unlikely to produce optimal results, as there may be relevant contextual information contained elsewhere in the tree as well.

In [4]:
# an example topic that does not by itself provide much information about what content is relevant
topic = Topic("t_c78b75536f2c")
print("Content title:\t'" + topic.content[0].title + "' [kind: " + topic.content[0].kind + "]")
print("Topic title:\t'" + topic.title + "'")
print("Breadcrumbs:\t" + topic.get_breadcrumbs())   #构造主题的层次结构

Content title:	'Applications of properties of kite' [kind: video]
Topic title:	'Videos'
Breadcrumbs:	Maths G3 to G10 >> Maths >> G8 >> 8. Quadrilateral: Constructions and types >> Kite >> Videos


In the example above, note that the title of the topic ("Videos") is not a good semantic predictor of the title of the content item ("Applications of properties of kite"), instead referring to its kind.

The parent topic ("Kite") is a better semantic match, but still doesn't disambiguate well the specific usage of the term "kite". The grandparent topic ("8. Quadrilateral: Constructions and types") provides additional context around the fact that the term "kite" refers to a geometric construction.

Furthermore, the "G8" also provides a clue about the appropriate level of complexity of the content, as this topic might be presented differently to a grade 4 class than to a grade 10 class.

Are there perhaps cases where the siblings, cousins, and other relatives of the topic might also provide additional relevant semantic context?

## Tip 2: Narrow down by language

The language of a topic will almost always (99% of the time) match the language of any correlated content. Filtering, or at least prioritizing, your content recommendations by the language of the target topic may give you better performance.

相同的语言最容易匹配

In [5]:
matching = 0
nonmatching = 0
for topic_id in topics_df.query("has_content").sample(n=1000).index:
    topic = Topic(topic_id)
    if any(topic.language != content.language for content in topic.content):
        nonmatching += 1
    else:
        matching += 1

print("Matching:", matching)
print("Nonmatching:", nonmatching)
print("Percent matching: {:.2f}%".format(100 * matching / (matching + nonmatching)))

Matching: 989
Nonmatching: 11
Percent matching: 98.90%


## Tip 3: Focus on aligned and supplemental for performance

As described on the data page, topics are organized into "channels" (each with a single topic tree), and these channels fall into one of the following categories:
- **`source`**: Structure was given by original content creator (e.g. the topic tree as imported from Khan Academy).
- **`aligned`**: Structure is from a national curriculum or other target taxonomy, with content typically aligned from multiple sources.
- **`supplemental`**: This is a channel that has to some extent been aligned, with without the same level of granularity or fidelity as an aligned channel.

As the goal of this competition is to produce algorithms that recommend content from _multiple sources_, to align with the topics in a novel topic tree, the testing dataset does not contain any topics from `source` channels. These topics are included in the training dataset because we believe they will have a beneficial role to play in training, but they also have specific biases in them (such as more consistent relationships between titles of topics and content items, more homogenous content formats, etc). When choosing your loss functions, and your validation sets for measuring performance, you may wish to focus on aligned and supplemental topics (which are represented roughly equally in the testing data).

在处理内容推荐算法时，如何利用不同的“频道”类别来优化性能。这些频道类别包括source、aligned和supplemental，它们各自有不同的内容组织方式和特点。

source：这个频道的结构是由原始内容创建者提供的。例如，从可汗学院导入的主题树。source频道的内容结构较为一致，通常来源单一，有助于训练模型识别标题和内容之间的关系。

aligned：这个频道的结构来自国家课程或其他目标分类法，通常包括来自多个来源的内容。aligned频道的内容结构较为精细和准确，是根据特定的教育标准或要求组织的。

supplemental：这个频道在一定程度上是对齐的，但不如aligned频道那样细致或准确。supplemental频道通常用于补充aligned频道的内容，不一定完全遵循某个标准。

在这个竞赛中，目标是开发能够从多个来源推荐内容的算法，以符合一个新颖的主题树。因此，测试数据集中不包括source频道的主题，因为它们的内容较为一致，可能会引入偏差。然而，source频道的主题被包含在训练数据集中，因为它们可以为训练模型提供有益的信息。

因此，在选择损失函数和验证集来衡量性能时，建议重点关注aligned和supplemental主题。这是因为这些类别的内容在测试数据中大致是均衡分布的，且能够更好地反映多源内容推荐的真实场景。

In [6]:
# The training data (shown below) is heavily weighted towards `source` channels,
# whereas the testing data consists only of `supplemental` and `aligned` channels
topics_df.category.value_counts()

source          43487
supplemental    19368
aligned         14117
Name: category, dtype: int64

## Tip 4: Balance the semantics of `title`, `description`, and `text`

The amount of text varies a lot across topics and content items. Every topic and content item has a `title`, fewer have a `description`, and even fewer (and only content items) have any `text`. The amount of "noise" or "distraction" also varies a lot across these fields. Think carefully around how you combine and weight and select the semantic information from each of these fields.

title（标题）：所有主题和内容项都有标题，这通常是最简洁的描述，提供了关于内容的主要信息。标题往往是用户首先看到的部分，所以它的重要性很高。

description（描述）：描述字段有时会提供比标题更多的背景信息，但并不是所有内容项都有描述。描述可以帮助提供更多的上下文和细节，使得内容更易理解。

text（文本）：文本字段通常包含最详细的信息，但仅在内容项中出现，而且并不是所有内容项都有文本。文本可以提供深度的内容分析，但同时也可能包含更多的“噪音”或无关信息。

由于这些字段的信息密度和质量各不相同，在构建推荐算法时，需要平衡这些字段的语义信息。这意味着要考虑每个字段在语义上对内容理解的贡献，并可能需要对其进行加权。例如，标题可能需要较高的权重，因为它直接反映了主题或内容的核心，而文本则可能需要更多的筛选以去除噪音信息。

In [8]:
print("Topics with titles: {:.0f}%".format(100 * (topics_df.title != '').mean()))
print("Topics with descriptions: {:.0f}%".format(100 * (topics_df.description != '').mean()))
print()
print("Content with titles: {:.0f}%".format(100 * (content_df.title != '').mean()))
print("Content with descriptions: {:.0f}%".format(100 * (content_df.description != '').mean()))
print("Content with text: {:.0f}%".format(100 * (content_df.text != '').mean()))

Topics with titles: 100%
Topics with descriptions: 45%

Content with titles: 100%
Content with descriptions: 58%
Content with text: 48%


In [9]:
# sometimes the title is meaningful, and the description less so
# 有时候标题包含更多的含义，description不包含很多有意义的内容
topic = Topic("t_bcab9c637071")
print("Title:      \t", topic.title)
print("Description:\t", topic.description)

Title:      	 Different meanings of fractions
Description:	 v0.1


In [10]:
# while other times the title doesn't provide as much useful information as the description
# 在有些情况，description相较于title有更多的含义
topic = Topic("t_fda27a0b8b63")
print("Title:      \t", topic.title)
print("Description:\t", topic.description)

Title:      	 Teacher leader
Description:	 Learn about Paul Clifton's role as a 6th grade teacher leader for math and ELD, as well as how he manages his finances. 


In [11]:
# and sometimes the content's text is more informative than its title, especially the first part
content = ContentItem("c_7f2dd85e3f71")
print("Topic title:\t", content.topics[0].title, "\n")
print("Content title:\t", content.title, "\n")
print("First line of text:")
print(content.text.split("\n")[0])
print("\nLast line of text:")
print(content.text.split("\n")[-1])

Topic title:	 Module 03: Permission and Ownership Management 

Content title:	 C.5: File Attributes 

First line of text:
EXAM OBJECTIVES COVERED 3.1 Given a scenario, apply or acquire the appropriate user and/or group permissions and ownership.

Last line of text:
Adapted from: "chattr command in Linux with examples" (https://www.geeksforgeeks.org/chattr-command-in-linux-with-examples/) by atharvakango (https://auth.geeksforgeeks.org/user/atharvakango/articles), Geeks for Geeks (https://www.geeksforgeeks.org/) is licensed under CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0) "Access Control Lists(ACL) in Linux" (https://www.geeksforgeeks.org/access-control-listsacl-linux/) by msdeep14 (https://auth.geeksforgeeks.org/user/msdeep14/articles), Geeks for Geeks is licensed under CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)


## Tip 5: Disregard `copyright_holder` for training purposes

The field `copyright_holder` is specified for content items that include extracted text, in order to properly attribute the copyright holder for the source content. However, this field has been blanked out in the testing data, as it's not something we want you to use as a basis for prediction, so leveraging it during training will not benefit your submission's test performance.

copyright_holder这个字段没有在测试集中，因此这个字段没有实际含义

## Tip 6: Restructure correlations for efficiency

If you're repeatedly traversing `correlations_df.content_ids`, you may wish to restructure it for more efficient lookups and joins.



content_ids存储了不同内容项之间的关联关系，可以体现在训练中

In [12]:
correlations = correlations_df.copy()
correlations.content_ids = correlations.content_ids.str.split()
correlations = correlations.explode("content_ids").rename(columns={"content_ids": "content_id"})
correlations

Unnamed: 0_level_0,content_id
topic_id,Unnamed: 1_level_1
t_00004da3a1b2,c_1108dd0c7a5d
t_00004da3a1b2,c_376c5a8eb028
t_00004da3a1b2,c_5bc0e1e2cba0
t_00004da3a1b2,c_76231f9d0b5e
t_00068291e9a4,c_639ea2ef9c95
...,...
t_fff9e5407d13,c_d64037a72376
t_fffbe1d5d43c,c_46f852a49c08
t_fffbe1d5d43c,c_6659207b25d5
t_fffe14f1be1e,c_cece166bad6a
