# Motivation

The idea is to analyze the tags provided on various problems that feature in programming contests to quantitatively see
which type of problems are most common at various rating divisions.

## Codeforces

For Codeforces, the common rating divisions are Div. 3, Div 2. and Div 1. with the problems ordered from least to most
difficult. It would make sense to first identify the most recent contests of these types, find the problems they contain,
and the corresponding tags. Then, we can visualize the distribution of the tags and see which are the most frequent,
along with possible secondary statistics like which tags are the "toughest" i.e., correspond to most failed attmepts,
which tags are the easiest and so on.

Relevant APIs are -
* `https://codeforces.com/api/contest.list`
* `https://codeforces.com/api/problemset.problems`

The first returns a list of all contests as a [`Contest`](https://codeforces.com/apiHelp/objects#Contest) object.

```json
{
    "id": 1854,
    "name": "Codeforces Round 889 (Div. 1)",
    "type": "CF",
    "phase": "BEFORE",
    "frozen": false,
    "durationSeconds": 9000,
    "startTimeSeconds": 1690641300,
    "relativeTimeSeconds": -147153
}
```

Relevant fields are only `id` and `name`. We need string-based heuristics for filtering the name to find the standard
Codeforces rounds and their intended "Div"s.

The problemset API returns a list of all problems as a [`Problem`]() object.

```json
{
    "contestId": 1853,
    "index": "B",
    "name": "Fibonaccharsis",
    "type": "PROGRAMMING",
    "points": 1000.0,
    "rating": 1200,
    "tags": ["binary search", "brute force", "math"]
}
```

Relevant fields are `contestId`, to tie back to the contest and check whether it came from a regular CF Round, and which
Div contest it came from. `rating` is also a useful feature to denote difficulty, and may be used skip the contest check
entirely. `tags` is obviously the one we want.

### Get data

In [1]:
import requests
r = requests.get('https://codeforces.com/api/problemset.problems')
r.status_code

200

In [2]:
from typing import List

class Problem:
    def __init__(self, name: str, rating: int, tags: List[str]):
        self.name = name
        self.rating = rating
        self.tags = tags
    
    def __repr__(self):
        return f'Problem(name={str(self.name)}, rating={str(self.rating)}, tags={str(self.tags)})'
    
    def __str__(self):
        return repr(self)

In [3]:
data = r.json()
problems = data['result']['problems']
print(problems[0])

{'contestId': 1887, 'index': 'F', 'name': 'Minimum Segments', 'type': 'PROGRAMMING', 'points': 2750.0, 'tags': ['constructive algorithms']}


In [4]:
import numpy as np
import pandas as pd

problems = [Problem(problem['name'], problem.get('rating', np.nan), problem.get('tags', [])) for problem in problems]
print(problems[0])

Problem(name=Minimum Segments, rating=nan, tags=['constructive algorithms'])


In [5]:
names = [problem.name for problem in problems]
ratings = [problem.rating for problem in problems]
tags = [problem.tags for problem in problems]

df = pd.DataFrame({
    'name': names,
    'rating': ratings,
    'tags': tags
})

print(df)

                        name  rating  \
0           Minimum Segments     NaN   
1             Good Colorings     NaN   
2                      Split     NaN   
3              Minimum Array     NaN   
4                Time Travel     NaN   
...                      ...     ...   
9002     The least round way  2000.0   
9003                  Winner  1500.0   
9004  Ancient Berland Circus  2100.0   
9005             Spreadsheet  1600.0   
9006          Theatre Square  1000.0   

                                                   tags  
0                             [constructive algorithms]  
1     [binary search, constructive algorithms, graph...  
2     [binary search, data structures, divide and co...  
3     [binary search, brute force, constructive algo...  
4               [binary search, graphs, shortest paths]  
...                                                 ...  
9002                                         [dp, math]  
9003                          [hashing, implementation]

### Preprocessing

In [6]:
all_tags = set(tag for tags_list in df['tags'] for tag in tags_list)
print(len(all_tags))
print(all_tags)

37
{'math', 'implementation', 'interactive', 'flows', 'binary search', 'dsu', 'constructive algorithms', 'two pointers', 'number theory', 'meet-in-the-middle', 'string suffix structures', 'ternary search', 'sortings', 'divide and conquer', 'hashing', 'expression parsing', 'graph matchings', 'chinese remainder theorem', 'trees', 'schedules', 'fft', 'data structures', 'greedy', '*special', 'dp', 'brute force', 'shortest paths', 'geometry', 'dfs and similar', 'matrices', 'games', 'strings', 'combinatorics', '2-sat', 'graphs', 'probabilities', 'bitmasks'}


In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
tags_encoded = pd.DataFrame(mlb.fit_transform(df['tags']), columns=mlb.classes_, index=df.index)
df = pd.concat([df, tags_encoded], axis=1)

In [None]:
result_data = []

for index, row in df.iterrows():
    for tag in all_tags:
        if row[tag] == 1:
            result_data.append({'tag': tag, 'rating': row['rating']})

result_df = pd.DataFrame(result_data)
result_df = result_df.dropna()
result_df.reset_index(drop=True, inplace=True)

### Visualization

Given this data, we want to gain insight into which tags are more common at certain rating ranges, such that competitors at those ranges have an idea of which problem tags will be useful to study if they want to progress.

In [105]:
import lets_plot as gg
gg.LetsPlot.setup_html()

p = gg.ggplot(result_df, aes(x='rating')) + \
    gg.geom_density(aes(color='tag', fill='tag'), alpha=.05,  \
                 tooltips=layer_tooltips().format('..count..', '.1f')  \
                                          .line('@tag') \
                                          .line('count|@..count..')) + \
    gg.ggsize(1920, 1080) + \
    gg.labs(x='Rating', y='Density', color='Tags', fill='Tags') + \
    gg.ggtitle(label='Tag Distribution by Rating on Codeforces', \
               subtitle='Density curves for each problem tag on Codeforces by its rating') + \
    gg.flavor_solarized_light() + \
    gg.theme(plot_title = element_text(hjust=0.5, face='bold'), plot_subtitle = element_text(hjust=0.5, color='#93a1a1'), \
             legend_title=element_text(hjust=0.5)) + \
    gg.theme(legend_text = element_text(size=10)) + \
    gg.guides(color = gg.guide_legend(ncol=2), fill = gg.guide_legend(ncol=2))

gg.ggsave(p, 'tag_distribution.png')

'/home/akrish13/repos/cp-problem-tag-analysis/lets-plot-images/tag_distribution.png'

## CodeChef

For CodeChef, we have two rating divisions of Div 2., and Div 1. There are long contests, short contests and Snacktime.
For each of these, we can repeat the same process as in Codeforces.