## Notebook to calculate trends for tag

### Description
This notebook calculate trends data, like tag rank, based on activity score up until certain date.

### Input 
This notebook takes as an input `posts_activity_tag.csv` partitioned file, produced by the previous step.

### Output
As an output this notebook produces `tag_trends.csv` file with the following format:
```
Tag,Date,TagPostsCreated,TagTotalPostsCreated,PostsCreated,TotalPostsCreated,TagPostsShare,TagRank
{tag},{creation-date},{tag-posts-created},{tag-total-posts-created},{posts-created},{tag-posts-share},{tag-rank}
```
where:
- `{tag}` - single tag related to a post. For instance: `c#`
- `{date}` - date in 'YYYY-MM' format. For example: '2008-07'
- `{tag-activity}` - activity score for the `tag` and at `{date}`;
- `{tag-total-activity}` - the cumulative activity score from the beginning up until `{date}`;
- `{posts-created}` - the number of all posts created at `{date}`;
- `{total-activity}` - the cumulative activity score from the beginning up until `{date}`;
- `{tag-share}` - the percentage of activity related to the tag comparing to all activities. Calculated as `{tag-total-activity} / {total-activity} * 100`;
- `{tag-rank}` - the rank of the `{tag}` based on `{tag-share}` in comparison to other tags at the same `{date}`;

For example:
```csv

```

In [1]:
import dask.dataframe as dd
from config import get_file_path

#### Load data, show shape and sample

In [2]:
posts_activity_tag_file_path = get_file_path("posts_activity_tag.csv/*")
posts_activity_tag_df = dd.read_csv(posts_activity_tag_file_path)
posts_activity_tag_df.head()

Unnamed: 0,Id,ActivityScore,CreationDate,Tag
0,4,809,2008-07,c#
1,4,809,2008-07,floating-point
2,4,809,2008-07,type-conversion
3,4,809,2008-07,double
4,4,809,2008-07,decimal


### Calculate activity score per tag and creation date.

In [66]:
tag_date_activitity_score_df = posts_activity_tag_df.\
    groupby(['Tag', 'CreationDate'])['ActivityScore'].\
    sum().\
    rename('ActivityScore').\
    reset_index().\
    sort_values(by=['Tag', 'CreationDate'], ascending=True).\
    persist()
tag_date_activitity_score_df.head()

Unnamed: 0,Tag,CreationDate,ActivityScore
105340,.a,2010-01,361
155727,.a,2010-03,3
100215,.a,2010-04,70
159370,.a,2011-02,3
105364,.a,2011-05,18


### Calculate cumulative activity score per tag and creation date.

In [74]:
tag_total_activitity_score_df = tag_date_activitity_score_df.sort_values(by=['Tag', 'CreationDate'], ascending=True).repartition(npartitions=1)
tag_total_activitity_score_df['TagTotalActivityScore'] = tag_total_activitity_score_df.groupby('Tag')['ActivityScore'].cumsum()
tag_total_activitity_score_df.sort_values(by=['Tag', 'CreationDate'], ascending=True).persist()
tag_total_activitity_score_df.head()

Unnamed: 0,Tag,CreationDate,ActivityScore,TagTotalActivityScore
105340,.a,2010-01,361,361
155727,.a,2010-03,3,364
100215,.a,2010-04,70,434
159370,.a,2011-02,3,437
105364,.a,2011-05,18,455


### Calculate activity score per creation date.

In [96]:
date_activity_score_df = posts_activity_tag_df.\
    groupby(['CreationDate', 'Id']).\
    agg({'ActivityScore': 'first'}).\
    reset_index()[['CreationDate', 'ActivityScore']].\
    groupby('CreationDate').\
    agg({'ActivityScore': 'sum'}).\
    reset_index().\
    sort_values(by=['CreationDate'], ascending=True).\
    persist()

date_activity_score_df.head()

Unnamed: 0,CreationDate,ActivityScore
188,2008-07,5932
167,2008-08,472624
171,2008-09,1866414
175,2008-10,1450211
179,2008-11,909101


### Calculate cumulative activity score per creation date.

In [101]:
total_activitity_score_df = date_activity_score_df.sort_values(by=['CreationDate'], ascending=True)
total_activitity_score_df['TotalActivityScore'] = total_activitity_score_df['ActivityScore'].cumsum()
total_activitity_score_df.sort_values(by=['CreationDate'], ascending=True).persist()
total_activitity_score_df.head()

Unnamed: 0,CreationDate,ActivityScore,TotalActivityScore
188,2008-07,5932,5932
167,2008-08,472624,478556
171,2008-09,1866414,2344970
175,2008-10,1450211,3795181
179,2008-11,909101,4704282


### Join two dataframes to calculate tag trends

In [106]:
tag_trends_df = dd.merge(
    tag_total_activitity_score_df,
    total_activitity_score_df,
    left_on='CreationDate',
    right_on='CreationDate',
    how='inner'
)[['Tag', 'CreationDate', 'TagTotalActivityScore', 'TotalActivityScore']]
tag_trends_df.head()

Unnamed: 0,Tag,CreationDate,TagTotalActivityScore,TotalActivityScore
0,.a,2010-01,361,21380180
1,.a,2010-03,364,24298257
2,.a,2010-04,434,25736070
3,.a,2011-02,437,42108507
4,.a,2011-05,455,48750127


### Calculate tag share

In [107]:
tag_share_sequence = (tag_trends_df['TagTotalActivityScore'] / tag_trends_df['TotalActivityScore']) * 100
tag_trends_df['TagShare'] = tag_share_sequence
tag_trends_df.head()

Unnamed: 0,Tag,CreationDate,TagTotalActivityScore,TotalActivityScore,TagShare
0,.a,2010-01,361,21380180,0.001688
1,.a,2010-03,364,24298257,0.001498
2,.a,2010-04,434,25736070,0.001686
3,.a,2011-02,437,42108507,0.001038
4,.a,2011-05,455,48750127,0.000933


### Calculate tag rank

In [114]:
tag_trends_pd_df = tag_trends_df.compute()

tag_rank_sequence = tag_trends_pd_df.groupby("CreationDate")["TagShare"].rank(method="first", ascending=False)
tag_trends_pd_df = tag_trends_pd_df.assign(TagRank=tag_rank_sequence)
tag_trends_pd_df

Unnamed: 0,Tag,CreationDate,TagTotalActivityScore,TotalActivityScore,TagShare,TagRank
0,.a,2010-01,361,21380180,0.001688,4884.0
1,.a,2010-03,364,24298257,0.001498,5359.0
2,.a,2010-04,434,25736070,0.001686,5143.0
3,.a,2011-02,437,42108507,0.001038,7263.0
4,.a,2011-05,455,48750127,0.000933,7923.0
...,...,...,...,...,...,...
3650769,zyte,2023-06,90,297998257,0.000030,20458.0
3650770,zyte,2023-09,91,299232836,0.000030,19269.0
3650771,zyte,2023-12,100,300234127,0.000033,18045.0
3650772,zyte,2024-01,104,300559060,0.000035,18872.0


### Save
Save final dataframe as intermediate result

In [115]:
tag_trends_file_path = get_file_path('tag_trends.csv')
tag_trends_pd_df.to_csv(tag_trends_file_path, index=False)