## Notebook to calculate posts creation trends for tag

### Description
This notebook calculate basic trends data like tag rank based on the number of posts created up until certain date.

### Input 
This notebook takes as an input `posts_tag.csv` file, produced by the previous step.

### Output
As an output this notebook produces `posts_creation_trends.csv` file with the following format:
```
Tag,CreationDate,TagPostsCreated,TagTotalPostsCreated,PostsCreated,TotalPostsCreated,TagPostsShare,TagRank
{tag},{creation-date},{tag-posts-created},{tag-total-posts-created},{posts-created},{tag-posts-share},{tag-rank}
```
where:
- `{tag}` - single tag related to a post. For instance: `c#`
- `{creation-date}` - post creation in 'YYYY-MM' format. For example: '2008-07'
- `{tag-posts-created}` - the number of posts created with that tag and at `{creation-date}`;
- `{tag-total-posts-created}` - the cumulative number of posts created with that tag from the beginning up until `{creation-date}`;
- `{posts-created}` - the number of all posts created at `{creation-date}`;
- `{total-posts-created}` - the cumulative number of posts created from the beginning up until `{creation-date}`;
- `{tag-posts-share}` - the percentage of posts created with that tag comparing to all posts. Calculated as `{tag-total-posts-created} / {total-posts-created} * 100`;
- `{tag-rank}` - the rank of the `{tag}` based on `{tag-posts-share}` in comparison to other tags at the same `{creation-date}`;

For example:
```csv
Tag,CreationDate,TagPostsCreated,TagTotalPostsCreated,PostsCreated,TotalPostsCreated,TagPostsShare,TagRank
.a,2010-01,4,4.0,145866,1560787.0,0.00025628096594858875,9138.0
.a,2010-03,2,6.0,160711,1862493.0,0.00032214886176753414,9492.0
.a,2010-04,5,11.0,150604,2013097.0,0.0005464217571234769,8539.0
.a,2011-02,1,12.0,236699,3929356.0,0.00030539355558519004,11605.0
.a,2011-05,4,16.0,281657,4772127.0,0.00033528026391585974,12028.0
.a,2011-06,6,22.0,279544,5051671.0,0.00043549946146532504,11475.0
.a,2011-07,6,28.0,281125,5332796.0,0.0005250528990795823,11020.0
.a,2011-08,4,32.0,298765,5631561.0,0.0005682261099542382,10858.0
.a,2012-01,2,34.0,317368,7082831.0,0.00048003404288482954,11856.0
```

In [35]:
import pandas as pd
from config import get_file_path

#### Load data, show shape and sample

In [36]:
posts_tag_file_path = get_file_path("posts_tag.csv")
posts_tag_df = pd.read_csv(posts_tag_file_path)
posts_tag_df

Unnamed: 0,Id,CreationDate,Tag
0,1,2020-01,a
1,2,2020-01,a
2,3,2020-01,a
3,4,2020-01,a
4,5,2020-01,a
...,...,...,...
3595,3596,2021-12,c
3596,3597,2021-12,c
3597,3598,2021-12,c
3598,3599,2021-12,c


### Calculate the number of posts per tag and creation date.

In [37]:
tag_posts_created_df = posts_tag_df.\
    groupby(['Tag', 'CreationDate'])['Id'].\
    nunique().\
    rename({'Count': 'TagPostsCreated'}).\
    reset_index(name='TagPostsCreated').\
    sort_values(by=['Tag', 'CreationDate'])
tag_posts_created_df

Unnamed: 0,Tag,CreationDate,TagPostsCreated
0,a,2020-01,10
1,a,2020-02,10
2,a,2020-03,10
3,a,2020-04,10
4,a,2020-05,10
...,...,...,...
67,c,2021-08,40
68,c,2021-09,40
69,c,2021-10,40
70,c,2021-11,40


### Calculate the cumulative number of posts per tag and creation date.

In [38]:
tag_total_posts_created_sequence = tag_posts_created_df.groupby('Tag')['TagPostsCreated']. \
    expanding(). \
    sum(). \
    reset_index(level=0, drop=True)

tag_total_posts_created_df = tag_posts_created_df.assign(TagTotalPostsCreated=tag_total_posts_created_sequence)
tag_total_posts_created_df

Unnamed: 0,Tag,CreationDate,TagPostsCreated,TagTotalPostsCreated
0,a,2020-01,10,10.0
1,a,2020-02,10,20.0
2,a,2020-03,10,30.0
3,a,2020-04,10,40.0
4,a,2020-05,10,50.0
...,...,...,...,...
67,c,2021-08,40,1040.0
68,c,2021-09,40,1080.0
69,c,2021-10,40,1120.0
70,c,2021-11,40,1160.0


### Calculate the number of posts per creation date.

In [39]:
posts_created_df = posts_tag_df.\
    groupby(['CreationDate'])['Id'].\
    nunique(). \
    rename({'Count': 'PostsCreated'}). \
    reset_index(name='PostsCreated'). \
    sort_values(by=['CreationDate'])
posts_created_df

Unnamed: 0,CreationDate,PostsCreated
0,2020-01,100
1,2020-02,100
2,2020-03,100
3,2020-04,100
4,2020-05,100
5,2020-06,100
6,2020-07,100
7,2020-08,100
8,2020-09,100
9,2020-10,100


### Calculate the cumulative number of posts per creation date.

In [40]:
total_posts_created_sequence = posts_created_df['PostsCreated'].expanding().sum()

total_posts_created_df = posts_created_df.assign(TotalPostsCreated=total_posts_created_sequence)
total_posts_created_df

Unnamed: 0,CreationDate,PostsCreated,TotalPostsCreated
0,2020-01,100,100.0
1,2020-02,100,200.0
2,2020-03,100,300.0
3,2020-04,100,400.0
4,2020-05,100,500.0
5,2020-06,100,600.0
6,2020-07,100,700.0
7,2020-08,100,800.0
8,2020-09,100,900.0
9,2020-10,100,1000.0


### Join two dataframes to calculate tag trends

In [41]:
posts_creation_trends_df = pd.merge(
    tag_total_posts_created_df,
    total_posts_created_df,
    left_on='CreationDate',
    right_on='CreationDate',
    how='inner'
)
posts_creation_trends_df

Unnamed: 0,Tag,CreationDate,TagPostsCreated,TagTotalPostsCreated,PostsCreated,TotalPostsCreated
0,a,2020-01,10,10.0,100,100.0
1,a,2020-02,10,20.0,100,200.0
2,a,2020-03,10,30.0,100,300.0
3,a,2020-04,10,40.0,100,400.0
4,a,2020-05,10,50.0,100,500.0
...,...,...,...,...,...,...
67,c,2021-08,40,1040.0,200,2800.0
68,c,2021-09,40,1080.0,200,3000.0
69,c,2021-10,40,1120.0,200,3200.0
70,c,2021-11,40,1160.0,200,3400.0


### Calculate share of posts created per tag

In [42]:
tag_share_sequence = (posts_creation_trends_df['TagTotalPostsCreated'] / posts_creation_trends_df['TotalPostsCreated']) * 100
posts_creation_trends_tag_share_df = posts_creation_trends_df.assign(TagShare=tag_share_sequence)
posts_creation_trends_tag_share_df

Unnamed: 0,Tag,CreationDate,TagPostsCreated,TagTotalPostsCreated,PostsCreated,TotalPostsCreated,TagShare
0,a,2020-01,10,10.0,100,100.0,10.000000
1,a,2020-02,10,20.0,100,200.0,10.000000
2,a,2020-03,10,30.0,100,300.0,10.000000
3,a,2020-04,10,40.0,100,400.0,10.000000
4,a,2020-05,10,50.0,100,500.0,10.000000
...,...,...,...,...,...,...,...
67,c,2021-08,40,1040.0,200,2800.0,37.142857
68,c,2021-09,40,1080.0,200,3000.0,36.000000
69,c,2021-10,40,1120.0,200,3200.0,35.000000
70,c,2021-11,40,1160.0,200,3400.0,34.117647


### Calculate tag rank

In [43]:
tag_rank_sequence = posts_creation_trends_tag_share_df.groupby("CreationDate")["TagShare"].rank(method="first", ascending=False)
posts_creation_trends_df = posts_creation_trends_tag_share_df.assign(TagRank=tag_rank_sequence)
posts_creation_trends_df

Unnamed: 0,Tag,CreationDate,TagPostsCreated,TagTotalPostsCreated,PostsCreated,TotalPostsCreated,TagShare,TagRank
0,a,2020-01,10,10.0,100,100.0,10.000000,3.0
1,a,2020-02,10,20.0,100,200.0,10.000000,3.0
2,a,2020-03,10,30.0,100,300.0,10.000000,3.0
3,a,2020-04,10,40.0,100,400.0,10.000000,3.0
4,a,2020-05,10,50.0,100,500.0,10.000000,3.0
...,...,...,...,...,...,...,...,...
67,c,2021-08,40,1040.0,200,2800.0,37.142857,1.0
68,c,2021-09,40,1080.0,200,3000.0,36.000000,2.0
69,c,2021-10,40,1120.0,200,3200.0,35.000000,2.0
70,c,2021-11,40,1160.0,200,3400.0,34.117647,2.0


### Save
Save final dataframe as intermediate result

In [44]:
posts_creation_trends_file_path = get_file_path('posts_creation_trends.csv')
posts_creation_trends_df.to_csv(posts_creation_trends_file_path, index=False)