## Notebook to calculate posts creation trends for tag

### Description
This notebook calculate basic trends data like tag rank based on the number of posts created up until certain date.

### Input 
This notebook takes as an input `posts_tag.csv` file, produced by the previous step.

### Output
As an output this notebook produces `posts_creation_trends.csv` file with the following format:
```
Tag,CreationDate,TagPostsCreated,TagTotalPostsCreated,PostsCreated,TotalPostsCreated,TagPostsShare,TagRank
{tag},{creation-date},{tag-posts-created},{tag-total-posts-created},{posts-created},{tag-posts-share},{tag-rank}
```
where:
- `{tag}` - single tag related to a post. For instance: `c#`
- `{creation-date}` - post creation in 'YYYY-MM' format. For example: '2008-07'
- `{tag-posts-created}` - the number of posts created with that tag and at `{creation-date}`;
- `{tag-total-posts-created}` - the cumulative number of posts created with that tag from the beginning up until `{creation-date}`;
- `{posts-created}` - the number of all posts created at `{creation-date}`;
- `{total-posts-created}` - the cumulative number of posts created from the beginning up until `{creation-date}`;
- `{tag-posts-share}` - the percentage of posts created with that tag comparing to all posts. Calculated as `{tag-total-posts-created} / {total-posts-created} * 100`;
- `{tag-rank}` - the rank of the `{tag}` based on `{tag-posts-share}` in comparison to other tags at the same `{creation-date}`;

For example:
```csv
Tag,CreationDate,TagPostsCreated,TagTotalPostsCreated,PostsCreated,TotalPostsCreated,CountPercantage,Rank
.a,2010-01,4,4.0,145866,1560787.0,0.00025628096594858875,9138.0
.a,2010-03,2,6.0,160711,1862493.0,0.00032214886176753414,9492.0
.a,2010-04,5,11.0,150604,2013097.0,0.0005464217571234769,8539.0
.a,2011-02,1,12.0,236699,3929356.0,0.00030539355558519004,11605.0
.a,2011-05,4,16.0,281657,4772127.0,0.00033528026391585974,12028.0
.a,2011-06,6,22.0,279544,5051671.0,0.00043549946146532504,11475.0
.a,2011-07,6,28.0,281125,5332796.0,0.0005250528990795823,11020.0
.a,2011-08,4,32.0,298765,5631561.0,0.0005682261099542382,10858.0
.a,2012-01,2,34.0,317368,7082831.0,0.00048003404288482954,11856.0
```

In [1]:
import pandas as pd
from config import get_file_path

#### Load data, show shape and sample

In [2]:
posts_tag_file_path = get_file_path("posts_tag.csv")
posts_tag_df = pd.read_csv(posts_tag_file_path)
posts_tag_df

Unnamed: 0,Id,CreationDate,Tag
0,4,2008-07,c#
1,4,2008-07,floating-point
2,4,2008-07,type-conversion
3,4,2008-07,double
4,4,2008-07,decimal
...,...,...,...
168684444,78091263,2024-03,react-hooks
168684445,78091263,2024-03,setstate
168684446,78091259,2024-03,drupal
168684447,78091259,2024-03,drupal-9


### Calculate the number of posts per tag and creation date.

In [3]:
tag_posts_created_df = posts_tag_df.\
    groupby(['Tag', 'CreationDate'])['Id'].\
    nunique().\
    rename({'Count': 'TagPostsCreated'}).\
    reset_index(name='TagPostsCreated').\
    sort_values(by=['Tag', 'CreationDate'])
tag_posts_created_df

Unnamed: 0,Tag,CreationDate,TagPostsCreated
0,.a,2010-01,4
1,.a,2010-03,2
2,.a,2010-04,5
3,.a,2011-02,1
4,.a,2011-05,4
...,...,...,...
3650769,zyte,2023-06,2
3650770,zyte,2023-09,1
3650771,zyte,2023-12,3
3650772,zyte,2024-01,1


### Calculate the cumulative number of posts per tag and creation date.

In [4]:
tag_total_posts_created_sequence = tag_posts_created_df.groupby('Tag')['TagPostsCreated']. \
    expanding(). \
    sum(). \
    reset_index(level=0, drop=True)

tag_total_posts_created_df = tag_posts_created_df.assign(TagTotalPostsCreated=tag_total_posts_created_sequence)
tag_total_posts_created_df

Unnamed: 0,Tag,CreationDate,TagPostsCreated,TagTotalPostsCreated
0,.a,2010-01,4,4.0
1,.a,2010-03,2,6.0
2,.a,2010-04,5,11.0
3,.a,2011-02,1,12.0
4,.a,2011-05,4,16.0
...,...,...,...,...
3650769,zyte,2023-06,2,28.0
3650770,zyte,2023-09,1,29.0
3650771,zyte,2023-12,3,32.0
3650772,zyte,2024-01,1,33.0


### Calculate the number of posts per creation date.

In [5]:
posts_created_df = posts_tag_df.\
    groupby(['CreationDate'])['Id'].\
    nunique(). \
    rename({'Count': 'PostsCreated'}). \
    reset_index(name='PostsCreated'). \
    sort_values(by=['CreationDate'])
posts_created_df

Unnamed: 0,CreationDate,PostsCreated
0,2008-07,6
1,2008-08,14709
2,2008-09,61826
3,2008-10,58874
4,2008-11,47493
...,...,...
184,2023-11,130075
185,2023-12,110803
186,2024-01,125348
187,2024-02,127588


### Calculate the cumulative number of posts per creation date.

In [6]:
total_posts_created_sequence = posts_created_df['PostsCreated'].expanding().sum()

total_posts_created_df = posts_created_df.assign(TotalPostsCreated=total_posts_created_sequence)
total_posts_created_df

Unnamed: 0,CreationDate,PostsCreated,TotalPostsCreated
0,2008-07,6,6.0
1,2008-08,14709,14715.0
2,2008-09,61826,76541.0
3,2008-10,58874,135415.0
4,2008-11,47493,182908.0
...,...,...,...
184,2023-11,130075,56520009.0
185,2023-12,110803,56630812.0
186,2024-01,125348,56756160.0
187,2024-02,127588,56883748.0


### Join two dataframes to calculate tag trends

In [9]:
posts_creation_trends_df = pd.merge(
    tag_total_posts_created_df,
    total_posts_created_df,
    left_on='CreationDate',
    right_on='CreationDate',
    how='inner'
)
posts_creation_trends_df

Unnamed: 0,Tag,CreationDate,TagPostsCreated,TagTotalPostsCreated,PostsCreated,TotalPostsCreated
0,.a,2010-01,4,4.0,145866,1560787.0
1,.a,2010-03,2,6.0,160711,1862493.0
2,.a,2010-04,5,11.0,150604,2013097.0
3,.a,2011-02,1,12.0,236699,3929356.0
4,.a,2011-05,4,16.0,281657,4772127.0
...,...,...,...,...,...,...
3650769,zyte,2023-06,2,28.0,161929,55801992.0
3650770,zyte,2023-09,1,29.0,135820,56252733.0
3650771,zyte,2023-12,3,32.0,110803,56630812.0
3650772,zyte,2024-01,1,33.0,125348,56756160.0


### Calculate share of posts created per tag

In [10]:
tag_share_sequence = (posts_creation_trends_df['TagTotalPostsCreated'] / posts_creation_trends_df['TotalPostsCreated']) * 100
posts_creation_trends_tag_share_df = posts_creation_trends_df.assign(TagShare=tag_share_sequence)
posts_creation_trends_tag_share_df

Unnamed: 0,Tag,CreationDate,TagPostsCreated,TagTotalPostsCreated,PostsCreated,TotalPostsCreated,TagShare
0,.a,2010-01,4,4.0,145866,1560787.0,0.000256
1,.a,2010-03,2,6.0,160711,1862493.0,0.000322
2,.a,2010-04,5,11.0,150604,2013097.0,0.000546
3,.a,2011-02,1,12.0,236699,3929356.0,0.000305
4,.a,2011-05,4,16.0,281657,4772127.0,0.000335
...,...,...,...,...,...,...,...
3650769,zyte,2023-06,2,28.0,161929,55801992.0,0.000050
3650770,zyte,2023-09,1,29.0,135820,56252733.0,0.000052
3650771,zyte,2023-12,3,32.0,110803,56630812.0,0.000057
3650772,zyte,2024-01,1,33.0,125348,56756160.0,0.000058


### Calculate tag rank

In [11]:
tag_rank_sequence = posts_creation_trends_tag_share_df.groupby("CreationDate")["TagShare"].rank(method="first", ascending=False)
posts_creation_trends_df = posts_creation_trends_tag_share_df.assign(TagRank=tag_rank_sequence)
posts_creation_trends_df

Unnamed: 0,Tag,CreationDate,TagPostsCreated,TagTotalPostsCreated,PostsCreated,TotalPostsCreated,TagShare,TagRank
0,.a,2010-01,4,4.0,145866,1560787.0,0.000256,9138.0
1,.a,2010-03,2,6.0,160711,1862493.0,0.000322,9492.0
2,.a,2010-04,5,11.0,150604,2013097.0,0.000546,8539.0
3,.a,2011-02,1,12.0,236699,3929356.0,0.000305,11605.0
4,.a,2011-05,4,16.0,281657,4772127.0,0.000335,12028.0
...,...,...,...,...,...,...,...,...
3650769,zyte,2023-06,2,28.0,161929,55801992.0,0.000050,20477.0
3650770,zyte,2023-09,1,29.0,135820,56252733.0,0.000052,19279.0
3650771,zyte,2023-12,3,32.0,110803,56630812.0,0.000057,18069.0
3650772,zyte,2024-01,1,33.0,125348,56756160.0,0.000058,18929.0


### Save
Save final dataframe as intermediate result

In [12]:
posts_creation_trends_file_path = get_file_path('posts_creation_trends.csv')
posts_creation_trends_df.to_csv(posts_creation_trends_file_path, index=False)