## Notebook to pre-process posts data

### Description
The main objective of this notebook is to pre-process post data for further aggregation and calculations.
In particular, it includes the following steps:
- Filter deleted or closed posts;
- Inherit `tags` field for answers from questions;
- Explode `tags` field so resulting data is flat for easier aggregations.  

### Input 
This notebook takes as an input `posts.csv` file, produced by the previous step.

### Output
As an output this notebook produces `post_tag.csv` file with the following format:
```csv
Id,CreationDate,Tag
{post-id},{creation-date},{tag}
```
where:
- `{post-id}` - post identifier.
- `{creation-date}` - post creation in 'YYYY-MM' format. For example: '2008-07'
- `{tag}` - single tag related to a post. For instance: `c#`

For instance:
```csv
Id,CreationDate,Tag
4,2008-07,c#
4,2008-07,floating-point
4,2008-07,type-conversion
4,2008-07,double
4,2008-07,decimal
``` 

In [1]:
import pandas as pd
from config import get_file_path

#### Load data, show shape and sample

In [2]:
posts_file_path = get_file_path("posts.csv")
all_posts_df = pd.read_csv(posts_file_path)

In [3]:
all_posts_df.dtypes

DeletionDate    float64
PostTypeId        int64
ParentId        float64
Id                int64
CreationDate     object
ClosedDate       object
Tags             object
dtype: object

In [4]:
all_posts_df

Unnamed: 0,DeletionDate,PostTypeId,ParentId,Id,CreationDate,ClosedDate,Tags
0,,1,,4,2008-07-31T21:42:52.667,,<c#><floating-point><type-conversion><double><...
1,,1,,6,2008-07-31T22:08:08.620,,<html><css><internet-explorer-7>
2,,2,4.0,7,2008-07-31T22:17:57.883,,
3,,1,,9,2008-07-31T23:40:59.743,,<c#><.net><datetime>
4,,1,,11,2008-07-31T23:55:37.967,,<c#><datetime><time><datediff><relative-time-s...
...,...,...,...,...,...,...,...
59749044,,1,,78091308,2024-03-02T02:52:48.793,,<wifi><gstreamer><esp32><audio-streaming><mult...
59749045,,1,,78091309,2024-03-02T02:53:20.573,,<jquery><woocommerce><hide><show>
59749046,,2,18727766.0,78091310,2024-03-02T02:53:29.510,,
59749047,,1,,78091311,2024-03-02T02:54:12.030,,<r>


### Filter data
Filter out closed or deleted posts, since they considered as irrelevant
'DeletionDate' and 'ClosedDate' columns present only for deleted or closed posts respectively.

In [5]:
non_deleted_closed_posts_df = all_posts_df[pd.isna(all_posts_df['DeletionDate']) & pd.isna(all_posts_df['ClosedDate'])]
non_deleted_closed_posts_count = len(non_deleted_closed_posts_df)
all_posts_count = len(all_posts_df)
print(all_posts_count)

non_deleted_closed_posts_percentage = round((non_deleted_closed_posts_count / all_posts_count) * 100, 2)
print(f'Number of NOT deleted or closed posts: {non_deleted_closed_posts_count}, which is {non_deleted_closed_posts_percentage}% of all data')

# Remove columns that we don't need anymore
posts_df = non_deleted_closed_posts_df.drop(['DeletionDate', 'ClosedDate'], axis=1)
print('Filtered posts dataframe shape:')
posts_df.dtypes

59749049
Number of NOT deleted or closed posts: 58652191, which is 98.16% of all data
Filtered posts dataframe shape:


PostTypeId        int64
ParentId        float64
Id                int64
CreationDate     object
Tags             object
dtype: object

### Convert creation date
Trends will be calculated per-month granularity.
Convert creation datetime into year-month pair. It easier and faster to do here, while posts dataframe is relatively small.

In [6]:
posts_df['CreationDateNew'] = pd.to_datetime(posts_df['CreationDate']).dt.strftime('%Y-%m')
posts_df = posts_df.drop(['CreationDate'], axis=1).rename(columns={'CreationDateNew': 'CreationDate'})
posts_df

Unnamed: 0,PostTypeId,ParentId,Id,Tags,CreationDate
0,1,,4,<c#><floating-point><type-conversion><double><...,2008-07
1,1,,6,<html><css><internet-explorer-7>,2008-07
2,2,4.0,7,,2008-07
3,1,,9,<c#><.net><datetime>,2008-07
4,1,,11,<c#><datetime><time><datediff><relative-time-s...,2008-07
...,...,...,...,...,...
59749044,1,,78091308,<wifi><gstreamer><esp32><audio-streaming><mult...,2024-03
59749045,1,,78091309,<jquery><woocommerce><hide><show>,2024-03
59749046,2,18727766.0,78091310,,2024-03
59749047,1,,78091311,<r>,2024-03


### Split posts into questions and answers

Split all posts_df onto two other dataframes: questions and answers.
Questions does not assigned tags and answers does.
Use `PostTypeId` column for it, where `1` is a type for question and `2` is for answer.
Drop `ParentId` column for questions, because it is always `null` since they are parent posts for questions.
Drop `Tags` column for answers, because it is always `null` since questions contain tags only. 
Answers should have same tags as questions.
See readme.txt for more details.

In [7]:
answers_df = posts_df[posts_df['PostTypeId'] == 2].drop(['PostTypeId', 'Tags'], axis=1)
questions_df = posts_df[posts_df['PostTypeId'] == 1].drop(['PostTypeId', 'ParentId'], axis=1)

answers_count = len(answers_df)
questions_count = len(questions_df)
posts_count = len(posts_df)

answers_percentage = round((answers_count / posts_count) * 100, 2)
questions_percentage = round((questions_count / posts_count) * 100, 2)

print(f'Number of answers: {answers_count}, which is {answers_percentage}% of all data')
print(f'Number of questions: {questions_count}, which is {questions_percentage}% of all data')

Number of answers: 35553848, which is 60.62% of all data
Number of questions: 22984906, which is 39.19% of all data


### Parse tags

Parse `Tags` column. It contains list of tags in XML like format. For instance: `<c#><.net><datetime>`
To work properly with it, we need to turn it into proper list of tags.

In [8]:
# Remove the '<' and '>' characters and then split by '><'
questions_df['TagsParsed'] = questions_df['Tags'].str.replace('<', '').str.replace('>', '<').str.split('<')

# Remove 'Tags' column that is not needed anymore
questions_df.drop(['Tags'], axis=1, inplace=True)

# Remove empty strings that may appear as a result of the split
questions_df['TagsParsed'] = questions_df['TagsParsed'].apply(lambda tags: [tag for tag in tags if tag])
questions_df

Unnamed: 0,Id,CreationDate,TagsParsed
0,4,2008-07,"[c#, floating-point, type-conversion, double, ..."
1,6,2008-07,"[html, css, internet-explorer-7]"
3,9,2008-07,"[c#, .net, datetime]"
4,11,2008-07,"[c#, datetime, time, datediff, relative-time-s..."
6,13,2008-08,"[html, browser, timezone, user-agent, timezone..."
...,...,...,...
59749041,78091305,2024-03,"[reactjs, next.js, micro-frontend, single-spa,..."
59749043,78091307,2024-03,"[ubuntu, server, cuda, nvidia, tesla]"
59749044,78091308,2024-03,"[wifi, gstreamer, esp32, audio-streaming, mult..."
59749045,78091309,2024-03,"[jquery, woocommerce, hide, show]"


### Explode tags

Explode 'TagsParsed' column to have a single tag per row and rename it to 'Tag'
Having single tag per row allows to perform necessary aggregations.

In [9]:
questions_tag_df = questions_df.explode('TagsParsed').rename(columns={'TagsParsed': 'Tag'})
questions_tag_df

Unnamed: 0,Id,CreationDate,Tag
0,4,2008-07,c#
0,4,2008-07,floating-point
0,4,2008-07,type-conversion
0,4,2008-07,double
0,4,2008-07,decimal
...,...,...,...
59749045,78091309,2024-03,jquery
59749045,78091309,2024-03,woocommerce
59749045,78091309,2024-03,hide
59749045,78091309,2024-03,show


### Assign tags on answers

Merge answers dataframe with questions dataframe on 'Id' and 'ParentId' columns. 
This merge is needed to populate tag data into answers posts for later aggregations.
As it was mentioned before, answers posts don't have tags assigned, because they implicitly inherit those from parent question posts.

In [11]:
answers_tag_df = pd.merge(
    questions_tag_df,
    answers_df,
    left_on='Id',
    right_on='ParentId',
    how='inner',
    suffixes=('_Question', '_Answer')
)[['CreationDate_Answer', 'Tag', 'Id_Answer']].rename(columns={'CreationDate_Answer': 'CreationDate', 'Id_Answer': 'Id'})

answers_tag_df

Unnamed: 0,CreationDate,Tag,Id
0,2008-07,c#,7
1,2008-08,c#,78
2,2008-08,c#,86
3,2008-08,c#,2791
4,2008-08,c#,7263
...,...,...,...
100076826,2024-03,react-hooks,78091263
100076827,2024-03,setstate,78091263
100076828,2024-03,drupal,78091259
100076829,2024-03,drupal-9,78091259


#### Union questions and answers
Union questions and answers dataframes to get dataframe that shows all posts created for particular tag.

In [12]:
posts_tag_df = pd.concat([questions_tag_df, answers_tag_df], ignore_index=True)
posts_tag_df

Unnamed: 0,Id,CreationDate,Tag
0,4,2008-07,c#
1,4,2008-07,floating-point
2,4,2008-07,type-conversion
3,4,2008-07,double
4,4,2008-07,decimal
...,...,...,...
168684444,78091263,2024-03,react-hooks
168684445,78091263,2024-03,setstate
168684446,78091259,2024-03,drupal
168684447,78091259,2024-03,drupal-9


### Save
Save final dataframe as intermediate result

In [14]:
posts_tag_file_path = get_file_path("posts_tag.csv")
posts_tag_df.to_csv(posts_tag_file_path, index=False)