### Importing Libraries

In [None]:
import pandas as pd
import numpy as np

## Loading and Analysing Dataframe

In [None]:
df = pd.read_csv('../data/train-sample.csv')

In [None]:
print(df.columns)
df.head(3)

### Features To Create/Encode
From these columns, I consider important:
- Title,
- BodyMarkdown,
- SelectedTags (top N tags, rest as "other"),
- user life at creation (PostCreationDate - OwnerCreationDate)
- OpenStatus

I will create them in the next sections

## Data processing

### Statuses

In [None]:
statuses = df['OpenStatus'].unique()
statuses

### Most Frequent Tags

In [None]:
tag_column_names = ['Tag1', 'Tag2', 'Tag3', 'Tag4', 'Tag5']
tags_long = list(df[tag_column_names].values.ravel('K'))

tags_unique, frequencies = np.unique(tags_long, return_counts=True)
freq_dict = {tags_unique[i]: frequencies[i] for i in range(len(tags_unique))}

In [None]:
tags_freq_arr = sorted(freq_dict.items(), key=lambda kv: -1 * kv[1])

print('there is {} unique tags, some of them (along with frequencies): {}'.format(len(tags_freq_arr), tags_freq_arr[:10]))

As we can see, the most frequent value is 'nan' - we do not want to include it so it will be removed in the following cells.

We will also select N the most frequently occuring tags and classify the rest as `other`

In [None]:
N_tags = 500

tags_freq_arr = tags_freq_arr[1:][:N_tags]

selected_tags = np.empty(shape=N_tags, dtype=tags_unique.dtype)
for i, item in enumerate(tags_freq_arr):
  selected_tags[i] = item[0]

### Selecting Best Features

In [None]:
print(df.columns)
df.head(3)

In [None]:
df['DaysTillPosting'] = (df['PostCreationDate'] - df['OwnerCreationDate']).dt.days

In [None]:
print(df.columns)
df.head(3)