# Exploratory Analysis notebook

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import bay12_solution_eposts as solution

## Load data

**NOTE!** This loading function assumes the following directory structure:

```
/
    data/
        train/
        test/
    notebooks/
        0_exploratory_analysis.ipynb
    src/
        bay12_solution_eposts/
            __init__.py
            prepare.py
            ...
```

You may set `path_data="your/path/to/data"` if you have a different structure. 

In [3]:
post, thread = solution.prepare.load_dfs('train')

In [4]:
post.head()

Unnamed: 0,thread_num,user,text,quotes
0,45016,Mephansteras,"Basically, this is where we talk about what ga...",[]
1,45016,dakarian,The currently running or about to run games (i...,[]
2,45016,webadict,And mine's started.\r\r\r\n\r\r\r\nI'll try to...,[]
3,45016,ExKirby,"Mine needs 14 players, not 13.",[]
4,45016,RedWarrior0,Mine can wait a bit. BYORPE is a problem as it...,[]


In [5]:
thread.head()

Unnamed: 0,thread_num,thread_name,thread_label,thread_replies,thread_label_id
0,45016,Games Threshold Discussion and List [Vote for ...,other,5703,8
1,88720,New Player's Guide to the Subforum - New to Ma...,other,961,8
2,39338,Mafia: A Basic Tutorial,other,79,8
3,34959,Paranormal Mafia Game - Rules Discussion,other,1719,8
4,64229,Notable Games Archive,other,307,8


## Look at label statistics on the train set

In [6]:
lbl_stats = thread.groupby(['thread_label', 'thread_label_id']).thread_replies.agg(['count', 'sum'])
lbl_stats.columns = ['threads', 'posts']
# Adding 1 for each initial post (because it's sum of replies)
lbl_stats['posts'] += lbl_stats.threads
lbl_stats['avg posts per thread'] = (lbl_stats['posts'] / lbl_stats['threads']).astype(int)
lbl_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,threads,posts,avg posts per thread
thread_label,thread_label_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bastard,0,14,5411,386
beginners-mafia,1,23,10242,445
byor,2,13,10609,816
classic,3,21,7021,334
closed-setup,4,36,13828,384
cybrid,5,3,958,319
kotm,6,2,1719,859
non-mafia-game,7,2,673,336
other,8,201,28639,142
paranormal,9,20,10976,548


In [7]:
print(lbl_stats.index.tolist())

[('bastard', 0), ('beginners-mafia', 1), ('byor', 2), ('classic', 3), ('closed-setup', 4), ('cybrid', 5), ('kotm', 6), ('non-mafia-game', 7), ('other', 8), ('paranormal', 9), ('supernatural', 10), ('vanilla', 11), ('vengeful', 12)]


**Some notes**:
* the "other" threads are by far the most common, but have the least posts per thread
* there are very few examples of threads for many classes
* from previous knowledge, I know that "named" versions like beginner's mafia, cybrid, paranormal, and supernatural can be easily noticed from the title

**Baseline task**:

I will try to predict the label just from the thread data; this might help prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting) as our dataset is pretty small.