# Generate Game Data DataFrame
This dataframe contains metadata game-wise.

##  Creating Pandas Dataframe from the logs

Use this format:

| game_log_id | red | kui | ton-nan | sanma | soku | p0-dan | p1-dan | p2-dan | p3-dan| p0-rating | p1-rating | p2-rating | p3-rating |
|-------------|-----|-----|---------|-------|------|--------|--------|--------|-------|-----------|-----------|-----------|-----------|
|     2019010100gm-00a9-0000-009379d9     | True|True | False   | False | False|  15    |  16    |   16   | 17    | 1700      |  1690     |   1909    |    1990   |

### Config Explanation
- 'red' == Contains red fives
- 'kui' == Open-tanyao ([Kuitan](http://arcturus.su/wiki/Tanyao#Kuitan))
- 'ton-nan' == East-South
- 'sanma' == 3-Player
- 'soku' == Fast rounds

In [1]:
from pathlib import Path
import pandas as pd
import json
from tqdm import tqdm; 

import utilities.utilities as util

In [2]:
rows = []

for year, logs in util.get_all_logs_annually(Path('E:') / 'mahjong' / 'logs'):

    for log in logs:

        json_log = json.load(log.open())

        config = json_log['meta']['GO']['config']
        players = json_log['meta']['UN']

        row = {}

        row['log_id'] = log.stem

        row['red'] = config['red']
        row['kui'] = config['kui']
        row['ton-nan'] = config['ton-nan']
        row['sanma'] = config['sanma']
        row['soku'] = config['soku']

        row['p0-dan'] = players[0]['dan']
        row['p1-dan'] = players[1]['dan']
        row['p2-dan'] = players[2]['dan']
        row['p3-dan'] = players[3]['dan']

        row['p0-rating'] = players[0]['rate']
        row['p1-rating'] = players[1]['rate']
        row['p2-rating'] = players[2]['rate']
        row['p3-rating'] = players[3]['rate']

        rows.append(row)

2009: 100%|█████████████████████████████████████████████████████████████████████| 80156/80156 [01:29<00:00, 899.79it/s]
2010: 100%|███████████████████████████████████████████████████████████████████| 149606/149606 [06:18<00:00, 395.03it/s]
2011: 100%|███████████████████████████████████████████████████████████████████| 217011/217011 [09:12<00:00, 392.43it/s]
2012: 100%|███████████████████████████████████████████████████████████████████| 263589/263589 [11:02<00:00, 398.16it/s]
2013: 100%|███████████████████████████████████████████████████████████████████| 250790/250790 [10:49<00:00, 385.88it/s]
2014: 100%|███████████████████████████████████████████████████████████████████| 244154/244154 [10:52<00:00, 374.27it/s]
2015: 100%|███████████████████████████████████████████████████████████████████| 249916/249916 [11:16<00:00, 369.27it/s]
2016: 100%|███████████████████████████████████████████████████████████████████| 252064/252064 [11:30<00:00, 365.03it/s]
2017: 100%|█████████████████████████████

In [12]:
df = pd.DataFrame(rows)

In [17]:
# Optimize DataFrame by changing type (Memory Usage from 200MB to 110MB)
for i in range(4):
    df[f'p{i}-dan'] = pd.to_numeric(df[f'p{i}-dan'], downcast='unsigned')
    
for i in range(4):
    df[f'p{i}-rating'] = pd.to_numeric(df[f'p{i}-rating'], downcast='unsigned')

df.set_index('log_id', inplace=True)  # Change into more suitable index

In [17]:
df.to_parquet(Path('E:') / 'mahjong' / 'pandas' / 'log_game_data.parquet', engine='fastparquet')  # Use `fastparquet` to preserve categorical data

In [6]:
df

Unnamed: 0_level_0,red,kui,ton-nan,sanma,soku,p0-dan,p1-dan,p2-dan,p3-dan,p0-rating,p1-rating,p2-rating,p3-rating
log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2009020103gm-00a9-0000-2453a04c,True,True,True,False,False,0,0,0,0,1500.00,1500.00,1509.00,1500.00
2009020103gm-00a9-0000-47e70b77,True,True,True,False,False,0,0,0,0,1500.00,1500.00,1500.00,1500.00
2009022011gm-00a9-0000-d7935c6d,True,True,True,False,False,16,16,16,16,2096.00,2000.00,2030.00,2008.00
2009022011gm-00e1-0000-2820118d,True,True,False,False,True,17,16,16,17,2011.00,2032.00,2090.00,2070.00
2009022011gm-00e1-0000-293fb785,True,True,False,False,True,16,16,17,17,2034.00,2089.00,2004.00,2068.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019123123gm-00e1-0000-99d83b7a,True,True,False,False,True,16,16,16,16,2061.34,2187.98,2075.11,2122.54
2019123123gm-00e1-0000-c92676e9,True,True,False,False,True,16,16,16,17,2034.50,2085.77,2074.37,2179.28
2019123123gm-00e1-0000-cf8b36ed,True,True,False,False,True,18,16,17,16,2240.79,2036.80,2172.86,2072.47
2019123123gm-00e1-0000-f2122b27,True,True,False,False,True,16,17,16,16,2031.11,2158.59,2158.87,2158.59


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2506325 entries, 2009020103gm-00a9-0000-2453a04c to 2019123123gm-00e1-0000-f7f33877
Data columns (total 13 columns):
 #   Column     Dtype  
---  ------     -----  
 0   red        bool   
 1   kui        bool   
 2   ton-nan    bool   
 3   sanma      bool   
 4   soku       bool   
 5   p0-dan     uint8  
 6   p1-dan     uint8  
 7   p2-dan     uint8  
 8   p3-dan     uint8  
 9   p0-rating  float64
 10  p1-rating  float64
 11  p2-rating  float64
 12  p3-rating  float64
dtypes: bool(5), float64(4), uint8(4)
memory usage: 117.1+ MB
