# 2. Data Exploration 1 - Train Test split

#### This script takes the AuthorInformation CSV file and creates a new csv file, tagging each author as part of the training set or part of the testing set. This split tries to make it so that there are 50/50 of each gender in the train set

### Imports

In [1]:
import pandas as pd
from IPython.display import clear_output
import numpy as np
import warnings

### Defnitions

In [2]:
data_path = 'data/'

data_filename = 'AuthorInformation.csv'

### Load Dataframe

In [3]:
df = pd.read_csv(data_path+data_filename)
df.head()

Unnamed: 0,UserID,Gender,NoOfPosts
0,1000331,female,13
1,1000866,female,771
2,1004904,male,52
3,1005076,female,85
4,1005545,male,80


### Split into female and male

In [4]:
df_male = df[ df['Gender'] == 'male' ]
df_female = df[ df['Gender'] == 'female']

print('Number of male blogs: {0}'.format( df_male.shape[0] ))
print('Number of male posts: {0}'.format( df_male['NoOfPosts'].sum() ))
print('Number of female blogs: {0}'.format( df_female.shape[0] ))
print('Number of female posts: {0}'.format( df_female['NoOfPosts'].sum() ))

Number of male blogs: 9660
Number of male posts: 344773
Number of female blogs: 9660
Number of female posts: 335010


From the above we can see that there are almost equal female and male blogs

### Split into train and test for each dataframe

In [5]:
df_male_train = df_male.iloc[:int(df_male.shape[0]*0.8)]
df_male_test = df_male.iloc[int(df_male.shape[0]*0.8):]
df_female_train = df_female.iloc[:int(df_female.shape[0]*0.74)]
df_female_test = df_female.iloc[int(df_female.shape[0]*0.74):]

print('Number of training male blogs: {0}'.format( df_male_train.shape[0] ))
print('Number of training male posts: {0}'.format( df_male_train['NoOfPosts'].sum() ))
print('Number of training female blogs: {0}'.format( df_female_train.shape[0] ))
print('Number of training female posts: {0}'.format( df_female_train['NoOfPosts'].sum() ))
print()
print('Number of testing male blogs: {0}'.format( df_male_test.shape[0] ))
print('Number of testing male posts: {0}'.format( df_male_test['NoOfPosts'].sum() ))
print('Number of testing female blogs: {0}'.format( df_female_test.shape[0] ))
print('Number of testing female posts: {0}'.format( df_female_test['NoOfPosts'].sum() ))

Number of training male blogs: 7728
Number of training male posts: 270176
Number of training female blogs: 7148
Number of training female posts: 269911

Number of testing male blogs: 1932
Number of testing male posts: 74597
Number of testing female blogs: 2512
Number of testing female posts: 65099


### Concatenate back with new training and test tags

In [6]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    df_male_train['TrainTest'] = 'train'
    df_male_test['TrainTest'] = 'test'
    df_female_train['TrainTest'] = 'train'
    df_female_test['TrainTest'] = 'test'

    df = pd.concat([
        df_male_train,
        df_male_test,
        df_female_train,
        df_female_test
    ])

    df = df.loc[:][['UserID', 'TrainTest']]
df.head()

Unnamed: 0,UserID,TrainTest
2,1004904,train
4,1005545,train
5,1007188,train
8,1009572,train
12,1013637,train


### Save it to new CSV

In [7]:
df.to_csv('data/AuthorTrainTest.csv', index=False)