# Dataset preprocessing and manual labeling #

This project starts with defining test set. Main problem is that there are more than
12 000 items in this dataset and data is unlabeled. This is why semi-supervised
approach will be used. 200 articles will be selected as a test set and labeled manually.
This articles will be picked randomly. If by chance, there will be less than 10 'aggressive'
texts, testing dataset could be extended manually. If by chance, there will be more than
50 'aggressive' texts, some of them should be excluded from testing dataset.

Useful columns from ** fake_news.csv ** dataset:

** language ** - to filter out non english texts.

** type ** - to create 'hate' column, that could be used as benchmark model.

** text ** - as a main data source. Rows with empty texts should be removed.

new column ** aggressive ** will be created during manual labeling process. Only 200 randomly selected articles will be labeled. This articles will be a test dataset.

All the usefull columns and processed data will be saved to ** partially_labeled_news.csv ** file.

In [None]:
import numpy as np
import pandas as pd

rawData = pd.read_csv("fake_news.csv")

rawData.head(5)

In [None]:
# remove tab from all texts
rawData['text'] = rawData['text'].str.replace('\t', '')
rawData['text'] = rawData['text'].str.replace('\n', '')

## Data filtering ##

** Language filter : ** From raw data only english language text could be usefull for us. There is 'language' column, that could help to decide this. Lets check, that final dataset contains only text in english


In [None]:
rawData.language.unique()

In [None]:
print("before english language filtering: " + str(rawData.shape[0]))
data = rawData[rawData.language == 'english']
print("after english language filtering: " + str(data.shape[0]))

** Empty text filter : **Also, we need to check, that there is no empty texts or text just with only special characters.

In [None]:
print("before empty text filtering: " + str(data.shape[0]))

data = data.dropna(subset = ['text'])

data = data[data.text.str.strip().map(len) > 3]

print("after empty text filtering: " + str(data.shape[0]))

In [None]:
data = data.reset_index(drop=True)

## Creating 'hate' column ##

** Improve type column :** From 'type' column should be created new column by defining 'hate' or not 'hate'

In [None]:
data['type'].unique()

In [None]:
data['hate'] = data['type'].apply(lambda x: 1 if (x == 'hate') else 0)

In [None]:
def printHateStatistics(dt):
    nHate = dt[dt['hate'] == 1].shape[0]
    nNoneHate = dt[dt['hate'] == 0].shape[0]
    print("total hate messages: " + str(nHate))
    print("total nonhate messages: " + str(nNoneHate))
    print("hate messages %: " + str( 100.0 * nHate / (nHate + nNoneHate)))

In [None]:
printHateStatistics(data)

In [None]:
data = data.ix[:, ['text', 'hate']]
data['aggressive'] = pd.Series(np.NaN, index = data.index)
data.head(5)

Finally, save prepeared dataset to file

In [None]:
data.to_csv('partially_labeled_news.csv', sep='\t', index=False)

## Manual labeling procedure ##

Load data from file to start/continue data labeling

In [2]:
import numpy as np
import pandas as pd

unlabeledData = pd.read_csv('partially_labeled_news.csv', sep='\t')

In [3]:
randomState = 2017

In [4]:
testSet = unlabeledData.sample(n = 200, random_state= randomState)

testSet.head()

Unnamed: 0,text,hate,aggressive
4270,"October 26, 2016 Trump Has Hissy Fit After Rep...",0,1.0
719,N379P / Piper PA-46-350P Malibu Mirageand Flig...,0,0.0
5653,Share This There are so many reasons Americans...,0,1.0
8766,UK economy running as mysteriously as a 1993 V...,0,0.0
7482,It wasn’t long ago that the Left represented t...,1,0.0


Expected to see 2% of hate messages: at least 4 messages. If there are 0 hate messages, than this set could not be used as a test set.

In [None]:
printHateStatistics(testSet)

In [None]:
def labelNext(dataSet): 
    total = dataSet[dataSet.aggressive.isnull()]
    index = total.index[0]
    print("Left : " + str(total.shape[0]))
    print("Next index to analyse " + str(index))
    print dataSet.loc[index].text
    return (int(input("How mark this text? ")), index)

Repeat next cell until there are no unlabeled rows left.

In [None]:
inputResult, textIdx = labelNext(testSet)
testSet.loc[textIdx, 'aggressive'] = inputResult
unlabeledData.loc[textIdx, 'aggressive'] = inputResult
print testSet.loc[textIdx]

All items from testSet should be labeled

In [None]:
def printAggressiveStatistics(dt):
    nAggresive = dt[dt['aggressive'] == 1].shape[0]
    nNoneAggresive = dt[dt['aggressive'] == 0].shape[0]
    print("total aggressive messages: " + str(nAggresive))
    print("total nonaggressive messages: " + str(nNoneAggresive))
    print("aggressive messages %: " + str( 100.0 * nAggresive / (nAggresive + nNoneAggresive)))
    
printAggressiveStatistics(unlabeledData)


Check, that there are no error input with with unexpected values

In [None]:
unlabeledData['aggressive'].unique()

## Save result set to file ##

In [1]:
unlabeledData.to_csv('partially_labeled_news.csv', sep='\t', index=False)

NameError: name 'unlabeledData' is not defined