# Assignment 4

## Preparing the data sets

In [76]:
import pandas as pd
import string
from math import log10
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer

We use the Pandas library to work with the data. In the cell below, we populate `COLUMN_LABELS` with the relevant columns and then import the training and test data sets. Then we print the first five elements to visualize the data structure.

In [34]:
COLUMN_LABELS = ['Game Name', 'Class', 'Title', 'Review Text']
train = pd.read_csv('games-train.csv', sep='\t', names=COLUMN_LABELS)
test = pd.read_csv('games-test.csv', sep='\t', names=COLUMN_LABELS)

In [225]:
train.head()

Unnamed: 0,Game Name,Class,Title,Review Text
0,Hay Day,gut,,Spaß pur
1,Bike Race Free,gut,,Top game mit sucht Potenzial
2,Subway Surfers,gut,Gut,Es lagt manchmal
3,Subway Surfers,gut,,Es ist ein tolles Spiel aber manchmal bleibt e...
4,Hay Day,gut,,Cccccccccooooooooooooooooo ooooooooooll


In [226]:
test.head()

Unnamed: 0,Game Name,Class,Title,Review Text
0,Farmville 2,schlecht,,"Echt schlecht , immer wen ich versuche zu star..."
1,Die Simpsons,gut,Buchi0202136,Suche noch freunde zum hinzufuegen
2,Die Simpsons,gut,Suchtgefähr :) !!,"Ich find das Spiel gut,man muss nicht permanen..."
3,Die Simpsons,gut,Dauerhafter Spaß...,... durch immer neue Events. Schon 1 1/2 Jahre...
4,Subway Surfers,gut,Great,I like the game but near the last update it st...


## Creating the Model

### Preprocessing
We define a global `tokenize` method here so that we don't need to instantiate a new `TweetTokenizer` object every time we call the `preprocess` method. We use `TweetTokenizer` because the nature of online game reviews is not so different from that of tweets. Additionally, the `TweetTokenizer` has a length shortening parameter.

In [43]:
tokenize = TweetTokenizer(reduce_len=True).tokenize

In [234]:
def preprocess(doc):
    doc = str(doc).lower() # str() in case Pandas imported number as int type
    doc = doc.translate(str.maketrans('', '', string.punctuation)).strip()
    return [token for token in tokenize(doc) if token not in stopwords.words('german')]

In [31]:
def estimate_parameters(docs, collection_size): # docs = docs belonging to one class
    p_y = len(docs) / collection_size
    count = Counter()
    for doc in docs:
        count.update(preprocess(doc))
        
    return (p_y, count)

Now let's estimate the parameters for each class in the data sets. We use Python's dict comprehension to map every unique value in the training data's class column (in this case just 'gut', 'schlecht') to the result of running `estimate_parameters` on a data set containing only elements of that class. For added clarity, we enumerate the variables below.

- `params` = a dictionary of class to frequency distribution of terms in class
- `class_` = a string containing either "gut" or "schlecht"
- `train` = the training data as a `DataFrame`

In [230]:
params = {
    class_: estimate_parameters(
        train[train['Class'] == class_]['Review Text'], # Gets only the text of the review
        len(train)
    ) for class_ in train['Class'].unique() # = ['gut', 'schlecht']
}

`p_y` for each class, or the distribution of each class.

In [232]:
print(params['gut'][0], params['schlecht'][0])

0.8230904656534169 0.1769095343465831


The most common words in the "gut" frequency distribution.

In [235]:
params['gut'][1].most_common(10)

[('spiel', 33871),
 ('cool', 16236),
 ('macht', 13632),
 ('super', 11447),
 ('geil', 9955),
 ('gut', 9503),
 ('einfach', 8996),
 ('spaß', 8329),
 ('echt', 5912),
 ('immer', 4793)]

The 10 most common words in the "schlecht" frequency distribution.

In [143]:
bad[1].most_common(10)

[('spiel', 9424),
 ('mehr', 6483),
 ('seit', 3275),
 ('bitte', 3263),
 ('immer', 3176),
 ('update', 3092),
 ('mal', 2639),
 ('geht', 2309),
 ('beheben', 2143),
 ('schon', 1973)]

## Using the Model to Predict Class

Let's begin by using the `predict` method on some easy examples.

In [173]:
predict('tolles Spiel', params)

('gut', 3.763923847279274)

In [174]:
predict('das Spiel stürtzt immer ab. bitte schnell beheben', params)

('schlecht', 16.50217592654949)

On both examples, it acted just as we would expect. Now let's move on to the test data set. Let's assign `result` to a `Series` equal to the the prediction of each row in the "Review Text" column of `test`, then print the first five results.

In [176]:
%time result = test['Review Text'].apply(lambda x: predict(x, params))

CPU times: user 4min 34s, sys: 14.8 s, total: 4min 49s
Wall time: 4min 50s


In [178]:
result.head()

0    (schlecht, 24.71177312969205)
1    (schlecht, 7.229119758952452)
2    (schlecht, 71.83599943884803)
3    (schlecht, 39.38009345437814)
4         (gut, 45.68513157370194)
Name: Review Text, dtype: object

Now let's extract only the predictions and store it as `pred`.

In [221]:
pred = [x[0] for x in list(result)]

To visualize what this looks like, we'll create a `DataFrame` of the two sequences of predicted and true values, then print the first five rows.

In [224]:
joined = pd.concat([test['Class'], pd.Series(pred)], axis=1)
joined.columns = ['True', 'Predicted']
joined.head()

Unnamed: 0,True,Predicted
0,schlecht,schlecht
1,gut,schlecht
2,gut,schlecht
3,gut,schlecht
4,gut,gut


## Evaluation

In [216]:
def evaluate(target_class, true, predicted):
    if len(true) != len(predicted):
        raise ValueError('Sequences are of different lengths.')
    evl = pd.DataFrame(list(zip(true, predicted)), columns=['True', 'Predicted'])
    tp = len(evl[(evl['Predicted'] == target_class) & (evl['True'] == target_class)])
    fp = len(evl[(evl['Predicted'] == target_class) & (evl['True'] != target_class)])
    fn = len(evl[(evl['Predicted'] != target_class) & (evl['True'] == target_class)])
    prec = tp / (tp + fp)
    recall = tp / (tp + fn)
    fscore = (2 * prec * recall) / (prec + recall)
    return tp, fp, fn, prec, recall, fscore

In [215]:
print('gut:', evaluate('gut', test['Class'], pred))
print('schlecht:', evaluate('schlecht', test['Class'], pred))

gut: (28108, 2475, 8308, 0.919072687440735, 0.7718585237258347, 0.8390572993626769)
schlecht: (5342, 8308, 2475, 0.3913553113553114, 0.6833823717538697, 0.4976941351842363)
