# Assignment 9

In this assigment you will implement

 * a one hot encoding for categorical variables
 * the precision score
 * a text classifier for german parliament speeches

## Assignment 09.01

Implement a function assignment_09_01 that computes the one hot encoded representation of a list of strings. 

Return the result as a numpy array with the rows referring to data points and the columns to unique categories. The order of the columns is irrelevant. 

Test the implementation with the list ``['apple','banana','apple','lemon']``.

In [1]:
import numpy as np

def assignment_09_01(x):
    unique_items = list(set(x))
    one_hot = [[1 if ui == w else 0 for ui in unique_items] for w in x]

    return np.array(one_hot)

ohe = assignment_09_01(['apple','banana','apple','lemon'])
ohe

assert np.array_equal(ohe.sum(axis=1),np.array([1.,1.,1.,1.]))
assert np.array_equal(ohe[0,:],ohe[2,:])

## Assignment 09.02

Implement a function assignment_09_02 that computes the precision of binary predictions:

$${\text{precision}}={\frac {|\{{\text{relevant instances}}\}\cap \{{\text{predicted instances}}\}|}{|\{{\text{predicted instances}}\}|}}$$


The function should expect the true and predicted binary categories as numpy vectors, meaning numpy arrays with only one axis as e.g. ``np.array([1,0])`` where 1 stands for positive prediction and 0 for negative prediction. Make sure that always a number is returned and not a NaN.

In [2]:
def assignment_09_02(y_true, y_predicted):
    
    tp = len([i for i in range(len(y_true)) if y_true[i]==1 and y_predicted[i]==1])
    fp = len([i for i in range(len(y_true)) if y_true[i]==0 and y_predicted[i]==1])
    
    if tp==0 and fp==0:
        return 0
    
    return tp / (tp+fp)

assert assignment_09_02(np.array([1,1,0]),np.array([0,0,0])) == 0
assert assignment_09_02(np.array([1,1,0]),np.array([1,1,0])) == 1
assert assignment_09_02(np.array([1,1,0]),np.array([1,0,0])) == 1
assert assignment_09_02(np.array([1,1,0]),np.array([0,1,1])) == .5

## Assignment 09.03

In the 17th Bundestag elected in 2009, the ruling parties were CDU/CSU and FDP, in the 18th Bundestag elected in 2013 the ruling parties were CDU/CSU and SPD. Download the [parliament speeches](https://www.dropbox.com/s/1nlbfehnrwwa2zj/bundestags_parlamentsprotokolle.csv.gzip?dl=1) and compute a new target variable 'government' that is true if the respective party was in the ruling coalition at the time. 

Write a function ``assignment_09_03`` that preprocesses the data and trains a text classification pipeline that predicts whether a speech was made by the governing party. Train the pipeline on the speeches of the 17th Bundestag and test them on (heldout) data from the 17th Bundestag as well as on data from the 18th Bundestag. 

In [3]:
import os, gzip
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
import urllib.request
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score, precision_recall_curve

In [4]:
DATADIR = "data"

def load_data():
    if not os.path.exists(DATADIR): 
        os.mkdir(DATADIR)

    file_name = os.path.join(DATADIR, 'bundestags_parlamentsprotokolle.csv.gzip')
    if not os.path.exists(file_name):
        url_data = 'https://www.dropbox.com/s/1nlbfehnrwwa2zj/bundestags_parlamentsprotokolle.csv.gzip?dl=1'
        urllib.request.urlretrieve(url_data, file_name)

    df = pd.read_csv(gzip.open(file_name), index_col=0).sample(frac=1)
    df.loc[df.wahlperiode==17,'government'] = df[df.wahlperiode==17].partei.isin(['cducsu','fdp'])
    df.loc[df.wahlperiode==18,'government'] = df[df.wahlperiode==18].partei.isin(['cducsu','spd'])
    
    return df

In [5]:
def train(texts, gov, num_words=1e5):
    stopwords = [w.strip() for w in open("data/stopwords.txt").readlines()]
    
    clf = Pipeline([('vect', 
                          TfidfVectorizer(stop_words=stopwords,
                          ngram_range=(1,1), max_features=int(num_words))),
                         ('clf', SGDClassifier(loss='log', alpha=1e-5))])
    clf.fit(texts, gov)
    
    return clf

In [6]:
def assignment_09_03():
    
    # Load data from the remote sourcec
    df = load_data()
    
    # get government and text data from df, for only year 17
    df_17_gov, df_17_speeches = df.loc[df.wahlperiode == 17, "government"], df.loc[df.wahlperiode == 17, "text"]
    
    # get government and text data from df, for only year 18
    df_18_gov, df_18_speeches = df.loc[df.wahlperiode == 18, "government"], df.loc[df.wahlperiode == 18, "text"]

    # Split the data for 17 bundestags
    train_data, test_data, train_labels, test_labels = train_test_split(df_17_speeches, df_17_gov, test_size=0.2)
    
    # Train the model using 17-Bundestags training data
    model = train(train_data.tolist(), train_labels.tolist())
    
    # test the model using 17-Bundastag test data
    test_prediction_17 = model.predict(test_data.tolist())
    
    # test the model using 18-Bundastag test data
    test_prediction_18 = model.predict(df_18_speeches.tolist())
    
    print("*"*80 + "\nEvaluation on 17th Bundestag held-out data")
    print(classification_report(test_prediction_17, test_labels.tolist()))
    
    print("*"*80 + "\nEvaluation on 18th Bundestag held-out data")
    print(classification_report(test_prediction_18, df_18_gov.tolist()))
    
assignment_09_03()



********************************************************************************
Evaluation on 17th Bundestag held-out data
             precision    recall  f1-score   support

      False       0.82      0.81      0.81      2584
       True       0.77      0.79      0.78      2145

avg / total       0.80      0.80      0.80      4729

********************************************************************************
Evaluation on 18th Bundestag held-out data
             precision    recall  f1-score   support

      False       0.83      0.50      0.63     10002
       True       0.64      0.90      0.75     10032

avg / total       0.74      0.70      0.69     20034

