# Debating Motions Classifier

Classifies motions into categories such as 'Economics' and 'Politics'. 

**Multi-label classifier**: Each motion can be classified into multiple categories. E.g. a motion can be both of the category 'Economics' and 'Politics'.

The list of categories is specified: it is used in the training set and defined in the code as the array ``labels``.



In [2]:
#!/Users/jessica/anaconda/lib/python3.5

import sklearn
import pandas as pd
import numpy as np

In [60]:
# Read data
def read_csv_to_np_array(file_path, header=True):
    """
    Reads data from a csv file and converts it to a numpy array.
    Returns data, header
    """
    data = pd.read_csv(file_path)
    header_values = data.columns.values
    data = np.array(data)
    if header == True:
        return header_values, data
    else:
        return data
debating_file_path = "/Users/jessica/GitHub/data-science/data/debatingmotions_sorted.csv"
header, debating_data = read_csv_to_np_array(debating_file_path)

# Check data is of expected shape
debating_data.shape

(1080, 22)

In [4]:
# Specify sheet variables.
# Should consider assigning ``total_labelled`` and ``col`` variables programatically from debating_data input.

total_labelled = 188
train_size = 120
test_size = total_labelled - train_size
motion_col = 16
infoslide_col = 17
category_col_start = 18
category_col_end_plus_one = 21

In [5]:
# Shuffle labelled data 
labelled_data = debating_data[:total_labelled,:]
unlabelled_data = debating_data[total_labelled:,:]

# Check unlabelled data really is unlabelled
print('Is this unlabelled?', '\n', unlabelled_data[0], '\n')

# Shuffle data
np.random.shuffle(labelled_data)
np.random.shuffle(unlabelled_data)
print(labelled_data[:3])

Is this unlabelled? 
 ['2016-01-23' 'IoNA' 'United Kingdom' 0 'York IV' 'Bethany Garry'
 'Jennie Hope' 'Nissim Massarano' nan nan nan nan nan nan '2' '2'
 'This house believes that the Labour Party should have worked to rehabilitate Tony Blair’s image in its campaigning, prior to the 2015 General Election.'
 nan nan nan nan nan] 

[['2016-01-01' 'International' 'Greece' 3 'WUDC' 'Manos Moschopoulos'
  'Arinah Najwa' 'Chris Bisset' 'John McKee' 'Josh Zoffer'
  'Sarah Balakrishnan' 'Tasneem Elias' nan nan '5' '5'
  'THBT the US should withdraw from East Asia and cede regional hegemony to China.'
  nan 'International Relations' 'Security, War and Military' nan nan]
 ['2015-01-01' 'International' 'Malaysia' 3 'WUDC' 'Shafiq Bazari'
  'Jonathan Leader Maynard' 'Danique Van Koppenhagen' 'Sebastian Templeton'
  'Engin Arikan' 'Brett Frazer' 'Madeline Schultz' nan nan 'Masters_1'
  'Masters_1'
  'TH regrets the rise of art that celebrates gaining material wealth' nan
  'Art and Culture' nan na

### Convert labels to binary vectors

In [6]:
Z = labelled_data[:,category_col_start:category_col_end_plus_one]

In [7]:
labels = [
'Art and Culture',
'Business',
'Criminal Justice System',
'Development',
'Economics','Education',
'Environment',
'Family',
'Feminism',
'Freedoms',
'Funny',
'International Relations',
'LGBT+',
'Media',
'Medical Ethics',
'Minority Communities',
'Morality',
'Politics',
'Religion',
'Science and Technology',
'Security, War and Military',
'Social Policy',
'Social Movements',
'Sports',
'Terrorism',
'The Human Experience'
]

len_labels = len(labels)
print(len_labels)

26


In [8]:
Y_vector = np.zeros(shape=(total_labelled,len_labels))
for i in range(total_labelled):
    for j in range(category_col_end_plus_one - category_col_start):
        for k in range(len_labels):
            if Z[i,j] == labels[k]:
                Y_vector[i,k] = 1

In [9]:
# Transform Ys into binary
Y_dict = dict.fromkeys(labels)
for i in range(len_labels):
    Y_dict[labels[i]] = Y_vector[:,i]

In [10]:
Y_dict['Art and Culture']

array([ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0

In [11]:
# Choose category to train for
chosen_element = 'International Relations'
Y_for_element = Y_dict[chosen_element]

In [12]:
# Configure training set

X = labelled_data[:train_size,motion_col]
Y = Y_for_element[:train_size]

In [13]:
# Test set

X_test = labelled_data[train_size:total_labelled,motion_col]
Y_test = Y_for_element[train_size:total_labelled]

In [14]:
# Check that data is in form we expect
X[99]

'THR the criminalization of the reckless transmission of sexual infections (e.g. HIV, herpes and gonorrhoea) in England and Wales'

## Extract Features 

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X)
X_train_counts.shape

(120, 836)

In [16]:
count_vect.vocabulary_.get(u'house')

352

### Refinements to occurrence count
1. **tf**: To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

2. **tf-idf** (Term Frequency times Inverse Document Frequnecy) Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.




In [17]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(120, 836)

In [18]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(120, 836)

## Training a classifier

In [19]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, Y)

In [20]:
# Predict the outcome on a new document
docs_new = ['thw criminalise drugs', 'THW ban endangered animals',
           'ban drug taking in sports', 'THBT Saudi Arabia should nationalise its oil industry']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, category))

'thw criminalise drugs' => 0.0
'THW ban endangered animals' => 0.0
'ban drug taking in sports' => 0.0
'THBT Saudi Arabia should nationalise its oil industry' => 1.0


## Building a Pipeline

In [21]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())
                    ])

In [22]:
# Hard-coded array

category_clfs = [text_clf, text_clf, text_clf, text_clf, 
                 text_clf, text_clf, text_clf, text_clf,
                 text_clf, text_clf, text_clf, text_clf,
                 text_clf, text_clf, text_clf, text_clf,
                 text_clf, text_clf, text_clf, text_clf,
                 text_clf, text_clf, text_clf, text_clf,
                 text_clf, text_clf]            

In [23]:
for i in range(len_labels):
    category_clfs[i] = text_clf.fit(X, Y_dict[labels[i]][:train_size])

In [24]:
# Create dictionary of category classifiers
category_clfs_dict = dict.fromkeys(labels)
for i in range(len_labels):
    category_clfs_dict[labels[i]] = text_clf.fit(X, Y_dict[labels[i]][:train_size])

In [53]:
# v1 Evaluation of the performance on the test set
# v1 Evaluation of the performance on the test set
import numpy as np

def mean_accuracy_across_categories():
    """
    Prints mean accuracy of predictions.
    """
    overall_mean_accuracy_across_categories = []
    
    # For each category
    for category_to_test in labels:

        # Ask model to predict whether or not motions in ``X_test`` 
        # belong to category ``category_to_test``.
        predicted = category_clfs_dict[category_to_test].predict(X_test)
        
        # Calculate the mean accuracy of these predictions 
        # for ``category_to_test``
        category_mean_accuracy = np.mean(
        	predicted == Y_dict[category_to_test][train_size:total_labelled])
        print(category_to_test, category_mean_accuracy)
        
        # Append the mean accuracy to the array 
        # containing mean accuracy for all categories
        overall_mean_accuracy_across_categories.append(category_mean_accuracy)
    
    # Calculate and print the mean accuracy across all categories 
    # and motions in the test set. 
    print('Overall Mean Accuracy across Categories: ', 
    		np.mean(overall_mean_accuracy_across_categories))

Art and Culture 0.941176470588
Business 0.897058823529
Criminal Justice System 0.882352941176
Development 0.955882352941
Economics 0.779411764706
Education 0.955882352941
Environment 0.926470588235
Family 1.0
Feminism 0.897058823529
Freedoms 0.955882352941
Funny 0.985294117647
International Relations 0.764705882353
LGBT+ 0.985294117647
Media 0.970588235294
Medical Ethics 0.970588235294
Minority Communities 0.970588235294
Morality 0.897058823529
Politics 0.808823529412
Religion 0.955882352941
Science and Technology 0.985294117647
Security, War and Military 0.764705882353
Social Policy 0.926470588235
Social Movements 0.970588235294
Sports 1.0
Terrorism 0.985294117647
The Human Experience 0.985294117647
Overall Mean Accuracy across Categories:  0.927601809955


In [92]:
# v1.5 Evaluation of the performance on the test set

def accuracy_by_category(return_predictions=False):
    """
    Returns an array of arrays of booleans that indicates whether each 
    prediction matched the true value.
    Each row is a category and each value within the row is a motion.
    
    Weighs false positives the same as false negatives.
    
    >>> accuracy_by_category(e.g. if the prediction was 1 (in category)
    but the true value was 0 (not in category), the value is False.
    
    if ``return_predictions=True``, returns array of arrays of predictions 
    as well.
    """
    accuracy_by_category = []
    predictions = []

    # For each category
    for category_to_test in labels:

        # Generate predictions for all motions in test set
        # as to whether or not a motion belongs to a category
        predicted = category_clfs_dict[category_to_test].predict(X_test)

		# Add this to our array of predictions       
        predictions.append(predicted)
        
        # Accuracy: Is the prediction the same as the true label?
        accuracy_by_category.append(
        	predicted == Y_dict[category_to_test][train_size:total_labelled])
    
    # If we want to, the function can return the array of predictions
    # as well as the accuracy by category. 
    # (Specify ``return_predictions=True`` in func argument)
    if return_predictions == True:
        return accuracy_by_category, predictions
    else:
        return accuracy_by_category

def mean_accuracy_per_motion():
    """
    Prints the mean accuracy per motion.
    
    Assumes we weigh false positives the same as false negatives.
    """
    accuracy_per_motion = []
    acc_by_category = accuracy_by_category()

    # For each motion
    for i in range(test_size):
        score = 0

        # For the per-category prediction for each motion
        for j in range(len(labels)):

        	# Add 1 to the score if the prediction was accurate,
        	# add 0 to the score if the prediction was inaccurate
            score += acc_by_category[j][i]

        # Normalise the score such that it's between 0 and 1 inclusive    
        accuracy_for_one_motion = score/len(labels)

        # Put all the accuracy scores into one array
        accuracy_per_motion.append(accuracy_for_one_motion)

    print('Mean Accuracy Per Motion: ', np.mean(accuracy_per_motion))

mean_accuracy_per_motion()

Mean Accuracy Per Motion:  0.927601809955


In [None]:
# Suppose we weigh each false negative with weight ``false_neg`` and 
# each false positive with weight ``false_pos``. 
false_neg = 5
false_pos = 1

# v2 Evaluation of the performance on the test set

def mean_score_per_motion_v2():
    """
    Prints the mean score per motion.
    """
    # Booleans, actual predictions. 
    # Rows: categories. Values within rows: motions.
    acc_by_category, predicted = accuracy_by_category(return_predictions=True)
    
    # Initialise counts
    tp, tn, fp, fn = 0, 0, 0, 0
    accuracy_per_motion = []
    
    # For each motion
    for i in range(test_size):
        score = 0

        # For the per-category prediction for each motion
        for j in range(len(labels)):
            
            # Was the prediction accurate?
            pred_accuracy = acc_by_category[j][i]

            # We need the prediction to see if it is a 
            # true positive or negative
            pred = predicted[j][i]
            
            # If the prediction is accurate
            if pred_accuracy == 1:
                score += 1
                
                # True positive
                if pred == 1:
                    tp += 1
                
                # True negative
                elif pred == 0:
                    tn += 1
            
            # Else if the prediction is not accurate
            elif pred_accuracy == 0:
                
                # False positive
                if pred == 1:
                    score -= false_pos
                    fp += 1
                
                # False negative
                elif pred == 0:
                    score -= false_neg
                    fn += 1
                    
        # Normalise the score
        score_for_one_motion = score/len(labels)

        # Put all the scores into one array
        score_per_motion.append(score_for_one_motion)

    print('Mean Score Per Motion: ', np.mean(score_per_motion))
    print('True Positives: ', tp, '\n', 'True Negatives: ', tn, '\n', 
    		'False Positives: ', fp, '\n', 'False Negatives: ', fn)

mean_score_per_motion_v2()  

Of course the two are the same. We're taking the means of all the values in the array of arrays ``accuracy_by_category``. But we have laid the foundations to vary the weightings of false positives and false negatives and come up with a different measure of accuracy.

# Predict categories for unlabelled motions

In [31]:
# Predict the outcome on a new motion
motions_new = ['schools teachers students politicians elections government China']
X_new_counts = count_vect.transform(motions_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

for category_to_test in labels:
    predicted = category_clfs_dict[category_to_test].predict(motions_new)
    print(category_to_test, ": ", predicted)
    if predicted == 1:
        print("Yes: ", category_to_test)

Art and Culture :  [ 0.]
Business :  [ 0.]
Criminal Justice System :  [ 0.]
Development :  [ 0.]
Economics :  [ 0.]
Education :  [ 0.]
Environment :  [ 0.]
Family :  [ 0.]
Feminism :  [ 0.]
Freedoms :  [ 0.]
Funny :  [ 0.]
International Relations :  [ 0.]
LGBT+ :  [ 0.]
Media :  [ 0.]
Medical Ethics :  [ 0.]
Minority Communities :  [ 0.]
Morality :  [ 0.]
Politics :  [ 0.]
Religion :  [ 0.]
Science and Technology :  [ 0.]
Security, War and Military :  [ 0.]
Social Policy :  [ 0.]
Social Movements :  [ 0.]
Sports :  [ 0.]
Terrorism :  [ 0.]
The Human Experience :  [ 0.]
