<a href="https://colab.research.google.com/github/StavroK/MtySaturdayAI2020/blob/master/Fakenews_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Set Analysis**

In this step we will check for a balanced data so we can train the machine learning models without oversampling or undersampling classes. If we see some bias towards one class we need to apply techniques to manage this issue with our data set.

First step is to check if we have all Python packages we need be using to visualize the characteristics of the dataset to check for class balance, and install those missing if any. We will need to join files containing labeling of examples to content of articles for our purpose.


We want to check for: 1. Number of examples in each class, 2. Total Percentage of examples in each class, 3. Distribution of the lenght of each example as we will use its word content to train for classification.


In [0]:
# This command will show you the "Location" of your package
# I am using numpy as is the most common package 
! pip show numpy

In [0]:
# use this other command to list all files under "Location"
# we are checking for the following packages
# altair - enables to build histogram for class representation in examples
# matplotlib - extension of NumPy enables you to embed plots into applications
# pandas - used to clean, transform, manipulate and analize data
# pickle - used to convert a python object into a character stream
# pip - used to manage installation and updates of python packages
# seaborn - based on matplotlib is used to vizualize example lenght by category 
# warnings - used to hide warinings coming from seaborn package
! ls /usr/local/lib/python3.6/dist-packages

In [0]:
# If you are missign a package use this command to install
# remember to be in your distribution packages folder used by colab 
# before you run the instalation command

%cd /usr/local/lib/python3.6/dist-packages

# use these commands to install and check installation
# ! pip install nameofpackage
# ! pip show nameofpackage

! pip install warn
! pip show warn

Now that we have secured environment requirements for python packages that will be declared in the code, next we will import them into the program

In [0]:
import altair as alt
import matplotlib.pyplot as plt
import pandas as pd
import pickle as pkl
import seaborn as sns
import warnings

We want to check for: 1. Number of examples in each class, 2. Total Percentage of examples in each class, 3. Distribution of the lenght of each example as we will use its word content to train for classification.

Data is already included in /content/FakeNews folder. Data is already separated by example content and labeled files, and also in train, test and demo files to keep process clean.


In [8]:
%cd /content/FakeNews/

file_train_instances = "train_stances.csv"
file_train_bodies = "train_bodies.csv"
file_test_instances = "test_stances_unlabeled.csv"
file_test_bodies = "test_bodies.csv"
file_predictions = 'predictions_test.csv'


/content/FakeNews


In [19]:
%cd /content/FakeNews/
"""
Scorer for the Fake News Challenge
 - @bgalbraith
Submission is a CSV with the following fields: Headline, Body ID, Stance
where Stance is in {agree, disagree, discuss, unrelated}
Scoring is as follows:
  +0.25 for each correct unrelated
  +0.25 for each correct related (label is any of agree, disagree, discuss)
  +0.75 for each correct agree, disagree, discuss
"""
from __future__ import division
import csv
import sys


FIELDNAMES = ['Headline', 'Body ID', 'Stance']
LABELS = ['agree', 'disagree', 'discuss', 'unrelated']
RELATED = LABELS[0:3]

USAGE = """
FakeNewsChallenge FNC-1 scorer - version 1.0
Usage: python scorer.py gold_labels test_labels
  gold_labels - CSV file with reference GOLD stance labels
  test_labels - CSV file with predicted stance labels
The scorer will provide three scores: MAX, NULL, and TEST
  MAX  - the best possible score (100% accuracy)
  NULL - score as if all predicted stances were unrelated
  TEST - score based on the provided predictions
"""

ERROR_MISMATCH = """
ERROR: Entry mismatch at line {}
 [expected] Headline: {} // Body ID: {}
 [got] Headline: {} // Body ID: {}
"""

SCORE_REPORT = """
MAX  - the best possible score (100% accuracy)
NULL - score as if all predicted stances were unrelated
TEST - score based on the provided predictions
||    MAX    ||    NULL   ||    TEST   ||\n||{:^11}||{:^11}||{:^11}||
"""


class FNCException(Exception):
    pass


def score_submission(gold_labels, test_labels):
    score = 0.0
    cm = [[0, 0, 0, 0],
          [0, 0, 0, 0],
          [0, 0, 0, 0],
          [0, 0, 0, 0]]

    for i, (g, t) in enumerate(zip(gold_labels, test_labels)):
        if g['Headline'] != t['Headline'] or g['Body ID'] != t['Body ID']:
            error = ERROR_MISMATCH.format(i+2,
                                          g['Headline'], g['Body ID'],
                                          t['Headline'], t['Body ID'])
            raise FNCException(error)
        else:
            g_stance, t_stance = g['Stance'], t['Stance']
            if g_stance == t_stance:
                score += 0.25
                if g_stance != 'unrelated':
                    score += 0.50
            if g_stance in RELATED and t_stance in RELATED:
                score += 0.25

        cm[LABELS.index(g_stance)][LABELS.index(t_stance)] += 1

    return score, cm


def score_defaults(gold_labels):
    """
    Compute the "all false" baseline (all labels as unrelated) and the max
    possible score
    :param gold_labels: list containing the true labels
    :return: (null_score, best_score)
    """
    unrelated = [g for g in gold_labels if g['Stance'] == 'unrelated']
    null_score = 0.25 * len(unrelated)
    max_score = null_score + (len(gold_labels) - len(unrelated))
    return null_score, max_score


def load_dataset(filename):
    data = None
    try:
        with open(filename) as fh:
            reader = csv.DictReader(fh)
            if reader.fieldnames != FIELDNAMES:
                error = 'ERROR: Incorrect headers in: {}'.format(filename)
                raise FNCException(error)
            else:
                data = list(reader)

            if data is None:
                error = 'ERROR: No data found in: {}'.format(filename)
                raise FNCException(error)
    except FileNotFoundError:
        error = "ERROR: Could not find file: {}".format(filename)
        raise FNCException(error)

    return data


def print_confusion_matrix(cm):
    lines = ['CONFUSION MATRIX:']
    header = "|{:^11}|{:^11}|{:^11}|{:^11}|{:^11}|".format('', *LABELS)
    line_len = len(header)
    lines.append("-"*line_len)
    lines.append(header)
    lines.append("-"*line_len)

    hit = 0
    total = 0
    for i, row in enumerate(cm):
        hit += row[i]
        total += sum(row)
        lines.append("|{:^11}|{:^11}|{:^11}|{:^11}|{:^11}|".format(LABELS[i],
                                                                   *row))
        lines.append("-"*line_len)
    lines.append("ACCURACY: {:.3f}".format(hit / total))
    print('\n'.join(lines))


if __name__ == '__main__':
    if len(sys.argv) != 3:
        print(USAGE)
        sys.exit(0)

    _, gold_filename, test_filename = sys.argv

    try:
        gold_labels = load_dataset(gold_filename)
        test_labels = load_dataset(test_filename)

        test_score, cm = score_submission(gold_labels, test_labels)
        null_score, max_score = score_defaults(gold_labels)
        print_confusion_matrix(cm)
        print(SCORE_REPORT.format(max_score, null_score, test_score))

    except FNCException as e:
        print(e)

/content/FakeNews
ERROR: Could not find file: -f
