The aim of this project is to find the best way to extract features from text data (BBC News Dataset) and apply supervised machine learning algorithms to predict the label of the news article  

#### Import necessary modules from sklearn, scipy, numpy 

In [1]:
import numpy as np
import scipy
import os
from sklearn.cross_validation import train_test_split
import pandas as pd
from time import time

# Supervised Learning Using feed forward Neural Network

### Data Loading

Reads the csv file using Pandas.

In [2]:
bbc = pd.read_csv('supervisedlearningdataset_13082016.csv', parse_dates=True)

### Data Refinning.

Converting all NaN values in the data loaded to zeroes.

In [3]:
# converting NaN values in the dataset to 0.
for column in bbc.columns:
    bbc[column] = bbc[column].fillna(0)

Removes the Document id column from the data.

In [4]:
# removing document id from dataset
bbc = bbc.drop("Document id", 1)

In [5]:
# table information.
input_neurons = bbc.shape[1]-1
inputs = bbc.shape[0]

In [6]:
# colomn names.
feature_columns = bbc.columns[0:input_neurons]

Stores target column name in the target variable.

In [7]:
# target variable
target = "label"

## Splitting Data in training and test data set

The Code splits the data into training and test sets. With training being 80% of the original dataset and remaining being test data.

In [8]:
# seperating data into features and label
y = bbc.pop(target)
X = bbc
# splitting data into train and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [9]:
#creating numpy array from list
np_X_train = np.array(X_train)
np_X_test = np.array(X_test)
np_y_train = np.array(y_train)
np_y_test = np.array(y_test)

training_set = (np_X_train,np_y_train)
test_set = (np_X_test,np_y_test)

In [10]:
def vectorized_result(j):
    """Return a 5-dimensional unit vector with a 1.0 in the jth
    position and zeroes elsewhere.  This is used to convert a digit
    (0...4) into a corresponding desired output from the neural
    network."""
    e = np.zeros((5, 1))
    e[j] = 1.0
    return e

In [11]:
# Creating a In particular, ``training_data`` is a list containing 1780
# 2-tuples ``(x, y)``.  ``x`` is a 9635-dimensional numpy.ndarray
# containing the input feature.  ``y`` is a 5-dimensional
# numpy.ndarray representing the unit vector corresponding to the
# correct document for ``x``.

training_inputs = [np.reshape(x, (9635, 1)) for x in training_set[0]]
training_results = [vectorized_result(y) for y in training_set[1]]

In [12]:
training_data = zip(training_inputs, training_results)

In [13]:
test_inputs = [np.reshape(x, (9635, 1)) for x in test_set[0]]
test_data = zip(test_inputs, test_set[1])

print training_set[0].shape
print training_set[1].shape
print test_set[0].shape
print test_set[1].shape

(1780, 9635)
(1780,)
(445, 9635)
(445,)


## Building Neural Net

In [14]:
import network

In [15]:
#building a net with 3 layers
#first layer i.e. input layer has 9673 neurons, second layer i.e. hidden layer has 3 neuron
#and last layer i.e. output layer has 5 neurons

net = network.Network([input_neurons,30,5])

In [16]:
#running stochastic gradient descent algorithm
net.SGD(training_data, 30, 10, 3.0,test_data=test_data)

Epoch 0: 182 / 445
Epoch 1: 203 / 445
Epoch 2: 323 / 445
Epoch 3: 358 / 445
Epoch 4: 365 / 445
Epoch 5: 384 / 445
Epoch 6: 386 / 445
Epoch 7: 392 / 445
Epoch 8: 392 / 445
Epoch 9: 394 / 445
Epoch 10: 406 / 445
Epoch 11: 398 / 445
Epoch 12: 401 / 445
Epoch 13: 401 / 445
Epoch 14: 401 / 445
Epoch 15: 407 / 445
Epoch 16: 404 / 445
Epoch 17: 404 / 445
Epoch 18: 404 / 445
Epoch 19: 405 / 445
Epoch 20: 408 / 445
Epoch 21: 407 / 445
Epoch 22: 407 / 445
Epoch 23: 407 / 445
Epoch 24: 406 / 445
Epoch 25: 409 / 445
Epoch 26: 408 / 445
Epoch 27: 408 / 445
Epoch 28: 409 / 445
Epoch 29: 409 / 445
