## BBC Text Dataset
Source : https://www.kaggle.com/yufengdev/bbc-fulltext-and-category/downloads/bbc-text.csv

category: One of 5 categories

text: The title and body of the article, concatenated.

### Importing libraries

In [167]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

### Importing the file

In [168]:
data = pd.read_csv('bbc-text.csv')
data.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


### Setting data keys 

In [169]:
data.keys()

Index(['category', 'text'], dtype='object')

### Finding unique news categories 

In [170]:
data.category.unique()

array(['tech', 'business', 'sport', 'entertainment', 'politics'],
      dtype=object)

### Finding the count of each unique type of article

In [171]:
data['category'].value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: category, dtype: int64

### Shape of the data

In [172]:
data.shape

(2225, 2)

### Checking for Null

In [173]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   text      2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


### Display sample text article

In [174]:
data['text'][15]

's korean credit card firm rescued south korea s largest credit card firm has averted liquidation following a one trillion won ($960m; £499m) bail-out.  lg card had been threatened with collapse because of its huge debts but the firm s creditors and its former parent have stepped in to rescue it. a consortium of creditors and lg group  a family owned conglomerate  have each put up $480m to stabilise the firm. lg card has seven million customers and its collapse would have sent shockwaves through the country s economy.  the firm s creditors - which own 99% of lg card - have been trying to agree a deal to secure its future for several weeks. they took control of the company in january when it avoided bankruptcy only through a $4.5bn bail-out.  they had threatened to delist the company  a move which would have triggered massive debt redemptions and forced the company into bankruptcy  unless agreement was reached on its future funding.  lg card will not need any more financial aid after th

### Transforming the text data into numerical from by using term frequency–inverse document frequency (tfidf vectorizer)
Tfidf vectorizer determines the importance of word in an document. Higher value placed on common word in document, and lower value for common word in corpus

In [175]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
x_transformed = tfidf_vectorizer.fit_transform(data.text)
tfidf_vectorizer.vocabulary_

{'tv': 27047,
 'future': 11392,
 'hands': 12454,
 'viewers': 27928,
 'home': 13065,
 'theatre': 26238,
 'systems': 25786,
 'plasma': 19954,
 'high': 12887,
 'definition': 7800,
 'tvs': 27050,
 'digital': 8268,
 'video': 27911,
 'recorders': 21495,
 'moving': 17681,
 'living': 15952,
 'room': 22583,
 'way': 28308,
 'people': 19550,
 'watch': 28276,
 'radically': 21123,
 'different': 8252,
 'years': 28971,
 'time': 26419,
 'according': 1648,
 'expert': 10127,
 'panel': 19194,
 'gathered': 11541,
 'annual': 2503,
 'consumer': 6705,
 'electronics': 9342,
 'las': 15428,
 'vegas': 27796,
 'discuss': 8440,
 'new': 18118,
 'technologies': 26024,
 'impact': 13580,
 'favourite': 10445,
 'pastimes': 19354,
 'leading': 15527,
 'trend': 26833,
 'programmes': 20651,
 'content': 6737,
 'delivered': 7865,
 'networks': 18100,
 'cable': 4951,
 'satellite': 22976,
 'telecoms': 26059,
 'companies': 6332,
 'broadband': 4614,
 'service': 23520,
 'providers': 20797,
 'rooms': 22584,
 'portable': 20187,
 'dev

### Setting the transformed data into X and the target variable into Y and Invoking Train-Test split
splitting the data 80% training and 20% test with shuffling the data

In [176]:
X = x_transformed

Y = data.category

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size= 0.2, shuffle= True, random_state = 0)

### Linear SVC

In [177]:
def linear_svc(x_train, y_train):
    linearsvc = LinearSVC(C=1.0, max_iter=1000, tol=1e-3)
    linearsvc.fit(x_train, y_train)
    return linearsvc

### Decison Tree

In [178]:
def decision_tree(x_train, y_train):
    dtc = DecisionTreeClassifier(max_depth = 10)
    dtc.fit(x_train, y_train)
    return dtc

### MLP Classifier

In [179]:
def mlp_model(x_train, y_train):
    mlpc = MLPClassifier(activation='relu', hidden_layer_sizes={12,12,12}, solver='adam', verbose=True, max_iter=1000)
    mlpc.fit(x_train, y_train)
    return mlpc

### Build, Train, Test, and evaluate model 


In [180]:
def build_and_train_model(data, target_name, class_fn):
    model = class_fn(x_train, y_train)
    score = model.score(x_train, y_train)
    print("Training Score : ", score)
    y_pred = model.predict(x_test)

    accuracy = accuracy_score(y_test, y_pred)
    print("Testing Score: ", accuracy)

    df_y = pd.DataFrame({'y_test' : y_test, 'y_pred' : y_pred})

    print(df_y.sample(10))

    return {'model': model,
            'x_train' : x_train, 'x_test' : x_test,
            'y_train' : y_train, 'y_test' : y_test, 
            'y_pred' : y_pred, 'sample' : df_y.sample(10)
            }

Utilizing the build and train model function to predict category of news article with accuracy score for each model on the data. 3 inputs as needed: data file, value to predict(category), and the type of model below. Results in order Training score, Testing/accuracy and the Y_test: actual values and Y_pred = predicted values. 

In [181]:
Linear_SVC = build_and_train_model(data, 'category', linear_svc)

Training Score :  1.0
Testing Score:  0.9887640449438202
             y_test         y_pred
132           sport          sport
913        business       business
1760          sport          sport
453           sport          sport
2224          sport          sport
938           sport          sport
994           sport          sport
1199  entertainment  entertainment
1789  entertainment  entertainment
1600           tech           tech


In [182]:
Decision_tree = build_and_train_model(data, 'category', decision_tree)

Training Score :  0.8449438202247191
Testing Score:  0.8044943820224719
             y_test         y_pred
666        business       business
2132           tech       business
1825          sport          sport
1602  entertainment  entertainment
1621  entertainment  entertainment
1474       business       business
1367       politics       politics
1629       business       business
488           sport          sport
1410  entertainment  entertainment


In [183]:
MLP = build_and_train_model(data, 'category', mlp_model)

Iteration 1, loss = 1.59810907
Iteration 2, loss = 1.52356628
Iteration 3, loss = 1.44135883
Iteration 4, loss = 1.35783637
Iteration 5, loss = 1.27464114
Iteration 6, loss = 1.19156165
Iteration 7, loss = 1.10956361
Iteration 8, loss = 1.02860171
Iteration 9, loss = 0.94997119
Iteration 10, loss = 0.87391226
Iteration 11, loss = 0.80109446
Iteration 12, loss = 0.73219931
Iteration 13, loss = 0.66799622
Iteration 14, loss = 0.60797355
Iteration 15, loss = 0.55305704
Iteration 16, loss = 0.50263401
Iteration 17, loss = 0.45711633
Iteration 18, loss = 0.41579863
Iteration 19, loss = 0.37858064
Iteration 20, loss = 0.34508427
Iteration 21, loss = 0.31510952
Iteration 22, loss = 0.28803693
Iteration 23, loss = 0.26390753
Iteration 24, loss = 0.24217824
Iteration 25, loss = 0.22260972
Iteration 26, loss = 0.20508255
Iteration 27, loss = 0.18922033
Iteration 28, loss = 0.17502828
Iteration 29, loss = 0.16214096
Iteration 30, loss = 0.15051377
Iteration 31, loss = 0.13999174
Iteration 32, los