# Text Classification Tutorial
based on https://www.opencodez.com/python/text-classification-using-keras.htm

- We're about to classify newsgroup documents into 20 categories
- Dataset (http://qwone.com/~jason/20Newsgroups/ 20newsbydate.tar.gz): contains training folder and test folder

In [1]:
import pandas as pd
import numpy as np
import pickle
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout
from sklearn.preprocessing import LabelBinarizer
import sklearn.datasets as skds
from pathlib import Path

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Retrieve Data

In [2]:
# For reproducibility
np.random.seed(1237)
 
# Source file directory
path_train = "./resources/20news-bydate/20news-bydate-train"
 
files_train = skds.load_files(path_train,load_content=False)
 


In [3]:
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames
 
data_tags = ["filename","category","news"]
data_list = []

In [5]:
# Read and add data from file to a list
i=0
for f in labelled_files:
    data_list.append((f,label_names[label_index[i]],Path(f).read_text()))
    i += 1
 
# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)

## Prepare Data

- we split the training data into 80/20
- each element contains content, tag(category) and file name

### Preprocessing
- tokenization (keras Tokenizer) of the content of each document
- tokenizer transforms each text in a vector by using tfidf weighting
- encoding of tags

In [6]:
# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)
 
train_posts = data['news'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]
 
test_posts = data['news'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]

In [8]:
# 20 news groups
num_labels = 20
vocab_size = 15000 #vocabulary is restricted to 15000 words
batch_size = 100
 
# define Tokenizer with Vocab Size
tokenizer = Tokenizer(num_words=vocab_size)#welche Wörter
tokenizer.fit_on_texts(train_posts)
 
x_train = tokenizer.texts_to_matrix(train_posts, mode='tfidf')
x_test = tokenizer.texts_to_matrix(test_posts, mode='tfidf')
 
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)

## Model Data

- input layer: Dense (nodes = vocab_size, activation = sigmoid, dropout = 0.3)
- hidden layer: Dense (nodes = 512, activation = sigmoid, dropout = 0.3)
- output layer: Dense (nodes = 512, activation = softmax, dropout = 0.3)

fitting/training of the model with training(X) & test(Y) data

In [15]:
model = Sequential()
model.add(Dense(300, input_shape=(vocab_size,)))#512
model.add(Activation('sigmoid'))
model.add(Dropout(0.3))
model.add(Dense(300)) #512
model.add(Activation('sigmoid'))
model.add(Dropout(0.3))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.summary()
 
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
 
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=10,
                    verbose=1,
                    validation_split=0.1)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_7 (Dense)              (None, 300)               4500300   
_________________________________________________________________
activation_7 (Activation)    (None, 300)               0         
_________________________________________________________________
dropout_5 (Dropout)          (None, 300)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 300)               90300     
_________________________________________________________________
activation_8 (Activation)    (None, 300)               0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 300)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 20)                6020      
__________

# Predicting some unseen documents

- files are taken from the test folder
- steps: content tokenization, prediction, comparison with actual tag

In [16]:
# These are the labels we stored from our training
# The order is very important here.
 
labels = np.array(['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x',
 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball',
 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space',
 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast',
 'talk.politics.misc', 'talk.religion.misc'])
 
test_files = ["./resources/20news-bydate/20news-bydate-test/comp.graphics/38761",
              "./resources/20news-bydate/20news-bydate-test/misc.forsale/76117",
              "./resources/20news-bydate/20news-bydate-test/soc.religion.christian/21411"
              ]
x_data = []
for t_f in test_files:
    t_f_data = Path(t_f).read_text()
    x_data.append(t_f_data)
 
x_data_series = pd.Series(x_data)
x_tokenized = tokenizer.texts_to_matrix(x_data_series, mode='tfidf')
 
i=0
for x_t in x_tokenized:
    print(x_data[i])
    prediction = model.predict(np.array([x_t]))
    predicted_label = labels[np.argmax(prediction[0])]
    print("File ->", test_files[i], "Predicted label: " + predicted_label)
    print("********************************")
    i += 1

From: kiwi@iis.ethz.ch (Rene Mueller)
Subject: ICN (MSDOS) -> PBM/PGM/PPM format?
Organization: Swiss Federal Institute of Technology (ETH), Zurich, CH
Distribution: comp
Lines: 7

I have many icons in IconEdit and PBIcon format and I would like to 
convert them to PBM, PGM or PPM format. Do you know the formats of
IconEdit or PBIcon?

Thank's for your help.
   ,
Rene (kiwi@iis.ethz.ch)

File -> ./resources/20news-bydate/20news-bydate-test/comp.graphics/38761 Predicted label: comp.windows.x
********************************
From: jack@acs2.bu.edu
Subject: For Sale: Misc. Computer Parts & a radar detector
Distribution: na
Organization: Boston University, Boston, MA, USA
Lines: 183
Originator: jack@acs2.bu.edu


 I have the following computer items for sale:

  Item                                Condition                 Price

 (a) Color EGA card and monitor            Working                 $180.00
     Monitor made by Zenith
     

 (b) (3)  1Mx8 80ns SIMMS by MT            Working  