# 7. Topic classification

### 7.2. Using Auto-Keras’ pre-trained model for topic classification

In [1]:
from autokeras_pretrained.text_classifier import TopicClassifier
topic_classifier = TopicClassifier()
# text taken from https://techcrunch.com/2019/11/03/spacex-achieves-key-milestone-in-safety-testing-of-crew-dragon-spacecraft/?guccounter=1&guce_referrer_us=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_cs=1hiYk4oIiho1oL1M82Ddtg
#class_name = topic_classifier.predict(
#    """
#    SpaceX has managed to run 13 successful parachute tests in a row 
#    of the third major revision of the parachute system it’s 
#    planning to use for its Crew Dragon spacecraft. 
#    The most recent test, which SpaceX shared a shorted edited video 
#    clip of on Twitter, involved using the system with one of the 
#    parachutes intentionally not deploying, to prove that it can land the
#    crew craft safely even in case of a partial failure.
#    """)
#print(class_name)

should_be_business = topic_classifier.predict(
    'The DOW has reached a new low yesterday.')

should_be_world = topic_classifier.predict(
    'The rebuilding of the Notre Dame will start soon.')

should_be_sports = topic_classifier.predict(
    'The 2020 soccer world cup is a hard one for experts to predict.')

print(should_be_business, should_be_world, should_be_sports)

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Downloading file with Google ID 1U7C3xPid1ZvBKpkfW9KikrmNui0yJqnk into /tmp/autokeras/tc.pth... Download completed.
Business Sports Sports


### 7.3. Building our own dataset for use with the pre-trained model

In [7]:
# pip install newspaper3k
# adapted from docs and examples: https://github.com/codelucas/newspaper

import newspaper
from autokeras_pretrained.text_classifier import TopicClassifier

topic_classifier = TopicClassifier()
cnn = newspaper.build('https://edition.cnn.com/', 
                      memoize_articles=False)
print('Total articles:', len(cnn.articles))
for article in cnn.articles[:10]:
    try:
        article.download()
        article.parse()
        text = article.text[:500]
        print(text)
        print('Predicted topic:', topic_classifier.predict(text))
        print('-' * 20)
    except:
        pass

Total articles: 867
How often does Trump misspell words on Twitter? These researchers have an answer
Predicted topic: Sci/Tech
--------------------
By Daniel Gallan, for CNN

New Zealand ended its Rugby World Cup campaign on a conciliatory high by beating Wales 40-17 in the third-place playoff match at the International Stadium in Yokohama.
Predicted topic: Sports
--------------------
New Delhi (CNN) Residents of India's capital are set to suffer record-levels of smog for at least a week, even as the local government puts in place emergency measures to try and tackle New Delhi's heavily polluted air.

Flights were delayed and diverted from New Delhi's international airport Sunday when pilots could not see through the thick smog, which was more than three times the "hazardous" level on the global air quality index (AQI).

On Monday, the AQI level remained above 800 in certain 
Predicted topic: World
--------------------
London (CNN Business) Saudi Arabia is moving forward with an initia

### 7.4. Our own Auto-Keras model for topic classification

In [8]:
import newspaper
from autokeras_pretrained.text_classifier import TopicClassifier

topic_classifier = TopicClassifier()
cnn = newspaper.build('https://edition.cnn.com/', 
                      memoize_articles=False)

articles = []
topics = []
for article in cnn.articles[:500]:
    try:
        article.download()
        article.parse()
        text = article.text[:500]
        articles.append(text)
        topics.append(topic_classifier.predict(text))
    except:
        pass
    
print('Downloaded articles:', len(articles))

Downloaded articles: 453


In [13]:
from autokeras import TextClassifier
import numpy as np
from autokeras.preprocessor import OneHotEncoder

def to_one_hot(y):
    y_encoder = OneHotEncoder()
    y_encoder.fit(y)
    y = y_encoder.transform(y)
    return y, y_encoder

n_train = 400
x_train, y_train, x_test, y_test = articles[:n_train], topics[:n_train], \
                                   articles[n_train:], topics[n_train:]

y_train, encoder = to_one_hot(y_train)
y_test = encoder.transform(y_test)

clf = TextClassifier(verbose=True)
clf.fit(x=x_train, y=y_train, time_limit=20 * 60)
results = clf.evaluate(x_test, y_test)
print(results)

Epoch:   0%|          | 0/4 [00:00<?, ?it/s]
Iteration:   0%|          | 0/13 [00:00<?, ?it/s][A

***** Running training *****
Num examples = %d 400
Batch size = %d 32
Num steps = %d 50



Iteration:   8%|▊         | 1/13 [00:08<01:46,  8.86s/it][A
Iteration:  15%|█▌        | 2/13 [00:17<01:35,  8.67s/it][A
Iteration:  23%|██▎       | 3/13 [00:25<01:25,  8.55s/it][A
Iteration:  31%|███       | 4/13 [00:33<01:16,  8.49s/it][A
Iteration:  38%|███▊      | 5/13 [00:41<01:07,  8.40s/it][A
Iteration:  46%|████▌     | 6/13 [00:50<00:59,  8.44s/it][A
Iteration:  54%|█████▍    | 7/13 [00:58<00:50,  8.35s/it][A
Iteration:  62%|██████▏   | 8/13 [01:06<00:41,  8.34s/it][A
Iteration:  69%|██████▉   | 9/13 [01:15<00:33,  8.34s/it][A
Iteration:  77%|███████▋  | 10/13 [01:23<00:24,  8.32s/it][A
Iteration:  85%|████████▍ | 11/13 [01:31<00:16,  8.29s/it][A
Iteration:  92%|█████████▏| 12/13 [01:39<00:08,  8.26s/it][A
Epoch:  25%|██▌       | 1/4 [01:43<05:11, 103.94s/it]s/it][A
Iteration:   0%|          | 0/13 [00:00<?, ?it/s][A
Iteration:   8%|▊         | 1/13 [00:08<01:40,  8.38s/it][A
Iteration:  15%|█▌        | 2/13 [00:16<01:32,  8.40s/it][A
Iteration:  23%|██▎       |

Training loss = %d 1.4466414526104927
***** Running evaluation *****
  Num examples = %d 53
  Batch size = %d 32
0.9056603773584906
