In [2]:
# https://www.kaggle.com/crawford/20-newsgroups

In [3]:
!pip install ktrain==0.26.4

Collecting ktrain==0.26.4
  Downloading ktrain-0.26.4.tar.gz (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 75.1 MB/s 
[?25hCollecting scikit-learn==0.23.2
  Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 45.5 MB/s 
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 55.6 MB/s 
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |████████████████████████████████| 263 kB 67.4 MB/s 
[?25hCollecting syntok
  Downloading syntok-1.4.2-py3-none-any.whl (19 kB)
Collecting seqeval==0.0.19
  Downloading seqeval-0.0.19.tar.gz (30 kB)
Collecting transformers<=4.3.3,>=4.0.0
  Downloading transformers-4.3.3-py3-none-any.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 54.7 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.ma

In [4]:
# !pip install --upgrade scikit-learn==0.24.1

In [5]:
import numpy as np
import pandas as pd
import ktrain

In [6]:
from sklearn.datasets import fetch_20newsgroups

In [7]:
fetch_20newsgroups

<function sklearn.datasets._twenty_newsgroups.fetch_20newsgroups>

In [8]:
remove = ('headers', 'footers', 'quotes')
data = fetch_20newsgroups(subset='train', remove=remove)

In [9]:
texts = data.data

In [10]:
len(texts)

11314

In [11]:
targets = data.target

In [12]:
data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [13]:
categories = [data.target_names[target] for target in targets]

In [14]:
%%time
tm = ktrain.text.get_topic_model(texts, n_features=10000) # preprocessing - removing stopwords and all that

n_topics automatically set to 75
lang: en
preprocessing texts...
fitting model...
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
done.
CPU times: user 1min 26s, sys: 1min 3s, total: 2min 29s
Wall time: 1min 19s


In [15]:
tm.print_topics()

topic 0 | power air station options option tank plastic module control assembly
topic 1 | ruler irregular contour ms-windows borders summary post product appointed data
topic 2 | port modem serial ports parallel poster board connect irq interrupt
topic 3 | woods fade mindset threat gun people carry usually afraid hip
topic 4 | window server application running memory mouse set manager display screen
topic 5 | windows package software dos ibm library comp code unix microsoft
topic 6 | evidence science claim truth true moral argument universe believe objective
topic 7 | people president said armenian armenians war turkish children government states
topic 8 | pain doctor patients diet muscle acid med surgery sam diagnosed
topic 9 | gif scale white translate arbitrary black rotate example using object
topic 10 | encryption clause strong people away presently weak goal think furthermore
topic 11 | key chip encryption keys clipper des algorithm security bit nsa
topic 12 | groups pure baby cb

In [16]:
%%time
tm.build(texts, threshold=0.25)

done.
CPU times: user 7.72 s, sys: 4.34 s, total: 12.1 s
Wall time: 7.19 s


In [17]:
text = tm.filter(texts)

In [18]:
categories = tm.filter(categories)

In [19]:
tm.print_topics(show_counts=True)

topic:35 | count:3519 | just don like think know good time going people did
topic:20 | count:2581 | does use don like know just time problem need think
topic:50 | count:910 | drive card disk hard mac bit video scsi use drives
topic:72 | count:754 | god jesus people bible believe christian church say faith life
topic:7 | count:327 | people president said armenian armenians war turkish children government states
topic:29 | count:246 | information edu available computer mail internet ftp university software anonymous
topic:37 | count:235 | team game games hockey season league year players win nhl
topic:40 | count:233 | edu com list sale send e-mail new email condition interested
topic:31 | count:204 | space nasa center new program earth launch research april national
topic:61 | count:202 | file version program files ftp available directory jpeg use format
topic:54 | count:173 | government law public rights privacy guns gun federal state private
topic:11 | count:101 | key chip encryption k

In [20]:
tm.visualize_documents(doc_topics=tm.get_doctopics())

Output hidden; open in https://colab.research.google.com to view.

In [21]:
doc_topics = tm.get_doctopics(topic_ids=[31, 55])
tm.visualize_documents(doc_topics=doc_topics)

Output hidden; open in https://colab.research.google.com to view.

In [22]:
news = '''
The space agency has targeted the most remote point of the ocean to crash land the space station when its time finally comes. After that, NASA plans to buy time for its astronauts on commercial space stations.
'''

In [23]:
tm.predict([news])

array([[0.00256412, 0.0025641 , 0.00256411, 0.0025641 , 0.00256412,
        0.00256411, 0.0025641 , 0.00256411, 0.0025641 , 0.0025641 ,
        0.0025641 , 0.00256411, 0.0025641 , 0.0025641 , 0.0025641 ,
        0.0025641 , 0.0025641 , 0.0025641 , 0.0025641 , 0.0025641 ,
        0.00256411, 0.0025641 , 0.0025641 , 0.0025641 , 0.0025641 ,
        0.0025641 , 0.0025641 , 0.0025641 , 0.0025641 , 0.0025641 ,
        0.00256411, 0.522838  , 0.0025641 , 0.0025641 , 0.0025641 ,
        0.22324292, 0.0025641 , 0.0025641 , 0.0025641 , 0.0025641 ,
        0.0025641 , 0.0025641 , 0.0025641 , 0.0025641 , 0.0025641 ,
        0.0025641 , 0.0025641 , 0.00256412, 0.0025641 , 0.0025641 ,
        0.00256411, 0.0025641 , 0.0025641 , 0.0025641 , 0.0025641 ,
        0.00256422, 0.0025641 , 0.0025641 , 0.0025641 , 0.0025641 ,
        0.0025641 , 0.0025641 , 0.06930348, 0.0025641 , 0.0025641 ,
        0.0025641 , 0.0025641 , 0.0025641 , 0.0025641 , 0.0025641 ,
        0.0025641 , 0.0025641 , 0.00256411, 0.00

In [24]:
tm.topics[np.argmax(tm.predict([news]))]

'space nasa center new program earth launch research april national'

In [25]:
tm.train_recommender()

In [26]:
raw_text = '''
In the International Space Station Transition Report, NASA said the plan was for the ISS to fall to Earth in an area known as the South Pacific Oceanic Uninhabited Area -- also known as Point Nemo. The report said that its budget estimate assumed that the deorbit would happen in January 2031.
Named after the submarine sailor in Jules Verne's novel "Twenty Thousand Leagues Under the Sea," Point Nemo is the point in the ocean that is farthest from land and has been a watery grave for many other spacecraft.
The area is approximately 3,000 miles off of New Zealand's eastern coast and 2,000 miles north of Antarctica and it's estimated that space-faring nations such as the US, Russia, Japan and European countries have sunk more than 263 pieces of space debris there since 1971.
The report said the ISS would perform thrusting maneuvers that would ensure safe atmospheric entry.

'''

In [27]:
tm.recommend(text = raw_text, n=5)

[{'doc_id': 141,
  'text': '\n\nSo how much would it cost as a private venture, assuming you could talk the\nU.S. government into leasing you a couple of pads in Florida? \n\n',
  'topic_id': 31,
  'topic_proba': 0.29107155614449265},
 {'doc_id': 468,
  'text': '\nMexico City, Bogota, La Paz?\n',
  'topic_id': 31,
  'topic_proba': 0.2952380266318685},
 {'doc_id': 499,
  'topic_id': 31,
  'topic_proba': 0.37009274237962464},
 {'doc_id': 544,
  'text': '\n)Do you know what frequencies chanels 17 to 19 use and what is usually \n)allocated to those frequencies for broadcast outside of cable?\n\n17 is air comm.\n18 is amateur\n19 is business and public service\n',
  'topic_id': 31,
  'topic_proba': 0.37208652579353135},
 {'doc_id': 578,
  'text': 'Can anyone provide information on CS chemical agent--the tear gas used recently\nin WACO.  Just what is it chemically, and what are its effects on the body?\n\ndsc@gemini.gsfc.nasa.gov  \n |  Regards,         |   Hughes STX                |    Cod

In [30]:
for i, doc in enumerate(tm.recommend(text = raw_text, n = 5)):
  print(f'Result {i+1}')
  print('Text')
  print(' '.join(doc['text'].split(' ')[:501]))
  print('\n')

Result 1
Text


So how much would it cost as a private venture, assuming you could talk the
U.S. government into leasing you a couple of pads in Florida? 




Result 2
Text

Mexico City, Bogota, La Paz?



Result 3
Text
Toronto Siggraph 

What: ``Chance's Art'': 2D Graphics and Animation on the Indigo.

By:    Ken Evans, Imagicians Artware, Inc. 

When:  Tuesday 20 April 1993 7:00pm-9:00pm 

Where: The McLuhan Centre for Culture and Technology
       University of Toronto
       39A Queen's Park Crescent
       Toronto

Who:   Members and non-members alike 
       (non-members encouraged to become members...)

Abstract:

Imagicians Artware, Inc. is entering into early beta site testing on Silicon 
Graphics workstations of a new 2D abstract artwork and animation package called 
Chance's Art.  The package will be described and demonstrated, and some of the 
technical issues will be discussed.  Marketing plans will be outlined.  The 
talk will also present some of the technical and busine

Github


In [33]:
!git init

Initialized empty Git repository in /content/.git/


In [34]:
!git remote add origin https://github.com/AmazingGrace-D/NLP-Repo.git

In [36]:
!git fetch

In [38]:
!git status

On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31m.config/[m
	[31msample_data/[m

nothing added to commit but untracked files present (use "git add" to track)


In [42]:
!ls /content/sample_data

anscombe.json		      mnist_test.csv
california_housing_test.csv   mnist_train_small.csv
california_housing_train.csv  README.md


In [37]:
# !git branch -M main

error: refname refs/heads/master not found
fatal: Branch rename failed


In [None]:
!git push -u origin main