<a href="https://colab.research.google.com/github/Turakulov/Python/blob/master/text_classfication.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pip

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import metrics

In [0]:
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

In [3]:
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [4]:
twenty_train.data[0].split('\n')

["From: lerxst@wam.umd.edu (where's my thing)",
 'Subject: WHAT car is this!?',
 'Nntp-Posting-Host: rac3.wam.umd.edu',
 'Organization: University of Maryland, College Park',
 'Lines: 15',
 '',
 ' I was wondering if anyone out there could enlighten me on this car I saw',
 'the other day. It was a 2-door sports car, looked to be from the late 60s/',
 'early 70s. It was called a Bricklin. The doors were really small. In addition,',
 'the front bumper was separate from the rest of the body. This is ',
 'all I know. If anyone can tellme a model name, engine specs, years',
 'of production, where this car is made, history, or whatever info you',
 'have on this funky looking car, please e-mail.',
 '',
 'Thanks,',
 '- IL',
 '   ---- brought to you by your neighborhood Lerxst ----',
 '',
 '',
 '',
 '',
 '']

In [0]:
for i in range(len(twenty_train.data)):
  twenty_train.data[i] = twenty_train.data[i].replace("\n\n","\n")
  twenty_train.data[i] = twenty_train.data[i].replace('\n\n','')

In [6]:
twenty_train.data[1].split('\n')

['From: guykuo@carson.u.washington.edu (Guy Kuo)',
 'Subject: SI Clock Poll - Final Call',
 'Summary: Final call for SI clock reports',
 'Keywords: SI,acceleration,clock,upgrade',
 'Article-I.D.: shelley.1qvfo9INNc3s',
 'Organization: University of Washington',
 'Lines: 11',
 'NNTP-Posting-Host: carson.u.washington.edu',
 'A fair number of brave souls who upgraded their SI clock oscillator have',
 'shared their experiences for this poll. Please send a brief message detailing',
 'your experiences with the procedure. Top speed attained, CPU rated speed,',
 'add on cards and adapters, heat sinks, hour of usage per day, floppy disk',
 'functionality with 800 and 1.4 m floppies are especially requested.',
 'I will be summarizing in the next two days, so please add to the network',
 "knowledge base if you have done the clock upgrade and haven't answered this",
 'poll. Thanks.',
 'Guy Kuo <guykuo@u.washington.edu>',
 '']

In [0]:
from sklearn.pipeline import Pipeline

In [8]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='modified_huber', penalty='l2', random_state=42)),
                    ])
# ,alpha=1e-3
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target) 

0.8507700477960701

In [9]:
print(metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.82      0.80      0.81       319
           comp.graphics       0.76      0.79      0.78       389
 comp.os.ms-windows.misc       0.77      0.72      0.74       394
comp.sys.ibm.pc.hardware       0.72      0.75      0.73       392
   comp.sys.mac.hardware       0.83      0.85      0.84       385
          comp.windows.x       0.87      0.76      0.81       395
            misc.forsale       0.83      0.90      0.86       390
               rec.autos       0.92      0.90      0.91       396
         rec.motorcycles       0.95      0.95      0.95       398
      rec.sport.baseball       0.91      0.95      0.93       397
        rec.sport.hockey       0.96      0.98      0.97       399
               sci.crypt       0.94      0.93      0.94       396
         sci.electronics       0.79      0.80      0.79       393
                 sci.med       0.91      0.86      0.88       396
         

In [0]:
docs_new = ['Jesus teaches that god is love', '','OpenGL on the GPU is fast', 'In order to cure almost every disease']

In [0]:
predicted = text_clf.predict(docs_new)

In [12]:
for doc, category in zip(docs_new, predicted):
  print('%r => %s' % (doc, twenty_train.target_names[category]))

'Jesus teaches that god is love' => talk.religion.misc
'' => misc.forsale
'OpenGL on the GPU is fast' => rec.autos
'In order to cure almost every disease' => sci.med


In [13]:
test_txt = open("test.txt", "r")
test_txt = test_txt.read().split('\n\n')
test_txt

["Ryzen is a brand[3] of x86-64 microprocessors designed and marketed by Advanced Micro Devices, Inc. (AMD) for desktop, mobile and embedded platforms based on the Zen microarchitecture and its successors. It consists of central processing units marketed for mainstream, enthusiast and workstation segments and accelerated processing units (APUs) marketed for mainstream and entry-level segments and embedded applications. Ryzen is especially significant for AMD, since it is a completely new design, and since it marks the corporation's return to the high end desktop CPU market. AMD's competitor Intel had controlled this section of the market for almost ten years, since Intel released the Conroe/Core microarchitecture, which displaced AMD's Athlon 64 as the cutting edge of performance in the market.[clarification needed]",
 "AMD officially announced the first 14nm Ryzen products during its New Horizon summit on December 13, 2016 and introduced them the following February with the first proc

In [0]:
predicted = text_clf.predict(test_txt)

In [15]:
for doc, category in zip(test_txt, predicted):
  print('%r => %s' % (doc, twenty_train.target_names[category]))

"Ryzen is a brand[3] of x86-64 microprocessors designed and marketed by Advanced Micro Devices, Inc. (AMD) for desktop, mobile and embedded platforms based on the Zen microarchitecture and its successors. It consists of central processing units marketed for mainstream, enthusiast and workstation segments and accelerated processing units (APUs) marketed for mainstream and entry-level segments and embedded applications. Ryzen is especially significant for AMD, since it is a completely new design, and since it marks the corporation's return to the high end desktop CPU market. AMD's competitor Intel had controlled this section of the market for almost ten years, since Intel released the Conroe/Core microarchitecture, which displaced AMD's Athlon 64 as the cutting edge of performance in the market.[clarification needed]" => comp.sys.ibm.pc.hardware
"AMD officially announced the first 14nm Ryzen products during its New Horizon summit on December 13, 2016 and introduced them the following Feb