In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

Opis datasetu:

http://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset

Filtrowanie tekstu:

http://scikit-learn.org/stable/datasets/index.html#filtering-text-for-more-realistic-training

In [2]:
print("newsgroups.data")
print("type:", type(newsgroups.data), "; length:", len(newsgroups.data), "; dtype:", type(newsgroups.data[0]))
print("newsgroups.target")
print("type:", type(newsgroups.target), "; shape:", newsgroups.target.shape, "; dtype:", newsgroups.target.dtype)

newsgroups.data
type: <class 'list'> ; length: 11314 ; dtype: <class 'str'>
newsgroups.target
type: <class 'numpy.ndarray'> ; shape: (11314,) ; dtype: int64


In [3]:
newsgroups.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [4]:
for i in range(10):
    print("Class:", newsgroups.target[i])
    print("Class label:", newsgroups.target_names[i])
    print(newsgroups.data[i])
    print("\n############################################\n")

Class: 7
Class label: alt.atheism
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

############################################

Class: 4
Class label: comp.graphics
A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two d

Konwersja tekstu do tabelki feature'ów:

http://scikit-learn.org/stable/datasets/index.html#converting-text-to-vectors

Użyjemy `CountVectorizer` - pewnie działa gorzej niż `TfidfVectorizer`, ale jest bardziej zrozumiały:

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [5]:
train_text = [
    "Druga zwrotka bo zawsze chciałem zacząć od środka",
    "Wśród kamienic plotka, że to horror w opłotkach",
    "Jeden kolo ma ziarno i je pali aż parno",
    "Wszędzie dym aż czarno można ciąć Husqvarną",
    "A jak ziarno zasadzisz to Ci zniknie jak Vanish",
    "Kliknie jak klawisz aż się wzdrygniesz jak panicz Yo",
    "Bo to nie ziarnko pod farmerskie wdzianko",
    "To Cię zarazi i nic nie poradzisz"]

cv = CountVectorizer()
cv.fit(train_text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [6]:
# wypiszmy nauczony słownik
# domyślnie ignorowane są słowa jednoliterowe
cv.vocabulary_

{'aż': 0,
 'bo': 1,
 'chciałem': 2,
 'ci': 3,
 'ciąć': 4,
 'cię': 5,
 'czarno': 6,
 'druga': 7,
 'dym': 8,
 'farmerskie': 9,
 'horror': 10,
 'husqvarną': 11,
 'jak': 12,
 'je': 13,
 'jeden': 14,
 'kamienic': 15,
 'klawisz': 16,
 'kliknie': 17,
 'kolo': 18,
 'ma': 19,
 'można': 20,
 'nic': 21,
 'nie': 22,
 'od': 23,
 'opłotkach': 24,
 'pali': 25,
 'panicz': 26,
 'parno': 27,
 'plotka': 28,
 'pod': 29,
 'poradzisz': 30,
 'się': 31,
 'to': 32,
 'vanish': 33,
 'wdzianko': 34,
 'wszędzie': 35,
 'wzdrygniesz': 36,
 'wśród': 37,
 'yo': 38,
 'zacząć': 39,
 'zarazi': 40,
 'zasadzisz': 41,
 'zawsze': 42,
 'ziarnko': 43,
 'ziarno': 44,
 'zniknie': 45,
 'zwrotka': 46,
 'środka': 47,
 'że': 48}

In [7]:
# tak będzie wygodniej
features = list(zip(*sorted(cv.vocabulary_.items(), key=lambda tup: tup[1])))[0]
print(features)

('aż', 'bo', 'chciałem', 'ci', 'ciąć', 'cię', 'czarno', 'druga', 'dym', 'farmerskie', 'horror', 'husqvarną', 'jak', 'je', 'jeden', 'kamienic', 'klawisz', 'kliknie', 'kolo', 'ma', 'można', 'nic', 'nie', 'od', 'opłotkach', 'pali', 'panicz', 'parno', 'plotka', 'pod', 'poradzisz', 'się', 'to', 'vanish', 'wdzianko', 'wszędzie', 'wzdrygniesz', 'wśród', 'yo', 'zacząć', 'zarazi', 'zasadzisz', 'zawsze', 'ziarnko', 'ziarno', 'zniknie', 'zwrotka', 'środka', 'że')


In [8]:
X = cv.transform(train_text)
X

<8x49 sparse matrix of type '<class 'numpy.int64'>'
	with 58 stored elements in Compressed Sparse Row format>

Warto zaprzyjaźnić się z macierzami typu _sparse_: CSR, CSC i COO.

https://docs.scipy.org/doc/scipy/reference/sparse.html

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html

In [9]:
X = np.array(X.todense())
print(X)

[[0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 1 0 0 0 1 1 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0
  1 0 0 0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 1 0 0 0 0]
 [1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
  0 0 0 0 1 0 0 1 1 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
  0 1 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0
  0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 0]]


In [10]:
# policzmy, czy liczba słów w każdym zdaniu (z dokładnością do słów jednoliterowych) się zgadza
print("original:", [len(sentence.split()) for sentence in train_text])
print("transformed:", np.sum(X, axis=1))

original: [8, 8, 9, 7, 9, 9, 7, 7]
transformed: [8 7 8 7 8 9 7 6]


In [11]:
# nauczony vectorizer ignoruje nowe słowa

new_text = [
    "I jak by pytał kto ja ten kolo jestem",
    "Mam plan niecny i szpetny jak Wujek Fester",
    "Mam torbę ziaren ale nie mylić z towarem",
    "Bo to podlewasz wokalem owocuje Ci tekstem"]

XX = np.array(cv.transform(new_text).todense())
print("original:", [len(sentence.split()) for sentence in new_text])
print("transformed:", np.sum(XX, axis=1))
print("known words:")
for row in XX:
    print("  ", [features[i] for i in np.nonzero(row)[0]])

original: [9, 8, 8, 7]
transformed: [2 1 1 3]
known words:
   ['jak', 'kolo']
   ['jak']
   ['nie']
   ['bo', 'ci', 'to']


In [12]:
# zobaczmy na koniec statystyki datasetu "newsgroups"

cv2 = CountVectorizer()
cv2.fit(newsgroups.data)
cv2.transform(newsgroups.data)

<11314x101631 sparse matrix of type '<class 'numpy.int64'>'
	with 1103627 stored elements in Compressed Sparse Row format>

Nie polecam:
* iterować ręcznie po powyższej tabelce w pythonowej pętli,
* konwertować bez potrzeby do reprezentacji gęstej (todense).