This example uses a scipy.sparse matrix to store the features and demonstrates various classifiers that can efficiently handle sparse matrices.

In [1]:
from sklearn.datasets import fetch_20newsgroups

categories = [
        'alt.atheism',
        'talk.religion.misc',
        'comp.graphics',
        'sci.space',
    ]
remove = ()

train = fetch_20newsgroups(subset='train',categories=categories,
                          shuffle=True,random_state=42,remove=remove)
test = fetch_20newsgroups(subset='test',categories=categories,
                          shuffle=True,random_state=42,remove=remove)

In [47]:
train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

In [46]:
print(train['target_names'])
print()
print(train['target'][:10])
print()
print(train['data'][0])

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

[1 3 2 0 2 0 2 1 2 1]

From: rych@festival.ed.ac.uk (R Hawkes)
Subject: 3DS: Where did all the texture rules go?
Lines: 21

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych

Rycharde Hawkes				email: rych@festival.ed.ac.uk
Virtual Environment Laboratory
Dept. of Psychology			Tel  : +44 31 650 3426
Univ. of Edinburgh			Fax  : +44 31 667 0150



In [60]:
def size_mb(docs):
    print("1st document text length is : {}".format(len(docs[0].encode('utf-8'))) )
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

data_train_size_mb = size_mb(train.data)
data_test_size_mb = size_mb(test.data)

print("%d documents - %0.3fMB (training set)" % (
    len(train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(test.data), data_test_size_mb))
print("%d categories" % len(categories))
print()

1st document text length is : 1022
1st document text length is : 249
2034 documents - 3.980MB (training set)
1353 documents - 2.867MB (test set)
4 categories



In [61]:
# split a training set and a test set
train_X , train_Y = train.data , train.target

If n_samples == 10000, storing X as a numpy array of type float32 would require 10000 x 100000 x 4 bytes = 4GB in RAM
Fortunately, most values in X will be zeros since for a given document less than a couple thousands of distinct words will be used. 
For this reason we say that bags of words are typically high-dimensional sparse datasets. 
We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.

# Extracting features from the training data using a sparse vectorizer

From occurrences to frequencies
divide the number of occurrences of each word in a document by the total number of words in the document: 
these new features are called tf for Term Frequencies.

to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.
This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.
Both tf and tf–idf can be computed as follows:

In [64]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
X_train = vectorizer.fit_transform(train.data)  # the fit(..) method to fit our estimator to the data and 
                                                #secondly the transform(..) method to transform our count-matrix to a tf-idf representation.

In [68]:
print(type(X_train) )
X_train

<class 'scipy.sparse.csr.csr_matrix'>


<2034x33809 sparse matrix of type '<class 'numpy.float64'>'
	with 224893 stored elements in Compressed Sparse Row format>

In [69]:
X_train.shape #33809 unique words(except stop words and low count words)

(2034, 33809)

In [71]:
feature_names = vectorizer.get_feature_names()
feature_names[:10]

['00',
 '000',
 '0000',
 '00000',
 '000000',
 '000005102000',
 '000021',
 '000062david42',
 '0000vec',
 '0001']