## Load 20news data with python

The data is in form (group, word, frequency). This structure follows the COO sparse matrix format.

The last column (frequency) is not used in this project.

Finally print details to make sure loading data stats from here results the same values as provided by R example demo(0).

__R Demo(0) stats__:
- Number of groups: 20
- Number of words: 53975
- Number of documents: 11269
- Number of word-doc pairs: 1467345
- Density: 0.002412427

In [83]:
import pickle
import json
import numpy as np
from scipy.sparse import coo_matrix, csr_matrix

In [3]:
def load_20news_data(path_pfix="data/20news/"):
    X = []
    with open(f"{path_pfix}train.data", "r") as f:
        for i, line in enumerate(f.readlines()):
            line = line.split()
            assert len(line) == 3, f"line.split() returned incorrect number of elements: {len(line)} != 3"
            X.append(tuple(line[:2]))
    
    f = open(f"{path_pfix}train.label", "r")
    y = [label.strip() for label in f.readlines()]
    f.close()
    
    f = open(f"{path_pfix}vocabulary.txt", "r")
    vocab = [token.strip() for token in f.readlines()]
    f.close()
    print(X[0])
    return np.array(X, dtype=int), np.array(y, dtype=int), vocab


X, y, vocab = load_20news_data()
prev_docid = None
docid_count = 0
for x in X:
    if x[0] != prev_docid:
        docid_count += 1
        prev_docid = x[0]

assert docid_count == len(y), \
    "Number of doc_ids and labels should be equal, got {docid_count} doc_ids and {len(y)} labels."

groups = [
    'alt.atheism', 
    'comp.graphics',    
    'comp.os.ms-windows.misc', 
    'comp.sys.ibm.pc.hardware', 
    'comp.sys.mac.hardware', 
    'comp.windows.x', 
    'misc.forsale', 
    'rec.autos', 
    'rec.motorcycles', 
    'rec.sport.baseball', 
    'rec.sport.hockey', 
    'sci.crypt', 
    'sci.electronics', 
    'sci.med', 
    'sci.space', 
    'soc.religion.christian', 
    'talk.politics.guns',
    'talk.politics.mideast', 
    'talk.politics.misc', 
    'talk.religion.misc'
]


('1', '1')


In [4]:
print(f"• Number of groups: {len(groups)}")
print(f"• Number of words: {len(vocab)}")
print(f"• Number of documents: {docid_count}")
print(f"• Number of word-doc-pairs: {len(X)}")
print(f"• Density: {round(len(X)/(docid_count*len(vocab)), 9)}")

• Number of groups: 20
• Number of words: 53975
• Number of documents: 11269
• Number of word-doc-pairs: 1467345
• Density: 0.002412427


In [8]:
ynp = np.array(y, dtype=int)
group_index_ranges = []  # stores 2-tuples like (start_index, end_index) for each of the 20 groups
train_index_ranges = []
test_index_ranges = []
for i in range(1, 21):
    mask = (ynp == i)
    idx_range = np.where(mask)[0]
    split_idx = int((idx_range[-1]-idx_range[0])*0.9+idx_range[0])
    group_index_ranges.append((idx_range[0], idx_range[-1]))
    train_index_ranges.append((idx_range[0], split_idx))
    test_index_ranges.append((split_idx, idx_range[-1]+1))

In [9]:
train_index_ranges

[(0, 431),
 (480, 1002),
 (1061, 1574),
 (1633, 2160),
 (2220, 2736),
 (2795, 3326),
 (3387, 3909),
 (3969, 4500),
 (4561, 5096),
 (5157, 5690),
 (5751, 6288),
 (6349, 6882),
 (6943, 7474),
 (7534, 8067),
 (8128, 8660),
 (8721, 9259),
 (9320, 9809),
 (9865, 10371),
 (10429, 10845),
 (10893, 11230)]

In [10]:
test_index_ranges

[(431, 480),
 (1002, 1061),
 (1574, 1633),
 (2160, 2220),
 (2736, 2795),
 (3326, 3387),
 (3909, 3969),
 (4500, 4561),
 (5096, 5157),
 (5690, 5751),
 (6288, 6349),
 (6882, 6943),
 (7474, 7534),
 (8067, 8128),
 (8660, 8721),
 (9259, 9320),
 (9809, 9865),
 (10371, 10429),
 (10845, 10893),
 (11230, 11269)]

### Create conditional probability tables

"Estimate the probability that a document from the given group contains the word word."

Here we don't care about the word frequencies in documents, just the binary occurrence. Basically you need to count the number of documents in each group that has each word.

`p(w|g) = #(docs_having_word and docs_in_group) / #docs_in_group`

In [104]:
row = None  # groups
col = None  # vocab

scipy.sparse.coo.coo_matrix

y-labels are in array thats length is the number of documents in the dataset. X on the other hand is currently in (implicit) sparse matrix format where first index indicates the document and second index the word. Construct a proper sparse matrix from X and concatenate Y to it.

The dataset was intended to be used with R and for that reason the indexing starts with 1 instead of 0. Remove the first row and first column from resulting sparse matrix

After the data in X is transformed, the documents can be directly aggregated based on group indexes given in y.

In [105]:
X_csr = csr_matrix((np.ones(X.shape[0]), (X[:, 0], X[:, 1])))
print(X_csr.shape)
print(X_csr[:, 0].sum() == 0)
print(X_csr[0, :].sum() == 0)
X_csr = X_csr[:, range(1, X_csr.shape[1])]  # remove first column as its all-zeroes
X_csr = X_csr[range(1, X_csr.shape[0]), :]  # remove first row as its all-zeroes
assert X_csr.sum() == X.shape[0], f"# of elements in sparse csr matrix does not match with the original"

(11270, 53976)
True
True


In [108]:
print(X_csr.shape)
print(X_csr.sum())
print(X_csr[:, 0].sum() == 0)
print(X_csr[0, :].sum() == 0)

(11269, 53975)
1467345.0
False
False
