# Bit String Extraction using brown-clustering from PyPI (yangyuan)

Objectives: 
- Extract the bit string representation for words/tokens/texts 
- Store into a text file and soon to extract features for training pos tagging model

References: 
- https://pypi.org/project/brown-clustering/
- https://github.com/yangyuan/brown-clustering

Package: 
`pip install brown-clustering`

In [10]:
from brown_clustering import BigramCorpus, BrownClustering

# use some tokenized and preprocessed data
sentences = [
    ["This", "is", "an", "example"],
    ["This", "is", "another", "example"]
]

# create a corpus
corpus = BigramCorpus(sentences, alpha=0.5, min_count=0)

# (optional) print corpus statistics:
corpus.print_stats()

# create a clustering
clustering = BrownClustering(corpus, m=4)

# train the clustering
clusters = clustering.train()


Vocab count: 5
Token count: 8
unique 2gram count: 7
2gram count: 10.0
Laplace smoothing: 0.5


100%|█████████████████████████████████████████████| 5/5 [00:01<00:00,  4.10it/s]


In [11]:
# get codes for the words
cluster_dict = clustering.codes()
# {'an': '110', 'another': '111', 'This': '00', 'example': '01', 'is': '10'}

In [12]:
import json

with open('test_bitstring_dictionary.txt', 'w') as convert_file:
     convert_file.write(json.dumps(cluster_dict))

In [13]:
# importing the module
import json
  
# reading the data from the file
with open('test_bitstring_dictionary.txt') as f:
    data = f.read()

print("Data type before reconstruction : ", type(data))
      
# reconstructing the data as a dictionary
js = json.loads(data)
  
print("Data type after reconstruction : ", type(js))
print(js)

Data type before reconstruction :  <class 'str'>
Data type after reconstruction :  <class 'dict'>
{'an': '110', 'another': '111', 'This': '00', 'example': '01', 'is': '10'}
