In [1]:
from spacy.en import English
parser = English()

In [2]:
with open("../TheGadflyProject/news_articles/China Is Rising.txt", encoding="utf-8") as f:
    source_text = f.read()
    
print(source_text)

In the constant scorekeeping of where tech’s power centers are, two trends stand out in Asia: the continued rise of China as a tech titan and the slow decline of Japan’s once-mighty tech industry.

Those currents were evident in two recent developments.

On Thursday, Paul Mozur and Jane Perlez reported that American officials had blocked a proposed purchase of a controlling stake in a unit of the Dutch electronics company Philips by Chinese investors because the United States was fearful the deal would be used to further China’s push into microchips.

At the center of the concerns was a material called gallium nitride, which was being used by the Philips unit in light-emitting diodes, but which has applications for military and space and can be helpful in semiconductors. The report illustrated how American officials have come to view China’s large and growing tech industry with misgivings.

At the same time, Jonathan Soble and Paul Mozur reported that Sharp, one of Japan’s large consum

In [3]:
# all you have to do to parse text is this:
#note: the first time you run spaCy in a file it takes a little while to load up its modules
parsedData = parser(source_text)

# Let's look at the tokens
# All you have to do is iterate through the parsedData
# Each token is an object with lots of different properties
# A property with an underscore at the end returns the string representation
# while a property without the underscore returns an index (int) into spaCy's vocabulary
# The probability estimate is based on counts from a 3 billion word
# corpus, smoothed using the Simple Good-Turing method.
for i, token in enumerate(parsedData):
    print("original:", token.orth, token.orth_)
    print("lowercased:", token.lower, token.lower_)
    print("lemma:", token.lemma, token.lemma_)
    print("shape:", token.shape, token.shape_)
    print("prefix:", token.prefix, token.prefix_)
    print("suffix:", token.suffix, token.suffix_)
    print("log probability:", token.prob)
    print("Brown cluster id:", token.cluster)
    print("----------------------------------------")
    if i > 1:
        break

original: 683 In
lowercased: 477 in
lemma: 477 in
shape: 87670 Xx
prefix: 467 I
suffix: 683 In
log probability: -7.603263854980469
Brown cluster id: 62
----------------------------------------
original: 466 the
lowercased: 466 the
lemma: 466 the
shape: 28983 xxx
prefix: 3598 t
suffix: 466 the
log probability: -3.528766632080078
Brown cluster id: 11
----------------------------------------
original: 2839 constant
lowercased: 2839 constant
lemma: 2839 constant
shape: 53740 xxxx
prefix: 4206 c
suffix: 14180 ant
log probability: -10.435121536254883
Brown cluster id: 1831
----------------------------------------


In [5]:
for sent in parsedData.sents:
    print("*" * 100)
    print(sent.text)

****************************************************************************************************
In the constant scorekeeping of where tech’s power centers are, two trends stand out in Asia: the continued rise of China as a tech titan and the slow decline of Japan’s once-mighty tech industry.
****************************************************************************************************


Those currents were evident in two recent developments.
****************************************************************************************************


On Thursday, Paul Mozur and Jane Perlez reported that American officials had blocked a proposed purchase of a controlling stake in a unit of the Dutch electronics company Philips by Chinese investors because the United States was fearful the deal would be used to further China’s push into microchips.
****************************************************************************************************


At the center of the concerns was a

In [93]:
entities = set()
for ent in parsedData.ents:
    if (ent.label_ != "") and (ent.label_ not in ["DATE", "TIME", "PERCENT", "CARDINAL"]):
#         print(ent, ent.label, ent.label_, ent.start)
        entities.add(ent.text_with_ws)
#         print(parsedData[:ent.start], "________", parsedData[ent.end:])
print(entities)


{'Jane Perlez ', 'China ', 'Chinese ', 'Jonathan Soble ', 'Dutch ', 'Philips ', 'the United States ', 'Foxconn', 'Asia', 'Japan', 'Taiwanese ', 'South Korea', 'Sharp', 'China', 'American ', 'Paul Mozur '}


In [30]:
# Let's look at the sentences
sents = []
# the "sents" property returns spans
# spans have indices into the original string
# where each index value represents a token
for span in parsedData.sents:
    sent = [parsedData[i] for i in range(span.start, span.end)]
    print()
    print()
    print(''.join(parsedData[i].string for i in range(span.start, span.end)).strip())
    for token in sent:
        if token.ent_type_ != "": 
            print()
            print(token, token.ent_type_)
            print("text_with_ws", token.text_with_ws)



In the constant scorekeeping of where tech’s power centers are, two trends stand out in Asia: the continued rise of China as a tech titan and the slow decline of Japan’s once-mighty tech industry.

two  CARDINAL
text_with_ws two 

Asia LOC
text_with_ws Asia

China  GPE
text_with_ws China 

Japan GPE
text_with_ws Japan


Those currents were evident in two recent developments.

two  CARDINAL
text_with_ws two 


On Thursday, Paul Mozur and Jane Perlez reported that American officials had blocked a proposed purchase of a controlling stake in a unit of the Dutch electronics company Philips by Chinese investors because the United States was fearful the deal would be used to further China’s push into microchips.

Thursday DATE
text_with_ws Thursday

Paul  PERSON
text_with_ws Paul 

Mozur  PERSON
text_with_ws Mozur 

Jane  PERSON
text_with_ws Jane 

Perlez  PERSON
text_with_ws Perlez 

American  NORP
text_with_ws American 

Dutch  NORP
text_with_ws Dutch 

Philips  ORG
text_with_ws Philips 


### Gadfly

In [31]:
import re
_parsed_text = parsedData

In [32]:
def find_named_entities():
    entities = set()
    for ent in _parsed_text.ents:
        if (ent.label_ != "") and \
            (ent.label_ not in ["DATE", "TIME", "PERCENT", "CARDINAL"]):
            entities.add(ent.text_with_ws)
    return entities

In [33]:
def _replaceNth(sent, old, new, n):
    """Replaces the old with new at the nth index in sent
    Cite:inspectorG4dget http://stackoverflow.com/a/27589436"""
    inds = [i for i in range(len(sent) - len(old)+1)
            if sent[i:i+len(old)] == old]
    if len(inds) < n:
        return  # or maybe raise an error
    # can't assign to string slices. So, let's listify
    sent_list = list(sent)
    # do n-1 because we start from the first occurrence of the string,
    # not the 0-th
    sent_list[inds[n-1]:inds[n-1]+len(old)] = new
    return ''.join(sent_list)

In [34]:
question_set = set()
entities = find_named_entities()
for sent in _parsed_text.sents:
    for entity in entities:
        sent_ents = re.findall(entity, sent.orth_)
        if sent_ents:
            print("-----------" * 10)
            for n in range(len(sent_ents)):
                gap_fill_question = _replaceNth(sent.orth_,
                                                     entity,
                                                     "_____",
                                                     n
                                                     )
                print(entity, ": ", gap_fill_question)
#                 break
#         break
    break
    #             question = Question(sent.text,
    #                                 gap_fill_question,
    #                                 entity,
    #                                 self.GAP_FILL,
    #                                 )
    #             question_set.add(question)
    # return question_set

--------------------------------------------------------------------------------------------------------------
Japan :  In the constant scorekeeping of where tech’s power centers are, two trends stand out in Asia: the continued rise of China as a tech titan and the slow decline of _____’s once-mighty tech industry.
--------------------------------------------------------------------------------------------------------------
China :  In the constant scorekeeping of where tech’s power centers are, two trends stand out in Asia: the continued rise of _____ as a tech titan and the slow decline of Japan’s once-mighty tech industry.
--------------------------------------------------------------------------------------------------------------
Asia :  In the constant scorekeeping of where tech’s power centers are, two trends stand out in _____: the continued rise of China as a tech titan and the slow decline of Japan’s once-mighty tech industry.
-------------------------------------------------