In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [16]:
with open("all_text.txt", mode="r", encoding="utf-8") as file:
    text = file.read()

print(text[0:1000])

UK will not be able to resist China's tech dominance
China's success in technology has not come out of thin air, even given the unlikely origins of the DeepSeek deep shock.
The obscure Hangzhou hedge fund that coded a ChatGPT competitor as a side project it claims cost just $5.6m to train emerges from a concerted effort to invest in future generations of technology.
This is not an accident. This is policy.
The raw materials of artificial intelligence (AI) are microchips, science PhDs and data. On the latter two, China might be ahead already.
There are on average more than 6,000 PhDs in STEM subjects (science, technology, engineering and mathematics) coming out of Chinese universities every month. In the US it is more like 2,000-3,000, in the UK it is 1,500.
In terms of patents generally, more are being filed in China than in the rest of the world put together. In 2023China filed 1.7 million patents, against 600,000 in the US. Two decades earlier China had a third of the patents filed b

In [31]:
# We do a simple cleaning with split(), and then join the list of tokens that
# split creates into a string again (whitespace between all the tokens). Perhaps
# the quotation marks and other punctuation could be removed too, but we leave
# that for now.

split = text.split()
all_text_cleaned = " ".join(split)

print(all_text_cleaned[0:1000])

UK will not be able to resist China's tech dominance China's success in technology has not come out of thin air, even given the unlikely origins of the DeepSeek deep shock. The obscure Hangzhou hedge fund that coded a ChatGPT competitor as a side project it claims cost just $5.6m to train emerges from a concerted effort to invest in future generations of technology. This is not an accident. This is policy. The raw materials of artificial intelligence (AI) are microchips, science PhDs and data. On the latter two, China might be ahead already. There are on average more than 6,000 PhDs in STEM subjects (science, technology, engineering and mathematics) coming out of Chinese universities every month. In the US it is more like 2,000-3,000, in the UK it is 1,500. In terms of patents generally, more are being filed in China than in the rest of the world put together. In 2023China filed 1.7 million patents, against 600,000 in the US. Two decades earlier China had a third of the patents filed b

In [33]:
# Open a file in write mode ("w" = write, "utf-8" ensures proper encoding)
with open("all_text_cleaned.txt", "w", encoding="utf-8") as file:
    file.write(all_text_cleaned)  # Write the cleaned text to the file

print("File saved successfully!")


File saved successfully!


In [None]:
# We import SpaCy and load also the English language model.

import spacy

nlp = spacy.load("en_core_web_sm")

In [42]:
# Let's process our text with spacy nlp into a doc object. 

doc_all_text_cleaned = nlp(all_text_cleaned)


In [46]:
# As we remember from last week, as part of the nlp pipeline, spacy recognizes
# and classifies named entities in the text it processes.

# There are several ways to access the NE information. Spacy creates a tuple
# (which is similar to a list, but cannot be modified) about all the named
# entities in doc_book.ents. We can print it:

print (doc_all_text_cleaned.ents[:50])

# You can try printing what type .ents is: it is a "tuple"

#print(type(doc_novel.ents))

(UK, China, China, DeepSeek, Hangzhou, just $5.6, two, China, more than 6,000, STEM, Chinese, every month, US, 2,000-3,000, UK, 1,500, China, 2023China, 1.7 million, 600,000, US, Two decades earlier, China, a third, US, a quarter, Japan, South Korea, Europe, China, US, Chinese, seventh, a decade ago, AI, China, China, Chinese, China, 24/7, AI, China, three-quarters, the start of the century, Last year, the US National Science Board, China, China, United, UK)


In [49]:
# Here we create a list with all the person names. The attribute .label has the
# NE class (GPE, PERSON etc.). The attribute .text stores the actual entity as
# a string variable.

# In a for loop, we iterate through all the recognised named entities, check with
# "if" if the entity is a "PERSON", and if yes, we store the entity into our list
# "list_persons"

list_persons = []

for entity in doc_all_text_cleaned.ents:
    if entity.label_ == "PERSON":          # We use the class "PERSON" here (could be any other NE class too)
        list_persons.append(entity.text)

print(f"The entities tagged with label PERSON are: {spacy.explain('PERSON')}")

print("")

print(list_persons)

The entities tagged with label PERSON are: People, including fictional

['Rachel Reeves', 'Trump', 'Keir Starmer', 'Xi', 'Biden', 'Liang Wenfeng', 'Elon Musk', 'Marina Zhang', 'Liang Wenfung', "Xi Jinping's", 'Donald Trump', 'Gregory C Allen', 'Mr Allen', 'Ms Zhang', 'Ms Zhang', 'Deepseek', 'Liang Wenfeng', 'Liang', 'Ms Zhang', 'Allen', 'Liang', 'Zhilin Yang', 'Kaiming', 'Wei Sun', 'Getty Images', 'Fiona Zhou', 'Marc Andreessen', 'Sputnik', 'OpenAI', 'Gene Munster', 'OpenAI', 'Sam Altman', 'Oracle', 'Larry Ellison', 'Donald Trump', 'Trump', 'Ellison', "Liang Wenfung's", 'High-Flyer', 'Sam Altman', 'Sputnik', 'Reuters DeepSeek', 'OpenAI', 'Donald Trump', 'Kayla Blomquist', 'Qwen', 'Ms Blomquist', 'Donald Trump', 'OpenAI', 'Deepseek', 'Sam Altman', 'GPT-4', 'App Store', 'Ernie', 'Doubao', 'Brandon Drenon', 'Tom Gerken', 'Marc Cieslak', 'Donald Trump', 'OpenAI', 'Deepseek', 'Sam Altman', 'GPT-4', 'App Store', 'Ernie', 'Doubao', 'Liang Wenfeng', 'Liang', 'Li Qiang', 'Liang', 'Liang', 'Lian

In [51]:
from collections import Counter

# Count occurrences of each unique name
name_frequencies = Counter(list_persons)

# Print explanation of "PERSON" tag
print(f"The entities tagged with label PERSON are: {spacy.explain('PERSON')}\n")

# Print names with their frequencies
for name, count in name_frequencies.items():
    print(f"{name}: {count}")


The entities tagged with label PERSON are: People, including fictional

Rachel Reeves: 1
Trump: 12
Keir Starmer: 1
Xi: 11
Biden: 8
Liang Wenfeng: 18
Elon Musk: 4
Marina Zhang: 3
Liang Wenfung: 1
Xi Jinping's: 1
Donald Trump: 11
Gregory C Allen: 1
Mr Allen: 1
Ms Zhang: 3
Deepseek: 5
Liang: 35
Allen: 2
Zhilin Yang: 1
Kaiming: 1
Wei Sun: 2
Getty Images: 2
Fiona Zhou: 1
Marc Andreessen: 8
Sputnik: 7
OpenAI: 17
Gene Munster: 1
Sam Altman: 12
Oracle: 3
Larry Ellison: 1
Ellison: 1
Liang Wenfung's: 1
High-Flyer: 2
Reuters DeepSeek: 1
Kayla Blomquist: 1
Qwen: 1
Ms Blomquist: 1
GPT-4: 7
App Store: 3
Ernie: 2
Doubao: 3
Brandon Drenon: 1
Tom Gerken: 1
Marc Cieslak: 1
Li Qiang: 7
Prof Gina Neff: 1
Wendy Hall: 1
Emmanuel Macron: 1
Narendra Modi: 1
JD Vance: 3
Sundar Pichai: 2
Kier Starmer: 1
Wu Zhaohui: 1
Ding Xuexiang: 1
Xi JinPing: 1
Dario Amodei: 1
Geoffrey Hinton: 1
Prof Hinton: 1
Prof Max Tegmark: 1
Holly Wang: 1
Holly: 4
Covid: 3
Getty Images DeepSeek: 1
Nan Jia: 1
Nan: 1
John: 1
Fang Kecheng:

In [54]:
# Here we print only the 10 most common strings in the list:

print ("Top 10 person named entites and their frequencies: " + str(Counter(list_persons).most_common(10)))


Top 10 person named entites and their frequencies: [('Liang', 35), ('Liang Wenfeng', 18), ('OpenAI', 17), ('Trump', 12), ('Sam Altman', 12), ('Xi', 11), ('Donald Trump', 11), ('Kevin', 10), ('Biden', 8), ('Marc Andreessen', 8)]


In [57]:
import pandas as pd

persons_counted = Counter(list_persons)

df = pd.DataFrame(persons_counted.most_common(), columns=['person', 'count'])

df[0:10] # We print twenty first entries in our dataframe, thus, top twenty counts

Unnamed: 0,person,count
0,Liang,35
1,Liang Wenfeng,18
2,OpenAI,17
3,Trump,12
4,Sam Altman,12
5,Xi,11
6,Donald Trump,11
7,Kevin,10
8,Biden,8
9,Marc Andreessen,8


In [77]:
from collections import Counter

# Merging counts manually in the list_persons
# Merge "Liang" with "Liang Wenfeng"
for i in range(len(list_persons)):
    if list_persons[i] == "Liang" or list_persons[i] == "Liang Wenfung":
        list_persons[i] = "Liang Wenfeng"  # Replace "Liang" with "Liang Wenfeng"

# Merge "Trump" with "Donald Trump"
for i in range(len(list_persons)):
    if list_persons[i] == "Trump":
        list_persons[i] = "Donald Trump"  # Replace "Trump" with "Donald Trump"
        
# Merge "Xi" with "Xi Jinping"
for i in range(len(list_persons)):
    if list_persons[i] == "Xi":
        list_persons[i] = "Xi Jinping" # Replace "Xi" with "Xi Jinping"

# Print the updated top 10 person named entities and their frequencies
top_10_persons = Counter(list_persons).most_common(10)
print("Top 10 person named entities and their frequencies: ", top_10_persons)


Top 10 person named entities and their frequencies:  [('Liang Wenfeng', 54), ('Donald Trump', 23), ('Xi Jinping', 19), ('OpenAI', 17), ('Sam Altman', 12), ('Kevin', 10), ('Biden', 8), ('Marc Andreessen', 8), ('Sputnik', 7), ('GPT-4', 7)]


In [79]:
persons_counted = Counter(list_persons)

df = pd.DataFrame(persons_counted.most_common(), columns=['person', 'count'])

df[0:10]

Unnamed: 0,person,count
0,Liang Wenfeng,54
1,Donald Trump,23
2,Xi Jinping,19
3,OpenAI,17
4,Sam Altman,12
5,Kevin,10
6,Biden,8
7,Marc Andreessen,8
8,Sputnik,7
9,GPT-4,7


In [84]:
# List to store adjectives
adjectives = []

# Loop through the tokens in the processed document
for token in doc_all_text_cleaned:
    # Check if the token is an adjective (POS tag "ADJ")
    if token.pos_ == "ADJ":
        adjectives.append(token.text)

# Print the list of adjectives
print("Adjectives in the text:", adjectives[:50])

Adjectives in the text: ['able', 'tech', 'thin', 'unlikely', 'deep', 'obscure', 'ChatGPT', 'concerted', 'future', 'raw', 'artificial', 'latter', 'average', 'more', 'Chinese', 'new', 'scientific', 'Chinese', 'electric', 'visible', 'electric', 'biggest', 'Chinese', 'electric', 'intelligent', 'conventional', 'dark', 'astonishing', 'clean', 'Last', 'artificial', 'more', 'more', 'indigenous', 'electric', 'complete', 'own', 'Tube', 'Chinese', 'extraordinary', 'difficult', 'concerned', 'problematic', 'positive', 'prominent', 'first', 'obvious', 'long', 'national', 'deeper']


In [86]:
adj_counted = Counter(adjectives)

df = pd.DataFrame(adj_counted.most_common(), columns=['akjectives', 'count'])

df[0:50]

Unnamed: 0,akjectives,count
0,Chinese,236
1,more,97
2,other,90
3,new,87
4,many,80
5,American,68
6,last,67
7,artificial,61
8,powerful,56
9,open,56
