# The intruduction of the project

This project primarily investigates the (https://www.fmprc.gov.cn/mfa_eng/) Chinese Ministry of Foreign Affairs' public statements in December, examining the responses from China to relevant affairs of which countries or regions during the month, along with the expressed opinions and attitudes.

#  Webscrapping

## The introduction of the libraries
- requests:
Used for making HTTP requests to fetch data from web servers.
BeautifulSoup:
A library for parsing HTML and XML documents, commonly used in web scraping.
- csv:
A built-in module for reading and writing CSV files, simplifying handling tabular data.
urllib.parse:
Provides functions for parsing and manipulating URLs.
- re:
The regular expressions module for powerful string pattern matching and manipulation.
- os:
Interacts with the operating system, often used for file and directory manipulation.
unicodedata:
Provides access to the Unicode Character Database, useful for working with Unicode characters and strings.

In [3]:
import requests
from bs4 import BeautifulSoup
import csv
import urllib.parse
import re
import os
import unicodedata

Firstly we collect the data from the website(https://www.fmprc.gov.cn/eng/xwfw_665399/s2510_665401/2511_665403/)and save it in a CSV file. 

In [4]:
def scrape_and_save_news(url, output_file):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    news_list = soup.find("div", class_="newsLst_mod").find_all("li")

    data = []

    base_url = "https://www.fmprc.gov.cn/eng/xwfw_665399/s2510_665401/2511_665403/"

    for news_item in news_list:
        title = news_item.find("a").text.strip()
        news_url_relative = news_item.find("a")['href']

        # Splice the relative path and the base URL to get the complete news link
        news_url = urllib.parse.urljoin(base_url, news_url_relative)

        # Get the content of the news content page
        content_response = requests.get(news_url)
        content_soup = BeautifulSoup(content_response.text, "html.parser")

        # Get news content
        content_paragraphs = content_soup.find("div", class_="content").find_all("p")  
        content_text = " ".join([p.text.strip() for p in content_paragraphs])

        # Add title and content to data list
        data.append({"Title": title, "Content": content_text})

    # Write data to CSV file
    with open(output_file, 'w', newline='', encoding='utf-8') as csv_file:
        fieldnames = ["Title", "Content"]
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

# Replace with actual URL and output filename
url_to_scrape = "https://www.fmprc.gov.cn/eng/xwfw_665399/s2510_665401/2511_665403/index.html"
output_csv_file = "news_data.csv"

scrape_and_save_news(url_to_scrape, output_csv_file)


Due to the large amount of saved data, we only selected content with the title "December."

In [5]:
def scrape_and_save_december_news(url, output_file):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    news_list = soup.find("div", class_="newsLst_mod").find_all("li")

    data = []

    base_url = "https://www.fmprc.gov.cn/eng/xwfw_665399/s2510_665401/2511_665403/"

    for news_item in news_list:
        title = news_item.find("a").text.strip()
        
        if "December" in title:
            news_url_relative = news_item.find("a")['href']
            
            # Splice the relative path and the base URL to get the complete news link
            news_url = urllib.parse.urljoin(base_url, news_url_relative)

            # Get the content of the news content page
            content_response = requests.get(news_url)
            content_soup = BeautifulSoup(content_response.text, "html.parser")
            content_paragraphs = content_soup.find("div", class_="content").find_all("p") 
            content_text = " ".join([p.text.strip() for p in content_paragraphs])

            # Add title and content to data list
            data.append({"Title": title, "Content": content_text})

    with open(output_file, 'w', newline='', encoding='utf-8') as csv_file:
        fieldnames = ["Title", "Content"]
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

url_to_scrape = "https://www.fmprc.gov.cn/eng/xwfw_665399/s2510_665401/2511_665403/index.html"
output_csv_file = "december_news_data.csv"

scrape_and_save_december_news(url_to_scrape, output_csv_file)


# Data cleaning

Clean up the text, retaining only normal punctuation and textual content.

In [6]:
def clean_text(text):
    cleaned_text = re.sub(r'[^a-zA-Z.,!? ]', '', text)
    return cleaned_text

def clean_and_save_to_csv(input_csv_file, output_csv_file):
    data = []

    with open(input_csv_file, 'r', encoding='utf-8') as csv_file:
        reader = csv.DictReader(csv_file)
        for row in reader:
            title = row["Title"]
            content = row["Content"]

            cleaned_content_text = clean_text(content)

            data.append({"Title": title, "Content": cleaned_content_text})

    with open(output_csv_file, 'w', newline='', encoding='utf-8') as csv_file:
        fieldnames = ["Title", "Content"]
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

input_csv_file = "december_news_data.csv"
output_csv_file = "cleaned_data.csv"

clean_and_save_to_csv(input_csv_file, output_csv_file)


Split the text content into multiple TXT files and rename them in the format of "txt_01."

In [7]:


def save_to_txt(title, content, output_folder, file_count):

    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    filename = f"txt_{file_count:02d}.txt"

    with open(os.path.join(output_folder, filename), 'w', encoding='utf-8') as txt_file:
        txt_file.write(f"Title: {title}\n\n")
        txt_file.write(f"Content:\n{content}")

def split_csv_to_txt(input_csv_file, output_folder):
    with open(input_csv_file, 'r', encoding='utf-8') as csv_file:
        reader = csv.DictReader(csv_file)
        for idx, row in enumerate(reader, start=1):
            title = row["Title"]
            content = row["Content"]

            cleaned_content_text = clean_text(content)

            save_to_txt(title, cleaned_content_text, output_folder, idx)

input_csv_file = "cleaned_data.csv"
output_folder = "output_txt_files"

split_csv_to_txt(input_csv_file, output_folder)


# Install Spacy

- Install and Import Libraries: The code installs spaCy and Plotly libraries using %pip install and imports necessary packages.

- Install English Language Model: It downloads and installs the English language model for spaCy using !python -m spacy download en_core_web_sm.

- Load spaCy Model: It loads the English language model into spaCy using nlp = spacy.load("en_core_web_sm").

- Import spaCy Visualizer: The code imports the spaCy visualizer displacy for later use in visualizing text annotations.

- Import Other Libraries: It imports additional libraries such as os for file handling, pandas for data manipulation, and Plotly for graphing.


In [8]:
# Install and import spacy and plotly.
%pip install spaCy
%pip install plotly
%pip install nbformat --upgrade

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [9]:
# Import spacy
import spacy

# Install English language model
!python -m spacy download en_core_web_sm

# Import os to upload documents and metadata
import os

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Load spaCy visualizer
from spacy import displacy

# Import pandas DataFrame packages
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

# Import graphing package
import plotly.graph_objects as go
import plotly.express as px


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [10]:
# Create empty lists for file names and contents
texts = []
file_names = []

# Iterate through each file in the folder
for _file_name in os.listdir('cleaned_txt_files'):
# Look for only text files
    if _file_name.endswith('.txt'):
    # Append contents of each text file to text list
        texts.append(open('cleaned_txt_files' + '/' + _file_name, 'r', encoding='utf-8').read())
        # Append name of each file to file name list
        file_names.append(_file_name)

In [11]:
# Create dictionary object associating each file name with its text
d = {'Filename':file_names,'Text':texts}

In [12]:
# Turn dictionary into a dataframe
News_df = pd.DataFrame(d)

In [13]:
# Remove extra spaces from papers
News_df['Text'] = News_df['Text'].str.replace('\s+', ' ', regex=True).str.strip()
News_df.head()

Unnamed: 0,Filename,Text
0,txt_11.txt,We noted that General Secretary and President ...
1,txt_05.txt,CCTV This year marks the tenth anniversary of ...
2,txt_04.txt,China News Service Premier Li Qiang attended t...
3,txt_10.txt,At the invitation of Premier of the State Coun...
4,txt_06.txt,The fourth LancangMekong Cooperation LMC Leade...


In [14]:
# Load metadata.
metadata_df = pd.read_csv('metafile.csv')
metadata_df.head()

Unnamed: 0,Paper_ID,Title,Date,Spokesman
0,txt_01,Foreign Ministry Spokesperson Mao Nings Regula...,29-Dec-23,Mao Ning
1,txt_02,Foreign Ministry Spokesperson Mao Nings Regula...,28-Dec-23,Mao Ning
2,txt_03,Foreign Ministry Spokesperson Mao Nings Regula...,27-Dec-23,Mao Ning
3,txt_04,Foreign Ministry Spokesperson Mao Nings Regula...,26-Dec-23,Mao Ning
4,txt_05,Foreign Ministry Spokesperson Mao Nings Regula...,25-Dec-23,Mao Ning


In [15]:
# Remove .txt from title of each paper
News_df['Filename'] = News_df['Filename'].str.replace('.txt', '', regex=True)

# Rename column from paper ID to Title
metadata_df.rename(columns={"Paper_ID": "Filename"}, inplace=True)

In [16]:
# Merge metadata and papers into new DataFrame
# Will only keep rows where both essay and metadata are present
final_News_df = metadata_df.merge(News_df,on='Filename')

In [17]:
# Print DataFrame
final_News_df.head()

Unnamed: 0,Filename,Title,Date,Spokesman,Text
0,txt_01,Foreign Ministry Spokesperson Mao Nings Regula...,29-Dec-23,Mao Ning,"CNR This year, President Xi Jinping visited As..."
1,txt_02,Foreign Ministry Spokesperson Mao Nings Regula...,28-Dec-23,Mao Ning,AFP Chinas Embassy in Myanmar today reminded a...
2,txt_03,Foreign Ministry Spokesperson Mao Nings Regula...,27-Dec-23,Mao Ning,CCTV This year marks the th anniversary of the...
3,txt_04,Foreign Ministry Spokesperson Mao Nings Regula...,26-Dec-23,Mao Ning,China News Service Premier Li Qiang attended t...
4,txt_05,Foreign Ministry Spokesperson Mao Nings Regula...,25-Dec-23,Mao Ning,CCTV This year marks the tenth anniversary of ...


# Text Enrichment with spaCy

## Creating Doc Objects

In [18]:
# Load nlp pipeline
nlp = spacy.load('en_core_web_sm')

# Check what functions it performs
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [19]:
# Define a function that runs the nlp pipeline on any given input text
def process_text(text):
    return nlp(text)

In [20]:
# Apply the function to the "Text" column, so that the nlp pipeline is called on each student essay
final_News_df['Doc'] = final_News_df['Text'].apply(process_text)

# Text Reduction

## Tokenization

A critical first step spaCy performs is tokenization, or the segmentation of strings into individual words and punctuation markers. Tokenization enables spaCy to parse the grammatical structures of a text and identify characteristics of each word-like part-of-speech.

To retrieve a tokenized version of each text in the DataFrame, we’ll write a function that iterates through any given Doc object and returns all functions found within it.

In [21]:
# Define a function to retrieve tokens from a doc object
def get_token(doc):
    return [(token.text) for token in doc]

In [22]:
# Run the token retrieval function on the doc objects in the dataframe
final_News_df['Tokens'] = final_News_df['Doc'].apply(get_token)
final_News_df.head()

Unnamed: 0,Filename,Title,Date,Spokesman,Text,Doc,Tokens
0,txt_01,Foreign Ministry Spokesperson Mao Nings Regula...,29-Dec-23,Mao Ning,"CNR This year, President Xi Jinping visited As...","(CNR, This, year, ,, President, Xi, Jinping, v...","[CNR, This, year, ,, President, Xi, Jinping, v..."
1,txt_02,Foreign Ministry Spokesperson Mao Nings Regula...,28-Dec-23,Mao Ning,AFP Chinas Embassy in Myanmar today reminded a...,"(AFP, Chinas, Embassy, in, Myanmar, today, rem...","[AFP, Chinas, Embassy, in, Myanmar, today, rem..."
2,txt_03,Foreign Ministry Spokesperson Mao Nings Regula...,27-Dec-23,Mao Ning,CCTV This year marks the th anniversary of the...,"(CCTV, This, year, marks, the, th, anniversary...","[CCTV, This, year, marks, the, th, anniversary..."
3,txt_04,Foreign Ministry Spokesperson Mao Nings Regula...,26-Dec-23,Mao Ning,China News Service Premier Li Qiang attended t...,"(China, News, Service, Premier, Li, Qiang, att...","[China, News, Service, Premier, Li, Qiang, att..."
4,txt_05,Foreign Ministry Spokesperson Mao Nings Regula...,25-Dec-23,Mao Ning,CCTV This year marks the tenth anniversary of ...,"(CCTV, This, year, marks, the, tenth, annivers...","[CCTV, This, year, marks, the, tenth, annivers..."


In [23]:
tokens = final_News_df[['Text', 'Tokens']].copy()
tokens.head()

Unnamed: 0,Text,Tokens
0,"CNR This year, President Xi Jinping visited As...","[CNR, This, year, ,, President, Xi, Jinping, v..."
1,AFP Chinas Embassy in Myanmar today reminded a...,"[AFP, Chinas, Embassy, in, Myanmar, today, rem..."
2,CCTV This year marks the th anniversary of the...,"[CCTV, This, year, marks, the, th, anniversary..."
3,China News Service Premier Li Qiang attended t...,"[China, News, Service, Premier, Li, Qiang, att..."
4,CCTV This year marks the tenth anniversary of ...,"[CCTV, This, year, marks, the, tenth, annivers..."


# ❗️❗️Research Question

# What are the most frequent words used in the speech?Does this express China's stance on international affairs?

In [28]:
from collections import Counter
from nltk.corpus import stopwords
import string
import nltk

# Download NLTK stopwords if not already downloaded
nltk.download('stopwords')

# Get the English stop words
stop_words = set(stopwords.words('english'))

# Assuming 'Tokens' column contains a list of tokens for each document
all_tokens = [token for tokens_list in final_News_df['Tokens'] for token in tokens_list]

# Remove stop words and punctuation
filtered_tokens = [token.lower() for token in all_tokens if token.lower() not in stop_words and token not in string.punctuation]

# Analyze word frequency
word_frequency = Counter(filtered_tokens)

# Get the top N words
N = 10  # Replace with the desired number of top words
top_words = word_frequency.most_common(N)

# Display the top words
print(top_words)


[('china', 629), ('cooperation', 280), ('countries', 278), ('chinas', 241), ('two', 172), ('international', 167), ('president', 162), ('development', 161), ('us', 155), ('chinese', 153)]


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/josiechen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## conclusion

We can see that besides 'China,' the most frequently mentioned terms are 'cooperation' and 'development.' This indicates China's desire for friendly relations with other countries, the elimination of prejudice and discrimination, while also emphasizing its own stance and safeguarding the interests of the Chinese people.

## Lemmatization

Another process performed by spaCy is lemmatization, or the retrieval of the dictionary root word of each word (for example “brighten” for “brightening”). We’ll perform a similar set of steps to those above to create a function to call the lemmas from the Doc object, then apply it to the DataFrame.

In [None]:
# Define a function to retrieve lemmas from a doc object
def get_lemma(doc):
    return [(token.lemma_) for token in doc]

# Run the lemma retrieval function on the doc objects in the dataframe
final_News_df['Lemmas'] = final_News_df['Doc'].apply(get_lemma)

## Text Annotation

## Part of Speech Tagging

spaCy facilitates two levels of part-of-speech tagging: coarse-grained tagging, which predicts the simple universal part-of-speech of each token in a text (such as noun, verb, adjective, adverb), and detailed tagging, which uses a larger, more fine-grained set of part-of-speech tags (for example 3rd person singular present verb). The part-of-speech tags used are determined by the English language model we use. In this case, we’re using the small English model, and you can explore the differences between the models on spaCy’s website.

We can call the part-of-speech tags in the same way as the lemmas. Create a function to extract them from any given Doc object and apply the function to each Doc object in the DataFrame. The function we’ll create will extract both the coarse- and fine-grained part-of-speech for each token (token.pos_ and token.tag_, respectively).

In [None]:
# Define a function to retrieve lemmas from a doc object
def get_pos(doc):
    #Return the coarse- and fine-grained part of speech text for each token in the doc
    return [(token.pos_, token.tag_) for token in doc]

# Define a function to retrieve parts of speech from a doc object
final_News_df['POS'] = final_News_df['Doc'].apply(get_pos)

In [None]:
# Define function to extract proper nouns from Doc object
def extract_proper_nouns(doc):
    return [token.text for token in doc if token.pos_ == 'PROPN']

# Apply function to Doc column and store resulting proper nouns in new column
final_News_df['Proper_Nouns'] = final_News_df['Doc'].apply(extract_proper_nouns)

## Named Entity Recognition

spaCy can tag named entities in the text, such as names, dates, organizations, and locations. Call the full list of named entities and their descriptions using this code:

In [None]:
# Get all NE labels and assign to variable
labels = nlp.get_pipe("ner").labels

# Print each label and its description
for label in labels:
    print(label + ' : ' + spacy.explain(label))

Let's check the named entity recognition of the full text

In [None]:
# Define function to extract named entities from doc objects
def extract_named_entities(doc):
    return [ent.label_ for ent in doc.ents]

# Apply function to Doc column and store resulting named entities in new column
final_News_df['Named_Entities'] = final_News_df['Doc'].apply(extract_named_entities)
final_News_df['Named_Entities']

We can add another column with the words and phrases identified as named entities:

In [None]:
# Define function to extract text tagged with named entities from doc objects
def extract_named_entities(doc):
    return [ent for ent in doc.ents]

# Apply function to Doc column and store resulting text in new column
final_News_df['NE_Words'] = final_News_df['Doc'].apply(extract_named_entities)
final_News_df['NE_Words']

Let’s visualize the words and their named entity tags in a single text. Call the first text’s Doc object and use displacy.render to visualize the text with the named entities highlighted and tagged:

In [None]:
# Extract the first Doc object
doc = final_News_df['Doc'][1]

# Visualize named entity tagging in a single paper
displacy.render(doc, style='ent', jupyter=True)

# Download Enriched Dataset

In [None]:
# Save DataFrame as csv (in Google Drive)
# Use this step only to save  csv to your computer's working directory
final_News_df.to_csv('MICUSP_papers_with_spaCy_tags.csv')

# Analysis of Linguistic Annotations

## Part of Speech Analysis

spaCy counts the number of each part-of-speech tag that appears in each document (for example the number of times the NOUN tag appears in a document). This is called using doc.count_by(spacy.attrs.POS). Here’s how it works on a single sentence:

In [None]:
# Create doc object from single sentence
doc = nlp("This is 'an' example? sentence")

# Print counts of each part of speech in sentence
print(doc.count_by(spacy.attrs.POS))

In [None]:
# Store dictionary with indexes and POS counts in a variable
num_pos = doc.count_by(spacy.attrs.POS)

dictionary = {}

# Create a new dictionary which replaces the index of each part of speech for its label (NOUN, VERB, ADJECTIVE)
for k,v in sorted(num_pos.items()):
  dictionary[doc.vocab[k].text] = v

dictionary

In [None]:
# Create new DataFrame for analysis purposes
pos_analysis_df = final_News_df[['Filename','Spokesman', 'Doc']]

# Create list to store each dictionary
num_list = []

# Define a function to get part of speech tags and counts and append them to a new dictionary
def get_pos_tags(doc):
    dictionary = {}
    num_pos = doc.count_by(spacy.attrs.POS)
    for k,v in sorted(num_pos.items()):
        dictionary[doc.vocab[k].text] = v
    num_list.append(dictionary)

# Apply function to each doc object in DataFrame
pos_analysis_df.loc['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags)

# ❗️❗️Research question:  

## Do spokespersons Mao Ning and Wang Wenbing use certain parts of speech more frequently?

In [None]:
# Create new dataframe with part of speech counts
pos_counts = pd.DataFrame(num_list)
columns = list(pos_counts.columns)

# Add discipline of each paper as new column to dataframe
idx = 0
new_col = pos_analysis_df['Spokesman']
pos_counts.insert(loc=idx, column='Spokesman', value=new_col)

pos_counts

In [None]:
# Get average part of speech counts used in papers of each discipline
average_pos_df = pos_counts.groupby(['Spokesman']).mean()

# Round calculations to the nearest whole number
average_pos_df = average_pos_df.round(0)

# Reset index to improve DataFrame readability
average_pos_df = average_pos_df.reset_index()

# Show dataframe
average_pos_df

In the speech statistics for December, Wang Wenbing surpasses Mao Ning in various metrics. Even with an equal number of speeches, it is evident that Wang Wenbing utilizes a more diverse range of vocabulary and covers a broader range of topics.

In [None]:
# Use plotly to plot proper noun use per genre
fig = px.bar(average_pos_df, x="Spokesman", y=["ADJ", 'VERB', "NUM","ADV"], title="Average Part-of-Speech Use in Papers Written by Biology and English Students", barmode='group')
fig.show()

## conclusion

According to data visualization, overall, there isn't a significant difference in the language preferences of the two spokespersons. Wang Wenbing, relatively speaking, tends to use more nouns and verbs. This contributes to more persuasive and robust speeches, which may be related to gender differences.

# Analysis of GPE Named Entities

# ❗️❗️Research question:  

## In the December press briefings of the Ministry of Foreign Affairs, which country or region was mentioned the most?

In [None]:
# Replace the index with the desired document's index
doc_to_analyze = final_News_df['Doc'][0]  # Replace 0 with the desired index

# Extract only GPE entities
gpe_entities = [ent.text for ent in doc_to_analyze.ents if ent.label_ == 'GPE']

# Count the frequency of each GPE entity
gpe_entity_counts = {}
for entity in gpe_entities:
    gpe_entity_counts[entity] = gpe_entity_counts.get(entity, 0) + 1

# Print the frequency of each GPE entity
for entity, count in gpe_entity_counts.items():
    print(f'{entity}: {count}')

# Visualize named entities of type GPE
gpe_doc = spacy.tokens.Doc(doc_to_analyze.vocab, words=gpe_entities)
displacy.render(gpe_doc, style='ent', jupyter=True)


## conclusion

According to the statistics, it can be observed that several countries and regions were mentioned, with China and Myanmar being mentioned the most. This is related to the recent international situation.

# In conclusion

In summary, the analysis of speeches from the two spokespersons reveals a common emphasis on terms such as 'cooperation' and 'development,' reflecting China's commitment to fostering friendly relations, eliminating prejudice, and safeguarding its people's interests. Data visualization indicates that, overall, there isn't a significant difference in the language preferences of the two spokespersons, but Wang Wenbing tends to employ more nouns and verbs, contributing to more persuasive speeches, possibly influenced by gender differences.

Furthermore, the statistical findings highlight the frequent mentions of China and Myanmar, suggesting a correlation with recent international developments. This comprehensive overview sheds light on the spokespersons' communication strategies and the geopolitical context influencing their discourse.