# Question 2 Analysis 3
- Download articles with keyword "China" using NYT article-search API
- compare those in year 1980 with in 2010
- find out what articles usually focus on in either year and see if the focus about China had changed
- generate a CSV file to see what are the focus on either year

In [1]:
import glob
import os
import json
import csv
from pathlib import Path
from collections import Counter

In [2]:
# To prevent duplicate code, I create a def to store subjects in either year
def getFocus(path):
    files = glob.glob(path)
    focus = []
    for file in files:
        with open(file, 'r', encoding='utf-8', errors='ignore') as f:
            python_data = json.load(f)
            docs = python_data['response']['docs']
            for doc in docs:
                if 'keywords' in doc and doc['keywords'] is not None:
                    keywords = doc['keywords']
                    for key in keywords:
                        if key['name'] == 'subject':
                            focus.append(key['value'])
    return focus

In [8]:
# prepare the variables including paths and counters
p = Path(os.getcwd())
as_1980_path = str(p.parent) + '/data/article_search/1980/*'
as_2010_path = str(p.parent) + '/data/article_search/2010/*'
focus_counter_1980 = Counter()
focus_counter_2010 = Counter()

In [9]:
# call the def and get the subjects
subject_list_1980 = getFocus(as_1980_path)
subject_list_2010 = getFocus(as_2010_path)
print(len(subject_list_1980))
print(len(subject_list_2010))

2248
2279


In [10]:
# update both counters
for sub in subject_list_1980:
    focus_counter_1980.update([sub])
for sub in subject_list_2010:
    focus_counter_2010.update([sub])

In [11]:
# write the two collections to CSV at the same time
# iterate collections with index and get the tuples
# in each tuples, tuple[0] refers to the subject, tuple[1] refers to the frequency of that subject
path = 'ana_3/focus_change.csv'
help_list = focus_counter_2010.most_common(50)
with open(path, 'w') as outcsv:
    writer = csv.DictWriter(outcsv, fieldnames = ["RANK", "FOCUS_1980", "OCCURANCE_1980", "FOCUS_2010", "OCCURANCE_2010"])
    writer.writeheader()
    for idx, val in enumerate(focus_counter_1980.most_common(50)):
        writer.writerow({'RANK': str(idx+1), 'FOCUS_1980': val[0], 'OCCURANCE_1980': val[1], 'FOCUS_2010': help_list[idx][0], 'OCCURANCE_2010': help_list[idx][1]})

# Conclusion
- The result is interesting, as we can see from CSV file. In the year 1980, NYT focus very much on politics and government relating to China, which is only the 10th focus subject in 2010.
- Just as the example above, we can jump to several further conclusions:
- For China, what NYT cares the most in 1980 is China's politics and government while in 2010 is international trade and world market. The change of focus reflects how China developed in these 30 years.
- Economic conditions and trends are always a main focus during these years.
- In 2010, computers and the Internet, a keyword in top 3 focus showed up, which cannot be found in even top 50 of year 1980. Thus computers and the Internet is definitely a hot concern of China in recent years.