<a href="https://colab.research.google.com/github/Addy-mufc/Kissinger-Travel-from-Apollo-docs/blob/main/Copy_of_INFM_603(Assignment_1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**For INFM 603 students:** This notebook illustrates string processing operations in python.

**Summary:** We'll first load a file of State Department cables (i.e., telegram messages) from the 1970's that the National Archives and Records Administration makes available.  We'll then use python to find some interesting things in that collections.

**Getting started:** Click `File` -> `Save a copy in Drive`. There will be a copy of this demo notebook in your Google Drive (which will automatically create a directory `Colab Notebook`). We will each work with the cloned notebook.


First we need to download the collection.  We'll do this by using the Unix wget command to download a zip file, and then the Unix unzip command.  To use a Unix command in a Colab notebook, just precede it with a !

In [None]:
!wget https://users.umiacs.umd.edu/~oard/cables.zip
print('starting unzip')
!unzip -u -q cables.zip
print('unzip cpmplete, files stored in cables/')

--2022-09-12 13:31:17--  https://users.umiacs.umd.edu/~oard/cables.zip
Resolving users.umiacs.umd.edu (users.umiacs.umd.edu)... 128.8.120.33
Connecting to users.umiacs.umd.edu (users.umiacs.umd.edu)|128.8.120.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 509664391 (486M) [application/zip]
Saving to: ‘cables.zip’


2022-09-12 13:31:36 (25.3 MB/s) - ‘cables.zip’ saved [509664391/509664391]

starting unzip
unzip cpmplete, files stored in cables/


Now let's open one file from the collection, which contains multiple messages (enclosed in a `<sasdoc></sasdoc>` is a document. Inside is the `<msgtext>` and `<subject>` tags). Look at some of the subject lines to get a sense for what's in the collection.  

In [None]:
import xml.etree.ElementTree as ET
import random

tree=ET.parse('cables/CFPF.TEL.APR73.PU')
root = tree.getroot()

print('Ten random cables')
for subject in list(random.sample(list(root.iter('subject')), 10)):
    print('Subject:', subject.text)



Ten random cables
Subject: TRAVEL OF BNDD AGENT NOLAN
Subject: CIVIL AVIATION SECURITY
Subject: N/A
Subject: PRIVATE TRADE OPPORTUNITY
Subject: SUMMONS BY GENERAL AMIN
Subject: PROPOSED PRESIDENTIAL VISIT
Subject: PROPOSED LABOR SEMINAR
Subject: NEED FOR EXIMBANK, WASHINGTON, RESPONSE TO QUERY FROM BANK OF ADELAIDE. FOR EXIMBANK
Subject: WEEKLY REVIEW OF PEOPLE' S REPUBLIC OF CHINA NO. 15
Subject: OMEGA AUSTRALIA


I'm interested in the Apollo program, which in April 1973 had just recently ended.  So let's see what we can find on Apollo. 

In [None]:
n=0
for subject in list(root.iter('subject')):
    if 'APOLLO' in subject.text:
        print(n, 'Subject:', subject.text)
    n += 1

991 Subject: BELGIAN REACTION TO MEETING ON POST- APOLLO COOPERATION
1385 Subject: POST- APOLLO INTERGOVERNMENTAL AGREEMENT
1951 Subject: PRESIDENTIAL GOOD WILL TOUR BY APOLLO 17 ASTRONAUTS - NIGER
2358 Subject: APOLLO 17 ASTRONAUTS VISIT - NIGER
2591 Subject: POST- APOLLO GOVERNMENTAL AGREEMENT
2714 Subject: PRESIDENTIAL GOODWILL TOUR BY APOLLO 17 ASTRONAUTS ( CHALLENGER)
2719 Subject: CHALLENGER: GOODWILL TOUR BY APOLLO 17 ASTRONAUTS
3618 Subject: PRESIDENTIAL GOODWILL TOUR BY APOLLO 17 ASTRONAUTS ( CHALLENGER)
3675 Subject: INFORMAL MEETING ON GOVERNMENTAL AGREEMENT ON POST- APOLLO COOPERATION.
6725 Subject: POST- APOLLO: SPACELAB GOVERNMENT AGREEMENT


Let's take a look at that last one, message 6725.

In [None]:
message = root[6725].find('msgtext')
print(message.text)




  UNCLASSIFIED

 PAGE 01   PARIS 10268  121948 Z

60
 ACTION   SCI-06

  INFO  OCT-01    EUR-25   ADP-00   NASA-04   GAC-01   ACDA-19   CIAE-00

       DODE-00   PM-09   INR-10   L-03   NSAE-00   NSC-10   RSC-01   EB-11

       COME-00   RSR-01  /101  W
                       ---------------------       054442
 R 121720 Z APR 73
FM AMEMBASSY PARIS
TO  SECSTATE WASHDC 9174
INFO  AMEMBASSY BONN
 AMEMBASSY BRUSSELS
 AMEMBASSY THE HAGUE
 AMEMBASSY LONDON
 AMEMBASSY ROME

 UNCLAS PARIS 10268

E. O. 11652:  N/ A
TAGS:  TSPA,  XT
SUBJECT:  POST- APOLLO:  SPACELAB GOVERNMENT AGREEMENT

DEPT PASS NASA

REFERENCE:  PARIS 9652

1.   SUMMARY:  ESRO OFFICIAL IS DELIVERING ESRO DRAFT GOVERNMENT
AGREEMENT TO POLLACK FOR APRIL 18  MEETING.   GIBSON REQUESTED
NAMES PRINCIPAL U. S.  PARTICIPANTS,  PROVIDED PROTOCOL SUGGESTIONS
FOR NEGOTIATION GOVERNMENT AGREEMENTS.   END SUMMARY.

2.   EMBASSY SCIENCE OFFICER AND NASA EUROPEAN REPRESENTATIVE
MET WITH GIBSON AND KALTENECKER APRIL 12  TO DISCUSS SPACELA

There are two problemms with the way we found this.  First, the substring search we used (python's in operator) will find not just entire words but also parts of words.  For example, if we had searched for cat we might find cat, catalog, or catatonic. Second, we searced only the subject line of the message.  We can solve both of these problems at once by using the unix split operator to split the full text of the message up into words (whcih we will call tokens, because some are not actually words). We already have the entire message in a single string called `message.text`, so let's start by splitting that into tokens at spaces and newlines and stripping leading and trailing punctuation from each token.  To avoid the problem with capitalization being different for the work at the start of a sentence, we will also convert all the text to lowercase.  This will produce a list, each element of which is one token, and we can print out that list.

In [None]:
tokens = message.text.split()
for j in range(len(tokens)):
    tokens[j] = tokens[j].casefold().strip(".,:*/'())-")
print(tokens)

['unclassified', 'page', '01', 'paris', '10268', '121948', 'z', '60', 'action', 'sci-06', 'info', 'oct-01', 'eur-25', 'adp-00', 'nasa-04', 'gac-01', 'acda-19', 'ciae-00', 'dode-00', 'pm-09', 'inr-10', 'l-03', 'nsae-00', 'nsc-10', 'rsc-01', 'eb-11', 'come-00', 'rsr-01', '101', 'w', '', '054442', 'r', '121720', 'z', 'apr', '73', 'fm', 'amembassy', 'paris', 'to', 'secstate', 'washdc', '9174', 'info', 'amembassy', 'bonn', 'amembassy', 'brussels', 'amembassy', 'the', 'hague', 'amembassy', 'london', 'amembassy', 'rome', 'unclas', 'paris', '10268', 'e', 'o', '11652', 'n', 'a', 'tags', 'tspa', 'xt', 'subject', 'post', 'apollo', 'spacelab', 'government', 'agreement', 'dept', 'pass', 'nasa', 'reference', 'paris', '9652', '1', 'summary', 'esro', 'official', 'is', 'delivering', 'esro', 'draft', 'government', 'agreement', 'to', 'pollack', 'for', 'april', '18', 'meeting', 'gibson', 'requested', 'names', 'principal', 'u', 's', 'participants', 'provided', 'protocol', 'suggestions', 'for', 'negotiation

Once we can do that for one message, we can do if for all of them, in several files (just to save time, we'll only process 5 files, but you could process more if you're patient).  We'll make a list of lists.  The outer list will have on entry per message; the inner list will have one entry per token.  We'll print out the tokens for the first five messages.  Note that we need to be careful about how to handle empty messages.

In [None]:
import os

tokens=[]
i=0
files = 0
for file in os.listdir('cables'):
    if file.endswith('PU') and files < 5:
        tree=ET.parse('cables/CFPF.TEL.APR73.PU')
        root = tree.getroot()
        for message in list(root.iter('msgtext')):
            tok = message.text.split()
            if len(tok)>0:
                tokens.append(tok)
                for j in range(len(tokens[i])):
                    tokens[i][j] = tokens[i][j].casefold().strip(".,:*/'())-")
                if i<5:
                    print(tokens[i])
                i += 1
        files += 1

['confidential', 'page', '01', 'lima', '02545', '01', 'of', '02', '232207', 'z', '70', 'action', 'ara-17', 'info', 'oct-01', 'adp-00', 'nic-01', 'ciae-00', 'dode-00', 'pm-09', 'h-02', 'inr-10', 'l-03', 'nsae-00', 'nsc-10', 'pa-03', 'rsc-01', 'prs-01', 'ss-15', 'usia-12', 'eur-25', 'aid-20', 'eb-11', 'trse-00', 'iga-02', 'rsr-01', '144', 'w', '', '012957', 'r', '232119', 'z', 'apr', '73', 'fm', 'amembassy', 'lima', 'to', 'secstate', 'washdc', '5194', 'info', 'amembassy', 'santiago', 'uscincso', 'c', 'o', 'n', 'f', 'i', 'd', 'e', 'n', 't', 'i', 'a', 'l', 'section', '1', 'of', '2', 'lima', '2545', 'e', 'o', '11652', 'gds', 'tags', 'pfor', 'pinr', 'mass', 'pe', 'subject', 'conversation', 'with', 'prime', 'minister', 'mercado', 'leftist', 'influence', 'in', 'the', 'government', 'and', 'fms', 'for', 'santiago', 'your', 'attention', 'is', 'drawn', 'especially', 'to', 'para', '6', '1', 'during', 'a', 'recent', 'conversation', 'with', 'prime', 'minister', 'mercado', 'he', 'said', 'he', 'wished'

Now we can look for specific complete words in the full text of a message.  Let's look for mentions of Secretary of State Henry Kissenger.  Note that because we have lowercased everything, we need to lowercase this proper name as well.

In [None]:
for i in range(len(tokens)):
    for j in range(len(tokens[i])):
        if tokens[i][j] == 'kissinger':
            print('Kissinger found in message', i)

print(tokens[1705])

Kissenger found in message 1705
Kissenger found in message 1705
Kissenger found in message 12623
Kissenger found in message 12623
Kissenger found in message 23541
Kissenger found in message 23541
Kissenger found in message 34459
Kissenger found in message 34459
Kissenger found in message 45377
Kissenger found in message 45377


OK, we've seen how to get text over the Internet using wget (and unzip), how to iterate throgh files in a directory (which we created using unzip), how to parse XML to get specific fields (subject and msgtext), how to split long strings into lists of tokens, and how to match full strings and substrings.  In Assignment 1 you'll use these capabiilities to answr a question.  

Where is Kissinger i.e. where is the from address.
from -location

In [None]:
rules_analyzer = nlp_en.get_pipe('coreferee').annotator.rules_analyzer
rules_analyzer.get_propn_subtree(doc[1])

NameError: ignored

In [None]:
import xml.etree.ElementTree as ET
import random


import csv
import os
f=open('Kissinger_Travel.csv','w')
writer = csv.writer(f)
data=["Date","Location"]
files = 0
for file in os.listdir('cables'):
    if file.startswith('CFPF.TEL') and file.endswith('PU') and files < 20:
        tree=ET.parse('cables/' + file)
        root = tree.getroot()
        n=0
        date=list(root.iter('date'))
        location=list(root.iter('from'))
        
        writer.writerow(data)
        for tag in list(root.iter('tags')):
          if 'KISSINGER' in tag.text:
            data=[date[n].text,location[n].text]
            writer.writerow(data)
          n += 1
        files += 1


I iterated through 20 different files in the cables directory and found the messages that Kissinger sent. Then added the location and the date on which the telegram was sent to a CSV file called "Kissinger_Travel.csv". 

In the below code, the library pandas was imported to read the CSV file created in the previous code block. A DataFrame data is created and the dates are set in ascending order. This sorted DataFrame is written into a txt file called "Kissinger_Travel.txt", which contains the required output.

In [None]:
import pandas as pd
data = pd.read_csv('/content/Kissinger_Travel.csv')
display(data.head())
data['Date'] = pd.to_datetime(data.Date, infer_datetime_format = True)

data.sort_values(by = 'Date', ascending = True, inplace = True)
display(data.head())
print(data)
with open('Kissinger_Travel.txt','a+') as f:
  f.write("Date")
  f.write("       ")
  f.write("Location")
  f.write("\n")
data.to_csv('Kissinger_Travel.txt', header=None, index=None, sep=' ', mode='a')

Unnamed: 0,Date,Location
0,04 JUL 1973,HELSINKI
1,03 JUL 1973,HELSINKI
2,03 JUL 1973,HELSINKI
3,05 JUL 1973,HELSINKI
4,07 JUL 1973,HELSINKI


Unnamed: 0,Date,Location
4985,1973-01-03,STATE
4988,1973-01-12,TEL AVIV
4987,1973-01-12,TEL AVIV
4986,1973-01-16,JERUSALEM
6005,1973-02-19,STATE


           Date      Location
4985 1973-01-03         STATE
4988 1973-01-12      TEL AVIV
4987 1973-01-12      TEL AVIV
4986 1973-01-16     JERUSALEM
6005 1973-02-19         STATE
...         ...           ...
5897 1974-12-31         STATE
5896 1974-12-31  BUENOS AIRES
5895 1974-12-31         CAIRO
5894 1974-12-31         STATE
5888 1974-12-31         STATE

[8494 rows x 2 columns]
