### About the Dataset
#### Our dataset contains excerpts from CIA memos that detail covert activities.
#### It includes the year the statement was made, an an excerpt from the memo.

#### First step will be to import the data from the CSV file.
#### The file uses utf-8 encoding, and we will use this to decode the data into our data structure.

In [3]:
import csv

f = open("sentences_cia.csv",'r',encoding="utf-8")
csvreader = csv.reader(f)
sentences_cia = list(csvreader)

# Printing the Year and Excerpt column
print(sentences_cia[1][0])
print(sentences_cia[1][1])

1997
The FBI information included that al-Mairi's brother "traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps."


### Convert to Dataframe
#### Will store the list of lists to a dataframe.

In [6]:
import pandas as pd

sentences_cia_df = pd.DataFrame(sentences_cia[1:],columns=sentences_cia[0])
sentences_cia_df.head()

Unnamed: 0,year,statement
0,1997,The FBI information included that al-Mairi's b...
1,1997,The FBI information included that al-Mairi's b...
2,1997,"For example, on October 12, 2004, another CIA ..."
3,1997,"On October 16, 2001, an email from a CTC offic..."
4,1997,"For example, on October 12, 2004, another CIA ..."


### Clean Up sentences
#### We will now remove extraneous symbols from the sentences column. We only need letters, digits and spaces.
#### We will do this using the ord() function.

In [10]:
# Integer code of all characters we want to to keep.
good_characters = [48,49,50,51,52,53,54,55,56,65,66,67,68,69,70,71,72,73,74,75,76,
                   77,78,79,80,81,82,83,84,85,86,87,88,89,90,97,98,99,100,101,102,103,
                   104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,
                   120,121,122,32]

def clean(row):
    statement = row['statement']
    clean_statement_list = [s for s in statement if ord(s) in good_characters]
    clean_statement = "".join(clean_statement_list)
    return clean_statement

sentences_cia_df['cleaned_statement'] = sentences_cia_df.apply(clean, axis=1)

In [11]:
sentences_cia_df.head(5)

Unnamed: 0,year,statement,cleaned_statement
0,1997,The FBI information included that al-Mairi's b...,The FBI information included that alMairis bro...
1,1997,The FBI information included that al-Mairi's b...,The FBI information included that alMairis bro...
2,1997,"For example, on October 12, 2004, another CIA ...",For example on October 12 2004 another CIA det...
3,1997,"On October 16, 2001, an email from a CTC offic...",On October 16 2001 an email from a CTC officer...
4,1997,"For example, on October 12, 2004, another CIA ...",For example on October 12 2004 another CIA det...


### Tokenize Statements
#### We will now combine statements and convert them into tokens.
#### Eventually, we will count up how many times each token occurs.

In [14]:
combined_statements = " ".join(sentences_cia_df["cleaned_statement"])
statement_tokens = combined_statements.split(" ")
print(statement_tokens[:5])

['The', 'FBI', 'information', 'included', 'that']


### Filter the Tokens
#### Some of the most fequently occuring words are 'stopwords'like a, an, the, etc.
#### They do not add much information to our analysis.
#### A simple way to remove stopwords is to filter out all the words that have less than 4 characters.

In [16]:
filtered_tokens=[]
for token in statement_tokens:
    if len(token) > 4:
        filtered_tokens.append(token)

print(filtered_tokens[:5])

['information', 'included', 'alMairis', 'brother', 'traveled']


### Count the tokens
#### After filtering, we can count how many times a token occurs.
#### We can use 'counter' object from 'collections' library for this purpose.

In [18]:
from collections import Counter

filtered_token_counts = Counter(filtered_tokens)

### Most Common Tokens
#### With the help of counter, we can get the most common tokens within the file.

In [20]:
common_tokens = filtered_token_counts.most_common(5)
print(common_tokens)

[('REDACTED', 1746), ('General', 1509), ('interrogation', 1453), ('Interrogation', 1440), ('techniques', 1198)]


### Most common Tokens by Year
#### We will now find the most common tokens by year.

In [24]:
def clean(year):
    df = sentences_cia_df[sentences_cia_df['year']==year]
    combined_statements = " ".join(df["cleaned_statement"])
    statement_tokens = combined_statements.split(" ")
    filtered_tokens=[]
    for token in statement_tokens:
        if len(token) > 4:
            filtered_tokens.append(token)
    filtered_token_counts = Counter(filtered_tokens)
    common_tokens = filtered_token_counts.most_common(5)
    return common_tokens

common_2000 = clean("2000")
common_2002 = clean("2002")
common_2013 = clean("2013")
common_2014 = clean("2014")
print(common_2000)
print(common_2002)
print(common_2013)
print(common_2014)

[('Ahmad', 9), ('terrorist', 9), ('Afghanistan', 6), ('training', 6), ('Padilla', 6)]
[('interrogation', 298), ('Zubaydah', 267), ('August', 261), ('REDACTED', 251), ('techniques', 231)]
[('Response', 196), ('states', 111), ('information', 101), ('reporting', 59), ('Committee', 55)]
[('UNCLASSIFIED', 32), ('Committee', 29), ('REDACTED', 22), ('SECRET', 19), ('162014Z', 18)]


### Conclusion
#### Using the above simple techniques, we can easily tokenize and get count after cleaning the data.