# 1. Introduction

* **Computers store files on hard drives.** A hard drive allows us to save data, turn the computer off, and then access the data again later. The tech community commonly refers to hard drives as **magnetic storage**
, because they store data on magnetic strips.

* Magnetic strips can only contain a series of two values - up and down. Our entire CSV file saves to a hard drive the same way. We can't directly write strings such as the letter a to a hard disk; **we need to convert them to a series of magnetic ups and downs first.**

* We can do this with an `encoding system called binary`. With binary, the only valid numbers are 0 and 1. This constraint makes it easy to store binary values on a hard disk.

# 2. The Basics of Binary

* Computers can't store values like strings or integers directly. Instead, they store information in binary, where the only valid numbers are 0 and 1. This system makes storing data on devices like hard drives possible.

* **To work with binary in Python, we need to enter it as a string.**
eg- '101001'

* we can convert string to a binary number with the int function.
* We'll need to set the optional second argument, base, to 2 (binary is base two)

In [1]:
b='1011'
print(int(b,2))

print(bin(11))

bin(int(b,2))

11
0b1011


'0b1011'

# 3. Binary Addition

In [2]:
def binary_add(a, b):
    return bin(int(a, 2) + int(b, 2))[2:]
binary_add("10001","101")
# The bin function adds "0b" to the beginning of a string to indicate that it contains binary values.

'10110'

# 4. Converting Binary Values to Other Bases

In [3]:
# binary to int

int("10001010",2)

138

# 5. Converting Characters to Binary

**Computers store strings in binary, just like they do with integers.` First, they split them into single characters, then convert those characters to integers. Finally, they convert those integers to binary and store them.`**

* There are `256 different ASCII symbols`, because the largest amount of storage any single ASCII character can take up is one byte.

## TODO:
* Convert "w" to binary and assign the result to binary_w.

* Convert "}" to binary and assign the result to binary_bracket.

In [4]:
binary_w=bin(ord("w"))
binary_w

'0b1110111'

In [5]:
binary_bracket=bin(ord("}"))
binary_bracket

'0b1111101'

# 6. Introduction to Unicode

You might be wondering what happened to all of the other characters and alphabets in the world. ASCII can't handle them, because it only supports 255 characters. The tech community realized it needed a new standard, and created Unicode.

**Unicode assigns "code points" to characters. In Python, code points look like this:**

"\u3232"

* We can use an encoding system to convert these code points to binary integers. 
* The most common encoding system `for Unicode is UTF-8`.
* This encoding tells a computer which code points are associated with which integers.

UTF-8 can encode values that are longer that one byte, which enables it to store all Unicode characters. It encodes characters using a variable number of bytes, which means that it also supports regular ASCII characters (which are one byte each).

In [6]:
code_point = "\u27f6"
print(code_point)

⟶


In [7]:
ord(code_point)

10230

In [8]:
bin(ord(code_point))

'0b10011111110110'

# 7. Strings with Unicode

* `ASCII is a subset of Unicode.` Unicode implements all of the ASCII characters, as well as the additional characters that code points allow.
* This lets us create Unicode strings that combine both ASCII and Unicode characters.
* By default, Python 3 uses Unicode for all strings, and encodes them with UTF-8. That means we can enter the Unicode code points or the actual characters.

In [9]:
s1 = "café"
# The \u prefix means "the next four digits are a Unicode code point"
# It doesn't change the value at all (the last character in the string below is \u00e9)
s2 = "café"

# These strings are the same, because code points are equal to their corresponding Unicode characters.
# \u00e9 and é are equivalent.
print(s1 == s2)
s3=s1+s2


True


# 8. The Bytes Data Type

* Python includes a `data type called "bytes."` It's similar to a string, except that it `contains encoded bytes values.`

* When we create an object with a bytes type from a string, we specify an encoding system (usually UTF-8).
* Then, we can use the .encode() method to encode the string into bytes.

In [10]:
# We can make a string with some Unicode values
superman = "Clark Kent␦"
print(superman)

# This tells Python to encode the string superman as Unicode using the UTF-8 encoding system
# We end up with a sequence of bytes instead of a string
superman_bytes = "Clark Kent␦".encode("utf-8")
print(type(superman_bytes))
superman_bytes

Clark Kent␦
<class 'bytes'>


b'Clark Kent\xe2\x90\xa6'

In [11]:
s="anshu8,"
b="anshu8,".encode("utf-8")
b

b'anshu8,'

# 9. Introduction to HexaDecimal

**Similar to the` \u `prefix for a Unicode code point, `\x` is the prefix for a hexadecimal character.**

* Just like binary is base 2 and our normal counting system is base 10, hexadecimal is base 16. 
* The valid digits in hexadecimal are 0-9 and A-F. Here are the values corresponding to each character:

  * A - 10 
  * B - 11
  * C - 12
  * D - 13
  * E - 14
  * F - 15

**We use hexadecimal because it represents a byte efficiently**
* Programmers often use hexadecimal to display bytes instead of binary because it's more compact and easier to write out.

# 10. Hexadecimal Conversions

**The \x prefix means "the next two digits are in hexadecimal**

* Two hexadecimal digits equal eight binary digits, because digits can have higher values in hexadecimal (base 16). For instance, "F" is 15 in hexadecimal, but 1111 is 15 in binary.

Because it's shorter to display, and four binary digits always equal one hexadecimal digit, programs often use hexadecimal to print out values. This is purely for convenience.

## TODO:
Add "2" to "ea" in hexadecimal, and assign the result to hex_ea.

Add "e" to "f" in hexadecimal, and assign the result to hex_ef.

In [12]:
print(int("F",16))

15


In [13]:
hex(15)

'0xf'

In [14]:
hex(15)[2:]

'f'

In [15]:
def hex_add(a,b):
    return hex(int(a,16)+int(b,16))[2:]

In [16]:
hex_ea=hex_add("2","ea")
hex_ea

'ec'

In [17]:
hex_ef=hex_add("e","f")
hex_ef

'1d'

# 11. Hex to Binary

## TODO
Convert the hexadecimal byte "\xaa" to binary, and assign the result to binary_aa.


In [18]:
binary_aa=bin(ord('\xaa'))[2:]
binary_aa

'10101010'

In [19]:
bin(int('aa',16))[2:]

'10101010'

# 12. Bytes and Strings

**There's no encoding system associated with the bytes data type. That means if we have an object with that data type, Python won't know how to display the (encoded) code points in it. For this reason, we can't mix bytes objects and strings together.**

In [20]:
hulk_bytes = "Bruce Banner␦".encode("utf-8")

# We can't mix strings and bytes
# For instance, if we try to replace the Unicode ␦ character as a string, it won't work, because that value has been encoded to bytes
try:
    hulk_bytes.replace("Banner", "")
except Exception:
    print("TypeError with replacement")

# We can create objects of the bytes data type by putting a b in front of the quotation marks in a string
hulk_bytes = b"Bruce Banner"

# Now, instead of mixing strings and bytes, we can use the replace method with bytes objects instead
hulk_bytes.replace(b"Banner", b"")

TypeError with replacement


b'Bruce '

### We can create objects of the bytes data type by putting a `b in front of the quotation marks in a string`

# 13. Decode Bytes to Strings

* Once we have a bytes object, we can decode it into a string using an encoding system.` We use the .decode() method` to do this.

In [21]:
# Make a bytes object with aquaman's secret identity
aquaman_bytes = b"Who knows?"

# Now, we can use the decode method, along with the encoding system (UTF-8) to turn it into a string
aquaman = aquaman_bytes.decode("utf-8")

# We can print the value and type to verify that it's a string
print(aquaman)
print(type(aquaman))

Who knows?
<class 'str'>


# 14. Read in File Data

In [22]:
# We can read our data in using csvreader
import csv
# When we open a file, we can specify the system used to encode it (in this case, UTF-8).
f = open("sentences_cia.csv", 'r', encoding="utf-8")
csvreader = csv.reader(f)
sentences_cia = list(csvreader)

# The data consists of two columns
# The first column contains the year, and the second contains a sentence from a CIA report written in that year
# Print the first column of the second row
print(sentences_cia[1][0])

# Print the second column of the second row
print(sentences_cia[1][1])
sentences_ten=sentences_cia[9][1]

1997
The FBI information included that al-Mairi's brother "traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps."


# 15. Convert to a dataframe

Having a dataframe will make processing and analysis much simpler because we can use the .apply() method.

In [23]:
import csv
# Let's read in the legislators data from a few missions ago
f = open("legislators.csv", 'r', encoding="utf-8")
csvreader = csv.reader(f)
legislators = list(csvreader)

# Now, we can import pandas and use the DataFrame class to convert the list of lists to a dataframe.
import pandas as pd

legislators_df = pd.DataFrame(legislators)

# As you can see, the first row contains the headers, which we don't want (because they're not actually data)
print(legislators_df.iloc[0,:])

# To remove the headers, we'll subset the df and pass them in separately
# This code removes the headers from legislators, and instead passes them into the columns argument
# The columns argument specifies column names
legislators_df = pd.DataFrame(legislators[1:], columns=legislators[0])
# We now have the right data in the first row, as well as the proper headers
print(legislators_df.iloc[0,:])


sentences_cia=pd.DataFrame(sentences_cia[1:],columns=sentences_cia[0])
print(sentences_cia[:2])

0     last_name
1    first_name
2      birthday
3        gender
4          type
5         state
6         party
Name: 0, dtype: object
last_name                 Bassett
first_name                Richard
birthday               1745-04-02
gender                          M
type                          sen
state                          DE
party         Anti-Administration
Name: 0, dtype: object
   year                                          statement      
0  1997  The FBI information included that al-Mairi's b...      
1  1997  The FBI information included that al-Mairi's b...      


# 16. Clean up Sentences

In [24]:
# The integer codes for all the characters we want to keep
good_characters = [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 32]

sentence_15 = sentences_cia["statement"][14]

# Iterate over the characters in the sentence, and only take those whose integer representations are in good_characters
# This will construct a list of single characters
cleaned_sentence_15_list = [s for s in sentence_15 if ord(s) in good_characters]

# Join the list together, separated by "" (no space), which creates a string again
cleaned_sentence_15 = "".join(cleaned_sentence_15_list)

## TODO:
Make a function that takes a dataframe row and then returns the clean version of the "statement" column.

Use the .apply() method on dataframe to apply the function to each row of sentences_cia.

Assign the resulting vector to the cleaned_statement column of sentences_cia.

In [25]:
def clean_statement(row):
    good_characters = [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
                       78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 97, 98, 99, 100, 101, 102, 103, 104, 105,
                       106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 32]
    statement = row["statement"]
    clean_statement_list = [s for s in statement if ord(s) in good_characters]
    return "".join(clean_statement_list)

sentences_cia["cleaned_statement"] = sentences_cia.apply(clean_statement, axis=1)

## 17. Tokenize Statements

## TODO:
Tokenize combined_statements by splitting it into words on the spaces.
You should end up with a list of all the words in combined_statements.
Assign the result to statement_tokens.

In [26]:
combined_statements = " ".join(sentences_cia["cleaned_statement"])
statement_tokens=combined_statements.split(" ")

# 18. Filter the Tokens

## TODO:
Filter the statement_tokens list so that it only contains tokens that are at least five characters long.
Assign the result to filtered_tokens.

In [27]:
filtered_tokens=[]
for token in statement_tokens:
    if len(token)>=5:
        filtered_tokens.append(token)
        
filtered_tokens[:10]

['information',
 'included',
 'alMairis',
 'brother',
 'traveled',
 'Afghanistan',
 '19971998',
 'train',
 'Ladencamps',
 'information']

# 19. Count the Tokens

## TODO:
Count the items in filtered_tokens and assign the result to filtered_token_counts.

`Counter` takes a list as input. It creates a dictionary where the keys are list items, and the values are the number of times those items appear in the list.

In [28]:
from collections import Counter

filtered_token_counts=Counter(filtered_tokens)
filtered_token_counts

Counter({'information': 375,
         'included': 49,
         'alMairis': 4,
         'brother': 9,
         'traveled': 25,
         'Afghanistan': 39,
         '19971998': 4,
         'train': 6,
         'Ladencamps': 4,
         'example': 41,
         'October': 138,
         'another': 22,
         'detainee': 128,
         'explained': 6,
         'alKuwaiti': 177,
         'guesthouse': 2,
         'operated': 4,
         'Shaykh': 27,
         'alLibi': 6,
         'Zubaydah': 328,
         'email': 104,
         'officer': 39,
         'tracking': 5,
         'since': 26,
         'stated': 96,
         'although': 13,
         'proof': 4,
         'needed': 9,
         'believe': 24,
         'mastermind': 14,
         'behind': 7,
         'attacks': 66,
         'Interrogators': 13,
         'Disagree': 3,
         'Headquarters': 72,
         'About': 13,
         'AlNashiris': 3,
         'Level': 4,
         'Cooperation': 3,
         'Oppose': 3,
         'Continued':

# 20. Most Common Tokens

## TODO:
Get the three most common items in filtered_token_counts, and assign the result to common_tokens.

In [29]:
common_tokens=filtered_token_counts.most_common(3)
common_tokens

[('interrogation', 391), ('information', 375), ('REDACTED', 375)]

# 21. Finding the Most Common Tokens by Year

## TODO:
Write a function that finds the two most common terms in sentences_cia for a given year (the "year" column).

* The "year" column in sentences_cia stores strings, so you'll need to pass strings into the function.
* Select the rows in sentences_cia that match that year, combine the clean statements, split them into a list on the space character (" "), filter out words less than five characters long, make a counter object with the results, and find the two most common items in the counter.
Use the function to find the most common terms for "2000". Assign the result to common_2000.

Use the function to find the most common terms for "2002". Assign the result to common_2002.

Use the function to find the most common terms for "2013". Assign the result to common_2013.

In [30]:
def find_most_common_by_year(year, sentences_cia):
    data = sentences_cia[sentences_cia["year"] == year]
    combined_statement = " ".join(data["cleaned_statement"])
    statement_split = combined_statement.split(" ")
    counter = Counter([s for s in statement_split if len(s) > 4])
    return counter.most_common(2)

common_2000 = find_most_common_by_year("2000", sentences_cia)
common_2002 = find_most_common_by_year("2002", sentences_cia)
common_2013 = find_most_common_by_year("2013", sentences_cia)