The file consists of one long string. To use it effectively, we'd need to parse it and convert it into rows and columns. We've covered strings extensively so far, but we haven't covered how a computer stores them.

Computers store files on hard drives. A hard drive allows us to save data, turn the computer off, and then access the data again later. The tech community commonly refers to hard drives as magnetic storage, because they store data on magnetic strips.

Magnetic strips can only contain a series of two values - up and down. Our entire CSV file saves to a hard drive the same way. We can't directly write strings such as the letter a to a hard disk; we need to convert them to a series of magnetic ups and downs first.

We can do this with an encoding system called binary. With binary, the only valid numbers are 0 and 1. This constraint makes it easy to store binary values on a hard disk

Computers can't store values like strings or integers directly. Instead, they store information in binary, where the only valid numbers are 0 and 1. This system makes storing data on devices like hard drives possible.

However, we normally count in "base 10." We call this system base 10 because there are 10 possible digits - 0 through 9. Binary is base two, because there are only two possible digits - 0 and 1.

To work with binary in Python, we need to enter it as a string. If we enter something like b = 10 directly, for example, Python will assume that it's a base 10 integer (rather than binary). Instead, we would need to put quotes around it to enter it as a string before working with it further.

In [4]:
# Let's say b is a binary number.  In python, we have to store binary numbers as strings.
# If we try to enter it directly as b = 10, Python will assume it's a base 10 integer.
b = "10"

# Now, we can convert b from a string to a binary number with the int function. We'll need to set the optional second argument, base, to 2 (binary is base two).
print(int(b, 2))
int("100",2)

2


4

Computers store strings in binary, just like they do with integers. First, they split them into single characters, then convert those characters to integers. Finally, they convert those integers to binary and store them.

In [6]:
# We can use the ord() function to get the integer for an ASCII character.
ord('a')

# Then, we use the bin() function to convert to binary.
# The bin function adds "0b" to the beginning of a string to indicate that it contains binary values.
bin(ord('a'))

# ÿ is the "last" ASCII character; it has the highest integer value of any ASCII character.
# This is because 255 is the highest value we can represent with eight binary digits.
ord('ÿ')
# As you can see, we get eight 1's, which shows that this is the highest possible eight-digit value.
bin(ord('ÿ'))

# Why is this?  Because a single binary digit is called a bit, and computers store values in sequences of eight bits (i.e., a byte).
# You might be more familiar with kilobytes or megabytes. A kilobyte is 1000 bytes, and a megabyte is 1000 kilobytes.
# There are 256 different ASCII symbols, because the largest amount of storage any single ASCII character can take up is one byte.
binary_w = bin(ord("w"))
binary_bracket = bin(ord("}"))              

Unicode assigns "code points" to characters. In Python, code points look like this:

"\u3232"

We can use an encoding system to convert these code points to binary integers. The most common encoding system for Unicode is UTF-8. This encoding tells a computer which code points are associated with which integers.

UTF-8 can encode values that are longer that one byte, which enables it to store all Unicode characters. It encodes characters using a variable number of bytes, which means that it also supports regular ASCII characters (which are one byte each).

In [7]:
# We can initialize Unicode code points (the value for this code point is \u27F6, but you see it as a character here because the Dataquest system is automatically converting it).
code_point = "⟶"

# This particular code point maps to a right arrow character.
print(code_point)

# We can get the base 10 integer value of the code point with the ord function.
print(ord(code_point))

# As you can see, this takes up a lot more than 1 byte.
print(bin(ord(code_point)))
binary_1019 = bin(ord("\u1019"))


⟶
10230
0b10011111110110


ASCII is a subset of Unicode. Unicode implements all of the ASCII characters, as well as the additional characters that code points allow.

This lets us create Unicode strings that combine both ASCII and Unicode characters.

By default, Python 3 uses Unicode for all strings, and encodes them with UTF-8. That means we can enter the Unicode code points or the actual characters.

Python includes a data type called "bytes." It's similar to a string, except that it contains encoded bytes values.

When we create an object with a bytes type from a string, we specify an encoding system (usually UTF-8).

Then, we can use the .encode() method to encode the string into bytes.

In [8]:
# We can make a string with some Unicode values
superman = "Clark Kent␦"
print(superman)

# This tells Python to encode the string superman as Unicode using the UTF-8 encoding system
# We end up with a sequence of bytes instead of a string
superman_bytes = "Clark Kent␦".encode("utf-8")

batman = "Bruce Wayne␦"
batman_bytes = batman.encode("utf-8")

Clark Kent␦


batman_bytes from the last screen prints out as Bruce Wayne\xe2\x90\xa6. Similar to the \u prefix for a Unicode code point, \x is the prefix for a hexadecimal character.

Just like binary is base 2 and our normal counting system is base 10, hexadecimal is base 16. The valid digits in hexadecimal are 0-9 and A-F. Here are the values corresponding to each character:

A - 10
B - 11
C - 12
D - 13
E - 14
F - 15
In hexadecimal, 9 + 1 equals A. We use hexadecimal because it represents a byte efficiently. You may recall that a byte is eight bits, or eight binary digits. The highest value we can express in a byte is 11111111, or 255 in base 10. We can express the same value in two hexadecimal digits, FF.

Programmers often use hexadecimal to display bytes instead of binary because it's more compact and easier to write out.

batman_bytes from the last screen prints out as Bruce Wayne\xe2\x90\xa6. Similar to the \u prefix for a Unicode code point, \x is the prefix for a hexadecimal character.

Just like binary is base 2 and our normal counting system is base 10, hexadecimal is base 16. The valid digits in hexadecimal are 0-9 and A-F. Here are the values corresponding to each character:

A - 10
B - 11
C - 12
D - 13
E - 14
F - 15
In hexadecimal, 9 + 1 equals A. We use hexadecimal because it represents a byte efficiently. You may recall that a byte is eight bits, or eight binary digits. The highest value we can express in a byte is 11111111, or 255 in base 10. We can express the same value in two hexadecimal digits, FF.

Programmers often use hexadecimal to display bytes instead of binary because it's more compact and easier to write out.


In [9]:
# One byte (eight bits) in hexadecimal (the value of the byte below is \xe2)
hex_byte = "â"

# Print the base 10 integer value for the hexadecimal byte
print(ord(hex_byte))

# This gives the exact same value. Remember that \x is just a prefix, and doesn't affect the value.
print(int("e2", 16))

# Convert the base 10 integer to binary
print(bin(ord("â"))) 
binary_aa = bin(ord("\xaa"))
binary_ab = bin(ord("\xab"))


226
226
0b11100010


In [11]:
from collections import Counter
fruits = ["apple", "apple", "banana", "orange", "pear", "orange", "apple", "grape"]
fruit_count = Counter(fruits)
print(fruit_count)
Counter()

Counter({'apple': 3, 'orange': 2, 'banana': 1, 'pear': 1, 'grape': 1})


In [12]:
from collections import Counter
fruits = ["apple", "apple", "banana", "orange", "pear", "orange", "apple", "grape"]
fruit_count = Counter(fruits)

# Our code has counted each of the items in the list, and given them dictionary keys
print(fruit_count)

# filtered_tokens has been loaded in
filtered_token_counts = Counter(filtered_tokens)

Counter({'apple': 3, 'orange': 2, 'banana': 1, 'pear': 1, 'grape': 1})


NameError: name 'filtered_tokens' is not defined

Converting a number to binary:

int("100", 2)

Encoding a string into bytes:

batman.encode("utf-8")

Converting a string of length one to a Unicode code point:

ord("a")

Converting integer to binary string:

bin(100)

Decoding a bytes object into a string:

morgan_freeman.decode()

Counting how many times a string occurs in a list:

from collections import Counter
fruits = ["apple", "apple", "banana", "orange"]
fruit_count = Counter(fruits)

