In [1]:
from IPython.core.display import HTML

HTML("""
<style>
    p {
        font-size: 1.2em;
        line-height: 1.5em;
    }
</style>
""")

## CSV Files

CSV files are the workhorse data format for data science. “CSV” usually stands
for “comma‐separated value,” but it really should be “character‐separated
value” since characters other than commas do get used. Sometimes, you will
see “.tsv” if tabs are used or “.psv” if pipes (the “|” character) are used. More
often though, everything gets called CSV regardless of the
delimiter.

CSV files are pretty straightforward conceptually – just a table with rows and
columns. There are a few complications you should be aware of though:

- Headers. Sometimes, the first line gives names for all the columns, and
sometimes, it gets right into the data.
- Quotes. In many files, the data elements are surrounded in quotes or another
character. This is done largely so that commas (or whatever the delimiting
character is) can be included in the data fields.
- Nondata rows. In many file formats, the data itself is CSV, but there are a
certain number of nondata lines at the beginning of the file. Typically, these
encode metadata about the file and need to be stripped out when the file is
loaded into a table.
- Comments. Many CSV files will contain human‐readable comments, as
source code does. Typically, these are denoted by a single character, such as
the # in Python.
- Blank lines. They happen.
- Lines with the wrong number of columns. These happen too.

The following Python code shows how to read a basic CSV file into a data
frame using `Pandas`:

```python
import pandas
df = pandas.read_csv("myfile.csv")
```

If your CSV file has weird complexities associated with it, then `read_csv` has
a number of optional arguments that let you deal with them. Here is a more
complicated call to `read_csv`:


```python
import pandas
df = pandas.read_csv("myfile.csv",
    sep = "|", # the delimiter. Default is the comma
    header = False,
    quotechar = ’"’,
    compression = "gzip",
    comment = ’#’
   )
```

The most used optional arguments are `sep` and `header`.

## JSON Files

JSON is a favorite among those from engineering backgrounds, for its dirt simplicity and flexibility.
It is a way to take hierarchical data structures and serialize them into a
plain text format. Every JSON data structure is either of the following:

- An atomic type, such as a number, a string, or a Boolean.
- A JSONObject, which is just a map from strings to JSON data structures.
This is similar to Python dictionaries, except that there are keys in the
JSONObject.
- An array of JSON data structures. This is similar to a Python list.
Here is an example of some valid JSON, which encodes a JSONObject map
with a lot of substructures:
```json
{
    "firstName": "John",
    "lastName": "Smith",
    "isAlive": true,
    "age": 25,
    "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021‐3100"
},
    "children":["alice","john",{"name":"alice","birth_order": 2}],
    "spouse": null
}
```

Note a few things about this example:
- The fact that it is all pretty with the newlines and indentations is
purely to make it easier to read. This could have all been on one long line and
any JSON parser would parse it equally well. A lot of programs for viewing
JSON will automatically format it in this more legible way.
- The overall object is conceptually similar to a Python dictionary, where the
keys are all strings and the values are JSON objects. The overall object could
have been an array too though.
- A difference between JSON objects and Python dictionaries is that all the
field names have to be strings. In Python, the keys can be any hashable
type.
- The fields in the object can be ordered arrays, such as “children.” These
arrays are analogous to Python lists.
- You can mix and match types in the object, just as in Python.
- You can have Boolean types. Note though that they are declared in lower
case.
- There are also numerical types.
- There is a null supported
- You can nest the object arbitrarily deeply.

Parsing JSON is a cinch in Python. You can either “load” a JSON string into
a Python object (a dictionary at the highest level, with JSON arrays mapping to
Python lists, etc.) or “dump” a Python dictionary into a JSON string. The JSON
string can either be a Python string or be stored in a file, in which case you
write from/to a file object. The code looks as follows:

```python
import json
json_str = """{"name": "Field", "height":6.0}"""
my_obj = json.loads(json_str)
my_obj

>>> {u'name': u'Field', u'height': 6.0}

str_again = json.dumps(my_obj)
```

Historically, JSON was invented as a way to serialize objects from the
JavaScript language. Think of the keys in a JSONObject as the names of the
members in an object. However, JSON does NOT support notions such as
pointers, classes, and functions.

## XML Files

XML is similar to JSON: a text‐based format that lets you store hierarchical
data in a format that can be read by both humans and machines. However, it’s
significantly more complicated than JSON – part of the reason that JSON has
been eclipsing it as a data transfer standard on the web.
Let’s jump in with an example:

```xml
<GroupOfPeople>
<person gender="male">
<Name>Field Cady</Name>
<Profession>Data Scientist</Profession>
</person>
<person gender="female">
<Name>Ryna</Name>
<Profession>Engineer</Profession>
</person>
</GroupOfPeople>
```

Everything enclosed in angle brackets is called a “tag.” Every section of the
document is bookended by a matching pairs of tags, which tell what type of
section it is. The closing tag contains a slash “/” after the “<”. The opening tag
can contain other pieces of information about the section – in this case, “gender”
is such an attribute. Because you can have whatever tag names or additional
attributes you like, XML lends itself to making domain‐specific
description languages.
XML sections must be fully nested into each other, so something such as the
following is invalid:

```xml
<a><b></a></b>
```

because the “b” section begins in the middle of the “a” section but doesn’t end
until the “a” is already over. For this reason, it is conventional to think of an
XML document as a tree structure. Every nonleaf node in the tree corresponds
to a pair of opening/closing tags, of some type and possibly with some attributes,
and the leaf nodes are the actual data.

Sometimes, we want the start and end tag of a section to be adjacent to each
other. In this case, there is a little bit of syntactic sugar, where you put the closing
“/” before the closing angle bracket. So,

```xml
<foo a="bar"></foo>
```

is equivalent to

```xml
<foo a="bar"/>
```

A big difference between JSON and XML is that the content in XML is
ordered. Every node in the tree has its children in a particular order – the order
in which they come in the document. They can be of any types and come in any
order, but there is AN order.

Processing XML is a little more finicky than processing JSON, in my experience.
This is for two reasons:

- It’s easier to refer to a named field in a JSON object than to search through
all the children of a XML node and find the one you’re looking for.
- XML nodes often have additional attributes, which are handled separately
from the node’s children.
- This isn’t inherent to the data formats, but in practice, JSON tends to be used
in small snippets, for smaller applications where the data has regular structure.
So, you typically know exactly how to extract the data you’re looking
for. In contrast, XML is liable to be a massive document with many parts,
and you have to sift through the whole thing.

In Python, the XML library offers a variety of ways of processing XML data.
The simplest is the ElementTree sublibrary, which gives us direct access to the
parse tree of the XML. It is shown in this code example, where we parse XML
data into a string object, access and modify the data, and then reencode it back
to an XML string:

```xml

import xml.etree.ElementTree as ET

xml_str = """
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
"""

root = ET.fromstring(xml_str)
root.tag
>>> 'data'

# gives the zeroth child
root[0] 
>>> <Element 'country' at 0x1092d4410>

# dictionary of node’s attributes
root.attrib 
>>> {}

root.getchildren()
>> [<Element 'country' at 0x1092d4410>, <Element 'country' at 0x1092d47d0>, <Element 'country' at 0x1092d4910>]

# deletes the zeroth child from the tree
del root[0] 
modified_xml_str = ET.tostring(root)
```

The “right” way to manage XML data is called the “Document Object Model.”
It is a little more standardized across programming languages and web browsers,
but it is also more complicated to master. The ElementTree is fine for simple
applications and capable of doing whatever you need it to do.

## HTML Files

By far the most important variant of XML is HTML, the language for describing
pages on the web. Practically speaking, the definition of “valid” HTML is
that your web browser will parse it as intended. There are differences between
browsers, some intentional and some not, and that’s why the same page
might look different in Chrome and Internet Explorer. But browsers have
largely converged on a standard version of HTML (the most recent official
standard is HTML5), and to a first approximation, that standard is a variant
of XML. Many web pages could be parsed with an XML parser library.

As mentioned in the last section, XML can be used to create domain‐specific
languages, each of which is defined by its own set of valid tags and their
associated attributes. This is the way HTML works. Some of the more notable
tags are given in the following table:

![html table](html_table.png)

The practical problem with processing HTML data is that, unlike JSON or
even XML, HTML documents tend to be extremely messy. They are often individually
made, edited by humans, and tweaked until they look “just right.” This
means that there is almost no regularity in structure from one HTML document
to the next, so the tools for processing HTML lean toward combing
through the entire document to find what it is you’re looking for.

The default HTML tool for Python is the HTMLParser class, which you use
by creating a subclass that inherits from it. An HTMLParser works by walking
through the document, performing some action each time it hits a start or an
end tag or other piece of text. These actions will be user‐defined methods on
the class, and they work by modifying the parser’s internal state. When the
parser has walked through the entire document, its internal state can be queried
for whatever it is you were looking for. One very important note is that it’s
up to the user to keep track of things such as how deeply nested you are within
the document’s sections.

To illustrate, the following code will pull down the HTML for a Wikipedia
page, step through its content, and count all hyperlinks that are embedded in
the body of the text (i.e., they are within paragraph tags):

```python

from HTMLParser import HTMLParser
import urllib

TOPIC = "Dangiwa_Umar"
url = "https://en.wikipedia.org/wiki/%s" % TOPIC
class LinkCountingParser(HTMLParser):
    in_paragraph = False
    link_count = 0

    def handle_starttag(self, tag, attrs):
        if tag=='p': self.in_paragraph = True
        elif tag=='a' and self.in_paragraph:
            self.link_count += 1

    def handle_endtag(self, tag):
        if tag=='p': self.in_paragraph = False

html = urllib.urlopen(url).read()
parser = LinkCountingParser()
parser.feed(html)
print("there were", parser.link_count, "links in the article")
```

## Tar Files

Tar is the most popular example of an “archive file” format. The idea is to take
an entire directory full of data, possibly including nested subdirectories, and
combine it all into a single file that you can send in an e‐mail, store somewhere,
or whatever you want. There are a number of other archive file formats, such
as ISO, but in my experience, tar is the most common example.

Besides their widespread use for archiving files, Tar files are also used by the
Java programming language and its relatives. Compiled Java classes are stored
into JAR files, but JAR files are created just by tarring individual Java class files
together. JAR is the same format as Tar, except that you are only combining
Java class files rather than arbitrary file types.

Tarring a directory doesn’t actually compress the data – it just combines the
files into one file that takes up about as much space as the data did originally.
So in practice, Tar files are almost always then zipped. GZipping in particular
is popular. The “.tgz” file extension is used as a shorthand for “.tar.gz”, that is,
the directory has been put into a Tar file, which was then compressed using the
GZIP algorithm.

Tar files are typically opened from the command line, such as the
following:

```
    # This will expand the contents of my_directory.tar into the local directory
    $ tar -xvf my_directory.tar
    
    # This command will untar and unzip a directory with has been tarred and g-zipped
    $ tar -zxf file.tar.gz
    
    # This command will tar the Homework3 directory into the file ILoveHomework.tar
    $ tar -cf ILoveHomework.tar Homework3 
```

## GZip Files

Gzip is the most common compression format that you will see on Unix‐like
systems such as Mac and Linux. Often, it’s used in conjunction with Tar to
archive the contents of an entire directory. Encoding data with gzip is comparatively
slow, but the format has the following advantages:
- It compresses data super well.
- Data can be decompressed quickly.
- It can also be decompressed one line at a time, in case you only want to operate
only on part of the data without decompressing the whole file.

Under the hood, gzip runs on a compression algorithm called DEFLATE. A
compressed gzip file is broken into blocks. The first part of each block contains
some data about the block, including how the rest of the block is encoded (it
will be some type of Huffman code, but you don’t need to worry about the
details of those). Once the gzip program has parsed this header, it can read the
rest of the block 1 byte at a time. This means there is minimal RAM being used
up, so all the decompression can go on near the top of the RAM cache and
hence proceed at breakneck speed.

The typical commands for gzipping/unzipping from the shell are simple:

```
    # creates raw file myfile.txt
    $ gunzip myfile.txt.gz 
    
    # compresses the file into myfile.txt.gz
    $ gzip myfile.txt 
```

However, you can typically also just double‐click on a file – most operating
systems can open gzip files natively.

## Zip Files

Zip files are very similar to Gzip files. In fact, they even use the same DEFLATE
algorithm under the hood! There are some differences though, such as the fact
that ZIP can compress an entire directory rather than just individual files.
Zipping and unzipping files are as easy with ZIP as with GZIP:

```
    # This puts several files into a single zip file
    $ zip filename.zip input1.txt input2.txt resume.doc pic1.jpg
    
    # This will open the zip file and put all of its contents into the current directory
    $ unzip filename.zip
```

## Image Files: Rasterized, Vectorized, and/or Compressed

Image files can be broken down into two broad categories: rasterized and vectorized.
Rasterized files break an image down into an array of pixels and encode
things such as the brightness or color of each individual pixel. Sometimes, the
image file will store the pixel array directly, and other times, it will store some
compressed version of the pixel array. Almost all machine‐generated data will
be rasterized.

Vectorized files, on the other hand, are a mathematical description of what
the image should look like, complete with perfect circles, straight lines, and so
on. They can be scaled to any size without losing resolution. Vectorized files
are more likely to be company logos, animations, and similar things. The most
common vectorized image format you’re likely to run into is SVG, which is
actually just an XML file under the hood. However, in daily work as a data scientist,
you’re most likely to encounter rasterized files.

A rasterized image is an array of pixels that, depending on the format, can be
combined with metadata and then possibly subjected to some form of compression
(sometimes using the DEFLATE algorithm, such as GZIP). There are several
considerations that differentiate between the different formats available:

- Lossy versus lossless. Many formats (such as BMP and PNG) encode the
pixel array exactly – these are called lossless. But others (such as JPEG) allow
you to reduce the size of the file by degrading the resolution of your image.
- Grayscale versus RBG. If images are black‐and‐white, then you only need
one number per pixel. But if you have a colored image, then there needs to be
some way to specify the color. Typically, this is done by using RGB encoding,
where a pixel is specific by how much red, how much green, and how much
blue it contains.
- Transparency. Many images allow pixels to be partly transparent. The “alpha”
of a pixel ranges from 0 to 1, with 0 being completely transparent and 1 being
completely opaque.

Some of the most important image formats you should be aware of are as
follows:

- JPEG. This is probably the single most important one in web traffic, prized
for its ability to massively compress an image with almost invisible degradation.
It is a lossy compression format, stores RGB colors, and does not allow
for transparency.
- PNG. This is maybe the next most ubiquitous format. It is lossless and allows
for transparency pixels. The transparent pixels make PNG
files super useful when putting together slide decks.
- TIFF. Tiff files are not common on the Internet, but they are a frequent format
for storing high‐resolution pictures in the context of photography or
science. They can be lossy or lossless.

The following Python code will read an image file. It takes care of any decompression
or format‐specific stuff under the hood and returns the image as a
NumPy array of integers. It will be a three‐dimensional array, with the first two
dimensions corresponding to the normal width and height. The image is read
in as RBG by default, and the third dimension of the array indicates whether we
are measuring the red, blue, or green content. The integers themselves will
range from 0 to 255, since each is encoded with a single byte.

```python
    from scipy.ndimage import imread
    img = imread('mypic.jpg')
```

If you want to read the image as grayscale, you can pass mode = "F" and get a
two‐dimensional array. If you instead want to include the alpha opacity as a
fourth value for each pixel, pass in mode = "RGBA."

## It’s All Bytes at the End of the Day

At the lowest level, the data in a computer file is a long array of bits, each of
which is set to 0 or 1. That array is broken into 8‐bit chunks called bytes. The
concept of a byte is both conceptual and physical. On the one hand, we usually
break up a file into basic logical units that are composed of bytes, such as having
one byte to encode a letter or a number. You could theoretically create a
file format where the basic units were of 5 bits or 11 bits long, but the universal
convention is to use bytes. At the same time, the physical hardware of the
computer is optimized to process data one byte (or a group of several bytes) at
a time.

A modern computer’s memory is called RAM, for “random access memory.”
“Random” in this case isn’t about probability: it refers to the fact that you can
read/modify any part of memory with about the same latency. The memory is
physically broken up into bytes for easier processing. The data structures that
exist in memory as a program runs are, similarly to raw files, ultimately encoded
into bytes. Sometimes, the encodings used for a file and a real‐time data structure
are identical, and sometimes, they are quite different.

Historically, an atomic type in a programming language was defined to take
up a fixed number of bytes. An integer would frequently be allocated 4 bytes,
and the integer was encoded in those bytes in binary. Having every integer take
up the same amount of space was critical, because the physical layout of the
bits doesn’t make it clear where one integer (or any other type of variable) ends
and another begins. These transitions generally occur on the boundaries
between bytes, but that’s it. The computer’s “native language” doesn’t have any
notion of integers or any other data type; to the computer, everything is just
bytes, so fixed‐size variables are critical for keeping track of things.

In modern languages such as Python, there are more variable‐size types. For
example, a Python string can take up arbitrarily many bytes. However, doing
this requires overhead to keep track of where one item ends and another
begins, which translates to a substantial performance cost. Modern languages
tend to try for fixed‐size atomic types whenever possible, but then revert to the
less‐efficient version when necessary. Software that is intended to run extremely
fast, like Python’s numerical libraries, almost always strips out the overhead
and limits itself to fixed‐size types.

## Integers

Integers are about the simplest atomic type to understand. Back in the day
when RAM was more expensive, people did all kinds of tricks to try and encode
integers using fewer bits, but now things have basically settled out:

- An integer gets a fixed number of bytes. Eight bytes is also typical, if you’re
using a 64‐bit computer.
- The integer is encoded in those bits in binary.
- One of the bits isn’t interpreted as a binary digit: it is a flag saying whether
the integer is negative. If it’s negative, then typically the 0s and 1s are flipped
in the rest of the number, for arithmetic efficiency reasons that you don’t
need to worry about.

This system works seamlessly most of the time, but there is a maximum size
of integer that can be handled; 63 bits is only so big. In Python, you can get that
upper bound in the following way:

```python
    import sys
    sys.maxint
    >>> 9223372036854775807
    
```

This number is 2 to the power of 63: 63 bits to store the number and one to
flag that it’s positive. This number is large enough for almost all purposes, but
occasionally you need something bigger. Oftentimes, you never even realize
that you’ve ventured into this area! In Python, if you ever declare a variable
equal to something larger than sys.maxint, then Python will silently switch
over to a different, far less efficient data type called a "long." From a programmer’s
perspective, a long looks, feels, and acts as an int: the only clear sign
that it’s something different is a telltale "L" after the number when it’s
displayed:

```python
    3*sys.maxint
    >>> 27670116110564327421L
    
```

The seamless transition is a luxury afforded by using a very high‐level language
such as Python, and you pay for it in efficiency. It takes overhead to
check at every step whether the system needs to switch over to using longs, and
if things ever DO switch over the performance hit really cranks up.

## Floats

Floating‐point numbers are more complicated than integers, mostly because
they are inherently error‐prone. A floating number can theoretically have infinitely
many decimal places, and the computer can only store finitely many of
them. Innocuous operations such as taking a square root, or even dividing by
3, can balloon a previously tame number into infinite‐decimal land.
In almost every computer system, a floating‐point value is stored as a pair of
two numbers, typically a pair of the integers as discussed in the previous
section:

- One integer stores the digits in the binary representation of the number.
- The other stores the location of the decimal point in the number.

The overwhelming advantage to this way of doing things is that it lets us
represent both very large and very small numbers with the same degree of
accuracy; roundoff error will corrupt the number 1 billion about the same percentage
that it hurts the number 1 billionth. Other floating‐point schemes
were tried out back in the day, but they are now in the dustbin of history.

As a data scientist, you don’t need to worry too much about roundoff error;
good partial workarounds have been baked into most numerical algorithms,
and the fact that RAM is so cheap now means that we usually carry around
many more decimal places than are ever necessary. However, roundoff issues
can show up in subtle ways, as shown in this script:

```python
    
    x, y = 0.1, 0.2
    x + y
    >>> 0.30000000000000004

```

This is because 0.1 and 0.2 both have infinitely many decimal places when
expressed in binary, so Python only stores an approximation to them. The
stored value of x is not 0.1; it is the closest number to 0.1 that can be stored as
a float. In this case, that number is slightly larger than 0.1, and similarly for 0.2.
When you add x and y, these small errors are large enough to add up. If you try
to look at the value of x, you will see

```python

    x
    >>> 0.1
    
```

This number is an illusion! Python is rounding x by a tiny bit before displaying
it, as a visual courtesy to the user. But the error margin on x + y is large
enough that Python will display it instead.
As with large integers, there are computationally very expensive workarounds
for the limitations of machine floating points. Usually, these take the
form of either storing numbers as arbitrary‐length strings or storing the arithmetic
expressions that generated the numbers. These exceedingly expensive,
but technically exact, expressions are carried through a computation and can
later be approximately cast into the normal style of numbers.
You'll probably never use an exact arithmetic system.
They are mostly useful in theoretical math situations where exact equality is
important, and this almost never occurs in real‐world work.

## Text Data

The previous two subsections were kind of academic: you generally don’t need
to worry about how machines represent numbers in your daily work. This is
not the case with strings though: there are several different ways that strings
are stored, which have very different tradeoffs, and you must keep an eye
toward them.

The granddaddy string format is called ASCII (pronounced “ass”, “key”). It is
dirt simple, is super efficient, and has stood the test of time. The problem is
that it’s set in stone, and it’s limited. Anything you can type with a standard
American‐style keyboard can be encoded into ASCII, so you can do a lot with
it. But in many modern applications that’s not enough. There are Chinese
characters, and German letters with an umlaut on top. There are emoticons.
There might even be additional types of text that get invented later – just look
at the rise of the emoticon!

In ASCII, every character is encoded in a single byte, sometimes called a
“char.” This gives us an interesting phenomenon: there is a mapping between
ASCII characters and short integers, since they are encoded by the same byte.
It’s not one‐to‐one, because ASCII only specifies characters for numbers up to
127, but a byte can encode up to 255 (some bytes are not valid ASCII, but they
are still perfectly fine encodings of integers). Quite rationally, the capital "A" is
the number 65, "B" is 66, and so on. The lowercase numbers are later, with 97
for “a,” 98 for “b,” 99 for “c,” and so on. Python lets you convert between these
using the functions chr() and ord() (for "ordinal"):

```python
    
    chr(65)
    >>> "A"
    
    ord("A")
    >>> 65

```

ASCII also includes the various special characters that you can type with a
keyboard. Tab is 9, and newline is 10. “@” is 64. The digits “0” through “9” are
48 to 57.

You might think of ASCII as the “establishment” string format that things use
if they have to be extremely fault tolerant. Python code is supposed to be stored
as ASCII – the Python interpreter will throw an error if you point it at a file
that is not ASCII formatted. Operating systems use it. Plain text files are typically
ASCII whenever possible. Python string objects are stored in RAM as
ASCII.

This might be a good time to revisit the way we declare strings in Python – this
paragraph is optional but interesting. Recall that for the most part we just put
the contents of the string in quotation marks, and type whatever we want, as in:

```python

    my_string = "abc123"
    
```

But some characters, such as tabs and newlines, can’t always be directly
typed. In this case, we use the slash character "\" to encode them out of things
that we CAN type. For example:

```python
    
    # this is a one‐character string
    my_tab = "\t" 
    
    # this is too
    my_newline = "\n" 

```

Adding the slash before a character in order to encode something is called
“escaping” the character. Now I’ll give you the keys to the kingdom: if you want
super fine‐grained control, you can escape "x" to tell the computer exactly
which ASCII bytes should be in a string. If I declare a string such as

```python
    
    fancy_string = '\xAA'
    
```

then the two characters "AA" will be interpreted as the hexadecimal number of
the ASCII byte you want. Hexagesimal is a slightly archaic, base‐16 way to
write numbers, whose 16 digits are 0, 1, 2, …, 9, A, B, …, E, F. Writing "\t" is just
a nicer way of writing "\x09," and "\n" is the same thing as "\x0A" (0A in hexagesimal
is 10). In fact, this is more powerful than ASCII, because technically
ASCII numbers only go up through 127, whereas hexagesimal notation lets
you put in bytes up to 255, that is, any possible byte.

The other big string standard is known as Unicode. Unicode is actually a
family of encoding standards, all of them aiming to supplement basic ASCII
with the massive range of other characters needed today and possibly in the
future. The main version of Unicode available is UTF‐8, and it is fast becoming
the most popular encoding around. In this chapter, UTF‐8 will be the one I
discuss.

The biggest difference between Unicode and ASCII is that in Unicode there
is a variable number of bytes that encode each character. This means that all
the performance advantages of fixed‐sized elements go out the window, but
this is the price you must pay for flexibility. However, UTF‐8 is backward
compatible with ASCII: a chunk of bytes that are valid ASCII are also valid
UTF‐8. This works because not all bytes are valid ASCII – the ASCII integers
top out at 127, but a byte can go up to 255. So, if you are reading through an
array of Unicode and come to a byte that is greater than 127, it signifies that
this byte (and possibly the next several) constitutes a non‐ASCII character.
When you upgrade from ASCII to 2‐byte characters, you get pretty much all
characters in Western languages. Three bytes will give you East Asia. Four
will give you various historical writing systems, mathematical symbols, and
emoticons.

Python has native support for Unicode. Declaring a string‐type variable to be
Unicode rather than a normal string is as simple as putting a "u" outside the
parentheses:

```python
    unicode_str = u"This is unicode"
```

Python strings are more general compared to ASCII or UTF‐8. Using the
“\x” trick, you can force Python to put an arbitrary collection of bytes into
a string object, and they may or may not be valid ASCII/Unicode. If you
have a string and you want to convert it into valid ASCII or UTF‐8, then
you can say

```python
    # fails if not valid ASCII
    as_ascii = my_string.decode('ascii')
    
    # drops non-ASCII characters
    as_ascii = my_string.decode(’ascii’, ’ignore’)
    
    # drops non-unicode characters
    as_utf8 = my_string.decode('utf8', 'ignore')

```