## Python Optimizations: Interning

Python create shared references automatically

In [1]:
a = 10
b = 10
print(id(a))
print(id(b))

94891096948320
94891096948320


Note how `a` and `b` reference the same object.

But consider the following example:

In [2]:
a = 5000
b = 5000
print(id(a))
print(id(b))

139684958999280
139684959000944


As you can see, the variables a and b do not point to the same object! This is because Python pre-caches integer objects in the range **[-5, 256]**


In [3]:
a = 256
b = 256
c = -5
d = -5
print(id(a), id(b), id(c), id(d))
# we have the same memory address

94891096956192 94891096956192 94891096947840 94891096947840


This is called **interning**: Python interns the integers in the range [-5, 256]. The integers in the range [-5, 256] are essentially **singleton** objects.

In [5]:
a = 10
b = int(10)
c = int('10')
d = int('1010', 2)
print(a, b, c, d)

10 10 10 10


In [8]:
# chek the memory address
print(a is b)
print(a is c) 
print(a is d)

True
True
True


As we can see, all these variables were created in different ways, but since the integer object with value 10 behaves like a singleton, they all ended up pointing to the **same** object in memory.

## Python Optimizations: String Interning

Python will automatically intern *certain* strings. In particular all the identifiers (variable names, function names, class names, etc) are interned (singleton objects created).

Python will also intern string literals that look like identifiers.

For example:

In [10]:
a = 'hello'
b = 'hello'
print(id(a))
print(id(b))
print('-----------')
c = 'hello, world!'
d = 'hello, world!'
print(id(c))
print(id(d))

139684960705264
139684960705264
-----------
139684958761456
139684958761904


However, because the following literals resemble identifiers, even though they are quite long, Python will still automatically intern them:

In [11]:
a = 'hello_world'
b = 'hello_world'
print(id(a))
print(id(b))

139684959526832
139684959526832


In [13]:
# even longer
a = '_this_is_a_long_string_that_could_be_used_as_an_identifier_ooooooooooo'
b = '_this_is_a_long_string_that_could_be_used_as_an_identifier_ooooooooooo'
print(id(a))
print(id(b))

139684964844976
139684964844976


Interning strings (making them singleton objects) means that testing for string equality can be done faster by comparing the memory address:

In [14]:
a = 'this_is_a_long_string'
b = 'this_is_a_long_string'
print('a==b:', a == b)
print('a is b:', a is b)

a==b: True
a is b: True


#### <font color="red">Note: Remember, using `is` ONLY works if the strings were interned!</font>

Here's where this technique fails:

In [15]:
a = 'hello world'
b = 'hello world'
print('a==b:', a==b)
print('a is b:', a is b)

a==b: True
a is b: False


You *can* force strings to be interned (but only use it if you have a valid performance optimization need):

In [17]:
import sys

a = sys.intern('hello world')
b = sys.intern('hello world')
c = 'hello world'
print(id(a))
print(id(b))
print(id(c))

139684958923696
139684958923696
139684958921072


Notice how `a` and `b` are pointing to the same object, but `c` is **NOT**.

So, since both `a` and `b` were interned we can use `is` to test for equality of the two strings:

In [18]:
print('a==b:', a==b)
print('a is b:', a is b)

a==b: True
a is b: True


So, does interning really make a big speed difference?

Yes, but only if you are performing a *lot* of comparisons. Let's run some quick and dirty benchmarks:

In [19]:
def compare_using_equals(n):
    a = 'a long string that is not interned' * 200
    b = 'a long string that is not interned' * 200
    for i in range(n):
        if a == b:
            pass

In [20]:
def compare_using_interning(n):
    a = sys.intern('a long string that is not interned' * 200)
    b = sys.intern('a long string that is not interned' * 200)
    for i in range(n):
        if a is b:
            pass

In [21]:
import time

start = time.perf_counter()
compare_using_equals(10000000)
end = time.perf_counter()

print('equality: ', end-start)

equality:  1.0860573120007757


In [22]:
start = time.perf_counter()
compare_using_interning(10000000)
end = time.perf_counter()

print('identity: ', end-start)

identity:  0.2549713740008883


As you can see, the performance difference, especially for long strings, and for many comparisons, can be quite radical!