# PYTHON OPTIMZATIONS INTERNING

**interning** : reusing objects on-demand
At startup, Python(CPython), pre-loads (caches) a global list of integers in
the range [-5, 256] <br>

Any time an integer is referenced within that range, Python will use the cached
version of that object, these objects are called **Singletons** (Optimization strategy - small integers show up often)

when we write
`a = 10`
Python just jas to point to the existing reference for 10

But if we write
`a = 257`
Python does not use that global list and a new object is created every time.

In [1]:
[-5, 256]





[-5, 256]

In [2]:

a = 10
b = 10
print(id(a))
print(id(b))

1664902464
1664902464


In [3]:
a = -5
b = -5
print(id(a), id(b))




1664902224 1664902224


In [4]:
a is b

True

In [5]:
a  = 256
b = 256
print(id(a), id(b))

1664906400 1664906400


In [6]:
a is b

True

In [7]:
a = 257
b = 257
print(id(a), id(b))

108831088 108831072


In [8]:
print(a is b)

False


In [9]:
a = 10
b = int(10)
c = int('10')
d = int('1010', 2) 

In [10]:
print(a, b, c, d)

10 10 10 10


In [11]:
print(id(a), id(b), id(c), id(d))

1664902464 1664902464 1664902464 1664902464


# STRING INTERNING

Why String Interning ?
It's all about (speed and, possibly, memory) optimization.

Python, both internally, and in the code you write, deals with lots and lots 
of dictionary type lookups, on string keys, which means a lot of **string equality**
testing.

Let's say we want to see if two strings are equal:<br> 
a = 'some_long_string'<br>
b = 'some_long_string'<br>

Using `a == b`, we need to compare the two strings *character by character*<br>
But if we know that 'some_long_string' has been **interned**, then `a ` and `b` are the same
string if they both point to the **same memory address**.

In which case we can use `a is b` instead - which compares two **integers** (memory address)

This is **much** faster than the chracter by chracter comparison.

*Not all strings are automatically interned by Python*.<br>
*But you can **force** strings to be interned by using the `sys.intern()`method.*

```python
import sys
a = sys.intern('the quick brown fox')
b = sys.intern('the quick brown fox')

a is b # True much faster than a == b 
``` 

### When should you do this ?
* dealing with a large number of strings that could have repetition e.g
tokenizing a large corpus of text (NLP)
* lots of string comparisons
* In case of Code Refactoring

In [12]:
a = 'hello'
b = 'hello'
print(id(a), id(b))





109071136 109071136


In [13]:
a = 'hello world'
b = 'hello world'
print(id(a), id(b))

108611088 108610888


In [14]:
a == b

True

In [15]:
a is b

False

In [16]:
a = 'hello'
b = 'hello'
print(a == b)



True


In [17]:
print(a is b)

True


In [18]:
a = '_this_is_a_long_string_that_could_be_used_as_an_identifier'

In [19]:
b =  '_this_is_a_long_string_that_could_be_used_as_an_identifier'



In [20]:
a is b

True

In [21]:
import sys


In [22]:
a = sys.intern('hello world')

In [23]:
b = sys.intern('hello world')

In [24]:
c = 'hello world'

In [25]:
print(id(a), id(b), id(c))

109145288 109145288 109188864


In [26]:
a == b

True

In [27]:
a is b

True

In [28]:
c = '_hello world'

In [29]:
d = '_hello world'

In [31]:
print(c is d)
print(id(c), id(d))

False
109242752 108612208


In [32]:
def compare_using_equals(n):
    a = 'a long string that is not interned' * 200
    b = 'a long string that is not interned' * 200
    for i in range(n):
        if a == b:
            pass

In [33]:
def compare_using_interning(n):
    a = sys.intern('a long string that is not interned' * 200)
    b = sys.intern('a long string that is not interned' * 200)
    for i in range(n):
        if a == b:
            pass


In [34]:
import time


In [35]:
start = time.perf_counter()
compare_using_equals(10000000)
end = time.perf_counter()
print('equality', end - start)

equality 23.187704999999823


In [36]:
start = time.perf_counter()
compare_using_interning(10000000)
end = time.perf_counter()
print('interning', end - start)



interning 3.232388799999626
