<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Python-tricks" data-toc-modified-id="Python-tricks-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Python tricks</a></span></li><li><span><a href="#Numpy" data-toc-modified-id="Numpy-0.2"><span class="toc-item-num">0.2&nbsp;&nbsp;</span>Numpy</a></span></li><li><span><a href="#Regular-expressions" data-toc-modified-id="Regular-expressions-0.3"><span class="toc-item-num">0.3&nbsp;&nbsp;</span>Regular expressions</a></span></li></ul></li></ul></div>

This notebook covers:
* Python tricks: Some lesser known python tricks and tips
* Numpy
* Regex basics

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import dask.dataframe as dd
import itertools
from operator import itemgetter

## Python tricks

![pokemon](../imgs/python-tips.jpg)

**variable unpacking**

In [3]:
a, b, c = [1, 2, 3]

In [5]:
a, b, c

(1, 2, 3)

**negative indexing**

In [6]:
l = [1, 2, 3, 4, 5]

In [8]:
l[-2]

4

In [9]:
l[-3:-1]

[3, 4]

**joining elements of list in a string**

In [112]:
a = ['cat', 'dog', 'rat']

In [113]:
"".join(a)

'catdograt'

In [115]:
",".join(a)

'cat,dog,rat'

In [114]:
" ".join(a)

'cat dog rat'

**splitting strings**

In [117]:
'a b c'.split()  #default space

['a', 'b', 'c']

In [119]:
'a b c'.split(" ")

['a', 'b', 'c']

In [117]:
'a b c'.split()  #default space

['a', 'b', 'c']

In [120]:
'cat,dog,rat'.split(',')

['cat', 'dog', 'rat']

**List slices with step (a[start:end:step])**

In [10]:
l[::-1]  #reversing the list

[5, 4, 3, 2, 1]

**repeating strings**

In [121]:
"cat" * 4

'catcatcatcat'

**we can reverse a string too**

In [79]:
s = 'abcbde'

In [80]:
s[::-1]

'edbcba'

remember but string is immutable

**iterating using `enumerate`**

In [11]:
for i, e in enumerate(l):
    print(i, e)

0 1
1 2
2 3
3 4
4 5


**Iterating through a dictionary**

In [2]:
dictionary = {'spain': 'madrid', 'france': 'paris'}
for key, value in dictionary.items():
    print(key, " : ", value)
print('')

spain  :  madrid
france  :  paris



**zipping**

`zip` combines multiple iterables of same length

In [33]:
a = [1, 2, 3]
b = ['a', 'b', 'c']
c = [6, 7, 8]

In [34]:
z = zip(a, b, c)

In [35]:
for e in z:
    print(e)

(1, 'a', 6)
(2, 'b', 7)
(3, 'c', 8)


**inverting dictionary using zip**

In [38]:
dictionary

{'france': 'paris', 'spain': 'madrid'}

In [37]:
dict(zip(dictionary.values(), dictionary.keys()))

{'madrid': 'spain', 'paris': 'france'}

**flattening list**

In [45]:
a = [[1, 2], [3, 4], [5, 6, 7]]

these 2 will work if every elemnt is a list

In [46]:
list(itertools.chain.from_iterable(a))

[1, 2, 3, 4, 5, 6, 7]

In [47]:
sum(a, [])

[1, 2, 3, 4, 5, 6, 7]

**dictionary comprehension**

In [49]:
m = {i: i * 4 for i in range(10)}

In [50]:
m

{0: 0, 1: 4, 2: 8, 3: 12, 4: 16, 5: 20, 6: 24, 7: 28, 8: 32, 9: 36}

**inverting dictionary using dictionary comprehension**

In [51]:
dictionary

{'france': 'paris', 'spain': 'madrid'}

In [53]:
{v: k for k, v in dictionary.items()}

{'madrid': 'spain', 'paris': 'france'}

**iterables**

works with list, dictionaries and any other iterable

In [54]:
a = [1, 2, 3]

In [59]:
it = iter(a)  #creates a iterable over the list

In [60]:
next(it)  #calls next iteration

1

In [61]:
print(*it)  #print remaining iterations

2 3


after the iteration is done over the complete list, calling next will throw an error

In [62]:
it = iter(a)  #creates a iterable over the list

In [63]:
next(it)  #calls next iteration

1

In [64]:
next(it)

2

In [65]:
next(it)

3

In [66]:
next(it)

StopIteration: 

string is also an iterable

In [67]:
a = "abcdef"

In [68]:
it = iter(a)

In [69]:
next(it)

'a'

In [70]:
print(*it)  #print remaining iterations

b c d e f


iteration over dictionary

In [71]:
a = {"a": 1, "B": 3, "c": 3}

In [77]:
it = iter(a)  #by deafult it creates iterable over the keys

In [78]:
next(it)

'a'

**sets**

In [82]:
l1 = [1, 2, 3, 3, 5, 6, 6, 7]
l2 = [1, 2, 4, 2]

sets actually create a set of unique values from an iterable

In [85]:
s1 = set(l1)

In [86]:
s2 = set(l2)

In [87]:
s1

{1, 2, 3, 5, 6, 7}

In [88]:
s2

{1, 2, 4}

you can apply various set operations

In [89]:
s1.intersection(s2)

{1, 2}

In [90]:
s1.union(s2)

{1, 2, 3, 4, 5, 6, 7}

In [91]:
s1.difference(s2)

{3, 5, 6, 7}

In [92]:
s2.difference(s1)

{4}

**pop**

In [93]:
l1

[1, 2, 3, 3, 5, 6, 6, 7]

In [94]:
l1.pop()  #this will pop the last element from the list

7

In [95]:
l1  #we can see that the element is no longer in the list

[1, 2, 3, 3, 5, 6, 6]

In [96]:
l1.pop(3)  #we can sepecify the index of the element for popping too

3

In [97]:
l1

[1, 2, 3, 5, 6, 6]

**`extend`: this can be used to merge 2 iterables**

In [98]:
l1, l2

([1, 2, 3, 5, 6, 6], [1, 2, 4, 2])

In [99]:
l1.extend(l2)

In [100]:
l1

[1, 2, 3, 5, 6, 6, 1, 2, 4, 2]

**default dict**

In [101]:
m = dict()

In [102]:
m['a']

KeyError: 'a'

In [103]:
import collections

In [105]:
m = collections.defaultdict(int)

In [106]:
m['a']

0

you can specify a default value also for the dictionary

In [107]:
m = collections.defaultdict(lambda: 2)

In [108]:
m['a']

2

**combinations**

Cartesian product of 2 lists `itertools.product`

In [110]:
for p in itertools.product([1, 2, 3], [4, 5]):
    print(p)

(1, 4)
(1, 5)
(2, 4)
(2, 5)
(3, 4)
(3, 5)


all possible combinations of size x from the elements of an iterable: `itertools.combinations`

here x =2, so number of combinations will be n choose 2 i.e 3C2

In [3]:
for p in itertools.combinations([1,2,3],2):print(p)

(1, 2)
(1, 3)
(2, 3)


**Scoping : There is a LEGB rule that is search local scope, enclosing function, global and built in scopes, respectively.**

In [96]:
x = 5


def f():
    y = 2 * x  # there is no local scope x
    return y


print(f())
print(x)

10
5


List/tuple conversion to dict

In [4]:
l = dict([['a', 1], ['b', 2]])

In [5]:
l

{'a': 1, 'b': 2}

**sorting based on specific element**

In [6]:
sorted(l, key=lambda x: l[x], reverse=True)  #sorting a dictionary

['b', 'a']

In [8]:
sorted(l.items(),key=itemgetter(1),reverse=True) #another way of sorting a dictionary

[('b', 2), ('a', 1)]

In [7]:
m = [[3, 1], [5, 2]]

here x refers to each element of the outer list

In [126]:
sorted(
    m, key=lambda x: x[1], reverse=True
)  #sorting a nested list based on second element of the nested list

[[5, 2], [3, 1]]

calling sorted on a string will split it into a list and then sort alphabetically

In [127]:
sorted('dsf')

['d', 'f', 's']

**Default and flexible arguments: args vs kwargs**

these 2 arguments are used when function can expect variable number of arguments

Whatever is passed in args is taken as tuple

In [100]:
# flexible arguments *args
def f(*args):
    print(type(args))
    print(args[0])
    for i in args:
        print(i)


f(1)
print("")
f(1, 2, 3, 4)
f([1, 2, 3])

<class 'tuple'>
1
1

<class 'tuple'>
1
1
2
3
4
<class 'tuple'>
[1, 2, 3]
[1, 2, 3]


Whatver is passed in kwargs is considered as **dictionary**

In [102]:
def f(**kwargs):
    print(type(kwargs))
    """ print key and value of dictionary"""
    for key, value in kwargs.items():
        print(key, " ", value)


f(country='spain', capital='madrid', population=123456)

<class 'dict'>
country   spain
capital   madrid
population   123456


You can pass args and kwargs simultaneously

In [103]:
def show_details(a, b, *args, **kwargs):
    print("a is ", a)
    print("b is ", b)
    print("args is ", args)
    print("kwargs is ", kwargs)


show_details(1, 2, 3, 4, 5, 6, 7, 8, 9)
print("-----------")
show_details(1, 2, 3, 4, 5, 6, c=7, d=8, e=9)
print("-----------")

a is  1
b is  2
args is  (3, 4, 5, 6, 7, 8, 9)
kwargs is  {}
-----------
a is  1
b is  2
args is  (3, 4, 5, 6)
kwargs is  {'c': 7, 'd': 8, 'e': 9}
-----------


In [104]:
def sum(a, b):
    return a + b

In [105]:
sum(1, 2)

3

In [106]:
num = [1, 2]
sum(*num)

3

In [111]:
num = {"a": 1, "b": 2}  #keys shouls match with function parameter
sum(**num)

3

**Lambda function**

lambda function are unname function that need not be defined

In [112]:
square = lambda x: x**2  # where x is name of argument
print(square(4))
tot = lambda x, y, z: x + y + z  # where x,y,z are names of arguments
print(tot(1, 2, 3))

16
6


`map` applies lambda function to each element of list

In [111]:
a = [1, 2, 3]
print(list(map(lambda x: x + 2, a)))

[3, 4, 5]


## Numpy

![pokemon](../imgs/numpy.png)

**Creating arrays**

1-d

from list

In [128]:
l = [1, 2, 3]

In [129]:
a = np.array(l)

vectorized operations can only happen with arrays and not list

In [130]:
a.shape

(3,)

This is a 1-D array. Notice that it has only 1 dimension

In [131]:
a

array([1, 2, 3])

In [132]:
a + 2

array([3, 4, 5])

In [133]:
l + 2

TypeError: can only concatenate list (not "int") to list

2-d

In [272]:
a2 = np.array([[1, 2], [3, 4]])

In [273]:
a2.shape

(2, 2)

In [274]:
a2

array([[1, 2],
       [3, 4]])

In [275]:
a2.shape

(2, 2)

 Some of the most commonly used numpy dtypes are: `float`, `int`, `bool`, `str` and `object`.

In [277]:
np.array([[1, 2], [3, 4]], dtype='float')

array([[1., 2.],
       [3., 4.]])

you can type cast

In [278]:
a2.astype('float')

array([[1., 2.],
       [3., 4.]])

***Array should have same datatype which is not the case in list***

In [134]:
np.array(
    [[1, 2, 3], [3, 4]]
)  #due to different numbers in the list, it took the complete list as one element

array([list([1, 2, 3]), list([3, 4])], dtype=object)

so it is 1-D array

In [135]:
np.array([[1, 2, 3], [3, 4]]).shape

(2,)

In [136]:
arr1d_obj = np.array([1, 'a'], dtype='object')

In [137]:
arr1d_obj.shape

(2,)

In [138]:
arr1d_obj

array([1, 'a'], dtype=object)

In [139]:
arr1d_obj.size

2

To convert the array back to a list use `tolist()`

In [282]:
a2.tolist()

[[1, 2], [3, 4]]

In [140]:
arr1d_obj.tolist()

[1, 'a']

* Arrays support vectorised operations, while lists don’t.
* Once an array is created, you cannot change its size. You will have to create a new array or overwrite the existing one.
* Every array has one and only one dtype. All items in it should be of that dtype.
* An equivalent numpy array occupies much less space than a python list of lists.

In [141]:
list2 = [[1, 2, 3, 4], [3, 4, 5, 6], [5, 6, 7, 8]]
arr2 = np.array(list2, dtype='float')
arr2

array([[1., 2., 3., 4.],
       [3., 4., 5., 6.],
       [5., 6., 7., 8.]])

In [142]:
arr2.shape

(3, 4)

size gives number of elements in the array

In [143]:
arr2.size  #gives number of elements

12

In [145]:
arr2.ndim  #gives number of dimension

2

In [147]:
arr2.dtype  #dtype of array

dtype('float64')

**Indexing into arrays**

In [150]:
arr2

array([[1., 2., 3., 4.],
       [3., 4., 5., 6.],
       [5., 6., 7., 8.]])

In [148]:
arr2[:2, :1]  #indexing starts at 0 and follows row, column format

array([[1.],
       [3.]])

In [149]:
arr2[1]  #second row of the array

array([3., 4., 5., 6.])

In [152]:
arr2[:, 0]  #first column of the array

array([1., 3., 5.])

**boolean indexing**

In [154]:
mask = arr2 > 4  #creating a mask

In [155]:
arr2[mask]  #boolean indexing can be passed but it'll flatten the output

array([5., 6., 5., 6., 7., 8.])

**Reversing the array**

In [156]:
list2 = [[1, 2, 3, 4], [3, 4, 5, 6], [5, 6, 7, 8]]
arr2 = np.array(list2, dtype='float')
arr2

array([[1., 2., 3., 4.],
       [3., 4., 5., 6.],
       [5., 6., 7., 8.]])

In [157]:
arr2[::-1]  #row reversal

array([[5., 6., 7., 8.],
       [3., 4., 5., 6.],
       [1., 2., 3., 4.]])

In [158]:
arr2[::-1, ::-1]  #row and column reversal

array([[8., 7., 6., 5.],
       [6., 5., 4., 3.],
       [4., 3., 2., 1.]])

**checking for missing and infinite value**

In [159]:
arr2[1, 1] = np.NaN

In [160]:
arr2[1, 2] = np.inf

In [161]:
arr2

array([[ 1.,  2.,  3.,  4.],
       [ 3., nan, inf,  6.],
       [ 5.,  6.,  7.,  8.]])

In [162]:
np.isnan(arr2)

array([[False, False, False, False],
       [False,  True, False, False],
       [False, False, False, False]])

In [163]:
np.isinf(arr2)

array([[False, False, False, False],
       [False, False,  True, False],
       [False, False, False, False]])

In [164]:
np.isnan(arr2) | np.isinf(arr2)

array([[False, False, False, False],
       [False,  True,  True, False],
       [False, False, False, False]])

modifying values in array

In [166]:
arr2[np.isnan(arr2) | np.isinf(arr2)] = -1

In [167]:
arr2

array([[ 1.,  2.,  3.,  4.],
       [ 3., -1., -1.,  6.],
       [ 5.,  6.,  7.,  8.]])

**aggregation**

In [168]:
arr2.mean(axis=0)  #across rows

array([3.        , 2.33333333, 3.        , 6.        ])

In [169]:
arr2.max(axis=1)  #across columns

array([4., 6., 8.])

In [170]:
np.cumsum(
    arr2
)  #this traverse each row completely then shifting to the next row and flattening the output

array([ 1.,  3.,  6., 10., 13., 12., 11., 17., 22., 28., 35., 43.])

In [171]:
np.cumsum(arr2, axis=0)  # column wise traversal but retaining the shape

array([[ 1.,  2.,  3.,  4.],
       [ 4.,  1.,  2., 10.],
       [ 9.,  7.,  9., 18.]])

In [172]:
np.cumsum(arr2, axis=1)  # row wise traversal but retaining the shape

array([[ 1.,  3.,  6., 10.],
       [ 3.,  2.,  1.,  7.],
       [ 5., 11., 18., 26.]])

**Note**:
If you just assign a portion of an array to another array, the new array you just created actually refers to the parent array in memory.

That means, if you make any changes to the new array, it will reflect in the parent array as well.

So to avoid disturbing the parent array, you need to make a copy of it using copy(). All numpy arrays come with the copy() method.

**Reshaping**

In [173]:
arr2.shape

(3, 4)

In [174]:
arr2

array([[ 1.,  2.,  3.,  4.],
       [ 3., -1., -1.,  6.],
       [ 5.,  6.,  7.,  8.]])

In [178]:
arr2.reshape(4, 3)

array([[ 1.,  2.,  3.],
       [ 4.,  3., -1.],
       [-1.,  6.,  5.],
       [ 6.,  7.,  8.]])

In [176]:
arr2.T  #transpose

array([[ 1.,  3.,  5.],
       [ 2., -1.,  6.],
       [ 3., -1.,  7.],
       [ 4.,  6.,  8.]])

In [283]:
arr = np.array([1, 2, 3])

In [284]:
arr

array([1, 2, 3])

In [285]:
arr.shape

(3,)

You can add an empty axis using `None`

In [287]:
arr[None].shape

(1, 3)

**Rememeber that reshape and transpose both are different. Reshape pick elements by doing a row traversal**

**flatten vs ravel**: The difference between ravel and flatten is, the new array created using ravel is actually a reference to the parent array. So, any changes to the new array will affect the parent as well. But is memory efficient since it does not create a copy.

In [179]:
a3 = arr2.flatten()

In [180]:
a3[1] = 100

In [181]:
a3

array([  1., 100.,   3.,   4.,   3.,  -1.,  -1.,   6.,   5.,   6.,   7.,
         8.])

In [182]:
arr2

array([[ 1.,  2.,  3.,  4.],
       [ 3., -1., -1.,  6.],
       [ 5.,  6.,  7.,  8.]])

arr2 hasn't changed

In [183]:
a3 = arr2.ravel()

In [184]:
a3[1] = 100

In [185]:
a3

array([  1., 100.,   3.,   4.,   3.,  -1.,  -1.,   6.,   5.,   6.,   7.,
         8.])

In [186]:
arr2

array([[  1., 100.,   3.,   4.],
       [  3.,  -1.,  -1.,   6.],
       [  5.,   6.,   7.,   8.]])

**but using ravel original array has been modified**

this also reshapes it but it's second dimension is guessed based on the first dimension if you specify -1

In [191]:
arr2.reshape(6, -1).shape

(6, 2)

In [192]:
arr2.resize((4, 3))  #inplace reshaping of array

In [193]:
arr2

array([[  1., 100.,   3.],
       [  4.,   3.,  -1.],
       [ -1.,   6.,   5.],
       [  6.,   7.,   8.]])

**arithmetic operations: the real power of numpy**

In [194]:
arr2

array([[  1., 100.,   3.],
       [  4.,   3.,  -1.],
       [ -1.,   6.,   5.],
       [  6.,   7.,   8.]])

In [195]:
arr1 = arr2.T

In [196]:
arr1

array([[  1.,   4.,  -1.,   6.],
       [100.,   3.,   6.,   7.],
       [  3.,  -1.,   5.,   8.]])

In [197]:
np.dot(arr1, arr2)  #matrix multiplication

array([[   54.,   148.,    42.],
       [  148., 10094.,   383.],
       [   42.,   383.,    99.]])

In [198]:
np.multiply(arr2, arr2)  #element wise multiplication

array([[1.0e+00, 1.0e+04, 9.0e+00],
       [1.6e+01, 9.0e+00, 1.0e+00],
       [1.0e+00, 3.6e+01, 2.5e+01],
       [3.6e+01, 4.9e+01, 6.4e+01]])

In [199]:
np.exp(arr1)  #vectorized exponentiation

array([[2.71828183e+00, 5.45981500e+01, 3.67879441e-01, 4.03428793e+02],
       [2.68811714e+43, 2.00855369e+01, 4.03428793e+02, 1.09663316e+03],
       [2.00855369e+01, 3.67879441e-01, 1.48413159e+02, 2.98095799e+03]])

**broadcasting**

numpy documentation(https://docs.scipy.org/doc/numpy-1.10.0/user/basics.broadcasting.html): The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when:

* they are equal, or
* one of them is 1

In [208]:
a = np.array([1, 2, 3])

In [209]:
a.shape

(3,)

In [210]:
a + 1

array([2, 3, 4])

In [211]:
arr2

array([[  1., 100.,   3.],
       [  4.,   3.,  -1.],
       [ -1.,   6.,   5.],
       [  6.,   7.,   8.]])

In [212]:
arr2.shape

(4, 3)

In [213]:
a.shape

(3,)

When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1 are stretched or “copied” to match the other.

shape for arr2: 4x3 and a   :   3. Now according to broadcasting rules first dimension of both arrays are matchin but a doesn't have any second dimension so 4 is copies to it and it is replicated 4 times to form a 4x3 array which is then added to arr2 element-wise

In [214]:
arr2 + a

array([[  2., 102.,   6.],
       [  5.,   5.,   2.],
       [  0.,   8.,   8.],
       [  7.,   9.,  11.]])

on the background a will be broadcasted to the below matrix

In [220]:
np.broadcast_to(a, arr2.shape)

array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3],
       [1, 2, 3]])

**creating sequences**

In [221]:
np.arange(4)

array([0, 1, 2, 3])

In [222]:
np.arange(3, 8)

array([3, 4, 5, 6, 7])

In [223]:
np.arange(3, 30, 2)  #step size

array([ 3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29])

In [224]:
np.arange(30, 3, -2)  #reverse order

array([30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10,  8,  6,  4])

If you dont want to calculate step size

In [225]:
np.linspace(0, 100, num=4)

array([  0.        ,  33.33333333,  66.66666667, 100.        ])

In [226]:
np.linspace(0, 100, num=4, dtype='int')

array([  0,  33,  66, 100])

in log space

In [227]:
np.logspace(0, 3, num=3)

array([   1.       ,   31.6227766, 1000.       ])

In [228]:
np.zeros([1, 3])

array([[0., 0., 0.]])

In [229]:
np.ones([1, 3])

array([[1., 1., 1.]])

repeating sequences

In [230]:
a = [1, 2, 3]

In [231]:
np.tile(a, 2)

array([1, 2, 3, 1, 2, 3])

In [232]:
np.repeat(a, 3)

array([1, 1, 1, 2, 2, 2, 3, 3, 3])

**Generating random numbers**

In [233]:
np.random.rand(
    2, 2
)  #creating random array of given shape and values between 0,1 uniform distibution

array([[0.4410182 , 0.04740462],
       [0.73086823, 0.62068428]])

In [234]:
np.random.randn(
    2, 2
)  #numbers picked from normal distribution of mean 0 and variance 1 of given shape

array([[-1.53018961, -0.76171759],
       [ 1.88120503,  0.03734846]])

In [235]:
np.random.randint(0, 10, [2, 2])  #uniform distribution is used

array([[3, 6],
       [3, 5]])

In [237]:
np.random.random([2, 2])

array([[0.23079446, 0.71556074],
       [0.0196204 , 0.45310905]])

In [67]:
np.random.choice(
    ['a', 'e', 'i', 'o', 'u'],
    size=10)  #random sample of given size from the list

array(['a', 'e', 'i', 'a', 'i', 'o', 'u', 'a', 'a', 'e'], dtype='<U1')

In [68]:
np.random.choice(
    ['a', 'e', 'i', 'o', 'u'], size=10, p=[0.3, .1, 0.1, 0.4, 0.1]
)  #random sample of given size from the list using predefined probabilities

array(['u', 'e', 'o', 'o', 'o', 'o', 'o', 'i', 'o', 'o'], dtype='<U1')

In [249]:
rn = np.random.RandomState(100)  #seed for reproducability
rn.rand(2, 2), rn.rand(3, 3)

(array([[0.54340494, 0.27836939],
        [0.42451759, 0.84477613]]),
 array([[0.00471886, 0.12156912, 0.67074908],
        [0.82585276, 0.13670659, 0.57509333],
        [0.89132195, 0.20920212, 0.18532822]]))

In [250]:
np.random.seed(100)

a = np.random.choice(['a', 'e', 'i', 'o', 'u'], size=10)
a

array(['a', 'a', 'o', 'a', 'i', 'u', 'i', 'i', 'i', 'i'], dtype='<U1')

In [251]:
np.unique(a, return_counts=True)  #getting unique items

(array(['a', 'i', 'o', 'u'], dtype='<U1'), array([3, 5, 1, 1]))

**Advanced numpy**

**filtering**

In [252]:
arr_rand = np.array([8, 8, 3, 7, 7, 0, 4, 2, 5, 2])

In [253]:
pos = np.where(arr_rand > 5)

In [254]:
arr_rand[pos].shape

(4,)

In [256]:
arr_rand[pos]

array([8, 8, 7, 7])

In [257]:
np.take(arr_rand, pos).shape

(1, 4)

In [258]:
np.where(
    arr_rand > 5, "a", "b"
)  #like if else: if condition met then first element"a" otherwise 2nd element:"b"

array(['a', 'a', 'b', 'a', 'a', 'b', 'b', 'b', 'b', 'b'], dtype='<U1')

In [259]:
list2 = [[1, 2, 3, 4], [3, 4, 5, 6], [5, 6, 7, 8]]
arr2 = np.array(list2, dtype='float')

In [260]:
arr2

array([[1., 2., 3., 4.],
       [3., 4., 5., 6.],
       [5., 6., 7., 8.]])

`argmax` returns the index of maximum value across an axis

In [261]:
np.argmax(arr2, axis=1)

array([3, 3, 3])

In [262]:
np.argmax(arr2, axis=0)

array([2, 2, 2, 2])

**Stacking**

In [263]:
arr3 = np.ones([3, 4])

In [264]:
arr3

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [265]:
arr2

array([[1., 2., 3., 4.],
       [3., 4., 5., 6.],
       [5., 6., 7., 8.]])

In [268]:
np.concatenate((arr2, arr3), axis=1)  #concatenate along second axis

array([[1., 2., 3., 4., 1., 1., 1., 1.],
       [3., 4., 5., 6., 1., 1., 1., 1.],
       [5., 6., 7., 8., 1., 1., 1., 1.]])

In [269]:
arr4 = np.ones([3, 4, 5])

In [270]:
arr5 = np.zeros([3, 4, 5])

In [275]:
arr4

array([[[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]]])

![pokemon](../imgs/array.png)

In [272]:
np.concatenate(
    (arr4, arr5),
    axis=2)  #concatenate along 3rd axis. remember axis starts at 0

array([[[1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0., 0., 0.]],

       [[1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0., 0., 0.]],

       [[1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0., 0., 0.]]])

In [273]:
np.concatenate((arr4, arr5), axis=1)

array([[[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]],

       [[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]],

       [[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]])

In [274]:
np.concatenate((arr4, arr5), axis=0)

array([[[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]],

       [[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]])

**sorting**

In [276]:
x = np.array([1, 10, 5, 2, 8, 9])
x

array([ 1, 10,  5,  2,  8,  9])

argsort() returns the index in sorted order

In [278]:
sort_index = np.argsort(x)
print(sort_index)

[0 3 2 4 5 1]


**vectorize**

you can vectorize user-defined functions to use on arrays

In [288]:
def foo(x):
    if x % 2 == 1:
        return x**2
    else:
        return x / 2

In [289]:
a = np.array([1, 2, 3])

In [290]:
foo_vect = np.vectorize(foo)

In [291]:
foo_vect(a)

array([1, 1, 9])

**introducing a new axis**

In [292]:
x = np.arange(5)

In [293]:
x

array([0, 1, 2, 3, 4])

In [294]:
a.shape

(3,)

In [295]:
x[:, np.newaxis]

array([[0],
       [1],
       [2],
       [3],
       [4]])

In [296]:
x[np.newaxis, :]

array([[0, 1, 2, 3, 4]])

In [297]:
x[np.newaxis, :, np.newaxis].shape

(1, 5, 1)

In [299]:
x

array([0, 1, 2, 3, 4])

In [300]:
np.clip(
    x, 3, 8
)  #clipping of values in array to a range between minimum and maximum: 3 and 8

array([3, 3, 3, 3, 4])

## Regular expressions

![pokemon](../imgs/regex-example.png)

I have talked about some basic regex functionality which is taken from this excellent post

https://www.machinelearningplus.com/python/python-regex-tutorial-examples/

In [302]:
import re

A regex pattern is a special language used to represent generic text, numbers or symbols so it can be used to extract texts that conform to that pattern.

Here the '\s' matches any whitespace character. By adding a '+' notation at the end will make the pattern match at least 1 or more spaces. So, this pattern will match even tab '\t' characters as well.

In [130]:
regex = re.compile('\s+')

**Splitting a string using regex**

In [131]:
text = "Hello World.   Regex is awesome"

In [132]:
regex.split(text)

['Hello', 'World.', 'Regex', 'is', 'awesome']

Another way but regex is generally the better one

In [133]:
re.split('\s', text)

['Hello', 'World.', '', '', 'Regex', 'is', 'awesome']

**re.findall**

the findall method extracts all occurrences of the pattern

 `'\d'` is a regular expression which matches any digit

In [134]:
text = "101 howard street, 246 mcallister street"

In [135]:
regex_num = re.compile('\d+')  #one or more digits

In [136]:
regex_num.findall(text)

['101', '246']

In [137]:
regex_num.split(text)

['', ' howard street, ', ' mcallister street']

**re.search() vs re.match()**

`regex.search()` returns a particular match object that contains the starting and ending positions of the **first occurrence of the pattern**.

Likewise, `regex.match()` also returns a match object. But the difference is, it requires the pattern to be present at the **beginning of the text itself**.

In [139]:
text2 = "205 MAT   Mathematics 189"

In [141]:
m = regex_num.match(text2)

In [142]:
m.group()

'205'

In [143]:
m.start()  #returns the index of the starting

0

In [144]:
s = regex_num.search(text2)

In [145]:
s.group()

'205'

**Substituting one text by another using `regex.sub()`**

In [146]:
text = """101   COM \t  Computers
205   MAT \t  Mathematics
189   ENG  \t  English"""

In [147]:
regex = re.compile('\s+')

In [148]:
regex.sub(' ', text)  #it replaces the regular expression by ' '

'101 COM Computers 205 MAT Mathematics 189 ENG English'

In [149]:
# get rid of all extra spaces except newline
regex = re.compile('((?!\n)\s+)')
print(regex.sub(' ', text))

101 COM Computers
205 MAT Mathematics
189 ENG English


**combining regex pattern**

In [150]:
# define the course text pattern groups and extract
course_pattern = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})'
re.findall(course_pattern, text)

[('101', 'COM', 'Computers'),
 ('205', 'MAT', 'Mathematics'),
 ('189', 'ENG', 'English')]

**greedy regex**

The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.

In [303]:
text = "< body>Regex Greedy Matching Example < /body>"
re.findall('<.*>', text)

['< body>Regex Greedy Matching Example < /body>']

it should have stopped at first > but it didn't. For extracting only the smaller portions:

Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a `?` at the end of the pattern.

In [153]:
re.findall('<.*?>', text)

['< body>', '< /body>']

In [155]:
s = re.search('<.*?>', text)  #getting only the first one

In [156]:
s.group()

'< body>'

![regex](../imgs/regex.png)

In [157]:
text = '01, Jan 2015'

In [159]:
print(re.findall('\d{3}', text))

['201']


**matching word boundaries**

Word boundaries `\b` are commonly used to detect and match the beginning or end of a word. That is, one side is a word character and the other side is whitespace and vice versa.

For example, the regex \btoy will match the ‘toy’ in ‘toy cat’ and not in ‘tolstoy’. In order to match the ‘toy’ in ‘tolstoy’, you should use toy\b

Can you come up with a regex that will match only the first ‘toy’ in ‘play toy broke toys’? (hint: \b on both sides)

Likewise, `\B` will match any non-boundary.

For example, \Btoy\B will match ‘toy’ surrounded by words on both sides, as in, ‘antoynet’.

In [160]:
re.findall(r'\btoy\b', 'play toy broke toys')

['toy']

In [161]:
re.findall(r'\btoy', 'play toy broke toys')

['toy', 'toy']

In [162]:
re.findall(r'toy\b', 'play toy broke toys')

['toy']

In [163]:
re.findall(r'\Btoy\b', 'playtoy broke toys')

['toy']

In [164]:
re.findall(r'\Btoy\B', 'playtoybroke toys')

['toy']

In [166]:
re.findall(r'\btoy', 'playtoybroke toys')

['toy']

**Practice regex examples**

In [167]:
emails = """zuck26@facebook.com
page33@google.com
jeff42@amazon.com"""

desired_output = [('zuck26', 'facebook', 'com'), ('page33', 'google', 'com'),
                  ('jeff42', 'amazon', 'com')]

In [197]:
regex = re.compile('([\w]+)@([\w]+).([\w]+)')

In [198]:
regex.findall(emails)

[('zuck26', 'facebook', 'com'),
 ('page33', 'google', 'com'),
 ('jeff42', 'amazon', 'com')]

2. Retrieve all the words starting with ‘b’ or ‘B’ from the following text.

In [200]:
text = """Betty bought a bit of butter, 
But the butter was so bitter, So she bought
some better butter, To make the bitter butter better."""

In [210]:
regex = re.compile('([$bB]\w+)')

In [211]:
regex.findall(text)

['Betty',
 'bought',
 'bit',
 'butter',
 'But',
 'butter',
 'bitter',
 'bought',
 'better',
 'butter',
 'bitter',
 'butter',
 'better']

In [212]:
sentence = """A, very   very; irregular_sentence"""
desired_output = "A very very irregular sentence"

In [228]:
regex = re.compile('[,\s;_]+')

In [231]:
' '.join(regex.split(sentence))

'A very very irregular sentence'

In [232]:
tweet = '''Good advice! RT @TheNextWeb: What I would do differently if I was learning to code today http://t.co/lbwej0pxOd cc: @garybernhardt #rstats'''

In [233]:
desired_output = 'Good advice What I would do differently if I was learning to code today'

In [272]:
def clean_tweet(tweet):
    tweet = re.sub('http\S+\s*', '', tweet)  # remove URLs
    tweet = re.sub('RT|cc', '', tweet)  # remove RT and cc
    tweet = re.sub('#\S+', '', tweet)  # remove hashtags
    tweet = re.sub('@\S+', '', tweet)  # remove mentions
    tweet = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""),
                   '', tweet)  # remove punctuations
    tweet = re.sub('\s+', ' ', tweet)  # remove extra whitespace
    return tweet


print(clean_tweet(tweet))

Good advice What I would do differently if I was learning to code today 
