In [None]:
import re

**Regex**, or regular expression, is a sequence of characters *that forms a search pattern*. It could be used to check strings that contains the specified search pattern.]

1. **findall** - returns a list containing all matches (txt, var)
1. **search** - returns a match object if ever there's a match (txt, var)
1. **split** - returns a list where the string has been split at each match (txt, var, #occuranceToSplit)
1. **sub** replaces one or many matches with a string (txt, sub, var)

**Match Object** is the object containing information about the search and the result.

1. **.span()** returns a tuple containing the start- and end positions of the match
1. **.string** returns the string into the function
1. **.group()** returns the part of the string where there was a match

In [None]:
oranges = "The orange fruit fell from an orange tree to an orange sidewalk"
searchedA = re.search("orange", oranges)
print(searchedA)

<re.Match object; span=(4, 10), match='orange'>


**METACHARACTERS**
1. []    set of characters, "a-m" or a to m
1. \     special sequence
1. .     any character except new line
1. ^     starts with
1. $     ends with
1. \*     zero or more occurances
1. \+     one or more occurrences
1. ?     zero or one occurances
1. {}    exactly the specified number of occurences
1. |     either or
1. ()    capture and group

In [None]:
bus = "The yellow bus is warm inside"
searchedB = re.search("y.l..w", bus)
print(searchedB.string)

The yellow bus is warm inside


**SPECIAL SEQUENCES** have a special meaning that could be added in the search.

1. **\A**    returns a match if the specified characters are at the *beggining of the string* ("\AThe")
1. **\b**    returns a match where the specified characters are at the *beggining or at the end of a word*
 (r"ain\b")
1. **\d**    returns a match where the string contains digit
1. **\D**    returns a match where the string DOES NOT contains digit
1. **\s** returns a match where the string contains a whitespace character
1. **\S** returns a match where the string DOES NOT contain a white space character
1. **\w** returns a match where the string contains any word charcters
1. **\W** returns a match where the string DOES NOT contain any word characters
1. **\Z** returnsa match if the specified characters are at the end of the string

NOTE: search has to be in a set next to each other

THIS IS A **VERRRYYYY USEFUL** piece of code to use to access files in google drive
```
from google.colab import drive
drive.mount('/content/gdrive')
gdrive_path = '/content/gdrive/My Drive/learningDataScience/'
with open(f'{gdrive_path}SCENE3ACT1_theMerchantOfVenice.txt', 'r') as file:
  for line in file:
    print(line)
```



Line by line matching using counters


```
matches = []
offset = 0
reg = re.compile("(<(\d{4,5})>)?")
for line in txtExcerpt:
    matches += [(reg.findall(line),offset)]
    offset += len(line)
txtExcerpt.close()
```



Reading the whole file by batch


```
import re

textfile = open(filename, 'r')
filetext = textfile.read()
textfile.close()
matches = re.findall("(<(\d{4,5})>)?", filetext)
```



Aggregate lines depends on the match by RegEx


```
def groupLinesByMatch(filename,regex):
    import re
    from collections import defaultdict

    regex = re.compile(regex)
    result = defaultdict(list)

    for line in open(filename).readlines():
        matches = regex.match(line)
        if matches:    
            result[matches.group(1)].append( line )

    return result.values()


for lines in groupLinesByMatch(filename, regex):
    for line in lines:
        print line,
    print
```



**PROJECT #1** : Shakespeare's Predicament [google colab link](https://colab.research.google.com/drive/1vJjWOmkU2kMUR-08P-IE7K8-Dbq4aQ30#scrollTo=JXKYRd27tzX1)

**PROJECT #2** : Jupyter Wordle! (UNFINISHED) [google colab link](https://colab.research.google.com/drive/1lD1X-nrA2ivjJmwPes4tbhoY4pdV1Xq8)



---


**NumPy** stands for "Numerical Python" used for working with arrays. There are lists in pythons, but they tend to be slower. NumPy on the other hand, can offer up to 50x faster speeds than traditional Python lists. This is because most of the computation parts of NuPy is written in "fast languages" liks C or  C++.


Every array object in NumPy is called an **ndarray**. ndarrays are storedat one continous place in memory, so unlike lists, the processes can access and maipulate them very efficiently. 

[This](https://github.com/numpy/numpy) is the github repository of NumPy.

Locally downloading numpy requires



```
pip install numpy
```

in the command line.

However, NumPy is already downloaded in tools such as Google Colab. NumPy is usually imported as "np". The version of NumPy are stored in a 
```
np.__version__
```
attribute.


In [2]:
import numpy as np

NumPy array objects are called "ndarray", and objects could be created using the **array()** function. Lists, tuples or any array-like object can be used for the array() method. **type()** is a useful function that tells us the type of the object passed to the array.




In [None]:
arr = np.array([1,2,3,4])
print(arr)
print(type(arr))

[1 2 3 4]
<class 'numpy.ndarray'>


**Array Dimension** refers to the level of array depth. In other words, it refers to how nested the arrays are. The more nested the arrays are, the higher is it's dimension. **Nested arrays** are arrays that has arrays as their elements too. 

**Scalars**, or 0-D arrays are elements in an array. Each value in an array is an 0-D array. 

In [None]:
a = np.array(23)
print(a)

23


**1-D arrays** are array containing 0-D arrays. It is also called a "uni-dimentional array"

In [None]:
b = np.array([1,2,3,4,5])
print(b)

[1 2 3 4 5]


**2-D arrays** re arrays containing 1-D arrays as its elements. There are often used to represent **matrix** or 2nd order tensors. NumPy matrix operations can be done by NumPy's own submodule called numpy.mat

In [None]:
c = np.array([[1,2,3],[4,5,6]])
print(c)

[[1 2 3]
 [4 5 6]]


Higher order tensors can be created by deeply nesting more arrays onto each others. NumPy's **ndim** attribute returns an integer that tells us how many dimensions the array have. 

In [None]:
print(a.ndim)
print(b.ndim)
print(c.ndim)

0
1
2


When an array is created the number of dimensions can also be defined by using the **ndmin** argument.

In [None]:
d = np.array([1,2,3,4,5], ndmin=5)
print(d)
print('number of dimensions: ', d.ndim)

[[[[[1 2 3 4 5]]]]]
number of dimensions:  5


Accessing the arrays in NumPy is similar to what's done in python itself. For high dimention arrays, the access order is from the highest dimention into the lowest dimention.

In [None]:
e = np.array([[1,2,3],[4,5,6]])
print(e[0,1]) 

2


In [None]:
f = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(f[0,1,2])

6


Negative indexes could also be used to access an array from the end.

In [None]:
g = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print(g[0,-1])

5


**Slicing** in python means taking elements from one given index to another given index.

The slice can be passed like **[start:end]**

Steps (the number of elements to skip) can also be defined like **start:end:step**

If parameters are not passed, such as:
1. start is not passed, it is assumed as 0
1. if the end is not passed, the length of the array in that dimention is passed
1. if the step is not passed, then the step is considered as 1

The result includes the start index, but excludes the end index.

In [None]:
h = np.array([1,2,3,4,5,6,7])
print(h[1:3])

[2 3]


In [None]:
i = np.array([1,2,3,4,5,6,7])
print(i[:4])

[1 2 3 4]


In [None]:
j = np.array([1,2,3,4,5,6,7])
print(j[2:])

[3 4 5 6 7]


In [None]:
k = np.array([1,2,3,4,5,6,7])
print(k[::2])

[1 3 5 7]


Negative slicing can also be done in the same way that negative indexes can be done.

In [None]:
l = np.array([1,2,3,4,5,6,7])
print(l[-3:-1])

[5 6]


Indexes from higher dimensions can also be sliced, just by referencing the elements wanted to be sliced like it was an index in that said higher dimension.

In [None]:
m = np.array([[1,2,3,4,5],[6,7,8,9,10]])
print(m[1,1:4])

[7 8 9]


In [None]:
n = np.array([[1,2,3,4,5],[6,7,8,9,10]])
print(n[0:2, 2])

[3 8]


These are the characters used to represent all data types in NumPy!
1. i - integer
1. b - boolean
1. u - unsigned integer (non-negative integer)
1. f - float
1. c - complex float 
1. m - timedelta, subclass of date time, is usually the difference of dates and time
1. M - datetime, represents the date and time
1. O - object
1. S - string
1. U - unicode string
1. V - fixed chunk of memory for other type (void)

The data type of a NumPy array object can be checked by the **dtype** property.

In [None]:
o = np.array([1,2,3,4])
print(o.dtype)

int64


In [None]:
p = np.array(["oranges","apples","kiwis","melons"])
print(p.dtype)

<U7


The array() function can take an optional **dtype** argument. This allows us to define the expected data type f the array elements. 

In [None]:
q = np.array([1,2,3,4,5], dtype='S')
print(q)
print(q.dtype)

[b'1' b'2' b'3' b'4' b'5']
|S1


On certain data types such as i, u, f, S and U, size can be defined as well. For example, let's create a integer data type with 4 bytes as size.

In [None]:
r = np.array([1,2,3,4,5], dtype='i4')
print(r)
print(r.dtype)

[1 2 3 4 5]
int32


**ValueError** is raised when the type of passed argument to a function  is unexpected or incorrect.

The best way to chaging the data type of an exiting array it to make a copy of the array with the **astype()** method.

astype() function creates a copy of the array, and allows us to specify the data type as a parameter.


In [None]:
s = np.array([1.1,2.1,3.1])
xs = s.astype('i')

print(xs)
print(xs.dtype)

[1 2 3]
int32


In [None]:
t = np.array([1,0,3])
xt = t.astype(bool)

print(t)
print(xt)

[1 0 3]
[ True False  True]


To create a copy of an array, the copy() method is used. Changes made in the copy of the array will not affect the original array. However, in view, the changes made will directly affect the original array.

In [None]:
ab = np.array([1,2,3,4,5])
ac = ab.copy()
ab[0] = 42

print(ab)
print(ac)

[42  2  3  4  5]
[1 2 3 4 5]


In [None]:
ad = np.array([1,2,3,4,5])
ae = ad.view()
ad[0] = 42

print(ad)
print(ae)

[42  2  3  4  5]
[42  2  3  4  5]


In [None]:
af = np.array([1,2,3,4,5])
ag = af.view()
ag[0] = 42

print(af)
print(ag)

[42  2  3  4  5]
[42  2  3  4  5]


To summarize, copies owns the data, while views does not own the data. How do we check if the array owns it's data?

We ue the **base** attribute, and it returns None if the array owns the data.

In [None]:
ah = np.array([1,2,3,4,5])

ai = ah.copy()
ak = ah.view()

print(ai.base)
print(ak.base)

None
[1 2 3 4 5]


The **shape** of an array is the number of elements in each dimension. It returns a tuple with each index having the number of corresponding elements.

In [None]:
al = np.array([[1,2,3,4],[5,6,7,8]])
print(al.shape)

(2, 4)


In [None]:
am = np.array([1,2,3,4,5], ndmin=5)

print(am)
print('The shape of array am is: ', am.shape)

[[[[[1 2 3 4 5]]]]]
The shape of array am is:  (1, 1, 1, 1, 5)


Reshaping the arrays means that we can add or remove dimensions or change number of elements in each dimension. It follows the same sequence as accessing the dimention of higher dimensional arrays. We can reshape arrays as long as the elements required for reshaping are equal in both shapes.

In [None]:
an = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
ao = an.reshape(4,3)

print(ao)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]


In [None]:
ap = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
aq = ap.reshape(2,2,3)

print(aq)

[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]


The copy and view methods can also be used with reshaping methods.

In [None]:
ar = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
at = ar.copy()
au = ar.view

print(ar.reshape(3,4).base)

[ 1  2  3  4  5  6  7  8  9 10 11 12]


Unknown dimensions are allowed as long as it's only one. One of the dimentions should be unspecified in the reshape method. Pass a **-1** value, and NumPy will automatically calculate the value.

In [None]:
av = np.array([1,2,3,4,5,6,7,8])
aw = av.reshape(2,2,-1)

print(aw)

[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]


Multidimensional arrays could also be converted into a 1D array called "flattening". **reshape(-1)** is used to do this. There are also a lot of functions used to change the shapes of arrays in numpy, but these are intermediate to advanced numpy.

In [None]:
ax = np.array([[1,2,3],[4,5,6]])
ay = ax.reshape(-1)
print(ay)

[1 2 3 4 5 6]


Iterating in multidimentional arrays in numpy, the for loop can be used.  *Iterating on a n-D array will go through n-1th dimension ne by one.* To return the actual values, the scalars, we hav eto iterate the arrays in each dimension.

In [None]:
ba = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])
for x in ba:
  print(ba)

[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]
[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]


In [None]:
print(ba.shape)

(2, 2, 3)


In [None]:
for x in ba:
  for y in x:
    for z in y:
      print(z)

1
2
3
4
5
6
7
8
9
10
11
12


In high dimesional arrays, for loops can become difficult to write, and because of that we use the **nditer()** method. It iterates on each scalar element.

In [None]:
bc = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

for x in np.nditer(bc):
  print(x)

1
2
3
4
5
6
7
8


Additional aguments could be added into the nditer() method, and we can use **op_dtypes** to pass the expected datatype to change the data type wih iterating. Since numpy does not change the data type of the element in-place(where the element is an array) so it needs some other space to perform this action, that extra space is called buffer, and in order to enable it in nditer() we pass **flags=['buffered']**.

In [None]:
bd = np.array([[1, 2, 3, 4], [5, 6, 7, 8]]) 

In [None]:
for x in np.nditer(bd, flags=['buffered'], op_dtypes=['S']):
  print(x)

b'1'
b'2'
b'3'
b'4'
b'5'
b'6'
b'7'
b'8'


nditer() can also be iterated with different step size, and it could be referenced as a parameter and an array index reference

In [None]:
for x in np.nditer(bd[:,::2]):
  print(x)

1
3
5
7


**ndenumerate()** is mentioning sequence number of somethings one by one. Sometimes, we require corresponding index of the element while iterating, and the ndenumerate() method can be used for those use cases.

In [None]:
for idx, x in np.ndenumerate(bd):
  print(idx,x)

(0, 0) 1
(0, 1) 2
(0, 2) 3
(0, 3) 4
(1, 0) 5
(1, 1) 6
(1, 2) 7
(1, 3) 8


Putting the contents of two or more arrays in a single array is called joining. In SQL, tables are joined based on a key, whereas in numpy we join arrays by axes. We pass a seuence of arrays that we want to join t the **concantenate()** function, along wih the axis. If axis is not explicitly passed, it is taken as 0.

In [None]:
be = np.array([1,2,3])
bf = np.array([4,5,6])  
bg = np.array([[1,2],[3,4]]) 
bh = np.array([[5,6,],[7,8]])

In [None]:
bi = np.concatenate((be,bf))
print(bi)

[1 2 3 4 5 6]


Axis could also be passed as an argument in concatenate() function.

axis = 0 is vertical concatenation, while axis=1 is horizontal concatenation

In [None]:
bk = np.concatenate((bg,bh), axis=1) 
print(bk)

[[1 2 5 6]
 [3 4 7 8]]


**Stacking** is similar to concatenation, the only difference is that stacking is done along a new axis. We can concatenato two 1D arrays along the second axis which would result in putting them one over the other, ie. stacking. We pass a sequence of arrays that we want to koin to the stack() method along with the axis. If axis is not explicitly passed it is taken as 0.

In [None]:
bl = np.stack((be,bf), axis=1)
print(bl)

[[1 4]
 [2 5]
 [3 6]]


**hstack()** is a helper function used to stack along rows.

In [None]:
bm = np.hstack((be,bf))
print(bm)

[1 2 3 4 5 6]


**vstack()** is another helper function to stack along columns

In [None]:
bn = np.vstack((be,bf))
print(bn)

[[1 2 3]
 [4 5 6]]


**dstack()** is a helper function used to stack along heigh, which is similar to depth.

In [None]:
bo = np.dstack((be,bf))
print(bo)

[[[1 4]
  [2 5]
  [3 6]]]


The reverse operation of Joining is **Splitting**. Splitting breaks one array into multiple. We use **array_split()** for splitting arrays, we pass it the *array we want to split* and the *number of splits*. Method split() is also available but generally avoided because it does not adjest to the lements when the elements are less in source array for splitting like in example above, array_split() worked properly but split() would fail.

In [None]:
bp = np.array([1,2,3,4,5,6])

In [None]:
bq = np.array_split(bp,3)
print(bq)

[array([1, 2]), array([3, 4]), array([5, 6])]


IF the array has less elements than required, it will adhest from the end accordingly.

In [None]:
br = np.array_split(bp, 4)
print(br)

[array([1, 2]), array([3, 4]), array([5]), array([6])]


In [6]:
bu = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]])
bx = np.array_split(bu,3,axis=1)
print(bx)

[array([[ 1],
       [ 4],
       [ 7],
       [10],
       [13],
       [16]]), array([[ 2],
       [ 5],
       [ 8],
       [11],
       [14],
       [17]]), array([[ 3],
       [ 6],
       [ 9],
       [12],
       [15],
       [18]])]


The equivalent stak operations is also avaible as split operations, namely, **hsplit()** and **vsplit()**

In [8]:
by = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])
bz = np.hsplit(by,3)
print(bz)

[array([[ 1],
       [ 4],
       [ 7],
       [10],
       [13],
       [16]]), array([[ 2],
       [ 5],
       [ 8],
       [11],
       [14],
       [17]]), array([[ 3],
       [ 6],
       [ 9],
       [12],
       [15],
       [18]])]


To seach through arrays for a specific value, the **where()** method is used. The result is a tuple of indexes where the match is present.

In [9]:
ca = np.array([1,2,3,4,5,4,4])
cb = np.where(ca==4)
print(cb)

(array([3, 5, 6]),)


In [12]:
#find the indexes where the values are even
cc = np.array([1,2,3,4,5,6,7,8])
cd = np.where(cc%2 == 0)
cf = np.where(cc%2 == 1)
print(cd)
print(cf)

(array([1, 3, 5, 7]),)
(array([0, 2, 4, 6]),)


To perform a binary search in an array, the **searchsorted()** method is used. It returns the index where the specified value would be inserted to maintain the search order. It is assumed to be used in *sorted arrays*.

In [14]:
ch = np.array([6,7,8,9])
cg = np.searchsorted(ch,7)
print(cg)

1


The default searching direction of the searchsorted() method is from the left side, but we can give **side='right'** parameter so that it will return the right most index instead.

In [15]:
ci = np.array([6,7,8,9])
ck = np.searchsorted(ci,7,side='right')
print(ck)

2


Multiple values could also be used here, but instead of searching a value, we pass a parameter instead.

In [19]:
cl = np.array([1,3,5,7]) 
cm = np.searchsorted(cl,[2,4,6])
print(cm)

[1 2 3]


**Sorting** means putting elements in an ordered sequence. An *ordered sequence* is any sequence that has an order corresponding to elements, like numeric or alphabetical, ascending or descending. NumPy ndarray object can be sorted using the **sort()** method. This method returns a COPY of an array, leaving the orginal array unchanged.

In [22]:
cp = np.array([3,2,0,1])
print(np.sort(cp))

[0 1 2 3]


In [23]:
cq = np.array(['banana','cherry','apple'])
print(np.sort(cq))

['apple' 'banana' 'cherry']


In [24]:
ct = np.array([True,False,True])
print(np.sort(ct))

[False  True  True]


In [25]:
cv = np.array([[3,2,4],[5,0,1]])
print(np.sort(cv))

[[2 3 4]
 [0 1 5]]


Filtering an array on NumPy is done using a boolean index list. A *boolean index list* is a list of booleans corresponding to indexes in the array. The values tagged as True will be contained while values tagged as False will be excluded in the filtered array. The common use of this is to create a filter array based on conditions.

In [26]:
cw = np.array([41,42,43,44])
cx = [True,False,True,False]
cy = cw[cx]
print(cy)

[41 43]


In [27]:
da = np.array([41,42,43,44])
filter_da = []

for element in da:
  if element > 42:
    filter_da.append(True)
  else:
    filter_da.append(False)

db = da[filter_da]
print(filter_da)
print(db)

[False, False, True, True]
[43 44]


Direct substitution is also a great way of solving a common task in NumPy. We can directly substitute the array instead of the iterable variable in our conditon and it will work just as we expect it to.

In [29]:
dc = np.array([41,42,43,44])
dd = dc>42
de = dc[dd]

print(dd)
print(de)

[False False  True  True]
[43 44]
