# Important Python/programming concepts for Data Science


## 1. List comprehensions


The basic structure of a list comprehension looks like this:
    

[[output] for (i in iterable) if(filter conditions)]

Fill a list with numbers:

In [12]:
numbers = [x for x in range(1, 8)]
numbers

[1, 2, 3, 4, 5, 6, 7]

Modify a list:

In [13]:
squared = [x*x for x in numbers]
squared

[1, 4, 9, 16, 25, 36, 49]

Filter and modify:


In [14]:
even_squared = [x*x for x in numbers if x%2 == 0]
even_squared

[4, 16, 36]

Nested loop:

In [16]:
matrix = [[x for x in range(1,4)] for y in range(1,5)]
matrix

[[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]

More examples with the Boston Housing dataset:

In [21]:
from sklearn.datasets import load_boston
import pandas as pd

boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [25]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


One handy thing we can use list comprehensions for is to transform a continous variable into a categorical variable. 

In [26]:
df['age_cat'] = ['0-30'if age <= 30 else '30 to 60' if 30 < age <60 else 'above 60' for age in df['AGE']]

In [27]:
df['age_cat']

0      above 60
1      above 60
2      above 60
3      30 to 60
4      30 to 60
         ...   
501    above 60
502    above 60
503    above 60
504    above 60
505    above 60
Name: age_cat, Length: 506, dtype: object

negate elements of a list greater than 3 but smaller than 8:

In [1]:
num = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [3]:
[-i if 2 <= i <= 7 else i for i in num ]

[1, 2, -3, -4, -5, -6, -7, -8, 9, 10]

alphabet - number pairs:

In [6]:
import string

{a:i+1 for a,i in zip(string.ascii_letters[:26], range(26))}

{'a': 1,
 'b': 2,
 'c': 3,
 'd': 4,
 'e': 5,
 'f': 6,
 'g': 7,
 'h': 8,
 'i': 9,
 'j': 10,
 'k': 11,
 'l': 12,
 'm': 13,
 'n': 14,
 'o': 15,
 'p': 16,
 'q': 17,
 'r': 18,
 's': 19,
 't': 20,
 'u': 21,
 'v': 22,
 'w': 23,
 'x': 24,
 'y': 25,
 'z': 26}

Tokenize words, exclude stopwords:

In [13]:
sentences = ["The Hubble Space telescope has spotted", 
             "a formation of galaxies that resembles", 
             "a smiling face in the sky", 
             "The image taken with the Wide Field Camera", 
             "shows a patch of space filled with galaxies", 
             "of all shapes, colours and sizes"]
stopwords = ['for', 'a', 'of', 'the', 'and', 'to', 'in', 'on', 'with']

In [14]:
[[word.lower() for word in sentence.split(' ')if word not in stopwords] for sentence in sentences]

[['the', 'hubble', 'space', 'telescope', 'has', 'spotted'],
 ['formation', 'galaxies', 'that', 'resembles'],
 ['smiling', 'face', 'sky'],
 ['the', 'image', 'taken', 'wide', 'field', 'camera'],
 ['shows', 'patch', 'space', 'filled', 'galaxies'],
 ['all', 'shapes,', 'colours', 'sizes']]

word:id pairs

In [18]:
sentences = ["The Hubble Space telescope has spotted", 
             "a formation of galaxies that resembles", 
             "a smiling face in the sky"]

In [19]:
[(word.lower(),i) for i, sentence in enumerate(sentences) for word in sentence.split(' ')]

[('the', 0),
 ('hubble', 0),
 ('space', 0),
 ('telescope', 0),
 ('has', 0),
 ('spotted', 0),
 ('a', 1),
 ('formation', 1),
 ('of', 1),
 ('galaxies', 1),
 ('that', 1),
 ('resembles', 1),
 ('a', 2),
 ('smiling', 2),
 ('face', 2),
 ('in', 2),
 ('the', 2),
 ('sky', 2)]

## 2. Lambda functions


A lambda function is a small anonymous function that can take any number of arguments, but can only have one expression. Lambda functions are especially useful if you want to use functions that take another function as an argument.

###### Syntax

lambda arguments : expression



###### Examples:

Comparison between a normal function and a lambda function.

In [20]:
def square(y): 
    return y*y; 

print(square(2))

l_square = lambda x: x*x 

print(l_square(2)) 
  


4
4


We can for example take a list of words and transform all letters to upper.

In [1]:
list(map(lambda x: x.upper(), ['cat', 'dog', 'cow']))

['CAT', 'DOG', 'COW']

or filter a list of numbers:

In [5]:
num = [ 1,2,3,4,5]
list(filter(lambda x: x >3, num))

[4, 5]

We can also create a new column based on the value of an existing one:

In [10]:
from sklearn.datasets import load_boston
import pandas as pd

boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [11]:
df['AGE_DESC'] = df['AGE'].apply(lambda x: 'age is: ' + str(x))
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,AGE_DESC
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,age is: 65.2
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,age is: 78.9
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,age is: 61.1
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,age is: 45.8
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,age is: 54.2


Another example relevant for data science is the combination of two columns into a new one:

In [13]:
df['AGE_LSTAT_Ratio'] = df.apply(lambda x: (x['LSTAT']/x['AGE']), axis=1)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,AGE_DESC,AGE_LSTAT_Ratio
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,age is: 65.2,0.07638
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,age is: 78.9,0.115843
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,age is: 61.1,0.065957
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,age is: 45.8,0.064192
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,age is: 54.2,0.098339


## 3. Data reshaping

Another very important technique every data scientist sould be able to do blindfolded is data reshaping, slicing and everything that goes with it. The following should serve as a brief overview of how this can be done in Python.

Let's start by creating a 1 dimensional array.


In [34]:
import numpy as np
a = np.arange(30)
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In [35]:
a.shape

(30,)

The rank of a variable is equal to the length of its shape.

In [36]:
len(a.shape)

1

Let's convert a from a rank 1 tensor into a rank 2 tensor. We can achieve this by using None as an indexer.

In [66]:
a2 = a[None,:]
a2.shape

(1, 30)

We can get the same result as above by just using None.

In [67]:
a2 = a[None]
a2.shape

(1, 30)

In [57]:
a2[:,:4]

array([[0, 1, 2, 3]])

In [58]:
len(x1.shape)

2

In [59]:
a2 = a[:,None]
a2.shape

(30, 1)

In [60]:
a2[:4]

array([[0],
       [1],
       [2],
       [3]])

Let's reshape a2 into a rank 2 5x6 tensor.

In [70]:
a2 = a2.reshape(5,6)
a2

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

We can transform the rank 2 into a rank 3 tensor like this:


In [71]:
a3 = a2[None, :, :]
a3.shape

(1, 5, 6)

As above we don't need ,:

In [74]:
a3 = a2[None]
a3.shape

(1, 5, 6)

In [73]:
a3 = a2[:, None, :]
a3.shape

(5, 1, 6)

In [75]:
a3 = a2[:, None]
a3.shape

(5, 1, 6)

In [65]:
a3 = a2[:, :, None]
a3.shape

(5, 6, 1)

The above can also be written like:

In [78]:
a3 = a2[...,None]
a3.shape

(5, 6, 1)

## 4. Broadcasting


### 4.1 Element-wise

In [79]:
a = np.array([3, 5, 9])
b = np.array([4, 6, 7])

In [80]:
a+b

array([ 7, 11, 16])

In [81]:
(a<b)

array([ True,  True, False])

In [85]:
(a<b).mean()

0.6666666666666666

In [82]:
a * b

array([12, 30, 63])

### 4.2 Broadcasting (with help from fastai)

<b>From Numpy Documentation: </b>

<i>"The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation."</i>

In [86]:
a +2

array([ 5,  7, 11])

In [89]:
m= np.array([[1,2,3],[4,5,6],[7,8,9]]) ;m

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [90]:
2*m

array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])

In [97]:
c= np.array([3,6,9])

In [98]:
c*m

array([[ 3, 12, 27],
       [12, 30, 54],
       [21, 48, 81]])

is the same as:

In [100]:
d= np.array([[3,6,9],[3,6,9],[3,6,9]]) *m ;d

array([[ 3, 12, 27],
       [12, 30, 54],
       [21, 48, 81]])

In [106]:
c1 = c[:,None];c1

array([[3],
       [6],
       [9]])

In [107]:
c1 *m

array([[ 3,  6,  9],
       [24, 30, 36],
       [63, 72, 81]])