# ML vs Rule-Based Systems

Let's consider Email System:  
- we want to develop a spam detection system
- it is classification task
- we may look at some spam emails and look for patterns, for example:
    - if sender = promotion@online then "spam"
    - if title contains "tax review" and sender = 'online.com'

In [1]:
def detect_spam(email):
    if email.sender == 'promostions@online.com':
        return "SPAM"
    if contains(email.title, ['tax', 'review']) and domain(email.sender, 'onlince.com'):
        return "SPAM"
    return "GOOD"

we deploy this system till the time people start to complain about new emails:
- then we add new rule

but then we see that our system could classify normal email as spam

repeat and repeat

system becomes more and more complicated and big, therefore hard to maintain

We can use ML!

- to create features for ML model, it is good idea to start from Rules, like rules above (is sender 'promotion@mail', true/false)

# Supervised ML

## Regression, Classification, Ranking

- ranking - recommended system for example  
for user we try to score some items, for example google does the similar

# ML processes
## CRISP-DM (cross industry processes data mining)  
cross industry standard process for data mining

![](./pic/1.png)

- Business Understanding - identify the problem that we want to solve (it is important) and how we will measure the success of the solution,  
often we do not need to use ML
- Data Understanding - what data we can use:
    - Is data reliable (do we track it correctly)
    - Is dataset is large enough?
- Data Preparation - Transform the data so it can be put into a ML algorithm
    - Clean the data
    - Build data piplines (sequence of steps), put into tabular format (table)
    - Extract features and put into table
- Modeling
    - Try different models
    - Select the best model
    - Sometimes go back to add more features or clean data etc ect
- Evaluation how well model solves Business problem
    - have we reduced number of Spam by 50 %?
    - We msy adjust business goal
- Deployment - often come together with evaluation (evaluate of live users) online evaluation
    - proper monitoring,
    - ensuring the quality and maintainability of the model

Start Simple - learn from feedback - improve

# Modelling STEP:

which model to choose:
- Logistic regression,
- Decision Treem
- Neural Network or many others?

## Selecting the best model

- We split dataset into Train and Test (Validation) datasets
- Then we use different models and look at the Score of each model
- Then pick the model with the highest (best) score


however, this approach has some problems:  
![](./pic/2.png)    


the last coin simply got really lucky and produced the same sequence as the right answers

**mutiple comparison problem**

To eliminate this, we take 60-20-20 (train, validating, testing):  
- then we again do the Model Selection, using Train, validating
- then apply model-winner on Test Dataset

1. Split dataset (3 parts)
2. Train model
3. Validate using Validation dataset
4. Select the best model
5. test model (check is it really good)

however, to use more data, we can do the following:
- split into 3 parts
- train validate as usual,
- then train model on the train-validation combined dataset
- test on the test dataset  

![](./pic/3.png)

# Intro to NumPy

In [1]:
import numpy as np

creating array

In [3]:
np.zeros(3)

array([0., 0., 0.])

In [4]:
np.ones(7)

array([1., 1., 1., 1., 1., 1., 1.])

In [5]:
np.full(4, 99)

array([99, 99, 99, 99])

In [7]:
# convert list into numpy array:
a = np.array([1,2,3])
a

array([1, 2, 3])

In [8]:
# access an element of an array:
a[2]

3

In [9]:
# change element:
a[2] = 999
a

array([  1,   2, 999])

In [11]:
# range from 3 to 9:
np.arange(3, 10)

array([3, 4, 5, 6, 7, 8, 9])

In [13]:
# creates array with array:
np.linspace(0, 1, 12)

array([0.        , 0.09090909, 0.18181818, 0.27272727, 0.36363636,
       0.45454545, 0.54545455, 0.63636364, 0.72727273, 0.81818182,
       0.90909091, 1.        ])

### Multidimensional arrays

In [15]:
np.zeros((5, 2)) # 1st - rows and # 2nd value is number of columns

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [18]:
n = np.array([
    [1,2,3],
    [4,5,6]
]) # list of lists
n

array([[1, 2, 3],
       [4, 5, 6]])

In [20]:
# access element:
n[0,1] # 1st row and second column

2

In [22]:
# access row
n[0]

array([1, 2, 3])

In [24]:
# rewrite row:
n[0] = np.full(3,999)
n

array([[999, 999, 999],
       [  4,   5,   6]])

In [28]:
# access column:
n[:, 1] # all rows and only column index = 1 (second column)

array([999,   5])

### Randomly generated arrays:

In [33]:
np.random.rand(5, 2) # 5 rows and 2 columns
# it creates standard uniform distribution

array([[0.58243186, 0.34040377],
       [0.67357958, 0.50478264],
       [0.84613947, 0.2364457 ],
       [0.15803523, 0.39101787],
       [0.46624332, 0.61067318]])

every time we get different numbers, but we can fix random seed:

In [35]:
np.random.seed(2)
np.random.rand(5, 2)

array([[0.4359949 , 0.02592623],
       [0.54966248, 0.43532239],
       [0.4203678 , 0.33033482],
       [0.20464863, 0.61927097],
       [0.29965467, 0.26682728]])

to have numbers not from 0 to 1 we can have numbers from 1 to 100, by simply multiplying whole array on 100:

In [37]:
np.random.seed(2)
np.random.rand(5, 2) * 100

array([[43.59949021,  2.59262318],
       [54.96624779, 43.53223926],
       [42.03678021, 33.0334821 ],
       [20.4648634 , 61.92709664],
       [29.96546737, 26.68272751]])

we may create array with random values of different distribution (not only standard uniform distribution):  


In [36]:
# standard normal distribution:
np.random.seed(2)
np.random.randn(5,2)

array([[-0.41675785, -0.05626683],
       [-2.1361961 ,  1.64027081],
       [-1.79343559, -0.84174737],
       [ 0.50288142, -1.24528809],
       [-1.05795222, -0.90900761]])

In [39]:
# random integers:
np.random.seed(2)
np.random.randint(low=2, high=99, size=(5,3))

array([[42, 17, 74],
       [24, 45, 84],
       [77,  9, 36],
       [51, 97, 77],
       [87, 49, 65]])

### element-wise operations

In [40]:
a = np.arange(5)
a

array([0, 1, 2, 3, 4])

In [42]:
# add 1 to each element
a + 10

array([10, 11, 12, 13, 14])

In [46]:
# multiply 10 to each element
b = a * 10
b

array([ 0, 10, 20, 30, 40])

In [47]:
# sum elements of arrays:
a + b

array([ 0, 11, 22, 33, 44])

### Comparison operations

In [49]:
a > 2

array([False, False, False,  True,  True])

In [52]:
# filtering array:
a[a >= 2]

array([2, 3, 4])

Summarizing operations:

In [53]:
a.sum()

10

In [54]:
a.mean()

2.0

In [56]:
a.std() # standard deviation

1.4142135623730951

# Linear Algebra

### vector operations

In [57]:
# multiplication of vector to some number:

In [59]:
np.full(3, 5) * 2

array([10, 10, 10])

add one vector to another: add each element of the vector to each element of the other

In [61]:
a = np.full(3,5)
b = np.full(3,7)
a + b

array([12, 12, 12])

### Multiplication (dot product):  

multiplication of vector is different in linear algebra from multiplication of arrays in numpy (element-wise multiplication)  
in Linear algebra, we 1st get multiply element-wise vectors and then we add all numbers

In [66]:
# vector vector multiplication (numpy):
a * b

array([35, 35, 35])

In [67]:
# linear algebra:
(a * b).sum()

105

In [73]:
def vector_vector_multiplication(u, v):
    assert u.shape[0] == v.shape[0] # check if both vectors have the same size
    
    n = u.shape[0]

    result = 0
    for i in range(n):
        result = result + u[i] * v[i]
    return result

In [74]:
vector_vector_multiplication(a,b)

105

there is already function in numpy (dot):


In [76]:
a.dot(b)

105

Matrix-vector multiplication

In [92]:
U = np.array([
    [2, 4, 5, 6],
    [1, 2, 1, 2],
    [3, 1, 2, 1]
])

U

array([[2, 4, 5, 6],
       [1, 2, 1, 2],
       [3, 1, 2, 1]])

In [96]:
v = np.array([1, 0, 0, 2])
v

array([1, 0, 0, 2])

In [159]:
def matrix_vector_multiplication(U, v):
    assert U.shape[1] == v.shape[0]

    num_rows = U.shape[0]

    result = np.zeros(num_rows) # result vector would have number of elements = number of rows in Matrix

    for i in range(num_rows):
        result[i] = vector_vector_multiplication(U[i], v)

    return result

In [160]:
matrix_vector_multiplication(U, v)

array([14.,  5.,  5.])

In [161]:
# there is function for that in numpy:
U.dot(v)

array([14,  5,  5])

### Matrix-matrix multiplication

In [178]:
U = np.array([
    [2, 4, 5, 6],
    [1, 2, 1, 2],
    [3, 1, 2, 1]
])

U

array([[2, 4, 5, 6],
       [1, 2, 1, 2],
       [3, 1, 2, 1]])

In [179]:
V = np.array([
    [1,2],
    [5,6],
    [3,4],
    [7,8]
])

V

array([[1, 2],
       [5, 6],
       [3, 4],
       [7, 8]])

In [198]:
def matrix_matrix_multiplication(U, V):
    assert U.shape[1] == V.shape[0] # check that closest indexes are equal in Both Matrices
    
    result = np.zeros((U.shape[0], V.shape[1])) # empty result matrix

    for i in range(V.shape[1]):
        result[:,i] = matrix_vector_multiplication(U, V[:,i])
        #result[i] = U[i].dot(V)

    return result

In [199]:
matrix_matrix_multiplication(U, V)

array([[79., 96.],
       [28., 34.],
       [21., 28.]])

In [200]:
U.dot(V)

array([[79, 96],
       [28, 34],
       [21, 28]])

## Identity matrix

- identity matrix - I (name) square matrix with 1s on main diagonal and zeros on the other places

it is the analogy of 1 in algebra in Linear algebra

U * I = U (multiplication on it does not change the initial matrix)

In [213]:
I = np.eye(4)
I

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [214]:
U.dot(I)

array([[2., 4., 5., 6.],
       [1., 2., 1., 2.],
       [3., 1., 2., 1.]])

## Inverse matrix

$A^- * A = I$

- only square Matrices have Inverse

In [234]:
U_s = U[:, [0,1,2]] # 1st 3 columns of Matrix U
U_s

array([[2, 4, 5],
       [1, 2, 1],
       [3, 1, 2]])

In [235]:
np.linalg.inv(U_s)

array([[-0.2       ,  0.2       ,  0.4       ],
       [-0.06666667,  0.73333333, -0.2       ],
       [ 0.33333333, -0.66666667, -0.        ]])

# Pandas

In [236]:
import pandas as pd

DataFrame:

In [238]:
data = [
    ['Nissan', 'Stanza', 1991, 138, 4, 'MANUAL', 'sedan', 2000],
    ['Hyundai', 'Sonata', 2017, None, 4, 'AUTOMATIC', 'Sedan', 27150],
    ['Lotus', 'Elise', 2010, 218, 4, 'MANUAL', 'convertible', 54990],
    ['GMC', 'Acadia',  2017, 194, 4, 'AUTOMATIC', '4dr SUV', 34450],
    ['Nissan', 'Frontier', 2017, 261, 6, 'MANUAL', 'Pickup', 32340],
] # list of lists

columns = [
    'Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
    'Transmission Type', 'Vehicle_Style', 'MSRP'
] # list

In [242]:
df = pd.DataFrame(data, columns=columns)
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


Series

In [244]:
df['Make']

0     Nissan
1    Hyundai
2      Lotus
3        GMC
4     Nissan
Name: Make, dtype: object

In [247]:
# several columns:
df[['Make', 'Model', 'MSRP']]

Unnamed: 0,Make,Model,MSRP
0,Nissan,Stanza,2000
1,Hyundai,Sonata,27150
2,Lotus,Elise,54990
3,GMC,Acadia,34450
4,Nissan,Frontier,32340


In [248]:
# create new column:
df['id'] = [1,2,3,4,5]

In [250]:
del df['id']

In [251]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


### Index

In [252]:
df.index

RangeIndex(start=0, stop=5, step=1)

all series in df have the same indexes

In [255]:
df.Make.index

RangeIndex(start=0, stop=5, step=1)

In [257]:
# access rows by index:
df.loc[1]

Make                   Hyundai
Model                   Sonata
Year                      2017
Engine HP                  NaN
Engine Cylinders             4
Transmission Type    AUTOMATIC
Vehicle_Style            Sedan
MSRP                     27150
Name: 1, dtype: object

In [258]:
# multiple rows by index:
df.loc[[1,2]]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990


In [259]:
# change index:
df.index = ['a', 'b', 'c', 'd', 'e']

In [260]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [261]:
df.loc[['a', 'b']]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150


In [263]:
# refer to elements by positional index:
df.iloc[[0, 1]]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150


In [266]:
# reset index:
df.reset_index(drop=True, inplace=True)

In [267]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


### Element-wise elements

In [268]:
df['Engine HP'] / 100

0    1.38
1     NaN
2    2.18
3    1.94
4    2.61
Name: Engine HP, dtype: float64

In [270]:
# filtering:
df[df['Year'] >= 2015]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


### String Operators

In [273]:
# convert all elements to lower case letters
df['Vehicle_Style'].str.lower()

0          sedan
1          sedan
2    convertible
3        4dr suv
4         pickup
Name: Vehicle_Style, dtype: object

In [276]:
# change space with underscore sign:
df['Vehicle_Style'] = df['Vehicle_Style'].str.replace(' ', '_').str.lower()
df['Vehicle_Style']

0          sedan
1          sedan
2    convertible
3        4dr_suv
4         pickup
Name: Vehicle_Style, dtype: object

### Summarizing operators:

In [279]:
df['MSRP'].describe()

count        5.000000
mean     30186.000000
std      18985.044904
min       2000.000000
25%      27150.000000
50%      32340.000000
75%      34450.000000
max      54990.000000
Name: MSRP, dtype: float64

In [282]:
# summary for numeric values
df.describe().round(2)

Unnamed: 0,Year,Engine HP,Engine Cylinders,MSRP
count,5.0,4.0,5.0,5.0
mean,2010.4,202.75,4.4,30186.0
std,11.26,51.3,0.89,18985.04
min,1991.0,138.0,4.0,2000.0
25%,2010.0,180.0,4.0,27150.0
50%,2017.0,206.0,4.0,32340.0
75%,2017.0,228.75,4.0,34450.0
max,2017.0,261.0,6.0,54990.0


In [285]:
# unique values:
df.nunique()

Make                 4
Model                5
Year                 3
Engine HP            4
Engine Cylinders     2
Transmission Type    2
Vehicle_Style        4
MSRP                 5
dtype: int64

### Missing values:  
in ML we do not want to have missing values, therefore we need to know how many of them do we have and then do something with them

In [286]:
df.isnull()

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,False,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False


In [287]:
# we want to understand for each column how many missing values in each column:
df.isnull().sum()

Make                 0
Model                0
Year                 0
Engine HP            1
Engine Cylinders     0
Transmission Type    0
Vehicle_Style        0
MSRP                 0
dtype: int64

### Grouping:

In [289]:
df.groupby('Transmission Type').agg({
    'MSRP': 'mean'
}).round()

Unnamed: 0_level_0,MSRP
Transmission Type,Unnamed: 1_level_1
AUTOMATIC,30800.0
MANUAL,29777.0


In [291]:
# we then can get all values in numpy array: (by add .values)
df.groupby('Transmission Type').agg({
    'MSRP': 'mean'
}).round().values

array([[30800.],
       [29777.]])

In [293]:
# convert pandas df to list of dictionaries:
df.to_dict()

{'Make': {0: 'Nissan', 1: 'Hyundai', 2: 'Lotus', 3: 'GMC', 4: 'Nissan'},
 'Model': {0: 'Stanza', 1: 'Sonata', 2: 'Elise', 3: 'Acadia', 4: 'Frontier'},
 'Year': {0: 1991, 1: 2017, 2: 2010, 3: 2017, 4: 2017},
 'Engine HP': {0: 138.0, 1: nan, 2: 218.0, 3: 194.0, 4: 261.0},
 'Engine Cylinders': {0: 4, 1: 4, 2: 4, 3: 4, 4: 6},
 'Transmission Type': {0: 'MANUAL',
  1: 'AUTOMATIC',
  2: 'MANUAL',
  3: 'AUTOMATIC',
  4: 'MANUAL'},
 'Vehicle_Style': {0: 'sedan',
  1: 'sedan',
  2: 'convertible',
  3: '4dr_suv',
  4: 'pickup'},
 'MSRP': {0: 2000, 1: 27150, 2: 54990, 3: 34450, 4: 32340}}