#### Machine Learning Intro
* making predictions and classifications
* have training data, fit regression line to fit training data
* quality based on how the model predicts new data
* decision trees

<br>

* bias/variance tradeoff 
    * fit training data well but makes poor predictions
    * i.e. over-fitting 

* compare methods and pick which had best/correct predictions on testing data

#### ML vs AI
* machine learning is a subset of AI
* importance based on weights
* ML can find weights, by comparing predicts with the actual data you gave it, and re-adjust those weights as it gets trained
* weights how much to factor each feature

#### ML vs Statistics
[ml vs stats](https://www.educba.com/machine-learning-vs-statistics/)
* ML is built on statistics
* ML is a set of steps/rules fed by the user where the machine understands and trains by itself, predicts future events, or classifies existing material 
    * for hypothesis uses 
    * more for data analytics
    * keywords: linear regressions, random forests, support vector machine, neural networks
* Statistics: math for finding patterns from the data, identifies relationship between data points 
    * for correlation between data points, univariates, and multivariates 
    * covariance, univar, multivar, estimators, p-values, root mean squared deviation (RMSE)

#### Calculus
* derivatives, chain
* hard problems reframed as finding an area under some graph
* integral of x^2, function that find area under curve
* if indefinite integration (?) - returns a function
* differentiation gives the derivative 
* derivative of a function for the area under the graph, returns the function 
* inverse of each other integral and derivatives

#### Calculus in Data Science
* used in least squares regression, probability distributions (basis)
* measuring rates or quantities that change over time (change)
* maximum and minimum of functions when optimizing (min/max)
* two kinds: 1. differential (rates of change) 2. integrals (quantity at a specific time, given rate of change)f
* finding slope at a curve - find the derivative, can have set equations
* f'(x^2) =  2x
* use these procedures to optimize decisions 

<br>

* [more info](https://www.mathsisfun.com/calculus/derivatives-introduction.html)
* price optimization
* find values that maximize and minimize outcomes
* e.g. finding maximum revenue given sales/price and discount per sale, describes an inverted U relationship

<br>

* minimums and maximums
* absolutes and relatives
* min and max are used in almost all machine learning algos - via "gradient descent" 
* absolute max at x = c , if f(c) >= f(x), for all x in the domain of function
* abs min at x=c if f(c) <= f(x), for all x in the domain

#### Walk through clustering and optimization
* from realpython
* scipy ecosystem: numpy, scipy, matplotlib, ipython, numpy, pandas
* scipy subpackages require explicit import

In [4]:
import scipy

In [3]:
help(scipy)

Help on package scipy:

NAME
    scipy

DESCRIPTION
    SciPy: A scientific computing package for Python
    
    Documentation is available in the docstrings and
    online at https://docs.scipy.org.
    
    Contents
    --------
    SciPy imports all the functions from the NumPy namespace, and in
    addition provides:
    
    Subpackages
    -----------
    Using any of these subpackages requires an explicit import. For example,
    ``import scipy.cluster``.
    
    ::
    
     cluster                      --- Vector Quantization / Kmeans
     fft                          --- Discrete Fourier transforms
     fftpack                      --- Legacy discrete Fourier transforms
     integrate                    --- Integration routines
     interpolate                  --- Interpolation Tools
     io                           --- Data input and output
     linalg                       --- Linear algebra routines
     linalg.blas                  --- Wrappers to BLAS library
     li

In [None]:
# using scipy.cluster.vq: kmeans clustering algo with vector quantization

In [9]:
from pathlib import Path
import numpy as np
from scipy.cluster.vq import whiten, kmeans, vq

In [15]:
data = Path("smsspamcollection/SMSSpamCollection").read_text()
data = data.strip() # removes trailing space
data = data.split("\n") #split string into a list based on new line


In [16]:
# use numpy to count digits in the data - could be done with collections.Counter - but cluster.vq needs numpy array

In [19]:
# creates an empty array with two columns and rows from the data (5574)
digit_counts = np.empty((len(data), 2), dtype=int)

In [None]:
# use digit counts to associate number of digits in the message to determine spam or not
# creating an array before loop so it's not allocating new memory as the array expands 

In [20]:
for i, line in enumerate(data): # uses enumerate to create a counter and index for the list
    case, message = line.split("\t") #splits case and message as two columns 
    num_digits = sum(c.isdigit() for c in message) #check sum of is digit for each line 
    digit_counts[i, 0] = 0 if case == "ham" else 1 # 0 for ham, and 1 for spam
    digit_counts[i, 1] = num_digits # add num_digits to that index


In [21]:
digit_counts[:10,:]

array([[ 0,  0],
       [ 0,  0],
       [ 1, 25],
       [ 0,  0],
       [ 0,  0],
       [ 1,  4],
       [ 0,  0],
       [ 0,  1],
       [ 1, 19],
       [ 1, 13]])

In [None]:
# use clustering algo with number of messages with certain number of digits
# create an array where first column = number of digits, 2nd column number of messages that have that number of digits (value counts?)

In [23]:
# np.unique() takes an array, and returns another array with unique elements from the argument
# return_counts returns number of times that element is in the array
unique_counts = np.unique(digit_counts[:,1], return_counts=True)

In [25]:
# two arrays with number of digits, and number of cases with those digits
unique_counts

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 40, 41, 47]),
 array([4110,  486,  160,   78,   42,   39,   16,   14,   28,   17,   16,
          34,   30,   31,   37,   29,   35,   33,   41,   47,   18,   31,
          28,   36,   34,   16,   16,   13,   19,    9,    2,    6,    3,
           4,    3,    4,    1,    1,    4,    2,    1]))

In [26]:
# transform unique_counts for clustering
# vstack combines the two arrays into 2xN arrays, and then transpose them to a Nx2 array
unique_counts = np.transpose(np.vstack(unique_counts))

In [28]:
# number of digits in a message, number of messages in rows
unique_counts[:10,:]

array([[   0, 4110],
       [   1,  486],
       [   2,  160],
       [   3,   78],
       [   4,   42],
       [   5,   39],
       [   6,   16],
       [   7,   14],
       [   8,   28],
       [   9,   17]])

In [29]:
# now ready for k-means algo
# whiten normalizes each feature to have a unit variance to improve kmeans results  (?)
whitened_counts = whiten(unique_counts)

# kmeans is the whitened data, and number of clusters - ham, spam, unknown
# codebook gives 3x2 array, with centroids of each group 
# _ gives the mean euclidian distance between observations and centroids - not needed anymore
codebook, _ = kmeans(whitened_counts, 3)

In [31]:
codebook

array([[2.52050073, 0.01840656],
       [0.85234324, 0.09724666],
       [0.        , 6.49364346]])

In [33]:
# determining which cluster each observation belongs to using vq 
# assigns codes from the codebook to each observation (cluster), each observation is assigned either 0,1, or 2
# _ euclidian distance between observation and centroid
codes, _ = vq(whitened_counts, codebook)

In [36]:
codes

array([2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int32)

In [34]:
# assign code to find code associated with each cluster 
ham_code = codes[0] # finds code associated with ham, since ham should have smallest digits, it is in beginning of codes
spam_code = codes[-1] # spam has most messages - and therefore the last cluster in codes
unknown_code = list(set(range(3)) ^ set((ham_code, spam_code)))[0] # use symmetric difference to determine the last code value (?)

In [35]:
print("definitely ham:", unique_counts[codes == ham_code][-1]) #0 digits definitely ham
print("definitely spam:", unique_counts[codes == spam_code][-1]) #21 to 47 (max num of digits) digits definitely spam 
print("unknown:", unique_counts[codes == unknown_code][-1]) #unknown 1-20 digits

definitely ham: [   0 4110]
definitely spam: [47  1]
unknown: [20 18]


In [37]:
# check accuracy of predictions 
# create masks for digit counts to grab ham/spam status based on the kmeans results
digits = digit_counts[:, 1]
predicted_hams = digits == 0
predicted_spams = digits > 20
predicted_unknowns = np.logical_and(digits > 0, digits <= 20)

In [38]:
spam_cluster = digit_counts[predicted_spams]
ham_cluster = digit_counts[predicted_hams]
unk_cluster = digit_counts[predicted_unknowns]

In [39]:
print("hams:", np.unique(ham_cluster[:, 0], return_counts=True))
print("spams:", np.unique(spam_cluster[:, 0], return_counts=True))
print("unknowns:", np.unique(unk_cluster[:, 0], return_counts=True))

hams: (array([0, 1]), array([4071,   39]))
spams: (array([0, 1]), array([  1, 232]))
unknowns: (array([0, 1]), array([755, 476]))


In [None]:
# 4110 - were identifies as definitely ham, 4071 were actually ham, but 39 spams - from initial labels
# 233 - were identified as definite spam - 1 was ham, and 232 were spam
# 1200 fell as unkowns - not that great. 

#### Using the Optimize Module in Scipy
* to optimize input params of a function, all f'n performs optimization
* minimize_scalar() and minimize() - minimize function of one var (minimize_scalar) and multiple variables (minimize)
* curve_fit() - fits a function to a set of data
* root_scalar() and root() - find zeros of a function for one or many variables, respectively
* linprog() - minimize linear objective function that has linear inequality and equality constraints

#### Minimize a function with one variable
* one number input and gives one output = scalar function 

In [40]:
from scipy.optimize import minimize_scalar

def objective_function(x):
    return 3 * x ** 4 - 2 * x + 1

In [41]:
# for a quadratic function finding the minimum
# can be any function as long as it returns a single number at the end
# minimize_scalar takes one input, the objective_function
res = minimize_scalar(objective_function)

In [42]:
# reports when optimization was successful, fun is the value of the function at optimal value x
res

     fun: 0.17451818777634331
 message: '\nOptimization terminated successfully;\nThe returned value satisfies the termination criteria\n(using xtol = 1.48e-08 )'
    nfev: 16
     nit: 12
 success: True
       x: 0.5503212087491959

In [43]:
# when there is no minimum will result in OverflowError - optimizer tries a number that is too big to be calculated by the computer 
def objective_func2(x):
    return x**3

In [None]:
res2=minimize_scalar(objective_func2)
res2

In [45]:
# for functions that have several minima, not guaranteed to find a global minimum
# built-in methods: brent (default), golden ('golden-section' search), bounded (when minimum is in known range)
# when method is brent or golden, need argument 'bracket' - sequence of 2 or 3 elements provides the initial guess for the bounds of region w/ minimum
# minimum found may not be in tese bounds 

def objective_func3(x):
    return x ** 4 - x ** 2

In [47]:
# default with brent detects one minimum but not the symmetric minimum on the negative
res3 = minimize_scalar(objective_func3)
res3

     fun: -0.24999999999999994
 message: '\nOptimization terminated successfully;\nThe returned value satisfies the termination criteria\n(using xtol = 1.48e-08 )'
    nfev: 15
     nit: 11
 success: True
       x: 0.7071067853059209

In [50]:
# can specify the bracket argument - doesn't always work 
res4 = minimize_scalar(objective_func3, bracket=(-1, 0))
res4

     fun: -0.24999999999999997
 message: '\nOptimization terminated successfully;\nThe returned value satisfies the termination criteria\n(using xtol = 1.48e-08 )'
    nfev: 17
     nit: 13
 success: True
       x: 0.7071067809244586

In [52]:
# use bounded to find the minimum at negative x 
res5 = minimize_scalar(objective_func3, method='bounded', bounds=(-1, 0))
res5

     fun: -0.24999999999998732
 message: 'Solution found.'
    nfev: 10
     nit: 10
  status: 0
 success: True
       x: -0.707106701474177

#### minimizing a function with many variables
* for multivariate inputs and outputs
* has more complex optimization algorithms
* can have constraints:
    * LinearConstraint: solution is constrained by taking the **inner product** of the solution x values with a user-input array and comparing the result to a lower and upper bound.
    * NonlinearConstraint:  solution is constrained by applying a **user-supplied function** to the solution x values and comparing the return value with a lower and upper bound.
    * Bounds: solution x values are constrained to lie between a lower and upper bound

* constrained optimization problem - for maximizing total income
    * minimize() finds the min value, multiply objective function by -1 to find x-values that lead to largest negative number 
    * constraint - sum of total shares purchased > number of shares own 
    * bounds - each buyer has an upper bound of cash available and lower bound of zero, lower than zero means you pay the buyers

In [53]:
import numpy as np
from scipy.optimize import minimize, LinearConstraint

n_buyers = 10
n_shares = 15

In [54]:
# set array for price that each buyer pays, and number they can afford based on the first two arrays
np.random.seed(10) #ensures same output everytime
prices = np.random.random(n_buyers) #creates prices (0 to 1) based on number of buyers  
money_available = np.random.randint(1, 4, n_buyers) #generate from 1-4 with size of buyers, how much each buyer has

In [55]:
# computing number of shares each buyer can buy
n_shares_per_buyer = money_available / prices #ratio of money_avail with prices to get the max number of shares each buyer can purchase
print(prices, money_available, n_shares_per_buyer, sep="\n")

[0.77132064 0.02075195 0.63364823 0.74880388 0.49850701 0.22479665
 0.19806286 0.76053071 0.16911084 0.08833981]
[1 1 1 3 1 3 3 2 1 1]
[ 1.29647768 48.18824404  1.57816269  4.00638948  2.00598984 13.34539487
 15.14670609  2.62974258  5.91328161 11.3199242 ]


In [None]:
# now create constraints and bounds for the solver
# constraint - involves more than one solution - sum of total purchased shares can't be greater than total num of shares
# take the dot or inner product of a vector of ones with the solution values, and constrain that to be equal to n_shares.
# i.e. use linearConstraints

In [56]:
constraint = LinearConstraint(np.ones(n_buyers), lb=n_shares, ub=n_shares) #lower and upper bound, should return the sum of purchased shares 
#lb = ub = n_shares, this is an equality constraint because the sum of the values must be equal to both lb and ub
# lb were different from ub, then it would be an inequality constraint.

In [57]:
# minimize() expects bounds to be a sequence of tuples of lower and upper bounds
# want to make the negative of your income as large a negative as possible 
bounds = [(0, n) for n in n_shares_per_buyer]

In [59]:
# income generated from each sale is price that the buyer pays * number of shares they are buying, i.e. the dot or inner product
# turned negative to get the number as small as possible
def objective_function(x, prices):
    return -x.dot(prices)

In [60]:
res = minimize(
    objective_function, #objective function that is being optimized
    x0=10 * np.random.random(n_buyers), #x0 - initial guess for the values of the solution, a random array of values between 0 and 10 w/ length n_buyers
    args=(prices,), #tuple of other arguments, in the function prices
    constraints=constraint, #sequence of constraints
    bounds=bounds, #sequence of bounds
)


In [61]:
res

     fun: -8.78302015708768
     jac: array([-0.7713207 , -0.02075195, -0.63364828, -0.74880397, -0.49850702,
       -0.22479665, -0.1980629 , -0.76053071, -0.16911089, -0.08833981])
 message: 'Optimization terminated successfully'
    nfev: 187
     nit: 17
    njev: 17
  status: 0
 success: True
       x: array([1.29647768e+00, 3.73026111e-14, 1.57816269e+00, 4.00638948e+00,
       2.00598984e+00, 3.48323773e+00, 6.66133815e-16, 2.62974258e+00,
       2.79628716e-15, 5.05103555e-15])

In [None]:
#  status = 0 means successful optimization 
# fun - value of the objective function at the optimized solution values = 8.78 income from this sale
# res.x is the values of x that optimize the function
# sell about 1.3 shares to first buyer, 0 to second, 1.6 to the third, and 4 to the 4th

In [63]:
# Check and make sure that the constraints and bounds that you set are satisfied. You can do this with the following code:
# each value should be positive 


print("The total number of shares is:", sum(res.x))
print("Leftover money for each buyer:", money_available - res.x * prices)

The total number of shares is: 14.999999999999996
Leftover money for each buyer: [3.66373598e-15 1.00000000e+00 1.77635684e-15 3.55271368e-15
 2.10942375e-15 2.21697984e+00 3.00000000e+00 4.88498131e-15
 1.00000000e+00 1.00000000e+00]


In [None]:
# errors 
# status 9 - iteration limit is exceeded
# there's no solution e.g. when 1000 shares are being sold,to the same buyers, not enough money and buyers in the marker
# make sure to check this to be sure