In [1]:
#Modeling
'''
A Model is a specification of a mathematical (or probabilistic) relationship that exists between different variables. 
'''

'\nA Model is a specification of a mathematical (or probabilistic) relationship that exists between different variables. \n'

In [2]:
#What is Machine Learning?
'''
Creating and using models that are learned from data.

predictive modeling or data mining.

Our goal will be to use existing data to develop models that we can use ot predict various outcomes for new data, such as:
    -Whether an email message is spam or not
    -Whether a credit card transaction is fraudulent
    -Which advertisement a shopper is most likely to click on
    -Which football team is going to win the Super Bowl

Supervised models - in which there is a set of data labeled with the correct answers to learn from
Unsupervised models - in which there are no such labels
Semisupervised models - in which only some of the data are labeled
Online models - in which the model needs to continuously adjust to newly arriving data
Reinforcement models - in which, after making a series of predictions, the model gets a signal indicating how well it did
'''

'\nCreating and using models that are learned from data.\n\npredictive modeling or data mining.\n\nOur goal will be to use existing data to develop models that we can use ot predict various outcomes for new data, such as:\n    -Whether an email message is spam or not\n    -Whether a credit card transaction is fraudulent\n    -Which advertisement a shopper is most likely to click on\n    -Which football team is going to win the Super Bowl\n\nSupervised models - in which there is a set of data labeled     with the correct answers to learn from\nUnsupervised models - in which there are no such labels\nSemisupervised models - in which only some of the data are      labeled\nOnline models - in which the model needs to continuously        adjust to newly arriving data\nReinforcement models - in which, after making a series of       predictions, the model gets a signal indicating how well     it did\n'

In [3]:
#Overfitting and Underfitting

'''
A common danger in ML is overfitting - producing a model that performs well on the data you train it on but generalizes poorly to any new data
The other side of this is underfitting - producing a model that doesn't perform well even on the training data, although typically when this happens you decide your model isn't good enough and keep looking for a better one
'''

"\nA common danger in ML is overfitting - producing a model that performs well on the data you train it on but generalizes poorly to any new data\nThe other side of this is underfitting - producing a model that doesn't perform well even on the training data, although typically when this happens you decide your model isn't good enough and keep looking for a better one\n"

In [7]:
#Using different data to train the model and to test the model. Simplest way to do this is to split the dataset, so that 2/3 of it is used to train the model, after which we measure the model's performance on the remaining 1/3

import random
from typing import TypeVar, List, Tuple
X = TypeVar('X') #generic type to represent a data point

def split_data(data: List[X], prob: float) -> Tuple[List[X], List[X]]:
    """Split data into fractions [prob, 1 - prob]"""
    data = data[:]                  #Make a shallow copy
    random.shuffle(data)            #b/c shuffle modifies the list
    cut = int(len(data) * prob)     #Use prob to find a cutoff
    return data[:cut], data[cut:]   #and split the shuffled list there

data = [n for n in range(1000)]
train, test = split_data(data, 0.75)

#The proportions should be correct
assert len(train) == 750
assert len(test) == 250

#And the original data should be preserved (in some order)
assert sorted(train + test) == data

In [14]:
#Often, we'll have paired input variables and output variables. In that case, we need to make sure to put corresponding values together in either the training data or the test data

Y = TypeVar('Y') #generic type to represent output variables

def train_test_split(xs: List[X], ys: List[Y], test_pct: float) -> Tuple[List[X], List[X], List[Y], List[Y]]:
    # Generate the indices and split them
    idxs = [i for i in range(len(xs))]
    train_idxs, test_idxs = split_data(idxs, 1 - test_pct)

    return ([xs[i] for i in train_idxs],    #x_train
            [xs[i] for i in test_idxs],     #x_test
            [ys[i] for i in train_idxs],    #y_train
            [ys[i] for i in test_idxs])     #y_test

xs = [x for x in range(1000)] # xs are 1 ... 10000
ys = [2* x for x in xs]       #each y_i is twice x_i
x_train, x_test, y_train, y_test = train_test_split(xs, ys, 0.25)

#Check that the proportions are correct
assert len(x_train) == len(y_train) == 750
assert len(x_test) == len(y_test) == 250

#Check that the corresponding data points are paired correctly
assert all(y == 2 * x for x, y in zip(x_train, y_train))
assert all(y == 2 * x for x, y in zip(x_test, y_test))

'''
After which you can do something like:

model = SomeKindOfModel()
x_train, x_test, y_train, y_test = train_test_split(xs, ys, 0.33)
model.train(x_train, y_train)
performance = model.test(x_test, y_test)
'''

'\nAfter which you can do something like:\n\nmodel = SomeKindOfModel()\nx_train, x_test, y_train, y_test = train_test_split(xs, ys, 0.33)\nmodel.train(x_train, y_train)\nperformance = model.test(x_test, y_test)\n'

In [15]:
#Correctness

'''
Given a set of labeled data and such a predictive model, every data point lies in one of four categories
    True Positive
        "This message is spam, and we correctly prodicted spam"
    False Positive (Type 1 error)
        "This message is not spam, but we predicted spam"
    False Negative (Type 2 error)
        "This message is spam, but we predicted no spam"
    True Negative
        "This message is not spam, and we correctly predicted not spam"
'''

'\nGiven a set of labeled data and such a predictive model, every data point lies in one of four categories\n    True Positive\n        "This message is spam, and we correctly prodicted spam"\n    False Positive (Type 1 error)\n        "This message is not spam, but we predicted spam"\n    False Negative (Type 2 error)\n        "This message is spam, but we predicted no spam"\n    True Negative\n        "This message is not spam, and we correctly predicted not spam"\n'

In [20]:
#Accuracy - the fraction of correct prediction

#tn = true Negative, fn = False negative, tp = True Positive, tn = True Negative
def accuracy(tp: int, fp: int, fn: int, tn: int) -> float:
    correct = tp +tn
    total = tp + fp + fn + tn
    return correct / total

assert accuracy(70, 4930, 13930, 981070) == 0.98114

#Precision measures how accurate our positive predictions were

def precision(tp: int, fp: int, fn: int, tn: int) -> float:
    return tp / (tp + fp)

assert precision(70, 4930, 13930, 981070) == 0.014

#Recall measures what fraction of the positives our model identified

def recall(tp: int, fp: int, fn: int, tn: int) -> float:
    return tp / (tp + fn)

assert recall(70, 4930, 13930, 981070) == 0.005

#Sometimes precision and recall are combined into the F1 score

def f1_score(tp: int, fp: int, fn: int, tn: int) -> float:
    p = precision(tp, fp, fn, tn)
    r = recall(tp, fp, fn, tn)

    return 2 * p * r / (p+r)

#This is the harmonic mean of precision and recall

In [1]:
#The Bias-Variance Tradeoff
'''
    Another way of thinking about the overfitting problem is as a tradeoff between bias and variance. Both are measures of what would happen if you were to retrain your model many times on different sets of training data(from the same larger population)
    If your model has high bias (which means it performs poorly even on your training data), one thing to try is adding more features.
    If your model has high variance, you can similarly remove features. But another solution is to obtain more data (if you can)
'''

'\n    Another way of thinking about the overfitting problem is as a tradeoff between bias and variance. Both are measures of what would happen if you were to retrain your model many times on different sets of training data(from the same larger population)\n    If your model has high bias (which means it performs poorly even on your training data), one thing to try is adding more features.\n    If your model has high variance, you can similarly remove features. But another solution is to obtain more data (if you can)\n'

In [None]:
#Feature Extraction and Selection
'''
    Features are whatever inputs we provide to our model

'''