# Week 11 - Classification and Clustering

## Recap/Precap
Sample statistics: use to help us compute parameter estimates of models  
Probability: helps formalise the model problem to deal with noisy data  
Expectation: defines average values a model takes on.  
Distributions: define the random processes in a model.  
Inference: helps to estimate parameters of a model.  
CLT: implies the variance of our parameter estimates will shrink with more observations.  
Confidence Intervals: give bounds on the value of the true parameters of the model.  
Hypothesis testing: helps tell us if model components are useful for prediction, and to compare models.  
Regression: gives us a model that predicts a continuous variable from continuous and/or categorical variables  

But where was the machine learning?  
In multiple linear regression we use linear algebra to solve
for the parameters instantaneously using a set of
simultaneous equations. So you can think of this as
instantaneous learning.  
In general for an arbitrary method we may have a set of
simultaneous equations with no closed form solution
(usually when nonlinear functions are involved) and
therefore iterative methods are required to obtain the
optimal solution for the parameter estimates.  
Machine learning is about iterative learning as new data
observations arrive. Logistic Regression is an example of
such an iterative method.  
You can also think of building up probability distributions
observation by observation as an iterative method.  

# Get GPT Notes

## Hard and Soft Classifiers
A Classifier attempts to predict the value of a categorical variable Y , based on predictors X1, . . . , Xp.  
A Hard classifier gives a specific predicted value of Y - it predicts the class of each individual.  
A Soft classifier gives a score for each class based on the predictors  
▶ A common example is the probability that the individual is in the class given the values of the predictors.  
We will look at soft classifiers today.  

### Bayes Classifier
Uses joint probability  
In practice, we don't have the joint probabilities so we estimate them from the data  

The Naïve Bayes solves the problem of too many
probabilities to estimate by making a very strong
assumption
Let X1, . . . , Xp be p categorical predictors (features)
=⇒ do not have to be binary
Use the shorthand notation p(Y =y, X=x) ≡ p(y, x)
Naïve Bayes assumes predictors are conditionally
independent, given the value of the target


### Naive Bayes Example
Estimate if someone will play tennis given weather, temperature, humidity, wind

Total = 14 days  
Yes = 9  
No = 5  
P(Yes) = 9/14  
P(No) = 5/14  

P(Sunny|Yes) = 2/9  
P(Cool|Yes) = 2/9  
P(High|Yes) = 3/9  
P(Strong|Yes) = 2/9  

P(Strong|No) = 3/5  
P(Cool|No) = 1/5  
P(High|No) = 4/5  
P(Strong|No) = 3/5  

X = (Sunny, Cool, High, Strong)  
P(Yes|X) = P(Yes)*P(X|Yes)/P(X)  
P(X) = Sum[P(X|Y)P(Y)] = P(X|Yes)\*P(Yes) + P(X|No)\*P(No)  

P(Yes|X) = P(Yes)\*P(Sunny|Yes)\*P(Cool|Yes)\*P(High|Yes)\*P(Strong|Yes) / P(X|Yes)\*P(Yes)+P(X|No)\*P(No)
 = 9/14 . 2/9 . 3/9 . 3/9 . 2/9  /   2/9 . 2/9 . 3/9 . 2/9 . 9/14 . 3/5 . 1/5. 4/5 . 1/5. 5/4
 = 0.1027

P(No|X) = 0.8075



In [22]:
data("iris")
summary(iris)
str(iris)

iris$SLC <- iris$Sepal.Length < 6
iris$SWC <- iris$Sepal.Width < 3
iris$PLC <- iris$Petal.Length < 5
iris$PWC <- iris$Petal.Width < 1.6

SLC.tab <- table(iris$SLC, iris$Species)
SLC.tab
SWC.tab <- table(iris$SWC, iris$Species)
SWC.tab
PLC.tab <- table(iris$PLC, iris$Species)
PLC.tab
PWC.tab <- table(iris$PWC, iris$Species)
PWC.tab


  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


       
        setosa versicolor virginica
  FALSE      0         24        43
  TRUE      50         26         7

       
        setosa versicolor virginica
  FALSE     48         16        29
  TRUE       2         34        21

       
        setosa versicolor virginica
  FALSE      0          2        44
  TRUE      50         48         6

       
        setosa versicolor virginica
  FALSE      0          5        47
  TRUE      50         45         3

In [26]:
str(iris)
iris$setosa <- (iris$Species == 'setosa')
sum(iris$setosa)

p.set=sum(iris$setosa)/150
p.set

SLC.tab
p.SLC.set<-SLC.tab[2,1]/sum(SLC.tab[ ,1])
p.SLC.set

SWC.tab
p.SWC.set<-SWC.tab[2,1]/sum(SWC.tab[ ,1])
p.SWC.set

PLC.tab
p.PLC.set<-PLC.tab[2,1]/sum(PLC.tab[ ,1])
p.PLC.set

PWC.tab
p.PWC.set<-PWC.tab[2,1]/sum(PWC.tab[ ,1])
p.PWC.set

p.denominator <- sum(iris$SLC*iris$SWC*iris$PLC*iris$PWC)/150
p.denominator

p.out.set <- (p.set*p.SLC.set*p.SWC.set*p.PLC.set*p.PWC.set)/p.denominator
p.out.set

'data.frame':	150 obs. of  10 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ SLC         : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ SWC         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ PLC         : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ PWC         : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ setosa      : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...


       
        setosa versicolor virginica
  FALSE      0         24        43
  TRUE      50         26         7

       
        setosa versicolor virginica
  FALSE     48         16        29
  TRUE       2         34        21

       
        setosa versicolor virginica
  FALSE      0          2        44
  TRUE      50         48         6

       
        setosa versicolor virginica
  FALSE      0          5        47
  TRUE      50         45         3

In [31]:
pimaTrain <- read.csv("data/pima_train.csv")
str(pimaTrain)

pimaTrain <- read.csv("data/pima_train.csv", stringsAsFactors = T)
str(pimaTrain)

full <- glm(DIABETES ~ ., pimaTrain, family=binomial)
summary(full)

pimaTest <- read.csv("data/pima_test.csv")
str(pimaTest)

pimaTest <- read.csv("data/pima_test.csv", stringsAsFactors = T)
str(pimaTest)

my.pred.stats(predict(full,pimaTest, type="response"), pimaTest$DIABETES)

'data.frame':	668 obs. of  9 variables:
 $ PREG    : int  6 1 8 1 5 3 10 2 8 4 ...
 $ PLAS    : int  148 85 183 89 116 78 115 197 125 110 ...
 $ BP      : num  72 66 64 66 74 50 35.3 70 96 92 ...
 $ SKIN    : num  35 29 23.3 23 25.6 32 35.3 45 54 37.6 ...
 $ INS     : int  148 85 183 94 116 88 115 543 125 110 ...
 $ BMI     : num  33.6 26.6 23.3 28.1 25.6 31 35.3 30.5 54 37.6 ...
 $ PED     : num  0.627 0.351 0.672 0.167 0.201 0.248 0.134 0.158 0.232 0.191 ...
 $ AGE     : int  50 31 32 21 30 26 29 53 54 30 ...
 $ DIABETES: chr  "Y" "N" "Y" "N" ...
'data.frame':	668 obs. of  9 variables:
 $ PREG    : int  6 1 8 1 5 3 10 2 8 4 ...
 $ PLAS    : int  148 85 183 89 116 78 115 197 125 110 ...
 $ BP      : num  72 66 64 66 74 50 35.3 70 96 92 ...
 $ SKIN    : num  35 29 23.3 23 25.6 32 35.3 45 54 37.6 ...
 $ INS     : int  148 85 183 94 116 88 115 543 125 110 ...
 $ BMI     : num  33.6 26.6 23.3 28.1 25.6 31 35.3 30.5 54 37.6 ...
 $ PED     : num  0.627 0.351 0.672 0.167 0.201 0.248 0.134 0.


Call:
glm(formula = DIABETES ~ ., family = binomial, data = pimaTrain)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.5271236  0.7988517 -10.674  < 2e-16 ***
PREG         0.1255938  0.0346542   3.624 0.000290 ***
PLAS         0.0353683  0.0043183   8.190  2.6e-16 ***
BP          -0.0170075  0.0071017  -2.395 0.016627 *  
SKIN         0.0136405  0.0153301   0.890 0.373582    
INS          0.0003532  0.0013082   0.270 0.787181    
BMI          0.0805829  0.0214811   3.751 0.000176 ***
PED          0.8410120  0.3293096   2.554 0.010653 *  
AGE          0.0189665  0.0104655   1.812 0.069943 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 868.88  on 667  degrees of freedom
Residual deviance: 618.08  on 659  degrees of freedom
AIC: 636.08

Number of Fisher Scoring iterations: 5


'data.frame':	100 obs. of  9 variables:
 $ PREG    : int  2 1 3 8 13 4 7 4 2 1 ...
 $ PLAS    : int  137 118 126 99 145 103 105 146 100 107 ...
 $ BP      : num  40 84 88 84 82 60 24 85 66 68 ...
 $ SKIN    : num  35 47 41 35.4 19 33 24 27 20 19 ...
 $ INS     : int  168 230 235 99 110 192 105 100 90 107 ...
 $ BMI     : num  43.1 45.8 39.3 35.4 22.2 24 24 28.9 32.9 26.5 ...
 $ PED     : num  2.288 0.551 0.704 0.388 0.245 ...
 $ AGE     : int  33 31 27 50 57 33 24 27 28 24 ...
 $ DIABETES: chr  "Y" "Y" "N" "N" ...
'data.frame':	100 obs. of  9 variables:
 $ PREG    : int  2 1 3 8 13 4 7 4 2 1 ...
 $ PLAS    : int  137 118 126 99 145 103 105 146 100 107 ...
 $ BP      : num  40 84 88 84 82 60 24 85 66 68 ...
 $ SKIN    : num  35 47 41 35.4 19 33 24 27 20 19 ...
 $ INS     : int  168 230 235 99 110 192 105 100 90 107 ...
 $ BMI     : num  43.1 45.8 39.3 35.4 22.2 24 24 28.9 32.9 26.5 ...
 $ PED     : num  2.288 0.551 0.704 0.388 0.245 ...
 $ AGE     : int  33 31 27 50 57 33 24 27 28 24 ..

ERROR: Error in my.pred.stats(predict(full, pimaTest, type = "response"), pimaTest$DIABETES): could not find function "my.pred.stats"
