In [104]:
library(data.table)
library(glmnet)
library(ggplot2)
library(lubridate, quietly=TRUE)
library(zoo, quietly = TRUE)
library(dplyr, quietly = TRUE)
library(GGally, quietly=TRUE)
library(caTools)
library(rpart)
library(rattle)
library(caret)
library(e1071)
library(randomForest)
library(gbm)

### Performance Function

In [105]:
perf_dt=function(type,actual,forecast){
    name=type
    n=length(actual)
    error=actual-forecast
    mean=mean(actual)
    sd=sd(actual)
    FBias=sum(error)/sum(actual)
    MPE=sum(error/actual)/n
    MAPE=sum(abs(error/actual))/n
    RMSE=sqrt(sum(error^2))/n
    MAD=sum(abs(error))/n
    WMAPE=MAD/mean
    l=data.frame(name,n,mean,sd,FBias,MAPE,RMSE,MAD,WMAPE)
    return(l)
}

# Second Dataset

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...

Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

For background on spam:

Cranor, Lorrie F., LaMacchia, Brian A. Spam!
Communications of the ACM, 41(8):74-83, 1998.

(a) Hewlett-Packard Internal-only Technical Report. External forthcoming.
(b) Determine whether a given email is spam or not.
(c) ~7% misclassification error. False positives (marking good mail as spam) are very undesirable.If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter.


Attribute Information:

The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:

48 continuous real [0,100] attributes of type word_freq_WORD
= percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR]
= percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,...] attribute of type capital_run_length_average
= average length of uninterrupted sequences of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_longest
= length of longest uninterrupted sequence of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_total
= sum of length of uninterrupted sequences of capital letters
= total number of capital letters in the e-mail

1 nominal {0,1} class attribute of type spam
= denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.



### Loading Dataset

In [106]:
spam=read.csv("spambase.data")
spam=as.data.table(spam[1000:nrow(spam),])
spam=na.omit(spam)
spam$X1=as.factor(spam$X1)

In [107]:
str(spam)

Classes 'data.table' and 'data.frame':	3601 obs. of  58 variables:
 $ X0     : num  0.45 0.45 0.82 0.09 0 0 0 0.47 0 1.47 ...
 $ X0.64  : num  0.9 0.91 0 0.49 0 0 0 0 0.72 0 ...
 $ X0.64.1: num  0.9 0.91 0.82 0.59 1.31 1.31 0.6 0.94 1.81 0 ...
 $ X0.1   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ X0.32  : num  0.45 0.45 0.41 0.39 0 0 0 0.94 0 0 ...
 $ X0.2   : num  0 0 0 0.19 0 0 0.6 0 0.36 0 ...
 $ X0.3   : num  0 0 0.41 0 0 0 0 0.94 0 0 ...
 $ X0.4   : num  0.45 0.45 0.82 0 0 0 0 0 0.36 0 ...
 $ X0.5   : num  0 0 0.41 0.09 0 0 0.6 0 0.72 0 ...
 $ X0.6   : num  1.8 1.83 1.23 0.39 0 0 0 0 1.08 1.47 ...
 $ X0.7   : num  0 0 1.65 0 0 0 0 0 0.36 0 ...
 $ X0.64.2: num  2.26 2.29 0.41 1.57 1.31 1.31 1.8 0.47 0.72 0 ...
 $ X0.8   : num  0 0 0 0.19 0 0 0 0 0 1.47 ...
 $ X0.9   : num  0.45 0.91 0 0 0 0 0 0 0.36 0 ...
 $ X0.10  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ X0.32.1: num  0.45 0.45 2.47 0 0 0 0.3 0 0.36 7.35 ...
 $ X0.11  : num  0 0 1.65 0.09 0 0 0 0.47 0.36 0 ...
 $ X1.29  : num  0 0 0 0 0 0 0 0 0.36

In [108]:
set.seed(35)
spl=sample.split(spam$X1, SplitRatio = 0.8)
train_spam=subset(spam,spl==TRUE)
test_spam=subset(spam,spl==FALSE)

## Penalized Regression Approaches(PRA)

### L1 Penalty with Mean Square Error measure

In order to tune lambda values, glmnet packages are used to get best lambda values for Penalized Regression. 

In [None]:
train_mat_spam=data.matrix(train_spam[complete.cases(train_spam),-c("X1"),with=F])

result_vec_spam=as.vector(t(train_spam[complete.cases(train_spam),"X1"]))

cvfit_spam=cv.glmnet(train_mat_spam,result_vec_spam,family="binomial",nfolds = 10,type.measure = "mse")

test_mat_spam=data.matrix(test_spam[complete.cases(test_spam),-c("X1")])

lasso_model_spam_mse_min <- glmnet(train_mat_spam,result_vec_spam,family="binomial", alpha = 1, lambda = cvfit_spam$lambda.min, standardize = FALSE)
lasso_model_spam_mse_1se <- glmnet(train_mat_spam,result_vec_spam,family="binomial", alpha = 1, lambda = cvfit_spam$lambda.1se, standardize = FALSE)
lasso_model_spam_mse_10th <- glmnet(train_mat_spam,result_vec_spam,family="binomial", alpha = 1, lambda = cvfit_spam$lambda[10], standardize = FALSE)

In [None]:
# train_mat_spam=data.matrix(train_spam[complete.cases(train_spam),-c("X1"),with=F])

# result_vec_spam=as.vector(t(train_spam[complete.cases(train_spam),"X1"]))

# cvfit_spam=cv.glmnet(train_mat_spam,result_vec_spam,family="binomial",nfolds = 10)

# test_mat_spam=data.matrix(test_spam[complete.cases(test_spam),-c("X1")])

# lasso_model_spam <- glmnet(train_mat_spam,result_vec_spam, alpha = 1,family="binomial", lambda = cvfit_spam$lambda.min, standardize = FALSE)

In [None]:
plot(cvfit_spam)

### Lambda Values for Mean Square Error Values

In [None]:
cvfit_spam$lambda.min

In [None]:
cvfit_spam$lambda.1se

In [None]:
cvfit_spam$lambda[10]

2 possible lambda values are determined by the cv.glmnet(). One of them is used for getting minimum Mean-Squared Error. The other one is determined by looking how many paramater is used in the model. Namely, this lambda value can be beneficial for reducing the possibility of over-fitting model on train dataset. Lastly, one random lambda was selected to compare results for these 3 lambda values.

In [None]:
prediction_pra_mse_spam_min <- predict(lasso_model_spam_mse_min, s = cvfit_spam$lambda.min, newx = test_mat_spam,type="class")
prediction_pra_mse_spam_1se <- predict(lasso_model_spam_mse_1se, s = cvfit_spam$lambda.1se, newx = test_mat_spam,type="class")
prediction_pra_mse_spam_10th <-predict(lasso_model_spam_mse_10th, s = cvfit_spam$lambda[10], newx = test_mat_spam,type="class")

## L1 Penalty with Mean Absolute Error measure

In [None]:
train_mat_spam=data.matrix(train_spam[complete.cases(train_spam),-c("X1"),with=F])

result_vec_spam=as.vector(t(train_spam[complete.cases(train_spam),"X1"]))

cvfit_spam_mae=cv.glmnet(train_mat_spam,result_vec_spam,family="binomial",nfolds = 10,type.measure = "mae")

test_mat_spam=data.matrix(test_spam[complete.cases(test_spam),-c("X1")])

lasso_model_spam_mae_min <- glmnet(train_mat_spam,result_vec_spam,family="binomial", alpha = 1, lambda = cvfit_spam_mae$lambda.min, standardize = FALSE)
lasso_model_spam_mae_1se <- glmnet(train_mat_spam,result_vec_spam, family="binomial",alpha = 1, lambda = cvfit_spam_mae$lambda.1se, standardize = FALSE)
lasso_model_spam_mae_10th <- glmnet(train_mat_spam,result_vec_spam,family="binomial", alpha = 1, lambda = cvfit_spam_mae$lambda[10], standardize = FALSE)

In [None]:
plot(cvfit_spam_mae)

### Lambda Values for Mean Absolute Error Values

In [None]:
cvfit_spam_mae$lambda.min

In [None]:
cvfit_spam_mae$lambda.1se

In [None]:
cvfit_spam_mae$lambda[10]

2 possible lambda values are determined by the cv.glmnet(). One of them is used for getting minimum Mean-Squared Error. The other one is determined by looking how many paramater is used in the model. Namely, this lambda value can be beneficial for reducing the possibility of over-fitting model on train dataset. Lastly, one random lambda was selected to compare results for these 3 lambda values.

In [None]:
prediction_pra_mae_spam_min <- predict(lasso_model_spam_mae_min, s = cvfit_spam$lambda.min, newx = test_mat_spam,type="class")
prediction_pra_mae_spam_1se <- predict(lasso_model_spam_mae_1se, s = cvfit_spam$lambda.1se, newx = test_mat_spam,type="class")
prediction_pra_mae_spam_10th <- predict(lasso_model_spam_mae_10th, s = cvfit_spam$lambda[10], newx = test_mat_spam,type="response")

### Performance Measure for Lasso Regression

In [None]:
perf_dt("Spam Data Set for Lasso Function with min lambda and mse objective", as.numeric(test_spam$X1), as.numeric(prediction_pra_mse_spam_min))
perf_dt("Spam Data Set for Lasso Function with 1se lambda and mse objective", as.numeric(test_spam$X1), as.numeric(prediction_pra_mse_spam_1se))
perf_dt("Spam Data Set for Lasso Function with 10th lambda and mse objective", as.numeric(test_spam$X1), as.numeric(prediction_pra_mse_spam_10th))

perf_dt("Spam Data Set for Lasso Function with min lambda and mae objective", as.numeric(test_spam$X1), as.numeric(prediction_pra_mae_spam_min))
perf_dt("Spam Data Set for Lasso Function with 1se lambda and mae objective", as.numeric(test_spam$X1), as.numeric(prediction_pra_mae_spam_1se))
perf_dt("Spam Data Set for Lasso Function with 10th lambda and mae objective", as.numeric(test_spam$X1), as.numeric(prediction_pra_mae_spam_10th))

In [None]:
confusionMatrix(data = as.factor(prediction_pra_mae_spam_min), reference = as.factor(test_spam$X1), mode = "prec_recall")

## Decision Tree(DT)

In [None]:
set.seed(35)

In [None]:
numFolds=trainControl(method="cv",number = 10)
cpGrid=expand.grid(.cp=(0:10)*0.02)
#minbucket_grid=expand.grid(.cp=(5:10))
for(i in 5:10){
    tr=train(X1~.,
          data=train_spam, 
          method="rpart",
          trControl=numFolds,
          tuneGrid= cpGrid,
            # minbucket=minbucket_grid
           control= rpart.control(minsplit = i)
            )
    trellis.par.set(caretTheme())
    print(plot(tr))    
    print(tr)
}

In [None]:
numFolds=trainControl(method="cv",number = 10)
cpGrid=expand.grid(.cp=(0:10)*0.01)
tr=train(X1~.,
      data=train_spam, 
      method="rpart",
      trControl=numFolds,
      tuneGrid= cpGrid,
        # minbucket=minbucket_grid
       control= rpart.control(minsplit = 10)
        )
trellis.par.set(caretTheme())
print(plot(tr))    
print(tr)

In [None]:
reg_tree_spam=tr$finalModel
fancyRpartPlot(reg_tree_spam)
reg_tree_spam$variable.importance

In [None]:
predicted_pisa=predict(reg_tree_spam,newdata=test_spam,type="class")

In [None]:
table(test_spam$X1,predicted_pisa)

### Performanca Measure

In [None]:
confusionMatrix(data = as.factor(predicted_pisa), reference = as.factor(test_spam$X1), mode = "prec_recall")

In [None]:
perf_dt("Decision Tree-Daily",as.numeric(test_spam$X1),as.numeric(predicted_pisa))

## Random Forest(RF)

In [None]:
library(ranger)

In [None]:
fitControl=trainControl(method = "repeatedcv",
                           number = 10) 

In [None]:
rf_grid=expand.grid(mtry=c(4,8,10,15),
                   splitrule = c("extratrees"),
                   min.node.size= c(5))
rf_grid  

In [None]:
rf_fit=train(X1 ~ ., data = train_spam, 
                 method = "ranger", 
                 trControl = fitControl, num.trees=500,
                 tuneGrid = rf_grid) 

In [None]:
rf_fit
plot(rf_fit)

In [None]:
PredictRandomForest_spam=predict(rf_fit,newdata=test_spam)

### Performanca Measure

In [None]:
confusionMatrix(data = as.factor(PredictRandomForest_spam), reference = as.factor(test_spam$X1), mode = "prec_recall")

### Performanca Measure

In [None]:
perf_dt("Random Forest-daily",as.numeric(test_spam$X1),as.numeric(PredictRandomForest_spam))

## Stochastic Gradient Boosting(SGB)

In [None]:
set.seed(35)

In [None]:
gbmGrid=expand.grid(interaction.depth = c(1, 3, 5), 
                        n.trees = (1:5)*50, 
                        shrinkage = c(0.1, 0.3, 0.5),
                        n.minobsinnode = 20)
                                                                

gbm_fit=train(X1 ~ ., data = train_spam, 
                 method = "gbm", 
                 trControl = fitControl,  
                 tuneGrid = gbmGrid,
                 verbose=F) #verbose is an argument from gbm, prints to screen

plot(gbm_fit)

In [None]:
predicted_spam_sgb=predict(gbm_fit,test_spam)

### Performanca Measure

In [None]:
confusionMatrix(data = as.factor(predicted_spam_sgb), reference = as.factor(test_spam$X1), mode = "prec_recall")

In [None]:
perf_dt("First Data Set for Stochastic Gradient Boosting", as.numeric(predicted_spam_sgb), as.numeric(test_spam$X1))

# General Result for 4 Method for Pisa Reading Results

In [None]:
perf_dt("Pisa Data Set for Lasso Function with min lambda and mse objective", as.numeric(test_spam$X1), as.numeric(prediction_pra_mae_spam_min))
perf_dt("Decision Tree with CV for Pisa Dataset",as.numeric(test_spam$X1),as.numeric(predicted_pisa))
perf_dt("First Data Set for Random Forest", as.numeric(test_spam$X1), as.numeric(PredictRandomForest_spam))
perf_dt("First Data Set for Stochastic Gradient Boosting", as.numeric(test_spam$X1), as.numeric(predicted_spam_sgb))

## Performance

In [None]:
results = resamples(list("Random Forest"=rf_fit,#"Linear Regression with Penalty"=lasso_model_spam_mae_min,
                         "Stochastic Gradient Boosting"=gbm_fit),metrics='Accuracy')
summary(results)
bwplot(results)
densityplot(results)

In [100]:
confusionMatrix(data = as.factor(prediction_pra_mae_spam_min), reference = as.factor(test_spam$X1), mode = "prec_recall")

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 538  32
         1  20 131
                                          
               Accuracy : 0.9279          
                 95% CI : (0.9065, 0.9457)
    No Information Rate : 0.7739          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7884          
                                          
 Mcnemar's Test P-Value : 0.1272          
                                          
              Precision : 0.9439          
                 Recall : 0.9642          
                     F1 : 0.9539          
             Prevalence : 0.7739          
         Detection Rate : 0.7462          
   Detection Prevalence : 0.7906          
      Balanced Accuracy : 0.8839          
                                          
       'Positive' Class : 0               
                                          

In [101]:
confusionMatrix(data = as.factor(predicted_pisa), reference = as.factor(test_spam$X1), mode = "prec_recall")

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 533  30
         1  25 133
                                         
               Accuracy : 0.9237         
                 95% CI : (0.9019, 0.942)
    No Information Rate : 0.7739         
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.7796         
                                         
 Mcnemar's Test P-Value : 0.5896         
                                         
              Precision : 0.9467         
                 Recall : 0.9552         
                     F1 : 0.9509         
             Prevalence : 0.7739         
         Detection Rate : 0.7393         
   Detection Prevalence : 0.7809         
      Balanced Accuracy : 0.8856         
                                         
       'Positive' Class : 0              
                                         

In [102]:
confusionMatrix(data = as.factor(PredictRandomForest_spam), reference = as.factor(test_spam$X1), mode = "prec_recall")

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 548  15
         1  10 148
                                          
               Accuracy : 0.9653          
                 95% CI : (0.9492, 0.9774)
    No Information Rate : 0.7739          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8998          
                                          
 Mcnemar's Test P-Value : 0.4237          
                                          
              Precision : 0.9734          
                 Recall : 0.9821          
                     F1 : 0.9777          
             Prevalence : 0.7739          
         Detection Rate : 0.7601          
   Detection Prevalence : 0.7809          
      Balanced Accuracy : 0.9450          
                                          
       'Positive' Class : 0               
                                          

In [103]:
confusionMatrix(data = as.factor(predicted_spam_sgb), reference = as.factor(test_spam$X1), mode = "prec_recall")

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 545  16
         1  13 147
                                          
               Accuracy : 0.9598          
                 95% CI : (0.9427, 0.9729)
    No Information Rate : 0.7739          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8843          
                                          
 Mcnemar's Test P-Value : 0.7103          
                                          
              Precision : 0.9715          
                 Recall : 0.9767          
                     F1 : 0.9741          
             Prevalence : 0.7739          
         Detection Rate : 0.7559          
   Detection Prevalence : 0.7781          
      Balanced Accuracy : 0.9393          
                                          
       'Positive' Class : 0               
                                          