# Module 5 Practice

In this module we will practice extracting and comparing features between probable-bots and probable-not-bots.

### Extract features

In machine learning, feature extraction refers to the process of defining features of the data via numeric variables. Machine learning isn't restricted by theory, so any number you can come up with that's derived from the data at hand will work. With that said, theoretically meaningful features are better predictors than spurious relationships due to noise, so it's not usually a good idea to find completely unrelated numbers to line-up with each observation. A good rule of thumb with this data would be to keep feature extraction limited to what you can extract out of the provided data set.

If you haven't guessed it yet, the task for this week's module will be to classify whether accounts are bot or not bots using features extracted from Twitter users data. And to extract the features, go ahead and load the tidyverse.

In [1]:
## load tidyverse
suppressPackageStartupMessages(library(tidyverse))

For this practice notebook, I extracted several features, split the data into training and test sets (making sure that the training set has an equal number of bots and not-bots), modeled the data using the {gbm} package, and then compared predicted to actual bot/not-bot classifications. The steps are listed below. In this week's exercise you'll only be asked to (a) come up with at least 3 new features, (b) play around the parameters in the gbm model, and (c) report the percent correct for classifications of the bots and non-bots in the test data set. 

## Data

For the data, I collected data on about 6,697 Twitter accounts. I believe 4,243 of those accounts are genuine (non-automated) accounts and 2,454 are bots. I then randomly split the data set into a **training** set and a **test** set. In the training data, I made sure there were equal numbers of bots versus non-bots. Since we are not using the test data to build our actual model, it is not important to have equal group sizes in the test data set.

### 1. Load the data

In [2]:
## read in the data (train and test)
train <- readRDS("../data/train.rds")
test <- readRDS("../data/test.rds")

### 2. How many bots and non-bots are in each data set?

In [3]:
## bot == 1, not-bot == 0
table(train_bot = train$bot)
table(test_bot = test$bot)

train_bot
   0    1 
1604 1604 

test_bot
   0    1 
2437  799 

### 3. Come up with 6 new features (numeric predictors)

In [4]:
## feature extraction
is_num <- function(x) is.numeric(x) | is.integer(x)
extract_features <- function(data) {
    ## mutate 6 new features
    data %>% 
        mutate(
            'bio_chars' = nchar(description),
            'verified' = as.integer(verified),
            'years' = as.integer(difftime(Sys.time(), account_created_at, "days")) / 365,
            ## i added one here so it wouldn't return NaN or undefined values (0 / x)
            'tweets_to_followers' = (statuses_count + 1) / (followers_count + 1),
            'statuses_rate' = statuses_count / years,
            ## i added one here so it wouldn't return NaN or undefined values (0 / x)
            'ff_ratio' = (followers_count + 1) / (friends_count + followers_count + 1)) %>%
        ## return only numeric variables
        select_if(is_num)
}

ftrain <- extract_features(train)
ftest <- extract_features(test)

### 4. Merge the data sets, group by the `bot` variable (whether the observations are bots), and summarise the numeric variables.

In [5]:
bind_rows(ftrain, ftest) %>%
    group_by(bot) %>%
    summarise_if(is_num, median, na.rm = TRUE)

bot,followers_count,friends_count,listed_count,statuses_count,favourites_count,verified,bio_chars,years,tweets_to_followers,statuses_rate,ff_ratio
0,1497,338,24,21596,6406,0,67,5.6,11.81194,4913.503,0.7878366
1,857,249,21,5491,94,0,97,3.989041,8.531579,1504.555,0.6463078


### 5. Train a model predicting whether each observation is a bot.

In [6]:
## load gbm package
suppressPackageStartupMessages(library(gbm))

In [7]:
## number of trees
n_trees <- 500

## set params and run model (~ . means use all other variables)
m1 <- gbm(bot ~ .,
          data = ftrain,
          n.trees = n_trees,
          interaction.depth = 4,
          cv.folds = 2,
          verbose = FALSE,
          distribution = "bernoulli",
          n.minobsinnode = 10,
          shrinkage = .025)

### 6. Summarise the model by looking up the relative influence of each feature.

In [8]:
summary(m1, plotit = FALSE)

Unnamed: 0,var,rel.inf
ff_ratio,ff_ratio,32.31932
friends_count,friends_count,25.430048
favourites_count,favourites_count,20.35148
statuses_rate,statuses_rate,4.903556
listed_count,listed_count,4.402853
tweets_to_followers,tweets_to_followers,2.753572
followers_count,followers_count,2.28639
bio_chars,bio_chars,2.260802
years,years,2.064818
statuses_count,statuses_count,1.944522


### 7. Check percent correct on training data set.

In [9]:
ftrain$pred <- predict(m1, n.trees = n_trees, type = "response")

In [10]:
percent_correct <- function(x) {
    x <- table(correct = x$pred > .5, bot = x$bot)
    pc <- round((x[2, 2]) / sum(x[, 2]), 3)
    pc <- as.character(pc * 100)
    message(sprintf("The model was %s%% accurate when classifying bots.\n", pc))
    pc <- round((x[1, 1]) / sum(x[, 1]), 3)
    pc <- as.character(pc * 100)
    message(sprintf("The model was %s%% accurate when classifying non-bots.\n", pc))
    pc <- round((x[1, 1] + x[2, 2]) / sum(c(x[, 1], x[, 2])), 3)
    pc <- as.character(pc * 100)
    message(sprintf("Overall, the model was correct %s%% of the time.", pc))
}
percent_correct(ftrain)

The model was 89.3% accurate when classifying bots.

The model was 92.3% accurate when classifying non-bots.

Overall, the model was correct 90.8% of the time.


### 8. Now, for the final task, classify the test data and report the percent correct again.

In [11]:
ftest$pred <- predict(m1, newdata = ftest, n.trees = n_trees, type = "response")
percent_correct(ftest)

The model was 86.9% accurate when classifying bots.

The model was 89% accurate when classifying non-bots.

Overall, the model was correct 88.5% of the time.
