# Module 5 Exercises

This week's exercises expand on the practice notebook, which extracted and compared features (users data collected from Twitter's REST API) between probable-bots and probable-non-bots. 

### Extracting features

Before the specific tasks, load the tidyverse.

In [8]:
## load tidyverse
suppressPackageStartupMessages(library(tidyverse))

## Data

For the data, data were collected on 6,697 Twitter accounts. Of these accounnts, 4,243 are considered to be genuine (non-automated) accounts and 2,454 are considered "bots" (accounts that are automated and/or part of a larger coordinated network of accounts). The data were then randomly split into a **training** data set and a **test** data set. In the training data, there are an equal number of bots versus non-bots. Since the test data is not actually used to build the model, it is not important to have equal group sizes in the test data set.

### 1. Load the data

In [9]:
## read in the data (train and test)
train <- readRDS("../data/train.rds")
test <- readRDS("../data/test.rds")

### 2. How many bots and non-bots are in each data set?

In [10]:
## bot == 1, not-bot == 0
table(train_bot = train$bot)
table(test_bot = test$bot)

train_bot
   0    1 
1604 1604 

test_bot
   0    1 
2437  799 

In [11]:
str(train)

Classes 'tbl_df', 'tbl' and 'data.frame':	3208 obs. of  21 variables:
 $ user_id               : chr  "251344965" "17621767" "6267142" "795736816192081920" ...
 $ name                  : chr  "Videodisc <U+25B3>" "Policy Innovations" "Ethan Brown <U+0001F916>" "Galaxi16" ...
 $ screen_name           : chr  "RolexSound" "carnegiePI" "ethanwbrown" "Galaxi162" ...
 $ location              : chr  "Barcelona" "New York, NY" "Zurich, Switzerland" "" ...
 $ description           : chr  "I'm not a bot . Music . Drawing . Videogames . Sweet Gig.holo.gram booking dani@divined.com https://t.co/jlLMxj"| __truncated__ "Policy Innovations is the @carnegiecouncil magazine for #socialinnovation and global ethics. We talk about doer"| __truncated__ "AI, algorithms, physics, and the rest <U+2022> Machine learning @squirro <U+2022> Building a bot army: @portman"| __truncated__ "" ...
 $ url                   : chr  "https://t.co/QnrvTrhJqP" "http://t.co/k2DZ1mSSgt" "https://t.co/aP5PVNh5Y0" NA ...
 $ pro

In [12]:
summary(train)

   user_id              name           screen_name          location        
 Length:3208        Length:3208        Length:3208        Length:3208       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 description            url            protected       followers_count 
 Length:3208        Length:3208        Mode :logical   Min.   :     0  
 Class :character   Class :character   FALSE:3208      1st Qu.:   398  
 Mode  :character   Mode  :character                   Median :  1170  
                                                       Mean   : 11635  
                                                       3rd Qu.:  3558  
                             

### 3. Come up with at least 3 NEW features (numeric predictors). In other words, create 3 features IN ADDITION TO the ones provided to you below (i.e., `bio_chars`, `verified`, `years`, `tweets_to_followers`, `statuses_rate`, and `ff_ratio`).

In [17]:
## feature extraction
is_num <- function(x) is.numeric(x) | is.integer(x)
extract_features <- function(data) {
    ## mutate 9 total features
    data %>% 
        mutate(
            ## your new variables should go below here
            'bio_chars' = nchar(description),
            'verified' = as.integer(verified),
            'years' = as.integer(difftime(Sys.time(), account_created_at, "days")) / 365,
            ## i added one here so it wouldn't return NaN or undefined values (0 / x)
            'tweets_to_followers' = (statuses_count + 1) / (followers_count + 1),
            'statuses_rate' = statuses_count / years,
            #start_new
            ##'protected' = as.integer(protected),(doesn't work--no variation)
            'oc' = (favourites_count +1) / (statuses_count +1),
            'id_chars' = nchar(user_id),
            'list_rate' = listed_count / years,
            #end_new
            ## i added one here so it wouldn't return NaN or undefined values (0 / x)
            'ff_ratio' = (followers_count + 1) / (friends_count + followers_count + 1)) %>%
        ## return only numeric variables
        select_if(is_num)
}

## apply function to training and test data sets
ftrain <- extract_features(train)
ftest <- extract_features(test)

### 4. Merge the data sets, `group_by` the `bot` variable (whether an account is considered a bot), and summarise the numeric variables by estimating the median of each.

In [18]:
bind_rows(ftrain, ftest) %>%
    group_by(bot) %>%
    summarise_if(is_num, median, na.rm = TRUE)

bot,followers_count,friends_count,listed_count,statuses_count,favourites_count,verified,bio_chars,years,tweets_to_followers,statuses_rate,oc,id_chars,list_rate,ff_ratio
0,1497,338,24,21596,6406,0,67,5.6,11.81194,4913.503,0.41113347,9,5.568268,0.7878366
1,857,249,21,5491,94,0,97,3.989041,8.531579,1504.555,0.02459016,10,6.033058,0.6463078


### 5. Train a gradient boosted model (gbm) predicting whether each observation is a bot. Adjust the parameter values passed to the `gbm()` function (read the documentation to figure out what you can do and what everything means) and try to maximize the quality of your model (hint: make sure not to overfit model; the goal is accuracy on the TEST set).

In [15]:
## load gbm package
suppressPackageStartupMessages(library(gbm))

## view documentation for gbm() function (in gbm package)
?gbm

In [19]:
## number of trees
n_trees <- 500

## set params and run model (~ . means use all other variables)
m1 <- gbm(bot ~ .,
          data = ftrain,
          n.trees = n_trees,
          interaction.depth = 4,
          cv.folds = 2,
          verbose = FALSE,
          distribution = "bernoulli",
          n.minobsinnode = 10,
          shrinkage = .025)

### 6. Summarise the model by estimating the relative influence of each feature.

In [20]:
summary(m1, plotit = FALSE)

Unnamed: 0,var,rel.inf
ff_ratio,ff_ratio,31.72876082
friends_count,friends_count,24.77079543
favourites_count,favourites_count,18.12039093
statuses_rate,statuses_rate,4.37805558
oc,oc,4.27969152
tweets_to_followers,tweets_to_followers,2.5459599
listed_count,listed_count,2.53100173
list_rate,list_rate,2.39954488
followers_count,followers_count,2.20358961
bio_chars,bio_chars,2.09915049


### 7. Check the percent correct for predictions of the training data set.

In [21]:
## classify probability of bot for each observation in training set
ftrain$pred <- predict(m1, n.trees = n_trees, type = "response")

In [22]:
## write a function to print out the percent correct (overall; for bots, and for non-bots)
percent_correct <- function(x) {
    x <- table(correct = x$pred > .5, bot = x$bot)
    pc <- round((x[2, 2]) / sum(x[, 2]), 4)
    pc <- as.character(pc * 100)
    message(sprintf("The model was %s%% accurate when classifying bots.\n", pc))
    pc <- round((x[1, 1]) / sum(x[, 1]), 4)
    pc <- as.character(pc * 100)
    message(sprintf("The model was %s%% accurate when classifying non-bots.\n", pc))
    pc <- round((x[1, 1] + x[2, 2]) / sum(c(x[, 1], x[, 2])), 3)
    pc <- as.character(pc * 100)
    message(sprintf("Overall, the model was correct %s%% of the time.", pc))
}
percent_correct(ftrain)

The model was 89.71% accurate when classifying bots.

The model was 92.96% accurate when classifying non-bots.

Overall, the model was correct 91.3% of the time.


### 8. Now, for the final task, classify the test data and report the percent correct for the predictions of the test data set.

In [23]:
ftest$pred <- predict(m1, newdata = ftest, n.trees = n_trees, type = "response")
percent_correct(ftest)

The model was 87.23% accurate when classifying bots.

The model was 89.37% accurate when classifying non-bots.

Overall, the model was correct 88.8% of the time.
