<h1 style="text-align: center;"><span style="color: #333399;">Twitter Analytics: Social Bot Detection</span></h1>
<h6 style="text-align: center;">Created by: Michael Gagliano on 2/8/19</h6>
<h6 style="text-align: center;">Last Update: Michael Gagliano 2/12/19</h6>


Internet bots have been around for over a decade now, and have been used in a multitude of applications. 


Some early examples are bots that use *candidate-elimination* algorithms that help the bot determine the most-likely result/output given a user feeding it information, much how the game 21 questions works! (e.g. - https://en.akinator.com/)


Other bots many internet users are familiar with are chat bots, which use *natural language processing* and *decision tree-type* algorithms to automate customer service. Many mobile banking, e-commerce, and online retail websites and applications use this now as a way to maximize resources while maintaining a (subjective) personal experience.


### In this notebook, the focus will be on **Social Bots** i.e. the various types of bots you encounter on social media platforms, how to detect them, and understand the implications they may create. From the Pew Research Org$^{1}$:

`"A recent Pew Research Center study explored the role bots play in sharing links on Twitter. The study examined 1.2 million tweeted links – collected over the summer of 2017 – to measure how many came from suspected bot accounts. The result: Around two-thirds (66%) of the tweeted links the Center examined were shared by suspected bots, or automated accounts that can generate or distribute content without direct human oversight."`

**Some things to keep in mind:**


*   Not all bots are bad! Think about the bots that send you updates on Amazon deals or airline tickets, or how some of them are responsible for aggregating your favorite news sources to you on your phone. Even Apple uses them for their AppleCare service.

*   Some of them are very bad. Unfortunately the grey area of data privacy and AI ethics are not growing nearly as fast to curb the production of some of the bad ones.$^{2, 3}$



# Import Packages for Bot Detection

In [1]:
import pandas as pd # Powerful data manipulation and visualization tool
import numpy as np # Computational data tool
import csv # Package that allows importing and exporting of .csv and .xlsx files
import seaborn as sns # Package that provides rich and highly customizable visualizations built on top of the matplotlib package
import matplotlib.pyplot as plt # Importing the pyplot functionality built on matplotlib
%matplotlib inline

In [2]:
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None)

# Importing Data

Data collection was acheived using Amazon Web Services' EC2 Service, where a virtual machine (Amazon Linux AMI) was created to continuously run a Python script to mine and retrieve text data from Twitter. The data that was mined was reliant on containing words such as "Applebees". 

The data was retrieved in .json format via manual input from the AWS EC2 virtual machine. The exported data collected was 51.7MB in total, containing 94,544 tweets and all respective metadata from 9/28/2018 to 11/2/2018. Once retrieved, the data was converted into .csv format for analysis to take place. 

**For this notebook, the data was partitioned into a significantly smaller sample size - keeping only the first 5,000 data rows**

In [7]:
#Using the csv package and pandas package to import

#All tweets

df = pd.read_csv('AllTweets.csv') #import full dataset
df_mod = df.sample(n=5000, replace = True, random_state = 1) # Randomly select 5000 rows in dataset

# Brief Data Overview

In [8]:
df_mod.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 77708 to 15287
Data columns (total 36 columns):
id                             5000 non-null int64
time                           5000 non-null int64
created_at                     5000 non-null object
from_user_name                 5000 non-null object
text                           5000 non-null object
filter_level                   5000 non-null object
possibly_sensitive             752 non-null float64
withheld_copyright             0 non-null float64
withheld_scope                 0 non-null float64
truncated                      0 non-null float64
retweet_count                  5000 non-null int64
favorite_count                 5000 non-null int64
lang                           5000 non-null object
to_user_name                   407 non-null object
in_reply_to_status_id          394 non-null float64
quoted_status_id               311 non-null float64
source                         5000 non-null object
location       

# Data Pre-Processing

This is not comprehensive; this is just quick arbitrary dimensionality reduction (manually removing non-essential features for this specific kind of analysis). A majority of the data was already cleaned prior to this notebook.


**Dropping non-essential columns**

In [9]:
df_mod.drop(columns = ['id',
               'filter_level',
               'possibly_sensitive',
               'withheld_copyright',
               'withheld_scope',
               'truncated',
               'from_user_utcoffset',
               'from_user_timezone',
               'in_reply_to_status_id',
               'quoted_status_id',
               'from_user_withheld_scope'], axis = 1, inplace= True)

**Renaming columns to access data easier**

In [23]:
df_mod.rename(columns={'from_user_name': 'user',
                            'from_user_followercount': 'followers',
                            'from_user_friendcount': 'friends',
                            'from_user_favourites_count': 'favs',}, inplace = True)

# Preview first 5 rows of data
df_mod.head()

Unnamed: 0,time,created_at,user,text,retweet_count,favorite_count,lang,to_user_name,source,location,lat,lng,from_user_id,from_user_realname,from_user_verified,from_user_description,from_user_url,from_user_profile_image_url,from_user_lang,from_user_tweetcount,followers,friends,favs,from_user_listed,from_user_created_at
77708,1540161699,2018-10-21 22:41:39,R_I_C_H_4_R_D,"RT @DanyAllstar15: Folks somebody had intercourse with somebodys girl in this San Jose/Islanders game. Everybody is fighting. This looks like a 4 am cocaine fight in an Applebee’s parking lot. People want to scrap, Saturday night Harley Davidson type shit. I love it.",0,0,en,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Ottawa, Canada",,,991947122,Richard,0,,,http://pbs.twimg.com/profile_images/994735978387472384/BFjGPCo1_normal.jpg,en,1873,37,180,3273,0,2012-12-05 22:49:24
5192,1538351038,2018-09-30 23:43:58,Big_Sanch_860,RT @gothJudyHopps: oh that’s your girl?? lmaooo then why did she just give me her number and insurance information after i hit her car in the parking lot of applebee’s,0,0,en,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",in your speakers,,,1651746972,Big Sanchie,0,Puerto Rican | Manager-@Cole860_ | 8$ixty| MysticBoyz | Listen to me on @AppleMusic |Inquires: MysticBoyz860@gmail.com| ALOEBOYS |,https://soundcloud.com/big-sanchez-860/krabby-patty-big-sanch-x-cakey-jakey,http://pbs.twimg.com/profile_images/1040042626098126848/I3KgrzHi_normal.jpg,en,10384,1411,2912,17842,13,2013-08-07 02:08:51
50057,1538766776,2018-10-05 19:12:56,Isaih_Cordell,RT @YezzusP: gf: “what sounds good” me: “a blow job” gf: me: Applebee’s employ: “I’ll just give you two another minute https://t.co/5Nr2oPoMYJ,0,0,en,,"<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>",Untouchable,,,2933921308,⚡ay_In_Da_Cut,0,💫Dreamchaser💫,,http://pbs.twimg.com/profile_images/1039235240084918272/VN1Aib18_normal.jpg,en,15820,449,496,6580,2,2014-12-20 17:02:55
73349,1539915462,2018-10-19 02:17:42,freeworldgroup,RT @DesignationSix: Here are some @FoxNews advertisers. They are tagged so they will get notified for each LIKE RETWEET or COMMENT @Arbys @IHOP @Applebees @McDonalds @LifeLock @Nestle @OmahaSteaks @IdahoPotato @Walmart @WaltDisneyCo @redlobster @goldencorral @Toyota @ProcterGamble @Always @Tampax,0,0,en,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Everywhere,,,49024495,FWG,0,We stand against #dotard Trump. For the sake of all humanity. #TheResistance #ImpeachTrump #resist #basta,http://www.freeworldgroup.com,http://pbs.twimg.com/profile_images/276419063/logo_square_normal.jpg,en,17279,3748,449,10763,63,2009-06-20 14:57:19
21440,1538454157,2018-10-02 04:22:37,tinacnguyen,RT @gothJudyHopps: oh that’s your girl?? lmaooo then why did she just give me her number and insurance information after i hit her car in the parking lot of applebee’s,0,0,en,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,2723411614,LittyTittyTina,0,Memento Vivere,,http://pbs.twimg.com/profile_images/1028931805318008837/5Zv0ighI_normal.jpg,en,2661,205,197,5320,0,2014-07-24 21:24:36


# Determing if a user is a bot from your dataset.

It is certainly possible to create a classification and regression-like algorithm to determine bots on your own, and tune hyperparameters to better fit the context you are working in. However, for more general research and accessbility - many packages and APIs have been developed for Python to facilitate this.

## 0. Exploratory Analysis (what might suggest you have bots in your dataset)

Prior to using the below resources, it's important to understand that many API's limit users unless they purchase "Pro" use or a specific plan that allows increased querying from their data warehouses. Exploratory analysis can help prevent unneccessary API queries and narrow down the most prominent figures to examine in your data early on.

Some of the following are lines of code you can use that may suggest or highlight more prominent bot accounts in your data (**using the above dataset**):

**Unique Users in Data** - If there is a high volume of tweets, but low users, it may indicate a lot of repeat tweeters or RT's in the data

In [28]:
uni = df_mod['user'].nunique()
print('The number of unique Twitter users in the dataset is %s' % uni)

The number of unique Twitter users in the dataset is 4808


**Average Tweet Per User** - If users show up more often than the average number of tweets per user, it may be suggest bot-like activity 

In [29]:
uni_per_tweet = (df_mod['user'].count())/(uni)
print('The average number of tweets per users in the new dataset is %.2f' % uni_per_tweet)

The average number of tweets per users in the new dataset is 1.04


**Showing Most Active Tweeters** - Anything "significantly" larger than the average value may show who the most prominent tweeters are, or the most severe bot culprits

In [27]:
df_mod['user'].value_counts().head(10)

Applebees        13
upper_fixer      5 
jordie_nassif    3 
Paula56790599    3 
TinaMarie_80s    3 
ApplebeesHI      3 
mmiicckkeeyyy    3 
aidanzzzzz       3 
lookingbad45     3 
King_essencee    3 
Name: user, dtype: int64

## 1. Botometer for Python$^{4}$

https://github.com/IUNetSci/botometer-python
http://www.pewinternet.org/2018/04/09/bots-in-the-twittersphere/pi_2018-04-09_twitter-bots_m-06/

Follow the instructions to install the package and dependencies, if not already satisfied.

Then, create an account on RapidAPI, formerly Mashape Market. I chose to simply link my GitHub account for added convenience.

Search for "Botometer" and click the *non*-pro version. You will be given a screen that looks like this. You will need to copy the Secret Key to use in your Jupyter Notebook later.

In [18]:
# To install: pip install botometer

import botometer

mashape_key = "1b052d1efbmsh74804899330916cp1079f4jsndc502ad63839"
twitter_app_auth = {
    'consumer_key': 'pypGYgwN9ZHTYUJCvx0t90LtW',
    'consumer_secret': 'woMY7WUGfTqqb7tK21ATd0T3ayzWQvp6biOzONXVO8mAPWv2We',
    'access_token': '237573884-xQrmhqVZjrWM0HvymU7MJV3vhpslVoYjlSvVGi9e',
    'access_token_secret': 'V0nHAMakQbLkNGC6nOmrcSE94ZIHYwZAcKA9nugTnl6o6',
  }
bom = botometer.Botometer(wait_on_ratelimit=True,
                          mashape_key=mashape_key,
                          **twitter_app_auth)

# Check a single account by screen name
result = bom.check_account('@mikefromcollege') # Me, a human

# Check a single account by id
#result = bom.check_account(1548959833)
    
import pprint

pprint.pprint(result)

{'cap': {'english': 0.0014187924969112314, 'universal': 0.0021984408773070953},
 'categories': {'content': 0.09728515292570794,
                'friend': 0.09994026911083925,
                'network': 0.08358138135822211,
                'sentiment': 0.06780783386008177,
                'temporal': 0.06669571356978159,
                'user': 0.03482447811460989},
 'display_scores': {'content': 0.5,
                    'english': 0.2,
                    'friend': 0.5,
                    'network': 0.4,
                    'sentiment': 0.3,
                    'temporal': 0.3,
                    'universal': 0.2,
                    'user': 0.2},
 'scores': {'english': 0.03265990956683609, 'universal': 0.04447883217419818},
 'user': {'id_str': '237573884', 'screen_name': 'mikefromcollege'}}


### Checking multiple accounts

In [30]:
# Check a sequence of accounts

accounts = ['@mikefromcollege', '@DesignationSix'] # Me, and a suspected bot
for screen_name, result in bom.check_accounts_in(accounts):
    pprint.pprint(result)

{'cap': {'english': 0.0014187924969112314, 'universal': 0.0021984408773070953},
 'categories': {'content': 0.09728515292570794,
                'friend': 0.09994026911083925,
                'network': 0.08358138135822211,
                'sentiment': 0.06780783386008177,
                'temporal': 0.06669571356978159,
                'user': 0.03482447811460989},
 'display_scores': {'content': 0.5,
                    'english': 0.2,
                    'friend': 0.5,
                    'network': 0.4,
                    'sentiment': 0.3,
                    'temporal': 0.3,
                    'universal': 0.2,
                    'user': 0.2},
 'scores': {'english': 0.03265990956683609, 'universal': 0.04447883217419818},
 'user': {'id_str': '237573884', 'screen_name': 'mikefromcollege'}}
{'cap': {'english': 0.2564674002687074, 'universal': 0.16230758351843008},
 'categories': {'content': 0.05090107918514772,
                'friend': 0.5196129601812253,
                'network':

### Interpreting the reports

Intepreting the scores and understanding how this algorithm works can be found here, explained at a very high-level. 
In essence, the scores are from 0 to 5. 0 is most human like and 5 is most bot like.
https://botometer.iuni.iu.edu/#!/faq

**From above, my user score is 0.2 on a scale of 0 to 5. That is very human! The suspected political propoganda bot DesignationSix from our dataset has a user score of 4.4. That is extremely likely to be a bot!**



## 2. "From Scratch" Bot Detection Program$^{5}$ - GitHub user @jubins

https://github.com/jubins/MachineLearning-Detecting-Twitter-Bots

Here is an example of how using SciKit-Learn and the other common Data Science packages can be used to build a machine learning program that classifies, trains, test, and sort based on the **bag of words** type algorithm

## 3. Uncovering Bot Networks on social media platforms

https://paulvanderlaken.com/2018/03/17/identifying-dirty-twitter-bots-with-r-and-python/
https://github.com/r0zetta/pronbot_search

Although the API permissions have changed (Facebook owns Instagram), the Python package "pronbot_search" is used to uncover "adult" spam bots on social media networks. Visualizations were created from Gephi.

# Appendix A: Resources

$^{1}$ http://www.pewresearch.org/fact-tank/2018/04/19/qa-how-pew-research-center-identified-bots-on-twitter/

$^{2}$ https://www.globaldots.com/2018-bad-bot-report-the-year-bad-bots-went-mainstream/

$^{3}$ https://voluum.com/blog/bot-traffic-bigger-than-human-make-sure-they-dont-affect-you/

$^{4}$ https://github.com/IUNetSci/botometer-python

$^{5}$ https://github.com/jubins/MachineLearning-Detecting-Twitter-Bots