## Twitter Bot detection 

Bots, trolls and fake accounts are quite problematic in this era of social media-dependent consciouness. They take part in creating hypes and spams, spread fake news, ultimately displaying non-genuine behavior in micro-blogging platforms. Genuine posts are valuable not only because that help us understand the human sentiments, but also becuase that increase diversity in the sphere.   

### Goal 
The goal of this project is to create a classifier that predicts if an account is a troll or a human. Using APIs, NLP and Machine Learning, I predict important features from which we detect a troll account and avoid spending time communicating with trolls. 

The outcome from this project could be used for marketers who target real users for selling their products. A curated list of real users will help them reach out to customers easily and attain business goals. 


### Data

The data comprises existing data from Kaggle [], data collected via Tweepy API from twitter between dates x/x/18 - x/x/18. 

### Project Workflow

#### Approach 1: 
We tried to build a classifier from existing user account data to predict bots. The model(s) achieved high accuracy, implicating model overfitting. Upon inspection of the datasets, we found we have small training and testing datasets with high sampling bias. Therefore, we needed to gather more data. 

#### Approach 2:
We gathered more data from Twitter. However, we found difficilty in getting  data that could be distinctively labeled for bots and non-bots. Precisely, many Twitter verified accounts that are claimed to be authentic have many bots or they are representative of an organization. The bot data that we gathered [from Website](www.) are small in number (xxx) and are suspended Russian bots. This makes the sample biased.  

#### Approach 3:
Becuse of this we decided to do a sentiment analysis on #TOPIC and identify bots from the tweets. 


### Analysis

Predicting bots from text is an NLP (Natural Languae Processing) problem.
I leveraged NLTK as my tokenizer and Sklearn tfidfVectorizer to perform my Bag of words Analysis and tfidf transformation. Word2Vec

For the final model I evaluated model complexity by accuarcy and speed. I eventually settled on _____
- Why? and limiting the max features in the SVD step to explain a lot of  __________________

-.

-.

-.


### Conclusions
- .

-.

-.

-.



## Approach 1: Bot detection from user account data 

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn import preprocessing
from sklearn.utils import resample

# Import sklearn models 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score, precision_score, recall_score, confusion_matrix, accuracy_score

#nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Time 
import time as t

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


### to do next - 

 - try creating the histograms for each of the plots above to show how much overlap and the overall distribution of each class. 
- use the log features for the inputs, probably not required for RFC, 
- try all the classifiers with the log features and ses if that imporives the accuracy/metrics of the other classifiers. 
- bring more data - https://botometer.iuni.iu.edu/bot-repository/datasets.html
- try more feature engineering - log, and get some more features may be from the paper
- try gettin more tweets from the bots and the real users - try sentitment analysis - give each person a score overall on how positive and negative feature. 

- write up what you did so far - this is the baseline - 
- 100 % accuracy > more data > how the data is biased > use the cluster plots to show how they are biased - 
- now bring in some fresh data from the other reposi https://botometer.iuni.iu.edu/bot-repository/datasets.html
- feature engineering and see how incorporate NLP and how sentiment analysis to score [positivity/negativity ] for the last 10 tweets. 
- 

## Summary of baseline 

To train our system we initially used a publicly available dataset consisting of 7k bot and non-bot accounts in total. Later we collected data; employed Twitter scraping and incorporated data from various publicly available resources. This procedure yielded a dataset of 57.4k with 39k nonbot and 18k bot accounts. 

We benchmarked our system using several off-the-shelf algorithms provided in the scikit-learn library (Pedregosa et. al. 2011). We measured the models' accuracy by measuring the Area Under the Receiver Operating Curve (ROC-AUC) with 5-fold cross validation. We compared Bernoulli Naive Bayes, Logistic Regression, Gradient Boosting Classifier, K-nearest neighbors, Random Forest Classifier and Support Vector Classifier. The best classification performance of 98.76 was obtained by the Random Forest Algorithm. The Random Forest model was trained with 100 estimators and the GIni coefficient to measure the quality of splits.

For the rest of the project I am going to focus on the latest verified dataset from [this resource](https://botometer.iuni.iu.edu/bot-repository/datasets.html) as the data we collected had sampling bias. Another advantage of choosing the following dataset is that we can access the tweets provided with these datasets. 

In the following I take a subset of certified bot and non-bot accounts, create features and try to improve the model accuracy. This dataset has almost 39k datapoints with 36% certified bots. 

In [5]:
# Getting more data 
all_hums = pd.read_csv('Data/all_hums.csv', encoding='utf-8')
all_bots = pd.read_csv('Data/all_bots.csv', encoding='utf-8')
new_fake = pd.read_csv('Data/fake_users_.csv')
fake_users_2 = pd.read_csv('Data/fake_users_2.csv')


In [6]:
# Creating certified humans and non-human dataset 
cert_hum = all_hums
cert_fake = pd.concat([all_bots, fake_users_2, new_fake], axis=0, ignore_index=True)

In [7]:
cert_hum.shape, cert_fake.shape

((24700, 20), (14245, 20))

In [8]:
df_cert = pd.concat([cert_hum, cert_fake], axis=0, ignore_index=True)

df_cert.shape

(38945, 20)

In [9]:
df_cert.bot.value_counts()

0    24700
1    14245
Name: bot, dtype: int64

In [10]:
df_cert.isna().sum()

bot                          0
created_at                   0
default_profile          30220
default_profile_image    34654
description              24536
favourites_count         19276
followers_count              0
friends_count                0
has_extended_profile     38945
id                           0
id_str                   38945
lang                     20276
listed_count             19276
location                 26097
name                     19277
screen_name              19276
status                   38945
statuses_count           19276
url                      30544
verified                 38934
dtype: int64

### Features 

__Existing features__: So far we have trained our models on the numerical features. And the most imporeant features are 'statuses_count' and 'favourites_count', followed by 'followers_count' and 'friends_count'. In the following I create new features, based on the meta-data provided with the user accounts. However, the main difficulty in creating these features is that a large percentage of the values are missing in these features. 

__Intended Features from meta-data__:

- Length of screen names 
- Length of description 
- calculate the ratio of (# of friends/2 * # of followers). It has been claimed that spambots have a high ratio value (i.e., lower ratio values mean legitimate users). [here](https://arxiv.org/pdf/1509.04098.pdf)

__Intended Features from tweets__:
- The content of spambots’ tweets exhibits the so-called message similarity. The score is higher for bots. 
- Sentiment features 


In [11]:
# There are several null values or blanks 
print(df_cert.screen_name.isna().sum(),
df_cert.description.isna().sum())

#Filling blanks with 0 

df_cert.screen_name = df_cert.screen_name.fillna(0)
df_cert.description = df_cert.description.fillna(0)

19276 24536


In [12]:
# Features from meta-data

# Length of screen name 
df_cert.screen_name = df_cert.screen_name.astype('str')
df_cert['len_screen_name'] = df_cert.screen_name.apply(lambda x: len(x))

In [13]:
# all null values are converted to 1 . # 19k 1 length 

df_cert.len_screen_name.value_counts()

1     19276
15     3128
12     2994
13     2590
14     2460
11     2457
10     2023
9      1693
8      1130
7       748
6       332
5       102
4        10
3         2
Name: len_screen_name, dtype: int64

In [14]:
# Length of description a
df_cert.description = df_cert.description.astype('str')  # null values will be converted to 1 in the next step
df_cert['len_desc'] = df_cert.description.apply(lambda x: len(x))

In [15]:
#df_cert.len_desc.value_counts()

In [None]:
# ratio of (# of friends/2 * # of followers)  # thankfully no null values 

df_cert['friend_2xfollower_ratio'] = df_cert



In [None]:
df_cert.friends_count

In [None]:
df_cert.head()