# Cyberbullying Detection App


Bullying has shifted from school recess encounters to the boundless frontiers of the digital world. Victims can be targeted whenever and wherever they are. Bullies on the other hand, can hide behind their computer screens and not have to confront their victims face-to-face: they can also hide behind false avatars and fake user names. 

The Nemours Foundation reports that cyberbullying's effects are always negative, and some cyberbullying victims experience catastrophic effects such as self-harm, depression and anxiety, and even suicide.

The famous microblogging platform Twitter has become a daily essential by having over 330 million monthly active users tweeting 5,787 messages every second (source). However, each passing day the platform is becoming a bully playground as hurtful tweets target individuals, especially teens in their fragile growing years. It is estimated that around 15,000 bullying-related tweets are posted every day (source). Furthermore another research suggests (source)

This app was created as an attempt tackle the cyberbullying phenomenon on Twitter: the target is to develop a tool that can identify tweets that contain any bullying content and to flag their occurrence. The tool can scale up to trigger events such as various immediate actions.

In [1]:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import operator
import numpy as np
import sys
import matplotlib.pyplot as plt
import pickle
from __future__ import division
from collections import Counter, defaultdict
from math import log,isinf
import csv
import codecs
import re
import random
import json
import nltk
from lib import *

#### The first step would be to fetch Twitter data in real-time to then be able to analyze it and obtain the results we are looking for. We will use two different tools:

• Scraper: the tool will read the data from Twitter in real- time.

• Analyzer: this tool will listen to the Tweets obtained from the Scraper, preform the analysis and deliver  results.

In [2]:
def main():
    conf = SparkConf().setMaster("local[2]").setAppName("Streamer")
    sc = SparkContext(conf=conf)
    # Creating a streaming context with batch interval of 10 sec
    ssc = StreamingContext(sc, 10)
    ssc.checkpoint("checkpoint")
    
    filename = 'datasets/new_data.csv'
    target_class = '1'
    
    # Load the data
    [train_data, dev_data, test_data] = load_data_from_csv(filename)
    
    # Train the classifier
    classifier = train_classifier(train_data, target_class)
    
    # Evaluate the classifier   
    in_class = []
    not_in_class = []
    
    for datum in dev_data:
        if classify(classifier, datum):
            in_class.append(datum)
        else:
            not_in_class.append(datum)
    evaluate(target_class, in_class, not_in_class)
    
    filename = 'finalized_model.sav'
    pickle.dump(classifier, open(filename, 'wb'))
    
    stream2(ssc, 30)
    

### The Scraper
The scraper runs 3 tasks:

• Opens a socket

• Connects to the Twitter Streaming API

• Reads stream tweets from the Twitter streaming API

• Publishes/sends tweets to the port/socket in JSON format

In order to read Tweets, we first need to set variables that contain the user credentials to access the Twitter API.
We then specify a port/socket (that works as a channel) to publish the tweets to so that the Analyzer can then read streaming data from this same socket. We use an arbitrary port, bind a socket and local host and start streaming.

In [3]:
def stream2(ssc, duration):

    dataStream = ssc.socketTextStream("localhost",9092)
    dataStream.window(60)

    tweets = dataStream.flatMap(lambda line:line.split('\n'))
    print('-----------------------------------------')
    print('Start of Twitter Feed')
    print('-----------------------------------------')
    tweets.foreachRDD(rdd_classify)
     
    # Start the computation
    ssc.start()
    
    ssc.awaitTerminationOrTimeout(duration)
    ssc.stop(stopGraceFully = True)
    
    return 

### The Analyzer
The Analyzer uses Spark to:

• Bind to the specified socket

• Listen to the tweets published by the scraper by reading data sent to the socket

• Preprocess the Tweets

• Count “Bullying” versus “Not Bullying” tweets.

• Push the results for visualization

We define the following “Analyzer” parameters:
• The batch Interval to specify how frequently (in seconds) we want to update the streaming data, as well as the window duration.

• The SparkContext object that needs to be instantiated since it is the main entry point for Spark functionality, and the StreamingContext. Spark Streaming is an extension of the core Spark API that enables stream processing of live data streams. Given that we will perform some RDD transformations, we also start checkpoints that save the generated RDDs to a reliable storage.

• A DStream that will connect to a hostname and a port (the same as the one we had defined on the Scraper file). We use the “window” to apply transformations over a sliding window of data (windows length and sliding window defined in the initial parameters).

The classifier model is currently a part of the Analyzer application, and it trains on a set of labeled text statements (bullying vs. non-bullying).

The Analyzer then prints out all tweets, displaying either “Bully Alert!" below offending tweets or "Nothing to see here", below regular harmless tweets.

Ideally, a count function applying reduceByKey would sum the number of “bullying” tweets per batch, window or overall. Specific thresholds could be set to trigger various events and a plot could be generated to show the number and trend of offending tweets.

In [4]:
def rdd_classify(rdd):
   
    classifier = pickle.load(open('finalized_model.sav', 'rb'))  
    file = open('datasets/badwords.txt', 'r')
    read_file = file.read()
    badwords = nltk.Text(nltk.word_tokenize(read_file))
   

    for tweet in rdd.collect():       
        if tweet != '':
            
            tweet_words = nltk.Text(nltk.word_tokenize(tweet))
            if classify(classifier, Datum(tweet, '1')):
                if any(i in tweet_words for i in badwords.tokens) == True:
                    print(tweet)
                    print("Bully Alert!")
                else:
                    print(tweet)
                    print("Nothing to see here")
            else:
                print(tweet)
                print("Nothing to see here")
            print('-----------------------------------------')        


In [5]:
def featurize(datum):
    features = []
    last_word = '^'
    for word in lower(datum):
        features.append(word)
        features.append(last_word + "_" + word)
        last_word = word
    return set(features)

In [6]:
def train_classifier(data, class_of_interest):
    
    total_counts = Counter()
    for datum in data:
        
        if datum.answer() == class_of_interest:
            total_counts[True] += 1
        else:
            total_counts[False] += 1
    
    # 2. Compute p(c)
    # The probability of each label. This should mirror total_counts.
    total_probs = Counter()
    total_probs[True] = total_counts[True] / (total_counts[True] + total_counts[False])
    total_probs[False] = total_counts[False] / (total_counts[True] + total_counts[False])
    
    # 3. Collect count(f | c)
    true_counts = Counter()
    false_counts = Counter()
   
    # For each tweet in our dataset...
    
    for datum in data:
        features = featurize(datum)
        for feature in features:
            if datum.answer() == class_of_interest:
                true_counts[feature] += 1
            else:
                false_counts[feature] += 1
  
    # Add an UNK count
    true_counts['__UNK__'] = 1
    false_counts['__UNK__'] = 1
    
    # Smooth the counts (add 0.1 fake counts to each feature)
    features = set(true_counts + false_counts)
    
    for feature in features:
        true_counts[feature] += 0.1
        false_counts[feature] += 0.1
        
    # 4. Compute p(f | c)
    true_probs = Counter()
    false_probs = Counter()
    # p(f | c) = count(f, c) / count(c)
    
    for feature in true_counts:
        true_probs[feature] = true_counts[feature] / total_counts[True];
    for feature in false_counts:
        false_probs[feature] = false_counts[feature] / total_counts[False];
    #print(total_probs)
    # 5. Return the model
    return [total_probs, true_probs, false_probs]

In [7]:
def classify(model, datum):
    # Unpack the model
    [total_probs, true_probs, false_probs] = model
    # Start the log scores at 0.0
    true_score = 0.0
    false_score = 0.0

    # Featurize the input
    features = featurize(datum)
    
    # Multiply in p(c)
    true_score += log(total_probs[True])
    false_score += log(total_probs[False])
    
    # Multiply in p(f | c) for each f
    for feature in features:
        if feature in true_probs:
            true_score += log(true_probs[feature])
        else:
            true_score += log(true_probs['__UNK__'])
        if feature in false_probs:
            false_score += log(false_probs[feature])
        else:
            false_score += log(false_probs['__UNK__'])
            
    # Some error checking
    if isinf(true_score) or isinf(false_score):
        print ("WARNING: either true_score or false_score is infinite")
    
    # Return the most likely class.
    #print(true_score > false_score)
    return true_score > false_score

  

### Results
The actual tweets stream is displayed within a specific window. In another window, once a tweet is classified as having bullying content following the two-step classification process, a “bullying alert” message is posted right below the tweet. 

In [8]:
if __name__=="__main__":
    try:
        main()
    except:
        e = sys.exc_info()[1]
        #print("Error: %s" % e)



Accuracy:  50.5
Precision: 31.7
Recall:    65.1
-----------------------------------------
Start of Twitter Feed
-----------------------------------------
The pocket of my jacket accommodates my Leuchtturm1917 A5 notebook brilliantly. https://t.co/6DLyzHuUKF
Nothing to see here
-----------------------------------------
Today:
Nothing to see here
-----------------------------------------
Fashion blitz...
Nothing to see here
-----------------------------------------
Enjoy!
Nothing to see here
-----------------------------------------
#TrendyTuesday #HighLevel
Nothing to see here
-----------------------------------------
@BaddieTwinz @slaytvnow @HIMpodcast Right here sis.. 😘 https://t.co/jWZtpKSaXP
Nothing to see here
-----------------------------------------
@HandsomeAssh0le I be telling my own brain stfu cus of that shit lmaoooo
Bully Alert!
-----------------------------------------
Working in the city and having so many fucking Starbucks around really has me fucked up. #GOODMORNING 🌎
Bu