___

Topic: `Sentiment Analysis for Cyberbullying Detection on Twitter Data Fetching `

Date: `2023/05/03`

Programming Language: `python`

Main: `Natural Language Processing (NLP)`

___



### Cyber Bullying data Scrapping

In this notebook we are going to use the `twitter` API to collect data for sentiment classification task. We are going to use python programming language to scrap the data from twitter. We are going to use `tweepy` together with api keys obtained from [twitter dev](https://developer.twitter.com/en). 


### Installation of `tweepy`

In the following code cell we are going to install the latest version of `tweepy`. This package allows us to interact twitter using python programming language using.


In [1]:
!pip install tweepy --upgrade -q

### Why tweepy?

We can use libraries like `selenium` to scrape the data for our task but tweepy comes with the following advantages:

* Provides many features about a given tweet (e.g. information about a tweet’s geographical location, etc.)
* Easy to use.
* This is an official way of getting tweets from tweeter especially for research purposes.
* It has a well written docummentation.

However, tweepy comes with some limitations such as:

* Limiting API requests.


### Installing TextBlob

We are going to use textblob in this notebook to name tweets sentiments based on a condition. During data scrapping on tweeter our tweets will not be labeled `positive`, `negative` or `nuetral` we have to do this on our own by the help of the [TextBlob Library](https://textblob.readthedocs.io/en/dev/), which is a library for processing text in python. We are going to use this library to group our text based on `polarity` value either the text is `positive`, `negative` or `nuetral`.


> TextBlob returns `polarity` and `subjectivity` of a sentence. Polarity lies between `[-1,1]`, `-1` defines a negative sentiment and `1` defines a positive sentiment. Negation words reverse the polarity. TextBlob has semantic labels that help with fine-grained analysis. For example — emoticons, exclamation mark, emojis, etc. Subjectivity lies between `[0,1]`. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information. TextBlob has one more parameter — intensity. TextBlob calculates subjectivity by looking at the ‘intensity’. Intensity determines if a word modifies the next word. For English, adverbs are used as modifiers (‘very good’).

We are only going to make use of the text `polarity` and create labels for our dataset cyber_bullying dataset.


In [2]:
!pip install textblob -q

### Helper Functions
Next let's install the `helperfns` packages, which is the package that i have created that helps us with some machine leaning helper functions.

In [3]:
!pip install helperfns -q

### Importing packages

In the following code cell we are going to import packages that we are going to use in this notebook to do data collection, scrapping an data cleaning.

In [4]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import tweepy as tp
import pandas as pd

import os
import json
import uuid
import textblob
import re


from textblob import TextBlob
from helperfns.text import de_contract

tp.__version__

'4.14.0'

### File System

We are going to store and load data from google drive so, we need to mount the goggle drive. In the following code cell we are mounting the google drive.

> Note that we are importing `files` and `drive` from `google` colab. Files allows us to interact with files in google colab and `drive` allows us to connect our `google-drive` to the `google-colab` instance.


In [5]:
from google.colab import drive, files

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Paths

In the following code cell we are going to define the paths where all our files will be stored in the google drive. The main folder is called `Cyber Bullying on Twitter` this is where all the files and everything that we are going to work with in this project going to be stored.

In [6]:
base_dir = '/content/drive/My Drive/Cyber Bullying on Twitter'

assert os.path.exists(base_dir), f"The path '{base_dir}' does not exists, check if you have mounted the google drive."

### Getting Keys

Inorder to scrap data from twitter using `tweepy` we need to have a [twitter-developer-account](https://developer.twitter.com/en) which can be created [here](https://developer.twitter.com/en). Inoder to get these keys you need to:

1. login/register to your twitter developer account
2. create an application
3. get the following keys from your app
    * `API_KEY`
    * `API_SECRET`
    * `ACCESS_TOKEN`
    * `ACCESS_TOKEN_SECRET`
    * `BEARER_TOKEN`
        
After getting these keys we are gong to create a file called `keys.json` which we are going to store in the `Cyber Bullying on Twitter`  folder that is in our `google-drive`. In this files we are going to put our API keys instead of displaying them here in this notebook. The reason is because we don't want to make these keys public as they can be used by others on our behalf. The `keys.json` file will look as follows:

```json
{
 "API_KEY" : "<YOUR_API_KEY>",
 "API_SECRET": "<YOUR_API_SECRETE>",
 "ACCESS_TOKEN": "<YOUR_ACCESS_TOKEN>",
 "ACCESS_TOKEN_SECRET": "<YOUR_ACCESS_TOKEN_SECRETE>",
 "BEARER_TOKEN": "<YOUR_BEARER_TOKEN>"
}
```

### Loading `KEYS`.

In the following code cell we are going to load `KEYS` from an external file `keys.json`. The reason I'm loading these keys from an external file it's because i dont want them to be visible in this notebook and I'm going to add the file name in the `.gitignore` file so that when this code is pushed to `github` the `keys.json` won't be uploaded and noone will have access to our `keys` to use them on our behalf.

In [7]:
with open(os.path.join(base_dir, 'keys.json'), 'r') as reader:
    keys = json.loads(reader.read())

### Keys Type
In the following code cell we are going to create a data class type called `Keys`. This datatype class will store our keys as an object.

In [8]:
class Keys:
    API_KEY             = keys['API_KEY']
    API_SECRET          = keys['API_SECRET']
    ACCESS_TOKEN        = keys['ACCESS_TOKEN']
    ACCESS_TOKEN_SECRET = keys['ACCESS_TOKEN_SECRET']
    BEARER_TOKEN        = keys['BEARER_TOKEN']

### Creating `client` object

We are going to autheticate to create a `client` object and pass the following values to it

* `BEARER_TOKEN`
* `API_KEY`
* `API_SECRET`
* `ACCESS_TOKEN`
* `ACCESS_TOKEN_SECRET`

And we are also going to set the `wait_on_rate_limit` to true.

In [9]:
client = tp.Client(Keys.BEARER_TOKEN, Keys.API_KEY, Keys.API_SECRET, Keys.ACCESS_TOKEN, Keys.ACCESS_TOKEN_SECRET, wait_on_rate_limit=True)

### Querying data on Twitter

We are going to scrap the data based on tweets which is related to the following tweets categories:

1. politics
2. sports
3. education

So our query(q) will be based on key word search that include one of those `categories`. 


### Why searching for those categories on twitter?

We are using the mentioned categories because that's where cyber bulling occurs the most on twitter. We are using at least `2` categories because we want to get as much data as possible. 

So in our search using the `twitter` api we are going to search tweets that are:

1. in english.
2. world-wide.
3. not retweets

### Why Not Getting tweets based on Geographical location like in Zimbabwe only?

The reason why we are not doing this is because we want to get a wide range of tweet text that will make our dataset as big as possible. Another reason is that we are focusing on cyber bullying and cyber bullying text is still the same everywhere, so this won't have much of a negative effect during our Natural Language Processing Task.

We are going to mention the count and use the tweepy `Paginator` class to retrieve the `count` tweets.

In [10]:
COUNT = 500_000

Defining a list of categories that we will get data from.

In [11]:
# -is:retweet means I don't wantretweets and lang:en is asking for the tweets to be in english
categories = ['politics -is:retweet lang:en', 'sports -is:retweet lang:en', 'education -is:retweet lang:en']

We will try to get the tweets for each category 1 by one and save them in respective `.csv` file that we will merge at the end of this notebook to create a `giant` `csv` file with cleaned text and labels.

In [32]:
tweets = list()
category = categories[2]
print(f"Geting tweets of: {category}")
_tweets  = tp.Paginator(client.search_recent_tweets, query=category, max_results=100).flatten(limit=COUNT)
for tweet in _tweets:
  tweets.append(tweet.text)

Geting tweets of: education -is:retweet lang:en


TwitterServerError: ignored

### Checking a single `tweet` example
In the following code cell we are going to check a single tweet example.

In [33]:
tweets[0]

"investment in critical areas such as education, healthcare, and infrastructure, which can have long-term negative effects on a country's development.\n\nIn conclusion, electing a drug baron as a president is a dangerous proposition that can have severe consequences for a country's"



### Features (text) Cleaning

Our tweets was obtained using `scrapping` if we look at our example texts we can see that we have hastags `#`, mentions `@user`, numbers `123`, url's `http://google.com/whatever` etc. These things does not add any meaning to our text. We are going to create a preprocessing function that will be able to remove all hashtags, mentions, urls, numbers as well as expanding the words like `I'm` to `I am` so that we make our text so clean. We are also going to remove single letter that means nothing and convert everything word to lowercase for example let's have a look at the following sentence :

```
I'm working with a model downloaded on http://google.com/whatever #100daysofcode created using AI by @username5 e h in 2015.
```

When we clean this text we want it to look as follows:

```
i am working with a model downloaded on created using ai by in.
```

The above sentence does not make any sense to human being to to deep learning model it does make sense. Because deep learning models learns the context in the sentence not the meaning of the sentence.


### Is text cleaning going to improve model metrics?

Cleaning features also known as a step to "feature extration" is a very important step in machine learning models. It helps us to reduce noise in our features so that our model instead of forcusing on learning numbers, hashtags and mentions it will just focus on the text which is what maters. This also reduces the size of the `vocabulary`. In NLP we have an important consept called `vocabulary` which i'm going to explain it more about it later in the model training notebook. But we must know that text cleaning reduces the size of the vocabulary and improve the model training speed.

In [34]:
def clean_sentence(sent:str)->str:
    """
    Args:
        sent (str): an uncleaned sentence with text, punctuations, numbers 
    """
    sent = sent.lower() # converting the text to lower case
    sent = re.sub(r'(@|#)([A-Za-z0-9]+)', ' ', sent) # removing tags and mentions (there's no right way of doing it with regular expression but this will try)
    sent = re.sub(r"([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+", " ", sent) # removing emails
    sent = re.sub(r'https?\S+', ' ', sent, flags=re.MULTILINE) # removing url's
    sent = re.sub(r'\d', ' ', sent) # removing none word characters
    sent = re.sub(r'[^\w\s\']', ' ', sent) # removing punctuations except for "'" in words like I'm
    sent = re.sub(r'\s+', ' ', sent).strip() # remove more than one space
    words = list()
    for word in sent.split(' '):
        if len(word) == 1 and word not in ['a',  'i']:
          continue
        else:
          words.append(de_contract(word)) # replace word's like "i'm -> i am"
    return " ".join(words)

clean_sentence("I'm working with a model downloaded on http://google.com/whatever #100daysofcode created using AI by @username5 e h in 2015.")

'i am working with a model downloaded on created using ai by in'

> TextBlob’s output for a polarity task is a float within the range `[-1.0, 1.0]` where `-1.0` is a negative polarity and `1.0` is positive. This score can also be equal to `0`, which stands for a neutral evaluation of a statement as it doesn’t contain any words from the training set.

In [35]:
"""
later on we will customize for neutral based on the absolute threshold to balance the data.
"""
def create_label(text:str)->str:
    blob = TextBlob(text)
    polarity =  blob.sentiment.polarity
    if polarity == 0.0:
        return "neutral"
    elif polarity < 0.0:
        return "negative"
    else:
        return "positive"

In [36]:
create_label("This is lovely"), create_label("I'm not saying anything"), create_label("this is boring")

('positive', 'neutral', 'negative')

### Clean Tweets

We are then going to clean our tweets text, give them class labels then save the the data in a `csv` file. Our `csv` file will have the following columns.

1. `text` - the tweet cleaned text

2. `label` - the class label which can be `positive`, `negative` or `neutral`



In [37]:
cleaned_tweets = list()
for twt in tweets:
  twt_txt = clean_sentence(twt)
  twt_label = create_label(twt_txt)
  cleaned_tweets.append(tuple([twt_txt, twt_label]))

In [38]:
len(cleaned_tweets)

28092

### Columns
In the following code cell we are going to define the columns name that we are going to need in our `.csv` file as follows.

In [39]:
columns = np.array([ 'text', 'label'])

### Dataframe
In the following column we are going to create a dataframe base on our `[category].csv` list and check the first `10` rows of our data as follows:
    

> Note that the `category` is dynamic its either `education`, `sports` or `politics`.

In [40]:
dataframe = pd.DataFrame(cleaned_tweets, columns=columns, index=None)
dataframe.head(10)

Unnamed: 0,text,label
0,investment in critical areas such as education...,negative
1,is there value in creating central teaching id...,positive
2,chegg chgg after the education company says ch...,neutral
3,the response to the phil beadle interview is e...,positive
4,non childish elder public infrastructure is no...,negative
5,the winners of the eighth arts for competition...,positive
6,great collaboration between education govt and...,positive
7,okay i am not happy about the issue i was one ...,positive
8,and disadvantage called intelligence the aim o...,positive
9,'putting kids first' legislation clears north ...,positive


In [41]:
file_name = f"{category.replace(' -is:retweet lang:en', '')}.csv" # politics.csv, # education.csv and # sport.csv
save_path = os.path.join(base_dir, file_name) 
dataframe.to_csv(save_path)
print(f"Done: {file_name}")

Done: education.csv


> Note that we will need to repeate the process of getting all the data from the mentioned category before we proccede to the code cells that follows after.


Now that we have the following files in our google drive:

1. `sports.csv`
2. `education.csv`
3. `politics.csv`

Now we can go ahead and merge these csv file to create a giant `csv` file that we will use to train the model.

### Saving our data 

We are going to save our cleaned data in a `csv` file with the name `clean_tweets.csv` we are fong to use the pandas method on the dataframe called `to_csv` as follows:

In [51]:
df_sports = pd.read_csv(os.path.join(base_dir, 'sports.csv'))
df_education = pd.read_csv(os.path.join(base_dir, 'education.csv'))
df_politics = pd.read_csv(os.path.join(base_dir, 'politics.csv'))
giant_df = pd.concat([df_sports, df_education, df_politics], ignore_index=True)
giant_df.drop(['Unnamed: 0'], axis=1, inplace=True)
# shuffle values
giant_df = giant_df.sample(frac = 1)
giant_df.head(10)

Unnamed: 0,text,label
112052,time for a full criminal investigation into po...,negative
38453,ew your personality is politics,neutral
176574,bjp is politics has now come down to foul crie...,negative
113079,history low step individual politics life fast...,positive
100130,in medieval times using god for dynastic polit...,positive
85701,naive idealism and naive overreaction are the ...,negative
38256,the defence ministry mapn is looking for over ...,neutral
1871,failed in men sports try women,negative
76174,expect delays ahead,neutral
197582,trumpism isn politics it extortion and also,neutral


In [52]:
save_path = os.path.join(base_dir, 'clean_tweets.csv')
giant_df.to_csv(save_path, index=False)
print("Done")

Done


### Next

In the next notebook we are going to create an `Artificial Neural Network (ANN)` model that will classifies our text into sentiment wether, positive, positive or nuetral. Given a text from the user our model will classify sentiments to positive, positive or nuetral.

### Credits

1. [tweepy docs](https://docs.tweepy.org/en/stable/authentication.html)
2. [textblob-docs](https://textblob.readthedocs.io/en/dev/)
3. [towardsdatascience](https://towardsdatascience.com/how-to-access-data-from-the-twitter-api-using-tweepy-python-e2d9e4d54978)
4. [dev.to](https://dev.to/twitterdev/a-comprehensive-guide-for-using-the-twitter-api-v2-using-tweepy-in-python-15d9)