___

Topic: `A Community-based Real-Time Service Delivery Sentiment Analysis Data Fetching.`

Date: `2022/06/16`

Programming Language: `python`

Main: `Natural Language Processing (NLP)`

___



### Service delivery SA data Scrapping

In this notebook we are ging to use the `twitter` API to collect data for sentiment classification task. We are going to use python programming language to scrap the data from twitter. We are going to use `tweepy` together with api keys. 


### Installation of `tweepy`
In the following code cell we are going to install the latest version of `tweepy`. This package allows us to interact twitter using python programming language using.


In [None]:
!pip install tweepy --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tweepy
  Downloading tweepy-4.10.0-py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 3.5 MB/s 
Collecting requests<3,>=2.27.0
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.8 MB/s 
Installing collected packages: requests, tweepy
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: tweepy
    Found existing installation: tweepy 3.10.0
    Uninstalling tweepy-3.10.0:
      Successfully uninstalled tweepy-3.10.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.28.1 which

### Why tweepy?

We can use libraries like `selenium` to scrape the data for our task but tweepy comes with the following advantages:

* Provides many features about a given tweet (e.g. information about a tweet’s geographical location, etc.)
* Easy to use.
* This is an official way of getting tweets from tweeter especially for research purposes.
* It has a well written docummentation.

However, tweepy comes with some limitations such as:

* Limiting API requests.


### Installing TextBlob

We are going to use textblob in this notebook to name tweets sentiments based on a condition. During data scrapping on tweeter our tweets will not be labeled `positive`, `negative` or `nuetral` we have to do this on our own by the help of the [TextBlob Library](https://textblob.readthedocs.io/en/dev/), which is a library for processing text in python. We are going to use this library to group our text based on `polarity` value either the text is `positive`, `negative` or `nuetral`.


> TextBlob returns `polarity` and `subjectivity` of a sentence. Polarity lies between `[-1,1]`, `-1` defines a negative sentiment and `1` defines a positive sentiment. Negation words reverse the polarity. TextBlob has semantic labels that help with fine-grained analysis. For example — emoticons, exclamation mark, emojis, etc. Subjectivity lies between `[0,1]`. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information. TextBlob has one more parameter — intensity. TextBlob calculates subjectivity by looking at the ‘intensity’. Intensity determines if a word modifies the next word. For English, adverbs are used as modifiers (‘very good’).

We are only going to make use of the text `polarity` and create labels for our dataset.


In [None]:
!pip install textblob

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Importing packages
In the following code cell we are going to import packages that we are going to use in this notebook.

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import tweepy as tp
import pandas as pd

import os
import re
import json
import uuid
import textblob
import nltk

from textblob import TextBlob

nltk.download("punkt")
nltk.download("words")

tp.__version__

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


'4.10.0'

### File System

We are going to store and load data from google drive so, we need to mount the goggle drive. In the following code cell we are mounting the google drive.


In [None]:
from google.colab import drive, files

drive.mount('/content/drive')

Mounted at /content/drive


### Paths

In the following code cell we are going to define the paths where all our files will be stored in the google drive.

In [None]:
base_dir = '/content/drive/My Drive/Service Delivery'

assert os.path.exists(base_dir), f"The path '{base_dir}' does not exists, check if you have mounted the google drive."

### Getting Keys

Inorder to scrap data from twitter using `tweepy` we need to have a [twitter-developer-account](https://developer.twitter.com/en) which can be created [here](https://developer.twitter.com/en). Inoder to get these keys you need to:

1. login to your twitter developer account
2. create an application
3. get the following keys
    * `API_KEY`
    * `API_SECRET`
    * `ACCESS_TOKEN`
    * `ACCESS_TOKEN_SECRET`
        
After getting these keys we are gong to create a file called `keys.json` which is where we are gong to store our keys instead of displaying them here in this notebook. The `keys.json` file will look as follows:

```json
{
 "API_KEY" : "<YOUR_API_KEY>",
 "API_SECRET": "<YOUR_API_SECRETE>",
 "ACCESS_TOKEN": "<YOUR_ACCESS_TOKEN>",
 "ACCESS_TOKEN_SECRET": "<YOUR_ACCESS_TOKEN_SECRETE>"
}
```

### Loading `KEYS`.

In the following code cell we are going to load `KEYS` from an external file `keys.json`. The reason I'm loading these keys from an external file it's because i dont want them to be visible in this notebook and I'm going to add the file name in the `.gitignore` file so that when this code is pushed to `github` the `keys.json` won't be uploaded and noone will have access to our `keys` to use them on our behalf.

In [None]:
with open(os.path.join(base_dir, 'keys.json'), 'r') as reader:
    keys = json.loads(reader.read())

### Keys Type
In the following code cell we are going to create a data class type called `Keys`. This datatype class will store our keys as an object.

In [None]:
class Keys:
    API_KEY             = keys['API_KEY']
    API_SECRET          = keys['API_SECRET']
    ACCESS_TOKEN        = keys['ACCESS_TOKEN']
    ACCESS_TOKEN_SECRET = keys['ACCESS_TOKEN_SECRET']

### Creating `api` object

We are going to autheticate to our twitter developer account using the `OAuthHandler` class. This class takes in two arguments which are:

* `API_KEY`
* `API_SECRET`

On our object `auth` we are going to use the method called `set_access_token` that takes in the followng arguments.
* `ACCESS_TOKEN`
* `ACCESS_TOKEN_SECRET`

We are then going to create an `api` object and pass in the `authentication`.


In [None]:
auth = tp.OAuthHandler(Keys.API_KEY, Keys.API_SECRET)
auth.set_access_token(Keys.ACCESS_TOKEN, Keys.ACCESS_TOKEN_SECRET)
api = tp.API(auth, wait_on_rate_limit=True)

### Querying data on Twitter

We are going to scrap the data based on twitter which is related to service delivery. So our query(q) will be based on key word search that include  `servicedelivery` on our tweets it can be hashtags. We are going to mention the count and use the tweepy `Cursor` class to retrieve the `count` tweets.

In [None]:
COUNT = 500_000

### Data Class Tweet
In the following code cell we are going to create a datatype called `Tweet`. This datatype will contain properties that we are interested in on a single `tweet` object. On each and every tweet we are going to be intrested with the following attributes:

1. `id` - the tweet id

2. `created_at` - the date a tweet was created

3. `username` - who tweeted this tweet

4. `text` - the text content of the tweet

In [None]:
class Tweet:
    def __init__(self, id:str, created_at:str, username:str, text:str):
        self.id = id
        self.created_at = created_at
        self.username = username
        self.text = text
    
    def __str__(self):
        return f"Tweet <{self.id}>"
    
    def __repr__(self):
        return f"Tweet <{self.id}>"

In [None]:
tweets = list()
for tweet in tp.Cursor(api.search_tweets, q="Service Delivery",lang="en",tweet_mode="extended", count=200).items(COUNT):
  tweet = tweet._json
  data = (tweet['id'], tweet['created_at'], tweet['user']['screen_name'], tweet['full_text'])
  tweets.append(Tweet(*data))


### Checking a single `tweet` example
In the following code cell we are going to check a single tweet example.

In [None]:
tweets[0].text

'@delhivery i want to place one order https://t.co/Ku1BvIFQWt Delivery courier wismaster shipment delivered to me but status cancelled mark updated 2 day please help me delhivery courier service  AWB NO 5963192046766 Delhivery courier DC HEAD IS MISVIHEBAR TALKING TO ME and no https://t.co/vWCdBOPK7A'

In [None]:
len(tweets)

17975



### Features (text) Cleaning

Our tweets was obtained using `scrapping` if we look at our example texts we can see that we have hastags `#`, mentions `@user`, numbers `123`, url's `http://google.com/whatever` etc. These things does not add any meaning to our text. We are going to create a preprocessing function that will be able to remove all hashtags, mentions, urls, numbers as well as expanding the words like `I'm` to `I am` so that we make our text so clean. We are also going to remove single letter that means nothing and convert everything word to lowercase for example let's have a look at the following sentence :

```
I'm working with a model downloaded on http://google.com/whatever #100daysofcode created using AI by @username5 e h in 2015.
```

When we clean this text we want it to look as follows:

```
i am working with a model downloaded on created using ai by in.
```

The above sentence does not make any sense to human being to to deep learning model it does make sense. Because deep learning models learns the context in the sentence not the meaning of the sentence.

### Retweets

We are going to remove retweets on our tweets. The retweets according to the tweetpy documentation are the tweets that start with `RT`. So we are going to remove them before we create our `.csv` file.

### Is text cleaning going to improve model metrics?

Cleaning features also known as a step to "feature extration" is a very important step in machine learning models. It helps us to reduce noise in our features so that our model instead of forcusing on learning numbers, hashtags and mentions it will just focus on the text which is what maters. This also reduces the size of the `vocabulary`. In NLP we have an important consept called `vocabulary` which i'm going to explain it more about it later in the model training notebook. But we must know that text cleaning reduces the size of the vocabulary and improve the model training speed.

In the code cells that follows we are going to make use of the `nltk` and `re` packages to clean our text. We are going to create a function that will does the text cleaning for us and this function will be called `clean_sentence`. This function instead of removing just the noise from the sentence it will also expand contracted words such as `ain't` to `are not` for example.

The text-cleaning functions was found on [CrispenGari/ml-utils](https://github.com/CrispenGari/ml-utils/tree/main/text-cleaning)

In [None]:
def decontracted(phrase:str)->str:
    """
    Args:
        phrase (str): takes in a word like I'm

    Returns:
        string: a decontracted word like I am
    """
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [None]:
def clean_sentence(sent:str)->str:

    """
    Args:
        sent (str): an uncleaned sentence with text, punctuations, numbers and non-english words
    Returns:
        str: returns a cleaned sentence with only english words in it.
    """
    sent = sent.lower() # converting the text to lower case
    sent = re.sub(r'(@|#)([A-Za-z0-9]+)', ' ', sent) # removing tags and mentions (there's no right way of doing it with regular expression but this will try)
    sent = re.sub(r"([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+", " ", sent) # removing emails
    sent = re.sub(r'https?\S+', ' ', sent, flags=re.MULTILINE) # removing url's
    sent = re.sub(r'\d', ' ', sent) # removing none word characters
    sent = re.sub(r'[^\w\s\']', ' ', sent) # removing punctuations except for "'" in words like I'm
    sent = re.sub(r'\s+', ' ', sent).strip() # remove more than one space
    words = list()
    eng = set(nltk.corpus.words.words())
    for word in sent.split(' '):
        words.append(decontracted(word)) # replace word's like "i'm -> i am"
    return " ".join(w for w in words if w.lower() in eng or not w.isalpha()) # removing non-english words


> TextBlob’s output for a polarity task is a float within the range `[-1.0, 1.0]` where `-1.0` is a negative polarity and `1.0` is positive. This score can also be equal to `0`, which stands for a neutral evaluation of a statement as it doesn’t contain any words from the training set.

In [None]:
"""
later on we will customize for neutral based on the absolute threshold to balance the data.
"""

def create_label(text:str)->str:
    blob = TextBlob(text)
    polarity =  blob.sentiment.polarity
    if polarity == 0.0:
        return "neutral"
    elif polarity < 0.0:
        return "negative"
    else:
        return "positive"

In [None]:
create_label("This is lovely"), create_label("I'm not saying anything"), create_label("this is boring")

('positive', 'neutral', 'negative')

### Categorical Label

Next we are going to create a categorical label based on the condition that if a given sentiment label is `negative` our categorical label will be `1` if our sentiment class label is `neutral` our categorical label will be `0` and `2` when it is positive. We are going to make create a python `lambda` function called `categorical_label`

In [None]:
categorical_label = lambda x: 1 if x == "negative" else 0 if x == "neutral" else 2

### Clean Tweets

We are then going to clean our tweets text, give them categorical and class labels then save the the data in a `csv` file. Our `csv` file will have the following columns.

1. `created_at` - the date when the tweet was created

2. `text` - the tweet cleaned text

3. `label` - the class label which can be `positive`, `negative` or `neutral`

4. `categorical_label` - an integer value either `2`, `1` or `0` for `positive`, `negative` or `neutral` class labels respectively

5. `username` - the user who wrote the tweet.

6. `id` - the tweet id.


In [None]:
cleaned_tweets = list()

for twt in tweets:
  # Removing all retweets:
  if twt.text.startswith("RT"):
    continue
  else:
    twt_txt = clean_sentence(twt.text)
    twt_id = twt.id
    twt_create_at = twt.created_at
    twt_username = twt.username
    
    # labels
    twt_label = create_label(twt_txt)
    twt_categorical_label = categorical_label(twt_label)
    
    cleaned_tweets.append(tuple([twt_id, twt_create_at, twt_username, twt_txt, twt_label, twt_categorical_label]))

In [None]:
len(cleaned_tweets)

6336

### Columns
In the following code cell we are going to define the columns name that we are going to need in our `.csv` file as follows.

In [None]:
columns = np.array(['id', 'created_at', 'username', 'text', 'label', 'categorical_label'])

### Dataframe
In the following column we are going to create a dataframe base on our `cleaned_tweets` list and check the first `10` rows of our data as follows:
    

In [None]:
dataframe = pd.DataFrame(cleaned_tweets, columns=columns, index=None)
dataframe.head(10)

Unnamed: 0,id,created_at,username,text,label,categorical_label
0,1547579068781998083,Thu Jul 14 13:49:37 +0000 2022,vivek49276338,i want to place one order delivery courier shi...,neutral,0
1,1547579038197108737,Thu Jul 14 13:49:29 +0000 2022,henryki29684325,your delivery service is ducking and unreliabl...,negative,1
2,1547578977325158401,Thu Jul 14 13:49:15 +0000 2022,earifin_com,exclusive gift pack make your one feel differe...,positive,2
3,1547578768448884740,Thu Jul 14 13:48:25 +0000 2022,deAndrento,actually we should expect in the future alread...,positive,2
4,1547578665822593029,Thu Jul 14 13:48:01 +0000 2022,MatthewJRoth,_z you only pay the at the delivery of your ca...,positive,2
5,1547578506283913217,Thu Jul 14 13:47:23 +0000 2022,ERadiators,read our latest review great customer service ...,positive,2
6,1547578433789915137,Thu Jul 14 13:47:05 +0000 2022,flipkartsupport,sorry about that we understand your concern ab...,negative,1
7,1547578268529729537,Thu Jul 14 13:46:26 +0000 2022,col_fox,out visiting the rope access and wash reach wi...,positive,2
8,1547578177500852225,Thu Jul 14 13:46:04 +0000 2022,SparrowCareers,are you looking for a career where you contrib...,neutral,0
9,1547578161436577792,Thu Jul 14 13:46:00 +0000 2022,RoadsAgency,to strengthen with role in the mining sector w...,neutral,0


### Saving our data 

We are going to save our cleaned data in a `csv` file with the name `clean_tweets.csv` we are fong to use the pandas method on the dataframe called `to_csv` as follows:

In [None]:
save_path = os.path.join(base_dir, 'clean_tweets.csv')

dataframe.to_csv(save_path)

print("Done")

Done


### Next

In the next notebook we are going to create an `Artificial Neural Network (ANN)` model that will classifies our text into sentiment wether, positive, positive or nuetral. Given a text from the user our model will classify sentiments to positive, positive or nuetral.

### References

1. [tweepy docs](https://docs.tweepy.org/en/stable/authentication.html)
2. [textblob-docs](https://textblob.readthedocs.io/en/dev/)