
### Neural Machine Translation Data Gathering

In this notebook we are going to gather as more english sentence as we can so that we create a `nmt` datasets based on the gathered data. We are going to gather data or tweets from different topics.
___

Topic: `NMT`

Date: `2022/07/19`

Programming Language: `python`

Main: `Natural Language Processing (NLP)`

___



### NMT Dataset Scrapping

In this notebook we are ging to use the `twitter` API to collect data for tweets `nmt` task. We are going to use python programming language to scrap the data from twitter. We are going to use `tweepy` together with api keys. 


### Installation of `tweepy`
In the following code cell we are going to install the latest version of `tweepy`. This package allows us to interact twitter using python programming language using.


In [1]:
!pip install tweepy --upgrade -q

### Why tweepy?

We can use libraries like `selenium` to scrape the data for our task but tweepy comes with the following advantages:

* Provides many features about a given tweet (e.g. information about a tweet’s geographical location, etc.)
* Easy to use.
* This is an official way of getting tweets from tweeter especially for research purposes.
* It has a well written docummentation.

However, tweepy comes with some limitations such as:

* Limiting API requests.


### Installing the package for helper Function

We are going to install a package called [`helperfs`](https://github.com/CrispenGari/helperfns). This is the package that i've created that contains some helper function that can be useful in this task.


In [2]:
!pip install helperfns -q

### Importing packages
In the following code cell we are going to import packages that we are going to use in this notebook.

In [4]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import tweepy as tp
import pandas as pd

import os
import re
import json

from helperfns.text import clean_sentence


tp.__version__

'4.10.0'

### File System

We are going to store and load data from google drive so, we need to mount the goggle drive. In the following code cell we are mounting the google drive.


In [5]:
from google.colab import drive, files

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Paths

In the following code cell we are going to define the paths where all our files will be stored in the google drive.

In [6]:
base_dir = '/content/drive/My Drive/NLP Data/nmt'

assert os.path.exists(base_dir), f"The path '{base_dir}' does not exists, check if you have mounted the google drive."

### Getting Keys

Inorder to scrap data from twitter using `tweepy` we need to have a [twitter-developer-account](https://developer.twitter.com/en) which can be created [here](https://developer.twitter.com/en). Inoder to get these keys you need to:

1. login to your twitter developer account
2. create an application
3. get the following keys
    * `API_KEY`
    * `API_SECRET`
    * `ACCESS_TOKEN`
    * `ACCESS_TOKEN_SECRET`
        
After getting these keys we are gong to create a file called `keys.json` which is where we are gong to store our keys instead of displaying them here in this notebook. The `keys.json` file will look as follows:

```json
{
 "API_KEY" : "<YOUR_API_KEY>",
 "API_SECRET": "<YOUR_API_SECRETE>",
 "ACCESS_TOKEN": "<YOUR_ACCESS_TOKEN>",
 "ACCESS_TOKEN_SECRET": "<YOUR_ACCESS_TOKEN_SECRETE>"
}
```

### Loading `KEYS`.

In the following code cell we are going to load `KEYS` from an external file `keys.json`. The reason I'm loading these keys from an external file it's because i dont want them to be visible in this notebook and I'm going to add the file name in the `.gitignore` file so that when this code is pushed to `github` the `keys.json` won't be uploaded and noone will have access to our `keys` to use them on our behalf.

In [7]:
with open(os.path.join(base_dir, 'keys-2.json'), 'r') as reader:
    keys = json.loads(reader.read())

### Keys Type
In the following code cell we are going to create a data class type called `Keys`. This datatype class will store our keys as an object.

In [8]:
class Keys:
    API_KEY             = keys['API_KEY']
    API_SECRET          = keys['API_SECRET']
    ACCESS_TOKEN        = keys['ACCESS_TOKEN']
    ACCESS_TOKEN_SECRET = keys['ACCESS_TOKEN_SECRET']

### Creating `api` object

We are going to autheticate to our twitter developer account using the `OAuthHandler` class. This class takes in two arguments which are:

* `API_KEY`
* `API_SECRET`

On our object `auth` we are going to use the method called `set_access_token` that takes in the followng arguments.
* `ACCESS_TOKEN`
* `ACCESS_TOKEN_SECRET`

We are then going to create an `api` object and pass in the `authentication`.


In [9]:
auth = tp.OAuthHandler(Keys.API_KEY, Keys.API_SECRET)
auth.set_access_token(Keys.ACCESS_TOKEN, Keys.ACCESS_TOKEN_SECRET)
api = tp.API(auth, wait_on_rate_limit=True)

### Querying data on Twitter

We are going to query tweets for different topics on tweeter. And we are going to query as more tweets as we can.

In [10]:
COUNT = 500_000

### Data Class Tweet
In the following code cell we are going to create a datatype called `Tweet`. This datatype will contain properties that we are interested in on a single `tweet` object. On each and every tweet we are going to be intrested with the following attributes:

1. `id` - the tweet id

2. `created_at` - the date a tweet was created

3. `username` - who tweeted this tweet

4. `text` - the text content of the tweet

In [11]:
class Tweet:
    def __init__(self, id:str, created_at:str, username:str, text:str):
        self.id = id
        self.created_at = created_at
        self.username = username
        self.text = text
    
    def __str__(self):
        return f"Tweet <{self.id}>"
    
    def __repr__(self):
        return f"Tweet <{self.id}>"

### Queries

Our queries will be based on the following keyword search on twitter:

```
[
  'farewell',
 'impromptu',
 'explanatory',
 'eulogy',
 'demonstrative',
 'entertaining',
 'debate',
 'motivational',
 'informative',
 'pitch',
 'persuasive',
 'oratorical',
 'occasion', 
 'joke', 
 'greating', 
 'culture',
 'sports',
 'art'
 ]
```

In [None]:

tweets = list()
topic = 'art'
for tweet in tp.Cursor(api.search_tweets, q=topic,lang="en",tweet_mode="extended", count=200).items(COUNT):
  tweet = tweet._json
  data = (tweet['id'], tweet['created_at'], tweet['user']['screen_name'], tweet['full_text'])
  tweets.append(Tweet(*data))


### Checking a single `tweet` example
In the following code cell we are going to check a single tweet example.

In [108]:
tweets[0].text

"RT @warehouseno_8: #LLDFest2k22 DAY 5\n\n🥪 https://t.co/uM3stCnGlZ 🕵️\n\nGong Jun's sandwich has gone missing and he needs your help! A Choose…"

In [109]:
len(tweets)

15122

### Clean Tweets

We are then going to clean our tweets text as a `csv` file that will be located in the `<topic>` folder with a file name `<topic>.csv`. Our `csv` file will have the following columns.

1. `created_at` - the date when the tweet was created

2. `text` - the tweet cleaned text

3. `username` - the user who wrote the tweet.

4. `id` - the tweet id.


In [110]:
cleaned_tweets = list()

for twt in tweets:
  # Removing all retweets:
  if twt.text.startswith("RT"):
    continue
  else:
    twt_txt = clean_sentence(twt.text)
    twt_id = twt.id
    twt_create_at = twt.created_at
    twt_username = twt.username
    
    cleaned_tweets.append(tuple([twt_id, twt_create_at, twt_username, twt_txt]))

In [112]:
len(cleaned_tweets)

5824

### Columns
In the following code cell we are going to define the columns name that we are going to need in our `.csv` file as follows.

In [113]:
columns = np.array(['id', 'created_at', 'username', 'text'])

### Dataframe
In the following column we are going to create a dataframe base on our `cleaned_tweets` list and check the first `10` rows of our data as follows:
    

In [100]:
dataframe = pd.DataFrame(cleaned_tweets, columns=columns, index=None)
dataframe.head(10)

Unnamed: 0,id,created_at,username,text
0,1549839010029240328,Wed Jul 20 19:29:49 +0000 2022,BGamboe380,the sports media had their time to disrespect ...
1,1549839008863338497,Wed Jul 20 19:29:48 +0000 2022,SportsDayDFW,listen rocker s a ranger golf s a threat and m...
2,1549839005054865414,Wed Jul 20 19:29:48 +0000 2022,MalickDaho,_rwanda _sports my man put on a show
3,1549839004836794368,Wed Jul 20 19:29:48 +0000 2022,Dovehousecars,new in coupe manual in midnight blue with metr...
4,1549838995965845504,Wed Jul 20 19:29:45 +0000 2022,JohnLeviBaker,_sell bazooka gold parallel rookie pop only hi...
5,1549838992597729280,Wed Jul 20 19:29:45 +0000 2022,Sorelle_Arduino,so you think a male bodied person should be in...
6,1549838986285236224,Wed Jul 20 19:29:43 +0000 2022,Kumarreddy222,_ranjitbajaj first this guy does not know abou...
7,1549838985324859392,Wed Jul 20 19:29:43 +0000 2022,xProudPapax,_misfit i am trying to follow as much sports i...
8,1549838983097552896,Wed Jul 20 19:29:42 +0000 2022,ltimmerman25,how many do the current roster of the call hom...
9,1549838976160112640,Wed Jul 20 19:29:41 +0000 2022,Conferenceof12,_sports it s a sign language y as well as the ...


### Saving our data 

We are going to save our cleaned data in a `csv` file with the name `<topic>.csv` we are fong to use the pandas method on the dataframe called `to_csv` as follows:

In [114]:
save_path = os.path.join(base_dir, f'datasets/{topic}/{topic}.csv')

if os.path.exists(save_path.replace(os.path.basename(save_path), "")) == False:
  print("Creating path: ", save_path.replace(os.path.basename(save_path), ""))
  os.mkdir(save_path.replace(os.path.basename(save_path), ""))

dataframe.to_csv(save_path)

print("Done")

Done


### Downloading the Saved File

In the following code cell we are going to download the saved file to our local computer.

In [115]:
from google.colab import files

files.download(save_path)

print("Done")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Done
