# NLP on Tweets about Crypto Markets

### by Artemij Kiel and Roman Pavlyutin

**Step 1. Data Wrangling**

The text file that we are aiming to analyze is rather big, 4.41 GB. It contains about 8 million tweets that have been parsed in one day. Due to the sheer size of the file, we could not use pandas for this task, it would not be able to handle our data. Thus, we decided to use dask.

In [3]:
import json
from dask import bag as db 
from dask import dataframe as dd

Here, we added a Client to be able to track the progress.

In [4]:
from dask.distributed import Client
client=Client(n_workers=8,)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 8
Total threads: 8,Total memory: 8.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:59789,Workers: 8
Dashboard: http://127.0.0.1:8787/status,Total threads: 8
Started: Just now,Total memory: 8.00 GiB

0,1
Comm: tcp://127.0.0.1:59826,Total threads: 1
Dashboard: http://127.0.0.1:59827/status,Memory: 1.00 GiB
Nanny: tcp://127.0.0.1:59795,
Local directory: /Users/romanpavlyutin/dask-worker-space/worker-nr9ao1tr,Local directory: /Users/romanpavlyutin/dask-worker-space/worker-nr9ao1tr

0,1
Comm: tcp://127.0.0.1:59835,Total threads: 1
Dashboard: http://127.0.0.1:59836/status,Memory: 1.00 GiB
Nanny: tcp://127.0.0.1:59798,
Local directory: /Users/romanpavlyutin/dask-worker-space/worker-n4cuu0rg,Local directory: /Users/romanpavlyutin/dask-worker-space/worker-n4cuu0rg

0,1
Comm: tcp://127.0.0.1:59820,Total threads: 1
Dashboard: http://127.0.0.1:59821/status,Memory: 1.00 GiB
Nanny: tcp://127.0.0.1:59794,
Local directory: /Users/romanpavlyutin/dask-worker-space/worker-umqh9x47,Local directory: /Users/romanpavlyutin/dask-worker-space/worker-umqh9x47

0,1
Comm: tcp://127.0.0.1:59817,Total threads: 1
Dashboard: http://127.0.0.1:59818/status,Memory: 1.00 GiB
Nanny: tcp://127.0.0.1:59793,
Local directory: /Users/romanpavlyutin/dask-worker-space/worker-w6uccin0,Local directory: /Users/romanpavlyutin/dask-worker-space/worker-w6uccin0

0,1
Comm: tcp://127.0.0.1:59823,Total threads: 1
Dashboard: http://127.0.0.1:59824/status,Memory: 1.00 GiB
Nanny: tcp://127.0.0.1:59792,
Local directory: /Users/romanpavlyutin/dask-worker-space/worker-db_z5rf1,Local directory: /Users/romanpavlyutin/dask-worker-space/worker-db_z5rf1

0,1
Comm: tcp://127.0.0.1:59832,Total threads: 1
Dashboard: http://127.0.0.1:59833/status,Memory: 1.00 GiB
Nanny: tcp://127.0.0.1:59799,
Local directory: /Users/romanpavlyutin/dask-worker-space/worker-43u_cze_,Local directory: /Users/romanpavlyutin/dask-worker-space/worker-43u_cze_

0,1
Comm: tcp://127.0.0.1:59829,Total threads: 1
Dashboard: http://127.0.0.1:59830/status,Memory: 1.00 GiB
Nanny: tcp://127.0.0.1:59797,
Local directory: /Users/romanpavlyutin/dask-worker-space/worker-pqkj54zz,Local directory: /Users/romanpavlyutin/dask-worker-space/worker-pqkj54zz

0,1
Comm: tcp://127.0.0.1:59838,Total threads: 1
Dashboard: http://127.0.0.1:59839/status,Memory: 1.00 GiB
Nanny: tcp://127.0.0.1:59796,
Local directory: /Users/romanpavlyutin/dask-worker-space/worker-y07jzv9z,Local directory: /Users/romanpavlyutin/dask-worker-space/worker-y07jzv9z


Reading in the file:

In [6]:
#a will be the database as .txt

a=db.read_text("/Users/romanpavlyutin/Desktop/tweets0408.txt",blocksize="50MB")

Use a cleaner:

In [7]:
# takes away/ cuts of the first and the last sign of every line in the text-document
# json.dumps makes a json-string out of a python-object
# json.loads makes a python-object out of a json-string
# sort_key=True as argument in dumps, sorts the key alphabetically. The keys are the column names.

def cleaner(a):
    return a[0:-2]

In [8]:
# map uses the above defined function "cleaner" on every element 

b=a.map(cleaner).map(json.loads).to_dataframe()

b.set_index("tweet_id").to_parquet("parquet")

2022-06-17 14:50:31,584 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:59829
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/comm/tcp.py", line 266, in write
    async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/tasks.py", line 418, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/distributed/comm/core.py", line 327, in connect
    await asyncio.wait_for(comm.write(local_info), time_left())
  File "/Library/Framework

The newly generated parquet files are read into a dataframe.

In [9]:
df=dd.read_parquet("parquet")

Let's look into the amount of cases we have:

In [10]:
len(df)

7972537

This time the missing values got handled pretty well and the dataframe has a structure we can work with

In [12]:
df.head(5)

Unnamed: 0_level_0,user_name,user_screen_name,language,nb_retweets,nb_likes,user_id,provider,isSensitive,countStatus,text,user_location,keyword,isRetweeted,report_date,isRetweet
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1511650360317861894,Jaineet King,JaineetKing,en,0,0,1510316194418597897,random,False,8,@purpleNFTmuseum TALKADO NFT #Whitelist IS LIV...,,top_banks,False,1649240505000,False
1511650360385052680,Mohd Nazri Mohd Nor,MohdNazriMohdN9,en,0,0,1452885627595149312,random,False,1,RT @AmeerGiveaway: Hey guys it’s Ameer! Just m...,"Kuala Lumpur City, Kuala Lumpu",top_banks,False,1649240505000,True
1511650360414584836,Borsalino Kizaru,kborsalino00,en,0,0,1487770961176719367,random,False,3,"@yogetoth Hello ELEFbody\n\nStarting April 8, ...",,top_banks,False,1649240505000,False
1511650360628293635,jhangvi,Alikhan13319125,en,0,0,1507030320847245312,random,False,2,@LaCryptoMonnai1 6/4\n?????? ??????\n ?REAL UT...,,top_banks,False,1649240505000,False
1511650360758263811,Imola_red ?,Imola_red888,en,0,0,1171555019478507520,random,False,4,RT @LOcommunityNFT: Fatima living that Lo-Fi l...,"Birmingham, England",top_banks,False,1649240505000,True


Dropping unwanted columns:

In [19]:
df_clean = df.drop(columns = ['user_name','provider', 'isSensitive','user_location','keyword','isRetweeted','isRetweet'])
df_clean.head()

Unnamed: 0_level_0,user_screen_name,language,nb_retweets,nb_likes,user_id,countStatus,text,report_date
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1511650360317861894,JaineetKing,en,0,0,1510316194418597897,8,@purpleNFTmuseum TALKADO NFT #Whitelist IS LIV...,1649240505000
1511650360385052680,MohdNazriMohdN9,en,0,0,1452885627595149312,1,RT @AmeerGiveaway: Hey guys it’s Ameer! Just m...,1649240505000
1511650360414584836,kborsalino00,en,0,0,1487770961176719367,3,"@yogetoth Hello ELEFbody\n\nStarting April 8, ...",1649240505000
1511650360628293635,Alikhan13319125,en,0,0,1507030320847245312,2,@LaCryptoMonnai1 6/4\n?????? ??????\n ?REAL UT...,1649240505000
1511650360758263811,Imola_red888,en,0,0,1171555019478507520,4,RT @LOcommunityNFT: Fatima living that Lo-Fi l...,1649240505000


Deleting all rows with text not in English:

In [28]:
df_clean = df_clean[df_clean.language == 'en']
df_clean = df_clean.drop(columns = ['language', 'report_date'])
df_clean.head(10)

Unnamed: 0_level_0,user_screen_name,nb_retweets,nb_likes,user_id,countStatus,text
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1511650360317861894,JaineetKing,0,0,1510316194418597897,8,@purpleNFTmuseum TALKADO NFT #Whitelist IS LIV...
1511650360385052680,MohdNazriMohdN9,0,0,1452885627595149312,1,RT @AmeerGiveaway: Hey guys it’s Ameer! Just m...
1511650360414584836,kborsalino00,0,0,1487770961176719367,3,"@yogetoth Hello ELEFbody\n\nStarting April 8, ..."
1511650360628293635,Alikhan13319125,0,0,1507030320847245312,2,@LaCryptoMonnai1 6/4\n?????? ??????\n ?REAL UT...
1511650360758263811,Imola_red888,0,0,1171555019478507520,4,RT @LOcommunityNFT: Fatima living that Lo-Fi l...
1511650360833716230,Mrszoe8,0,0,1497442153592606722,7,PROMOTE IT ON https://t.co/0kaEzKyesI
1511650360905113600,ngorji3,0,0,1486787670491766784,6,RT @bigtraderrrrr: @AltcoinWorldcom New genera...
1511650360951255040,manishms75,0,0,1467093830830800902,5,RT @Hujemi2: @autofarmnetwork A great project ...
1511650361211506688,Aprilw1nberkah,0,0,1499998570899914753,10,"RT @TiarCrypto: $250 | 3,5 JUTA in 72 HOURS\n\..."
1511650361249079297,Spice378j,0,0,1260085132855312384,9,RT @Play_Ukiyo: ?? Ukiyo X PXN ?\n\nWe have te...


Dropping more unwanted columns:

In [30]:
df_clean = df_clean.drop(columns = ['nb_retweets', 'nb_likes'])
df_clean.head(10)

Unnamed: 0_level_0,user_screen_name,user_id,countStatus,text
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1511650360317861894,JaineetKing,1510316194418597897,8,@purpleNFTmuseum TALKADO NFT #Whitelist IS LIV...
1511650360385052680,MohdNazriMohdN9,1452885627595149312,1,RT @AmeerGiveaway: Hey guys it’s Ameer! Just m...
1511650360414584836,kborsalino00,1487770961176719367,3,"@yogetoth Hello ELEFbody\n\nStarting April 8, ..."
1511650360628293635,Alikhan13319125,1507030320847245312,2,@LaCryptoMonnai1 6/4\n?????? ??????\n ?REAL UT...
1511650360758263811,Imola_red888,1171555019478507520,4,RT @LOcommunityNFT: Fatima living that Lo-Fi l...
1511650360833716230,Mrszoe8,1497442153592606722,7,PROMOTE IT ON https://t.co/0kaEzKyesI
1511650360905113600,ngorji3,1486787670491766784,6,RT @bigtraderrrrr: @AltcoinWorldcom New genera...
1511650360951255040,manishms75,1467093830830800902,5,RT @Hujemi2: @autofarmnetwork A great project ...
1511650361211506688,Aprilw1nberkah,1499998570899914753,10,"RT @TiarCrypto: $250 | 3,5 JUTA in 72 HOURS\n\..."
1511650361249079297,Spice378j,1260085132855312384,9,RT @Play_Ukiyo: ?? Ukiyo X PXN ?\n\nWe have te...


Here, we want to save the data frame so that we can work with it later without going through all the previous steps every time. We are saving it as a parquet file.

In [33]:
df_clean.to_parquet('/Users/romanpavlyutin/Desktop/df.parquet.gzip', compression='gzip')

In [36]:
df2 = dd.read_parquet('/Users/romanpavlyutin/Desktop/df.parquet.gzip')
df2.head(15)

Unnamed: 0_level_0,user_screen_name,user_id,countStatus,text
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1511650360317861894,JaineetKing,1510316194418597897,8,@purpleNFTmuseum TALKADO NFT #Whitelist IS LIV...
1511650360385052680,MohdNazriMohdN9,1452885627595149312,1,RT @AmeerGiveaway: Hey guys it’s Ameer! Just m...
1511650360414584836,kborsalino00,1487770961176719367,3,"@yogetoth Hello ELEFbody\n\nStarting April 8, ..."
1511650360628293635,Alikhan13319125,1507030320847245312,2,@LaCryptoMonnai1 6/4\n?????? ??????\n ?REAL UT...
1511650360758263811,Imola_red888,1171555019478507520,4,RT @LOcommunityNFT: Fatima living that Lo-Fi l...
1511650360833716230,Mrszoe8,1497442153592606722,7,PROMOTE IT ON https://t.co/0kaEzKyesI
1511650360905113600,ngorji3,1486787670491766784,6,RT @bigtraderrrrr: @AltcoinWorldcom New genera...
1511650360951255040,manishms75,1467093830830800902,5,RT @Hujemi2: @autofarmnetwork A great project ...
1511650361211506688,Aprilw1nberkah,1499998570899914753,10,"RT @TiarCrypto: $250 | 3,5 JUTA in 72 HOURS\n\..."
1511650361249079297,Spice378j,1260085132855312384,9,RT @Play_Ukiyo: ?? Ukiyo X PXN ?\n\nWe have te...


We have split this project into several notebooks, so that it doesn't get too messy and it is easier to keep track of things, and so that we do not have to go through all the steps everytime.