# Anàlisi del text

Intro

In [1]:
import pandas as pd 
filename = 'tocho2.csv'
df = pd.read_csv(filename, index_col = 0)
df['text_ tokens'] = df['text_ tokens'].str.split('\t')
print(df['text_ tokens'].iloc[0])

['101', '16493', '12478', '117', '10105', '42370', '76299', '100', '187', '15480', '10108', '23837', '117', '10393', '169', '32342', '18077', '10135', '10226', '27925', '10111', '10393', '17087', '10950', '10114', '41904', '51937', '16981', '10142', '15217', '119', '14820', '10689', '17860', '169', '28502', '11360', '10689', '13461', '10105', '13451', '35688', '26981', '10142', '12557', '136', '14120', '131', '120', '120', '188', '119', '11170', '120', '175', '11166', '15705', '10129', '19282', '10113', '11396', '10362', '102']


Volem identificar l'idioma dels twits, codificats en [BERT Base Multilingual Cased](https://huggingface.co/bert-base-multilingual-cased). Com els valor de la columna language estan codificats, volem inspeccionar el contingut dels textos per trobar els twits escrits en llengües que poguem interpretar. 

Per això decodifiquem el text d'un twit per valor de `language` i llegim el seu contingut. Farem servir la lliberia *transformers*:

In [2]:
#Import of the BertTokenizer module to detokenize the text tokens
from transformers import BertTokenizer

#Setting the BERT Base Multilingual Cased tokenizer model
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

#Print the first tweet in the most frequent languages
for index in df.language.value_counts().index[:21]:
    dl = df[df.language == index].copy()
    dl = dl.reset_index(drop=True)
    print("language = {} \n".format(index))        
    print(tokenizer.decode(dl['text_ tokens'].iloc[0]) + '\n')

Neither PyTorch nor TensorFlow >= 2.0 have been found.Models won't be available and only tokenizers, configurationand file/data utilities can be used.


language = D3164C7FBCF2565DDF915B1B3AEFB1DC 

[CLS] Jacob Young, the Syndicate [UNK] s Director of PR, has a serious problem on his hands and has reached out to Hammond Robotics for help. Can they find a solution before they become the next ones marked for death? https : / / t. co / g5evrQa8va [SEP]

language = 22C448FF81263D4BAF2A176145EE9EAD 

[CLS] ただいま うさ 顔 洗 い 中. おしっこ 飛 ばしの 後 にはいつもする 仕 草. 壁 跡 は 生 々しいけれど 、 なんとも 愛 らしい [UNK] [UNK] ！.. # うさぎ # うさぎ 好 きさんと 繋 がりたい # 顔 洗 い # キリトリセカイ # 癒 し 動 画 # 動 物 好 きな 人 と 繋 がりたい # うさぎあるある # 動 物 あるある # ファインダー 越 しの 私 の 世 界 # 部 屋 んぽ # かわいい https : / / t. co / 2gKb0EeWqa [SEP]

language = 06D61DCBBE938971E1EA0C38BD9B5446 

[CLS] RT @ GermantequilIa : otra vez Las Palmas hacia ruta 68 pista izquierda bloqueada por choque de vehículo @ sitiodelsuceso @ AlertaVR https : / / t. [UNK] [SEP]

language = ECED8A16BE2A5E8871FD55F4842F16B1 

[CLS] Porque ele matou, esquartejou e deu os restos mortais da mãe do filho dele para os cachorros só pq não queria pagar pensã

De cara a l'anàlisi textual dels twits, només hem inclòs aquells en **anglès**, **castellà** i **català** filtrant mitjançant els codis corresponents als textos que hem identificat.

Per tant, només farem servir els languages `D3164C7FBCF2565DDF915B1B3AEFB1DC` (anglès), `06D61DCBBE938971E1EA0C38BD9B5446` (castellà) i `190BA7DA361BC06BC1D7E824C378064D` (català).

In [3]:
#Subset: English language 
language = 'D3164C7FBCF2565DDF915B1B3AEFB1DC'
d_eng = df[df.language == language].copy()
d_eng = d_eng.reset_index(drop=True)

#Subset: Spanish language 
language = '06D61DCBBE938971E1EA0C38BD9B5446'
d_cast = df[df.language == language].copy()
d_cast = d_cast.reset_index(drop=True)

#Subset: Catalan language 
language = '190BA7DA361BC06BC1D7E824C378064D'
d_cat = df[df.language == language].copy()
d_cat = d_cat.reset_index(drop=True)

print("English subset has a total of {} twits".format(d_eng.size))
print("Spanish subset has a total of {} twits".format(d_cast.size))
print("Catalan subset has a total of {} twits".format(d_cat.size))

English subset has a total of 4579224 twits
Spanish subset has a total of 999888 twits
Catalan subset has a total of 25272 twits


Una vegada filtrats els twits per idioma, parem atenció als tipus de twit:

In [4]:
d_eng['tweet_type'].value_counts()

TopLevel    99974
Retweet     72282
Quote       18545
Name: tweet_type, dtype: int64

Tenim tres tipus de twits:
- <b>TopLevel</b>: Twit publicat originalment per un usuari
- <b>Retweet</b>: Twit compartit per un usuari, però publicat originalment per un altre usuari
- <b>Quote</b>: Resposta d'un usuari a un TopLevel ó Retweet

Amb la intenció d'identificar contingut original per usuari, descartem els Retweets. 

In [5]:
engagement_user_type = 'engaged_with_user_id'
tweet_type = 'TopLevel'
d_eng_top_level = d_eng.loc[d_eng['tweet_type'] == tweet_type]

Així doncs, podem obtenir els usuaris amb més publicacions originals (TopLevel) de cada idioma.

In [6]:
d_eng_top_level[engagement_user_type].value_counts()[:10]

C6758D692A850E4C67B2763B66D1CFA8    659
5FF622786FB4924A067BD44D4B717570    509
E5D1B83B0E02FAFF871EEEF276D18132    395
7C03844E8B2E0C7B4346D41028AB14E2    328
FBB188A3C1E05C41587AAAC00B5B1879    232
F2A8BF0F4EB185E6D2E5E1A0DF4C33AE    204
416B919C0DAA48D42FF6780574034149    189
8A800256378089EF53C6F655F8690490    183
A9DAB08351D94BDE86235B37D6E8C61D    182
9D9C2BC354011249F2D4D9B9C4205AC9    168
Name: engaged_with_user_id, dtype: int64

Pel primer anàlisi, farem servir l'usuari més 'actiu' en anglès. 

In [7]:
#Take the user ID with most publications
engaging_user_id = d_eng_top_level[engagement_user_type].value_counts().index[0]

#Subset with only the twits of the user ID with most publications
d_eng_user = d_eng_top_level.loc[d_eng[engagement_user_type] == engaging_user_id]

In [8]:
for n in range (10): print(tokenizer.decode(d_eng_user['text_ tokens'].iloc[n]))

[CLS] Why didn [UNK] t John Bolton complain about this [UNK] nonsense [UNK] a long time ago, when he was very publicly terminated. He said, not that it matters, NOTHING! [SEP]
[CLS] [UNK] I like President Trump [UNK] s Tweets ( Social Media ), I like everything about him... and this Ukraine stuff, the trial, the impeachment, this isn [UNK] t t about Ukraine. Donald Trump has committed the two unpardonable sins in the eyes of the Democrats. He beat Hillary Clinton in 2016, and.. [SEP]
[CLS] Are you better off now than you were three years ago? Almost everyone say YES! [SEP]
[CLS] Reports are that basketball great Kobe Bryant and three others have been killed in a helicopter crash in California. That is terrible news! [SEP]
[CLS] The Democrat controlled House never even asked John Bolton to testify. It is up to them, not up to the Senate! [SEP]
[CLS]..... So, what the hell has happened to @ FoxNews. Only I know! Chris Wallace and others should be on Fake News CNN or MSDNC. How [UNK] s Sh

## Text tokens: anàlisi i data cleaning

Imprimim els primers tokens del model per veure si es segueix algun ordre en la descripció de tokens. 

In [9]:
n = 1000
print(tokenizer.decode(x for x in range(0,n)))

[PAD] [unused1] [unused2] [unused3] [unused4] [unused5] [unused6] [unused7] [unused8] [unused9] [unused10] [unused11] [unused12] [unused13] [unused14] [unused15] [unused16] [unused17] [unused18] [unused19] [unused20] [unused21] [unused22] [unused23] [unused24] [unused25] [unused26] [unused27] [unused28] [unused29] [unused30] [unused31] [unused32] [unused33] [unused34] [unused35] [unused36] [unused37] [unused38] [unused39] [unused40] [unused41] [unused42] [unused43] [unused44] [unused45] [unused46] [unused47] [unused48] [unused49] [unused50] [unused51] [unused52] [unused53] [unused54] [unused55] [unused56] [unused57] [unused58] [unused59] [unused60] [unused61] [unused62] [unused63] [unused64] [unused65] [unused66] [unused67] [unused68] [unused69] [unused70] [unused71] [unused72] [unused73] [unused74] [unused75] [unused76] [unused77] [unused78] [unused79] [unused80] [unused81] [unused82] [unused83] [unused84] [unused85] [unused86] [unused87] [unused88] [unused89] [unused90] [unused91] [u

Veiem que els primers 105 tokens estan reservats per codis . Després s'assignen els tokens posteriors a  signes de puntuació i números. Desprès es codifiquen els distints alfabets que s'inclouen al model. 

Per un altre costat, analiztant tokens més endavant, podem veure que si a cada paraula reconeguda pel model, se li assignarà un token propi.

In [10]:
print(tokenizer.decode(x for x in range(11000,11200)))

33 годинеן three 1948 fuů invånaream kvadratkilometerou4 Earthä anche benот 1942는 made англria కి yeov 00 1957 người 1930 1920il used 1954mo dhe 09 vomwa Unjo sebagai formaésicabe Seion durante przez suchного deux 08 known ved 36 South ta nie then он USAken 1946 dass You Mariaina να 34 final 1955 से である Leagueной کی populationпраland का 1938lla than nuب 1953ine naamTسный 300 IV 1949 Thomas 1918 India τον 1952 All한 הוא State merupakanскойunні Oxford familie club July 1947 یکzione més elle Richardche June team Az 1936 unter học 44Pরneric care secondł zen ca mga Nel として Retrievedok відigth Marchو inomnuш She που には someada Santa there olehste GeoNamesut British pe Film ke West data5 gli areaán co 38 һәм 1937ció born 400 kao any became τωνdos vớitonского serie एक ook છેque vorку place Berlinныеty soorttes includingে


### Text cleaning

Podem veure al text que es repeteixen uns codis a l'inici (`[CLS]`) i final (`[SEP]`) de cada twit. De cara al procès de data cleaning, ens interessaria eliminar aquests codis. Aprofitarem també per eliminar qualsevol possible aparició d'altres tokens que no aportaran valor al text: `[PAD]` i `[MASK]`. 

In [11]:
num_tokens = [0, 101, 102, 103]

for x in num_tokens:
    print('{}: {}'.format(x, tokenizer.decode(x)))

0: [ P A D ]
101: [ C L S ]
102: [ S E P ]
103: [ M A S K ]


Una vegada identificat el token per cadascun d'aquets codis, procedim a eliminar-los del nostre dataset.

In [12]:
#Convert the token list to string items
num_tokens = [ str(x) for x in num_tokens ]

#Remove all the items in num_tokens in each tweet
for k in range(len(d_eng['text_ tokens'])):
    for r in num_tokens:
        if r in d_eng['text_ tokens'][k]:
            d_eng['text_ tokens'][k].remove(r)

d_eng_top_level = d_eng.loc[d_eng['tweet_type'] == tweet_type]
engaging_user_id = d_eng_top_level[engagement_user_type].value_counts().index[0]
d_eng_user = d_eng_top_level.loc[d_eng[engagement_user_type] == engaging_user_id]

for n in range (10): print(tokenizer.decode(d_eng_user['text_ tokens'].iloc[n]))

Why didn [UNK] t John Bolton complain about this [UNK] nonsense [UNK] a long time ago, when he was very publicly terminated. He said, not that it matters, NOTHING!
[UNK] I like President Trump [UNK] s Tweets ( Social Media ), I like everything about him... and this Ukraine stuff, the trial, the impeachment, this isn [UNK] t t about Ukraine. Donald Trump has committed the two unpardonable sins in the eyes of the Democrats. He beat Hillary Clinton in 2016, and..
Are you better off now than you were three years ago? Almost everyone say YES!
Reports are that basketball great Kobe Bryant and three others have been killed in a helicopter crash in California. That is terrible news!
The Democrat controlled House never even asked John Bolton to testify. It is up to them, not up to the Senate!
..... So, what the hell has happened to @ FoxNews. Only I know! Chris Wallace and others should be on Fake News CNN or MSDNC. How [UNK] s Shep Smith doing? Watch, this will be the beginning of the end for 

## Links: enllaços, multimedia i retweets

In [13]:
import re
#We are looking for patterns such as https : / / t. co / yvMa6bPqfy
valid_pattern = 'https : \/ \/ t\. co \/ [\dA-Za-z\.-]+'

In [14]:
list_twits = []

#Gathering all the strings that meet the pattern
for index, row in d_eng_user.iterrows(): 
    list_twits.append(re.findall(valid_pattern, 
                                 tokenizer.decode(d_eng['text_ tokens'][index])))
    
    #print(tokenizer.decode(d_tweet_type['text_ tokens'][index]))

In [15]:
#Let's print them in order to check the resulting strings and easily access the links in the text
for i, s in enumerate(list_twits):
    list_twits_item = []
    for item in list_twits[i]:
        link_regex = item.replace(' ','')
        list_twits_item.append(link_regex) # I use replace because s.strip() is not working 
        print(link_regex) #Let's print the link so we can access it easier
    
    #list_twits_fix.append(list_twits_item)
    if list_twits_item != []: print(list_twits_item) #Print only the lists that are not empty
#print(list_twits_fix)

https://t.co/drhi6UeJ14
['https://t.co/drhi6UeJ14']
https://t.co/yvMa6bPqfy
['https://t.co/yvMa6bPqfy']
https://t.co/drhi6UeJ14
['https://t.co/drhi6UeJ14']
https://t.co/4YCo01XYCn
['https://t.co/4YCo01XYCn']
https://t.co/UCdQXY3vPn
['https://t.co/UCdQXY3vPn']
https://t.co/QwYzS32lyQ
['https://t.co/QwYzS32lyQ']
https://t.co/UCdQXY3vPn
['https://t.co/UCdQXY3vPn']
https://t.co/WFK33pR0Lv
['https://t.co/WFK33pR0Lv']
https://t.co/SOn6wRV9Zs
['https://t.co/SOn6wRV9Zs']
https://t.co/VshQceiwUA
['https://t.co/VshQceiwUA']
https://t.co/5NeC0mFWfU
['https://t.co/5NeC0mFWfU']
https://t.co/rpGHpdoIa6
['https://t.co/rpGHpdoIa6']
https://t.co/9LXDf6mJKf
['https://t.co/9LXDf6mJKf']
https://t.co/rrtF1Stk78
['https://t.co/rrtF1Stk78']
https://t.co/UCdQXY3vPn
['https://t.co/UCdQXY3vPn']
https://t.co/QwYzS32lyQ
['https://t.co/QwYzS32lyQ']
https://t.co/rrtF1Stk78
['https://t.co/rrtF1Stk78']
https://t.co/VshQceiwUA
['https://t.co/VshQceiwUA']
https://t.co/yvMa6bPqfy
['https://t.co/yvMa6bPqfy']
https://t.co

In [16]:
#Create a new column to store the list of links
d_eng['present_links_url'] = ""
#d_eng.at[index, 'present_links_text'] = links_list

for index, row in d_eng.iterrows():
    links_list = []
    links_list = re.findall(valid_pattern, tokenizer.decode(d_eng['text_ tokens'][index]))
    for i, s in enumerate(links_list):
        links_list[i] = s.replace(' ','') # I use replace because s.strip() is not working 
        
    d_eng.at[index, 'present_links_url'] = links_list

In [17]:
#Replace the links in the text in order to made them clickable

#Now that the text will be more readable, we will create a column with the text
d_eng['text'] = ""    

for index, row in d_eng.iterrows():
    links_list = []
    tweet_text = tokenizer.decode(d_eng['text_ tokens'][index])
    links_list = re.findall(valid_pattern, tweet_text)
    
    for i, s in enumerate(links_list):
        if i != 0:
            tweet_text = tweet_text_replaced
        link_old = s
        link_new = s.replace(' ','') # I use replace because s.strip() is not working
        tweet_text_replaced = tweet_text.replace(link_old, link_new)
    
    if links_list == []:    
        d_eng.at[index, 'text'] = tweet_text
    else:
        d_eng.at[index, 'text'] = tweet_text_replaced

In [18]:
second_col = d_eng['text']
d_eng.drop(labels=['text'], axis=1, inplace = True)
d_eng.insert(1, 'text', second_col)
d_eng

Unnamed: 0,text_ tokens,text,hashtags,tweet_id,present_media,present_links,present_domains,tweet_type,language,tweet_timestamp,...,engaging_user_follower_count,engaging_user_following_count,engaging_user_is_verified,engaging_user_account_creation,engagee_follows_engager,reply_timestamp,retweet_timestamp,retweet_with_comment_timestamp,like_timestamp,present_links_url
0,"[16493, 12478, 117, 10105, 42370, 76299, 100, ...","Jacob Young, the Syndicate [UNK] s Director of...",,39024FBE0136E046D1357196BAECFCA6,GIF,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581532200,...,2,29,False,1568107028,False,,,,,[https://t.co/g5evrQa8va]
1,"[14924, 16118, 10114, 12888, 15127, 31204, 101...",Final week to see my exhibition in the Shircli...,3653868A966576CF17D6A9064889BCED\t7A6710E791A1...,AB3ADBBD011F88D10FE7F6C5FDAB214C,Photo,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581087590,...,515,424,False,1254588643,True,,,,1.581090e+09,[https://t.co/PxG5XzTruM]
2,"[138, 19826, 10108, 10105, 10635, 10105, 12250...",A lot of the time the side that wins is simply...,,675D7920EA2FB4869BA767F5122FB115,,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581134845,...,4,125,False,1547955941,False,,,,1.581135e+09,[]
3,"[56898, 137, 42374, 36630, 10797, 90861, 131, ...","RT @ readyletsgo27 : Schiff, Nadler and Pelosi...",,7EAAE5C243D675AD4CFC2C052B0E0BB5,,,,Retweet,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581011517,...,33439,35598,False,1261862258,True,,1.581012e+09,,,[]
4,"[137, 87043, 10679, 11369, 11211, 22650, 11305...",@ LiliyaL62076921 Unique beauty of flowers [UN...,C8F2526CD1AEA3A8CFC67EB4B79E46D5,E243DF6401359C38C5D21B6FB86A881E,,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581037782,...,27980,25991,False,1493066269,True,,,,1.581058e+09,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190796,"[14600, 146, 10134, 10106, 56237, 10112, 10108...",While I was in awe of how absolutely adorable ...,,2E56D8FF0CEF6509FE6FA5A20649D70A,Video,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581547818,...,558,541,False,1492528509,False,,,,,[https://t.co/9RrcqgY1uW]
190797,"[56898, 137, 38553, 20664, 12337, 22897, 10237...",RT @ lapublichealth : Media Advisory and Press...,1C6C4D7B7E82DE3337CE2049CA52CF22,E618A26B30E697DABE47B63B0F9C8C4D,,,,Retweet,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581105025,...,363,885,True,1477249339,False,,,,,[]
190798,"[21635, 11841, 119, 14120, 131, 120, 120, 188,...",fit post. https://t.co/dbU1hBNELy,,1F2B267AA1EE6C11D5734174EE5CB8D0,Photo\tPhoto\tPhoto,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581379814,...,178,139,False,1451443862,False,,,,1.581399e+09,[https://t.co/dbU1hBNELy]
190799,"[14120, 131, 120, 120, 188, 119, 11170, 120, 1...",https://t.co/aFdLgZl3vE. Great news Timmy's go...,,EF9D148030AF297AF0A9B536EB548A59,,979697C6B686E3B27F245B573057CFC7,866690453BCAA699384C54E88AB0C123,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581341973,...,1714,2448,False,1341837788,True,,1.581357e+09,,1.581357e+09,[https://t.co/aFdLgZl3vE.]


In [19]:
#We'll use a list of 10 random ints in order to print 10 random tweets 
import random
randomlist = []
for i in range(0,19):
    n = random.randint(0,100)
    randomlist.append(n)

In [20]:
d_eng_top_level = d_eng.loc[d_eng['tweet_type'] == tweet_type]
engaging_user_id = d_eng_top_level[engagement_user_type].value_counts().index[0]
d_eng_user = d_eng_top_level.loc[d_eng[engagement_user_type] == engaging_user_id]

for n in randomlist: 
    links_list = d_eng_user['present_links_url'].iloc[n]
#    print(d_eng_user['engaged_with_user_id'].iloc[n] + ': \n\n'  + 
#                           'Original text: \n' + tokenizer.decode(d_eng_user['text_ tokens'].iloc[n]) + '\n\n' +
#                           'Resulting text: \n' + d_eng_user['text'].iloc[n] + '\n\n' +
#                           'Links: ' + '{} \n'.format(d_eng_user['present_links_url'].iloc[n]) + '\n\n' +
#                           '---------------' + '\n\n' )
    print('User ID: ' + d_eng_user['engaged_with_user_id'].iloc[n] + '\n'  +
          'Tweet ID: ' + d_eng_user['tweet_id'].iloc[n] + '\n\n'  +
          'Original text: \n' + tokenizer.decode(d_eng_user['text_ tokens'].iloc[n]) + '\n\n' +
          'Resulting text: \n' + d_eng_user['text'].iloc[n] + '\n')
    if links_list != []:
        print ('Links: ')
        for item in links_list:
            print(item)
        print('\n')
    print('------------------------ \n')

User ID: C6758D692A850E4C67B2763B66D1CFA8
Tweet ID: AD10C649E8CD710303E443F5DE067D31

Original text: 
Kobe Bryant, despite being one of the truly great basketball players of all time, was just getting started in life. He loved his family so much, and had such strong passion for the future. The loss of his beautiful daughter, Gianna, makes this moment even more devastating....

Resulting text: 
Kobe Bryant, despite being one of the truly great basketball players of all time, was just getting started in life. He loved his family so much, and had such strong passion for the future. The loss of his beautiful daughter, Gianna, makes this moment even more devastating....

------------------------ 

User ID: C6758D692A850E4C67B2763B66D1CFA8
Tweet ID: F930B864A75E22D3CD00FD4E2C6E8294

Original text: 
Just received a briefing on the Coronavirus in China from all of our GREAT agencies, who are also working closely with China. We will continue to monitor the ongoing developments. We have the best

In [21]:
#Create text with no links
d_eng['text_tokens_no_links'] = ""    

for index, row in d_eng.iterrows():
    links_list = []
    tweet_text = tokenizer.decode(d_eng['text_ tokens'][index])
    links_list = re.findall(valid_pattern, tweet_text)
    
    for i, s in enumerate(links_list):
        if i != 0:
            tweet_text = tweet_text_replaced
        link_old = s
        tweet_text_replaced = tweet_text.replace(link_old, '')
    
    if links_list == []:
        d_eng.at[index, 'text_tokens_no_links'] = d_eng['text_ tokens'][index]        
    else:
        new_tokens = tokenizer.encode(tweet_text_replaced)
        new_tokens = [str(x) for x in new_tokens]
        d_eng.at[index, 'text_tokens_no_links'] = new_tokens

for k in range(len(d_eng['text_tokens_no_links'])):
    for r in num_tokens:
        if r in d_eng['text_tokens_no_links'][k]:
            d_eng['text_tokens_no_links'][k].remove(r)

In [22]:
second_col = d_eng['text_tokens_no_links']
d_eng.drop(labels=['text_tokens_no_links'], axis=1, inplace = True)
d_eng.insert(1, 'text_tokens_no_links', second_col)            
            
d_eng.head()

Unnamed: 0,text_ tokens,text_tokens_no_links,text,hashtags,tweet_id,present_media,present_links,present_domains,tweet_type,language,...,engaging_user_follower_count,engaging_user_following_count,engaging_user_is_verified,engaging_user_account_creation,engagee_follows_engager,reply_timestamp,retweet_timestamp,retweet_with_comment_timestamp,like_timestamp,present_links_url
0,"[16493, 12478, 117, 10105, 42370, 76299, 100, ...","[16493, 12478, 117, 10105, 42370, 76299, 100, ...","Jacob Young, the Syndicate [UNK] s Director of...",,39024FBE0136E046D1357196BAECFCA6,GIF,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,...,2,29,False,1568107028,False,,,,,[https://t.co/g5evrQa8va]
1,"[14924, 16118, 10114, 12888, 15127, 31204, 101...","[14924, 16118, 10114, 12888, 15127, 31204, 101...",Final week to see my exhibition in the Shircli...,3653868A966576CF17D6A9064889BCED\t7A6710E791A1...,AB3ADBBD011F88D10FE7F6C5FDAB214C,Photo,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,...,515,424,False,1254588643,True,,,,1581090000.0,[https://t.co/PxG5XzTruM]
2,"[138, 19826, 10108, 10105, 10635, 10105, 12250...","[138, 19826, 10108, 10105, 10635, 10105, 12250...",A lot of the time the side that wins is simply...,,675D7920EA2FB4869BA767F5122FB115,,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,...,4,125,False,1547955941,False,,,,1581135000.0,[]
3,"[56898, 137, 42374, 36630, 10797, 90861, 131, ...","[56898, 137, 42374, 36630, 10797, 90861, 131, ...","RT @ readyletsgo27 : Schiff, Nadler and Pelosi...",,7EAAE5C243D675AD4CFC2C052B0E0BB5,,,,Retweet,D3164C7FBCF2565DDF915B1B3AEFB1DC,...,33439,35598,False,1261862258,True,,1581012000.0,,,[]
4,"[137, 87043, 10679, 11369, 11211, 22650, 11305...","[137, 87043, 10679, 11369, 11211, 22650, 11305...",@ LiliyaL62076921 Unique beauty of flowers [UN...,C8F2526CD1AEA3A8CFC67EB4B79E46D5,E243DF6401359C38C5D21B6FB86A881E,,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,...,27980,25991,False,1493066269,True,,,,1581058000.0,[]


In [23]:
#Inspect different types of tweets in order to understand which kind of links we have