# Content Analysis
---

## Coding Procedure

**Steps**  
1. **Preprocessing the data** for word representation/embedding
    1. Clean data
    2. Stem, lemmatize, tokenize tweets
    3. Vectorize corpus
2. **Define sanction classifier** with seed words
    * a result of combining:
    1. retionalisation from netnography data
    2. conceptionalisation from literature / theoretical concepts

    ex:
```python
sanction = ['sanctions', 'retaliation']
#seed words found in netnographic data, where tweets were coded manually in Microsoft Excel
#(appendix X: Immersion Journal: coding procedure)
```
        
3. **Improve seed words** Apply models (Word2Vec, GLoVe...) for improving seed words in category
    1. Apply seed words to find most/least similar (see [lecture 7](https://absalon.ku.dk/courses/56009/files/5967363?module_item_id=1647448) )

<img src="img_wv.jpg" alt="W2V code ex:" style="width:400px;height:300px;">

---
---
---

## 1. Preprocessing the data


#### Import data into dataframe

---

In [8]:
import pandas as pd
import re

In [2]:
df = pd.read_excel('data_finito.xlsx')

In [3]:
df.drop(columns=['Unnamed: 0'], inplace=True)

In [4]:
df

Unnamed: 0,handle,Tweet_id,Date,Tweet,Fav_count,Retweet_count,in_reply_to_status_id,english,Name,Country,EU Party,ID,National Party,T_Name_e,Unnamed: 0_y,Twitter_names,Twitter_handles
0,andreykovatchev,1.522546e+18,2022-05-06 11:56:37+00:00,Historic day for #Romania! \n\nU.S. first lady...,5,1,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev
1,andreykovatchev,1.519710e+18,2022-04-28 16:08:40+00:00,"@NicolaeCiuca, Prime Minister of Romania parti...",3,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev
2,andreykovatchev,1.518311e+18,2022-04-24 19:26:52+00:00,"Good news from France, good news for Europe an...",6,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev
3,andreykovatchev,1.512050e+18,2022-04-07 12:48:56+00:00,Romania is fully prepared to join the #schenge...,3,1,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev
4,andreykovatchev,1.510663e+18,2022-04-03 17:00:15+00:00,I strongly condemn the #massacre in #Bucha com...,14,2,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605996,JakopDalunde,1.496511e+18,2022-02-23 15:43:44+00:00,"@jonworth No, it was very much introductory re...",1,0,1496510423956430855,1,jakop g. dalunde,Sweden,Group of the Greens/European Free Alliance,183338,Miljöpartiet de gröna,jakop dalunde,531,jakop dalunde,JakopDalunde
605997,JakopDalunde,1.496503e+18,2022-02-23 15:13:04+00:00,@jonworth I was in the room. His address was g...,4,0,1496449399588859909,1,jakop g. dalunde,Sweden,Group of the Greens/European Free Alliance,183338,Miljöpartiet de gröna,jakop dalunde,531,jakop dalunde,JakopDalunde
605998,JakopDalunde,1.496305e+18,2022-02-23 02:03:34+00:00,RT @ThomasVLinge: If you're gonna listen to an...,0,44010,,1,jakop g. dalunde,Sweden,Group of the Greens/European Free Alliance,183338,Miljöpartiet de gröna,jakop dalunde,531,jakop dalunde,JakopDalunde
605999,JakopDalunde,1.496109e+18,2022-02-22 13:04:19+00:00,RT @OstrivGame: I wish all my followers outsid...,0,129,,1,jakop g. dalunde,Sweden,Group of the Greens/European Free Alliance,183338,Miljöpartiet de gröna,jakop dalunde,531,jakop dalunde,JakopDalunde


---

#### 1.A Clean and sort the text data
* Create a hashtag variable, defining the hashtags used in the tweet
* Create a tweet variable that determine if tweet is a retweet (1) or a regular tweet (0)
* Create a mentioned variable, defining the users mentioned in the tweet
* Remove the retweet tag ('RT') in the text
* Clean the tweet from retweet tags (@user), numbers, and links

In [9]:
#finding hashtags

hashtags = []
for x in df['Tweet']:
    s = re.findall(r"#(\w+)", x)
    hashtags.append(s)
    
df['hashtags'] = hashtags

df.head(3)

Unnamed: 0,handle,Tweet_id,Date,Tweet,Fav_count,Retweet_count,in_reply_to_status_id,english,Name,Country,EU Party,ID,National Party,T_Name_e,Unnamed: 0_y,Twitter_names,Twitter_handles,hashtags
0,andreykovatchev,1.522546e+18,2022-05-06 11:56:37+00:00,Historic day for #Romania! \n\nU.S. first lady...,5,1,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,[Romania]
1,andreykovatchev,1.51971e+18,2022-04-28 16:08:40+00:00,"@NicolaeCiuca, Prime Minister of Romania parti...",3,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,[ukraine]
2,andreykovatchev,1.518311e+18,2022-04-24 19:26:52+00:00,"Good news from France, good news for Europe an...",6,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[FranceElection2022, FranceVotes]"


In [10]:
retweet = []
for element in df['Tweet']:
    if 'RT' in element:
        retweet.append(1)
    else:
        retweet.append(0)
        

df['is_retweet'] = retweet

In [11]:
df.head(20)

Unnamed: 0,handle,Tweet_id,Date,Tweet,Fav_count,Retweet_count,in_reply_to_status_id,english,Name,Country,EU Party,ID,National Party,T_Name_e,Unnamed: 0_y,Twitter_names,Twitter_handles,hashtags,is_retweet
0,andreykovatchev,1.522546e+18,2022-05-06 11:56:37+00:00,Historic day for #Romania! \n\nU.S. first lady...,5,1,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,[Romania],0
1,andreykovatchev,1.51971e+18,2022-04-28 16:08:40+00:00,"@NicolaeCiuca, Prime Minister of Romania parti...",3,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,[ukraine],0
2,andreykovatchev,1.518311e+18,2022-04-24 19:26:52+00:00,"Good news from France, good news for Europe an...",6,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[FranceElection2022, FranceVotes]",0
3,andreykovatchev,1.51205e+18,2022-04-07 12:48:56+00:00,Romania is fully prepared to join the #schenge...,3,1,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[schengen, Romania, immoral, unfair]",0
4,andreykovatchev,1.510663e+18,2022-04-03 17:00:15+00:00,I strongly condemn the #massacre in #Bucha com...,14,2,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[massacre, Bucha, Russian, Bucha, BuchaMassacre]",0
5,andreykovatchev,1.509998e+18,2022-04-01 20:56:40+00:00,Congratulations @RobertaMetsola! This picture ...,88,6,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[EU, EU]",0
6,andreykovatchev,1.50702e+18,2022-03-24 15:41:06+00:00,Stronger #EU is stronger #NATO! \n\nJean Monne...,7,1,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[EU, NATO, Ukraine]",0
7,andreykovatchev,1.506284e+18,2022-03-22 14:56:56+00:00,Debate with #ukraine’s Minister of Agriculture...,2,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[ukraine, UkraineConflict]",0
8,andreykovatchev,1.503261e+18,2022-03-14 06:44:33+00:00,Humanitarian visit to #Ukraine! \n\nI crossed ...,51,11,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[Ukraine, ukrainian, Ukraine]",0
9,andreykovatchev,1.500189e+18,2022-03-05 19:18:01+00:00,President @KlausIohannis visited the mobile #r...,7,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[refugee, Ukraine]",0


In [12]:
#finding people they mention

people = []
for x in df['Tweet']:
    s = re.findall("@([a-zA-Z0-9]{1,15})", x)
    people.append(s)
    
df['mentiones'] = people

In [13]:
df.head(3)

Unnamed: 0,handle,Tweet_id,Date,Tweet,Fav_count,Retweet_count,in_reply_to_status_id,english,Name,Country,EU Party,ID,National Party,T_Name_e,Unnamed: 0_y,Twitter_names,Twitter_handles,hashtags,is_retweet,mentiones
0,andreykovatchev,1.522546e+18,2022-05-06 11:56:37+00:00,Historic day for #Romania! \n\nU.S. first lady...,5,1,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,[Romania],0,"[FLOTUS, FLOTUS]"
1,andreykovatchev,1.51971e+18,2022-04-28 16:08:40+00:00,"@NicolaeCiuca, Prime Minister of Romania parti...",3,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,[ukraine],0,"[NicolaeCiuca, EPPGroup, NicolaeCiuca]"
2,andreykovatchev,1.518311e+18,2022-04-24 19:26:52+00:00,"Good news from France, good news for Europe an...",6,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[FranceElection2022, FranceVotes]",0,[EmmanuelMacron]


In [14]:
#Removing RT from tweets:

clean_rt = []
for element in df['Tweet']:
    if 'RT' in element:
        clean_rt.append(re.sub('RT', '', element))
    else:
        clean_rt.append(element)
        
df['Tweet'] = clean_rt

In [15]:
df.head(20)

Unnamed: 0,handle,Tweet_id,Date,Tweet,Fav_count,Retweet_count,in_reply_to_status_id,english,Name,Country,EU Party,ID,National Party,T_Name_e,Unnamed: 0_y,Twitter_names,Twitter_handles,hashtags,is_retweet,mentiones
0,andreykovatchev,1.522546e+18,2022-05-06 11:56:37+00:00,Historic day for #Romania! \n\nU.S. first lady...,5,1,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,[Romania],0,"[FLOTUS, FLOTUS]"
1,andreykovatchev,1.51971e+18,2022-04-28 16:08:40+00:00,"@NicolaeCiuca, Prime Minister of Romania parti...",3,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,[ukraine],0,"[NicolaeCiuca, EPPGroup, NicolaeCiuca]"
2,andreykovatchev,1.518311e+18,2022-04-24 19:26:52+00:00,"Good news from France, good news for Europe an...",6,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[FranceElection2022, FranceVotes]",0,[EmmanuelMacron]
3,andreykovatchev,1.51205e+18,2022-04-07 12:48:56+00:00,Romania is fully prepared to join the #schenge...,3,1,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[schengen, Romania, immoral, unfair]",0,[]
4,andreykovatchev,1.510663e+18,2022-04-03 17:00:15+00:00,I strongly condemn the #massacre in #Bucha com...,14,2,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[massacre, Bucha, Russian, Bucha, BuchaMassacre]",0,[]
5,andreykovatchev,1.509998e+18,2022-04-01 20:56:40+00:00,Congratulations @RobertaMetsola! This picture ...,88,6,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[EU, EU]",0,"[RobertaMetsola, Europarl]"
6,andreykovatchev,1.50702e+18,2022-03-24 15:41:06+00:00,Stronger #EU is stronger #NATO! \n\nJean Monne...,7,1,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[EU, NATO, Ukraine]",0,[KlausIohannis]
7,andreykovatchev,1.506284e+18,2022-03-22 14:56:56+00:00,Debate with #ukraine’s Minister of Agriculture...,2,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[ukraine, UkraineConflict]",0,"[LeshchenkoRM, EP]"
8,andreykovatchev,1.503261e+18,2022-03-14 06:44:33+00:00,Humanitarian visit to #Ukraine! \n\nI crossed ...,51,11,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[Ukraine, ukrainian, Ukraine]",0,[]
9,andreykovatchev,1.500189e+18,2022-03-05 19:18:01+00:00,President @KlausIohannis visited the mobile #r...,7,0,,1,andrey kovatchev,Bulgaria,Group of the European People's Party (Christia...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[refugee, Ukraine]",0,[KlausIohannis]


In [16]:
def remove_at(text):
    return  ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)","", text).split())

df['clean_text'] = [remove_at(i) for i in df.Tweet]

In [17]:
df.head(3)

Unnamed: 0,handle,Tweet_id,Date,Tweet,Fav_count,Retweet_count,in_reply_to_status_id,english,Name,Country,...,ID,National Party,T_Name_e,Unnamed: 0_y,Twitter_names,Twitter_handles,hashtags,is_retweet,mentiones,clean_text
0,andreykovatchev,1.522546e+18,2022-05-06 11:56:37+00:00,Historic day for #Romania! \n\nU.S. first lady...,5,1,,1,andrey kovatchev,Bulgaria,...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,[Romania],0,"[FLOTUS, FLOTUS]",Historic day for Romania US first lady will vi...
1,andreykovatchev,1.51971e+18,2022-04-28 16:08:40+00:00,"@NicolaeCiuca, Prime Minister of Romania parti...",3,0,,1,andrey kovatchev,Bulgaria,...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,[ukraine],0,"[NicolaeCiuca, EPPGroup, NicolaeCiuca]",Prime Minister of Romania participated today a...
2,andreykovatchev,1.518311e+18,2022-04-24 19:26:52+00:00,"Good news from France, good news for Europe an...",6,0,,1,andrey kovatchev,Bulgaria,...,97968,Citizens for European Development of Bulgaria,andrey kovatchev,376,andrey kovatchev,andreykovatchev,"[FranceElection2022, FranceVotes]",0,[EmmanuelMacron],Good news from France good news for Europe and...


In [13]:
df_twitter = df.copy()

In [14]:
df_twitter

Unnamed: 0,clean_text,Date,Country,text_no_sw,text_stem,text_lem
0,Historic day for Romania US first lady will vi...,2022-05-06 11:56:37+00:00,Bulgaria,Historic day Romania US first lady visit Roman...,"['histor', 'day', 'romania', 'us', 'first', 'l...","['historic', 'day', 'romania', 'u', 'first', '..."
1,Prime Minister of Romania participated today a...,2022-04-28 16:08:40+00:00,Bulgaria,Prime Minister Romania participated today Bure...,"['prime', 'minist', 'romania', 'particip', 'to...","['prime', 'minister', 'romania', 'participate'..."
2,Good news from France good news for Europe and...,2022-04-24 19:26:52+00:00,Bulgaria,Good news France good news Europe good news de...,"['good', 'news', 'franc', 'good', 'news', 'eur...","['good', 'news', 'france', 'good', 'news', 'eu..."
3,Romania is fully prepared to join the schengen...,2022-04-07 12:48:56+00:00,Bulgaria,Romania fully prepared join schengen zone Keep...,"['romania', 'fulli', 'prepar', 'join', 'scheng...","['romania', 'fully', 'prepared', 'join', 'sche..."
4,I strongly condemn the massacre in Bucha commi...,2022-04-03 17:00:15+00:00,Bulgaria,I strongly condemn massacre Bucha committed Ru...,"['i', 'strongli', 'condemn', 'massacr', 'bucha...","['i', 'strongly', 'condemn', 'massacre', 'buch..."
...,...,...,...,...,...,...
605996,No it was very much introductory remarks to th...,2022-02-23 15:43:44+00:00,Sweden,No much introductory remarks event,"['no', 'much', 'introductori', 'remark', 'event']","['no', 'much', 'introductory', 'remark', 'event']"
605997,I was in the room His address was good but the...,2022-02-23 15:13:04+00:00,Sweden,I room His address good nothing new proclaimed,"['i', 'room', 'hi', 'address', 'good', 'noth',...","['i', 'room', 'his', 'address', 'good', 'nothi..."
605998,If youre gonna listen to any speech about Ukra...,2022-02-23 02:03:34+00:00,Sweden,If youre gonna listen speech Ukraine let oneTh...,"['if', 'your', 'gon', 'na', 'listen', 'speech'...","['if', 'youre', 'gon', 'na', 'listen', 'speech..."
605999,I wish all my followers outside Ukraine could ...,2022-02-22 13:04:19+00:00,Sweden,I wish followers outside Ukraine could take mo...,"['i', 'wish', 'follow', 'outsid', 'ukrain', 'c...","['i', 'wish', 'follower', 'outside', 'ukraine'..."


---
---
---

### 1.B
---
#### Stem, lemmatize, tokenize tweets
---

In [7]:
# A Twitter Tokenizer which removes userhandles
import nltk 
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer # Porter is used below. This is an alternative, harsher stemmer. 
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from collections import defaultdict
from nltk.corpus import stopwords
import re
import pandas as pd
from gensim.models.word2vec import Word2Vec
import fasttext


In [64]:
df_twitter['clean_text'].

<bound method Series.keys of 0         Historic day for Romania US first lady will vi...
1         Prime Minister of Romania participated today a...
2         Good news from France good news for Europe and...
3         Romania is fully prepared to join the schengen...
4         I strongly condemn the massacre in Bucha commi...
                                ...                        
605996    No it was very much introductory remarks to th...
605997    I was in the room His address was good but the...
605998    If youre gonna listen to any speech about Ukra...
605999    I wish all my followers outside Ukraine could ...
606000                                              Im game
Name: clean_text, Length: 606001, dtype: object>

In [12]:
help(word_tokenize)

Help on function word_tokenize in module nltk.tokenize:

word_tokenize(text, language='english', preserve_line=False)
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently an improved :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).
    
    :param text: text to split into words
    :type text: str
    :param language: the model name in the Punkt corpus
    :type language: str
    :param preserve_line: A flag to decide whether to sentence tokenize the text or not.
    :type preserve_line: bool



In [42]:
[text for text in df_twitter.text_no_sw.values]

['Historic day Romania US first lady visit Romania Slovakia meet meet Ukrainian refugees American troops Romanias firstlady Europe United States united ever',
 'Prime Minister Romania participated today Bureau reiterated Romanias support ukraine Republic Moldova Only together defeat Kremlins aggression',
 'Good news France good news Europe good news democratic worldThe European project continue The proEuropean second term French president FranceElection2022 FranceVotes',
 'Romania fully prepared join schengen zone Keeping Romania Schengen area deeply immoral unfairRomanias accession Schengen would positive signal favor consolidating European Unions external borders',
 'I strongly condemn massacre Bucha committed Russian forcesIts horrific absolutely unimaginable happened Bucha The international community must take immediate action BuchaMassacre',
 'Congratulations This picture worth thousand words Metsola first president EU institution visit Ukrainian capital since Russia invasion EU s

In [86]:
def tokenizer(sent):
    #tknzr = word_tokenize()         #Creating stemmer
    
    sent = word_tokenize(sent.lower())        #Tokenizing, as stemmer only takes tokenized sentences
    sent_tknzed = []                 #Empty list to save stemmed sentence
    
    for token in sent: #Stemming words
        sent_tknzed.append(token)

    return sent_tknzed #' '.join(sent_stemmed)

#df_twitter['text_token'] = df_twitter['text_no_sw'].apply(lambda x: tokenizer(x))

In [249]:
#sent = [text for text in df_twitter.text_no_sw.values]
#corpus = tokenizer(sent)

In [43]:
# Remove stopwords

def remove_stopwords(sent):
    patterns = set(stopwords.words('english'))

    for pattern in patterns:
        if re.search(' '+pattern+' ', sent):           #Searching for stopwords bounded by whitespace in each tweet
            sent = re.sub(' '+pattern+' ', ' ', sent)  #Substituting stopwords with whitespace

    return sent


In [44]:
df_twitter['text_no_sw'] = df_twitter['clean_text'].apply(lambda x: remove_stopwords(x))

In [48]:
df_twitter.head(5)

Unnamed: 0,clean_text,Date,Country,text_no_sw
0,Historic day for Romania US first lady will vi...,2022-05-06 11:56:37+00:00,Bulgaria,Historic day Romania US first lady visit Roman...
1,Prime Minister of Romania participated today a...,2022-04-28 16:08:40+00:00,Bulgaria,Prime Minister Romania participated today Bure...
2,Good news from France good news for Europe and...,2022-04-24 19:26:52+00:00,Bulgaria,Good news France good news Europe good news de...
3,Romania is fully prepared to join the schengen...,2022-04-07 12:48:56+00:00,Bulgaria,Romania fully prepared join schengen zone Keep...
4,I strongly condemn the massacre in Bucha commi...,2022-04-03 17:00:15+00:00,Bulgaria,I strongly condemn massacre Bucha committed Ru...


---
---
---

### 1.C
---
#### Create corpus and vectorize
---

In [49]:
def stemmer(sent):
    stemmer = PorterStemmer()         #Creating stemmer
    
    sent = word_tokenize(sent.lower())        #Tokenizing, as stemmer only takes tokenized sentences
    sent_stemmed = []                 #Empty list to save stemmed sentence
    
    for word in sent:
        stem = stemmer.stem(word) #Stemming words
        sent_stemmed.append(stem)

    return sent_stemmed #' '.join(sent_stemmed)

df_twitter['text_stem'] = df_twitter['text_no_sw'].apply(lambda x: stemmer(x))

In [50]:
df_twitter.head(5)

Unnamed: 0,clean_text,Date,Country,text_no_sw,text_stem
0,Historic day for Romania US first lady will vi...,2022-05-06 11:56:37+00:00,Bulgaria,Historic day Romania US first lady visit Roman...,"[histor, day, romania, us, first, ladi, visit,..."
1,Prime Minister of Romania participated today a...,2022-04-28 16:08:40+00:00,Bulgaria,Prime Minister Romania participated today Bure...,"[prime, minist, romania, particip, today, bure..."
2,Good news from France good news for Europe and...,2022-04-24 19:26:52+00:00,Bulgaria,Good news France good news Europe good news de...,"[good, news, franc, good, news, europ, good, n..."
3,Romania is fully prepared to join the schengen...,2022-04-07 12:48:56+00:00,Bulgaria,Romania fully prepared join schengen zone Keep...,"[romania, fulli, prepar, join, schengen, zone,..."
4,I strongly condemn the massacre in Bucha commi...,2022-04-03 17:00:15+00:00,Bulgaria,I strongly condemn massacre Bucha committed Ru...,"[i, strongli, condemn, massacr, bucha, commit,..."


In [51]:
#Lemmatize words 

def lemmatize(sent):
    
    #First, the nltk wordnet lemmatizer needs the part-of-speech (POS) tag to correctly lemmatize
    #NLTK has a POS-tagger, but the format does not match POS-tags in wordnet's lemmatizer. 
    #The mapping dictionary below fixes that.
    
    tag_map = defaultdict(lambda : wordnet.NOUN)  #If nothing else is specified, use noun tag
    tag_map['J'] = wordnet.ADJ
    tag_map['V'] = wordnet.VERB
    tag_map['R'] = wordnet.ADV    
    
    lemmatizer = WordNetLemmatizer()        #Creating lemmatizer.
    
    sent = word_tokenize(sent.lower())              #Tokenizing, as lemmatizer only takes tokenized sentences
    sent_lemmatized = []                    #Empty list to save lemmatized sentence

    for word, tag in pos_tag(sent):
        lemma = lemmatizer.lemmatize(word, tag_map[tag[0]])  #Where the magic happens 
        
        #Notice above that we choose tag[0] to get all instances of a word class, 
        #e.g. NN (noun) and NP (proper noun) should both translate to noun. 
        
        sent_lemmatized.append(lemma)       #Putting the words back together into a sentence list
    
    return sent_lemmatized #' '.join(sent_lemmatized)

df_twitter['text_lem'] = df_twitter['text_no_sw'].apply(lambda x: lemmatize(x))

#stem = [[stemmer(word) for word in sent] for sent in df_twitter['corpus']]
#df_twitter['text_stem'] = stem

In [52]:
df_twitter.head(5)

Unnamed: 0,clean_text,Date,Country,text_no_sw,text_stem,text_lem
0,Historic day for Romania US first lady will vi...,2022-05-06 11:56:37+00:00,Bulgaria,Historic day Romania US first lady visit Roman...,"[histor, day, romania, us, first, ladi, visit,...","[historic, day, romania, u, first, lady, visit..."
1,Prime Minister of Romania participated today a...,2022-04-28 16:08:40+00:00,Bulgaria,Prime Minister Romania participated today Bure...,"[prime, minist, romania, particip, today, bure...","[prime, minister, romania, participate, today,..."
2,Good news from France good news for Europe and...,2022-04-24 19:26:52+00:00,Bulgaria,Good news France good news Europe good news de...,"[good, news, franc, good, news, europ, good, n...","[good, news, france, good, news, europe, good,..."
3,Romania is fully prepared to join the schengen...,2022-04-07 12:48:56+00:00,Bulgaria,Romania fully prepared join schengen zone Keep...,"[romania, fulli, prepar, join, schengen, zone,...","[romania, fully, prepared, join, schengen, zon..."
4,I strongly condemn the massacre in Bucha commi...,2022-04-03 17:00:15+00:00,Bulgaria,I strongly condemn massacre Bucha committed Ru...,"[i, strongli, condemn, massacr, bucha, commit,...","[i, strongly, condemn, massacre, bucha, commit..."


In [55]:
keywords = ['sanction','sanctioning','sanctions','oil','gas']
df_twitter[df_twitter["clean_text"].apply(lambda x: any(k in x for k in keywords))]

Unnamed: 0,clean_text,Date,Country,text_no_sw,text_stem,text_lem
105,I will be appearing on shortly to discuss the ...,2022-05-04 16:11:55+00:00,Hungary,I appearing shortly discuss impact announcemen...,"[i, appear, shortli, discuss, impact, announc,...","[i, appear, shortly, discuss, impact, announce..."
111,EU proposes to ban all Russian oil imports in ...,2022-05-04 07:18:22+00:00,Hungary,EU proposes ban Russian oil imports new sancti...,"[eu, propos, ban, russian, oil, import, new, s...","[eu, propose, ban, russian, oil, import, new, ..."
151,Seeing whats been done to Ukraine and its popu...,2022-04-14 08:00:15+00:00,Hungary,Seeing whats done Ukraine population immoral E...,"[see, what, done, ukrain, popul, immor, eu, co...","[see, whats, do, ukraine, population, immoral,..."
172,President s message is clear Russia must be he...,2022-04-06 16:39:30+00:00,Hungary,President message clear Russia must held accou...,"[presid, messag, clear, russia, must, held, ac...","[president, message, clear, russia, must, hold..."
181,Interview with this afternoon in Strasbourg fo...,2022-04-05 10:45:52+00:00,Hungary,Interview afternoon Strasbourg Europarl Radio ...,"[interview, afternoon, strasbourg, europarl, r...","[interview, afternoon, strasbourg, europarl, r..."
...,...,...,...,...,...,...
605944,Adam Kaufman Dont know about you but Im willin...,2022-03-12 14:26:01+00:00,Sweden,Adam Kaufman Dont know Im willing pay little e...,"[adam, kaufman, dont, know, im, will, pay, lit...","[adam, kaufman, dont, know, im, willing, pay, ..."
605945,Mr Putin these are not sanctions This is a spe...,2022-03-10 18:41:23+00:00,Sweden,Mr Putin sanctions This special financial oper...,"[mr, putin, sanction, thi, special, financi, o...","[mr, putin, sanction, this, special, financial..."
605946,Mr Putin these are not sanctions This is a spe...,2022-03-10 18:41:23+00:00,Sweden,Mr Putin sanctions This special financial oper...,"[mr, putin, sanction, thi, special, financi, o...","[mr, putin, sanction, this, special, financial..."
605963,Major remaining holes in sanctions EU gas purc...,2022-03-04 14:39:44+00:00,Sweden,Major remaining holes sanctions EU gas purchas...,"[major, remain, hole, sanction, eu, ga, purcha...","[major, remain, hole, sanction, eu, gas, purch..."


In [61]:
df_twitter.to_csv('prepro_data.csv')

In [28]:
corpus_lem

["['historic', 'day', 'romania', 'u', 'first', 'lady', 'visit', 'romania', 'slovakia', 'meet', 'meet', 'ukrainian', 'refugee', 'american', 'troop', 'romanias', 'firstlady', 'europe', 'united', 'state', 'unite', 'ever']",
 "['prime', 'minister', 'romania', 'participate', 'today', 'bureau', 'reiterate', 'romanias', 'support', 'ukraine', 'republic', 'moldova', 'only', 'together', 'defeat', 'kremlin', 'aggression']",
 "['good', 'news', 'france', 'good', 'news', 'europe', 'good', 'news', 'democratic', 'worldthe', 'european', 'project', 'continue', 'the', 'proeuropean', 'second', 'term', 'french', 'president', 'franceelection2022', 'francevotes']",
 "['romania', 'fully', 'prepared', 'join', 'schengen', 'zone', 'keep', 'romania', 'schengen', 'area', 'deeply', 'immoral', 'unfairromanias', 'accession', 'schengen', 'would', 'positive', 'signal', 'favor', 'consolidate', 'european', 'union', 'external', 'border']",
 "['i', 'strongly', 'condemn', 'massacre', 'bucha', 'commit', 'russian', 'forcesits

---

In [2]:
#import pandas as pd
df_org = pd.read_csv('prepro_data.csv', index_col='Unnamed: 0')

In [3]:
df = df_org.copy()

In [4]:
df.head(5)

Unnamed: 0,clean_text,Date,Country,text_no_sw,text_stem,text_lem
0,Historic day for Romania US first lady will vi...,2022-05-06 11:56:37+00:00,Bulgaria,Historic day Romania US first lady visit Roman...,"['histor', 'day', 'romania', 'us', 'first', 'l...","['historic', 'day', 'romania', 'u', 'first', '..."
1,Prime Minister of Romania participated today a...,2022-04-28 16:08:40+00:00,Bulgaria,Prime Minister Romania participated today Bure...,"['prime', 'minist', 'romania', 'particip', 'to...","['prime', 'minister', 'romania', 'participate'..."
2,Good news from France good news for Europe and...,2022-04-24 19:26:52+00:00,Bulgaria,Good news France good news Europe good news de...,"['good', 'news', 'franc', 'good', 'news', 'eur...","['good', 'news', 'france', 'good', 'news', 'eu..."
3,Romania is fully prepared to join the schengen...,2022-04-07 12:48:56+00:00,Bulgaria,Romania fully prepared join schengen zone Keep...,"['romania', 'fulli', 'prepar', 'join', 'scheng...","['romania', 'fully', 'prepared', 'join', 'sche..."
4,I strongly condemn the massacre in Bucha commi...,2022-04-03 17:00:15+00:00,Bulgaria,I strongly condemn massacre Bucha committed Ru...,"['i', 'strongli', 'condemn', 'massacr', 'bucha...","['i', 'strongly', 'condemn', 'massacre', 'buch..."


In [7]:
#df.describe()

In [5]:
df['text_no_sw'] = df['text_no_sw'].str.lower()

In [8]:
corpus = []

for lists in df['text_no_sw']:
#    for string in lists:
    tokens = word_tokenize(str(lists))
    corpus.append(tokens)

In [9]:
#corpus1 = df['text_lem']

In [10]:
#corpus1.iloc[corpus1]

In [11]:
#help(corpus1)

In [12]:
df['text_token'] = corpus

In [13]:
df

Unnamed: 0,clean_text,Date,Country,text_no_sw,text_stem,text_lem,text_token
0,Historic day for Romania US first lady will vi...,2022-05-06 11:56:37+00:00,Bulgaria,historic day romania us first lady visit roman...,"['histor', 'day', 'romania', 'us', 'first', 'l...","['historic', 'day', 'romania', 'u', 'first', '...","[historic, day, romania, us, first, lady, visi..."
1,Prime Minister of Romania participated today a...,2022-04-28 16:08:40+00:00,Bulgaria,prime minister romania participated today bure...,"['prime', 'minist', 'romania', 'particip', 'to...","['prime', 'minister', 'romania', 'participate'...","[prime, minister, romania, participated, today..."
2,Good news from France good news for Europe and...,2022-04-24 19:26:52+00:00,Bulgaria,good news france good news europe good news de...,"['good', 'news', 'franc', 'good', 'news', 'eur...","['good', 'news', 'france', 'good', 'news', 'eu...","[good, news, france, good, news, europe, good,..."
3,Romania is fully prepared to join the schengen...,2022-04-07 12:48:56+00:00,Bulgaria,romania fully prepared join schengen zone keep...,"['romania', 'fulli', 'prepar', 'join', 'scheng...","['romania', 'fully', 'prepared', 'join', 'sche...","[romania, fully, prepared, join, schengen, zon..."
4,I strongly condemn the massacre in Bucha commi...,2022-04-03 17:00:15+00:00,Bulgaria,i strongly condemn massacre bucha committed ru...,"['i', 'strongli', 'condemn', 'massacr', 'bucha...","['i', 'strongly', 'condemn', 'massacre', 'buch...","[i, strongly, condemn, massacre, bucha, commit..."
...,...,...,...,...,...,...,...
605996,No it was very much introductory remarks to th...,2022-02-23 15:43:44+00:00,Sweden,no much introductory remarks event,"['no', 'much', 'introductori', 'remark', 'event']","['no', 'much', 'introductory', 'remark', 'event']","[no, much, introductory, remarks, event]"
605997,I was in the room His address was good but the...,2022-02-23 15:13:04+00:00,Sweden,i room his address good nothing new proclaimed,"['i', 'room', 'hi', 'address', 'good', 'noth',...","['i', 'room', 'his', 'address', 'good', 'nothi...","[i, room, his, address, good, nothing, new, pr..."
605998,If youre gonna listen to any speech about Ukra...,2022-02-23 02:03:34+00:00,Sweden,if youre gonna listen speech ukraine let oneth...,"['if', 'your', 'gon', 'na', 'listen', 'speech'...","['if', 'youre', 'gon', 'na', 'listen', 'speech...","[if, youre, gon, na, listen, speech, ukraine, ..."
605999,I wish all my followers outside Ukraine could ...,2022-02-22 13:04:19+00:00,Sweden,i wish followers outside ukraine could take mo...,"['i', 'wish', 'follow', 'outsid', 'ukrain', 'c...","['i', 'wish', 'follower', 'outside', 'ukraine'...","[i, wish, followers, outside, ukraine, could, ..."


---

## 2. Define initial sanction and sub-classifier 

### General sanction category

In [None]:
'''''
1. 
Define (a list with) seed words for "general sanction category":
'''''

general_sanction = ['sanction', 'sanctions']


'''''
2.
Define (lists with) seed words for "sub-caterogies" in our "main sanction categories" 
(ex: topic, target, and tweet type)
Define (lists with) seed words for sub-caterogy black-lists (if any)
'''''

### Sanction topics category

In [27]:
'''''
Sub-categories and keywords:
'''''

energy = ['oil', 'gas', 'coal', 'uranium', 'nuclear']
#energy_black_list = ['']

flight_ban = ['']
#flight_ban_black_list = ['']

finance = ['swift']
#finance_black_list = ['']

propaganda = ['']
#propaganda_black_list = ['']

naval_ban = ['']
#naval_ban_black_list = ['']

# Multilateral coordination between EU member states
m_coordination = ['coordination', 'cooperate', 'among', 'allies']
#m_coordination_black_list = ['']

### Sanction targets category

In [None]:
'''''
Sub-categories and keywords:
'''''

putin = ['putin']
oligarch = [''] # wealthy_russians?
officials = ['']
civil_servants = ['']
secret_services = ['']
meps = ['']
military = ['']

### Tweet type category (proposed by Martin)
`Tweet example:`
"***putin*** (Target: person, official) *is a* ***criminal*** (Tweet type: accusation) *&amp; russia has become a* ***gulag*** (Tweet type: accusation) *state we cannot make any* ***compromises*** (Tweet type: call to action) *nor* ***concessions*** (Tweet type: call to action) *to* ***criminals*** (Tweet type: accusation).
 
*voted in #eplenary new #sanctions against* ***russia*** (Target: country) *and full* ***embargo*** (Topic: Trade) *on* ***imports*** (Topic: Trade) *of russian* ***oil, coal, nuclear fuel and gas*** (Topic: Energy/natural ressources). 
    
#***stoprussianaggression*** (Tweet type: call to action)
@eppgroup https://t.co/rv5hoj8oml"

* (NB: the keywords i would use in the sub-category lists are highlighted with ***bold italic text*** above.)

`Tweet type:`  
- accusation: the MEP is accusing Putin of being a "criminal" and Russia of being a "gulag" state (sort of prison labour camp (refering to the time of Stalin and the Soviet Union) and an "agressor"  
- call to action: the Tweet can be interpreted as what actions the MEP believe should be taken with regard to his accusations.

`Topic:`  
- Energy/natural ressources: oil, coal, nuclear fuel and gas  
- Trade sanctions: embargo on imports
- War / military: it would be safe to infer that the use of hashtag #stoprussianaggression refers to the war in Ukraine  

`Target:`  
- Country: Russia  
- Person: Putin  


In [None]:
'''''
Sub-categories and keywords:
'''''

calls_to_action = ['compromises', 'concessions', 'stoprussianaggression', 'must','propose']
accusations = ['criminal', 'criminals', 'gulag']
public_announcement = ['interview','article','talking', 'chatting'] #from another tweet

---

## 3. Explore new seed words for categorizing tweets

* Use pretrained Word2Vec, GLoVe, and FastText models for comparison

In [23]:
### Define parameters for the model
size=300 # Size of Embedding.
workers=4 # Number CPU Cores to use for training in Parallel.
iter_=10 # Depending on the size of your data you might want to run through your data more than once.
window=6 # How much Context
min_count=5 # Number of Occurrences to be kept in the Vocabulary - villighed til falske negativer

In [24]:
### Initializing the model and start training ###
# I train models on both token, lemmatized, and stemmed corpus

model = Word2Vec(corpus,vector_size=size,workers=workers,window=window,min_count=min_count)

In [218]:
model1 = Word2Vec(corpus_lem,vector_size=size,workers=workers,window=window,min_count=min_count)

2022-05-26 16:49:28,459 : INFO : collecting all words and their counts
2022-05-26 16:49:28,464 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-05-26 16:49:28,560 : INFO : PROGRESS: at sentence #10000, processed 922503 words, keeping 37 word types
2022-05-26 16:49:28,647 : INFO : PROGRESS: at sentence #20000, processed 1842440 words, keeping 37 word types
2022-05-26 16:49:28,737 : INFO : PROGRESS: at sentence #30000, processed 2812806 words, keeping 37 word types
2022-05-26 16:49:28,821 : INFO : PROGRESS: at sentence #40000, processed 3736857 words, keeping 37 word types
2022-05-26 16:49:28,907 : INFO : PROGRESS: at sentence #50000, processed 4687198 words, keeping 37 word types
2022-05-26 16:49:28,987 : INFO : PROGRESS: at sentence #60000, processed 5569257 words, keeping 37 word types
2022-05-26 16:49:29,069 : INFO : PROGRESS: at sentence #70000, processed 6475358 words, keeping 37 word types
2022-05-26 16:49:29,155 : INFO : PROGRESS: at sentence #80000

2022-05-26 16:49:33,857 : INFO : estimated required memory for 37 words and 300 dimensions: 107300 bytes
2022-05-26 16:49:33,859 : INFO : resetting layer weights
2022-05-26 16:49:33,860 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2022-05-26T16:49:33.860538', 'gensim': '4.2.0', 'python': '3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'build_vocab'}
2022-05-26 16:49:33,861 : INFO : Word2Vec lifecycle event {'msg': 'training model with 4 workers on 37 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=6 shrink_windows=True', 'datetime': '2022-05-26T16:49:33.861298', 'gensim': '4.2.0', 'python': '3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'train'}
2022-05-26 16:49:34,866 : INFO : EPOCH 0 - PROGRESS: at 7.63% examples, 767213 words/s, in_qsize 7, out_qsize 0
2022-05-26 16:49:35,872 : INFO :

KeyboardInterrupt: 

In [102]:
# Not relevant atm
#model.predict_output_word('sanctions')

[('n', 0.06626926),
 ('made', 0.011676979),
 ('w', 0.008690058),
 ('r', 0.004418599),
 ('moved', 0.0032434314),
 ('de', 0.0031338346),
 ('calls', 0.002856475),
 ('e', 0.0027837178),
 ('caidreamh', 0.0020919596),
 ('viva', 0.002042156)]

In [103]:
from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np

In [25]:
### Search for most similar words or lists of words.
### Exploratively
sanction = ['sanction']
print("Word2Vec Most Similar", "\n", model.wv.most_similar(sanction), "\n")
print("Word2Vec Most Similar (Cosine Similarity)","\n", model.wv.most_similar_cosmul(sanction),"\n")

'''''
Comment (Martin): I think, we should use the Cosine Similarity. But not sure yet.
'''''

# Check if any sanction tweets are detected
# df_twitter.loc[df_twitter['general_shaming_classifier'] == True]

Word2Vec Most Similar 
 [('assets', 0.40560057759284973), ('gasoil', 0.37045061588287354), ('oligarchs', 0.3693709373474121), ('sanctions', 0.36789485812187195), ('gaswont', 0.3586117625236511), ('importsgt', 0.3576804995536804), ('navalny', 0.34960615634918213), ('brusselswill', 0.33997172117233276), ('seizing', 0.3284044563770294), ('plas', 0.325237900018692)] 

Word2Vec Most Similar (Cosine Similarity) 
 [('assets', 0.7027996182441711), ('gasoil', 0.6852246522903442), ('oligarchs', 0.6846848130226135), ('sanctions', 0.6839467883110046), ('gaswont', 0.6793052554130554), ('importsgt', 0.6788396239280701), ('navalny', 0.6748024225234985), ('brusselswill', 0.6699852347373962), ('seizing', 0.664201557636261), ('plas', 0.6626182794570923)] 



"''\nComment (Martin): I think, we should use the Cosine Similarity. But not sure yet.\n"

**Explore the initial keywords in all our sub-categories, and find relevant new keywords to extend the lists**

In [31]:
'''''
Explore each word, and some words together.
'''''

#Ex. testing energy sub-cat: energy = ['oil', 'gas', 'coal', 'uranium', 'nuclear']

print("Energy, all key-w's:","\n","Word2Vec Most Similar (Cosine Similarity)","\n",model.wv.most_similar_cosmul(energy),"\n")
print("Oil:","\n","Word2Vec Most Similar (Cosine Similarity)","\n",model.wv.most_similar_cosmul('oil'),"\n")
print("Gas:","\n","Word2Vec Most Similar (Cosine Similarity)","\n",model.wv.most_similar_cosmul('gas'),"\n")
print("Coal:","\n","Word2Vec Most Similar (Cosine Similarity)","\n",model.wv.most_similar_cosmul('coal'),"\n")
print("Uranium:","\n","Word2Vec Most Similar (Cosine Similarity)","\n",model.wv.most_similar_cosmul('uranium'),"\n")

#s = ['sanction', 'assets', 'oligarchs', 'sanctions','gasoil','expel']
#extended = ['assets', 'oligarchs', 'sanctions','gasoil','expel']
#model.wv.most_similar_cosmul(extended)



Energy, all key-w's: 
 Word2Vec Most Similar (Cosine Similarity) 
 [('import', 0.12527714669704437), ('imports', 0.12104213237762451), ('importsgt', 0.1131068766117096), ('embargo', 0.10409441590309143), ('fossil', 0.10372918844223022), ('export', 0.10194060206413269), ('gasoil', 0.09963272511959076), ('crude', 0.09958624839782715), ('bryansk', 0.09943831712007523), ('purchases', 0.09678073227405548)] 

Oil: 
 Word2Vec Most Similar (Cosine Similarity) 
 [('import', 0.7070629000663757), ('capx', 0.7050924897193909), ('imports', 0.7025620937347412), ('pute', 0.6828221678733826), ('refinerie', 0.6741490960121155), ('blackmailto', 0.6732582449913025), ('cutthegas', 0.6730424165725708), ('importing', 0.6728558540344238), ('skyhigh', 0.6722517013549805), ('putindependent', 0.6713516712188721)] 

Gas: 
 Word2Vec Most Similar (Cosine Similarity) 
 [('gaswe', 0.7422040700912476), ('crude', 0.727179765701294), ('gasthat', 0.7142114043235779), ('cites', 0.6923399567604065), ('inju', 0.69090479612

**Expand lists from above calculations**

---

**Next: Word count and new column (variable) for each tweet**

In [None]:
# Classify documents as sanction
general_sanction = set(general_sanction)


In [None]:
# Calculate the number of words from our dictionary that is in the tweet.
general_sanction_text = [len(general_sanction.intersection(text)) for text in corpus]


In [None]:
# Make the dictionary count variable a part of our dataframe
df['general_sanction'] = general_shaming_text

In [None]:
# Make Variable
## Define "number of word" threshold
threshold = 1
df['general_sanction_classifier'] = df['general_sanction'] >= threshold

In [None]:
# Create a Dict of lists to iterate trough - with blacklists
subcat_dict = {
    # extended sub-categories
    # topics
    'energy' : ['oil', 'gas', 'coal', 'uranium', 'nuclear'],
    'energy_black_list' : [''],
    'flight_ban' : [''],
    'flight_ban_black_list' : [''],
    'finance' : ['swift'],
    #... and so on
}

In [None]:
#Loop to classify documents according to sub-categories of sanctions - now with blacklists

## Define "number of word" threshold
threshold = 1

for name in subcat_dict.keys():
# Skip black_lists in dict
    if 'black_list' in name:
    continue # Continue to the next iteration (here name)

###################
# Make white list #
###################

    # Makes a set out of our list of dictionary terms under sanction.
    set_sub_list = set(subcat_dict[name])

    # Calculate the number of words from our dictionary that is in the tweet.
    sub_list_text = [len(set_sub_list.intersection(text)) for text in corpus]
    
    # Make the dictionary count variable a part of our dataframe
    df[name] = sub_list_text # This is the white list

    ###################
    # Make black list #
    ###################

    # Get name of black list
    black_list = name + '_black_list'
    black_set_sub_list = set(subcat_dict[black_list])
    
    # Calculate the number of words from our dictionary that is in the tweet.
    black_sub_list_text = [len(black_set_sub_list.intersection(text)) for text in corpus]
    
    # Make the dictionary count variable a part of our dataframe
    df[black_list] = black_sub_list_text # This is the black list
    
    ###################
    # Make classifier #
    ###################
    
    # Create name for classifier
    name_classifier = name + '_classifier'
    
    # Create classifier based on white list, general_sanction and black list
    df[name_classifier] = (df[name] >= threshold) & (df['general_sanction'] >=
    threshold) & (df[black_list] == 0)
    
    # Output in csv-file
    # Get your query out in a csv for futher investigation
    
    #df_query_soli = df[df[name_classifier]!=0]
    #query_soli = df_query_soli[['screen_name', 'text_nh']]
    #query_soli.to_csv(name_classifier+'.csv')

In [None]:
# #### Check how many are classified
df_distribution = pd.DataFrame()
df_distribution['general_sanction'] = df.groupby('general_sanction_classifier').size()
df_distribution['energy'] = df.groupby('energy_classifier').size()
#... and so on

In [None]:
# ### Evaluating Classifiers
# Conditional(on positive) sample to quickly estimate precision.
sample_general_sanction_p = df[df['general_sanction_classifier']!=0].sample(n=50)
sample_energy_p = df[df['energy_classifier']!=0].sample(n=50)
# ... and so on

In [None]:
# Conditional(on negative) sample to quickly estimate/get sense of recall.
sample_general_sanction_n = df[df['general_sanction_classifier']==0].sample(n=50)
sample_energy_n = df[df['energy_classifier']==0].sample(n=50)
# ... and so on

In [None]:
# Export datasets for manual coding
sample_general_sanction_p['text'].to_csv('_sample_general_sanction_positive_test_set.csv'
sample_energy_p['text'].to_csv('_sample_energy_positive_test_set.csv')
#...                                         
sample_general_sanction_n['text'].to_csv('_sample_general_sanction_negative_test_set.csv')
sample_energy_n['text'].to_csv('_sample_energy_negative_test_set.csv')

**Testing the amount of tweets we get when we query/search the keyword list "s" (above) with the different columns**

In [152]:
df[df["clean_text"].astype(str).apply(lambda x: any(k in x for k in s))]

Unnamed: 0,clean_text,Date,Country,text_no_sw,text_stem,text_lem,text_token
105,I will be appearing on shortly to discuss the ...,2022-05-04 16:11:55+00:00,Hungary,i appearing shortly discuss impact announcemen...,"['i', 'appear', 'shortli', 'discuss', 'impact'...","['i', 'appear', 'shortly', 'discuss', 'impact'...","[i, appearing, shortly, discuss, impact, annou..."
111,EU proposes to ban all Russian oil imports in ...,2022-05-04 07:18:22+00:00,Hungary,eu proposes ban russian oil imports new sancti...,"['eu', 'propos', 'ban', 'russian', 'oil', 'imp...","['eu', 'propose', 'ban', 'russian', 'oil', 'im...","[eu, proposes, ban, russian, oil, imports, new..."
235,As of now we stop treating Russia as a MostFav...,2022-03-15 11:23:38+00:00,Hungary,as stop treating russia mostfavourednation thi...,"['as', 'stop', 'treat', 'russia', 'mostfavoure...","['a', 'stop', 'treating', 'russia', 'mostfavou...","[as, stop, treating, russia, mostfavourednatio..."
289,The tragedy of Afghanistan is that it is a dev...,2022-02-28 19:31:01+00:00,Hungary,the tragedy afghanistan development emergency ...,"['the', 'tragedi', 'afghanistan', 'develop', '...","['the', 'tragedy', 'afghanistan', 'development...","[the, tragedy, afghanistan, development, emerg..."
421,The tough economic and financial sanctions aga...,2022-03-15 10:59:21+00:00,Hungary,the tough economic financial sanctions putins ...,"['the', 'tough', 'econom', 'financi', 'sanctio...","['the', 'tough', 'economic', 'financial', 'san...","[the, tough, economic, financial, sanctions, p..."
...,...,...,...,...,...,...,...
605780,EU needs to be more active to end this crisis ...,2022-04-06 12:29:20+00:00,Austria,eu needs active end crisis also pressure sanct...,"['eu', 'need', 'activ', 'end', 'crisi', 'also'...","['eu', 'need', 'active', 'end', 'crisis', 'als...","[eu, needs, active, end, crisis, also, pressur..."
605945,Mr Putin these are not sanctions This is a spe...,2022-03-10 18:41:23+00:00,Sweden,mr putin sanctions this special financial oper...,"['mr', 'putin', 'sanction', 'thi', 'special', ...","['mr', 'putin', 'sanction', 'this', 'special',...","[mr, putin, sanctions, this, special, financia..."
605946,Mr Putin these are not sanctions This is a spe...,2022-03-10 18:41:23+00:00,Sweden,mr putin sanctions this special financial oper...,"['mr', 'putin', 'sanction', 'thi', 'special', ...","['mr', 'putin', 'sanction', 'this', 'special',...","[mr, putin, sanctions, this, special, financia..."
605963,Major remaining holes in sanctions EU gas purc...,2022-03-04 14:39:44+00:00,Sweden,major remaining holes sanctions eu gas purchas...,"['major', 'remain', 'hole', 'sanction', 'eu', ...","['major', 'remain', 'hole', 'sanction', 'eu', ...","[major, remaining, holes, sanctions, eu, gas, ..."


In [154]:
df[df["text_lem"].astype(str).apply(lambda x: any(k in x for k in s))]

Unnamed: 0,clean_text,Date,Country,text_no_sw,text_stem,text_lem,text_token
105,I will be appearing on shortly to discuss the ...,2022-05-04 16:11:55+00:00,Hungary,i appearing shortly discuss impact announcemen...,"['i', 'appear', 'shortli', 'discuss', 'impact'...","['i', 'appear', 'shortly', 'discuss', 'impact'...","[i, appearing, shortly, discuss, impact, annou..."
111,EU proposes to ban all Russian oil imports in ...,2022-05-04 07:18:22+00:00,Hungary,eu proposes ban russian oil imports new sancti...,"['eu', 'propos', 'ban', 'russian', 'oil', 'imp...","['eu', 'propose', 'ban', 'russian', 'oil', 'im...","[eu, proposes, ban, russian, oil, imports, new..."
163,Also on why Ireland needs to take a closer loo...,2022-04-09 11:27:09+00:00,Hungary,also ireland needs take closer look security d...,"['also', 'ireland', 'need', 'take', 'closer', ...","['also', 'ireland', 'need', 'take', 'close', '...","[also, ireland, needs, take, closer, look, sec..."
172,President s message is clear Russia must be he...,2022-04-06 16:39:30+00:00,Hungary,president message clear russia must held accou...,"['presid', 'messag', 'clear', 'russia', 'must'...","['president', 'message', 'clear', 'russia', 'm...","[president, message, clear, russia, must, held..."
235,As of now we stop treating Russia as a MostFav...,2022-03-15 11:23:38+00:00,Hungary,as stop treating russia mostfavourednation thi...,"['as', 'stop', 'treat', 'russia', 'mostfavoure...","['a', 'stop', 'treating', 'russia', 'mostfavou...","[as, stop, treating, russia, mostfavourednatio..."
...,...,...,...,...,...,...,...
605810,Rank Just watched talking at about EU Sanction...,2022-02-25 11:53:22+00:00,Austria,rank just watched talking eu sanctions russia ...,"['rank', 'just', 'watch', 'talk', 'eu', 'sanct...","['rank', 'just', 'watch', 'talk', 'eu', 'sanct...","[rank, just, watched, talking, eu, sanctions, ..."
605945,Mr Putin these are not sanctions This is a spe...,2022-03-10 18:41:23+00:00,Sweden,mr putin sanctions this special financial oper...,"['mr', 'putin', 'sanction', 'thi', 'special', ...","['mr', 'putin', 'sanction', 'this', 'special',...","[mr, putin, sanctions, this, special, financia..."
605946,Mr Putin these are not sanctions This is a spe...,2022-03-10 18:41:23+00:00,Sweden,mr putin sanctions this special financial oper...,"['mr', 'putin', 'sanction', 'thi', 'special', ...","['mr', 'putin', 'sanction', 'this', 'special',...","[mr, putin, sanctions, this, special, financia..."
605963,Major remaining holes in sanctions EU gas purc...,2022-03-04 14:39:44+00:00,Sweden,major remaining holes sanctions eu gas purchas...,"['major', 'remain', 'hole', 'sanction', 'eu', ...","['major', 'remain', 'hole', 'sanction', 'eu', ...","[major, remaining, holes, sanctions, eu, gas, ..."


In [153]:
df[df["text_token"].astype(str).apply(lambda x: any(k in x for k in s))]

Unnamed: 0,clean_text,Date,Country,text_no_sw,text_stem,text_lem,text_token
105,I will be appearing on shortly to discuss the ...,2022-05-04 16:11:55+00:00,Hungary,i appearing shortly discuss impact announcemen...,"['i', 'appear', 'shortli', 'discuss', 'impact'...","['i', 'appear', 'shortly', 'discuss', 'impact'...","[i, appearing, shortly, discuss, impact, annou..."
111,EU proposes to ban all Russian oil imports in ...,2022-05-04 07:18:22+00:00,Hungary,eu proposes ban russian oil imports new sancti...,"['eu', 'propos', 'ban', 'russian', 'oil', 'imp...","['eu', 'propose', 'ban', 'russian', 'oil', 'im...","[eu, proposes, ban, russian, oil, imports, new..."
163,Also on why Ireland needs to take a closer loo...,2022-04-09 11:27:09+00:00,Hungary,also ireland needs take closer look security d...,"['also', 'ireland', 'need', 'take', 'closer', ...","['also', 'ireland', 'need', 'take', 'close', '...","[also, ireland, needs, take, closer, look, sec..."
172,President s message is clear Russia must be he...,2022-04-06 16:39:30+00:00,Hungary,president message clear russia must held accou...,"['presid', 'messag', 'clear', 'russia', 'must'...","['president', 'message', 'clear', 'russia', 'm...","[president, message, clear, russia, must, held..."
235,As of now we stop treating Russia as a MostFav...,2022-03-15 11:23:38+00:00,Hungary,as stop treating russia mostfavourednation thi...,"['as', 'stop', 'treat', 'russia', 'mostfavoure...","['a', 'stop', 'treating', 'russia', 'mostfavou...","[as, stop, treating, russia, mostfavourednatio..."
...,...,...,...,...,...,...,...
605810,Rank Just watched talking at about EU Sanction...,2022-02-25 11:53:22+00:00,Austria,rank just watched talking eu sanctions russia ...,"['rank', 'just', 'watch', 'talk', 'eu', 'sanct...","['rank', 'just', 'watch', 'talk', 'eu', 'sanct...","[rank, just, watched, talking, eu, sanctions, ..."
605945,Mr Putin these are not sanctions This is a spe...,2022-03-10 18:41:23+00:00,Sweden,mr putin sanctions this special financial oper...,"['mr', 'putin', 'sanction', 'thi', 'special', ...","['mr', 'putin', 'sanction', 'this', 'special',...","[mr, putin, sanctions, this, special, financia..."
605946,Mr Putin these are not sanctions This is a spe...,2022-03-10 18:41:23+00:00,Sweden,mr putin sanctions this special financial oper...,"['mr', 'putin', 'sanction', 'thi', 'special', ...","['mr', 'putin', 'sanction', 'this', 'special',...","[mr, putin, sanctions, this, special, financia..."
605963,Major remaining holes in sanctions EU gas purc...,2022-03-04 14:39:44+00:00,Sweden,major remaining holes sanctions eu gas purchas...,"['major', 'remain', 'hole', 'sanction', 'eu', ...","['major', 'remain', 'hole', 'sanction', 'eu', ...","[major, remaining, holes, sanctions, eu, gas, ..."


---

**Havent gotten this to work yet**

In [124]:
%matplotlib inline

In [125]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [126]:
from collections import defaultdict
from gensim import corpora

In [128]:
# remove words that appear only once
frequency = defaultdict(int)
for text in corpus:
    for token in text:
        frequency[token] += 1

corpus = [
    [token for token in text if frequency[token] > 1]
    for text in corpus
]

dictionary = corpora.Dictionary(corpus)
corpus = [dictionary.doc2bow(text) for text in corpus]

2022-05-26 15:41:42,337 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2022-05-26 15:41:42,515 : INFO : adding document #10000 to Dictionary<15654 unique tokens: ['american', 'day', 'europe', 'ever', 'first']...>
2022-05-26 15:41:42,695 : INFO : adding document #20000 to Dictionary<20139 unique tokens: ['american', 'day', 'europe', 'ever', 'first']...>
2022-05-26 15:41:42,874 : INFO : adding document #30000 to Dictionary<23594 unique tokens: ['american', 'day', 'europe', 'ever', 'first']...>
2022-05-26 15:41:43,052 : INFO : adding document #40000 to Dictionary<23716 unique tokens: ['american', 'day', 'europe', 'ever', 'first']...>
2022-05-26 15:41:43,246 : INFO : adding document #50000 to Dictionary<24751 unique tokens: ['american', 'day', 'europe', 'ever', 'first']...>
2022-05-26 15:41:43,422 : INFO : adding document #60000 to Dictionary<24826 unique tokens: ['american', 'day', 'europe', 'ever', 'first']...>
2022-05-26 15:41:43,601 : INFO : adding document #70000 to Di

2022-05-26 15:41:53,257 : INFO : adding document #580000 to Dictionary<33169 unique tokens: ['american', 'day', 'europe', 'ever', 'first']...>
2022-05-26 15:41:53,452 : INFO : adding document #590000 to Dictionary<33169 unique tokens: ['american', 'day', 'europe', 'ever', 'first']...>
2022-05-26 15:41:53,629 : INFO : adding document #600000 to Dictionary<34762 unique tokens: ['american', 'day', 'europe', 'ever', 'first']...>
2022-05-26 15:41:53,737 : INFO : built Dictionary<35404 unique tokens: ['american', 'day', 'europe', 'ever', 'first']...> from 606001 documents (total 8401994 corpus positions)
2022-05-26 15:41:53,738 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<35404 unique tokens: ['american', 'day', 'europe', 'ever', 'first']...> from 606001 documents (total 8401994 corpus positions)", 'datetime': '2022-05-26T15:41:53.738413', 'gensim': '4.2.0', 'python': '3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'ev

---


In [129]:
from gensim import models

In [130]:
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

2022-05-26 15:43:12,409 : INFO : using serial LSI version on this node
2022-05-26 15:43:12,410 : INFO : updating model with new documents
2022-05-26 15:43:12,414 : INFO : preparing a new chunk of documents
2022-05-26 15:43:12,507 : INFO : using 100 extra samples and 2 power iterations
2022-05-26 15:43:12,508 : INFO : 1st phase: constructing (35404, 102) action matrix
2022-05-26 15:43:12,642 : INFO : orthonormalizing (35404, 102) action matrix
2022-05-26 15:43:13,155 : INFO : 2nd phase: running dense svd on (102, 20000) matrix
2022-05-26 15:43:13,384 : INFO : computing the final decomposition
2022-05-26 15:43:13,385 : INFO : keeping 2 factors (discarding 87.182% of energy spectrum)
2022-05-26 15:43:13,391 : INFO : processed documents up to #20000
2022-05-26 15:43:13,428 : INFO : topic #0(79.553): -0.426*"i" + -0.373*"the" + -0.292*"amp" + -0.271*"eu" + -0.230*"ukraine" + -0.186*"we" + -0.159*"people" + -0.146*"gas" + -0.146*"need" + -0.132*"russian"
2022-05-26 15:43:13,443 : INFO : topi

2022-05-26 15:43:17,820 : INFO : preparing a new chunk of documents
2022-05-26 15:43:17,908 : INFO : using 100 extra samples and 2 power iterations
2022-05-26 15:43:17,911 : INFO : 1st phase: constructing (35404, 102) action matrix
2022-05-26 15:43:17,981 : INFO : orthonormalizing (35404, 102) action matrix
2022-05-26 15:43:18,314 : INFO : 2nd phase: running dense svd on (102, 20000) matrix
2022-05-26 15:43:18,466 : INFO : computing the final decomposition
2022-05-26 15:43:18,466 : INFO : keeping 2 factors (discarding 84.719% of energy spectrum)
2022-05-26 15:43:18,481 : INFO : merging projections: (35404, 2) + (35404, 2)
2022-05-26 15:43:18,490 : INFO : keeping 2 factors (discarding 1.229% of energy spectrum)
2022-05-26 15:43:18,492 : INFO : processed documents up to #160000
2022-05-26 15:43:18,498 : INFO : topic #0(228.479): 0.536*"amp" + 0.328*"the" + 0.302*"eu" + 0.271*"i" + 0.264*"ukraine" + 0.169*"we" + 0.134*"war" + 0.131*"people" + 0.128*"russian" + 0.107*"need"
2022-05-26 15:4

2022-05-26 15:43:23,276 : INFO : preparing a new chunk of documents
2022-05-26 15:43:23,370 : INFO : using 100 extra samples and 2 power iterations
2022-05-26 15:43:23,374 : INFO : 1st phase: constructing (35404, 102) action matrix
2022-05-26 15:43:23,447 : INFO : orthonormalizing (35404, 102) action matrix
2022-05-26 15:43:23,803 : INFO : 2nd phase: running dense svd on (102, 20000) matrix
2022-05-26 15:43:23,971 : INFO : computing the final decomposition
2022-05-26 15:43:23,972 : INFO : keeping 2 factors (discarding 82.479% of energy spectrum)
2022-05-26 15:43:23,978 : INFO : merging projections: (35404, 2) + (35404, 2)
2022-05-26 15:43:23,983 : INFO : keeping 2 factors (discarding 0.490% of energy spectrum)
2022-05-26 15:43:23,985 : INFO : processed documents up to #300000
2022-05-26 15:43:23,987 : INFO : topic #0(316.097): 0.594*"amp" + 0.309*"the" + 0.294*"eu" + 0.266*"ukraine" + 0.240*"i" + 0.161*"we" + 0.128*"war" + 0.122*"russian" + 0.121*"people" + 0.101*"russia"
2022-05-26 15

2022-05-26 15:43:28,390 : INFO : preparing a new chunk of documents
2022-05-26 15:43:28,475 : INFO : using 100 extra samples and 2 power iterations
2022-05-26 15:43:28,478 : INFO : 1st phase: constructing (35404, 102) action matrix
2022-05-26 15:43:28,555 : INFO : orthonormalizing (35404, 102) action matrix
2022-05-26 15:43:28,905 : INFO : 2nd phase: running dense svd on (102, 20000) matrix
2022-05-26 15:43:29,053 : INFO : computing the final decomposition
2022-05-26 15:43:29,054 : INFO : keeping 2 factors (discarding 86.857% of energy spectrum)
2022-05-26 15:43:29,061 : INFO : merging projections: (35404, 2) + (35404, 2)
2022-05-26 15:43:29,064 : INFO : keeping 2 factors (discarding 0.259% of energy spectrum)
2022-05-26 15:43:29,066 : INFO : processed documents up to #440000
2022-05-26 15:43:29,068 : INFO : topic #0(388.828): 0.550*"amp" + 0.325*"eu" + 0.311*"the" + 0.276*"ukraine" + 0.253*"i" + 0.174*"we" + 0.138*"war" + 0.128*"russian" + 0.119*"people" + 0.105*"european"
2022-05-26 

2022-05-26 15:43:33,655 : INFO : preparing a new chunk of documents
2022-05-26 15:43:33,742 : INFO : using 100 extra samples and 2 power iterations
2022-05-26 15:43:33,745 : INFO : 1st phase: constructing (35404, 102) action matrix
2022-05-26 15:43:33,815 : INFO : orthonormalizing (35404, 102) action matrix
2022-05-26 15:43:34,169 : INFO : 2nd phase: running dense svd on (102, 20000) matrix
2022-05-26 15:43:34,321 : INFO : computing the final decomposition
2022-05-26 15:43:34,322 : INFO : keeping 2 factors (discarding 86.701% of energy spectrum)
2022-05-26 15:43:34,327 : INFO : merging projections: (35404, 2) + (35404, 2)
2022-05-26 15:43:34,331 : INFO : keeping 2 factors (discarding 0.137% of energy spectrum)
2022-05-26 15:43:34,333 : INFO : processed documents up to #580000
2022-05-26 15:43:34,336 : INFO : topic #0(452.190): 0.509*"amp" + 0.349*"eu" + 0.316*"the" + 0.285*"ukraine" + 0.257*"i" + 0.182*"we" + 0.146*"war" + 0.133*"russian" + 0.117*"people" + 0.114*"european"
2022-05-26 

In [138]:
doc = "Russian sanctions"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)

[(0, 0.18207010794618483), (1, 0.15587069818600482)]


In [139]:
from gensim import similarities

In [140]:
index = similarities.MatrixSimilarity(lsi[corpus])  # transform corpus to LSI space and index it

TypeError: __init__() missing 2 required positional arguments: 'corpus' and 'num_features'

In [134]:
index.save('similaritymatrix.index')
index = similarities.MatrixSimilarity.load('similaritymatrix.index')

2022-05-26 15:45:01,728 : INFO : MatrixSimilarity lifecycle event {'fname_or_handle': 'similaritymatrix.index', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-05-26T15:45:01.728720', 'gensim': '4.2.0', 'python': '3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'saving'}
2022-05-26 15:45:01,740 : INFO : saved similaritymatrix.index
2022-05-26 15:45:01,741 : INFO : loading MatrixSimilarity object from similaritymatrix.index
2022-05-26 15:45:01,743 : INFO : MatrixSimilarity lifecycle event {'fname': 'similaritymatrix.index', 'datetime': '2022-05-26T15:45:01.743809', 'gensim': '4.2.0', 'python': '3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'loaded'}


In [141]:
sims = index[vec_lsi]  # perform a similarity query against the corpus
print(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

