
# Tik tok 

Throughout this code we will carry out an analysis of the database related to TikTok and its contents related to the war in Ukraine. After analyzing the complete graph in a more general way, we will focus on the behavior of the most popular profiles related to news and how they transmit information through a sensitivity analysis of their content. In addition, we will also study the network of recommendations that TikTok generates from these profiles, to check if there is any pattern or trend under its algorithm.

In [2]:
import os
import numpy as np
from py2neo import Graph
from py2neo import Node
from py2neo import Relationship
import pandas as pd
import numpy as np
# for some nlp secret sauce
import nltk
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
from langdetect import detect_langs
from langdetect import detect
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Wordcloud stuff
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
import matplotlib as plt
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk import FreqDist
from googletrans import Translator, constants
from pprint import pprint
#print(stopwords.fileids())

In [3]:
# update bolt port number and password as needed.
# in Neo4j, start DB of choice, click on DB name, on right under Details tab note 'Bolt port'
# Windows
if os.name == 'nt':
    graph = Graph("bolt://localhost:7687", password='xxxx', name='neo4j')
# Linux & Mac
if os.name == 'posix':
    graph = Graph("bolt://localhost:11003", password='xxxx', name='neo4j')

The graphic representations will always look better in neo4j, which is where they had been implemented, but it is collected here theoretically

## Centrality Measures
<A HREF="https://neo4j-website.s3.eu-central-1.amazonaws.com/build/html/Algorithms/centrality/centrality.html">Centrality Measures</A>

First,we should install <A HREF="https://neo4j.com/labs/apoc/4.1/installation/">APOC</A> and Graph Data Science Library in the Tik Tok database

In [6]:
# Graph structure
query = """
CALL db.schema.visualization()
"""
graph.run(query)

nodes,relationships
"[(_-8:song {constraints: [], indexes: [], name: 'song'}), (_-5:video {constraints: [], indexes: [], name: 'video'}), (_-7:tag {constraints: [], indexes: [], name: 'tag'}), (_-6:user {constraints: [], indexes: ['id'], name: 'user'})]","[(_-6)-[:author {}]->(_-5), (_-6)-[:author {}]->(_-8), (_-5)-[:has {}]->(_-7), (_-5)-[:has {}]->(_-8), (_-6)-[:recommends {}]->(_-5), (_-6)-[:recommends {}]->(_-6), (_-5)-[:recommends {}]->(_-5), (_-5)-[:recommends {}]->(_-6)]"


In [70]:
#Accounts with highest centrality score
query = """
CALL gds.alpha.closeness.stream({
    nodeProjection: "user",
    relationshipProjection: "recommends"
})
YIELD nodeId, centrality
RETURN gds.util.asNode(nodeId).id AS name,
       gds.util.asNode(nodeId).followerCount AS followerCount,
       centrality AS centralScore
ORDER BY centrality DESC;
"""
result=graph.run(query)
centrality = result.to_data_frame()
centrality.head(10)

ClientError: [Procedure.ProcedureNotFound] There is no procedure with the name `gds.alpha.closeness.stream` registered for this database instance. Please ensure you've spelled the procedure name correctly and that the procedure is properly deployed.

In [None]:
# Counties with highest betweeness score
query = """
CALL gds.betweenness.stream({
    nodeProjection: "user",
    relationshipProjection: "recommends"
})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name,
       gds.util.asNode(nodeId).followerCount AS followerCount,
       score AS btwnScore
ORDER BY score DESC;
"""
result=graph.run(query)
betweeness = result.to_data_frame()
betweeness.head(10)

In [None]:
# Counties with highest page rank score
query = """
CALL gds.pageRank.stream({  
    nodeProjection: "user",
    relationshipProjection: "recommends",  
    maxIterations: 50,  dampingFactor: 0.85})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, 
       gds.util.asNode(nodeId).followerCount AS followerCount,
       score AS pageRank
ORDER BY score DESC;
"""
result=graph.run(query)
pagerank = result.to_data_frame()
pagerank.head(10)

# Popularity analysis

From here we will analyze the most popular news accounts that create content related to the Ukraine war. But the first thing we have to define is: what do we consider a popular account?

Aside from popularity, one feature that can tell us a lot about the content creator is whether or not the account is verified by TikTok. Although this does not have to directly imply that the information provided by the account is more rigorous, it requires something more serious and defines that the content creator has been, at least, identified by the platform. Therefore, throughout the code, this will be one of the most present variables.

Below we present a set of different data frames, in which we organize these profiles according to likes,followers and views, filtering these accounts by the presence of "news" in their name. A more in-depth process will be carried out once we decide on the criteria to follow.

In [6]:
#News accounts sorted by likes
query = """
MATCH (u:user)-[:author]->(v:video)
WHERE u.id =~ '.*new.*'
RETURN DISTINCT u.id AS userName, u.heartCount as LikesReceived
ORDER BY LikesReceived DESC 
"""
lista=pd.DataFrame(graph.run(query).data())
lista.head(10)

Unnamed: 0,userName,LikesReceived
0,cbsnews,153400000
1,yahoonews,97800000
2,nbcnews,79200000
3,underthedesknews,76600000
4,nikocadodailynews,71100000
5,donald_newshd,67000000
6,.thenewwave,54300000
7,news.com.au,51700000
8,switchy._.news,39700000
9,kingqurannewpage,34900000


In [7]:
#News accounts sorted by followers
query = """
MATCH (u:user)-[:author]->(v:video)
WHERE u.id =~ '.*new.*'
RETURN DISTINCT u.id AS userName, u.followerCount as Followers
ORDER BY Followers DESC ;
"""
lista=pd.DataFrame(graph.run(query).data())
lista.head(10)

Unnamed: 0,userName,Followers
0,nbcnews,2700000
1,cbsnews,2500000
2,underthedesknews,2300000
3,abcnews,2100000
4,yahoonews,1700000
5,kingqurannewpage,1700000
6,.thenewwave,1700000
7,nikocadodailynews,1700000
8,donald_newshd,1400000
9,viceworldnews,1300000


In [8]:
#News accounts sorted by streamings
query = """
MATCH (u:user)-[:author]->(v:video)
WHERE u.id =~ '.*new.*'
RETURN  u.id AS userName, sum(v.playCount) as Streams
ORDER BY Streams DESC ;
"""
lista=pd.DataFrame(graph.run(query).data())
lista.head(10)

Unnamed: 0,userName,Streams
0,nbcnews,126478400
1,nikocadodailynews,114467700
2,news.com.au,98309800
3,viceworldnews,86062900
4,kingqurannewpage,53900000
5,cbsnews,52198548
6,bikini.bottom.news,48264300
7,skynews,44466903
8,mednews22,41500000
9,donald_newshd,32900000


As we can see, the order of the lists changes considerably depending on the criteria we follow. 

Tiktok is a platform whose activity is based on the consumption of content from the "For you" page. This content does not appear randomly or chronologically, but based on the behavior of the user with respect to the content that is presented to him. This provides information to a powerful recommendation algorithm, which as the user spends more time on the platform, adjusts the content to the one with which the user has interacted the most. Because of all this, users rarely follow many accounts, they don't need it since the content they are interested in appears on their screen anyway. As a consequence of this implementation, we have chosen that the method to classify the popularity of an account is based on the reproductions of its videos related to this theme, since the higher, it will mean that it appears in more "For you" pages.

Next we will look for the accounts that produce videos in which the word "news" is mentioned, both in the name of the profile and in the description of one of their videos, as well as "war", "Ukraine", "Russia" or "conflict".

In [15]:
#War
query = """
MATCH (u:user)-[:author]->(v:video)
WHERE (u.id =~ '.*news.*' OR
      v.desc =~ '.*news.*' OR
      v.desc =~ '.*новини.*' OR 
      v.desc =~ '.*Новости.*') AND 
      (v.desc =~ '.*war.*' OR
      v.desc =~ '.*війни.*' OR
      v.desc =~ '.*война.*')
RETURN  u.id AS userName, u.verified as VerifiedAccount,count(v) as VidRelated,u.videoCount as Totalvid,sum(v.playCount) as Streams
ORDER BY Streams DESC ;
"""
war=pd.DataFrame(graph.run(query).data())
war

Unnamed: 0,userName,VerifiedAccount,VidRelated,Totalvid,Streams
0,bikini.bottom.news,0,12,39,36651600
1,skynews,1,18,738,33906575
2,cheddygrace,0,1,767,31600000
3,president.putin.fanpage,0,15,242,30406300
4,authentictimetraveler,0,7,112,27760800
...,...,...,...,...,...
119,wanhedaa,0,1,141,2042
120,karlemmrich,0,1,1376,1688
121,militaryforces.com,0,1,156,1570
122,ivxxuk,0,1,143,1206


In [16]:
war["VerifiedAccount"].sum()

19

In [17]:
#proportion of verified accounts
19/124

0.1532258064516129

In [18]:
users1=war["userName"].to_list()

In [19]:
#Ukraine
query = """
MATCH (u:user)-[:author]->(v:video)
WHERE (u.id =~ '.*news.*' OR
      v.desc =~ '.*news.*' OR
      v.desc =~ '.*новини.*' OR 
      v.desc =~ '.*Новости.*') AND 
      (v.desc =~ '.*ukraine.*' OR
      v.desc =~ '.*україни.*'OR
      v.desc =~ '.*Украина.*')
RETURN  u.id AS userName, u.verified as VerifiedAccount,count(v) as VidRelated,u.videoCount as Totalvid,sum(v.playCount) as Streams
ORDER BY Streams DESC ;
"""
ukraine=pd.DataFrame(graph.run(query).data())
ukraine

Unnamed: 0,userName,VerifiedAccount,VidRelated,Totalvid,Streams
0,megatmurad,0,15,32,39641600
1,bikini.bottom.news,0,14,39,37147500
2,skynews,1,2,738,32625800
3,cheddygrace,0,1,767,31600000
4,president.putin.fanpage,0,16,242,30492700
...,...,...,...,...,...
125,r3dwyn,0,1,31,1974
126,militaryforces.com,0,1,156,1570
127,linkyfirsttiktok,0,2,114,1214
128,thealetheiaproject,0,1,15,1092


In [20]:
ukraine["VerifiedAccount"].sum()

18

In [21]:
#proportion of verified accounts
18/130

0.13846153846153847

In [22]:
users2=ukraine["userName"].to_list()

In [23]:
#Russia
query = """
MATCH (u:user)-[:author]->(v:video)
WHERE (u.id =~ '.*news.*' OR
      v.desc =~ '.*news.*' OR
      v.desc =~ '.*новини.*'  OR 
      v.desc =~ '.*Новости.*') AND 
      (v.desc =~ '.*russia.*' OR
      v.desc =~ '.*Росія.*' OR
      v.desc =~ '.*Россия.*')
RETURN  u.id AS userName, u.verified as VerifiedAccount,count(v) as VidRelated,u.videoCount as Totalvid,sum(v.playCount) as Streams
ORDER BY Streams DESC ;
"""
russia=pd.DataFrame(graph.run(query).data())
russia

Unnamed: 0,userName,VerifiedAccount,VidRelated,Totalvid,Streams
0,president.putin.fanpage,0,16,242,39692700
1,megatmurad,0,15,32,39641600
2,bikini.bottom.news,0,14,39,37147500
3,skynews,1,3,738,32649600
4,cheddygrace,0,1,767,31600000
...,...,...,...,...,...
120,militaryforces.com,0,1,156,1570
121,linkyfirsttiktok,0,2,114,1214
122,thealetheiaproject,0,1,15,1092
123,ykrainews,0,1,16,408


In [24]:
russia["VerifiedAccount"].sum()

18

In [25]:
#proportion of verified accounts
18/125

0.144

In [26]:
users3=russia["userName"].to_list()

In [27]:
#Conflict
query = """
MATCH (u:user)-[:author]->(v:video)
WHERE (u.id =~ '.*new.*' OR
      v.desc =~ '.*news.*' OR
      v.desc =~ '.*новини.*'  OR 
      v.desc =~ '.*Новости.*') AND 
      (v.desc =~ '.*conflict.*' OR
      v.desc =~ '.*конфлікт.*' OR
      v.desc =~ '.*конфликт.*')
RETURN  u.id AS userName, u.verified as VerifiedAccount,count(v) as VidRelated,u.videoCount as Totalvid,sum(v.playCount) as Streams
ORDER BY Streams DESC ;
"""
conflict=pd.DataFrame(graph.run(query).data())
conflict

Unnamed: 0,userName,VerifiedAccount,VidRelated,Totalvid,Streams
0,caspolnews,0,3,404,9341700
1,therealisticfishhead,0,1,90,5300000
2,themsinstem,0,15,339,5105004
3,latestconflictwarnews2,0,5,98,1338500
4,lordeedge,0,2,1348,993200
5,fundiambb1,0,1,481,706700
6,thegossipnewstribune,0,27,32,532612
7,world_news_clips,0,2,26,347287
8,project_bandito,0,27,44,323462
9,wearescotland,0,1,740,306500


In [28]:
conflict["VerifiedAccount"].sum()

4

In [29]:
#proportion of verified accounts
4/31

0.12903225806451613

It seems that the word conflict is the least used of all, since it reduces the number of creators by a quarter. Therefore, we will study its behavior independently.

On the other hand, we can verify how the traditional media, which are verified on all platforms, either do not consider tiktok as an important platform to create content, or are advanced by unofficial accounts that are more popular among users, since the verified ones they do not exceed 15% in any of the above groups.

Now we will select the 20 profiles that create the most popular content with respect to these filters

In [30]:
#Intersection of the different list of creators (that mention war, ukraine & russia in any video)
users=list(set(users1) & set(users2)& set(users3))


In [31]:
len(users)

99

In [32]:
#First we will sort the list of the users by streaming again and we will select the main 20
query=  """
WITH ['president.putin.fanpage','frankdomenic','random.content99','bbcnewsukrainian','ykrainews','joy_enjoy6','themsinstem',
'project_bandito','ervesto77','about_ww3','tiktok__czech','picklenews','visegrad24','newswoow','abc15arizona','dogedave',
 'tip_berlin','ukrainerussianewsdaily','papakreem','militaryforces.com','dark_engine84','whitecheddarbread','caspolnews',
 'tv13news','world_news_clips','cbsmornings','crease19','lordeedge','dailyrundownwithchris','theverymessyworld','fundiambb1',
 'viceworldnews','aaronparnas6','internationalnews12365','bigmike3223','cheddygrace','cbsnews','matthewcassel','seoulsecrets',
 'world247breakingnews','an4e216','latestconflictwarnews2','military_thingz','goodmorningbadnews','supercad55','robertjackcutter',
 'aviationhxb','morriganamber','thenewsmovement','belgian_warrior_group','relevanttopics','stand.with.ukraine.now','iam.rob21',
 'david.heath','military_newss','dailymail','xenasolo','gregmustreader','eurovision.by.kseniia','ca_astronorthstarph',
 'crazyniiiik','planefinder','thealetheiaproject','ukraine.news.today','the_ghost_ace_of_kiev','forcesnews',
 'newz2you','bbcnews','s1xrey','.military.newss','highlightsideline','skynews','therealisticfishhead','_rachtok_','city__resident',
 'u24_news','thegossipnewstribune','cnn','sashareheylo','howya.now','politicsoftheworld','snoopy.vn','frankorando','mihhail83',
 'thefirstnews','healingwithirina','themiracle35','thehistorylegends','astridszarek','countdaedalus','ukraine_latest','thedailyrundown',
 'christina_hrystya','bikini.bottom.news','ukraine_war_1','bigkingpiz','zeitimbild','1newsnz','cazacaza23'] as users
MATCH (u:user)-[:author]->(v:video)
WHERE u.id in users and
          (v.desc =~ '.*ukraine.*' OR
          v.desc =~ '.*war.*' OR 
          v.desc =~ '.*russia.*')
RETURN   u.id AS userName, u.verified as VerifiedAccount , sum(v.playCount) as Streams
ORDER BY Streams DESC ;
""" 
sortedusers=pd.DataFrame(graph.run(query).data()).head(20)
sortedusers

Unnamed: 0,userName,VerifiedAccount,Streams
0,president.putin.fanpage,0,48940200
1,viceworldnews,1,41258500
2,bikini.bottom.news,0,37147500
3,skynews,1,34132375
4,cheddygrace,0,31600000
5,xenasolo,0,30629200
6,cbsnews,1,24747400
7,world247breakingnews,0,24664271
8,abc15arizona,1,23950384
9,matthewcassel,0,23114400


In [33]:
sortedusers["VerifiedAccount"].sum()
#We can see how the verified accounts are less than the 50% of the most consumed accounts 

6

In [34]:
sortedusers["userName"].to_list()

['president.putin.fanpage',
 'viceworldnews',
 'bikini.bottom.news',
 'skynews',
 'cheddygrace',
 'xenasolo',
 'cbsnews',
 'world247breakingnews',
 'abc15arizona',
 'matthewcassel',
 'aaronparnas6',
 'planefinder',
 'mihhail83',
 'military_thingz',
 'visegrad24',
 'random.content99',
 'politicsoftheworld',
 'aviationhxb',
 'cbsmornings',
 'city__resident']

In [35]:
#Scoring each account by the descriptions of their videos related to the war, ukraine and russia
sai = SentimentIntensityAnalyzer()
news=sortedusers["userName"]
for index, new in enumerate(news):
    score=[]
    query = """
    MATCH (u:user)-[a:author]->(v:video)
    WHERE u.id =~ '%s' and
          (v.desc =~ '.*ukraine.*' OR
          v.desc =~ '.*war.*' OR 
          v.desc =~ '.*russia.*')
    RETURN v.desc AS description; 
    """ % new
    # the % new inserts value of new where the %s is in the WHERE statement
    result=pd.DataFrame(graph.run(query).data())
    if (len(result)!=0):
        for sentence in result['description']:
            lang = detect(sentence)
            if lang=='en':
                kvp = sai.polarity_scores(sentence)
                score.append(kvp['compound'])
        #print(new,'has mean sentiment score of', round(np.mean(score),3),'from', len(score),'videos')
        #print()
        sortedusers.at[index, "Score"]=round(np.mean(score),3)
        
    else:
        print(new,'returned no results in video description')
        print()

In [36]:
sortedusers

Unnamed: 0,userName,VerifiedAccount,Streams,Score
0,president.putin.fanpage,0,48940200,0.114
1,viceworldnews,1,41258500,-0.263
2,bikini.bottom.news,0,37147500,-0.053
3,skynews,1,34132375,-0.173
4,cheddygrace,0,31600000,-0.511
5,xenasolo,0,30629200,0.097
6,cbsnews,1,24747400,-0.254
7,world247breakingnews,0,24664271,0.0
8,abc15arizona,1,23950384,0.045
9,matthewcassel,0,23114400,-0.331


In [37]:
VerifiedAccounts=sortedusers.loc[sortedusers['VerifiedAccount'] == 1]
VerifiedAccounts

Unnamed: 0,userName,VerifiedAccount,Streams,Score
1,viceworldnews,1,41258500,-0.263
3,skynews,1,34132375,-0.173
6,cbsnews,1,24747400,-0.254
8,abc15arizona,1,23950384,0.045
11,planefinder,1,19829600,-0.214
18,cbsmornings,1,12221719,-0.198


In [38]:
VerifiedAccounts["Score"].mean()

-0.17616666666666667

In [39]:
VerifiedAccounts["Score"].max()

0.045

In [40]:
NoVerifiedAccounts=sortedusers.loc[sortedusers['VerifiedAccount'] == 0]
NoVerifiedAccounts

Unnamed: 0,userName,VerifiedAccount,Streams,Score
0,president.putin.fanpage,0,48940200,0.114
2,bikini.bottom.news,0,37147500,-0.053
4,cheddygrace,0,31600000,-0.511
5,xenasolo,0,30629200,0.097
7,world247breakingnews,0,24664271,0.0
9,matthewcassel,0,23114400,-0.331
10,aaronparnas6,0,20800000,0.0
12,mihhail83,0,17416100,0.0
13,military_thingz,0,16800000,-0.1
14,visegrad24,0,16694300,0.053


In [41]:
NoVerifiedAccounts["Score"].mean()

-0.045214285714285714

In [42]:
NoVerifiedAccounts["Score"].max()

0.122

Let's try now how it works under the "conflict" accounts

In [53]:
#Scoring each account by the descriptions of their videos related to the war, ukraine and russia
sai = SentimentIntensityAnalyzer()
news=conflict["userName"]
for index, new in enumerate(news):
    score=[]
    query = """
    MATCH (u:user)-[a:author]->(v:video)
    WHERE u.id =~ '%s' and
          (v.desc =~ '.*conflict.*' OR
      v.desc =~ '.*конфлікт.*' OR
      v.desc =~ '.*конфликт.*')
    RETURN v.desc AS description; 
    """ % new
    # the % new inserts value of new where the %s is in the WHERE statement
    result=pd.DataFrame(graph.run(query).data())
    if (len(result)!=0):
        for sentence in result['description']:
            lang = detect(sentence)
            if lang=='en':
                kvp = sai.polarity_scores(sentence)
                score.append(kvp['compound'])
        #print(new,'has mean sentiment score of', round(np.mean(score),3),'from', len(score),'videos')
        #print()
        conflict.at[index, "Score"]=round(np.mean(score),3)
        
    else:
        print(new,'returned no results in video description')
        print()

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


In [54]:
conflict

Unnamed: 0,userName,VerifiedAccount,VidRelated,Totalvid,Streams,Score
0,caspolnews,0,3,404,9341700,-0.127
1,therealisticfishhead,0,1,90,5300000,-0.318
2,themsinstem,0,15,339,5105004,0.0
3,latestconflictwarnews2,0,5,98,1338500,-0.26
4,lordeedge,0,2,1348,993200,0.0
5,fundiambb1,0,1,481,706700,0.0
6,thegossipnewstribune,0,27,32,532612,0.0
7,world_news_clips,0,2,26,347287,0.0
8,project_bandito,0,27,44,323462,-0.017
9,wearescotland,0,1,740,306500,0.0


In [57]:
VerifiedConflict=conflict.loc[conflict['VerifiedAccount'] == 1]
VerifiedConflict["Score"].mean()

-0.27166666666666667

In [58]:
NoVerifiedConflict=conflict.loc[conflict['VerifiedAccount'] == 0]
NoVerifiedConflict["Score"].mean()

-0.08588000000000001

Confirming what has already been said, the analysis shows how the verified accounts condemn the "conflict" with a more negative language, while the unverified ones obtain a more neutral balance, usually as a result of the presence of both positive and negative values.

Although the predominant language is English, we can now study the most mentioned topics according to the language used in said descriptions, in order to see the trend that each language presents.

In [48]:
query = """
WITH ['president.putin.fanpage','viceworldnews','bikini.bottom.news','skynews','cheddygrace','xenasolo','cbsnews','world247breakingnews','abc15arizona','matthewcassel','aaronparnas6','planefinder','mihhail83','military_thingz','visegrad24','random.content99','politicsoftheworld','aviationhxb','cbsmornings','city__resident'] as users
MATCH (u:user)-[:author]->(v:video)
WHERE (u.id in users) AND 
      (v.desc =~ '.*news.*' OR
      v.desc =~ '.*новини.*' OR 
      v.desc =~ '.*Новости.*'OR
      v.desc =~ '.*ukraine.*' OR
      v.desc =~ '.*україни.*' OR
      v.desc =~ '.*Украина.*'OR
      v.desc =~ '.*russia.*' OR
      v.desc =~ '.*Росія.*' OR
      v.desc =~ '.*Россия.*' OR
      v.desc =~ '.*war.*' OR
      v.desc =~ '.*війни.*' OR
       v.desc =~ '.*война.*' )
RETURN u.id AS username,
       u.verified AS verified,
       sum(v.playCount) AS playCount,
       v.desc AS description
ORDER BY playCount DESC
"""
result = pd.DataFrame(graph.run(query).data())
result.head()
uk=[]
en=[]
other=[]
for sentence in result['description']:
    lang = detect(sentence)
    if lang=='uk':
        uk.append(sentence)
    elif lang=='en':
        en.append(sentence)
    else:
        other.append(sentence)

In [49]:

morestops = ['ukraine','ukraina','україна','українські','ukrainian','#','!','?',
             '&','>','<','@','.',',','у','в','«','»','на','до','про','що','як', 'з','а','коли',
            'news', 'це', 'новини', 'зсу','ми','його','не','чи',':','за','було','’','і']
stops = set(stopwords.words('english'))
def wordc(sentences, lang):
    print('\n',lang)
    wordlist = [] # put all words into one list
    for sentence in sentences:
        wordlist.append(sentence)
    wc = WordCloud(background_color = 'white', width = 1920, height = 1080, max_words=20)
    text=(" ").join(wordlist)
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens]
    tokens = [word for word in tokens if word not in morestops]
    tokens = [word for word in tokens if word not in stopwords.words('english') ]
    text = ' '.join(tokens)
    #print(text)
    wc.generate_from_text(text)
    wc.to_file('wordcloud_'+lang+'.png')
    fdist = FreqDist(tokens)
    top30 = fdist.most_common(20)
    print(top30)


wordc(en, 'en')
wordc(uk, 'uk')
wordc(other, 'other')


 en
[('russia', 239), ('war', 133), ('fyp', 120), ('putin', 94), ('ukraine🇺🇦', 81), ('ukrainewar', 77), ('russian', 64), ('invasion', 41), ('fypシ', 35), ('politics', 32), ('abc15arizona', 30), ('kyiv', 30), ('abc15', 28), ('nato', 27), ('biden', 25), ('bikinibottomnews', 25), ('zelenskyy', 25), ('abc15news', 24), ('foryou', 24), ('says', 23)]

 uk
[('новинах', 1), ('чули', 1), ('росіяни', 1), ('вигадуйте', 1), ('українськийтікток', 1), ('війна', 1), ('війнавукраїні', 1), ('війна2022', 1), ('війназросією', 1)]

 other
[('russia', 34), ('putin', 24), ('ukrainewar', 20), ('fyp', 15), ('vladimirputin', 13), ('ukraine🇺🇦', 11), ('україна🇺🇦', 8), ('war', 7), ('poland', 7), ('zelensky', 7), ('army', 6), ('украина', 6), ('stopwar', 5), ('ukrainewarrussia', 5), ('2022', 4), ('🇺🇦', 4), ('polska', 4), ('dlaciebie', 4), ('украинавойна', 4), ('nowar', 3)]


In [50]:
langs = []
for i in other:
    langs.append(detect(i))
fdist = FreqDist(langs)
print(fdist.most_common(10))

[('fi', 12), ('et', 10), ('en', 9), ('pl', 7), ('id', 5), ('no', 3), ('af', 2), ('ru', 1), ('tl', 1), ('sk', 1)]


Translation of the ucranian words:  
news-heard-Russians-invent-Ukrainian tiktok-war-war in Ukraine-war2022-war with Russia

We can see how practically all the accounts in which the word "news" is mentioned are English-speaking, since the frequency of words in other languages is practically zero. In the case of the "others" category, it is mainly made up of hashtags or a combinations of words that the algorithm does not detect as one language or another.

Finally, and focusing more on the visual aspect of the database, we are going to study the recommendations that TikTok makes of users. Having already selected the users with the most popular videos on the subject, we are going to study the network of recommendations generated by the algorithm and check if there is any pattern.

In [None]:
#Graph of the recommendations
query = """
WITH ['president.putin.fanpage','viceworldnews','bikini.bottom.news','skynews','cheddygrace','xenasolo','cbsnews','world247breakingnews','abc15arizona','matthewcassel','aaronparnas6','planefinder','mihhail83','military_thingz','visegrad24','random.content99','politicsoftheworld','aviationhxb','cbsmornings','city__resident'] as users
 MATCH (u:user)-[r:recommends]->(a:user)
 WHERE u.id in users
 RETURN *
"""
graph.run(query);


In [61]:
#List of the names of the recommended accounts
query = """
WITH ['president.putin.fanpage','viceworldnews','bikini.bottom.news','skynews','cheddygrace','xenasolo','cbsnews','world247breakingnews','abc15arizona','matthewcassel','aaronparnas6','planefinder','mihhail83','military_thingz','visegrad24','random.content99','politicsoftheworld','aviationhxb','cbsmornings','city__resident'] as users
 MATCH (u:user)-[r:recommends]->(a:user)
 WHERE u.id in users 
RETURN DISTINCT a.id
"""
graph.run(query);

#There are 516

In [65]:
query = """
WITH ['president.putin.fanpage','viceworldnews','bikini.bottom.news','skynews','cheddygrace','xenasolo','cbsnews','world247breakingnews','abc15arizona','matthewcassel','aaronparnas6','planefinder','mihhail83','military_thingz','visegrad24','random.content99','politicsoftheworld','aviationhxb','cbsmornings','city__resident'] as users
 MATCH (u:user)-[r:recommends]->(a:user)
 WHERE u.id in users
RETURN a.id as RecommendedAccount,count(*) as recommendCnt, a.verified as VerifiedAccount
ORDER BY recommendCnt DESC
"""
lista=pd.DataFrame(graph.run(query).data())
lista.head(20)

Unnamed: 0,RecommendedAccount,recommendCnt,VerifiedAccount
0,c4news,5,1
1,skynews,3,1
2,thealetheiaproject,3,0
3,internationalnews12365,3,0
4,24news4u,3,0
5,forcesnews,3,0
6,latestnewsss,3,0
7,istandwithukraine39,3,0
8,underthedesknews,3,1
9,typical_democrat,3,0


In [67]:
#Accounts that recommend you "c4news"
query = """
WITH ['president.putin.fanpage','viceworldnews','bikini.bottom.news','skynews','cheddygrace','xenasolo','cbsnews','world247breakingnews','abc15arizona','matthewcassel','aaronparnas6','planefinder','mihhail83','military_thingz','visegrad24','random.content99','politicsoftheworld','aviationhxb','cbsmornings','city__resident'] as users
 MATCH (u:user)-[r:recommends]->(a:user)
 WHERE u.id in users and a.id="c4news"
RETURN u.id ,u.verified as VerifiedAccount

"""
lista=pd.DataFrame(graph.run(query).data())
lista

Unnamed: 0,u.id,VerifiedAccount
0,skynews,1
1,cbsmornings,1
2,viceworldnews,1
3,bikini.bottom.news,0
4,aaronparnas6,0


In [68]:
#Accounts that recommend you "skynews"
query = """
WITH ['president.putin.fanpage','viceworldnews','bikini.bottom.news','skynews','cheddygrace','xenasolo','cbsnews','world247breakingnews','abc15arizona','matthewcassel','aaronparnas6','planefinder','mihhail83','military_thingz','visegrad24','random.content99','politicsoftheworld','aviationhxb','cbsmornings','city__resident'] as users
 MATCH (u:user)-[r:recommends]->(a:user)
 WHERE u.id in users and a.id="skynews"
RETURN u.id ,u.verified as VerifiedAccount

"""
lista=pd.DataFrame(graph.run(query).data())
lista

Unnamed: 0,u.id,VerifiedAccount
0,president.putin.fanpage,0
1,viceworldnews,1
2,cbsmornings,1


As we can see there is a clear winner when it comes to being recommended and it is the "c4news" account. Although there is no pattern, we can see how news content tends to recommend accounts that contain the term "news" or international content in their name.