In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 $('div.output_stderr').hide();
 } else {
 $('div.input').show();
 $('div.output_stderr').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action='javascript:code_toggle()'><input STYLE='color: #4286f4' 
type='submit' value='Click here to toggle on/off the raw code.'></form>''')

# <center> Small world effect on Twitter </center>

### <center> Third presentation </center>

---

#### <center>by Kristóf Furuglyás </center>


<center> <img src="twitter_start.jpg" alt="tw_start" width="600"/> </center>

##### <center> 2019 Fall, Consultant: Eszter Bokányi, Eötvös Loránd University </center> 

_Disclaimer: if you do not see the raw code, consider toggling them at the top of the page_


## <center> Plan </center>

1. Setting up Twitter API $\checkmark$
2. Gathering tweets $\checkmark$
3. Cleaning the tweets $in$ $progress$
4. Creating word-graph $\checkmark$
5. Exploring small-world properties $in$ $progress$



### <center>About last week </center>

My progress so far:

- Stream tweets via the $\texttt{tweepy}$ package
- Choose them by locations
- Prepare the data from .json format
- Tokenize the words in the tweets
- Cleaned the words from unnecesary things.
- Created word-graph

Since last week (the presentation), I was able to clean (most of) the unnecessary punctuation and other things, and after tokenizing the words from the tweets, I could create a graph also.

In [2]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [3]:
import sys
import ast
import pandas as pd
import json
import re
import nltk
import numpy as np
import networkx as nx
import operator
import plotly.graph_objects as go
import plotly
import folium
import collections
#import enchant
# I will need outer help for this particular package (for language check)
from collections import Counter
from nltk.stem import SnowballStemmer
import matplotlib.pyplot as plt

I already have a couple of tweets streamed from last week ($\sim$ 8500).

In [4]:
with open('intermediaryreporti/tweets_2.txt') as f:
    data = f.readlines()

tweets = []
for k in data:
    tweets.append(json.loads(k))

However, there was a problem with one tweet, so I deleted that.

I will not load every single piece of information in, just the necessary ones.

In [5]:
for c,tweet in enumerate(tweets):
    try:
        tweet["place"]["bounding_box"]
    except TypeError:
        del tweets[c]

In [6]:
df = pd.DataFrame()

df['id'] = np.array([tweet["id"] for tweet in tweets])
df['len'] = np.array([len(tweet["text"]) for tweet in tweets])

df['date'] = np.array([tweet["created_at"] for tweet in tweets])
df['source'] = np.array([tweet["source"] for tweet in tweets])
df['likes'] = np.array([tweet["favorite_count"] for tweet in tweets])
df['retweets'] = np.array([tweet['retweet_count'] for tweet in tweets])
df['name'] = np.array([tweet['user']['name'] for tweet in tweets])
df['locs'] = [[loc[::-1]for loc in tweet['place']['bounding_box']['coordinates'][0] if loc is not None] for tweet in tweets]

In [7]:
df.shape

(8479, 8)

The ratio of tweets where there was no need for a 'full_text' option:

In [8]:
texts = []
cnt = 1
for i, tweet in enumerate(tweets):
    try:
        texts.append(tweet['extended_tweet']['full_text'])
    except KeyError:
        texts.append(tweet['text'])
        cnt += 1
print(f"{cnt/i}%")

0.7425100259495164%


In [9]:
df['text'] = texts

# <center> Location of the tweets </center>

In [34]:
mymap = folium.Map(location=[52.809865,-2.118092],zoom_start=5.4,tiles='cartodbpositron')
for i in range(1000):
    marker = folium.Marker(location=df.locs[i][0],popup=df.text[i])
    marker.add_to(mymap)
folium.Popup(parse_html=True)
mymap

In [11]:
snow = SnowballStemmer('english',ignore_stopwords=False)

# <center> Cleaning from unnecessary things </center>

In [12]:
print(f"Text of tweet no.8: \n\n{df.text[8]}")

Text of tweet no.8: 

Bob Prattey: Liverpool Exhibition Centre - Do NOT host trophy hunting safari companies - Sign the Petition! https://t.co/W8VcURxSLN via @UKChange


In [13]:
clean = re.sub(r'http\S+', '', df.text[8])
print(f"\nAfter cleaning:\n\n{clean}")


After cleaning:

Bob Prattey: Liverpool Exhibition Centre - Do NOT host trophy hunting safari companies - Sign the Petition!  via @UKChange


In [14]:
clean_tknzd = [snow.stem(word) for word in re.findall('\w+',clean.lower())]
print(f"After tokenizing:\n\n{clean_tknzd}")

After tokenizing:

['bob', 'prattey', 'liverpool', 'exhibit', 'centr', 'do', 'not', 'host', 'trophi', 'hunt', 'safari', 'compani', 'sign', 'the', 'petit', 'via', 'ukchang']


Other tokenizers are plausible also (Porter).

In [15]:
df["nourl"] = [re.sub(r'http\S+', '', t) for t in df.text]

In [16]:
df['tkzd_clnd'] = [[snow.stem(word) for word in re.findall('\w+',t.lower())] for t in df.nourl]

# <center> Network of tweets </center>


- Tool: $\texttt{MultiGraph}$ by $\texttt{netwotrkx}$,

- Connection = same tweet $\rightarrow$ the more tweet = greater weight,

- Paralell edges possible, but no self-loops.

In [17]:
g = nx.MultiGraph()

In [18]:
for index, tweet in df.iterrows():
    for i, w in enumerate(tweet['tkzd_clnd']):
        g.add_node(w)
        for j in range(i):
            if tweet['tkzd_clnd'][j]!=w:
                g.add_edge(tweet['tkzd_clnd'][j],w)

In [19]:
g.remove_node("s")
g.remove_node("t")

In [20]:
print(f"Num of nodes and edges: {len(g.nodes), len(g.edges())}")

Num of nodes and edges: (22181, 1569028)


In [21]:
degree_sequence = sorted([d for n, d in g.degree()], reverse=True)  # degree sequence
# print "Degree sequence", degree_sequence
degreeCount = collections.Counter(degree_sequence)
deg, cnt = zip(*degreeCount.items())

layout = go.Layout(
    autosize=False,
    width=720,
    height=480,
    margin=go.layout.Margin(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
    bargroupgap=0.3
)


fig1 = go.Figure(go.Bar(x = deg, y = cnt, name = "Degree distribution"), layout=layout)
fig1.update_layout(xaxis_type="log")#, yaxis_type="log")
fig1.update_layout({'hovermode': 'x',})
fig1.update_xaxes(title_text = "Degree")
fig1.update_yaxes(title_text = "Num of words")
fig1.update_layout(title="Degree distribution" ,     font=dict(
                family='Courier New, monospace',
                size=14,
                color='black'
            ))

In [22]:
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig1, filename='degdist')

The words with the highest degree:

In [23]:
file = open("words_zipf.txt", "r") 

In [24]:
words = []
freqs = []

for line in file:
    t = line.split()
    words.append(t[0])
    freqs.append(t[1])

In [25]:
degs = sorted(g.degree, key=lambda x: x[1], reverse=True)
node, occ = [x[0] for x in degs], [x[1] for x in degs]

n = 20
len_g_e = len(g.edges()) 

data =[go.Bar(x = node[:n], y = [x/len_g_e for x in occ[:n]], name = "From twitter"),
      go.Bar(x = [x.lower() for x in words[:n]], y = [float(x)/10e5 for x in freqs[:n]], name = "Real" )]

fig2 = go.Figure(data, layout=layout)
#fig1.update_layout(xaxis_type="log")#, yaxis_type="log")
fig2.update_layout({'hovermode': 'x',})
fig2.update_xaxes(title_text = "Nodes", tickangle=315)
fig2.update_yaxes(title_text = "Num of edges (relative)")
fig2.update_layout(title="Most common words" ,     font=dict(
                family='Courier New, monospace',
                size=14,
                color='black'
            ))

In [26]:
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig2, filename='mostcommnodes')

In [27]:
edgs = list(g.edges())

edgs_d = dict(Counter(edgs))

sorted_edgs = sorted(edgs_d.items(), key=operator.itemgetter(1), reverse=True)

In [28]:
n = 20

toplot = [[e[0], e[1]] for c, e  in enumerate(sorted_edgs) if c<30]
nums = [100*z[1]/len(edgs) for z in toplot]
labels = [str(z[0][0]+' - '+ z[0][1]) for z in toplot]

Below you can see the 20 most common occurrences ('most numerous paralell edge').

In [29]:

fig = go.Figure(go.Bar(x = labels,y =  nums, name = 'Occurrence'), layout=layout )

fig.update_xaxes(title_text = "Connection pairs", tickangle=315)
fig.update_yaxes(title_text = "% of all the connections")
fig.update_layout(title_text="Relative occurrence of the most common edges",     font=dict(
                family='Courier New, monospace',
                size=14,
                color='black'
            ))

In [30]:
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig, filename='simple-3d-scatter')

It clearly seen, that a lot of non-real words (which are not part of the natural language) are here ('https', 't', 'co', etc.). However, it is promising that the word 'the' itself is very common -- as [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law) states.

# <center> Upcoming tasks </center>

- Gather more (now $\sim 8500$) tweets (more than a million)
- Search for communities
- Look for other measures (centralities)

# <center> Thank you for your attention! </center>
---
<center> <img src="elte_cimer_szines.jpg" alt="elte" width="400"/> </center>

---
#### <center> 2019 Fall, Eötvös Loránd University </center>