# <center> Small World Phenomenon on Twitter Data   </center>

## <center> $\textit{First Short Report} $ </center>

### <center> Data Science Laboratory </center>

<center> <img src="elte_cimer_szines.jpg" alt="elte_cimer" width="260"/> </center>

### Introduction 

The purpose of this report is to have a brief summary about my work since last week. Before last week's lesson I was able to gather tweets via the $\texttt{tweepy}$ package. 

### About last week

After being able to stream tweets, my next task was to prepare the data. The data originally comes in $\texttt{.json}$ format, but there were some decoding error, which I was not able handle properly at the beginning. 

On my personal laptop, after loading in the files, the dictionary was encoded in unicode, but while loading in, the keys were unicodes:

~~~
>>> list(tweets[0].keys())
~~~

~~~
[u'quote_count', u'contributors', u'truncated', u'text', u'is_quote_status', u'in_reply_to_status_id', u'reply_count', u'id', u'favorite_count', u'entities', u'retweeted', u'coordinates', u'timestamp_ms', u'source', u'in_reply_to_screen_name', u'id_str', u'retweet_count', u'in_reply_to_user_id', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'lang', u'extended_tweet', u'created_at', u'filter_level', u'in_reply_to_status_id_str', u'place']
~~~

So I tried the exact same method on the kooplex server, and somehow it worked.

In [1]:
import ast
import pandas as pd
import json
import re
import nltk

In [2]:
with open('tweets.txt') as f:
    data = f.readlines()

tweets = []
for k in data:
    tweets.append(json.loads(k))

In [4]:
list(tweets[0].keys())

['created_at',
 'id',
 'id_str',
 'text',
 'source',
 'truncated',
 'in_reply_to_status_id',
 'in_reply_to_status_id_str',
 'in_reply_to_user_id',
 'in_reply_to_user_id_str',
 'in_reply_to_screen_name',
 'user',
 'geo',
 'coordinates',
 'place',
 'contributors',
 'retweeted_status',
 'is_quote_status',
 'quote_count',
 'reply_count',
 'retweet_count',
 'favorite_count',
 'entities',
 'favorited',
 'retweeted',
 'filter_level',
 'lang',
 'timestamp_ms']

After figuring out these problems, for easier handling, I am going to load in everything into a Pandas DataFrame.

In [5]:
df = pd.DataFrame(tweets)

In [6]:
df.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_tweet,favorite_count,favorited,filter_level,geo,...,quoted_status_permalink,reply_count,retweet_count,retweeted,retweeted_status,source,text,timestamp_ms,truncated,user
0,,,Thu Sep 19 11:12:50 +0000 2019,,"{'hashtags': [], 'urls': [], 'user_mentions': ...",,0,False,low,,...,,0,0,False,{'created_at': 'Tue Sep 17 22:30:05 +0000 2019...,"<a href=""http://twitter.com/download/iphone"" r...","RT @seanhannity: **Move over FBI, James Comey,...",1568891570649,False,"{'id': 108675976, 'id_str': '108675976', 'name..."
1,,,Thu Sep 19 11:12:51 +0000 2019,,"{'hashtags': [], 'urls': [{'url': 'https://t.c...",{'full_text': 'To everybody breathlessly waiti...,0,False,low,,...,,0,0,False,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",To everybody breathlessly waiting for the next...,1568891571304,True,"{'id': 21390418, 'id_str': '21390418', 'name':..."
2,,,Thu Sep 19 11:12:52 +0000 2019,,"{'hashtags': [], 'urls': [], 'user_mentions': ...",,0,False,low,,...,,0,0,False,{'created_at': 'Thu Sep 19 02:41:51 +0000 2019...,"<a href=""http://twitter.com/#!/download/ipad"" ...",RT @RyanAFournier: Democratic megadonor Ed Buc...,1568891572477,False,"{'id': 821529902805643264, 'id_str': '82152990..."
3,,,Thu Sep 19 11:12:52 +0000 2019,,"{'hashtags': [], 'urls': [], 'user_mentions': ...",,0,False,low,,...,,0,0,False,{'created_at': 'Wed Sep 18 21:23:58 +0000 2019...,"<a href=""http://twitter.com/download/android"" ...",RT @TalbertSwan: Barack Obama: 8 yrs\n\n0 indi...,1568891572528,False,"{'id': 55573445, 'id_str': '55573445', 'name':..."
4,,,Thu Sep 19 11:12:53 +0000 2019,,"{'hashtags': [], 'urls': [], 'user_mentions': ...",,0,False,low,,...,,0,0,False,{'created_at': 'Wed Sep 18 19:41:51 +0000 2019...,"<a href=""http://twitter.com/download/iphone"" r...",RT @dbongino: One of the worst acts of media m...,1568891573644,False,"{'id': 4606622202, 'id_str': '4606622202', 'na..."


The next task is to get the core of the words in each text

In [10]:
print('Sample text: ' + df.text[3])

Sample text: RT @TalbertSwan: Barack Obama: 8 yrs

0 indictments
0 porn stars raw dogged and paid
0 pedophiles endorsed
0 racists pardoned
0 mass murder…


I can access numerous things via their keys:

In [17]:
key = 'friends_count'

for i in df.user.values:
    print(i[key])

4349
6412
64
487
233
254
400
52
954
270
323
4812
60
646
197
16465
1256
3695
2028
134
204
6871
397
72
4964
99
8355
30
228
988
16465
33
2
1189
224
11082
1558
47636
420
135
3022
2466
308
679
3608
9211
588
5002
623
919
28
158
135


I can even access the users themselves.

In [18]:
df.user[1]

{'id': 21390418,
 'id_str': '21390418',
 'name': '𝔇𝔬𝔠',
 'screen_name': 'xxdr_zombiexx',
 'location': '#NaziAmerica',
 'url': None,
 'description': 'ACT UP NOW.\n\nVote next year.\n\n#RepublicansHateYou',
 'translator_type': 'none',
 'protected': False,
 'verified': False,
 'followers_count': 7176,
 'friends_count': 6412,
 'listed_count': 85,
 'favourites_count': 44817,
 'statuses_count': 128671,
 'created_at': 'Fri Feb 20 11:38:00 +0000 2009',
 'utc_offset': None,
 'time_zone': None,
 'geo_enabled': False,
 'lang': None,
 'contributors_enabled': False,
 'is_translator': False,
 'profile_background_color': '81E74B',
 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
 'profile_background_tile': True,
 'profile_link_color': '000000',
 'profile_sidebar_border_color': '1B1F19',
 'profile_sidebar_fill_color': '83D1E6',
 'profile_text_color': '333333',
 'profile_use_b

Via the $\texttt{nltk}$ package I can use the text.

In [89]:
trump_tweet = 'Having great meetings and discussions with my friend, President @EmmanuelMacron of France. We are in the midst of meetings on Iran, Syria and Trade. We will be holding a joint press conference shortly, here at the @WhiteHouse. 🇺🇸🇫🇷'

nltk.download('stopwords')
nltk.download('punkt')

from nltk.stem import SnowballStemmer

snow = SnowballStemmer('english',ignore_stopwords=True)
for word in nltk.word_tokenize(df.text[0].lower())[0:10]:
    print(word,'->',snow.stem(word))
#match the words of same meanning

[nltk_data] Downloading package stopwords to /home/ahmitr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/ahmitr/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
always -> alway
wondered -> wonder
about -> about
the -> the
great -> great
cj -> cj
flipflop -> flipflop
. -> .


Here I tokenize all the tweets.

In [95]:
for k in df.text:
    print('\n' + k + '\n')
    for word in nltk.word_tokenize(k.lower()):

        print(word,'->',snow.stem(word))


Always wondered about the great CJ flipflop.

always -> alway
wondered -> wonder
about -> about
the -> the
great -> great
cj -> cj
flipflop -> flipflop
. -> .

RT @RedNationRising: Daily Reminder

Hillary Clinton PAYED British &amp; Russian Government Sources to dig up dirt on candidate Trump so she co…

rt -> rt
@ -> @
rednationrising -> rednationris
: -> :
daily -> daili
reminder -> remind
hillary -> hillari
clinton -> clinton
payed -> pay
british -> british
& -> &
amp -> amp
; -> ;
russian -> russian
government -> govern
sources -> sourc
to -> to
dig -> dig
up -> up
dirt -> dirt
on -> on
candidate -> candid
trump -> trump
so -> so
she -> she
co… -> co…

RT @thebradfordfile: Clinton Foundation contributions:

2015: $183 million
2016: $135 million
2017:   $23 million

Why would foreign govern…

rt -> rt
@ -> @
thebradfordfile -> thebradfordfil
: -> :
clinton -> clinton
foundation -> foundat
contributions -> contribut
: -> :
2015 -> 2015
: -> :
$ -> $
183 -> 183
million -> million
201

“ -> “
the -> the
president -> presid
must -> must
be -> be
held -> held
accountable -> account
. -> .
no -> no
one* -> one*
is -> is
above -> above
the -> the
law -> law
. -> .
'' -> ''
* -> *
except -> except
barack -> barack
obama -> obama
, -> ,
hillary -> hillari
clin… -> clin…

JUST NOW: Senator Bernie Sanders tells crowds at New Hampshire rally, “But given the impeachment process that is no… https://t.co/KnCXSUxgaO

just -> just
now -> now
: -> :
senator -> senat
bernie -> berni
sanders -> sander
tells -> tell
crowds -> crowd
at -> at
new -> new
hampshire -> hampshir
rally -> ralli
, -> ,
“ -> “
but -> but
given -> given
the -> the
impeachment -> impeach
process -> process
that -> that
is -> is
no… -> no…
https -> https
: -> :
//t.co/kncxsuxgao -> //t.co/kncxsuxgao

RT @PrisonPlanet: Who rioted after they lost the election?

Hillary Clinton supporters. https://t.co/ABXkZK8bTY

rt -> rt
@ -> @
prisonplanet -> prisonplanet
: -> :
who -> who
rioted -> riot
after -> after
they -> th

### Upcoming tasks

In the following weeks I will need to gather more tweets, find an appropriate way to get the chunks of the words and to find those words in the tweets therefore create the graph.