Skip to content

CoreyHyllested/impactweets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Collection src/collect/getAllFollowers.py is a script that utilizes twitter4j binaries to not hit the ratelimits. This makes collection of data take a while. It dumps all data, individual timelines, into a directory structure. Variables exist to modify where the timelines are to be saved, ego-networks's user, etc.

The protectedAccounts is a file of names. This list is twitter protected accounts and will be skipped because Twitter will timeout or fail to collect them. For both, protectedAccounts and deleted (some people have follow bots which are removed)... they need to be removed prior to Data Formating. I know.

grep "Showing @" * | cut -f 1 -d ':' >> protectedAccounts grep "Showing @" * | cut -f 1 -d ':' | xargs rm

grep "^Failed to get timeline" | cut -f 1 -d ':' # >> protectedAccounts and xargs rm... def do this if you're repeating this process.

Data Formating. Ah grasshopper. The JSON isn't an array. It requires manually massaging.

The following steps create a single JSON array. For each file, add a ',' to the last line. Use $ sed -e 's/^ }$/^ },/'

Concatenate all files together into an array, basically add "[" and "]". But be sure to remove the very last comma.

Now you have a giant JSON array of tweets.

Process. The pig script will require the tutorial.jar and data.json to be in specific locations. Be sure it has access. For the least amount of work, cd to src/process/testing. $ pig -x local GetTweetCharactoristics.pig it will create a set of output in src/process/testing/out

Topic-Grams. The process output will create a digraph file for graphviz. $ sfdp -Gbgcolor=black -Ncolor=white -Ecolor=white -Nwidth=0.02 -Nheight=0.02 -Goverlap=false -Nfontcolor=white -Earrowsize=0.4 -Gratio=fill -Tpng

About

Investigating avenues for suggested tweets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages