Trump Of Yore notes
Challenge: Find old tweets that are similar to new ones.
Research & Resources
Interesting answers to Quora question here: https://www.quora.com/What-are-some-NLP-techniques-that-can-be-used-to-find-similar-Twitter-messages
That referenced this paper: https://arxiv.org/pdf/1605.03481v2.pdf
Which has this code: https://github.com/bdhingra/tweet2vec
Going to try to use that code.
Using a new anaconda python environment I'm calling
``` conda create --name vectoring python=2 source activate vectoring conda install theano lasagne numpy pip install tweepy pip install boto3 ```
Ran into a dependency error around
downsample, which is deprecated but still used by the old Lasagne version that Theano installs. Soooo did this:
``` pip install --upgrade https://github.com/Theano/Theano/archive/master.zip pip install --upgrade https://github.com/Lasagne/Lasagne/archive/master.zip ```
See: https://lasagne.readthedocs.io/en/latest/user/installation.html for more info on this.
For the cosine distance:
``` conda install scikit-learn ```
Playing with the vectorization
./tweet2vec_encoder.sh to generate test results
End up in
result/ to this repo's
Had to add new
jupyter to the
` conda install jupyter`
So I could use Python 2.
See the python notebook in the
making_vectors folders ... where I load in the
embeddings.npy file generated by
tweet2vec using numpy.
Also took the one-to-many comparison code from here: https://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity Particularly this part:
Hence to find the top 5 related documents, we can use argsort and some negative array slicing (most related documents have highest cosine similarity values, hence at the end of the sorted indices array) ...
Got this working.
Commented out some prediction lines in
encode_char.py since we're not going to use the hashtag prediction -- just want the vector encoding of tweets.
Prepping Trump Tweets
Inside trump_data, modified the
preprocess.py from the original tweet2vec:
- added the twitter str_id to each row, separated by a tab character
- made the tokenizer handle periods better
- changed the 'write' mode to 'append' mode for the output file
python preprocess.py 2009.json pre_prez_archive.txt python preprocess.py 2010.json pre_prez_archive.txt python preprocess.py 2011.json pre_prez_archive.txt python preprocess.py 2012.json pre_prez_archive.txt python preprocess.py 2013.json pre_prez_archive.txt python preprocess.py 2014.json pre_prez_archive.txt python preprocess.py 2015.json pre_prez_archive.txt python preprocess.py 2016.json pre_prez_archive.txt python preprocess.py 2017_pre0120.json pre_prez_archive.txt
Now I'm going to use a modified version of tweet2vec which removes the prediction functions since we're not doing any machine learning, just turning the tweets into vectors. It also accounts for the tab characters we've added (along with the twitter id_str values) in the data.
Repo for Quartz version is here: https://github.com/Quartz/tweet2vec
Over there, ran the modified
encode_char.py which takes the following parameters:
python encode_char.py [datafile] [modelpath] [resultpath]
I'm just going to use the default model path to keep it happy.
source activate vectoring python encode_char.py ../../botstudio-trump-echo/trump_data/pre_prez_archive.txt best_model/ ../../botstudio-trump-echo/trump_data/result/
And I got a 50 MB numpy file!
Heading over to the
vector_tinkering_all notebook ...
Tweeting with Tweepy
TRUMP_USER_ID = "25073877" POTUS_USER_ID = "822215679726100480"
API keys, tokens, secrets
These are all in a file called
config.py which is not included in this repo. I have included an
config_example.py which mirrors the
config.py file, but without the secret codes.
Installing on EC2
Had to install conda. See: https://www.continuum.io/downloads#linux
wget "https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh" bash Anaconda2-4.4.0-Linux-x86_64.sh
(Answer 'yes' to adding the path to
.bashrc, which is not the default)
Did all the installs above.
config.py file to the top level of the repo.
To avoid this error/warning when running the app ...
WARNING (theano.configdefaults): g++ not detected ! Theano will be unable to execute optimized C-implementations (for both CPU and GPU) and will default to Python implementations. Performance will be severely degraded. To remove this warning, set Theano flags cxx to an empty string.
... I added a file at the user home level called
.theanorc containing this:
[global] floatX = float32 device = cpu cxx = ""
Getting Tweet Images
So working on how to display both old and new tweets. Couldn't get the user experience to be very good (see images in notes_images).
Then in talking with Quartz's Chris Zarate, we had the idea of getting screenshots of both tweets and combining them into a single image. He already had been working on a screen-shot system running on AWS Lambda, and modified it to accept two tweet urls. Then it returns a url to the combined image.
To do this, I need to connect to the Lambda function in the AWS system. So installed boto3, the AWS SDK for python.
`pip install boto3`
Added code that invokes the lambda function, using the boto3 docs.
Tinkered with this in
Found a key, undocumented trick to programmatically change the AWS region in this useful post.
Note that the screenshot function isn't publicly available just now. If you're actually reading this and are excited about this feature, let us know at firstname.lastname@example.org!
For now, the server the TrumpOfYore bot is running on has specific credentials to invoke the lambda function.
Running on EC2
forever to run my python script, just like I do with node. But using
-c to run a command.
-lis the log file
-mis the max number of retries (don't want to upset Twitter)
-cis the command
Note that I'm doing this command after
source activate vectoring to start my conda environment.
cd botstudio-trump-of-yore source activate vectoring forever start -a -l /home/ubuntu/botstudio-trump-of-yore/forever.log -c python app.py