HowRussian

A Classifier based on Twitter data to determine how likely a bit of text is to be a Russian Troll tweet, or to measure the influence of Russian Trolls on our own posts and text. For the sake of convienience, 'Russian' in the following context refers to those tweets that were found to be from a known twitter handle of the Internet Research Agency of Russia, and have nothing to do with the Russian people in general.

Data

Non-Russian

Sentiment 140 1,600,498 Tweets (Circa 2010)
Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

Russian

3 million Russian troll tweets

Non-russian tweets were chosen from 2009-2010 in order to avoid 'tainted' tweets in the dataset; the hypothesis here is that Russian Trolls (read: Internet Research Agency) have had a negative effect on our social media experience and are partly responsible for the great divides we are seeing in our society today. To determine if there is Russan Troll influence in a tweet, I needed to use what we sounded like before the social media campaigns started, hence the timeline.

Using the power of Left Joins, I discovered 25,837 Russian Tweets in the non-Russian sets above. I removed these before joining.

[1] Data Gathering

Methods

Preprocessing

The ultimate goal of this is to measure 'Russian' influence over modern social media language useage, and to do that I needed to remove things that were time-dependent in a social context, and to allow future ideas to be analyzed based on the language used, not the subject matter. In other words, I had to replace nouns and other parts of speach.

To do this, I ended up with the following preprocessor. First, I removed specific '@' referneces and replaced them with the phrase "_at_someone_", and did the same for links ("_link_") and #Hashtag ("_hashtag_"). Then removed all punctuation aside from the underscores. From there, I used part of speach tagging from nltk to sort through the parts of speach and keep certain words and reject others based on the part of speach. I kept prepositions (IN), verbs of all sorts (contains V), determiners (DT), coordinating conjunctions (CC), and all of my tagged twitter functions above. The rest I set to the part of speach tag from nltk. This helped to remove the nouns, gendered pronouns (example, anti-Obama and anti-Hillary Clinto speach would be treated differently with him vs her, but now would be the same), and even adverbs and the like, while keeping verbs, words such as 'all' or 'some', etc. This took over 3 hours to tag the tweets.

Term-Frequency Count Vector

I crated a tfidf vectorizer based on ngrams in the rang of 2-5, with a minimum of 500 document occurences and a maximum of 70% of documents (as recommended to eliminate stop-words). From there the results were scaled in order to use l1-normalization on the vector. See here for details.

Logistic Regression with Lasso Normalization

Using a hold-out set , a training set, and a test set, I ran Logistic Regression using lasso normalization in order to just eliminate the phrases that didn't contribute to determining if a tweet was Russian or not. Training Data got a score of 0.896, with test at 0.894, and the hold-out set was at 0.91.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
models		models
.gitignore		.gitignore
README.md		README.md
[1] Data Gathering.ipynb		[1] Data Gathering.ipynb
[2] Preprocessor.ipynb		[2] Preprocessor.ipynb
[3] Count Vectorizer, Scaler, and Logistic Regression.ipynb		[3] Count Vectorizer, Scaler, and Logistic Regression.ipynb
[4] Analysis.ipynb		[4] Analysis.ipynb
[4] LDA.ipynb		[4] LDA.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HowRussian

Data

Non-Russian

Russian

Methods

Preprocessing

Term-Frequency Count Vector

Logistic Regression with Lasso Normalization

About

Uh oh!

Releases

Packages

Languages

NeverForged/HowRussian

Folders and files

Latest commit

History

Repository files navigation

HowRussian

Data

Non-Russian

Russian

Methods

Preprocessing

Term-Frequency Count Vector

Logistic Regression with Lasso Normalization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages