FakeNews

Fake vs Real News Binary Classification

Contents of this Repository Data Wrangling - Removal of duplicates/non-English language articles Data Preprocessing - TF-IDF Vectorization, Lemmatization Exploratory Data Analysis, fake news has a higher variability on readability scores Modeling- ~90% accuracy

(The most interesting part of this project: in my opinion. This was (as far as I know) a novel and successful use of readability scores as a feature for classifying fake news.

Just a note: This repository is a portfolio display and as such I currently upload the documents directly). Version controlled notebooks/documentation are found in the Springboard repository

The removal of duplicates was a difficult task for this project. The computing resources necessary were quite intensive to even do vectorized fuzzy matching. Thus, I came up with a method using the python library textstat (which would ultimately produce features as well). This method was not perfect, but was computer efficient. Though I did not do it in this ideally any match this method caught or at least some subset should be checked via fuzzy matching.

Data preprocessing was basic for a NLP project. TF-IDF Vectorization, tokenization, and lemmatization.

EDA - The most interesting finding out of EDA was that readability scores have a much higher variability or range of scores compared with real news. I take this to indicate more adherence to institutional style guides by major legitimate news outlets compared to fake news outlets.

Modelling - No baseline was used as there wasn't much one could do naively except a no skill assumption (random guessing) Logistic Regression, Random Forest Classifier, and Support Vector Machine were compared

Random Forest Classifier was a narrow winner over Logistic Regression on all relevant measures, but paricularly the ROC score. A voting classifier was used for all three models, this was not as good as Random Forest or Logistic Regression.

Overall the accuracy of the final model was ~90%. And textstat readability features competed with the TF-IDF vectors for best model features

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Fake News Capstone - EDA.ipynb		Fake News Capstone - EDA.ipynb
Fake News Capstone Presentation.pdf		Fake News Capstone Presentation.pdf
Fake News Capstone Wrangle.ipynb		Fake News Capstone Wrangle.ipynb
Fake News Modelling.ipynb		Fake News Modelling.ipynb
Fake News Preprocessing.ipynb		Fake News Preprocessing.ipynb
Fake News Report.pdf		Fake News Report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FakeNews

About

Releases

Packages

Languages

Noob1701/FakeNews

Folders and files

Latest commit

History

Repository files navigation

FakeNews

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages