Skip to content

NLP semi-supervised learning project classifying climate change tweets as either climate believers or deniers

Notifications You must be signed in to change notification settings

JoeBrowz/climate-prop

Repository files navigation

Climate Prop

header

Overview

Twitter: not the final frontier, but one of the loudest frontiers. Twitter is a platform that allows users to share thoughts, ideas, resources, and news. With over 300 million active users, and billions of queries per day, Twitter is a trove of usable, immediate, actionable data on what people feel strongly about. This project seeks to leverage available data to understand the public conversation on climate change.

Task Definition

As an environmental advocacy group, having real time information on what people are saying about the climate crisis can be an invaluable tool in making the case to lawmakers, corporations, and, individuals that the will of the people is to take action on climate change. Having a pipeline that can gather tweets on climate change, classify them as believer or denier, then run sentiment analysis on the various groups can empower an organization with this immediate data.

Data

For this project data was classified by training a portion of the GW Libraries' "Climate Change Tweets Ids" dataset, which gives the Twitter Ids of nearly 40 million tweets on climate change. The dataset was hydrated using Twarc and the Twitter Developer APIs, split into manageable file sizes then imported in a for loop and reconcatenated using the Pandas package. Target variable was created by coercing embedded hashtag data from the API response.

Data was cleaned using Regular Expression and the nltk package, assisted by domain knowlege and findings from exploratory data analysis, stop words were determined, tweets were lemmatized, and prepped for modeling.

Model

Data was trained using several GridSearchCV models using TF-IDF Vectorization and Logistic Regression, Gradient Boosted Trees (based on LightGBM technique), and Random Forest Classifiers. Accuracy and F1 scores were both quite high on the training and testing data, meaning both precision and recall are high. When tested on a data from a different dataset, with target variables classified using a different method, the model performed acceptably, though it didn't predict negative cases very well.

lr1 lr2 lr1_val lr2_val

Conclusion

Using TF-IDF has shortcomings that can't be fully dealt with using linear and ensemble modeling. The dramatic class imbalance fogs the ability to discern what is holding the model back: modeling techniques, the data itself, or the mathematical realities a class imbalance causes. However, putting aside the shortcomings, the model does have excellent metrics, achieving over 90% accuracy and F1 scoring on all of the best parameter models for each classifier type.

Next Steps

A likely cure for the shortcomings of the modeling methods used thus far would be implementation of a recursive neural network built with transfer learning. Better handling class imbalance and allowing contextual memory would likely improve performance on the predicting minority class cases. Implementation of this in addition to feature engineering using other tweet metadata, utlizing a larger share of the dataset, more robust sentiment analysis, and importing and analyzing CO2 emissions data in relation to the climate conversation on twitter is all on the to-do list for this project's next steps.

Web App

A streamlit webapp was built that receives the text of a tweet and outputs its classification as either a climate believer's or denier's tweet.

File Structure:

├── README.md                      <- the top-level README for reviewers of this project
├── data_wrangle.ipynb             <- data collection, organization and concatenating
├── EDA_notebook.ipynb             <- data cleaning, EDA, light visualization
├── modeling_notebook.ipynb        <- notebook containing all elements of model
├── data                           <- dataset files not hosted on github due to size constraints 
├── Classifying-Climate-Tweets...  <- a pdf of the project presentation
├── streamlit.py.                  <- python generated webapp for use locally
└── images                         <- generated from code for use in readme and presentation slides

For Inquiries:

Sources

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5QCCUU

https://www.kaggle.com/edqian/twitter-climate-change-sentiment-dataset

About

NLP semi-supervised learning project classifying climate change tweets as either climate believers or deniers

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published