About

Copyright Notice:

About

This is a Python program that uses natural language processing and a Latent Dirichlet allocation model to scrape data from a set of Reddit subreddits and then find ten topics that summarize what people on those subreddits talk about. The output is presented as a wordcloud representing the words that make up each topic.

Install/Use

Run the python scripts in the following order:
panacirce_scraper.py - scrapes data from Reddit and outputs to "scraped_data" folder
- Scrapes data from minTimeOfInterest to maxTimeOfInterest
- Scrapes data from assigned subreddits (default: subreddits = [['Chicago','flair:"Ask CHI"'],['Boston','flair:tourism'],['NYC',""],['LosAngeles',""],['Seattle',""]])
  - Second argument is a flair filter, though it is not currently used by the scraper
  - Submissions that are not "self" (text-only submissions as opposed to links) and auto-moderators posts are filtered out
  - Hyperlinks are extracted, though not used for anything yet.
panacirce_cleaner.py - cleans data created by panacirce_cleaner
- Goes through each .csv file in scraped_data and consolidates each reddit submission into one row of data containing
  - phrased self text submissions
  - permalink to subsmission
  - a column for each most common wordcloud
panacirce_lda.py - uses LDA to create a word cloud from all cleaned_data and displays

Future

panacirce_model_builder
- uses cleaned_data which has been manually identified with a topic to build a predictive model
  - To create training data:
    - Set panacirce_cleaner.py :: mode to "1", which will output occurrences for each most common word
    - eg "these are most common words" -> "these 5 are 3 most 2 common 1 words 1"
    - replace left most column with 1 or 0 indicating whether or not the list of words matches the topic you want to detect.
    - eg, if you see "restaurant" or "burger" or whatever, a 1 would indicate this list of words indicates the topic of food
    - place resulting file in "training_data" folder and run panacirce_model_builder.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
training_data_example		training_data_example
ModelBuilder.py		ModelBuilder.py
RedditSubmission.py		RedditSubmission.py
example_topic_cloud_output.png		example_topic_cloud_output.png
panacirce_cleaner.py		panacirce_cleaner.py
panacirce_lda.py		panacirce_lda.py
panacirce_lda_user.py		panacirce_lda_user.py
panacirce_model_builder.py		panacirce_model_builder.py
panacirce_scraper.py		panacirce_scraper.py
panacirce_utilities.py		panacirce_utilities.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training_data_example

training_data_example

ModelBuilder.py

ModelBuilder.py

RedditSubmission.py

RedditSubmission.py

example_topic_cloud_output.png

example_topic_cloud_output.png

panacirce_cleaner.py

panacirce_cleaner.py

panacirce_lda.py

panacirce_lda.py

panacirce_lda_user.py

panacirce_lda_user.py

panacirce_model_builder.py

panacirce_model_builder.py

panacirce_scraper.py

panacirce_scraper.py

panacirce_utilities.py

panacirce_utilities.py

readme.md

readme.md

Repository files navigation

Copyright Notice:

About

Install/Use

Future

About

Releases

Packages

Languages

statarczuk16/Panacirce

Folders and files

Latest commit

History

Repository files navigation

Copyright Notice:

About

Install/Use

Future

About

Topics

Resources

Stars

Watchers

Forks

Languages