NAACP MEDIA RESEARCH

CS506 Spark! project

SOME INSTRUCTIONS ON HOW TO RUN OUR CODE:

PS: to successfully run our codes, please install all tools and packages we used using pip, and make sure to follow our project structure (or you can change all target directories in our codes into your own valid path).

Collect data

Run get_links.py first to get 3 txt files (bostonglobe.txt, wbur.txt and wgbh.txt) for 3 websites. Our Scrapy spider is globespider.py. To run Scrapy for data collecting, you should open terminal, locate the spiders folder and run:

scrapy runspider globeSpider.py -o resultname.json

You need to change the filename in line 11 and use different xpath filter (line 29 for Boston Globe, line 30 for WGBH and WBUR) in our code to get all three websites' results. For more instructions on hwo to use Scrapy, you can check A primer on web scraping and the Wayback machine by John C. Merfeld. We really thank John for his excellent work, and it really helped us a lot.

The whole data collecting process will take a long time, days, or even weeks if there are other server problems, so we splitted these 3 txts into more small sub files when we did this step. We stored our data in raw data folder, and suggest you simply download it to see what we've collected.

Filter news about black people

Make sure you've already generated (by run combine.py on raw data or downloaded classified data, then run keywords_filter_black.py. The results should be generated in relevent data folder.

Calculate coverage: make sure you've already generated (by run combine.py on raw data) or downloaded classified data, then run coverage.py. The result should be printed on your screen. It's also recorded in statistics_final.xlsx sheet one.

Sentiment analysis

Make sure you've already generated (by run combine.py on raw data and keywords_filter_black.py on classified data or downloaded classified data and relevent data, then run sentiment_analysis_black.py and sentiment_analysis_all.py. The results should be stored in sentiment_analysis_black.csv and sentiment_analysis_all.csv. The results are also recorded in statistics_final.xlsx sheet one.

Look for popular topics

Make sure you've already generated (by run combine.py on raw data) or downloaded classified data, then run get_topics.py to get popular topics and see visulization results like topic_crime_black.png and topic_crime_all.png. you can change code to get all popular topics for the past five years. The results are stored in 5topics_black.txt and 5topics_all.txt , The results are also stored in statistics_final.xlsx sheet two.

FOR NON-TECHNICAL AUDIANCE

If you are non-technical audiance who just want to see what we've achieved, please have a look at our project report and poster. This should give you an basic idea of what we've done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NAACP MEDIA RESEARCH

SOME INSTRUCTIONS ON HOW TO RUN OUR CODE:

Collect data

Filter news about black people

Sentiment analysis

Look for popular topics

FOR NON-TECHNICAL AUDIANCE

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
Report&Poster		Report&Poster
classified data		classified data
globe		globe
raw data		raw data
relevant data		relevant data
5topics_all.txt		5topics_all.txt
5topics_black.txt		5topics_black.txt
README.md		README.md
bostonglobe.txt		bostonglobe.txt
combine.py		combine.py
coverage.py		coverage.py
get_links.py		get_links.py
get_topics.py		get_topics.py
keywords_filter_black.py		keywords_filter_black.py
keywords_more.txt		keywords_more.txt
keywords_neighborhood_all.txt		keywords_neighborhood_all.txt
keywords_neighborhood_black.txt		keywords_neighborhood_black.txt
sentiment_analysis_all.csv		sentiment_analysis_all.csv
sentiment_analysis_all.py		sentiment_analysis_all.py
sentiment_analysis_black.csv		sentiment_analysis_black.csv
sentiment_analysis_black.py		sentiment_analysis_black.py
statistics_final.xlsx		statistics_final.xlsx
topic_crime_all.png		topic_crime_all.png
topic_crime_black.png		topic_crime_black.png
wbur.txt		wbur.txt
wgbh.txt		wgbh.txt
word_count.py		word_count.py

AllenChenGH/NAACP_MEDIA_RESEARCH

Folders and files

Latest commit

History

Repository files navigation

NAACP MEDIA RESEARCH

SOME INSTRUCTIONS ON HOW TO RUN OUR CODE:

Collect data

Filter news about black people

Sentiment analysis

Look for popular topics

FOR NON-TECHNICAL AUDIANCE

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages