# COGS 108 - Final Project

# Permissions
[x]No - keep private

# Names

- Manit Soni
- Young Jun Kim
- Esau Estrada
- Cooper Beaman
- Arren De Manuel

# Group Members IDs

- A15576567
- A16062404
- A16056905
- A13589935
- A15697684

# Overview

Our project explores how different sources of news media portrayed the candidates of the 2016 presidential election; namely Hillary Clinton and Donald Trump.  We look at articles from three news outlets:  CNN (moderate leaning), New York Times (left leaning), and Breitbart (right leaning). Using web scraping to gather our data followed by sentiment analysis to quantify candidate portrayals, we present a preliminary analysis of potential trends in sentiment between outlets and candidtates.

# Research Question

What do one to two word phrases tell us about how different news outlets covered Donald Trump and Hillary Clinton in the 2016 general elections? Did news outlets predominately describe each candidate positively or negatively? Were certain phrases more commonly used by news outlets when describing candidates?

## Background and Prior Work

There has been much discussion over the words politicians use to convey their ideas. Their words are meant to convince the public and thus are catered for that effect. Many analyses have been done on the keywords politicians use and their connotations to the subjects they discuss. Our objective is to apply this same methodology to the words news organizations themselves use to describe political figures, focusing on the presidential candidates of the 2016 election.

We are interested in this questions as we often hear about fake news and media narratives and want to confirm the validity or falsity of this statement. The importance of this topic is in understanding the information we consume, especially as it is often taken for granted to be accurate.

There has already been research done to look at the headlines used during President Trump's term in office. Towards Data Science concluded that 'trump' was the most used word by quite a landslide. CNN and MSNBC used many keyword associated with trump's scandals, whereas FOX used more generic words such as 'man' and 'woman'. The New Yorker analyzed the language used by the political condidates during the 2016 election. For example, the Democratic candidates tended to use more concrete language and the Republican candidates tended to use more descriptive language.


References (include links):
- [1] https://blog.gdeltproject.org/announcing-the-television-news-ngram-datasets-tv-ngram/
- [2] https://www.newyorker.com/magazine/2016/04/11/examining-the-vocabulary-of-the-presidential-race
- [3] https://towardsdatascience.com/how-does-news-coverage-differ-between-media-outlets-20aa7be1c96a
- [4] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6549470/
- [5] https://www.newyorker.com/magazine/2018/10/01/how-russia-helped-to-swing-the-election-for-trump

# Hypothesis


### Explanation
Not all news outlets use words of equally negative sentiment in their articles, however all news outlet articles and headlines analyzed by Michael Tauberg at Towards Data Science exhibited a negative sentiment bias on average [3].  Additionally, during presidential elections in general, candidates typically search for negative events from their opponent's past to reduce their chances of winning. Furthermore, most humans tend to prefer gossip compared to other forms of conversation [4]. Hence, news outlets are incentivized to report negative stories, true or not from candidate's campaigns instead of more positive stories to increase viewership and revenue more reliably. Thus, we predict both Hillary Clinton and Donald Trump should have been portrayed negatively by most news outlets. However, given the abnormally high frequency of "Trump" appearing in news articles combined with his infamous personality [3,5], we predict news outlet's negative sentiment descriptions of Trump should predominate relative to those of Hillary. 

### Hypothesis
__We predict all news networks we analyze will cover both 2016 presidential candidates with predominantly negative sentiment, however we expect to see greater negative sentiment descriptions of Donald Trump compared to Hillary Clinton.__

# Dataset(s)

### Dataset Name: News Sentiments Towards Presidential Candidates

#### Dataset Source:
We are collecting data, through web scraping, from online news articles.  The three news sources we will be scraping are:
* CNN
* Breitbart
* New York Times

#### Size of Dataset

From the three news sources, we are reading in: 
* 9272 articles from CNN
* 150  articles from Breitbart
* 1294 articles from New York Times

After setting up our data into a dataframe, as described in the next sections, our data will be organized in columns as follows:

| Column Name | Description                                            |
|-------------|--------------------------------------------------------|
| **News_Source** | The host of the article. Either CNN, Breitbart, or NYT |
| **Candidate**   | The name of the candidate. Either Trump or Hillary     |
| **neg**         | The negative sentiment score. Between 0 and 1          |
| **neu**         | The neutral sentiment score. Between 0 and 1           |
| **pos**         | The positive sentiment score. Between 0 and 1          |
| **compound**    | The combined sentiment score. Between 0 and 1          |

You can see a condensed representation of our dataset at the end of the **Data Cleaning** section.  After applying sentiment analysis onto each sentence, our cleaned dataset has 137599 observations. However, the analysis would have unfair comparisons since some news organizations had more articles and references to Trump/Hillary. We reduced the number of observations from CNN and NYT to match Breitbart through random sampling. Thus, the dataset was condensed to **6479** observations, equally split into three parts between the news sources.

# Setup

Firstly, we went through each of the news sources and scraped a list of all US presidential election articles in the year 2016.  The scripts for these can be scene in our Scripts folder, called [News Source]ArticleFinder.cpp, and they are pretty straightforward.

Now that we have our data sources, it is now time to: 
* webscrape the contents of those articles
* perform sentiment analysis on each sentence 
* connect each sentence to a candidate 
* write all that data to a csv. 


This was done in the LinkToText.ipynb file in the Scripts folder. This webscraping takes hours and is thus NOT recommended to be run. The data has alrady been collected and saved in the Dataset.csv file in the Scripts folder.

# Data Cleaning

Now that we have all of our data into a nice CSV, we will now parse the data into a pandas dataframe and clean it.  For cleaning, we will be removing data that contains a 0 value for neg, pos, neu, and compound since this means that our sentiment analyzer was not able to work on that particular sentence.

In [18]:
# Read in our data set
df = pd.read_csv("Dataset.csv")

In [19]:
# Separate the sentiment into its own columns
df[['neg','neu', 'pos', 'compound']] = df.Sentiment.str.split(",",expand=True)

# Drop sentinment, we dont need it anymore
df = df.drop(['Sentiment'], axis=1);

In [20]:
# Clean up the extraneous parts of the strings
def just_digits(label):
    numbers = re.compile('\d+(?:\.\d+)?')
    return numbers.findall(label)[0]

# Convert each of those columns to float type
df['neg'] = df['neg'].apply(just_digits).astype(float);
df['neu'] = df['neu'].apply(just_digits).astype(float);
df['pos'] = df['pos'].apply(just_digits).astype(float);
df['compound'] = df['compound'].apply(just_digits).astype(float);

In [22]:
# drop rows where ALL values are 0
df = df[(df[['neg', 'neu', 'pos', 'compound']] != 0).any(axis=1)]
df

Unnamed: 0,News_Source,Candidate,neg,neu,pos,compound
2,CNN,Trump,0.000,1.000,0.000,0.0000
3,CNN,Trump,0.000,0.813,0.187,0.3182
4,CNN,Trump,0.000,0.813,0.187,0.3182
5,CNN,Trump,0.081,0.827,0.091,0.0772
6,CNN,Trump,0.066,0.838,0.095,0.2500
...,...,...,...,...,...,...
139593,Breitbart,Hillary,0.065,0.873,0.062,0.0240
139594,Breitbart,Hillary,0.000,0.946,0.054,0.0387
139595,Breitbart,Hillary,0.000,0.946,0.054,0.0387
139596,Breitbart,Hillary,0.042,0.725,0.233,0.7430


At this state, we have a dataframe that contains columns for News_Source, Candidate, neg (negative sentiment score), neu (neutral sentiment score), pos (positive sentiment score), and compound (compound sentiment score). Using this, we will be able to manipulate and analyse the data further.

This data did not include any null or invalid data. Essentially, the data would have been invalid if there was a 0 value for "pos", "neg", and "neu". However, this is not possible through the use of the sentiment analyser, which would give a neutral value of 1 if there was no trace of positive or negative sentiment. We did have to clean up the data by breaking the sentiment output into "pos", "neg", "neu", and "compound". Thus we could graph these float values instead of the string sentiment output. apart into However, most of these sentences are just neutral, meaning they were not analyzed as having too much positive or negative sentiment.  The reason for this is that since we are reading in all sentences, many sentences are full of filler words, conjugations, etc that just convey information. So in the following section, we will be using different thresholds of neutral sentences (i.e. < 0.8 neu or < 0.5 neu) in order to just look at the more *emotionally* charged sentences.

# Data Analysis & Results

# Ethics & Privacy

There should not be any privacy concerns with the data used as it will be taken from news articles and coverage that is already available to the public. There may be potential biases in the dataset if it doesn't use diverse sources or overly relies on select sources. This is because of the potential political biases of news outlets. We control for this by having variables for the specific news organization. Thus, we will have multiple perspectives on the coverage instead of a combined media perspective. There may also be an unintended use of the data to promote narratives of why the coverage is directed the way it is. Our research is specifically meant to look for correlations of positive/negative coverage of candidates and does not prove any bias towards/against candidates.
 
One thing we have to worry about in terms of ethics is that news outlets may have certain ideas they would want to promote. For example, a news source might have policies on what their hosts can talk about and how they should present the information to the public. If we do not account for this fact, it can raise ethical issues because it can skew our data with the biases of particular news outlet.

# Conclusion & Discussion

# Team Contributions

- Manit and Aaron wrote the main web scrapping algorithms
- Young Jun, Esau, and Cooper analyzed the data, created questions, and made adjustments to the dataframe