# Reputation First
## -- A Public Opinion Research based on Media Article

In this project, we designed, implemented ReputationFirst, a public opinion research system based on public media. Our goal is to build a useful system that can collect, process and analyse public opinions about companies based on public media efficiently, and calculate these companies' reputation based on NLP techniques.
<img src="https://raw.githubusercontent.com/ShengjieLuo/ReputationFirst/master/image/intro.jpg" width="500">

We will cover the following topics in this report:
- [1. Motivation](#1.-Motivation)
- [2. Contribution](#2.-Contribution)
- [3. Architecture](#3.-Architecture)
- [4. Implementation](#4.-Implementation)
- [5. Evaluation](#5.-Evaluation)
- [6. Conclustion](#6.-Conclusion)
- [Reference](#Reference)


# 1. Motivation

There's many important factors that a technique compnay should care about. Such as human resource, the technique trend, management, operations and so on. As a manager of a company, one should take all those factor into consideration to make money, and these factors should be studied by the manager. Among all these factors, there's one very important one but has rarely been researched, that is the **public report and reputation**.

Why public report can impact a company so much. The reason is easy, as the developement of communication tools and the media techniques. Every one can be exported to a lot of pubilc information from Internet, newspapers and even self-media such as Facebook and Twitter poster, which means once a event has been public or reported, it will spread all over the world in few seconds. As people receives these information, they will read and think about the news and make their own judgement, investment decision, and emotion to these news and the companies involved.

The opinions from all readers make a general response to a news, and this response will impact the company, a famous example is that Facebook admitted data leak hits 87 million users, widening privacy scandal. When this event is public, people's trust in Facebook is badly hurt and many people started to delete Facebook application from their cell phone, which further cause the stock price of Facebook decrease a lot. Another positive example is also about Facebook, when Facebook admitted its own fault and announced that it will cooperate with British government to protect users' data from third-party companies, this operation has been reported positively and the readers start to forgive and trust Facebook again, therefore, their stock price start to increase.

<img src="https://raw.githubusercontent.com/ShengjieLuo/ReputationFirst/master/image/motivation.jpg" width="500">

From these examples we can see that when people do not know that Facebook was leaking their information, their's nothing special happened although actually Facebook was doing bad things. People started to give response when this event has been public and reported by media. We can make this judgement that **the reputation of a company is deeply related to the news about this company**. If there's always good news from all media about one company, it's impossible that this company has a bad reputation, vice versa.

Based on this assumption, we think it would be quite interesting to collect and analyze the news of every company in this world, and conclude their reputation from these news. If we can do that automatically, from a manager's aspect, one can just **use the news as the input, get their reputation today, and make decisions according to it**. From a investor's aspect, one can make investment decision according to this company's reputation by simply use our tool to analyse news about one company. For a normal user, well, one can still know what others and public media thinks about one company. It's a very useful and excited way to build a bridge from public news and the company's reputation, and this idea motivates us to dive deeply into this interesting problem.

# 2. Contribution

In this project, our contributions is 
1. We collect **multiple sources** of news and extract news about technique companies efficiently by **distributed internet crawler**.
2. We take advantage of Google Cloud Platform and **Spark** to speed up ETL and NLP data anlysis.
2. We use **TextRank** algorithm and NLTK **sentiment analysis** tools to calculate the reputation score of each technique company.

Since there's no ground truth for reputation, we evaluate our system by multiple factors such as significant change of stock price and big event/news happend.


# 3. Architecture
<img src=https://raw.githubusercontent.com/ShengjieLuo/ReputationFirst/master/image/arch.jpg width="500">

We use distributed internet spider to crawl articles from Reuters, NewYork Times and Wall Street Journals, after which we use Spark to clean data and apply textrank algorithm to extract the summary from original text. In the next step, python NLTK is a good solution to handle sentiment analysis and extract media attitude. Finally, these messages are combined to predict the reputation score.

<img src=https://raw.githubusercontent.com/ShengjieLuo/ReputationFirst/master/image/analysis.jpg width="500">

One of the largest challenges is how to process 2GB data and 600 Million tokens efficiently in 3 different tasks, including *ETL* task, *textrank* task and *sentiment analysis* task. 

It is a typical **Iterative Machine Learning** workload, in which the intermediate result of previous stages would then be used in the next stage. And one of the best method to accelerate the Iterative Machine Learning is to use the **Spark+Yarn+CloudCluster** framework.

In our implementation, we use a spark cluster with 8 cores to accelerate the workload from 73.9 hours to 9.8 hours.


# 4. Implementation

In this chapter, we would introduce the implementation of this project.
## 4.1 Distributed Internet Crawler
In this project, we use distributed Internet Spider to crawl media articles from different sources. However, it is not easy to crawl the history data from media website directly for following reasons.

+ Not only article links are listed in the website, also you could find the advertisement links, recommendation links and so on.
+ The website content is customized based on user’s habit and varies from different users in each single view.
+ The anti-crawler is setup for these popular websites to avoid an overflow network traffic.

Based on the observation above, we have to give up the attempts of fetching contents from the original websites.

However, it is important to note that saome popular newspapers and magazines provide the **archive database** for users publicly, which is usually the printable and readable version of articles. Compared with the website content, it provides following features:
+ It provides all articles published without customer bias.
+ There is no or much less advertisement and other links.
+ The content could be fetched from the html webpage directly by beautiful soup
+ Usually, the content is marked with clear timestamps.

Following are the archives for popular medias, we take the content from May.01 as an example here. Note that usually only the traditional media provides the archive as a summarization of past articles, while internet media or web media does not provide similar service.

**TypeA: Newspaper & Maganizes**

*Newyork Times*
  + http://spiderbites.nytimes.com/2018/articles_2018_05_00000.html
  + The database is updated each day to provide the articles officially. Able to be crawled.

*Wall Street Journals*
  + http://www.wsj.com/public/page/archive-2018-5-01.html
  + The database is updated each day to provide the articles officially. Able to be crawled.

*Washington Post*
  + http://www.washingtonpost.com/wp-adv/archives/copyright.htm
  + The archive is not timestamped, instead key word or topic based.

**TypeB: News Agency**

*BBC*: 
  + http://dracos.co.uk/made/bbc-news-archive/2018/05/01/
  + Collected by third-party organization with less reliability. Also it does not include all contents of the media.
  
*Reuters*
  + https://uk.reuters.com/resources/archive/uk/20180501.html
  + The database is updated each day to provide the articles officially. Able to be crawled.
  
*CNN*
  + CNN provides clear archive before 2001 with good format, however, the following materials are really confused without spicification. 


The newspaper crawler is shown as following:

In [None]:
import requests
from bs4 import BeautifulSoup
import re
import json
import threading
from newspaper import Article

In [None]:
'''
NewsParser: The Newspaper Crawler Class
'''
class NewsParser(threading.Thread):

    def __init__(self,source):
        '''
        Init Function:
        1. Iniralization as a python thread
        2. Define the source used in the parser
        '''
        threading.Thread.__init__(self)
        self.source = source

    def _fetch_links(self,url):
        '''
        1. Fetch the url links from archive database
        2. Parse the database page to get the article links
        '''
        links = []
        r = requests.get(url)
        if r.status_code!=200:
            print(url+":Not Correct Response")
        soup = BeautifulSoup(r.text,"html.parser")
        for link in soup.find_all('a'):
            rawlink = link.get('href')
            if re.match("(\S)+://www.nytimes.com/20(\d)+/(\d)+/(\d)+/(\S)+\.html",rawlink):
                links.append(rawlink)
        return links

    def _fetch_data(self,link):
        '''
        1. Get the articles and parse the articles by "newspaper" lib
        2. Add the realted field into the datum struct
        '''
        datum = {}
        article = Article(link)
        try:
            article.download()
        except:
            print(link+": Cannot be download!")
            return datum
        try:       
            article.parse()
        except:
            print(link+": Cannot be parsed!")
            return datum            
        datum["authors"] = article.authors
        datum["date"]    = str(article.publish_date)
        datum["text"]    = str(article.text)
        datum["title"]   = str(article.title)
        return datum
    
    def parse(self,source):
        '''
        Main Function to use the crawling function
        '''
        f = open("data_"+source.split("/")[-1],"a")
        links = self._fetch_links(source)
        count = 0
        for link in links:
            datum = self._fetch_data(link)
            if len(datum)==0:
                continue
            f.write(json.dumps(datum)+'\n')
            print("Fetch news from " + source + " : "+str(count)+"/"+str(len(links)))
            count += 1
        f.close()

    def run(self):
        '''
        Provide the interactive interface as a python thread
        '''
        self.parse(self.source)
             

The program shown above is a single thread internet crawler, however, the single thread crawler is too slower to crawl all newspaper articles. Following codes extend it into a multi-thread crawler.

In [None]:
def _get_targets():
    '''
    Read the "sourcelist" file to get the crawling resources    
    '''
    f = open("sourcelist")
    lines = f.readlines()
    f.close()
    result = []
    for line in lines:
        if line:
            result.append(line.strip())
    return result

def real_main():
    '''
    1. Intializes multi-threads based on sources, each of which uses one thread
    2. Coordinate multiple threads and wait for working threads end.
    '''
    sources = _get_targets()
    threads = []
    for source in sources:
        threads.append(NewsParser(source))
    for t in threads:
        t.start()
    for t in threads:
        t.join()

real_main()

## 4.2 Data Extract&Transformation&Load (ETL)

In the next step, we use Spark to clean the raw dataset fetched from the crawler following these steps:

*Step1.* Filter out all articles of which **the company name is not included**.  Note that we match the company name by python regular expression within a case-insensitve style.

*Step2.* For articles remained, we parse the articles to get the **title**, **summary** and **tag** fields. The method to extract the summary from original text would be described in the next chapter.

*Step3.* Filter out articles which the company name is not included in title, summary and tag fields.

It is important to note that,
+ Sometimes the company name has different meanings, for example, the company name *"Adobe"* is used to describe a type of house as well which is used in some articles about earthquakes. Therefore, we use the summary function to identify the relation between articles and company names.
+ The summary extraction is a time-consuming task with complex algorithm. Hence, we use step1 as a croase filtering to reduce the overall workload.

### 4.2.1 Coarse-Grained Data Filter

Coarse Data Filter is used to filter out the articles which is not related to the companies in the company list.First, we use nltk to tokenize text into a list of tokens. And then, we compare the tokens with the company list one by one. Finally, we add the tag name into the result stuct. 

In [None]:
import pyspark
import json
import nltk
from nltk import word_tokenize

def _init_line(line):
    name = line.lower().split()[0]
    return (name,line.lower().split())

def _init_list(sc):
    results = {}
    companyRDD = sc.textFile("gs://group688/companylist")
    coms = companyRDD.map(_init_line).collect()
    for com in coms:
        for name in com[1]:
              results[name] = com[0]
    return results   

def _data_filter(lines,company,source):
    import nltk
    nltk.download('punkt',download_dir='./nltk_data')
    nltk.data.path.append("./nltk_data")
    results = []
    for datum in lines:
        data    = json.loads(datum)
        authors = data["authors"]
        date    = data["date"]
        text    = data["text"]
        title   = data["title"]
        tokens_text  = word_tokenize(text.lower())
        tokens_title = word_tokenize(title.lower())
        tags = []
        for word in text.lower().split():
            if word[0]=="#":
            tags.append(word.lower())
        #Stat is a dictionary, key is the company name, and value is the attribute
        #attributes: [in_title,title_count,total_count]
        stat  = {}
        for token in tokens_title:
              if token in company:
                if company[token] in stat:
                    stat[company[token]][0] = True
                    stat[company[token]][1] += 1
                else:
                    stat[company[token]] = [True,1,0]
        for token in tokens_text:
              if token in company:
                if company[token] in stat:
                      stat[company[token]][2] += 1
                else:
                      stat[company[token]] = [False,0,1]
        for name in stat:
            result = {}
            if (source=="wsj"):
                result["date"]      = date[:5] + '0' + date[5:9]
            else:
                result["date"]      = date[:10]
        result["text"]        = text
        result["tokens"]      = tokens_text
        result["company"]     = name
        result["source"]      = source
        result["in_title"]    = stat[name][0]
        result["title_count"] = max(stat[name][1],title.lower().count(name))
        result["total_count"] = max(stat[name][2],text.lower().count(name))
        result["title"]       = title
        result["authors"]     = authors
        result["tags"]        = tags
        results.append((name,json.dumps(result)))
    return results

As we introduced, even the coarse data filter is really time-consuming. In our experiment, we use a single thread data filter running on 300MB raw dataset with 255k articles, and it takes more than 2 hours to complete. To accelerate the execution time in a GB-level dataset, we use the Spark+Yarn platform to execute the program in parallel on a 5 instance cluster.
The distributed cluster configuration is:
+ **1 Master Node **
  1. 4   CPUs
  2. 16  GB memory
  3. 100 GB storage
  
+ **4 Worker Nodes **
  1. 2  CPUs
  2. 12 GB memory
  3. 80 GB storage
 
 The spark program is included as following:

In [None]:
def real_main():
    sc = pyspark.SparkContext()
    company = _init_list(sc)
    dataRDD1 = sc.textFile("gs://group688/nytimes",5)
    dataRDD1 = dataRDD1.mapPartitions(lambda x:_data_filter(x,company,"nytimes"))
    dataRDD2 = sc.textFile("gs://group688/wsj",10)
    dataRDD2 = dataRDD2.mapPartitions(lambda x:_data_filter(x,company,"wsj"))
    dataRDD3 = sc.textFile("gs://group688/reuters.dat",10)
    dataRDD3 = dataRDD3.mapPartitions(lambda x:_data_filter(x,company,"reuters"))
    dataRDD  = dataRDD3.union(dataRDD2).union(dataRDD1)
    dataRDD.sortByKey().map(lambda x:x[1]).saveAsTextFile("gs://group688/688v1")

real_main()

To support the Spark execution on cloud platform, we uses the **Google Cloud Dataproc** service to deply the spark cluster efficiently. Following is the scipt to combine the spark program with the google dataproc cluster. 

In [None]:
gsutil rm -r gs://group688/688v1
gcloud dataproc jobs submit pyspark \
--cluster spark688 \
--region us-east1 \
etl.py

### 4.2.2 Generate Article Summary

#### TextRank

TextRank is a very popular and accurate extractive text summarization algorithm. Why we need to get summarization from articles? The reason is that the whole text of the news is usually too big for setiment analysis, therefore we need to compress the size and extract useful text from the article, so we choose to take advantage of this text summarization algorithm.

TextRank is similar to PageRank. It considers sentences the equivalent of web pages. The probability of going from sentence A to sentence B is equal to the similarity of the 2 sentences, and then simply apply the PageRank algorithm over this sentence graph. By applying this algorithm, we can decrease the size of text signigicantly.

![TextRankModel](image/TextRank.png)

The following is the sequencial version of the abstract and keyword extraction agorithm. It take advange of a open-source TextRank implementation.

In [None]:
from summa import summarizer
from summa import keywords


def get_abstract_keywords(text):
    return summarizer.summarize(text), keywords.keywords(text, split=True)


def get_result_list(file, num):
    line_count = 0
    abandon_count = 0
    result_list = []
    with open(file, 'r') as raw_data:
        while line_count != num:
            if line_count % 100 == 0:
                print(line_count, datetime.datetime.now())
            line_count += 1
            json_data = json.loads(raw_data.readline())
            abstract, keyword = get_abstract_keywords(json_data['text'])
            name = json_data['company']
            tags_words = list(map(lambda x: x[1:], json_data['tags']))
            abstract_words = list(
                map(lambda x: x.lower(),
                    nltk.tokenize.word_tokenize(abstract)))
            title_words = list(
                map(lambda x: x.lower(),
                    nltk.tokenize.word_tokenize(json_data['title'])))
            if abstract != '' and name not in abstract_words and name not in title_words and name not in tags_words:
                abandon_count += 1
                print(json_data['title'])
                print(abandon_count)
                continue
            json_data['abstract'] = abstract
            json_data['keywords'] = keyword
            result_list.append(json.dumps(json_data))
    return result_list

However, the sequential version of algorithm is too slow to run, especially when there's so many raw text, wo we revise it to the Spark version and speed-up the algorithm by 8 times.

In [None]:
from summa import summarizer
from summa import keywords
def generate_summary(text):
    abstract  = summarizer.summarize(text)

### 4.2.3 Fine-Grained Data Filter

The fine-grained data filter is the data filter based on the generated summary. The reason why we introduce the fine-grained data filter is that some articles only cover the company name in few details, while the major content has no relation to this company.Considering this user case, the company name would no longer be included in the summary generation, therefore, we could filter the unrelated articles in a more fine-grained level

In [None]:
import time
import json
import datetime
import pyspark

def get_result_list(lines):
    from summa import summarizer
    from summa import keywords
    import nltk
    nltk.download('punkt',download_dir='./nltk_data')
    nltk.data.path.append("./nltk_data")
    result_list = []
    for line in lines:
        json_data = json.loads(line)
        text      = json_data["text"]
        abstract  = summarizer.summarize(text)
        keyword   = keywords.keywords(text, split=True)
        name = json_data['company']
        tags_words = list(map(lambda x: x[1:], json_data['tags']))
        abstract_words = list(map(lambda x: x.lower(), nltk.tokenize.word_tokenize(abstract)))
        title_words = list(map(lambda x: x.lower(), nltk.tokenize.word_tokenize(json_data['title'])))
        if abstract != '' and name not in abstract_words and name not in title_words and name not in tags_words:
            continue
        json_data['abstract'] = abstract
        json_data['keywords'] = keyword
        result_list.append(json.dumps(json_data))
    return result_list

if __name__=="__main__":
    sc = pyspark.SparkContext()
    dataRDD = sc.textFile("gs://group688/688v2.dat",20)
    dataRDD.mapPartitions(get_result_list).saveAsTextFile("gs://group688/688v3")

After the step4.2 ETL, we get a series of json data structs, each of which presents a "company-article" relation. The json struct is defined as following

In [None]:
definition = {
  "company":"google",
  "is_title":True,
  "title_count":1,
  "text_count":8,
  "title": "is google a better engine than baidu",
  "text": "qwertyuiopsdfghjklertyuiopdfghjkl;ertyuiolp;dfghjklfvgbhnjmk",
  "tokens": ["is","google","better", ...],
  "date": "2018-05-01",
  "source": "nytimes",
  "tags":["#nytoday","#deletefacebook"],
  "authors":["John Williams","name2","name3"],
  "abstract":"google is better than baidu in most respects."
}

## 4.3 Get Stock Price

The last step is to use pandas data interface to get and store the stock price for a list of companies in a specific time interval.

In [None]:
from pandas_datareader import data, wb


def get_stock_pdf(ticker):
    start = datetime.datetime(2018, 1, 1)
    end = datetime.date.today()
    print(ticker)
    time.sleep(1)
    return ticker, data.DataReader(ticker, "iex", start, end)


def read_stock_data(file):
    with open(file, 'r') as f:
        fl = f.readlines()
        tickers = list(filter(lambda x: x, map(lambda x: x.split(',')[1].strip(), fl)))
        tickers_map = {i.split(',')[1].strip(): i.split(',')[0].split(' ') for i in fl}
        return tickers, tickers_map


def output_csv(pdfs):
    for pdf in pdfs:
        pdf[1].to_csv('data/'+pdf[0]+'.csv', sep=',', encoding='utf-8')
        

tickers, tickers_map = read_stock_data('ticker_list')
print(tickers)
pdfs = [get_stock_pdf(t) for t in tickers]
output_csv(pdfs)

## 4.4  Sentiment Analysis

As described in the design part above, the overall idea is to first preprocess raw dataset we get to produce their abstract or simply original text. Then the problem becomes how to use these texts to reflect companies' reputation. One straight forward idea is to use sentiment analysis. However, we met several practical problems.

Sentiment analysis here refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

### Why sentiment analyis

Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. The attitude may be a judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author or speaker), or the intended emotional communication (that is to say, the emotional effect intended by the author or interlocutor).

In our application, we use attitude as our main target of sentiment analysis. This is because we want to know attitudes that general trend of popular media holds toward certain company, which helps to reflect the overall reputation.

### Details with attitude analysis
There are several different component when analyzing the attitude in each articles. We need to determine: 1) The holder (source) of the attitude. 2) The aspect (target) of the attitude. 3) The detailed type of attitude, including different positive and negtive words and weighting between them. 4) The scope of certain type attitude.

### Bag of words --- Input of Sentiment Analysis
We use the simplest model to do sentiment analysis, which is just input the adjecent words with the company name. We have also tried to use EM model to refine this input model. However, one problem is that in each article, the words that appears in the bag is very sparse. In other words, it's hard to use posterior knowledge to refine this model.

Here is the full implementation of our Spark program when doing sentiment analysis:

In [None]:
import time
import json
import datetime
import pyspark

def get_result_list(lines):
    import nltk
    from nltk.sentiment.vader import SentimentIntensityAnalyzer
    nltk.download('punkt',download_dir='./nltk_data')
    nltk.download('vader_lexicon',download_dir='./nltk_data')
    nltk.data.path.append("./nltk_data")
    sia = SentimentIntensityAnalyzer()
    result_list = []
    for line in lines:
        json_data = json.loads(line)
        
        #mode1: major-abstract
        if len(json_data["abstract"])>0:
            score = sia.polarity_scores(json_data["abstract"])
        else:
            score = sia.polarity_scores(json_data["text"])
        json_data["score_abstract"] = score
        
        #mode2: text-based
        score2 = sia.polarity_scores(json_data["text"])
        json_data["score_text"] = score2

        #mode3: sentence-based
        scores = []
        sents  = nltk.sent_tokenize(json_data["text"].lower())
        name   = json_data["company"]
        for sent in sents:
            if name in sent:
                scores.append(sia.polarity_scores(sent))
        try:
            pos_score = 0
            neg_score = 0
            neu_score = 0
            for score in scores:
                pos_score += score["pos"]
                neg_score += score["neg"]
                neu_score += score["neu"]
            pos_score /= len(scores)
            neg_score /= len(scores)
            neu_score /= len(scores)
            json_data["score_sentence"] = {"pos":pos_score,"neg":neg_score,"neu":neu_score}
        except:
            json_data["score_sentence"] = score2

        result_list.append(json.dumps(json_data))
    return result_list

if __name__=="__main__":
    sc = pyspark.SparkContext()
    dataRDD = sc.textFile("gs://group688/688v3/*")
    dataRDD.mapPartitions(get_result_list).saveAsTextFile("gs://group688/688v4")

## 4.5 Coreferrence Resolution
Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step for a lot of higher level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction.

#### Why coreferrence resolution
When we are exploring news articles we get from various sources, one interesting fact is that most of the time, the company's name may appear only a few times even if it's the main character in the article. And most of other appearance may be 'it', 'the company' and even its CEO or chairman. This lead to the problem that the number of candidates is too small when we use bag of words algorithm to do sentiment analysis of a certain article, if we just use the company name as key word without resolution of these references. 

#### Usage of coreference resolution
Here we uses Stanford Core NLP toolkits to help us pre-process the coreferences. One problem to use the toolkit is that it's written in Java and only has limited support for Python. So we build a local NLP server with following command:

```
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
```
Then install Python library to access the nlp sever:
```
pip install stanfordcorenlp
```
Go to root directory of the downloaded directory, and then run the following command to set up local stanford CoreNLP server (detailed configuration can be found here: https://stanfordnlp.github.io/CoreNLP/history.html):
```
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
```
Finally, we can access both from web browser through http://localhost:9000 or use the Programming API like following:

In [None]:
from stanfordcorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP("http://localhost", 9000)

sentence = 'Google is a good company'
print('Tokenize:', nlp.word_tokenize(sentence))
print('Part of Speech:', nlp.pos_tag(sentence))
print('Named Entities:', nlp.ner(sentence))
print('Constituency Parsing:', nlp.parse(sentence))
print('Dependency Parsing:', nlp.dependency_parse(sentence))

nlp.close() 

Code above just shows basic operations supported. In order to do coreference resolution, we uses the following pseudo code:

In [None]:
from stanfordcorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP("http://localhost", 9000)
f = open("dataset", 'r')
# input the whole article and output all possible coreferences 
result = []
while line in f.readlines():
    article = json.load(line)
    result = nlp.coref(article["body"])
# result is a list of list
for subject in result:
    # If refers to the company name, then put into sentiment analysis
    judge_if_company_name()
nlp.close() 

## 4.6 Combined Prediction

After the 5 steps above, we have calculated the reputation score for each "company-article" reflection, the final step is to combine the prediction together and build the dataframe for prediction result. Following codes indicate the procedure for final prediction.

In [None]:
company_dict = {}
def get_company_dataframes(file):
    line_count = 0
    abandon_count = 0
    result_list = []
    with open(file, 'r') as raw_data:
        while True:
            line = raw_data.readline()
            if not line:
                break
            if line_count % 1000 == 0:
                print(line_count, datetime.datetime.now())
            line_count += 1
            json_data = json.loads(line)
            name = json_data['company']
            d = {'neg_abstract': json_data['score_abstract']['neg'],\
                 'pos_abstract': json_data['score_abstract']['pos'],\
                 'neu_abstract': json_data['score_abstract']['neu'],\
                 'neg_text': json_data['score_text']['neg'],\
                 'pos_text': json_data['score_text']['pos'],\
                 'neu_text': json_data['score_text']['neu'],\
                 'neg_sentence': json_data['score_sentence']['neg'],\
                 'pos_sentence': json_data['score_sentence']['pos'],\
                 'neu_sentence': json_data['score_sentence']['neu'],\
                 'date': json_data['date'],\
                 'week': json_data['week']}
            if name not in company_dict:
                company_dict[name] = pd.DataFrame([d])
            else:
                df = pd.DataFrame([d])
                company_dict[name] = company_dict[name].append(df, ignore_index=True)

The result is presenteded as following:

In [None]:
get_company_dataframes('688v43.dat')

```
0 2018-05-08 23:11:38.375655
1000 2018-05-08 23:11:41.527542
2000 2018-05-08 23:11:44.186069
3000 2018-05-08 23:11:46.843435
4000 2018-05-08 23:11:49.595050
5000 2018-05-08 23:11:52.346185
6000 2018-05-08 23:11:54.849394
7000 2018-05-08 23:11:57.364961
8000 2018-05-08 23:11:59.890049
9000 2018-05-08 23:12:02.564109
10000 2018-05-08 23:12:05.154044
```
Let's take the company **facebook** as an example here. In following script, we get all articles related to facebook from the dataframe, and then count the number of these articles. Totally, we get 2080 articles related to facebook, we would use this result in the next step to present the effect of our model.

In [None]:
pd_fbpd_fb  ==  company_dictcompany_ ['facebook']
print(len(pd_fb))

```
2080
```

# 5. Evaluation
After sentiment analysis, there are 3 scores for each article, which describe the possibility that it's positive, negative or neutral. In order to verify our idea and implementation, we draws several plots for each company to reflect its reputation. 

The visualization program is indicated as following.

In [None]:
pd_fb = company_dict['facebook']
print(len(pd_fb))
pd_tmp = pd_fb.groupby("date").mean().reset_index()
labels = ['neg_abstract',\
            'pos_abstract',\
            'neg_text',\
            'pos_text',\
            'neg_sentence',\
            'pos_sentence' \
        ]

pd_min = pd_fb.groupby("date").min()
pd_max = pd_fb.groupby("date").max()
#xmajorLocator = MultipleLocator(10)
for label in labels:
    if label != "pos_sentence":
        continue
    pd_min_tmp = pd_min.loc[:, label].to_frame().rename(columns={'date': 'date', label: label + "min"})
    pd_tmp = pd_tmp.set_index("date").join(pd_min_tmp).reset_index()
    pd_max_tmp = pd_max.loc[:, label].to_frame().rename(columns={'date': 'date', label: label + "max"})
    pd_tmp = pd_tmp.set_index("date").join(pd_max_tmp).reset_index()
    pd_tmp = pd_tmp.loc[110:120, :]
    dev = [pd_tmp.loc[:, label] - pd_tmp.loc[:, label + "min"], pd_tmp.loc[:, label + "max"] - pd_tmp.loc[:, label]]
    www_plot = plt.subplot(121)
    plt.ylim(0, 0.3)
    #plt.ylabel("A")
    plt.xticks(rotation=60)
    plt.errorbar(pd_tmp.date, pd_tmp.loc[:, label], yerr = dev, fmt='k-', ecolor='gray', lw=1)
    #www_plot.xaxis.set_major_locator(xmajorLocator)
    plt.show()

print(pd_tmp.to_string())

In the first figure, we find the reputation score of facebook in recent 120 days, in which the x-axis presents the date timestamp, the y-axis indicates the index volume. Figure 1.1 is the change of positive reputation score.Figure 1.2 is the change of positive reputation score.Figure 1.3 is the change of positive reputation score.
<img src="https://github.com/ShengjieLuo/ReputationFirst/raw/master/image/facebook.jpeg" width="500">
In the second figure, we find the reputation score of google in recent 120 days,
<img src="https://github.com/ShengjieLuo/ReputationFirst/raw/master/image/google.jpeg" width="500">
In the third figure, we find the reputation score of Amazon in recent 120 days,
<img src="https://github.com/ShengjieLuo/ReputationFirst/raw/master/image/amazon.jpeg" width="500">

## 5.2 Case Analysis
In this user case, we find a reputation increasing of facebook in Apr.28, the reason of which is that facebook begins a corporation with UK government in data protection.

Following figure shows the reputation score of facebook, in which the blue curve indicates the positive index, and the orange curve indicates the negatice index. Obviously, the blue curve in this period has an positive news peek.
<img src="https://github.com/ShengjieLuo/ReputationFirst/raw/master/image/case1.jpg" width="500">

To find the true reason of this reputation increasing, we dive into the raw text and find that most media agencies reported that Facebook begins a coorperation with UK government in data protection to avoid the data leackage occuring again.

For example, the following figure is a screenshot of Reuters' news in Apr.29 about facebook issue. It directly displays the willing of facebook to protect user privacy in the future, which is a positive report for facebook.
<img src="https://github.com/ShengjieLuo/ReputationFirst/raw/master/image/case2.jpg" width="500">

# 6. Conclusion

"Reputation First" is a public opinion research based on media articles. As we all know, Facebook is experiencing the largest reputation crisis recently, leading to millions of losses. Today, the public reputation matters for every company. That’s also why we want to dive into social media data and try to predict reputation from them.

Now, We are looking forward to extending this project into a **realtime analysis tool** in the future. Thanks for your attention!

# Reference
+ [1] Newspaper3k: Article scraping & curation https://github.com/codelucas/newspaper
+ [2] Hamborg, Felix & Meuschke, Norman & Breitinger, Corinna & Gipp, Bela. (2017). news-please: A Generic News Crawler and Extractor. 
+ [3] Summa – Textrank https://github.com/summanlp/textrank
+ [4] Barrios, Federico & López, Federico & Argerich, Luis & Wachenchauzer, Rosa. (2016). Variations of the Similarity Function of TextRank for Automated Summarization. Proc. Argentine Symposium on Artificial Intelligence, ASAI.
+ [5] Mihalcea, Rada & Tarau, Paul. (2004). TextRank: Bringing Order into Text.
+ [6] Perkins, J. (2010). Text Classification for Sentiment Analysis – Naive Bayes Classifier. [online] StreamHacker. Available at: http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/ [Accessed 5 Apr. 2016].
+ [7] Perkins, J. (2010). Text Classification for Sentiment Analysis – Precision and Recall. [online] StreamHacker. Available at: http://streamhacker.com/2010/05/17/text-classification-sentiment-analysis-precision-recall/ [Accessed 5 Apr. 2016].
+ [8] 
Peter Buell Hirsch, (2017) "Counting the spoons: what really influences corporate reputation", Journal of Business Strategy, Vol. 38 Issue: 6, pp.54-58, https://doi.org/10.1108/JBS-09-2017-0131
+ [9] 
Grahame Dowling, (2016) "Defining and measuring corporate social reputations", Annals in Social Responsibility, Vol. 2 Issue: 1, pp.18-28, https://doi.org/10.1108/ASR-08-2016-0008
