## Implementation

As described in the design part above, the overall idea is to first preprocess raw dataset we get to produce their abstract or simply original text. Then the problem becomes how to use these texts to reflect companies' reputation. One straight forward idea is to use sentiment analysis. However, we met several practical problems.

### Sentiment Analysis

Sentiment analysis here refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

#### Why sentiment analyis

Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. The attitude may be a judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author or speaker), or the intended emotional communication (that is to say, the emotional effect intended by the author or interlocutor).

In our application, we use attitude as our main target of sentiment analysis. This is because we want to know attitudes that general trend of popular media holds toward certain company, which helps to reflect the overall reputation.

#### Details with attitude analysis
There are several different component when analyzing the attitude in each articles. We need to determine: 1) The holder (source) of the attitude. 2) The aspect (target) of the attitude. 3) The detailed type of attitude, including different positive and negtive words and weighting between them. 4) The scope of certain type attitude.

#### Bag of words --- Input of Sentiment Analysis
We use the simplest model to do sentiment analysis, which is just input the adjecent words with the company name. We have also tried to use EM model to refine this input model. However, one problem is that in each article, the words that appears in the bag is very sparse. In other words, it's hard to use posterior knowledge to refine this model.

Here is the full implementation of our Spark program when doing sentiment analysis:

In [None]:
import time
import json
import datetime
import pyspark

def get_result_list(lines):
    import nltk
    from nltk.sentiment.vader import SentimentIntensityAnalyzer
    nltk.download('punkt',download_dir='./nltk_data')
    nltk.download('vader_lexicon',download_dir='./nltk_data')
    nltk.data.path.append("./nltk_data")
    sia = SentimentIntensityAnalyzer()
    result_list = []
    for line in lines:
        json_data = json.loads(line)
        
        #mode1: major-abstract
        if len(json_data["abstract"])>0:
            score = sia.polarity_scores(json_data["abstract"])
        else:
            score = sia.polarity_scores(json_data["text"])
        json_data["score_abstract"] = score
        
        #mode2: text-based
        score2 = sia.polarity_scores(json_data["text"])
        json_data["score_text"] = score2

        #mode3: sentence-based
        scores = []
        sents  = nltk.sent_tokenize(json_data["text"].lower())
        name   = json_data["company"]
        for sent in sents:
            if name in sent:
                scores.append(sia.polarity_scores(sent))
        try:
            pos_score = 0
            neg_score = 0
            neu_score = 0
            for score in scores:
                pos_score += score["pos"]
                neg_score += score["neg"]
                neu_score += score["neu"]
            pos_score /= len(scores)
            neg_score /= len(scores)
            neu_score /= len(scores)
            json_data["score_sentence"] = {"pos":pos_score,"neg":neg_score,"neu":neu_score}
        except:
            json_data["score_sentence"] = score2

        result_list.append(json.dumps(json_data))
    return result_list

if __name__=="__main__":
    sc = pyspark.SparkContext()
    dataRDD = sc.textFile("gs://group688/688v3/*")
    dataRDD.mapPartitions(get_result_list).saveAsTextFile("gs://group688/688v4")

### Coreferrence Resolution
Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step for a lot of higher level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction.

#### Why coreferrence resolution
When we are exploring news articles we get from various sources, one interesting fact is that most of the time, the company's name may appear only a few times even if it's the main character in the article. And most of other appearance may be 'it', 'the company' and even its CEO or chairman. This lead to the problem that the number of candidates is too small when we use bag of words algorithm to do sentiment analysis of a certain article, if we just use the company name as key word without resolution of these references. 

#### Usage of coreference resolution
Here we uses Stanford Core NLP toolkits to help us pre-process the coreferences. One problem to use the toolkit is that it's written in Java and only has limited support for Python. So we build a local NLP server with following command:

```
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
```
Then install Python library to access the nlp sever:
```
pip install stanfordcorenlp
```
Go to root directory of the downloaded directory, and then run the following command to set up local stanford CoreNLP server (detailed configuration can be found here: https://stanfordnlp.github.io/CoreNLP/history.html):
```
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
```
Finally, we can access both from web browser through http://localhost:9000 or use the Programming API like following:

In [2]:
from stanfordcorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP("http://localhost", 9000)

sentence = 'Google is a good company'
print('Tokenize:', nlp.word_tokenize(sentence))
print('Part of Speech:', nlp.pos_tag(sentence))
print('Named Entities:', nlp.ner(sentence))
print('Constituency Parsing:', nlp.parse(sentence))
print('Dependency Parsing:', nlp.dependency_parse(sentence))

nlp.close() 

SyntaxError: invalid syntax (<ipython-input-2-137fedddf0ef>, line 6)

Code above just shows basic operations supported. In order to do coreference resolution, we uses the following pseudo code:

In [None]:
from stanfordcorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP("http://localhost", 9000)
f = open("dataset", 'r')
# input the whole article and output all possible coreferences 
result = []
while line in f.readlines():
    article = json.load(line)
    result = nlp.coref(article["body"])
# result is a list of list
for subject in result:
    # If refers to the company name, then put into sentiment analysis
    judge_if_company_name()
nlp.close() 

## Evaluation
After sentiment analysis, there are 3 scores for each article, which describe the possibility that it's positive, negative or neutral. In order to verify our idea and implementation, we draws several plots for each company to reflect its reputation. 

![facebook.png](facebook.jpeg)
![goole.png](google.jpeg)
![amazon.png](amazon.jpeg)

Although the overall performance is not very good, 
![facebook429.png](facebook429.png)

In [None]:
pd_fb = company_dict['facebook']
print(len(pd_fb))
#print(pd_fb.to_string())
#print(pd_fb.groupby('date').mean().to_string())
pd_tmp = pd_fb.groupby("date").mean().reset_index()
#pd_tmp = pd_tmp.assign(std_dev = pd_fb.groupby("date").agg(np.std, ddof = 0).loc[:, "neg_abstract"])
#pd_fb.groupby("date").agg(np.std, ddof = 0).loc[:, "neg_abstract"].to_frame()
labels = ['neg_abstract',\
            'pos_abstract',\
            'neg_text',\
            'pos_text',\
            'neg_sentence',\
            'pos_sentence' \
        ]
#pd_dev = pd_fb.groupby("date").agg(np.std, ddof = 0).loc[:, "neg_abstract"].to_frame().rename(columns={'date': 'date', 'neg_abstract': 'std_dev'})
#print(pd_dev.to_string())
pd_min = pd_fb.groupby("date").min()
pd_max = pd_fb.groupby("date").max()
#xmajorLocator = MultipleLocator(10)
for label in labels:
    if label != "pos_sentence":
        continue
    pd_min_tmp = pd_min.loc[:, label].to_frame().rename(columns={'date': 'date', label: label + "min"})
    #print(pd_min_tmp.to_string())
    pd_tmp = pd_tmp.set_index("date").join(pd_min_tmp).reset_index()
    pd_max_tmp = pd_max.loc[:, label].to_frame().rename(columns={'date': 'date', label: label + "max"})
    #print(pd_max_tmp.to_string())
    pd_tmp = pd_tmp.set_index("date").join(pd_max_tmp).reset_index()
    pd_tmp = pd_tmp.loc[110:120, :]
    dev = [pd_tmp.loc[:, label] - pd_tmp.loc[:, label + "min"], pd_tmp.loc[:, label + "max"] - pd_tmp.loc[:, label]]
    www_plot = plt.subplot(121)
    plt.ylim(0, 0.3)
    #plt.ylabel("A")
    plt.xticks(rotation=60)
    plt.errorbar(pd_tmp.date, pd_tmp.loc[:, label], yerr = dev, fmt='k-', ecolor='gray', lw=1)
    #www_plot.xaxis.set_major_locator(xmajorLocator)
    plt.show()

print(pd_tmp.to_string())
#print(pd_tmp.to_string())
#plt.errorbar(pd_tmp.index, )
#plt.errorbar(pd_tmp.index, pd_tmp.loc[:, label], yerr = pd_tmp.loc[:, label + "min"])
#pd_fb.groupby('date').plot()
#pd_fb['date'] = pd.to_datetime(pd_fb['date'])
#mask = (pd_fb['date'] > '2018-3-19') & (pd_fb['date'] <= '2018-4-1')
#print(pd_fb.loc[mask]['abstract'].to_string())