#  Introduction


The purpose of this report is to give a brief literature review on the NLP task, detailed explanation of steps taken in this task to perform the NLP techniques, additional data-preprocessing steps taken after cleaning the data stored from web crawler, explaining the results.


#  Literature Review

After a thorough study of literature regarding sentiment analysis and from my understanding of said literature, we can say that sentiment analysis is one of the challenging models, even with numerous research works carried out on it there are still challenges like “Slang words, New accents, grammatical and spelling mistakes”[1].The primary objective of this task is to do a sentiment analysis of the news articles scraped from the website by assigning them polarity values or spotting the opinions of the authors based on the titles of the articles via text basis. By understanding the user requirements we are able to deliver better and a balanced recommendation and not  biased recommendations. Because of the recent pandemic the news websites have been giving different views, opinions during various outbreak related events which is good for opinion mining on the dataset [2]. As our web scraping was made to scrape the 100 articles related to the covid-19 pandemic over a set period of time .”Research studies and practical applications in the field of SA have escalated in the past decade with the transformation and expansion of Web from passive provider of content to an active socially‐aware distributor of collective intelligence ”[3]. By applying this NLP technique we are able to identify our title and separate into one of the 3 possible categories that are positive,neutral and negative. 

Document similarity is one of the crucial components of many text analysis tasks like document classification,clustering and information retrieval. One of the key strategies being adopted and applied in the latest IT services to achieve their goals is using asset-based approach to service delivery [4]. But the similarity measure applied in the current industries only distinguishes between the similar and dissimilar documents which is a simple classification at best. We need more advanced similarity measures for possible scenarios of an article covering various topics involved in a project or something relating to the information in the article. “Accurate assessment of the topical similarity between documents is fundamental to many automatic text analysis applications, including information retrieval, document classification, and document clustering. Choosing a good similarity measure is no less important than choosing a good document representation (Hartigan, 1975)”[5]. Most of the commonly used techniques are Cosine and Jaccard metrics. 

These measures treat the words in the document as they are independent of each other, but that is not a good assumption. As in bringing meaning to the grammatical sense and to contain information, words are always related to each other to form meaningful sentences and structure to develop ideas. As news websites are competing with others to gain more views and clicks for their articles, it is important for the organization to stay ahead by innovating something new and attracting new users. The recommendation system helps the user to find items they are looking for, content based recommendation is one of various methods used in building a recommendation system. From the above mentioned NLP techniques, we use Sentiment Analysis and Document Similarity to develop a content based balanced recommender.


# Rationale for the NLP Task

Sentiment Analysis is used to technically process huge volumes of information for segregating and attaining relevant knowledge. The reason for choosing this is that sentiment analysis techniques are quite efficient in capturing the opinions from the written text when they are syntactically correct and the language is explicit. The news articles written and published by these authors from these websites have gone through editorial processes and there is less chance of dealing with informal data. Additionally ongoing researches show that with additional information added to the process would improve the accuracy of the polarities to a reliable and accurate form.

We support the additional information by taking the description of the articles and making a cosine similarity matrix with this we are able to develop a prototype balance recommendation system utilizing both the NLP techniques.


# Data Pre-Processing


Data Preprocessing is a mandatory requirement if we are to get a good recommendation system. It helps us in eliminating the noise in the data set and modify it’s content to the necessary requirements for the model. The steps taken here are as follows :

“na” removal, generally if it is numerical data we are able to use mean substitution or other techniques, as eliminating instances from the dataset is the last option. But in our dataset the description for a couple of articles were missing and rerunning the web crawler to get the information again proved no difference. Hence the rows containing na were removed.All of the text in the description are converted into lower cases as this also affects the system.

“stopword removal” is one of the general steps taken before processing a natural language. Stopwords are like articles, prepositions, pronouns, conjunctions and other words which do not add much information to the text. It is one of the necessary steps which must be taken to get a better recommendation system. As removing stop words from the description would help us remove the low-level information from the text and the more important information would remain and be focused by the system.

“Lemmatization” is also one of the steps applied before the data is fed for the vectorizer, the goal of lemmatization is to reduce inflectional and derivationally related forms of words to a base form. Stemming was not chosen as it is a crude heuristic process that drops the ends of the words. Lemmatization is best for this possible scenario, there are also two types of lemmatization one on the noun and the verb. Here we have chosen to do it on the verbs.

In [1]:
from newsapi import NewsApiClient
from textblob import TextBlob
import numpy as np
import pandas as pd
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\venka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\venka\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
abcnews_data_final = pd.read_csv('news_crawled.csv')

In [4]:
abcnews_data_final.head(10)

Unnamed: 0,author,title,description,url,date
0,RICARDO ALONSO-ZALDIVAR and ZEKE MILLER Associ...,US setting up $1.7B national network to track ...,The Biden administration says the U.S. is sett...,https://abcnews.go.com/Health/wireStory/us-set...,2021-04-16T14:34:18Z
1,MARINA VILLENEUVE Associated Press,Reforms follow deadly year in New York nursing...,After a deadly year in New York’s nursing home...,https://abcnews.go.com/Health/wireStory/reform...,2021-04-10T12:14:31Z
2,RYAN J. FOLEY Associated Press,Industry foe charged under Iowa's new food tre...,An animal rights activist whose investigations...,https://abcnews.go.com/Business/wireStory/indu...,2021-04-08T18:27:31Z
3,JOYCE M. ROSENBERG AP Business Writer,"Supply bottlenecks leave ships stranded, busin...",A trade bottleneck born of the COVID-19 outbre...,https://abcnews.go.com/Business/wireStory/supp...,2021-03-21T12:21:23Z
4,KAREL JANICEK Associated Press,"Longest-serving bookseller among 25,000 Czech ...",A year after the Czech Republic recorded its f...,https://abcnews.go.com/Health/wireStory/longes...,2021-03-28T07:10:53Z
5,Erin Schumaker,Pausing Johnson & Johnson vaccines shows monit...,Pausing Johnson & Johnson vaccines shows monit...,https://abcnews.go.com/Health/pausing-johnson-...,2021-04-13T20:48:08Z
6,JAKE COYLE AP Film Writer,"The pandemic has upended the Oscars. Good, pro...",Ninety seconds,https://abcnews.go.com/Entertainment/wireStory...,2021-04-16T12:10:22Z
7,The Associated Press,"China says it will discuss climate, other issu...",China says it has agreed with the U.S. to take...,https://abcnews.go.com/US/wireStory/china-disc...,2021-03-20T11:44:45Z
8,Associated Press,Coroner: Man who died after vaccine died of na...,A South Florida doctor who died about two week...,https://abcnews.go.com/Health/wireStory/corone...,2021-04-07T23:14:37Z
9,The Associated Press,Norwegian Cruises asks CDC to allow trips from...,Norwegian Cruise Line’s parent company wants t...,https://abcnews.go.com/Business/wireStory/norw...,2021-04-05T16:32:15Z


# Specification and Justification of hyperparameter

There are no specific hyperparameters used in both Sentiment analysis and in the TFIDF vectorizer. The reason for this is Sentiment analysis has no parameters, whereas  TFIDF vectorizer the limitations of the resources on the system prevented me from performing the NLP task within the time limit. So no parameters were passed.

Some hyperparameters passed were in Lemmatization where we have chosen to focus the lemmatizer on the verbs within the text rather.


# NLP Technique - I (News Polarity Analysis based on titles)

In [5]:
polarity = []
for i in range(abcnews_data_final.shape[0]):
    polarity.append(TextBlob(abcnews_data_final['title'][i]).sentiment.polarity)

abcnews_data_final['polarity'] = pd.Series(polarity)

In [6]:
abcnews_data_final.head()

Unnamed: 0,author,title,description,url,date,polarity
0,RICARDO ALONSO-ZALDIVAR and ZEKE MILLER Associ...,US setting up $1.7B national network to track ...,The Biden administration says the U.S. is sett...,https://abcnews.go.com/Health/wireStory/us-set...,2021-04-16T14:34:18Z,0.0
1,MARINA VILLENEUVE Associated Press,Reforms follow deadly year in New York nursing...,After a deadly year in New York’s nursing home...,https://abcnews.go.com/Health/wireStory/reform...,2021-04-10T12:14:31Z,-0.031818
2,RYAN J. FOLEY Associated Press,Industry foe charged under Iowa's new food tre...,An animal rights activist whose investigations...,https://abcnews.go.com/Business/wireStory/indu...,2021-04-08T18:27:31Z,0.136364
3,JOYCE M. ROSENBERG AP Business Writer,"Supply bottlenecks leave ships stranded, busin...",A trade bottleneck born of the COVID-19 outbre...,https://abcnews.go.com/Business/wireStory/supp...,2021-03-21T12:21:23Z,0.0
4,KAREL JANICEK Associated Press,"Longest-serving bookseller among 25,000 Czech ...",A year after the Czech Republic recorded its f...,https://abcnews.go.com/Health/wireStory/longes...,2021-03-28T07:10:53Z,0.0


In [7]:
abcnews_data_final.dropna(inplace = True)

In [8]:
abcnews_data_final_posneutral = abcnews_data_final.loc[abcnews_data_final.polarity >= 0, :]
abcnews_data_final_negneutral = abcnews_data_final.loc[abcnews_data_final.polarity <= 0, :]

# NLP Technique - II (Document Similarity Analysis)

In [9]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [10]:
corpus_posneutral = list(abcnews_data_final_posneutral['description'])
for i in range(len(corpus_posneutral)):
    filtered_list_of_words = [word.lower() for word in corpus_posneutral[i].split()]
    for j in range(len(filtered_list_of_words)):
        word = ''
        for k in range(len(filtered_list_of_words[j])):
            if filtered_list_of_words[j][k] in string.punctuation:
                word += ''
            else:
                word += filtered_list_of_words[j][k]
        filtered_list_of_words[j] = word
    filtered_list_of_words = [WordNetLemmatizer().lemmatize(word, pos="v") for word in filtered_list_of_words if word not in stopwords.words('english')]
    corpus_posneutral[i] = ' '.join(filtered_list_of_words)
    
corpus_negneutral = list(abcnews_data_final_negneutral['description'])
for i in range(len(corpus_negneutral)):
    filtered_list_of_words = [word.lower() for word in corpus_negneutral[i].split()]
    for j in range(len(filtered_list_of_words)):
        word = ''
        for k in range(len(filtered_list_of_words[j])):
            if filtered_list_of_words[j][k] in string.punctuation:
                word += ''
            else:
                word += filtered_list_of_words[j][k]
        filtered_list_of_words[j] = word
    filtered_list_of_words = [WordNetLemmatizer().lemmatize(word, pos="v") for word in filtered_list_of_words if word not in stopwords.words('english')]
    corpus_negneutral[i] = ' '.join(filtered_list_of_words)

In [11]:
vectorizer_posneutral = TfidfVectorizer()
tf_idf_corpus_posneutral = vectorizer_posneutral.fit_transform(corpus_posneutral)
vectorizer_negneutral = TfidfVectorizer()
tf_idf_corpus_negneutral = vectorizer_negneutral.fit_transform(corpus_negneutral)

In [17]:
input_query_doc = input('Enter the query to search related news among latest news: ')

Enter the query to search related news among latest news: covid


In [18]:
filtered_list_of_words = [word.lower() for word in input_query_doc.split()]
for j in range(len(filtered_list_of_words)):
    word = ''
    for k in range(len(filtered_list_of_words[j])):
        if filtered_list_of_words[j][k] in string.punctuation:
            word += ''
        else:
            word += filtered_list_of_words[j][k]
    filtered_list_of_words[j] = word
filtered_list_of_words = [WordNetLemmatizer().lemmatize(word, pos="v") for word in filtered_list_of_words if word not in stopwords.words('english')]
input_query_doc = ' '.join(filtered_list_of_words)

In [19]:
doc_vector_input_posneutral = vectorizer_posneutral.transform([input_query_doc])
doc_vector_input_negneutral = vectorizer_negneutral.transform([input_query_doc])

In [20]:
cosine_similarities = cosine_similarity(doc_vector_input_posneutral.toarray(), tf_idf_corpus_posneutral.toarray()).reshape(len(corpus_posneutral))
most_similar_news = list(np.argsort(cosine_similarities))
most_similar_news.reverse()
print('5 Most Related Positive or Neutral News Articles are as follows: ')
abcnews_data_final_posneutral.iloc[most_similar_news[:5],:]

5 Most Related Positive or Neutral News Articles are as follows: 


Unnamed: 0,author,title,description,url,date,polarity
548,Erin Schumaker,Few health care workers infected with COVID af...,Few health care workers infected with COVID af...,https://abcnews.go.com/Health/health-care-work...,2021-03-24T20:48:59Z,0.075
670,Morgan Winsor,Over 100 fully vaccinated people contract COVI...,Over 100 fully vaccinated people contract COVI...,https://abcnews.go.com/Health/100-fully-vaccin...,2021-03-31T13:22:37Z,0.0
912,Dr. Alexis E. Carrington,How NY hospital faced COVID devastation and ca...,How NY hospital faced COVID devastation and ca...,https://abcnews.go.com/Health/ny-hospital-face...,2021-03-28T16:49:30Z,0.0
884,The Associated Press,AstraZeneca says US trial data shows vaccine 7...,AstraZeneca says advanced trial data from a U....,https://abcnews.go.com/Health/wireStory/astraz...,2021-03-22T07:50:04Z,0.6
435,"Sarah Kolinovsky, Molly Nagle","'My heart goes out,' Biden says on Colorado sh...","For the second time in a week, a mass shooting...",https://abcnews.go.com/Politics/heart-biden-co...,2021-03-23T18:09:07Z,0.0


In [21]:
cosine_similarities = cosine_similarity(doc_vector_input_negneutral.toarray(), tf_idf_corpus_negneutral.toarray()).reshape(len(corpus_negneutral))
most_similar_news = list(np.argsort(cosine_similarities))
most_similar_news.reverse()
print('5 Most Related Negative or Neutral News Articles are as follows: ')
abcnews_data_final_negneutral.iloc[most_similar_news[:5],:]

5 Most Related Negative or Neutral News Articles are as follows: 


Unnamed: 0,author,title,description,url,date,polarity
953,Dr. Samuel Rothman,National Doctors Day celebrates mental health ...,"COVID ""gave permission for physicians to care ...",https://abcnews.go.com/Health/year-national-do...,2021-03-30T21:58:12Z,-0.1
670,Morgan Winsor,Over 100 fully vaccinated people contract COVI...,Over 100 fully vaccinated people contract COVI...,https://abcnews.go.com/Health/100-fully-vaccin...,2021-03-31T13:22:37Z,0.0
912,Dr. Alexis E. Carrington,How NY hospital faced COVID devastation and ca...,How NY hospital faced COVID devastation and ca...,https://abcnews.go.com/Health/ny-hospital-face...,2021-03-28T16:49:30Z,0.0
435,"Sarah Kolinovsky, Molly Nagle","'My heart goes out,' Biden says on Colorado sh...","For the second time in a week, a mass shooting...",https://abcnews.go.com/Politics/heart-biden-co...,2021-03-23T18:09:07Z,0.0
811,Sarah Kolinovsky,Biden forced to confront 2nd mass shooting in ...,"For the second time in a week, a mass shooting...",https://abcnews.go.com/Politics/biden-forced-c...,2021-03-23T16:45:02Z,-0.15


# Preliminary assessment of NLP Task


We set out on the task to build a balanced recommender which would suggest news articles to the user based on all the possible polarities like positive, negative and neutral. We have achieved this at the end with the relevant NLP techniques applied on the dataset. From the results it can be observed that we have achieved our goal of building a Balanced recommender

# References


1. Dharani Devi, G., & Kamalakkannan, D. S. (2020, January). Literature Review on Sentiment Analysis in Social Media: Open Challenges toward Applications. Https://Www.Researchgate.Net/. https://www.researchgate.net/publication/341913399_Literature_Review_on_Sentiment_Analysis_in_Social_Media_Open_Challenges_toward_Applications

2. A.H. Alamoodi, B.B. Zaidan, A.A. Zaidan, O.S. Albahri, K.I. Mohammed, R.Q. Malik, E.M. Almahdi, M.A. Chyad, Z. Tareq, A.S. Albahri, Hamsa Hameed, Musaab Alaa, Sentiment analysis and its applications in fighting COVID-19 and infectious diseases: A systematic review,Expert Systems with Applications,

3. Reilly T. Web 2.0 Compact Definition: Trying Again. Sebastopol, CA: O'Reilly Media; 2017. http://radar.oreilly.com/2006/12/web-20-compact-definition-tryi.html. Accessed April 24, 2017.

4. Sayeed, Asad & Sarkar, Soumitra & Deng, Yu & Hosn, Rafah & Mahindru, Ruchi & Nithya, R.Anitha. (2009). Characteristics of document similarity measures for compliance analysis. 1207-1216. 10.1145/1645953.1646106.

5. Hartigan, J. A. (1975). Clustering Algorithms. New York, NY, USA: John Wiley & Sons, Inc.

6. H. Yanagimoto, M. Shimada and A. Yoshimura, "Document similarity estimation for sentiment analysis using neural network," 2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS), 2013, pp. 105-110, doi: 10.1109/ICIS.2013.6607825.
