**Problem Statement:**

•	Objective 1: Get the most frequent entities from the tweets.

•	Objective 2: Find out the sentiment/polarity of each author towards each of the entities.

**Technology/Tools used:**

•	Spacy for tokenization, stop words removal, named entity recognition.

•	Scispacy (“en_ner_bc5cdr_md” module having labels “Entity” for named entity recognition. (medical or bi medical named entities)

•	“en_core_sci_lg” module is used to cater to clinical dataset i.e; “tweets.json”

•	Pandas for analyzing the dataset.

•	Json to reads the dataset to Dataframe.

•	Natural Language Processing library “SentimentIntensityAnalyzer”, Valence Aware Dictionary or Vader for sentiment Reasoning.

**Approach followed:**

Step 1:  Installed spacy.

Step 2:  Imported and loaded scispacy and its biomedical module “en_ner_bc5cdr_md”.

Step 3:  “Json” format read to Dataframe using orientation “index”

Step 4:  Cleaned “tweet_text” column. 

•	No duplicates or missing values in the Dataframe.

•	Removed punctuations, emoji s and digits,

•	Removed hyper link and hashtags, 

•	Removed words less than 2 characters

•	Changed to lower case, 

•	Removed extra trailing spaces, 

•	Created tokens, 

•	Removed stop words

Step 5:  Extracted entity from “cleaned_tweet_text” column.

Step 6:  Frequency of entities (solution to problem statement 1)

Step 7:  Calculated polarity on each entity.

Step 8:  Authors' sentiment analysis (solution to problem statement 2)

**Approach to solve the problem :**

•	Spacy module “en_core_web_sm” did not recognise context related entity, tried few clinical module but  “en_ner_bc5cdr_md” biomedical  module of scispacy solved this problem.

**Conclusion:**

•	Data Cleansing and Entities Extraction were a crucial parts of the solution. Removing unnecessary elements from “tweet_text” consumed most of the time.

•	Entities Extraction is a key point in the solution. Finding the right entities defines extracting the right phrases from the tweets on which the core meaning of tweets relies on.

•	Then, there comes extracting the sentiments of each entities from the tweets with the help of "NLTK package SentimentIntensityAnalyzer" .


# **Installing required Packages**

In [None]:
!pip install spacy

#medical or bi medical named entities
!pip install scispacy

#en_ner_bc5cdr_md - chemical disease relation (scispacy module)
#!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bc5cdr_md-0.5.1.tar.gz

#en_core_sci_lg
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz

#en_ner_craft_md
#!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_craft_md-0.5.1.tar.gz

#en_ner_jnlpba_md
#!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_jnlpba_md-0.5.1.tar.gz

#en_ner_bionlp13cg_md
#!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz (532.3 MB)
  Preparing metadata (setup.py) ... [?25l[?25hdone


# **Importing required Library**

In [None]:
#Importing Pandas for data analysis and work with dataframe
import pandas as pd
#Importing numpy to working with arrays 
import numpy as np 
#Importing json to read the .json file
import json
#Importing re for regular expression operation
import re

#importing scipacy packages
import spacy
import scispacy
#import en_ner_bc5cdr_md
#import en_ner_craft_md
#import en_ner_bionlp13cg_md
#import en_ner_jnlpha_md
import en_core_sci_lg

#importing _________
#import en_core_web_sm

import nltk
nltk.download('averaged_perceptron_tagger')

import nltk
nltk.download('stopwords')
nltk.download('brown')
nltk.download('wordnet')
nltk.download('brown')
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# **Loading necessary Modules**

In [None]:
#loading modules
#ner_bc = en_ner_bc5cdr_md.load()
#ner_cr = en_ner_craft_md.load()
#ner_bi = en_ner_bionlp13cg_md.load()
#ner_jn = en_ner_jnlpha_md.load()
#ner_web = en_core_web_sm.load()
ner_sci = en_core_sci_lg.load()

# **Reading the .json file to dataframe**

In [None]:
df = pd.read_json('/content/tweets.json',orient ='index')
df = df.reset_index(drop = False)
df

Unnamed: 0,index,tweet_author,tweet_text
0,2013-07-18 09:39:46.071961602,Hematopoiesis News,⚕️ Scientists conducted a Phase II study of ac...
1,2013-07-17 03:40:32.173842437,"Michael Wang, MD",This phase 2 Acalabrutinib-Venetoclax (AV) tri...
2,2013-07-15 15:41:16.553048065,1stOncology,#NICE backs #AstraZenecas #Calquence for #CLL ...
3,2013-07-12 19:19:42.367813635,Toby Eyre,#acalabrutinib is a valuable option in pts int...
4,2013-07-04 12:40:34.334232586,Lymphoma Hub,NICE has recommended the use of acalabrutinib ...
...,...,...,...
43342,1987-06-19 12:17:53.643945985,Joy is a Lifestyle,Hanging out with Friends! :) #FF #CLL #Happine...
43343,1987-06-19 12:06:26.675290112,𝓒𝓻𝓲𝔃𝔃𝔂 𝓟𝓮𝓻𝓻𝔂🌹,Hanging out with Friends! :) #FF #CLL #Happine...
43344,1987-06-17 23:05:41.186953217,IQWiG,Zusatznutzen von #Idelalisib ist weder für #CL...
43345,1987-06-17 15:18:00.525635584,Medibooks,#Hematología PTK2 EXPRESSION AND IMMUNOCHEMOTH...


# **Data Preprocessing (Cleaning the data)**

In [None]:
#checking missing values
df.isnull().sum()

index           0
tweet_author    0
tweet_text      0
dtype: int64

In [None]:
#checking for duplicates
df.duplicated().sum()

0

In [None]:
#removing hashtags (word starting with #)
df['tweet_text'] = df['tweet_text'].replace(r'#[\w]*', '', regex=True)
df["tweet_text"][0]

'⚕️ Scientists conducted a Phase II study of acalabrutinib in patients with relapsed/refractory  who were ibrutinib-intolerant, and found an overall response rate of 73%. \nhttps://t.co/eJ6m4QpC5P https://t.co/kuZz6ZO47r'

In [None]:
#removing https links
df['tweet_text'] = df['tweet_text'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)
df["tweet_text"][0]

'⚕️ Scientists conducted a Phase II study of acalabrutinib in patients with relapsed/refractory  who were ibrutinib-intolerant, and found an overall response rate of 73%. \n '

In [None]:
# Replacing punctuations with space
df['tweet_text'] = df['tweet_text'].str.replace("[^a-zA-Z]", " ",regex = True)
df["tweet_text"][0]

'   Scientists conducted a Phase II study of acalabrutinib in patients with relapsed refractory  who were ibrutinib intolerant  and found an overall response rate of        '

In [None]:
# make entire text lowercase
df['tweet_text'] = [row.lower() for row in df['tweet_text']]
df["tweet_text"][0]

'   scientists conducted a phase ii study of acalabrutinib in patients with relapsed refractory  who were ibrutinib intolerant  and found an overall response rate of        '

In [None]:
# #remove words with len < 3
df['tweet_text'] = df['tweet_text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
df["tweet_text"][0]

'scientists conducted phase study acalabrutinib patients with relapsed refractory were ibrutinib intolerant found overall response rate'

In [None]:
#tokenization
#df["cleaned_tweet_text"] = df["tweet_text"].apply(lambda x: [t.text for t in ner_bc.tokenizer(x)])
#df
#df["cleaned_tweet_text"] = df["tweet_text"].apply(lambda x: ner_bc.tokenizer(x))
#df
df["cleaned_tweet_text"] = df["tweet_text"].apply(lambda x: ner_sci.tokenizer(x))
df

Unnamed: 0,index,tweet_author,tweet_text,cleaned_tweet_text
0,2013-07-18 09:39:46.071961602,Hematopoiesis News,scientists conducted phase study acalabrutinib...,"(scientists, conducted, phase, study, acalabru..."
1,2013-07-17 03:40:32.173842437,"Michael Wang, MD",this phase acalabrutinib venetoclax trial that...,"(this, phase, acalabrutinib, venetoclax, trial..."
2,2013-07-15 15:41:16.553048065,1stOncology,backs,(backs)
3,2013-07-12 19:19:42.367813635,Toby Eyre,valuable option intolerant further valuable da...,"(valuable, option, intolerant, further, valuab..."
4,2013-07-04 12:40:34.334232586,Lymphoma Hub,nice recommended acalabrutinib patients with t...,"(nice, recommended, acalabrutinib, patients, w..."
...,...,...,...,...
43342,1987-06-19 12:17:53.643945985,Joy is a Lifestyle,hanging with friends,"(hanging, with, friends)"
43343,1987-06-19 12:06:26.675290112,𝓒𝓻𝓲𝔃𝔃𝔂 𝓟𝓮𝓻𝓻𝔂🌹,hanging with friends,"(hanging, with, friends)"
43344,1987-06-17 23:05:41.186953217,IQWiG,zusatznutzen weder noch refrakt follikul lymph...,"(zusatznutzen, weder, noch, refrakt, follikul,..."
43345,1987-06-17 15:18:00.525635584,Medibooks,expression immunochemotherapy outcome chronic ...,"(expression, immunochemotherapy, outcome, chro..."


In [None]:
#stopwords removal
#df["cleaned_tweet_text"] = df["cleaned_tweet_text"].apply(lambda x: [token.text for token in ner_bc(x) if not token.is_stop])
#df
df["cleaned_tweet_text"] = df["cleaned_tweet_text"].apply(lambda x: [token.text for token in ner_sci(x) if not token.is_stop])
df

Unnamed: 0,index,tweet_author,tweet_text,cleaned_tweet_text
0,2013-07-18 09:39:46.071961602,Hematopoiesis News,scientists conducted phase study acalabrutinib...,"[scientists, conducted, phase, study, acalabru..."
1,2013-07-17 03:40:32.173842437,"Michael Wang, MD",this phase acalabrutinib venetoclax trial that...,"[phase, acalabrutinib, venetoclax, trial, recr..."
2,2013-07-15 15:41:16.553048065,1stOncology,backs,[backs]
3,2013-07-12 19:19:42.367813635,Toby Eyre,valuable option intolerant further valuable da...,"[valuable, option, intolerant, valuable, data,..."
4,2013-07-04 12:40:34.334232586,Lymphoma Hub,nice recommended acalabrutinib patients with t...,"[nice, recommended, acalabrutinib, patients, t..."
...,...,...,...,...
43342,1987-06-19 12:17:53.643945985,Joy is a Lifestyle,hanging with friends,"[hanging, friends]"
43343,1987-06-19 12:06:26.675290112,𝓒𝓻𝓲𝔃𝔃𝔂 𝓟𝓮𝓻𝓻𝔂🌹,hanging with friends,"[hanging, friends]"
43344,1987-06-17 23:05:41.186953217,IQWiG,zusatznutzen weder noch refrakt follikul lymph...,"[zusatznutzen, weder, noch, refrakt, follikul,..."
43345,1987-06-17 15:18:00.525635584,Medibooks,expression immunochemotherapy outcome chronic ...,"[expression, immunochemotherapy, outcome, chro..."


In [None]:
df["cleaned_tweet_text"] = df["cleaned_tweet_text"].astype(str)

In [None]:
df["cleaned_tweet_text"][0]

In [None]:
df['cleaned_tweet_text'] = df['cleaned_tweet_text'].str.replace("[^a-zA-Z]", " ",regex = True)
df["cleaned_tweet_text"][0]

'  scientists    conducted    phase    study    acalabrutinib    patients    relapsed    refractory    ibrutinib    intolerant    found    overall    response    rate  '

# ***SciSpacy Named Entity Recognition(NER)***

***Application on DataFrame***

In [None]:
print(list(ner_sci.component_names))

['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner']


In [None]:
print(list(ner_sci.get_pipe("ner").labels))

['ENTITY']


In [None]:
def add_bc(abstractList, doiList):
    i = 0
    table= {"Author":[], "Entity":[], "Class":[]}
    for doc in ner_sci.pipe(abstractList):
        doi = doiList[i]
        for x in doc.ents:
          table["Author"].append(doi)
          table["Entity"].append(x.text)
          table["Class"].append(x.label_)
        i +=1
    return table
     

In [None]:
#meta_df = pd.read_csv("/content/sample.csv")

#Sort out blank abstracts
#df = meta_df.dropna(subset=['abstract'])

#Create lists
doiList = df['tweet_author'].tolist()
abstractList = df['cleaned_tweet_text'].tolist()

#Add all entity value pairs to table (run one at a time, each ones takes ~20 min)
table = add_bc(abstractList, doiList)

# table = add_bc(abstractList, doiList)

# table = add_bi(abstractList, doiList)

# table = add_jn(abstractList, doiList)

#Turn table into an exportable CSV file (returns normalized file of entity/value pairs)
ner_df = pd.DataFrame(table)
#trans_df.to_csv ("Entity_pairings.csv", index=False)

In [None]:
#trans_df = pd.DataFrame(table)
ner_df[:30]
#trans_df.to_csv("Entity.csv",index = False)

Unnamed: 0,Author,Entity,Class
0,Hematopoiesis News,scientists,ENTITY
1,Hematopoiesis News,acalabrutinib,ENTITY
2,Hematopoiesis News,patients,ENTITY
3,Hematopoiesis News,relapsed,ENTITY
4,Hematopoiesis News,ibrutinib,ENTITY
5,Hematopoiesis News,intolerant,ENTITY
6,Hematopoiesis News,response,ENTITY
7,Hematopoiesis News,rate,ENTITY
8,"Michael Wang, MD",acalabrutinib,ENTITY
9,"Michael Wang, MD",venetoclax,ENTITY


In [None]:
ner_df.duplicated().value_counts()

False    58701
True     52478
dtype: int64

In [None]:
ner_df.drop_duplicates(inplace = True)
ner_df

Unnamed: 0,Author,Entity,Class
0,Hematopoiesis News,scientists,ENTITY
1,Hematopoiesis News,acalabrutinib,ENTITY
2,Hematopoiesis News,patients,ENTITY
3,Hematopoiesis News,relapsed,ENTITY
4,Hematopoiesis News,ibrutinib,ENTITY
...,...,...,...
111170,Micheál 🇮🇪,lifetime,ENTITY
111172,𝓒𝓻𝓲𝔃𝔃𝔂 𝓟𝓮𝓻𝓻𝔂🌹,friends,ENTITY
111174,Medibooks,immunochemotherapy,ENTITY
111175,Medibooks,outcome,ENTITY


In [None]:
ner_df[ner_df["Entity"] == "na"]
  
#ner_df_cleaned = ner_df[ner_df['Entity'] != "na"][["Author","Entity","Class"]]
#ner_df_cleaned
ner_df_cleaned = ner_df

In [None]:
freq = pd.DataFrame(ner_df_cleaned['Entity'].value_counts()).reset_index().rename(columns = {"index":"Entity","Entity":"Frequency"})
freq[:20]
freq.to_csv("objective1.csv", encoding='utf-8',index = False)

In [None]:
df5 = pd.read_csv("/content/objective1.csv")
df5

Unnamed: 0,Entity,Frequency
0,patients,2146
1,chronic,1628
2,acalabrutinib,1444
3,calquence,1349
4,treatment,1291
...,...,...
7233,astronauts,1
7234,sigo,1
7235,presby,1
7236,snapshots,1


# ***Part II - Sentimental analysis/Polarity***

In [None]:
polarity_df = ner_df_cleaned.copy()
polarity_df

Unnamed: 0,Author,Entity,Class
0,Hematopoiesis News,scientists,ENTITY
1,Hematopoiesis News,acalabrutinib,ENTITY
2,Hematopoiesis News,patients,ENTITY
3,Hematopoiesis News,relapsed,ENTITY
4,Hematopoiesis News,ibrutinib,ENTITY
...,...,...,...
111170,Micheál 🇮🇪,lifetime,ENTITY
111172,𝓒𝓻𝓲𝔃𝔃𝔂 𝓟𝓮𝓻𝓻𝔂🌹,friends,ENTITY
111174,Medibooks,immunochemotherapy,ENTITY
111175,Medibooks,outcome,ENTITY


In [None]:
#polarity = en_ner_bc5cdr_md.load()
#polarity = en_core_web_sm.load()
polarity = en_core_sci_lg.load()

In [None]:
!pip install vaderSentiment

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nltk.sentiment.util import *

#Sentiment Analysis
SIA = SentimentIntensityAnalyzer()

polarity_df["Entity"]= polarity_df["Entity"].astype(str)
# Applying Model, Variable Creation
polarity_df['Polarity Score'] = polarity_df["Entity"].apply(lambda x:SIA.polarity_scores(x)['compound'])
polarity_df['Neutral Score'] = polarity_df["Entity"].apply(lambda x:SIA.polarity_scores(x)['neu'])
polarity_df['Negative Score'] = polarity_df["Entity"].apply(lambda x:SIA.polarity_scores(x)['neg'])
polarity_df['Positive Score'] = polarity_df["Entity"].apply(lambda x:SIA.polarity_scores(x)['pos'])


# Converting 0 to 1 Decimal Score to a Categorical Variable
polarity_df['overall polarity']=''
polarity_df.loc[polarity_df['Polarity Score'] > 0,'overall polarity']='Positive'
#polarity_df.loc[polarity_df['Polarity Score'] == 0,'overall polarity']='Neutral'
polarity_df.loc[polarity_df['Polarity Score'] <= 0,'overall polarity']='Negative'
polarity_df[:30]

Unnamed: 0,Author,Entity,Class,Polarity Score,Neutral Score,Negative Score,Positive Score,overall polarity
0,Hematopoiesis News,scientists,ENTITY,0.0,1.0,0.0,0.0,Negative
1,Hematopoiesis News,acalabrutinib,ENTITY,0.0,1.0,0.0,0.0,Negative
2,Hematopoiesis News,patients,ENTITY,0.0,1.0,0.0,0.0,Negative
3,Hematopoiesis News,relapsed,ENTITY,0.0,1.0,0.0,0.0,Negative
4,Hematopoiesis News,ibrutinib,ENTITY,0.0,1.0,0.0,0.0,Negative
5,Hematopoiesis News,intolerant,ENTITY,0.0,1.0,0.0,0.0,Negative
6,Hematopoiesis News,response,ENTITY,0.0,1.0,0.0,0.0,Negative
7,Hematopoiesis News,rate,ENTITY,0.0,1.0,0.0,0.0,Negative
8,"Michael Wang, MD",acalabrutinib,ENTITY,0.0,1.0,0.0,0.0,Negative
9,"Michael Wang, MD",venetoclax,ENTITY,0.0,1.0,0.0,0.0,Negative


In [None]:
polarity_df["overall polarity"].value_counts()

Negative    54689
Positive     4012
Name: overall polarity, dtype: int64

In [None]:
polarity_df.columns

Index(['Author', 'Entity', 'Class', 'Polarity Score', 'Neutral Score',
       'Negative Score', 'Positive Score', 'overall polarity'],
      dtype='object')

In [None]:
del polarity_df["Polarity Score"]
#,"Neutral Score","Negative Score","Positive Score","Class"])

In [None]:
del polarity_df["Neutral Score"]

In [None]:
del polarity_df["Negative Score"]

In [None]:
del polarity_df["Positive Score"]

In [None]:
del polarity_df["Class"]

In [None]:
polarity_df

Unnamed: 0,Author,Entity,overall polarity
0,Hematopoiesis News,scientists,Negative
1,Hematopoiesis News,acalabrutinib,Negative
2,Hematopoiesis News,patients,Negative
3,Hematopoiesis News,relapsed,Negative
4,Hematopoiesis News,ibrutinib,Negative
...,...,...,...
111170,Micheál 🇮🇪,lifetime,Negative
111172,𝓒𝓻𝓲𝔃𝔃𝔂 𝓟𝓮𝓻𝓻𝔂🌹,friends,Positive
111174,Medibooks,immunochemotherapy,Negative
111175,Medibooks,outcome,Negative


In [None]:
polarity_df.to_csv("objective2.csv", encoding='utf-8',index = False)