<a href="https://colab.research.google.com/github/Azizkhaled/NLP-with-Aziz/blob/main/Sentiment_of_organizations_in_Reddit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NER and Sentiment

In this part, we'll apply simple sentiment analysis to our data using a ready-made distilBERT model from the Flair library. Next, we'll utilize the organization labels we obtained from NER in the previous section to generate a list of organizations ranked by their highest and lowest average sentiment scores.

In [5]:
pip install flair

  Attempting uninstall: urllib3
    Found existing installation: urllib3 2.0.4
    Uninstalling urllib3-2.0.4:
      Successfully uninstalled urllib3-2.0.4
  Attempting uninstall: gdown
    Found existing installation: gdown 4.6.6
    Uninstalling gdown-4.6.6:
      Successfully uninstalled gdown-4.6.6
Successfully installed accelerate-0.21.0 boto3-1.28.26 botocore-1.31.26 bpemb-0.3.4 conllu-4.5.3 datasets-2.14.4 deprecated-1.2.14 dill-0.3.7 flair-0.12.2 ftfy-6.1.1 gdown-4.4.0 huggingface-hub-0.16.4 janome-0.5.0 jmespath-1.0.1 langdetect-1.0.9 mpld3-0.3 multiprocess-0.70.15 pptree-3.1 pytorch-revgrad-0.2.0 s3transfer-0.6.1 safetensors-0.3.2 segtok-1.5.11 sentencepiece-0.1.99 sqlitedict-2.1.0 tokenizers-0.13.3 transformer-smaller-training-vocab-0.2.4 transformers-4.31.0 urllib3-1.26.16 wikipedia-api-0.6.0 xxhash-3.3.0


In [21]:
import pandas as pd
import flair
import spacy


### Initilize the sentiment model

In [7]:
model = flair.models.TextClassifier.load('en-sentiment')

2023-08-15 04:47:29,292 https://nlp.informatik.hu-berlin.de/resources/models/sentiment-curated-distilbert/sentiment-en-mix-distillbert_4.pt not found in cache, downloading to /tmp/tmpf8961atc


100%|██████████| 253M/253M [00:11<00:00, 22.7MB/s]

2023-08-15 04:47:41,444 copying /tmp/tmpf8961atc to cache at /root/.flair/models/sentiment-en-mix-distillbert_4.pt





2023-08-15 04:47:42,506 removing temp file /tmp/tmpf8961atc


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Function to get the sentiment



 - tokenize the input text,
 - make a prediction,
 - extract the direction (positive or negative) and
 - confidence (a score from 0 to 1)

In [62]:
def get_sentiment(text):
    # tokenize input text
    sentence = flair.data.Sentence(text)
    # make sentiment prediction
    model.predict(sentence)
    # extract sentiment direction and confidence (label and score) object
    sentiment = sentence.get_labels()[0].value, sentence.get_labels()[0].score
    return sentiment

### Get sentiment from the data

In [63]:
# data can be found from NER_On_Sub_reddits.ipynb or in the repo

df = pd.read_csv('./reddit_investing.csv', sep='|')
df.head()

Unnamed: 0,id,created,subreddit,title,selftext,upvote_ratio,ups,downs,score
0,l6wvia,1611841000.0,investing,Robinhood and other brokers literally blocking...,"See title. Can't buy these stocks on RH, but c...",0.99,26952,0,26952
1,64q201,1491907000.0,investing,United Airlines stock down over 5% premarket t...,http://www.marketwatch.com/story/uniteds-stock...,0.88,13795,0,13795
2,a6zrah,1545053000.0,investing,"Bitcoin was nearly $20,000 a year ago today",It's always interesting looking at the past an...,0.94,10636,0,10636
3,949u8r,1533305000.0,investing,"If in 2001, you bought $399 of Apple stock ins...",,0.92,10538,0,10538
4,lhtodm,1613075000.0,investing,Historically it's way better to invest at mark...,"Found this 2018 article, interesting/fun fact:...",0.98,9351,0,9351


In [107]:
# get sentiment
df['sentiment'] = ['0'] * len(df.index)
for i, row in df.iterrows():
  # text = str(df['title'][i]) + '\n' + str(df['selftext'][i])
  text = str(df['selftext'][i])
  df['sentiment'][i] = get_sentiment(text)

df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sentiment'][i] = get_sentiment(text)


Unnamed: 0,id,created,subreddit,title,selftext,upvote_ratio,ups,downs,score,sentiment,organizations
0,l6wvia,1611841000.0,investing,Robinhood and other brokers literally blocking...,"See title. Can't buy these stocks on RH, but c...",0.99,26952,0,26952,"(NEGATIVE, 0.956432580947876)","[NOK, AMC]"
1,64q201,1491907000.0,investing,United Airlines stock down over 5% premarket t...,http://www.marketwatch.com/story/uniteds-stock...,0.88,13795,0,13795,"(NEGATIVE, 0.9999698400497437)","[United Airlines, UAL]"
2,a6zrah,1545053000.0,investing,"Bitcoin was nearly $20,000 a year ago today",It's always interesting looking at the past an...,0.94,10636,0,10636,"(POSITIVE, 0.9983178377151489)",[]
3,949u8r,1533305000.0,investing,"If in 2001, you bought $399 of Apple stock ins...",,0.92,10538,0,10538,"(NEGATIVE, 0.6754499673843384)","[Apple, iPod]"
4,lhtodm,1613075000.0,investing,Historically it's way better to invest at mark...,"Found this 2018 article, interesting/fun fact:...",0.98,9351,0,9351,"(NEGATIVE, 0.9989438652992249)","[SPY, The Stock Market Works, Chart, https://i..."


### Get Organiztions

In [108]:
nlp = spacy.load('en_core_web_sm')

def get_orgs(text):
    # process the text with our SpaCy model to get named entities
    doc = nlp(text)
    # initialize list to store identified organizations
    org_list = []
    # loop through the identified entities and append ORG entities to org_list
    for entity in doc.ents:
        if entity.label_ == 'ORG':
            org_list.append(entity.text)
    # if organization is identified more than once it will appear multiple times in list
    # we use set() to remove duplicates then convert back to list
    org_list = list(set(org_list))
    return org_list

In [109]:
# get organizations
df['organizations'] = ['0'] * len(df.index)
for i, row in df.iterrows():
  text = str(df['title'][i]) + '\n' + str(df['selftext'][i])
  df['organizations'][i] = get_orgs(text)

df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['organizations'][i] = get_orgs(text)


Unnamed: 0,id,created,subreddit,title,selftext,upvote_ratio,ups,downs,score,sentiment,organizations
0,l6wvia,1611841000.0,investing,Robinhood and other brokers literally blocking...,"See title. Can't buy these stocks on RH, but c...",0.99,26952,0,26952,"(NEGATIVE, 0.956432580947876)","[NOK, AMC]"
1,64q201,1491907000.0,investing,United Airlines stock down over 5% premarket t...,http://www.marketwatch.com/story/uniteds-stock...,0.88,13795,0,13795,"(NEGATIVE, 0.9999698400497437)","[United Airlines, UAL]"
2,a6zrah,1545053000.0,investing,"Bitcoin was nearly $20,000 a year ago today",It's always interesting looking at the past an...,0.94,10636,0,10636,"(POSITIVE, 0.9983178377151489)",[]
3,949u8r,1533305000.0,investing,"If in 2001, you bought $399 of Apple stock ins...",,0.92,10538,0,10538,"(NEGATIVE, 0.6754499673843384)","[Apple, iPod]"
4,lhtodm,1613075000.0,investing,Historically it's way better to invest at mark...,"Found this 2018 article, interesting/fun fact:...",0.98,9351,0,9351,"(NEGATIVE, 0.9989438652992249)","[SPY, The Stock Market Works, Chart, https://i..."


## Get the sentiment score for each organization

In [110]:
# initialize sentiment dictionary
sentiment = {}

# loop through dataframe and extract org labels and sentiment scores into sentiment dictionary
for i, row in df.iterrows():
    # extract sentiment direction and score
    direction = row['sentiment'][0]
    score = row['sentiment'][1]
    # loop through each label in organizations column
    for org in row['organizations']:
        # check if org label exists in sentiment dictionary already
        if org not in sentiment.keys():
            # if it doesn't, initialize new entry in dictionary
            sentiment[org] = {'POSITIVE': [], 'NEGATIVE': []}
        # append positive/negative score to respective dictionary entry
        sentiment[org][direction].append(score)

In [111]:
sentiment['Amazon']

{'POSITIVE': [0.9859572649002075,
  0.9754666090011597,
  0.7815338373184204,
  0.9391564130783081,
  0.8032909035682678],
 'NEGATIVE': [0.8081296682357788,
  0.9999397993087769,
  0.6754499673843384,
  0.9992662072181702,
  0.9999432563781738,
  0.6406596899032593,
  0.9997462630271912,
  0.9809196591377258,
  0.9981691837310791,
  0.9629344344139099,
  0.8861182928085327,
  0.9794243574142456,
  0.9928145408630371,
  0.678425133228302,
  0.6047007441520691,
  0.9957142472267151,
  0.9953433871269226,
  0.9997910857200623,
  0.9999902248382568,
  0.7631211876869202,
  0.9978368878364563,
  0.9828482866287231,
  0.9957152009010315,
  0.6589743494987488,
  0.9995431900024414]}

## Get the average sentimency for each organization and assign a score

In [112]:
# initialize sentiment list
avg_sentiment = []

# loop through each organization
for org in sentiment.keys():
    # get number of positive and negative ratings
    freq = len(sentiment[org]['POSITIVE']) + len(sentiment[org]['NEGATIVE'])
    for direction in ['POSITIVE', 'NEGATIVE']:
        # assign to variable for cleaner code
        score = sentiment[org][direction]
        # if there are no entries, set to 0
        if len(score) == 0:
            sentiment[org][direction] = 0.0
        else:
            # otherwise calculate total
            sentiment[org][direction] = sum(score)
    # now calculate total amount
    total = sentiment[org]['POSITIVE'] - sentiment[org]['NEGATIVE']
    # and the average score
    avg = total/freq
    # add to sentiment list
    avg_sentiment.append({
        'entity': org,
        'positive': sentiment[org]['POSITIVE'],
        'negative': sentiment[org]['NEGATIVE'],
        'frequency': freq,
        'score': avg
    })

In [118]:
sentiment_df = pd.DataFrame(avg_sentiment)
sentiment_df

Unnamed: 0,entity,positive,negative,frequency,score
0,NOK,0.000000,0.956433,1,-0.956433
1,AMC,0.000000,6.935461,7,-0.990780
2,United Airlines,0.000000,0.999970,1,-0.999970
3,UAL,0.000000,0.999970,1,-0.999970
4,Apple,2.346662,17.484563,22,-0.688086
...,...,...,...,...,...
1194,Live Nation Entertainment,0.000000,0.999698,1,-0.999698
1195,Ticketmaster’s,0.000000,0.999698,1,-0.999698
1196,Live Nation,0.000000,0.999698,1,-0.999698
1197,reported](https://www.reuters.com,0.000000,0.999972,1,-0.999972


#### most frequent, and scores in descending order

In [119]:
sentiment_df = sentiment_df[sentiment_df['frequency'] > 2]
sentiment_df

Unnamed: 0,entity,positive,negative,frequency,score
1,AMC,0.000000,6.935461,7,-0.990780
4,Apple,2.346662,17.484563,22,-0.688086
6,SPY,1.891549,5.751514,8,-0.482496
16,Uber,0.000000,3.996291,4,-0.999073
17,WSB,0.925240,8.862602,10,-0.793736
...,...,...,...,...,...
773,Android,0.900775,3.961626,5,-0.612170
794,YOLO,0.000000,2.951484,3,-0.983828
879,Yahoo Finance,0.973287,2.990989,4,-0.504425
881,P/S,0.781534,1.787338,3,-0.335268


In [121]:
sentiment_df.sort_values('score', ascending=False).head()


Unnamed: 0,entity,positive,negative,frequency,score
223,MSFT,2.761266,0.899029,4,0.465559
422,Hertz,1.685041,1.755769,4,-0.017682
586,Cathie Wood,1.35748,1.989693,4,-0.158053
378,AI,1.990934,2.875314,5,-0.176876
109,ETF,1.891549,2.880625,5,-0.197815
