## Pre-trained Sentiment Analysis

To analyse the comments, a sentiment analysis could be interesting. We want to determine how much positive/negative or neutral are our comments attributing them a score. 

The intuition behind is that positive comments might stand out more and get more upvotes.

In our last notebook, we will implement this sentiment analysis in our final model for score prediction.

Some Python libraries offers sentiment analysis implementations for textual data. The models of such libraries are pre-trained, meaning we do not need to train our own sentiment analysis model.

### Set-up

In [4]:
! pip install nltk

Collecting nltk




  Downloading nltk-3.6.1-py3-none-any.whl (1.5 MB)
Collecting regex
  Downloading regex-2021.4.4-cp39-cp39-win_amd64.whl (270 kB)
Collecting click
  Using cached click-7.1.2-py2.py3-none-any.whl (82 kB)
Collecting tqdm
  Downloading tqdm-4.60.0-py2.py3-none-any.whl (75 kB)
Installing collected packages: tqdm, regex, click, nltk
Successfully installed click-7.1.2 nltk-3.6.1 regex-2021.4.4 tqdm-4.60.0


In [6]:
!pip install textblob

Collecting textblob
  Downloading textblob-0.15.3-py2.py3-none-any.whl (636 kB)
Installing collected packages: textblob
Successfully installed textblob-0.15.3


In [7]:
import pickle

import numpy as np
import pandas as pd

import nltk
nltk.download('punkt')
from textblob import TextBlob

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Solene\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Input your repository path here:

In [2]:
repsource = "C:/Users/s1027177/OneDrive - Syngenta/Documents/FOAD/au_secours/"

In [9]:
file=open(repsource+"df_inter","rb")
df=pickle.load(file)
file.close()

### Pre-trained sentiment analysis

In [10]:
def get_sentiment(df, colname):
    
    '''
      This function computes sentiment scores for a comment using the textblob library.
      It has to be applied on raw data (not preprocessed).
      Scores are ratio of positive/negative/neutral sentences over the number of sentences
      in the comment. Positive, negative and neutral sentiment sentences are defined based
      on sentiment polarities' thresholds.

      Parameters 
      ----------
      df: pandas.Dataframe
        Raw data
          
      colname: string
        Name of the column containing text from raw data.

      Returns
      -------
      df: pandas.Dataframe
        pandas.Dataframe with new columns for values of the three ratios.

    '''

    for idx, com in df[colname].items():

        if isinstance(com, str) and com!='deleted' and com!='[deleted]':
            blob = TextBlob(com)
            pos = 0
            neg = 0
            neutral = 0
            count = 0
            for sentence in blob.sentences:
                sentiment = sentence.sentiment.polarity
                if sentiment > 0.1:
                    pos +=1
                elif sentiment > -0.1:
                    neutral +=1
                else:
                    neg +=1
                count+=1
            if count == 0:
                count = 1

            scores = {"pos": pos/count,
                      "neutral": neutral/count,
                      "neg": neg/count
                     }
            
        else:
            scores = {"pos": 0,
                    "neutral": 0,
                    "neg": 0
                    }
            
        df.at[idx,'positive_com'] = scores["pos"]
        df.at[idx,'neutral_com'] = scores["neutral"]
        df.at[idx,'negative_com'] = scores["neg"]

    return df

Warning ! Executing this function on the raw comments take 4 hours...

In [11]:
new_df=get_sentiment(df, 'body')

### Saving the sentiment analysis

In [17]:
file3=open(repsource+"new_df","wb")
pickle.dump(new_df,file3)
file3.close()

We will see in the training part that these variables are not very significant. We could have rather built dummy variables taking 1 when the score is maximum between positive, negative and neutral and include only two of them in the model to avoid for multi-colinearity but it's very time consuming so we didn't execute it again. Here below the function to create these dummy variables.

In [1]:
def get_sentiment(df, colname):

    df['positive_com'] = 0
    df['neutral_com'] = 0
    df['negative_com'] = 0

    for idx, com in df[colname].items():
        if pd.notnull(com) and com!='deleted' and com!='[deleted]':
            blob = TextBlob(com)
            pos = 0
            neg = 0
            neutral = 0
            count = 0
            for sentence in blob.sentences:
                sentiment = sentence.sentiment.polarity
                if sentiment > 0.1:
                    pos +=1
                elif sentiment > -0.1:
                    neutral +=1
                else:
                    neg +=1
                count+=1
            if count == 0:
                count = 1
            scores = {"pos": pos/count,
                    "neutral": neutral/count,
                    "neg": neg/count
                    }
            max_score = max(scores, key=scores.get)

            if max_score == 'pos':
                df.at[idx,'positive_com'] +=1
            elif max_score == 'neutral':
                df.at[idx,'neutral_com'] +=1
            else:
                df.at[idx,'negative_com'] +=1
                
        else:
            df.at[idx,'positive_com'] = 0
            df.at[idx,'neutral_com'] = 0
            df.at[idx,'negative_com'] = 0

    return df