This notebook contains the Python code to run sentiment analysis on column(s) of texts. I prepared this code to be be reused in future scientific studies for both myself and anyone who finds it.

Import the NLTK library and download files for stopwords, punctuations and VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis tool.

In [None]:
import nltk
nltk.download('punkt') # List of punctuations
nltk.download('stopwords') # List of stopwords 
nltk.download('vader_lexicon') # VADER files

Load up the CSV file as per the comments into a Pandas Dataframe. The data is cleaned for:

1. Uppercase characters
2. Numeric characters
3. Stopwords and punctuation marks (i.e period(.), commas(,))

The sentiment of the texts are then analyzed using VADER to print out the sentiment polarity scores.

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd
from nltk.corpus import stopwords
from string import punctuation


# Define stopwords for removal
stop_words = stopwords.words('english')

def clean_text(text):
    """
    This function cleans text data by handling potential non-string values,
    lowercasing, removing punctuation, and removing stopwords.

    Args:
        text (str): The text to be cleaned.

    Returns:
        str: The cleaned text.
    """

    if not isinstance(text, str):
        # Handle non-string values gracefully (e.g., return empty string)
        return ""

    # Lowercase conversion
    text = text.lower()

    # Remove punctuation
    text = ''.join([char for char in text if char not in punctuation])

    # Tokenization (splitting into words)
    tokens = nltk.word_tokenize(text)

    # Remove stopwords
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Join back into a cleaned text string
    cleaned_text = ' '.join(filtered_tokens)
    return cleaned_text

# Replace 'data.csv' with the path to your actual CSV file
# Replace 'text_column' with the name of the column containing the text data
df = pd.read_csv('data.csv')
text_column = 'text_column'

# Clean text data, handling potential non-string values
df[text_column] = df[text_column].apply(clean_text)

# Create sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Add new columns for sentiment scores
df['sentiment_polarity'] = df[text_column].apply(lambda x: analyzer.polarity_scores(x)['compound'])

# Print or use the DataFrame with sentiment scores as needed
print(df['sentiment_polarity'])