# Notebook Description:
## This Jupyter notebook is designed for a comprehensive literary analysis of Sylvia Plath's poems, using a variety of Python libraries for data manipulation, natural language processing (NLP), and visualization. The notebook is structured into distinct sections, each focusing on different aspects of text analysis and visualization.

# Sections of the Notebook:
- ## Sentiment Analysis Over Time: This section analyzes the sentiment of poems across different years, including a special category for 'Juvenilia'. It uses TextBlob to calculate sentiment polarity and Plotly for visualization. The output is a line plot showing how average sentiment varies by year.

- ## Sentiment Variability (Standard Deviation): In addition to average sentiment scores, this part calculates the standard deviation of sentiments for each year, providing insight into the emotional variability within the poems. The visualization includes error bars to represent this variability.

- ## Keyword and Phrase Analysis: This part involves extracting and visualizing the most common words and bigrams (pairs of adjacent words) in the poems. It employs NLTK for text processing and Plotly for creating bar charts that display these frequencies.

- ## Named Entity Recognition (NER): Using spaCy's NER feature, this section identifies and categorizes named entities (like people, places, organizations) within the poems. The results are visualized as a bar chart, showing the frequency of different types of named entities.

- ## Data Loading and Preprocessing: The notebook begins by loading poem data from CSV and DOCX files. It includes preprocessing steps like tokenization, lemmatization, and conversion of data types for analysis.

# How to Use the Notebook:
## Setup:
- ## Ensure all necessary libraries (pandas, plotly.graph_objects, datetime, nltk, spacy, string, collections, docx, plotly.express, textblob) are installed.
- ## Download PlathPoems.csv, PlathPoems.docx and poems_by_year.csv. After downloading them, load them into the "files" section of this notebook.

## Running Sections:
- ## Execute each cell in the notebook sequentially. Each section is self-contained and includes both the code for analysis and the code for visualization.
- ## Modify parameters like file paths or visualization settings as needed for your specific dataset.

## Interpreting Results:
- ## View and interpret the visualizations generated by the notebook. These plots offer insights into sentiment trends, word usage, and named entity occurrences in the poems.

## This notebook serves as a powerful tool for literary analysis, particularly for studying Sylvia Plath's poems. It leverages the capabilities of Python in text processing and visualization, making it suitable for both academic research and personal exploration.

#### By: Marcela Soriano

## Timeline

In [None]:
import pandas as pd
import plotly.graph_objects as go
from datetime import datetime

In [None]:
# Function to convert year to a datetime object (for those that are valid years)
def convert_year(year):
    try:
        return datetime.strptime(str(year), "%Y")
    except ValueError:
        return None

In [None]:
# Load the CSV file
file_path = '/content/PlathPoems.csv'
plath_data = pd.read_csv(file_path)

In [None]:
# Apply the conversion to the Year column
plath_data['Year_dt'] = plath_data['Year'].apply(convert_year)

# Create 'Type' column for combined data
plath_data['Type'] = 'Specific Years'
plath_data.loc[plath_data['Year'] == 'Juvenilia', 'Type'] = 'Juvenilia'

# Sort the DataFrame by 'Year_dt'
plath_data = plath_data.sort_values(by='Year_dt')

In [None]:
# Adjust the size and spacing of the scatter dots
dot_size = 12  # Adjust the size of the scatter dots
dot_spacing = 40.0  # Adjust the spacing as needed
font_size = 14  # Adjust the font size for the text

# Create a figure for the horizontal dot plot
fig = go.Figure()

# Add scatter dots for events with text annotations
fig.add_trace(go.Scatter(
    y=plath_data['Poem'],
    x=plath_data['Year_dt'],
    mode='markers+text',
    marker=dict(size=dot_size, color='blue'),
    text=plath_data['Event'],
    hoverinfo='text+y+x',
    textposition='middle right',  # Adjust text position next to the dot
    textfont=dict(size=font_size),  # Adjust the font size for the text
))

# Add spacing between the dots
for i in range(1, len(plath_data)):
    plath_data.loc[i, 'Year_dt'] += pd.DateOffset(days=dot_spacing * i)

# Customize the layout
fig.update_layout(
    title='Sylvia Plath Poems and Events',
    xaxis_title='Year',
    yaxis_title='Poem',
    yaxis=dict(categoryorder='total ascending'),
    width=2000,  # Adjust the width of the plot
    height=1500,  # Adjust the height of the plot
    template="plotly_dark",  # Use a darker background template
)

# Show the interactive plot
fig.show()

- ### Import Libraries and Functions:
  - ### pandas for data manipulation.
  - ### plotly.graph_objects for plotting.
  - ### datetime for date and time operations.

- ### Convert Year Function: A function convert_year is defined to convert year values into datetime objects. If the conversion fails (e.g., for invalid years), it returns None.

- ### Load and Process Data: The CSV file (presumably containing Sylvia Plath's poems and related information) is loaded into a pandas DataFrame named plath_data. The Year column is converted into datetime objects using the convert_year function and stored in a new column Year_dt.
A new column Type is created to categorize the data into 'Specific Years' and 'Juvenilia'. The DataFrame is sorted by Year_dt.

- ### Scatter Plot Configuration: Dot size, spacing, and font size are defined for the scatter plot. A plotly figure (fig) is created for the horizontal dot plot. Scatter dots are added to the plot with text annotations for each event.These dots represent poems, plotted against years. The dots are colored blue, and the position and font size of the text annotations are adjusted. Spacing between dots is added for clarity.

- ### Customize and Display Plot: The layout of the plot is customized with titles, axis labels, dimensions (width and height), and a dark background template. The interactive plot is displayed using fig.show().

## Top 20 Common Words and Bigrams

In [None]:
import nltk
import spacy
import string
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from docx import Document
import plotly.express as px

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Function to read text from a DOCX file
def read_text_from_docx(file_path):
    doc = Document(file_path)
    return " ".join([para.text for para in doc.paragraphs])

# Replace the file path with the path of your DOCX file
file_path = "/content/PlathPoems.docx"
text = read_text_from_docx(file_path)

# Text Preprocessing
def preprocess_text(text):
    # Tokenization and lowercasing
    tokens = word_tokenize(text.lower())

    # Removing punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # Removing stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization using spaCy
    lemmatized = [token.lemma_ for token in nlp(' '.join(tokens))]

    return lemmatized

# Preprocess the text
preprocessed_text = preprocess_text(text)

# Keyword Analysis
word_freq = Counter(preprocessed_text)
common_words = word_freq.most_common(20)

# Phrase Analysis (bigrams)
bigrams = ngrams(preprocessed_text, 2)
bigram_freq = Counter(bigrams)
common_bigrams = bigram_freq.most_common(20)

# Named Entity Recognition (NER)
doc = nlp(text)
entities = [(entity.text, entity.label_) for entity in doc.ents]
# Define a dark theme with blue elements
dark_theme_layout = {
    'plot_bgcolor': 'black',  # Background color
    'paper_bgcolor': 'black', # Paper background color
    'font': {'color': 'white'}, # Font color
    'colorway': ['blue'] # Color of the bars or lines in the graph
}

# Graph for Common Words with dark theme
fig = px.bar(x=[word for word, freq in common_words], y=[freq for word, freq in common_words],
             labels={'x': 'Words', 'y': 'Frequency'}, title='Top 20 Common Words')
fig.update_layout(**dark_theme_layout)
fig.show()

# Graph for Common Bigrams with dark theme
fig = px.bar(x=[' '.join(bigram) for bigram, freq in common_bigrams], y=[freq for bigram, freq in common_bigrams],
             labels={'x': 'Bigrams', 'y': 'Frequency'}, title='Top 20 Common Bigrams')
fig.update_layout(**dark_theme_layout)
fig.show()

# Graph for Named Entities with dark theme
entities_data = Counter([entity.label_ for entity in doc.ents])
fig = px.bar(x=list(entities_data.keys()), y=list(entities_data.values()),
             labels={'x': 'Entity Type', 'y': 'Frequency'}, title='Frequency of Named Entity Types')
fig.update_layout(**dark_theme_layout)
fig.show()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


- ### Downloading NLTK Resources: The Natural Language Toolkit (NLTK) is used to download resources like 'punkt' (for tokenization), 'stopwords' (common words that are often filtered out), and 'wordnet' (a lexical database for the English language).

- ### Loading spaCy Model: spaCy, another popular NLP library, is used to load the "en_core_web_sm" model, which is a small English language model.

- ### Reading Text from a DOCX File: A function read_text_from_docx is defined to read text from a DOCX file. The function opens the DOCX file and concatenates the text from all paragraphs.

- ### Text Preprocessing: A preprocess_text function is defined for text preprocessing which includes: Tokenization and conversion to lowercase, removal of punctuation and stop words. Additionallt, lemmatization (converting words to their base form) is used in  the loaded spaCy model.

- ### Keyword and Phrase Analysis: The preprocessed text is analyzed to find the most common words and bigrams (pairs of adjacent words). Counter from the collections module is used to calculate frequencies.

- ### Named Entity Recognition (NER): spaCy's NER feature is used to identify named entities (like people, places, organizations) in the text.

- ### Visualization with Plotly: Plotly Express (px) is used to create bar graphs for visualizing the results of the analysis. Three graphs are created: one for common words, one for common bigrams, and one for named entity types. A dark theme layout with blue elements is defined and applied to all graphs.

- ### Displaying Graphs: The graphs are displayed interactively in the notebook, showing the top 20 common words, bigrams, and the frequency of different types of named entities.

## Sentiment Analysis

In [55]:
import pandas as pd
import plotly.express as px
from textblob import TextBlob

# Load the CSV file
file_path_poems_by_year = '/content/poems_by_year.csv'
poems_by_year_df = pd.read_csv(file_path_poems_by_year)

# Function to calculate sentiment polarity
def calculate_sentiment(text):
    return TextBlob(text).sentiment.polarity

# Calculate sentiment scores
poems_by_year_df['Sentiment'] = poems_by_year_df['Poem'].apply(calculate_sentiment)

# Convert 'Year' to string to include 'Juvenilia'
poems_by_year_df['Year'] = poems_by_year_df['Year'].astype(str)

# Ensure 'Juvenilia' is ordered before the numerical years
poems_by_year_df['Year'] = pd.Categorical(poems_by_year_df['Year'], categories=['Juvenilia'] + sorted(set(poems_by_year_df['Year']) - {'Juvenilia'}), ordered=True)

# Aggregate scores by year
average_sentiment_by_year = poems_by_year_df.groupby('Year')['Sentiment'].mean().reset_index()

# Creating the plot with Plotly
fig = px.line(average_sentiment_by_year, x='Year', y='Sentiment', line_shape='linear', title='Average Sentiment Score Over Time')

# Setting the line color to blue
fig.update_traces(line=dict(color='blue'))

# Adding colored markers
fig.add_scatter(x=average_sentiment_by_year['Year'], y=average_sentiment_by_year['Sentiment'], mode='markers',
                marker=dict(color=average_sentiment_by_year['Sentiment'], colorscale='Viridis', size=10))

# Adding lines and annotations for positive, neutral, and negative sentiments
fig.add_hline(y=0, line_dash="dash", line_color="white",
              annotation_text="Neutral", annotation_position="bottom right", annotation_font_color="white")
fig.add_hline(y=0.2, line_dash="dash", line_color="green",
              annotation_text="Positive", annotation_position="bottom right", annotation_font_color="green")
fig.add_hline(y=-0.2, line_dash="dash", line_color="red",
              annotation_text="Negative", annotation_position="bottom right", annotation_font_color="red")

# Changing the background color
fig.update_layout(plot_bgcolor='black', paper_bgcolor='black',
                  font_color='white', title_x=0.5)

# Updating axes titles
fig.update_layout(xaxis_title='Year', yaxis_title='Average Sentiment Score', xaxis=dict(tickangle=45))

# Displaying the plot
fig.show()

- ### Load Data: The CSV file, containing poems and their publication years, is loaded into a pandas DataFrame poems_by_year_df.

- ### Sentiment Analysis: The function calculate_sentiment is defined to calculate the sentiment polarity of a given text using TextBlob. This function is applied to the 'Poem' column of the DataFrame to create a new column 'Sentiment' containing sentiment scores for each poem.

- ### Data Preparation: The 'Year' column is converted to a string type to include non-numeric categories like 'Juvenilia'. The DataFrame is re-categorized to ensure 'Juvenilia' appears in order, followed by numerical years.

- ### Aggregating Sentiment Scores: The DataFrame is grouped by 'Year', and the mean sentiment score for each year is calculated. The resulting data is stored in average_sentiment_by_year.

- ### Creating a Line Plot with Plotly: A line plot is created to visualize the average sentiment score over time. The line is set to be blue, and markers are added to represent each year's average sentiment score, colored according to a 'Viridis' colorscale.

- ### Adding Reference Lines and Annotations: Horizontal lines and annotations are added to denote the levels of positive, neutral, and negative sentiments. These reference lines help in quickly identifying the overall sentiment trend for each year.

- ### Customizing the Plot Appearance: The plot's background, paper color, and font color are set to black and white for contrast. The title's position is centered, and the axes titles are updated for clarity. The x-axis labels are rotated for better readability.

- ### Displaying the Plot: The plot is displayed, providing an interactive visualization of how the sentiment in Sylvia Plath's poems varied over the years, including a special focus on her early works ('Juvenilia').

## Sentiment Analysis with Standard Deviation

In [62]:
# Load the CSV file
poems_by_year = '/content/poems_by_year.csv'
poems_by_year_df = pd.read_csv(poems_by_year)

# Function to calculate sentiment polarity
def calculate_sentiment(text):
    return TextBlob(text).sentiment.polarity

# Calculate sentiment scores
poems_by_year_df['Sentiment'] = poems_by_year_df['Poem'].apply(calculate_sentiment)

# Convert 'Year' to string and order it correctly
poems_by_year_df['Year'] = poems_by_year_df['Year'].astype(str)
poems_by_year_df['Year'] = pd.Categorical(poems_by_year_df['Year'], categories=['Juvenilia'] + sorted(set(poems_by_year_df['Year']) - {'Juvenilia'}), ordered=True)

# Calculate average sentiment and standard deviation for each year
sentiment_analysis_df = poems_by_year_df.groupby('Year').agg(
    Average_Sentiment=('Sentiment', 'mean'),
    Sentiment_StdDev=('Sentiment', 'std')
).reset_index()

# Creating the plot with Plotly
fig = go.Figure()

# Adding the line and markers (dots)
fig.add_trace(go.Scatter(
    x=sentiment_analysis_df['Year'],
    y=sentiment_analysis_df['Average_Sentiment'],
    mode='lines+markers',
    line=dict(color='blue'),
    marker=dict(color=sentiment_analysis_df['Average_Sentiment'], colorscale='Viridis', size=10),
    error_y=dict(type='data', array=sentiment_analysis_df['Sentiment_StdDev'], visible=True)
))

# Setting the background color to black and adjusting layout
fig.update_layout(
    title='Average Sentiment Score Over Time with Standard Deviation (Plotly Interactive)',
    xaxis=dict(title='Year', tickangle=45),
    yaxis=dict(title='Average Sentiment Score'),
    plot_bgcolor='black',
    paper_bgcolor='black',
    font=dict(color='white'),
    title_x=0.5
)

# Displaying the plot
fig.show()

- ### Standard Deviation Calculation: In the code, after calculating the sentiment scores for each poem, the standard deviation of these scores is computed for each year. This is done using the agg() function in pandas, which aggregates the sentiment scores by year. The standard deviation (Sentiment_StdDev) is a statistical measure that quantifies the amount of variation or dispersion in a set of values. In this context, it tells us how much the sentiment scores of poems in a given year vary from the average sentiment score of that year.

- ### Interpreting Standard Deviation in this Context: A high standard deviation in a particular year indicates that the sentiments expressed in poems from that year vary widely - some poems might be very positive, others very negative, and others neutral. Conversely, a low standard deviation suggests that the poems from that year are more consistent in their sentiment - either mostly positive, mostly negative, or mostly neutral.

-  ### Visualization: The standard deviation is visualized as error bars in the scatter plot created with Plotly. These error bars are attached to each year's average sentiment score marker. The presence of error bars allows viewers to not only see the average sentiment for each year but also to understand the range and variability of sentiments within that year.

- ### Importance of Including Standard Deviation: Including the standard deviation in the analysis provides a more nuanced understanding of the data. It reveals the emotional complexity and diversity in the poems across different years. For literary analysis, this could be particularly insightful. For instance, a year with a high standard deviation in sentiment might indicate a period of tumultuous or varied emotional expression in the poet's life, which could be a point of interest for researchers or enthusiasts studying the poet's work.