# ToGitOrNotToGit üíÄ

## EDA üó£Ô∏è `dark_stage` dataset

1. Import libraries   
2. Convert Markdown table (.md) to clean CSV  
3. Load cleaned CSV  
4. Quick overview  
    - Playwrights üé≠  
    - Sentiment analysis  
    - Period üëë  
    - Incidents & Locations      
 
---

üé≠ `creators.md` ‚Üí who writes  
‚ú® `creaturesmd` ‚Üí who acts   
üó£Ô∏è `dark_stage.md` ‚Üí where transgression unfolds  

In [None]:
# ------------------------------------------------------------------
# 1. Import libraries
# ------------------------------------------------------------------
import pandas as pd                 # === CORE EDA ===
import numpy as np

import matplotlib.pyplot as plt     # === VISUALIZATION ===
import seaborn as sns
import plotly.express as px

import textwrap                     # === TEXT / Light NLP ===
import re
from collections import Counter
from wordcloud import WordCloud

from sklearn.feature_extraction.text import TfidfVectorizer    # === MACHINE LEARNING ===
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

from tabulate import tabulate       # === ENHANCED DISPLAY ===
from rich import print as rprint

plt.style.use("default")            # === DISPLAY SETTING ===
sns.set_theme()

In [None]:
# ------------------------------------------------------------------
# 2. Convert Markdown (.md) to clean CSV
# ------------------------------------------------------------------
md_file = '../../data/raw/dark_stage_raw_dataset.md'

df = pd.read_csv(md_file, sep='|', skiprows=[1], engine='python')       # Read Markdown as table

# Remove ghost columns (Unnamed and fully empty)
df = df.loc[:, ~df.columns.str.contains("^Unnamed")]       # Remove Unnamed columns
df = df.dropna(axis=1, how='all')                          # Remove fully empty columns

df.columns = df.columns.str.strip()                                     # Strip whitespace from column names

text_cols = df.select_dtypes(include="object").columns                  # Strip leading/trailing spaces from all text columns
df[text_cols] = df[text_cols].apply(lambda col: col.str.strip())

df = df.drop(columns=["style"], errors="ignore")                        # Drop 'style' column (all 'blank-verse')

df.to_csv('../../data/raw/dark_stage_clean.csv', index=False)        


# ------------------------------------------------------------------
# 3. Load cleaned CSV 
# ------------------------------------------------------------------
df_dark_stage = pd.read_csv("../../data/raw/dark_stage_clean.csv")

In [None]:
# ------------------------------------------------------------------
# 4. Quick overview
# ------------------------------------------------------------------
print(df_dark_stage.shape)      

In [None]:
print(df_dark_stage.columns)

In [None]:
print(df_dark_stage.info())

In [None]:
print(df_dark_stage.head())

In [None]:
df.head()

#### üó£Ô∏è Columns Guide (15 columns)

=== CORE IDENTIFIERS ===
- **author_id**        : the dramatist involved, linked to `creators.csv`
- **play_id**          : the play concerned (if applicable)
- **creature_id**      : the character involved, unwitting actor in the backstage drama

=== EVENT & STORY DIMENSIONS ===
- **incident_type**    : nature of chaos (Rivalry, Duel, Censorship, Collaboration, Scandal, Witty Repart√©e‚Ä¶)
- **anecdote**         : the juicy narrative (Duels, Betrayals, Quips, Literary Feuds‚Ä¶)
- **intensity**        : scale of impact, measuring drama magnitude (Minor, Notable, Epic, etc.)
- **sentiment**        : emotional or moral undertone (jealousy, grudge, admiration, humiliation, cunning, etc.)
- **stage_mood**       : expressive emoji representing the backstage atmosphere (üòè üò° ü§´ üòà ü§î‚Ä¶)

=== CONTEXTUAL INFORMATION ===
- **period**           : time indication (historical period or approximate year/range)
- **location**         : where the incident occurred, Theatre, Tavern, Royal Court, London, etc.
- **notes**            : sources, critical references, metadata, commentary backing up the tale

=== ADDITIONAL LITERARY METADATA ===
- **influence**        : cultural/literary reach : national, global, etc.
- **death_year**       : year of death of the dramatist (for timeline alignment)
- **feud_with**        : person(s) or group involved in the feud : rivals, actors, critics, authorities
- **notable_rivalry**  : named rivalry or controversy associated with the anecdote

In [None]:
# ------------------------------------------------------------------
# Playwrights üé≠
# ------------------------------------------------------------------
pivot_intensity = pd.crosstab(df_dark_stage['author_id'], df_dark_stage['intensity'])
pivot_intensity = pivot_intensity.div(pivot_intensity.sum(axis=1), axis=0)                 # convert to proportions

pivot_intensity.plot(kind='bar', stacked=True, figsize=(12, 4), colormap='cividis')

plt.title("DRAMA INTENSITY per AUTHOR")
plt.ylabel("")               # no numeric label
plt.xlabel("Author")

plt.yticks([])               # remove tick labels entirely
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
weights = {"minor": 1, "notable": 2, "epic": 4}

df_dark_stage["feud_score"] = (df_dark_stage["intensity"].str.lower().map(weights)* df_dark_stage["feud_with"].notna().astype(int))

ranking = (df_dark_stage.groupby("author_id")["feud_score"].sum().sort_values(ascending=False))

print(ranking.head(10))

In [None]:
epic = df_dark_stage[df_dark_stage["intensity"].str.lower() == "epic"]

pie_data = epic["author_id"].value_counts()

colors = [
    "#4b006e",  # deep purple
    "#6a0dad",  # royal purple
    "#8a2be2",  # blue-violet
    "#9b5fc0",  # amethyst
]

pie_data.plot(kind="pie", figsize=(8,8), autopct="%1.1f%%", colors=colors[:len(pie_data)])

plt.title("WHO IS INVOLVED IN THE MOST EPIC FEUDS ?")
plt.ylabel("")
plt.show()

In [None]:
import networkx as nx

feuds_df = df_dark_stage.dropna(subset=['feud_with'])

G = nx.Graph()

for _, row in feuds_df.iterrows():                           # Add edges (author ‚Üî feud_with)
    G.add_edge(row['author_id'], row['feud_with'])

plt.figure(figsize=(12,8))                                   # Draw graph
pos = nx.spring_layout(G, seed=42) 
nx.draw(G, pos, with_labels=True, node_color='orange', edge_color='darkred', node_size=1000, font_size=8, font_color='black', width=3)

plt.title("FEUD NETWORK, Who Feuds with Whom")
plt.show()

In [None]:
feuds_df = df_dark_stage.dropna(subset=['feud_with'])
feuds_df = feuds_df[feuds_df['intensity'].str.lower() == "epic"]          # Only 'epic' intensity + valid feud_with

top_fighters = (feuds_df['author_id'].value_counts().head(10).index)
feuds_filtered = feuds_df[feuds_df['author_id'].isin(top_fighters)]       # Keep only rows where author_id is among top fighters

G = nx.Graph()

for _, row in feuds_filtered.iterrows():
    G.add_edge(row['author_id'], row['feud_with'])

plt.figure(figsize=(10,6))
pos = nx.spring_layout(G, seed=42)

nx.draw(G, pos, with_labels=True, node_color='orange', edge_color='darkred', node_size=2000, font_size=9, font_color='black', width=3)

plt.title("EPIC FEUD NETWORK")
plt.show()

In [None]:
# ------------------------------------------------------------------
# Sentiment analysis
# ------------------------------------------------------------------
df_dark_stage['sentiment'].value_counts()

In [None]:
plt.figure(figsize=(10,5))

sns.countplot(data=df_dark_stage, y='sentiment', order=df_dark_stage['sentiment'].value_counts().index,
    hue='sentiment', palette=sns.color_palette("mako_r", n_colors=df_dark_stage['sentiment'].nunique()), dodge=False)

plt.title("DISTRIBUTION of SENTIMENTS")
plt.xlabel("Nber of incidents")
plt.ylabel("Sentiment")
plt.show()

In [None]:
sentiment_author = pd.crosstab(df_dark_stage['author_id'], df_dark_stage['sentiment'])

sentiment_author.plot(kind='bar', stacked=True, figsize=(16,6), colormap='mako_r')

plt.title("CROSS-TAB AUTHOR x SENTIMENT")
plt.xlabel("Author")
plt.ylabel("Nber of incidents")
plt.legend(title='Sentiment', bbox_to_anchor=(1.05, 1))
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
sentiment_incident = pd.crosstab(df_dark_stage['incident_type'], df_dark_stage['sentiment'])

sentiment_incident.plot(kind='bar', stacked=True, figsize=(10,5), colormap='mako_r')

plt.title("CROSS-TAB of SENTIMENT x INCIDENT TYPE")
plt.xlabel("Incident Type")
plt.ylabel("Nber of incidents")
plt.legend(title='Sentiment', bbox_to_anchor=(1.05, 1))
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
# ------------------------------------------------------------------
# Period üëë
# ------------------------------------------------------------------
sentiment_period = pd.crosstab(df_dark_stage['period'], df_dark_stage['sentiment'])

sentiment_period.plot(kind='bar', stacked=True, figsize=(10,6), colormap='mako_r')

plt.title("CROSS-TAB PERIOD x SENTIMENT (Elizabethan vs Jacobean)")
plt.xlabel(" ")
plt.ylabel("Nber of incidents")
plt.legend(title='Sentiment', bbox_to_anchor=(1.05, 1))
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
plt.figure(figsize=(12,5))
sns.countplot(data=df_dark_stage, x='incident_type', hue='period', palette='mako_r')

plt.title("INCIDENT TYPES by PERIOD")
plt.xlabel(" ")
plt.ylabel("Nber of incidents")
plt.legend(title="Period")
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
# ------------------------------------------------------------------
# Incidents & Locations
# ------------------------------------------------------------------
df_dark_stage['incident_type'].nunique()

In [None]:
df_dark_stage['incident_type'].unique()

In [None]:
# Group by incident_type and count occurrences
incident_counts = df_dark_stage['incident_type'].value_counts().sort_values(ascending=False)

plt.figure(figsize=(8,6))
incident_counts.plot(
kind='barh', color=plt.cm.plasma(np.linspace(0,1,len(incident_counts)))         # color map        
)   
plt.xlabel("Count of Incidents")
plt.ylabel("Incident Type")
plt.title("DISTRIBUTION of INCIDENT TYPES")
plt.gca().invert_yaxis()  
plt.show()

In [None]:
location_counts = df_dark_stage['location'].value_counts()             

plt.figure(figsize=(8,4))
location_counts.plot(kind='bar', color=plt.cm.viridis(np.linspace(0,1,len(location_counts))))

plt.xlabel("Nber of Incidents")
plt.ylabel("Location")
plt.title("INCIDENT COUNT by LOCATION")
plt.show()

In [None]:
theatre_incidents = df_dark_stage[df_dark_stage['location'] == 'Theatre']
theatre_authors = theatre_incidents['author_id'].unique()
print("AUTHORS INVOLVED in THEATRE INCIDENTS :", theatre_authors)

In [None]:
tavern_incidents = df_dark_stage[df_dark_stage['location'] == 'Tavern']
theatre_authors = tavern_incidents['author_id'].unique()
print("AUTHORS INVOLVED in TAVERN INCIDENTS :", theatre_authors)

In [None]:
import matplotlib.ticker as mtick       # Provides tools to format and control axis tick marks and labels in Matplotlib

g = sns.FacetGrid(df_dark_stage, col="location", col_wrap=3, height=4, sharey=False)       # Create FacetGrid : one subplot per location

g.map_dataframe(sns.countplot, x="incident_type", hue="incident_type", palette="tab20", legend=False)   # Plot countplot in each subplot

g.set_xticklabels(rotation=45)                        # Rotate x-axis labels and set titles
g.set_axis_labels("Incident Type", "Count")
g.set_titles(col_template="{col_name}")

for ax in g.axes.flat:                                # Round y-axis labels to 1 decimal
    ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.1f'))
     
    for p in ax.patches:                              # Add counts on top of each bar
        height = p.get_height()
        ax.text(x=p.get_x() + p.get_width()/2, y=height + 0.1, s=f"{int(height)}", ha='center')

plt.tight_layout()
plt.show()

In [None]:
print("LONDON")
print(df_dark_stage[df_dark_stage['location'] == 'London']['incident_type'].value_counts())
print()

In [None]:
print("ROYAL COURT")
print(df_dark_stage[df_dark_stage['location'] == 'Royal Court']['incident_type'].value_counts())
print()

In [None]:
print("THEATRE")
print(df_dark_stage[df_dark_stage['location'] == 'Theatre']['incident_type'].value_counts())
print()

In [None]:
print("TAVERN")
print(df_dark_stage[df_dark_stage['location'] == 'Tavern']['incident_type'].value_counts())
print()

In [None]:
print("COURT")
print(df_dark_stage[df_dark_stage['location'] == 'Court']['incident_type'].value_counts())
print()

In [None]:
total_counts = df_dark_stage['incident_type'].value_counts()
print("TOTAL")
print(total_counts)