
## Group 4 – CPH Business Intelligence Exam Project 2025

**Team Members:**
- Alberte Mary Wahlstrøm Vallentin
- Felicia Favrholdt
- Fatima Majid Shamcizadh

**GitHub Links:**
- [Analysis 1](https://github.com/AlberteVallentin/SpeakScape)
- [Analysis 2 & Streamlit App](https://github.com/FeliciaFavrholdt/Speakscape_BI_Exam_Group_4/tree/main)


## Problem Statement

How can SpeakScape provide actionable, data-driven feedback to users by analyzing their presentation text against TED Talk benchmarks to identify impactful linguistic patterns?


## Motivation

Effective public speaking plays a critical role in personal and professional success. However, most individuals lack access to high-quality, personalized feedback on their communication style.

By leveraging text analytics and AI, SpeakScape aims to bridge this gap by learning from expertly crafted TED Talks and helping users enhance their own presentations.



## Project Goals

- Identify linguistic patterns that correlate with high user engagement.
- Train a classifier to distinguish TED-style speech from user submissions.
- Generate actionable feedback to improve user presentations.

## Hypotheses

- Linguistic richness, clarity, and emotional appeal are more prevalent in TED Talks.
- Machine learning can detect these patterns and predict presentation quality.



## Impact and Beneficiaries

SpeakScape can benefit:
- Students and professionals preparing for talks.
- Educators aiming to enhance oral communication curricula.
- AI researchers exploring explainable NLP feedback mechanisms.

The feedback engine promotes better presentation design and self-improvement.


## Project Structure

The project is implemented through six modular notebooks:

1. **Problem Statement & Setup** – Define scope and prepare environment.
2. **Data Loading & Preprocessing** – Clean and standardize raw input data.
3. **Exploratory Data Analysis (EDA)** – Analyze and visualize textual patterns.
4. **Feature Engineering** – Extract meaningful linguistic features.
5. **Model Training & Evaluation** – Train models to classify TED-like content.
6. **Results & Interpretation** – Analyze model insights and propose feedback.

A Streamlit application will later provide an interactive interface for feedback.


## Environment Setup

We use standard Python libraries for data analysis, NLP, and machine learning. All paths are relative for reproducibility.


# SpeakScape — Presentation Feedback Powered by TED Talks

**Group 4 — CPH Business BI Exam Project 2025**  
- Alberte Mary Wahlstrøm Vallentin — cph-av169@cphbusiness.dk  
- Felicia Favrholdt — cph-ff62@cphbusiness.dk  
- Fatima Majid Shamcizadh — cph-fs156@cphbusiness.dk

## Project Title: SpeakScape

## Problem Statement
How can SpeakScape provide actionable, data-driven feedback to users by analyzing their presentation text against TED Talk benchmarks to identify impactful linguistic patterns?

## Research Questions
1. What specific linguistic features—such as sentence complexity, pronoun usage, and rhetorical devices—are most predictive of audience engagement in TED Talks?
2. How can we effectively correlate these features with engagement metrics like view count?
3. How can we use these insights to offer personalized feedback on user-submitted texts?


## Project Scope and Impact

This project analyzes TED Talk transcripts to learn what makes them effective. Using machine learning and text analytics, SpeakScape will:

- Identify high-impact linguistic patterns
- Benchmark user speeches against TED Talks
- Offer tailored feedback to help users improve their communication

## Expected Outcomes

- A clean, annotated dataset combining TED_2017 and TED_2020
- A trained machine learning model to predict TED-likeness
- An interactive Streamlit app that gives linguistic feedback to users
- Documentation of all steps to ensure full reproducibility


In [42]:
# Import libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Apply Seaborn styling (this sets both matplotlib and seaborn visuals)
sns.set_theme(style="whitegrid", palette="deep")


# Create directories for reproducibility
folders = ["../data", "../models", "../plots", "../reports"]
for folder in folders:
    os.makedirs(folder, exist_ok=True)

print("Environment setup complete.")


Environment setup complete.


## Execution Plan: BI Sprints

**Sprint 1: Problem Formulation**  
Notebook: `01_problem_statement_and_setup.ipynb`  
Focus: Define problem, goals, research questions, and project structure.  

**Sprint 2: Data Collection & Cleaning**  
Notebook: `02_data_loading_and_preprocessing.ipynb`  
Focus: Load and clean TED datasets, recover transcripts, merge into a consistent schema.  

**Sprint 3: Feature Engineering and Machine Learning**  
Notebooks: `04_feature_engineering.ipynb`, `05_model_training.ipynb`  
Focus: Extract features from transcripts, train classifiers to predict TED-likeness, and analyze important linguistic features.  

**Sprint 4: Business Application**  
Assets: `streamlit_app.py`, `06_results_and_interpretation.ipynb`  
Focus: Deploy a user-facing Streamlit application that provides actionable presentation feedback. Include visual interpretation and documentation of results.

This project follows a structured BI development lifecycle, using notebooks for reproducible analysis and Streamlit for interactive delivery. All steps are tracked via GitHub.


**Tools:** Jupyter, pandas, sklearn, seaborn, TextBlob, Streamlit  
**Versioning:** All work is tracked via GitHub repositories linked in Wiseflow  
**Deployment:** Streamlit app with model + visualizations

In [51]:
def save_notebook_and_summary(notebook_name: str, summary: dict):
    import json
    from datetime import datetime
    from pathlib import Path

    try:
        import IPython
        from IPython.display import display, Javascript

        if hasattr(IPython, "get_ipython"):
            display(Javascript('IPython.notebook.save_checkpoint();'))
            print("Notebook save triggered.")
        else:
            print("Notebook save skipped (non-Notebook environment).")
    except Exception:
        print("Notebook save skipped (not supported in this interface).")

    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    summary["notebook"] = f"{notebook_name}.ipynb"
    summary["timestamp"] = timestamp

    Path("../reports").mkdir(parents=True, exist_ok=True)
    summary_path = Path("../reports") / f"{notebook_name}_summary_{timestamp}.json"

    with open(summary_path, "w", encoding="utf-8") as f:
        json.dump(summary, f, indent=4)

    print(f"Summary saved to: {summary_path}")


In [53]:
save_notebook_and_summary(
    notebook_name="01_problem_statement_and_setup",
    summary={
        "description": "Defined problem statement, research questions, BI sprint plan, and initial folder setup.",
        "team_members": [
            "Alberte Mary Wahlstrøm Vallentin",
            "Felicia Favrholdt",
            "Fatima Majid Shamcizadh"
        ],
        "sprints_defined": 4,
        "folders_created": ["data", "models", "plots", "reports"]
    }
)


<IPython.core.display.Javascript object>

Notebook save triggered.
Summary saved to: ../reports/01_problem_statement_and_setup_summary_2025-05-25_23-34-03.json


------------------------