# Notebook 01 - Problem Statement And Setup

### SpeakScape — Presentation Feedback Powered by TED Talks

**Group 4, l25dat4bi1f**
CPH Business Academy Lyngby 
Exam Project 2025**

### Collaborators  
- Alberte Mary Wahlstrøm Vallentin — cph-av169@cphbusiness.dk  
- Felicia Favrholdt — cph-ff62@cphbusiness.dk  
- Fatima Majid Shamcizadh — cph-fs156@cphbusiness.dk

### Project Title
SpeakScape

### Github Links 
- [Analysis 1](https://github.com/AlberteVallentin/SpeakScape)
- [Analysis 2 & Streamlit App](https://github.com/FeliciaFavrholdt/Speakscape_BI_Exam_Group_4/tree/main)

### Problem Statement
How can SpeakScape provide actionable, data-driven feedback to users by analyzing their presentation text against TED Talk benchmarks to identify impactful linguistic patterns?

### Research Questions
1. What specific linguistic features—such as sentence complexity, pronoun usage, and rhetorical devices—are most predictive of audience engagement in TED Talks?
2. How can we effectively correlate these features with engagement metrics like view count?
3. How can we use these insights to offer personalized feedback on user-submitted texts?


### Motivation

Effective public speaking plays a critical role in personal and professional success. However, most individuals lack access to high-quality, personalized feedback on their communication style.

By leveraging text analytics and AI, SpeakScape aims to bridge this gap by learning from expertly crafted TED Talks and helping users enhance their own presentations.



### Project Goals

- Identify linguistic patterns that correlate with high user engagement.
- Train a classifier to distinguish TED-style speech from user submissions.
- Generate actionable feedback to improve user presentations.

### Hypotheses

- Linguistic richness, clarity, and emotional appeal are more prevalent in TED Talks.
- Machine learning can detect these patterns and predict presentation quality.



---------------------------------

## Project Scope and Impact

SpeakScape is a machine learning–driven system designed to analyze TED Talk transcripts in order to understand the linguistic elements that contribute to effective public speaking. By leveraging natural language processing (NLP) and engagement metrics, the project aims to provide meaningful, personalized feedback on presentation content.

### Key Objectives

- Detect high-impact linguistic patterns common to successful TED Talks
- Benchmark user-generated transcripts against TED standards
- Deliver actionable feedback to enhance the clarity, structure, and impact of user speeches

---

## Expected Outcomes

The project will deliver the following components:

- A cleaned, unified dataset merging TED_2017 and TED_2020, annotated for analysis
- A trained machine learning model capable of predicting TED-likeness or engagement potential
- A fully functional Streamlit application that allows users to upload presentations and receive detailed linguistic feedback
- Well-documented, reproducible code and workflows across modular Jupyter notebooks

---

## Impact and Beneficiaries

SpeakScape provides value across multiple user groups:

- **Students and professionals** who need to prepare for presentations, pitches, or interviews
- **Educators** who teach communication skills and seek scalable tools for assessment and feedback
- **AI and NLP researchers** interested in transparent, explainable models of language effectiveness

By aligning speech content with proven TED Talk benchmarks, SpeakScape empowers users to improve their public speaking through structured, data-backed feedback—encouraging more engaging and effective communication.


---------------------------------------

## Brief Annotation

**1. Which challenge would you like to address?**

We aim to address the challenge of providing personalized, data-driven feedback to public speakers—particularly students and professionals—by analyzing how their presentation content compares to highly engaging TED Talks. The core challenge is helping users understand what makes a talk effective from a linguistic and structural perspective.

**2. Why is this challenge an important or interesting research goal?**
    
Strong communication is a critical skill in education, business, and public speaking, yet it's rarely evaluated with objective, content-based tools. This research is interesting because it bridges natural language processing (NLP) and audience engagement analytics to make expert-level feedback scalable and automated. By using TED Talks as a benchmark, we explore the linguistic patterns that correlate with high impact.

**3. What is the expected solution your project would provide?**
    
Our solution will produce a machine learning model trained on TED data to detect linguistic traits that align with high engagement. Combined with a Streamlit app, the system allows users to upload their own presentation text and receive feedback on clarity, complexity, rhetorical style, and TED-likeness. The model highlights how closely their content aligns with successful talks and where they can improve.

**4. What would be the positive impact of the solution, and which category of users could benefit from it?**

The solution empowers non-expert speakers—including students, educators, startup founders, and professionals—to refine their content using empirical, explainable language metrics. It supports inclusive, scalable skill development in a way that's usually only available through costly 1-on-1 coaching. Ultimately, it democratizes access to communication feedback.

----------------------------------------

## Notebooks

The project is implemented through six modular notebooks:

- **01_Problem_Statement_and_Setup**  
   Define the research goals, problem formulation, and prepare the working environment.

- **02_Dataset_Cleaning_Overview**  
   Merge TED datasets, clean raw fields, drop irrelevant columns, and document preprocessing logic.

- **03_Data_Loading_and_Preprocessing**  
   Normalize transcripts, extract linguistic features (e.g., word count, readability), and save a ready-to-model dataset.

- **04_Exploratory_Data_Analysis (EDA)**  
   Visualize linguistic patterns and engagement metrics to guide feature engineering.

- **05_Model_Training_and_Evaluation**  
   Train classification models to detect TED-like linguistic features and predict engagement.

- **06_Results_and_Interpretation**  
   Interpret model outputs, extract feature importance, and design content feedback strategies.

------------------------------

### Streamlit Application

A separate **Streamlit web application** is used to deliver feedback interactively, allowing users to compare their transcripts against TED benchmarks.


--------------------------------

## Execution Plan: BI Sprints

**Sprint 1: Problem Formulation**  
Notebook: `01_problem_statement_and_setup.ipynb`  
Focus: Define problem, goals, research questions, and project structure.  

**Sprint 2: Data Collection & Cleaning**  
Notebook: `02_data_loading_and_preprocessing.ipynb`  
Focus: Load and clean TED datasets, recover transcripts, merge into a consistent schema.  

**Sprint 3: Feature Engineering and Machine Learning**  
Notebooks: `04_feature_engineering.ipynb`, `05_model_training.ipynb`  
Focus: Extract features from transcripts, train classifiers to predict TED-likeness, and analyze important linguistic features.  

**Sprint 4: Business Application**  
Assets: `streamlit_app.py`, `06_results_and_interpretation.ipynb`  
Focus: Deploy a user-facing Streamlit application that provides actionable presentation feedback. Include visual interpretation and documentation of results.

This project follows a structured BI development lifecycle, using notebooks for reproducible analysis and Streamlit for interactive delivery.

#### Timeline & Milestones
 
| Milestone                                | Deliverables                                      |
|-------------------------------------------|---------------------------------------------------|
 Define problem & gather data              | Project scope, TED_2017 + TED_2020 datasets       |
 Dataset cleaning & unification            | `cleaned_data.csv`, Notebook 02                   |
 Preprocessing & feature extraction        | `preprocessed_data.csv`, Notebook 03              |
 Exploratory Data Analysis (EDA)           | Plots & correlation insights, Notebook 04         |
 Feature engineering + model training      | Model pipeline, Notebook 05                       |
 Results interpretation + Streamlit UI     | Final models, Notebook 06 + Streamlit app         |
 Team review, documentation, hand-in       | GitHub updated, final `.md` summary + PDF export  |


#### Team Member Engagement

| Member    | Tasks                                         |
|-----------|------------------------------------------------------|
| Alberte   |       |
| Felicia   | |
| Fatima    | |

All members are involved in review and testing before deliverables are pushed to GitHub or submitted.


In [None]:
## Project Directory Structure

The project is organized using a modulyar and reproducible data science workflow:

SpeakScape_Analysis/
│
├── data/                  
│   ├── combined_dataset.csv         # Merged TED 2017 + 2020 data
│   ├── cleaned_data.csv             # Output from cleaning pipeline
│   └── preprocessed_data.csv        # Feature-rich version for modeling
│
├── models/               
│   └── *.pkl / *.joblib             # Trained ML models
│
├── notebooks/           
│   ├── 01_problem_statement_and_setup.ipynb
│   ├── 02_dataset_cleaning_overview.ipynb
│   ├── 03_data_loading_and_preprocessing.ipynb
│   ├── 04_exploratory_data_analysis.ipynb
│   ├── 05_model_training_and_evaluation.ipynb
│   └── 06_results_and_interpretation.ipynb
│
├── plots/                 
│   └── *.png                         # All visualizations generated in EDA and preprocessing
│
├── reports/               
│   ├── *.json                       # Notebook summaries
│   └── Exam_Group4_SpeakScape_Summary.md
│
├── streamlit_app/       
│   ├── app.py                       # Main app file
│   └── utils.py / model_loaders.py  # App logic and prediction helpers
│
├── utils/
│   └── setup.py / save_tools.py     # Reusable functions for all notebooks
│
├── .gitignore
├── requirements.txt
└── README.md

### Workflow Procedures

- Notebooks save processed files (`.csv`) and `.json` summaries to ensure reproducibility.
- `IPython.notebook.save_checkpoint()` ensures notebook state is saved on run.
- Preprocessed data is versioned (e.g., `cleaned_data_v1.csv`).
- ML models are saved using `joblib` for app integration.
- Final review and packaging handled before exam hand-in.

------------------------------

---------------------------

## Environment Setup

We use standard Python libraries for data analysis, natural language processing, and machine learning. All paths are defined as relative to ensure reproducibility across machines and platforms.

### Libraries Used

- **Pandas** — Data manipulation and DataFrame operations  
- **NumPy** — Numerical computations  
- **Matplotlib & Seaborn** — Data visualization  
- **NLTK** — Tokenization, lemmatization, stopword removal  
- **Scikit-learn** — Modeling, metrics, and preprocessing tools  
- **Joblib** — Model persistence  
- **Streamlit** — Interactive web application for feedback delivery

### Setup Procedures

1. Create a virtual environment using Anaconda or `venv`
2. Install dependencies:

```bash
pip install -r requirements.txt


#### Development Tools

- **IDE:** Visual Studio Code (VS Code) with Jupyter support
- **Version Control:** Git (GitHub remote)
- **Package Management:** Anaconda / pip (via `requirements.txt`)
- **Notebook Environment:** Jupyter Notebooks, Streamlit for deployment


### Platform Requirements

- **Python** 3.9 or higher  
- **Jupyter Notebook** (or Visual Studio Code with Jupyter extension)  
- **Streamlit** version 1.20 or later  

All notebooks are executed from within the `notebooks/` directory. Output files such as datasets, visualizations, and model summaries are saved to corresponding subdirectories located one level up for consistency across the project.

------------------------------------

### Project initialization

In [68]:
from utils.setup import init_environment
init_environment()

Environment initialized.


### Initialization Function

The `init_environment()` function performs the following setup tasks:

- Applies consistent Seaborn and Matplotlib visual styles
- Verifies the existence of required directories: `../data`, `../plots`, `../models`, and `../reports`
- Ensures a clean, reproducible environment for all notebooks

-------------------

In [72]:
# Import libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Apply Seaborn styling (this sets both matplotlib and seaborn visuals)
sns.set_theme(style="whitegrid", palette="deep")


# Create directories for reproducibility
folders = ["../data", "../models", "../plots", "../reports"]
for folder in folders:
    os.makedirs(folder, exist_ok=True)

print("Environment setup complete.")

Environment setup complete.


In [74]:
from utils.save_tools import save_notebook_and_summary

save_notebook_and_summary(
    notebook_name="01_problem_statement_and_setup",
    summary={
        "description": "Established the project scope and objectives for SpeakScape, outlined key research questions, defined the BI sprint structure, and initialized the working environment.",
        "team_members": [
            "Alberte Mary Wahlstrøm Vallentin",
            "Felicia Favrholdt",
            "Fatima Majid Shamcizadh"
        ],
        "sprints_defined": 4,
        "notebooks_planned": [
            "01_problem_statement_and_setup",
            "02_dataset_cleaning_overview",
            "03_data_loading_and_preprocessing",
            "04_exploratory_data_analysis",
            "05_model_training_and_evaluation",
            "06_results_and_interpretation"
        ],
        "folders_created": ["data", "models", "plots", "reports"],
        "tools_used": [
            "Python 3.9", "Jupyter Notebook", "VS Code", "GitHub", "Streamlit"
        ]
    }
)


<IPython.core.display.Javascript object>

Notebook save triggered.
Summary saved to: ../reports/01_problem_statement_and_setup_summary_2025-05-26_03-38-27.json


------------------------