# ML Product Team Capstone

## The Coursera Summarizer

**The notebooks in this directory demonstrate how to build a  machine learning tool to extract hyperlinks from video transcripts**

## Introduction

Coursera course content videos have two key information components - video and transcript. Content within the trascripts can often be hard to undersand and require further search methods such a googling and reading wikipedia pages.

## Goal
This notebook demonstrates how to create a hyperlink and summary based web application for videotranscripts.

**Run the following notebooks and explore how we build the model and deployed it to our website.**

## 1. Create Dataset

First, we extract the raw coursera data with all of the transcripts used in the MADS program.

[Coursera Downloader Documentation](https://github.com/coursera-dl/coursera-dl)

Run the following notebook to extract the raw coursera data.

[1-CreateDataset.ipynb](./1-CreateDataset.ipynb)

This notebook saves the course data in the folder `./data`.

## 2. Clean Dataset

After the raw data has been gathered, we clean the dataset to only keep the course transcripts.

Run the following notebook to extract the transcripts from the raw course data.

[2-CleanDataset.ipynb](./2-CleanDataset.ipynb) 

This notebook saves the dataset in the file `./intermediate_data/transcripts_<course_name>.csv`.

## 3. Transcript Summarization

The next step is to generate summaries for each of the transcripts. Here we use the Bert Extractive Summarizer and SentenceTransformers libraries to generate our transcript summaries.

[Bert Summarizer Documentation](https://pypi.org/project/bert-extractive-summarizer/#description)

[SentenceTransformers Documentation](https://www.sbert.net/)


Run the following notebook to generate transcript summaries. 



[3-TranscriptSummarization.ipynb](./3-TranscriptSummarization.ipynb)

This notebook saves the dataset in the file `./intermediate_data/transcripts_<course_name>_summaries.csv`.

## 4. Keyword Extraction

Next we generate keywords for each of the transcripts. We do this using the keyphrase_vectorizers, spacy, and keybert libraries to generate our transcript keywords.

[Bert Summarizer Documentation](https://pypi.org/project/bert-extractive-summarizer/#description)

[SentenceTransformers Documentation](https://www.sbert.net/)


Run the following notebook to generate transcript keywords. 


[4-KeywordExtraction.ipynb](./4-KeywordExtraction.ipynb)

This notebook saves the dataset in the file `./intermediate_data/transcripts_<course_name>_summaries_keywords.csv`.

## 5. Hyperlink Generation

After we have keywords created, we want useul weblinks for them. To do this we use the wikipedia libary to hyperlinks for wikipedia pages for each keyword.

[Wikipedia Documentation](https://wikipedia.readthedocs.io/en/latest/)

Run the following notebook to generate hyperlinks for each keyword.

[5-HyperlinkGeneration.ipynb](./5-HyperlinkGeneration.ipynb)

This notebook saves the dataset in the file `./intermediate_data/transcripts_<course_name>_summaries_keywords_urls.csv`.

## 6. Demo

With the final hyperlinks generated the information can then be deployed to a web application for use. We have created a website to host the appliction as well as our project blog.

To access the tool you must follow these steps.
1. Go to the [Coursera Summarizer](https://QEJSCSCHDK5N6XTY.anvil.app/TRSK7WPDLEL4Z6NUKLZYBHLJ) site
2. Click "Demo"
3. Enter the password "Bagel"
4. Enter one of the following URLs
      - https://www.coursera.org/learn/siads697698/lecture/g3pi8/capstone-overview
      - https://www.coursera.org/learn/siads697698/lecture/3vwIb/how-to-do-a-standup
      - https://www.coursera.org/learn/siads697698/lecture/PBRXG/how-to-write-an-effective-blog-post
      - https://www.coursera.org/learn/siads697698/lecture/USD5r/how-to-collaborate-with-a-team
      - https://www.coursera.org/learn/siads697698/lecture/uPBqO/how-to-write-documentation-for-your-project-repository



## Version and Hardware Information
Here we use the watermark extension to print software, operating system, and hardware version information.

In [2]:
%load_ext watermark
%watermark -v -m -p ipywidgets,matplotlib,numpy,pandas,sklearn,gensim,tqdm,nltk,collections,spacy,string,keybert,keyphrase_vectorizers,wikipedia,getpass,summarizer

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Python implementation: CPython
Python version       : 3.7.4
IPython version      : 7.8.0

ipywidgets           : 7.5.1
matplotlib           : 2.2.3
numpy                : 1.21.6
pandas               : 0.25.1
sklearn              : 1.0.2
gensim               : 4.2.0
tqdm                 : 4.64.0
nltk                 : 3.7
collections          : unknown
spacy                : 3.4.0
string               : unknown
keybert              : 0.5.1
keyphrase_vectorizers: 0.0.10
wikipedia            : 1.4.0
getpass              : unknown
summarizer           : unknown

Compiler    : Clang 4.0.1 (tags/RELEASE_401/final)
OS          : Darwin
Release     : 21.5.0
Machine     : x86_64
Processor   : i386
CPU cores   : 16
Architecture: 64bit



---

**Authors:** [Wei Zhou](mailto:weiwzhou@umich.edu), [Nick Capaldini](mailto:nickcaps@umich.edu), University of Michigan, August 21, 2022

---