# Capstone

**The notebooks in this directory demonstrate how to build a  machine learning tool to extract hyperlinks from video transcripts**

## Introduction

Coursera course content videos have two key information components - video and transcript. Content within the trascripts can often be hard to undersand and require further search methods such a googling and reading wikipedia pages.

Important topics within a transcript include:
* Key Terms:
* Books: contains predominantly beta sheets
* People: contains alpha helices and beta sheets

## Goal
This notebook demonstrates how to create a hyperlink based web application for videotranscripts.

**Run the following notebooks and explore how we build the model and deployed it to our website.**

## 1. Create Dataset

First, we extract the raw coursera data with all of the transcripts used in the MADS program.

[Coursera Downloader Documentation](https://github.com/coursera-dl/coursera-dl)

Run the following notebook to extract the raw coursera data.

[1-CreateDataset.ipynb](./1-CreateDataset.ipynb)

This notebook saves the dataset in the folder `./data`.

## 2. Clean Dataset

After the raw data has been gathered, we clean the dataset to only keep the course transcripts.

Run the following notebook to extract the transcripts from the raw course data.

[2-CleanDataset.ipynb](./2-CleanDataset.ipynb) 

This notebook saves the dataset in the file `./intermediate_data/transcripts.csv`.

## 3. Transcript Summarization

Then next step is to generate summaries for each of the transcripts. Here we use the Bert Extractive Summarizer and SentenceTransformers libraries to generate our transcript summaries.

[Bert Summarizer Documentation](https://pypi.org/project/bert-extractive-summarizer/#description)

[SentenceTransformers Documentation](https://www.sbert.net/)


Run the following notebook to generate transcript summaries. 



[3-TranscriptSummarization.ipynb](./3-TranscriptSummarization.ipynb)

This notebook saves the dataset with feature vectors in the file `./intermediate_data/features.json`.

## 5. Keyword Extraction

## 6. Hyperlink Generation

Next, we fit a 3-state classification model using the feature vectors and the given fold classification from the Protein Data Bank dataset.

Run the following notebook to fit a machine learning model on a training set and evaluate its performance on a test set.

[3-FitModel.ipynb](./3-FitModel.ipynb)

This notebook saves the classification model in the file `./intermediate_data/classifier`.

## 7. Manual Evaluation
How accurate is the url for the model? Is it relevant? What percentage of the URL's make sense?

## 8. Model Generalization
Does the model generalize to other course transcripts? (Perry Samsom)

## 9. Deploy Model Product Demo on Website (Optional)

Finally, we use the trained model from the previous step to our Capstone Website using Anvile.

[4-MLProduct.ipynb](./4-MLProduct.ipynb)

## Version and Hardware Information
Here we use the watermark extension to print software, operating system, and hardware version information.

In [1]:
%load_ext watermark
%watermark -v -m -p ipywidgets,matplotlib,numpy,pandas,sklearn,gensim,tqdm,nltk,collections,spacy,string,keybert,keyphrase_vectorizers,wikipedia,getpass, Summarizer

Python implementation: CPython
Python version       : 3.7.4
IPython version      : 7.8.0

ipywidgets           : 7.5.1
matplotlib           : 2.2.3
numpy                : 1.21.6
pandas               : 0.25.1
sklearn              : 1.0.2
gensim               : 4.2.0
tqdm                 : 4.64.0
nltk                 : 3.7
collections          : unknown
spacy                : 3.4.0
string               : unknown
keybert              : 0.5.1
keyphrase_vectorizers: 0.0.10

Compiler    : Clang 4.0.1 (tags/RELEASE_401/final)
OS          : Darwin
Release     : 21.5.0
Machine     : x86_64
Processor   : i386
CPU cores   : 16
Architecture: 64bit



---

**Authors:** [Wei Zhou](mailto:weiwzhou@umich.edu), [Nick Capaldini](mailto:nickcaps@umich.edu), University of Michigan, July 30, 2022

---