# Question answering from recorded videos

### A list of potential questions interesting to users 
1) What was the summary of the talk?
    * What was the key takeaway of the talk?
    * What was the most exciting result of the project?

2) What are the clusters in a playlist like Dev Conf talks?
    Which talks are related to each other?
    
3) How is this talk related to topic X [AI, cloud application, Linux, open source, ...]?
    * Find all talks/docs related to topic X?
    * How do I demo X or How do I solve X? 
    * How can I get started with X?

4) What are the links/webpages mentioned in the talk?


### Possible videos it can be applied to
* Dev Conf playlist on Youtube
* Espresso series 
* Operate first channel videos
* DS meetup videos playlist
* Openshift channel videos
* Open serivces group SIG, subproject meetings

### Possible approaches to get text from the data
1) Audio to text from videos
2) Metadata
    * Abstract, video decription, meeting notes
3) Image to text (Video frame as images)


### Tasks
1) Summarization for question 1) 
2) Clustering of talks for 2)
3) Question answering and document ranking for 3)

# Summarization

# Clustering

## Document ranking based on the query
#### BM-25 doc retrieval

In [None]:
!pip install rank_bm25

In [2]:
from rank_bm25 import BM25Okapi

# Code example from package
corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?"
]
tokenized_corpus = [doc.split(" ") for doc in corpus]
query = "windy London"
tokenized_query = query.split(" ")
bm25 = BM25Okapi(tokenized_corpus)
bm25.get_top_n(tokenized_query, corpus, n=1)

In [22]:
# Application to one of the talks
with open ('test-doc.txt', 'r') as f:
    corpus = f.readlines()
    
tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
query = "What is the summary of the talk?"
tokenized_query = query.split(" ")
bm25.get_top_n(tokenized_query, corpus, n=1)

["Speaker 1    00:05:39    So Neo otics, we talked about HDM being a conceptu, similar to the neocortex. What is neocortex? Neocortex is the part that you are seeing the picture. It's the part of the brain covering the brain. It's the size of the large table napkin, which is like 50 by 50 square centimeters. That's it, uh, whatever I do, whatever. I think how I behave completely dependence on completely depends on this neocortex, whatever I'm speaking right now, whatever you are listening right now, it's, it's all because of these neocortex. And it, it, it constitute of 75% of the brain's volume, which is 2.5 millimeter thick. It contains almost a 20 billion neurons, tens of thousands of signups per neurons signups are the intersections just to be clear intersections, which, uh, which acts as a memory or some spike in the new one of the important characteristic is it is partially active.  \n"]

## Doc2vec 

In [20]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.1.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 6.3 MB/s eta 0:00:01
[?25hCollecting smart-open>=1.8.1
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 56.5 MB/s eta 0:00:01
[?25hCollecting scipy>=0.18.1
  Downloading scipy-1.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.6 MB)
[K     |████████████████████████████████| 41.6 MB 77.6 MB/s eta 0:00:01
Installing collected packages: smart-open, scipy, gensim
Successfully installed gensim-4.1.2 scipy-1.8.0 smart-open-5.2.1
You should consider upgrading via the '/opt/app-root/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate()]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)s