# Project

In this Project, you will bring together many of the tools and techniques that you have learned throughout this course into a final project. You can choose from many different paths to get to the solution. 

### Business scenario

You work for a training organization that recently developed an introductory course about machine learning (ML). The course includes more than 40 videos that cover a broad range of ML topics. You have been asked to create an application that will students can use to quickly locate and view video content by searching for topics and key phrases.

You have downloaded all of the videos to an Amazon Simple Storage Service (Amazon S3) bucket. Your assignment is to produce a dashboard that meets your supervisorâ€™s requirements.

## Project steps

To complete this project, you will follow these steps:

1. [Viewing the video files](#1.-Viewing-the-video-files)
2. [Transcribing the videos](#2.-Transcribing-the-videos)
3. [Normalizing the text](#3.-Normalizing-the-text)
4. [Extracting key phrases and topics](#4.-Extracting-key-phrases-and-topics)
5. [Creating the dashboard](#5.-Creating-the-dashboard)

## Useful information

The following cell contains some information that might be useful as you complete this project.

In [1]:
bucket = "c56161a939430l3396553t1w744137092661-labbucket-rn642jaq01e9"
job_data_access_role = 'arn:aws:iam::744137092661:role/service-role/c56161a939430l3396553t1w7-ComprehendDataAccessRole-1P24MSS91ADHP'

In [None]:
#Installing all the necessary libraries
!pip install moviepy
!pip install SpeechRecognition
!pip install pocketsphinx
!pip install contractions
!pip install keybert
!pip install gradio

#importing all the libraries used
import subprocess
import boto3
import os
import moviepy.editor as mp
import speech_recognition as sr
import pocketsphinx
import shutil
from keybert import KeyBERT
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
import re, string
from nltk.corpus import stopwords
from keybert import KeyBERT
import gradio as gr

## 1. Viewing the video files
([Go to top](#Capstone-8:-Bringing-It-All-Together))


The source video files are located in the following shared Amazon Simple Storage Service (Amazon S3) bucket.

In [None]:
!aws s3 ls s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/

## 2. Transcribing the videos
 ([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to implement your solution to transcribe the videos. 

In [None]:
# Write your answer/code here

#making a list to hold the name of the movies
import subprocess
aws_command = "aws s3 ls s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/"
output = subprocess.check_output(aws_command, shell=True).decode()

movie_list = []
lines = output.strip().split('\n')
for line in lines:
    parts = line.split()
    if len(movie_list) == 0:
        filename = parts[-2] + " "+parts[-1]
        movie_list.append(filename)
    else:
        filename = parts[-1]
        movie_list.append(filename)
print(movie_list)

In [None]:
#downloading the videos from the S3 bucket
import boto3
import os
s3 = boto3.resource("s3")
b = s3.Bucket("aws-tc-largeobjects")
obj_key = "CUR-TF-200-ACMNLP-1/video/"
for i in movie_list:
    object_key = obj_key + i 
    video_file = os.path.join("video", i)
    print(object_key)
    b.download_file(Key = object_key,Filename = video_file)

In [None]:
#Converting the video into an audio file format
import moviepy.editor as mp
import speech_recognition as sr
file = "video/"
for i in movie_list:
    video_file = file + i
    clip = mp.VideoFileClip(video_file)
    audio = clip.audio
    audio.write_audiofile(f"{i}.wav")

In [None]:
#Transcribing the audio
import pocketsphinx
import moviepy.editor as mp
import speech_recognition as sr
recognizer = sr.Recognizer()
for i in movie_list:
    folder_path = "text_files/"
    file_name = folder_path + i + ".txt"
    txt_file = open(file_name,"w+")
    audio_file = i + ".wav"
    with sr.AudioFile(audio_file) as source:
        print("Transcribing audio: ", audio_file)
        data = recognizer.record(source)
    text = recognizer.recognize_sphinx(data)
    txt_file.write(text)

## 3. Normalizing the text
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to perform any text normalization steps that are necessary for your solution.

In [None]:
# Write your answer/code here

#Performing all text-preprocessing on all the documents
import os
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
import re, string
import contractions
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
nltk.download('stopwords')
def text_normalizing(text):
    expanded_words = []
    for i in text.split():
        expanded_words.append(contractions.fix(i))
    text = " ".join(expanded_words)
    text = text.lower()
    punctuation = re.compile('[%s]' % re.escape(string.punctuation))
    stop_word = stopwords.words('english')
    text = re.sub(punctuation,'',text )
    text.strip()
    normalized_text = []
    tokens = word_tokenize(text)
    for i in tokens:
        if i not in stop_word:
            normalized_text.append(i)
    #print(normalized_text)
    lemmatize = []
    lm = WordNetLemmatizer()
    for i in normalized_text:
        lemmatize.append(lm.lemmatize(i))
    final_text = " ".join(lemmatize)
    return final_text
    
    
for i in sorted(os.listdir("text_files/")):
    with open("text_files/" + i,"r") as file:
        data = file.read()
        data = text_normalizing(data)
        with open("normalized_text_files/" + i,"w+") as normalized_file:
            normalized_file.write(data)
            normalized_file.close()

## 4. Extracting key phrases and topics
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to extract the key phrases and topics from the videos.

In [None]:
# Write your answer/code here

#To check the score for each keywords in the video
from keybert import KeyBERT
key = KeyBERT()
keyword_score = {}
for i in sorted(os.listdir("normalized_text_files")):
    with open("normalized_text_files/" + i,"r") as file:
        doc = file.read()
        keywords = key.extract_keywords(doc)
        print(i+": ",keywords)

In [None]:
#Implementing keyword and topic extractions for a given topic fed from the chatbot dashboard
def key_word_extraction(text):
    key = KeyBERT()
    keywords = key.extract_keywords(text)
    return keywords

def locate_video(input):
    score = 0
    video = None
    for i in sorted(os.listdir("normalized_text_files")):
        with open("normalized_text_files/" + i,"r") as file:
            doc = file.read()
            key_words = key_word_extraction(doc)
            matching_score = 0
            for key, scores in input:
                for j, k in key_words:
                    if key in j:
                        matching_score = matching_score + k
            if matching_score > score:
                score = matching_score
                video = i
    return video

## 5. Creating the dashboard
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to create the dashboard for your solution.

In [None]:
# Write your answer/code here

# Recommend video based on topics given by the user
def chatbot(message, history):
  # Default response if no keyword matches
    default_response = "No video found"
    text = message
    input_text = key_word_extraction(message)
    recommended_video = locate_video(input_text)
    if recommended_video:
        return f"recommended video: {recommended_video}"
    else:
        return default_response

gr.ChatInterface(
    chatbot,
    chatbot=gr.Chatbot(height=300),
    textbox=gr.Textbox(placeholder="Type the topic you want to learn", container=False, scale=7),
    title="Locate Video",
    description="Which video topics are you looking for?",
    theme="soft",
    examples=["Computer Vision", "Speech Recognition", "Classification", "Regression"],
    cache_examples=True,
    retry_btn=None,
    undo_btn="Delete Previous",
    clear_btn="Clear",
).launch()