# Project

In this Project, you will bring together many of the tools and techniques that you have learned throughout this course into a final project. You can choose from many different paths to get to the solution. 

### Business scenario

You work for a training organization that recently developed an introductory course about machine learning (ML). The course includes more than 40 videos that cover a broad range of ML topics. You have been asked to create an application that will students can use to quickly locate and view video content by searching for topics and key phrases.

You have downloaded all of the videos to an Amazon Simple Storage Service (Amazon S3) bucket. Your assignment is to produce a dashboard that meets your supervisor’s requirements.

## Project steps

To complete this project, you will follow these steps:

1. [Viewing the video files](#1.-Viewing-the-video-files)
2. [Transcribing the videos](#2.-Transcribing-the-videos)
3. [Normalizing the text](#3.-Normalizing-the-text)
4. [Extracting key phrases and topics](#4.-Extracting-key-phrases-and-topics)
5. [Creating the dashboard](#5.-Creating-the-dashboard)

## Useful information

The following cell contains some information that might be useful as you complete this project.

In [101]:
bucket = "c56161a939430l3396553t1w744137092661-labbucket-rn642jaq01e9"
job_data_access_role = 'arn:aws:iam::744137092661:role/service-role/c56161a939430l3396553t1w7-ComprehendDataAccessRole-1P24MSS91ADHP'

## 1. Viewing the video files
([Go to top](#Capstone-8:-Bringing-It-All-Together))


The source video files are located in the following shared Amazon Simple Storage Service (Amazon S3) bucket.

In [102]:
!aws s3 ls s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/

2021-04-26 20:17:33  410925369 Mod01_Course Overview.mp4
2021-04-26 20:10:02   39576695 Mod02_Intro.mp4
2021-04-26 20:31:23  302994828 Mod02_Sect01.mp4
2021-04-26 20:17:33  416563881 Mod02_Sect02.mp4
2021-04-26 20:17:33  318685583 Mod02_Sect03.mp4
2021-04-26 20:17:33  255877251 Mod02_Sect04.mp4
2021-04-26 20:23:51   99988046 Mod02_Sect05.mp4
2021-04-26 20:24:54   50700224 Mod02_WrapUp.mp4
2021-04-26 20:26:27   60627667 Mod03_Intro.mp4
2021-04-26 20:26:28  272229844 Mod03_Sect01.mp4
2021-04-26 20:27:06  309127124 Mod03_Sect02_part1.mp4
2021-04-26 20:27:06  195635527 Mod03_Sect02_part2.mp4
2021-04-26 20:28:03  123924818 Mod03_Sect02_part3.mp4
2021-04-26 20:31:28  171681915 Mod03_Sect03_part1.mp4
2021-04-26 20:32:07  285200083 Mod03_Sect03_part2.mp4
2021-04-26 20:33:17  105470345 Mod03_Sect03_part3.mp4
2021-04-26 20:35:10  157185651 Mod03_Sect04_part1.mp4
2021-04-26 20:36:27  187435635 Mod03_Sect04_part2.mp4
2021-04-26 20:36:40  280720369 Mod03_Sect04_part3.mp4
2021-04-26 20:40:01  443479

## 2. Transcribing the videos
 ([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to implement your solution to transcribe the videos. 

In [3]:
!pip install boto3 pandas



In [4]:
import boto3

In [5]:
pip install moviepy

Collecting moviepy
  Downloading moviepy-1.0.3.tar.gz (388 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.3/388.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting decorator<5.0,>=4.0.2 (from moviepy)
  Downloading decorator-4.4.2-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting proglog<=1.0.0 (from moviepy)
  Downloading proglog-0.1.10-py3-none-any.whl.metadata (639 bytes)
Collecting imageio_ffmpeg>=0.2.0 (from moviepy)
  Downloading imageio_ffmpeg-0.4.9-py3-none-manylinux2010_x86_64.whl.metadata (1.7 kB)
Downloading decorator-4.4.2-py2.py3-none-any.whl (9.2 kB)
Downloading imageio_ffmpeg-0.4.9-py3-none-manylinux2010_x86_64.whl (26.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.9/26.9 MB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading proglog-0.1.10-py3-none-any.whl (6.1 kB)
Building wheels for collected packages: moviepy
  Buildi

In [6]:
from moviepy.editor import *
def convert(src_name, dst_name):
    video = VideoFileClip(src_name)
    video.audio.write_audiofile(dst_name)

Matplotlib is building the font cache; this may take a moment.


In [7]:
s3 = boto3.client('s3')

bucket_name = 'aws-tc-largeobjects'
list_file_names = []
pref = 'CUR-TF-200-ACMNLP-1/video'

response = s3.list_objects_v2(Bucket=bucket_name, Prefix=pref)

if 'Contents' in response:
    for obj in response['Contents']:
        list_file_names.append(obj['Key'])
    
    #cleaning file names
    cleaned_file_names = []
    for i in range(len(list_file_names)):
        s = list_file_names[i].split('/')
        cleaned_file_names.append(s[-1])
    #print(cleaned_file_names)
    '''
    #converting the video files to audio to reduce bytes processed
    for i in range(len(list_file_names)):
        file_name = cleaned_file_names[i]
        
        res = s3.get_object(Bucket=bucket_name, Key=list_file_names[i])
        aud = res['Body'].read()
        file_size = res['ContentLength']
        print(file_size)
        
        #conversion
        
        s3.download_file(bucket_name, list_file_names[i], "videos/temp.mp4")
        audio_name = file_name.split('.')
        audio_name = audio_name[0]
        audio_name = audio_name + ".mp3"
        new = convert("videos/temp.mp4", "audios/"+ audio_name)'''
else:
    print("Empty Folder.")


In [8]:
#observing the change in size of file
import os
print(os.path.getsize("audios/Mod04_Sect02_part1.mp3"))

8602270


In [45]:
files = []
folder = "audios"
import os
for file_name in os.listdir(folder):
        file_path = os.path.join(folder, file_name)
        if file_name.lower().endswith(('.mp3')):
            files.append(file_path)

In [46]:
print(files)

['audios/Mod04_Sect02_part1.mp3', 'audios/Mod06_Sect01.mp3', 'audios/Mod03_WrapUp.mp3', 'audios/Mod02_Sect01.mp3', 'audios/Mod03_Sect03_part1.mp3', 'audios/Mod05_Sect02_part2.mp3', 'audios/Mod03_Sect06.mp3', 'audios/Mod02_Sect04.mp3', 'audios/Mod03_Sect03_part3.mp3', 'audios/Mod03_Sect02_part2.mp3', 'audios/Mod05_WrapUp_ver2.mp3', 'audios/Mod02_Intro.mp3', 'audios/Mod03_Sect05.mp3', 'audios/Mod03_Sect08.mp3', 'audios/Mod03_Sect04_part3.mp3', 'audios/Mod04_Intro.mp3', 'audios/Mod03_Sect04_part2.mp3', 'audios/Mod04_Sect01.mp3', 'audios/Mod05_Intro.mp3', 'audios/Mod05_Sect03_part1.mp3', 'audios/Mod03_Sect07_part2.mp3', 'audios/Mod01_Course Overview.mp3', 'audios/Mod04_Sect02_part2.mp3', 'audios/Mod03_Sect02_part1.mp3', 'audios/Mod06_WrapUp.mp3', 'audios/Mod05_Sect03_part4_ver2.mp3', 'audios/Mod05_Sect03_part2.mp3', 'audios/Mod05_Sect02_part1_ver2.mp3', 'audios/Mod03_Sect04_part1.mp3', 'audios/Mod03_Sect07_part1.mp3', 'audios/Mod03_Sect01.mp3', 'audios/Mod03_Intro.mp3', 'audios/Mod04_Sect0

In [11]:
print(len(files))

46


In [12]:
!pip install ibm_watson

Collecting ibm_watson
  Downloading ibm-watson-8.0.0.tar.gz (398 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m398.3/398.3 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting ibm-cloud-sdk-core==3.*,>=3.3.6 (from ibm_watson)
  Downloading ibm-cloud-sdk-core-3.19.2.tar.gz (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting urllib3<3.0.0,>=2.1.0 (from ibm-cloud-sdk-core==3.*,>=3.3.6->ibm_watson)
  Using cached urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB)
Collecting PyJWT<3.0.0,>=2.8.0 (from ibm-cloud-sdk-core==3.*,>=3.3.6-

In [13]:
from ibm_watson import SpeechToTextV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

In [14]:
apikey = 'rua-MmGloTNhTPc2rrPrCeEyZKqc3w7l691MxcLI_amV'
url = 'https://api.au-syd.speech-to-text.watson.cloud.ibm.com/instances/be7a1e87-d5e6-4e4f-a08a-a5070095f974'

In [15]:
authenticator = IAMAuthenticator(apikey)
stt = SpeechToTextV1(authenticator = authenticator)
stt.set_service_url(url)

In [17]:
res = stt.recognize(audio=open(files[15], 'rb'), content_type='audio/mp3', model='en-US_Telephony', \
                           inactivity_timeout=360).get_result()
print(res)

{'result_index': 0, 'results': [{'final': True, 'alternatives': [{'transcript': 'hi and welcome to marginal four of aws academy machine learning ', 'confidence': 0.83}]}, {'final': True, 'alternatives': [{'transcript': "in this module we're going to look at forecasting ", 'confidence': 0.92}]}, {'final': True, 'alternatives': [{'transcript': 'will start with an introduction to forecasting and look at how time series data is different from other kinds of data ', 'confidence': 0.95}]}, {'final': True, 'alternatives': [{'transcript': "then we're going to look at amazon forecast a service that helps you simplify building forecasts ", 'confidence': 0.95}]}, {'final': True, 'alternatives': [{'transcript': "at the end of this module you'll be able to describe the business problem solved with amazon forecast ", 'confidence': 0.96}]}, {'final': True, 'alternatives': [{'transcript': 'describe the challenges of working with time series data ', 'confidence': 0.96}]}, {'final': True, 'alternatives'

In [21]:
results = []
for filename in files:
    res = stt.recognize(audio=open(filename, 'rb'), content_type='audio/mp3', model='en-US_Telephony', \
                       inactivity_timeout=360).get_result()
    results.append(res)

In [22]:
#keeping each file's text at the index
transcribed_text = {}
for i in range(len(files)):
    text = []
    res = results[i]
    r = res['results']
    for j in range(len(r)):
        lis = r[j]
        s = lis['alternatives']
        t = s[0]
        v = t['transcript']
        text.append(v)
    text = ''.join(text)
    transcribed_text[files[i]] = [text]


    

In [23]:
print(transcribed_text['audios/Mod05_Sect03_part3.mp3'])

["hi welcome back we'll continue exploring video analysis by reviewing how to create the test data set the final step before you train your model is to identify a test data set you will use this test data set to validate and evaluate the models performance you'll do this by performing an inference on the images in the test data set you'll then compare the results with the labeling information that's in the training data set you can create your own test data set alternatively you can use amazon recognition costume labels to split your training data set into two data sets by using an eighty twenty split this split means that eighty percent of the data is used for training and twenty percent is used for testing after you define the training and test data sets amazon recognition custom labels can automatically train the model for you the service automatically loads and inspects the data selects the correct machine learning algorithms trains a model and provides model performance metrics yo

In [24]:
#saving all of the transcribed data in a csv file

import pandas as pd
df = pd.DataFrame(list(transcribed_text.items()), columns=['file_names', 'text'])
print(df)

csv = 'transcribed_text.csv'

df.to_csv(csv, index=True)

                            file_names  \
0        audios/Mod04_Sect02_part1.mp3   
1              audios/Mod06_Sect01.mp3   
2              audios/Mod03_WrapUp.mp3   
3              audios/Mod02_Sect01.mp3   
4        audios/Mod03_Sect03_part1.mp3   
5        audios/Mod05_Sect02_part2.mp3   
6              audios/Mod03_Sect06.mp3   
7              audios/Mod02_Sect04.mp3   
8        audios/Mod03_Sect03_part3.mp3   
9        audios/Mod03_Sect02_part2.mp3   
10        audios/Mod05_WrapUp_ver2.mp3   
11              audios/Mod02_Intro.mp3   
12             audios/Mod03_Sect05.mp3   
13             audios/Mod03_Sect08.mp3   
14       audios/Mod03_Sect04_part3.mp3   
15              audios/Mod04_Intro.mp3   
16       audios/Mod03_Sect04_part2.mp3   
17             audios/Mod04_Sect01.mp3   
18              audios/Mod05_Intro.mp3   
19       audios/Mod05_Sect03_part1.mp3   
20       audios/Mod03_Sect07_part2.mp3   
21    audios/Mod01_Course Overview.mp3   
22       audios/Mod04_Sect02_part2

## 3. Normalizing the text
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to perform any text normalization steps that are necessary for your solution.

In [1]:
#load the csv file as a dataframe for preprocessing
import pandas as pd 
df = pd.read_csv('transcribed_text.csv')

print(df.head())

   Unnamed: 0                     file_names  \
0           0  audios/Mod04_Sect02_part1.mp3   
1           1        audios/Mod06_Sect01.mp3   
2           2        audios/Mod03_WrapUp.mp3   
3           3        audios/Mod02_Sect01.mp3   
4           4  audios/Mod03_Sect03_part1.mp3   

                                                text  
0  ["hi and welcome back this is section two and ...  
1  ["will get started by reviewing what natural l...  
2  ["it's now time to review the monul and wrap u...  
3  ["i and welcome to section one in this section...  
4  ["hi i'm welcome back this is section three an...  


In [2]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

def removal_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# Write your answer/code here
import re
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

def text_preprocessing(str):
    # normalization
    lowercase = str.lower()
    rm_numbers = re.sub(r'\d+','', lowercase)
    rm_punc = re.sub(r'[^\w\s]','', rm_numbers)
    rm_wspace = rm_punc.strip()
    no_stop_words = removal_stopwords(rm_wspace)

    #  #lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_string = lemmatizer.lemmatize(no_stop_words)

    return lemmatized_string


[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
preprocessed_text = df['text'].apply(text_preprocessing)
df['preprocessed_text'] = preprocessed_text

## 4. Extracting key phrases and topics
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to extract the key phrases and topics from the videos.

In [5]:
pip install spacy

Note: you may need to restart the kernel to use updated packages.


In [6]:
!python3 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m83.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [7]:
# Write your answer/code here
import spacy
nlp = spacy.load("en_core_web_sm")

def extract_keywords(text):
    doc = nlp(text)
    return doc.ents

        

In [8]:
keywords = df['text'].apply(extract_keywords)
df['keywords'] = keywords

In [9]:
print(df['keywords'])

0     ((two), (third), (some, day), (the, month), (t...
1     ((next, language), (english), (one), (first), ...
2                                                    ()
3     ((first), (every, year), (twenty, four, seven)...
4     ((three), (one), (csv), (java), (japanese), (t...
5     ((two), (first), (three), (gat), (seconds), (r...
6     ((six), (first), (diplomas), (tens, of, thousa...
7     ((today), (first), (jupiter), (jupiter), (line...
8     ((one), (two), (two), (zero), (two), (one), (f...
9     ((three), (first), (three), (jupiter), (three)...
10                                                   ()
11                                     ((two), (first))
12    ((three), (five), (first), (three), (first), (...
13    ((three), (eight), (first), (second), (third),...
14    ((two), (first), (second), (two), (two), (two)...
15                                            ((four),)
16    ((between, two, and, eight), (california), (th...
17    ((two), (first), (one), (second), (more, t

## 5. Creating the dashboard
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to create the dashboard for your solution.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def search(all_videos, search_words):
    vectorizer = TfidfVectorizer()
    vid_text_vector = vectorizer.fit_transform(all_videos)
    search_vector = vectorizer.transform(search_words)
    
    #find the cosine simlairty between the searched word vector and videos
    similarity_score = cosine_similarity(search_vector, vid_text_vector)
    
    #first i will store the indices of the score above 0.1 so i can track the video it is from
    store_best_videos = {}
    for i in range(len(similarity_score[0])):
        if similarity_score[0][i] > 0.1:
            store_best_videos[i] = similarity_score[0][i]
    
    #sorting the dictionary of best videos by similarity score - to get the most relevant results at top
    sorted_bestvideos = {key: value for key, value in sorted(store_best_videos.items(), key=lambda item: item[1],reverse=True)}
    
    #return the text of those best videos - this was only for me to check if the videos being suggested are relevant
    '''
    for key, value in sorted_bestvideos.items():
        print(all_videos[key])'''
    return sorted_bestvideos
        
    

#keywords = ['text']
#search(df['text'].tolist(), keywords)
#print(search_results)

In [11]:
pip install ipywidgets

Note: you may need to restart the kernel to use updated packages.


In [12]:
files_list = df['file_names'].tolist()
urls = []
for i in range(len(files_list)):
    splitone = files_list[i].split('/')
    splittwo = splitone[1].split('.')
    files_list[i] = splittwo[0] + '.mp4'
                                 
def find_videolink(search_word):
    video_dict = search(df['text'].tolist(), search_word)
    #print(video_dict)
    temp_link = 'https://aws-tc-largeobjects.s3.amazonaws.com/CUR-TF-200-ACMNLP-1/video/'
    for k,v in video_dict.items():
        link = temp_link + files_list[k]
        urls.append(link)
    print(urls)
              

In [13]:
import ipywidgets as widgets
from IPython.display import display

text = widgets.HTML('Search for the topic you wish to learn about and press enter')

    
text_input = widgets.Text(
    placeholder='Enter text for search',
    description='Input:',
    disabled=False,
    continuous_update=False
)

# Display the text input widget
display(text)
display(text_input)

#after entering text

HTML(value='Search for the topic you wish to learn about and press enter')

Text(value='', continuous_update=False, description='Input:', placeholder='Enter text for search')

In [14]:
word = [str(text_input.value)]

In [15]:
print("Click on the following videos to learn more about" + str(text_input.value))
find_videolink(word)

Click on the following videos to learn more aboutimages
['https://aws-tc-largeobjects.s3.amazonaws.com/CUR-TF-200-ACMNLP-1/video/Mod05_Sect03_part1.mp4', 'https://aws-tc-largeobjects.s3.amazonaws.com/CUR-TF-200-ACMNLP-1/video/Mod05_Sect03_part4_ver2.mp4', 'https://aws-tc-largeobjects.s3.amazonaws.com/CUR-TF-200-ACMNLP-1/video/Mod05_Sect01_ver2.mp4', 'https://aws-tc-largeobjects.s3.amazonaws.com/CUR-TF-200-ACMNLP-1/video/Mod05_Sect03_part2.mp4', 'https://aws-tc-largeobjects.s3.amazonaws.com/CUR-TF-200-ACMNLP-1/video/Mod05_Sect02_part1_ver2.mp4']
