# Using NLP for content monetization
This is an accompanying notebook to Chapter 8 of the book - Natural Language Processing with AWS AI Services. Please do not use this notebook directly as there are prerequisites and dependent steps required to be performed as documented in the book. Briefly in this chapter, we look at a use case of how to use AWS services specifically NLP to enable monetization of your video content. The following high level steps (along with where the instructions are) walk through the solution:
1. Upload a video file to an Amazon S3 bucket - Refer to the book
2. Use AWS Elemental MediaConvert to create brodcast streams - Refer to the book
3. Run a transcription of the video file using Amazon Transcribe - Refer to this notebook
4. Run an Amazon Comprehend Topic Modeling job to extract topics - Refer to this notebook
5. Select the ad markers based on topics extracted - Refer to this notebook
6. Stitch into an Ad decision server URL - Refer to this notebook
7. Create an AWS Elemental MediaTailor configuration - Refer to the book
8. Play the ad embedded video to test - Refer to the book

## Transcribe section

In [None]:
import pandas as pd
import json
import boto3
import re
import uuid
import time
import io
from io import BytesIO
import sys
import csv
from IPython.display import Image, display
from PIL import Image as PImage, ImageDraw

In [None]:
bucket='<bucket-name>'
prefix='chapter8'
s3=boto3.client('s3')

In [None]:
import time
import boto3

def transcribe_file(job_name, file_uri, transcribe_client):
    transcribe_client.start_transcription_job(
        TranscriptionJobName=job_name,
        Media={'MediaFileUri': file_uri},
        MediaFormat='mp4',
        LanguageCode='en-US'
    )

In [None]:
job_name = 'media-monetization-transcribe'

In [None]:
transcribe_client = boto3.client('transcribe')
file_uri = 's3://'+bucket+'/'+prefix+'/'+'rawvideo/bank-demo-prem-ranga.mp4'
transcribe_file(job_name, file_uri, transcribe_client)

In [None]:
job = transcribe_client.get_transcription_job(TranscriptionJobName=job_name)
job_status = job['TranscriptionJob']['TranscriptionJobStatus']
if job_status in ['COMPLETED', 'FAILED']:
    print(f"Job {job_name} is {job_status}.")
    if job_status == 'COMPLETED':
        print(f"Download the transcript from\n"
              f"\t{job['TranscriptionJob']['Transcript']['TranscriptFileUri']}")

## Comprehend Topic Modeling Section

### First get the transcript

In [None]:
# Load the csv file into a Pandas DataFrame for easy manipulation
raw_df = pd.read_json(job['TranscriptionJob']['Transcript']['TranscriptFileUri'])
raw_df.shape

In [None]:
raw_df.head()

In [None]:
# Let's drop the rest of the columns, we only need the transcript for our solution
raw_df = pd.DataFrame(raw_df.at['transcripts','results'].copy())

In [None]:
#Convert this back to the CSV file
raw_df.to_csv('topic-modeling/raw/transcript.csv', header=False, index=False)

In [None]:
import csv
# Run Regex expression to create a list of sentences
folderpath = r"topic-modeling/raw" # make sure to put the 'r' in front and provide the folder where your files are
filepaths  = [os.path.join(folderpath, name) for name in os.listdir(folderpath) if not name.startswith('.')] # do not select hidden directories
fnfull = "topic-modeling/job-input/transcript_formatted.csv"
for path in filepaths:
    print(path)
    with open(path, 'r') as f:
        content = f.read() # Read the whole file
        lines = content.split('.') # a list of all sentences
        with open(fnfull, "w", encoding='utf-8') as ff:
            csv_writer = csv.writer(ff, delimiter=',', quotechar = '"')
            for num,line in enumerate(lines): # for each sentence
                csv_writer.writerow([line])
f.close()

In [None]:
# Upload the CSV file to the input prefix in S3 to be used in the topic modeling job
s3.upload_file('topic-modeling/job-input/transcript_formatted.csv', bucket, prefix+'/topic-modeling/job-input/tm-input.csv')

### Now follow the instructions in the book to run the topic modeling job from the Amazon Comprehend console

### Process Topic Modeling Results

In [None]:
# Let's first download the results of the topic modeling job. 
# Please copy the output data location from your topic modeling job for this step and use it below
tpprefix = prefix+'/'+'<path-to-job-output-tar>'
s3.download_file(bucket, tpprefix, 'topic-modeling/results/output.tar.gz')
!tar -xzvf topic-modeling/results/output.tar.gz

In [None]:
# Now load each of the resulting CSV files to their own DataFrames
tt_df = pd.read_csv('topic-terms.csv')
dt_df = pd.read_csv('doc-topics.csv')

In [None]:
# the topic terms DataFrame contains the topic number, what term corresponds to the topic, and 
# the weightage of this term contributing to the topic
for i,x in tt_df.iterrows():
    print(str(x['topic'])+":"+x['term']+":"+str(x['weight']))

In [None]:
# We may have multiple topics in the same line, but for this example we are not interested in these duplicates, so we will drop it
dt_df = dt_df.drop_duplicates(subset=['docname'])

In [None]:
# Filter the rows in the mean range of weightage for a topic
ttdf_max = tt_df.groupby(['topic'], sort=False)['weight'].max()

In [None]:
ttdf_max.head()

In [None]:
# Load these into its own DataFrame and remove terms that are masked
newtt_df = pd.DataFrame()
for x in ttdf_max:
    newtt_df = newtt_df.append(tt_df.query('weight == @x'))
newtt_df = newtt_df.reset_index(drop=True)    
adtopic = newtt_df.at[1,'term']

## Ad marking for Media Tailor
I have provided a sample csv containing content metadata for looking up ads. For this example, we'll use the topics we discovered from our topic modeling job as the key to fetch the cmsid & vid. We will then substitute these in the VAST ad marker URL before creating the AWS Elemental Media Tailor configuration.

In [None]:
#Get the ad content for marking our input video
adindex_df = pd.read_csv('media-content/ad-index.csv', header=None, index_col=0)
adindex_df

#### We will select ~content~ as the topic from our topic modeling results and lookup the ad content from the ad index above for our example

In [None]:
#Lookup the cmsid and vid for content as the topic
advalue = adindex_df.loc[adtopic]
advalue[2]

In [None]:
#Now we will create the AdMarker URL to use with AWS Elemental MediaTailor. 
#Lets first copy the placeholder URL available in our github repo which has a pre-roll, mid-roll and post-roll segments filled in
ad_rawurl = pd.read_csv('media-content/adserver.csv', header=None).at[0,0].split('&')
ad_rawurl

In [None]:
ad_formattedurl = ''
for x in ad_rawurl:
    if 'cmsid' in x:
        x = advalue[1]
    if 'vid' in x:
        x = advalue[2]
    
    ad_formattedurl += x+'&'
    
ad_formattedurl

## Resume from Creating AWS Elemental MediaTailor Configuration section in Chapter 8 of the book