Version: 02.14.2023

# Challenge Lab 6.3: Implementing Topic Modeling

In this lab, you will use either Amazon Comprehend or the Amazon SageMaker Neural Topic Model (NTM) to extract topics from the [CMU Movie Summary Corpus](http://www.cs.cmu.edu/~ark/personas/). 

## CMU Movie Summary Corpus

The CMU Movie Summary Corpus is a collection of 42,306 movie plot summaries and metadata at both the movie level (including box office revenue, genre, and date of release) and character level (including gender and estimated age).  This data supports work in the following paper:

David Bamman, Brendan O'Connor, and Noah Smith. "Learning Latent Personas of Film Characters." Presented at the Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 2013. http://www.cs.cmu.edu/~dbamman/pubs/pdf/bamman+oconnor+smith.acl13.pdf.

You will use two datasets in this lab:

**plot_summaries.txt**

This dataset contains plot summaries of 42,306 movies, extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.

**movie.metadata.tsv**

This dataset contains metadata for 81,741 movies, extracted from the November 4, 2012 dump of Freebase. The data is tab-separated and contains the following columns:

1. Wikipedia movie ID
2. Freebase movie ID
3. Movie name
4. Movie release date
5. Movie box office revenue
6. Movie runtime
7. Movie languages (Freebase ID:name tuples)
8. Movie countries (Freebase ID:name tuples)
9. Movie genres (Freebase ID:name tuples)

## Lab steps

To complete this lab, you will follow these steps:

1. [Installing the packages](#1.-Installing-the-packages)
2. [Reviewing the dataset](#2.-Reviewing-the-dataset)
3. [Extracting topics](#3.-Extracting-topics)

## Submitting your work

1. In the lab console, choose **Submit** to record your progress and when prompted, choose **Yes**.

1. If the results don't display after a couple of minutes, return to the top of these instructions and choose **Grades**.

     **Tip**: You can submit your work multiple times. After you change your work, choose **Submit** again. Your last submission is what will be recorded for this lab.

1. To find detailed feedback on your work, choose **Details** followed by **View Submission Report**.

## 1. Installing the packages
([Go to top](#Challenge-Lab-6.3:-Implementing-Topic-Extraction))

First, update and install the packages that you will use in the notebook.

In [1]:
%matplotlib inline

import boto3
import os, io, struct, json
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import uuid
from time import sleep
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

Matplotlib is building the font cache; this may take a moment.
[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...


In [2]:
bucket = "c100688a2296030l5426874t1w241840010076-labbucket-4068hmqh9l3w"
job_data_access_role = 'arn:aws:iam::241840010076:role/service-role/c100688a2296030l5426874t1w-ComprehendDataAccessRole-Am4ndZ38wCo2'
prefix='lab63'

## 2. Reviewing the dataset
([Go to top](#Challenge-Lab-6.3:-Implementing-Topic-Extraction))

First, load the plot_summaries.tsv data into a pandas DataFrame.

The file contains two columns: **movie_id** and **plot**. The data is tab-separated, and the '\t' escape sequence is used as the separator.

In [3]:
df = pd.read_csv('../data/plot_summaries.tsv', sep='\t', names=['movie_id','plot'])

Review the first few rows of data to get an overview of how the data is structured.

In [4]:
pd.options.display.max_rows
pd.set_option('display.max_colwidth', None)

df.head(5)

Unnamed: 0,movie_id,plot
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all."
1,31186339,"The nation of Panem consists of a wealthy Capitol and twelve poorer districts. As punishment for a past rebellion, each district must provide a boy and girl between the ages of 12 and 18 selected by lottery for the annual Hunger Games. The tributes must fight to the death in an arena; the sole survivor is rewarded with fame and wealth. In her first Reaping, 12-year-old Primrose Everdeen is chosen from District 12. Her older sister Katniss volunteers to take her place. Peeta Mellark, a baker's son who once gave Katniss bread when she was starving, is the other District 12 tribute. Katniss and Peeta are taken to the Capitol, accompanied by their frequently drunk mentor, past victor Haymitch Abernathy. He warns them about the ""Career"" tributes who train intensively at special academies and almost always win. During a TV interview with Caesar Flickerman, Peeta unexpectedly reveals his love for Katniss. She is outraged, believing it to be a ploy to gain audience support, as ""sponsors"" may provide in-Games gifts of food, medicine, and tools. However, she discovers Peeta meant what he said. The televised Games begin with half of the tributes killed in the first few minutes; Katniss barely survives ignoring Haymitch's advice to run away from the melee over the tempting supplies and weapons strewn in front of a structure called the Cornucopia. Peeta forms an uneasy alliance with the four Careers. They later find Katniss and corner her up a tree. Rue, hiding in a nearby tree, draws her attention to a poisonous tracker jacker nest hanging from a branch. Katniss drops it on her sleeping besiegers. They all scatter, except for Glimmer, who is killed by the insects. Hallucinating due to tracker jacker venom, Katniss is warned to run away by Peeta. Rue cares for Katniss for a couple of days until she recovers. Meanwhile, the alliance has gathered all the supplies into a pile. Katniss has Rue draw them off, then destroys the stockpile by setting off the mines planted around it. Furious, Cato kills the boy assigned to guard it. As Katniss runs from the scene, she hears Rue calling her name. She finds Rue trapped and releases her. Marvel, a tribute from District 1, throws a spear at Katniss, but she dodges the spear, causing it to stab Rue in the stomach instead. Katniss shoots him dead with an arrow. She then comforts the dying Rue with a song. Afterward, she gathers and arranges flowers around Rue's body. When this is televised, it sparks a riot in Rue's District 11. President Snow summons Seneca Crane, the Gamemaker, to express his displeasure at the way the Games are turning out. Since Katniss and Peeta have been presented to the public as ""star-crossed lovers"", Haymitch is able to convince Crane to make a rule change to avoid inciting further riots. It is announced that tributes from the same district can win as a pair. Upon hearing this, Katniss searches for Peeta and finds him with an infected sword wound in the leg. She portrays herself as deeply in love with him and gains a sponsor's gift of soup. An announcer proclaims a feast, where the thing each survivor needs most will be provided. Peeta begs her not to risk getting him medicine. Katniss promises not to go, but after he falls asleep, she heads to the feast. Clove ambushes her and pins her down. As Clove gloats, Thresh, the other District 11 tribute, kills Clove after overhearing her tormenting Katniss about killing Rue. He spares Katniss ""just this time...for Rue"". The medicine works, keeping Peeta mobile. Foxface, the girl from District 5, dies from eating nightlock berries she stole from Peeta; neither knew they are highly poisonous. Crane changes the time of day in the arena to late at night and unleashes a pack of hound-like creatures to speed things up. They kill Thresh and force Katniss and Peeta to flee to the roof of the Cornucopia, where they encounter Cato. After a battle, Katniss wounds Cato with an arrow and Peeta hurls him to the creatures below. Katniss shoots Cato to spare him a prolonged death. With Peeta and Katniss apparently victorious, the rule change allowing two winners is suddenly revoked. Peeta tells Katniss to shoot him. Instead, she gives him half of the nightlock. However, before they can commit suicide, they are hastily proclaimed the victors of the 74th Hunger Games. Haymitch warns Katniss that she has made powerful enemies after her display of defiance. She and Peeta return to District 12, while Crane is locked in a room with a bowl of nightlock berries, and President Snow considers the situation."
2,20663735,"Poovalli Induchoodan is sentenced for six years prison life for murdering his classmate. Induchoodan, the only son of Justice Maranchery Karunakara Menon was framed in the case by Manapally Madhavan Nambiar and his crony DYSP Sankaranarayanan to take revenge on idealist judge Menon who had earlier given jail sentence to Manapally in a corruption case. Induchoodan, who had achieved top rank in Indian Civil Service loses the post and Manapally Sudheeran ([[Saikumar enters the list of civil service trainees. We learn in flashback that it was Ramakrishnan the son of Moopil Nair , who had actually killed his classmate. Six years passes by and Manapally Madhavan Nambiar, now a former state minister, is dead and Induchoodan, who is all rage at the gross injustice meted out to him - thus destroying his promising life, is released from prison. Induchoodan thwarts Manapally Pavithran from performing the funeral rituals of Nambiar at Bharathapuzha. Many confrontations between Induchoodan and Manapally's henchmen follow. Induchoodan also falls in love with Anuradha ([[Aishwarya , the strong-willed and independent-minded daughter of Mooppil Nair. Justice Menon and his wife returns back to Kerala to stay with Induchoodan. There is an appearance of a girl named Indulekha ([[Kanaka , who claims to be the daughter of Justice Menon. Menon flatly refuses the claim and banishes her. Forced by circumstances and at the instigation and help of Manapally Pavithran, she reluctantly come out open with the claim. Induchoodan at first thrashes the protesters. But upon knowing the truth from Chandrabhanu his uncle, he accepts the task of her protection in the capacity as elder brother. Induchoodan decides to marry off Indulekha to his good friend Jayakrishnan . Induchoodan has a confrontation with his father and prods him to accept mistake and acknowledge the parentage of Indulekha. Menon ultimately regrets and goes on to confess to his daughter. The very next day, when Induchoodan returns to Poovally, Indulekha is found dead and Menon is accused of murdering her. The whole act was planned by Pavithran, who after killing Indulekha, forces Raman Nair to testify against Menon in court. In court, Nandagopal Maarar , a close friend of Induchoodan and a famous supreme court lawyer, appears for Menon and manages to lay bare the murder plot and hidden intentions of other party . Menon is judged innocent of the crime by court. After confronting Pavithran and promising just retribution to the crime of killing Indulekha, Induchoodan returns to his father, who now shows remorse for all his actions including not believing in the innocence of his son. But while speaking to Induchoodan, Menon suffers a heart stroke and passes away. At Menon's funeral, Manapally Pavithran arrives to poke fun at Induchoodan and he also tries to carry out the postponed last rituals of his own father. Induchoodan interrupts the ritual and avenges for the death of his sister and father by severely injuring Pavithran. On his way back to peaceful life, Induchoodan accepts Anuradha as his life partner."
3,2231378,"The Lemon Drop Kid , a New York City swindler, is illegally touting horses at a Florida racetrack. After several successful hustles, the Kid comes across a beautiful, but gullible, woman intending to bet a lot of money. The Kid convinces her to switch her bet, employing a prefabricated con. Unfortunately for the Kid, the woman ""belongs"" to notorious gangster Moose Moran , as does the money. The Kid's choice finishes dead last and a furious Moran demands the Kid provide him with $10,000 by Christmas Eve, or the Kid ""won't make it to New Year's."" The Kid decides to return to New York to try to come up with the money. He first tries his on-again, off-again girlfriend Brainy Baxter . However, when talk of long-term commitment arises, the Kid quickly makes an escape. He next visits local crime boss ""Oxford"" Charley , with whom he has had past dealings. This falls through as Charley is in serious tax trouble and does not particularly care for the Kid anyway. As he leaves Charley's establishment and is about to give up hope, the Kid notices a cornerside Santa Claus and his kettle. Thinking quickly, the Kid fashions himself a Santa suit and begins collecting donations. This fails as he is recognized by a passing policeman, who remembers his previous underhanded activity well. The Kid lands in court, where he is convicted of collecting for a charity without a license and sentenced to ten days in jail . However, while in court, the Kid learns where his scheme went wrong. After a short stay, Brainy arrives to bail him out. He then sets about restarting his Santa operation, this time with legitimate backing. To this end, he needs a charity to represent and a city license. The kid receives key inspiration when he remembers that Nellie Thursday , a kindly neighborhood resident, has been denied entry to a retirement home because of her jailed husband's criminal past as a safecracker. Organizing other small-time New York swindlers and Brainy, who is both surprised and charmed at the Kid's apparent goodwill, the Kid converts an abandoned casino into the ""Nellie Thursday Home For Old Dolls"". A small group of elderly women and makeshift amenities complete the project. The Kid is able to receive the all-important city license. Now free to collect, the Kid and his compatriots dress as Santa Claus and position themselves throughout Manhattan. The others are unaware that the Kid plans to keep the money for himself to pay off Moran. The scheme is a huge success, netting $2,000 in only a few days. An overjoyed Brainy decides to leave her job as a dancer and look after the ""home"" full-time until after Christmas. Coincidentally, her employer is none other than ""Oxford"" Charley, whom Brainy cheerfully informs of the effort. Seeing a potential gold mine, Charley decides to muscle in on the operation. Reasoning that the Nellie Thursday home is ""wherever Nellie Thursday is"", Charley and his crew kidnap the home's inhabitants and move them to Charley's mansion in Nyack. The Kid learns of this when he returns to the home after a late night to find the home deserted and money gone. Clued in by oversized Oxford footprints in the snow, the Kid and his friends pay Charley a visit. Here, Charley reveals the true nature of the Kid's scheme through a phone conversation with Moose Moran. The Kid's accomplices are angry and move to confront him, but the Kid manages to slip away. However, Brainy tracks him down outside and voices her disgust at his actions. After a few days of stewing in self-pity , the Kid is surprised to meet Nellie, who has escaped Charley's compound. He decides to recover the money, sneaking into Charley's home in the guise of an elderly woman. He finds that Charley and his crew are again moving the women, this time to a more secure location. Using the heightened activity to his advantage, the Kid enters Charley's office and confronts him. After a brief struggle, the Kid overpowers Charley and makes off with the money, narrowly avoiding the thugs Charley has sent after him. The ensuing chaos allows Brainy and the others to escape. Later that night, the Kid returns to the original Nellie Thursday home to meet with Moose Moran . The deal appears to be in jeopardy as Moran arrives with Charley. Charley demands that the Kid reimburse him, which would leave too little for Moran. However, the Kid turns the tables by hitting a switch, revealing hidden casino tables. All are occupied, mainly by the escaped old dolls. The Kid and his still-loyal friends hold off the gangsters as the police initiate a raid. Moran and Charley are arrested while the judge who sentenced the Kid earlier warns that he will be ""keeping an eye on him"". The Kid assures him that will not be necessary and his attention will lie on the home, which is going to become a reality. The night's main event begins as Nellie's husband Henry, free on parole, joyously reunites with his wife."
4,595909,"Seventh-day Adventist Church pastor Michael Chamberlain, his wife Lindy, their two sons, and their nine-week-old daughter Azaria are on a camping holiday in the Outback. With the baby sleeping in their tent, the family is enjoying a barbecue with their fellow campers when a cry is heard. Lindy returns to the tent to check on Azaria and is certain she sees a dingo with something in its mouth running off as she approaches. When she discovers the infant is missing, everyone joins forces to search for her, without success. It is assumed what Lindy saw was the animal carrying off the child, and a subsequent inquest rules her account of events is true. The tide of public opinion soon turns against the Chamberlains. For many, Lindy seems too stoic, too cold-hearted, and too accepting of the disaster that has befallen her. Gossip about her begins to swell and soon is accepted as statements of fact. The couple's beliefs are not widely practised in the country, and when the media report a rumour that the name Azaria means ""sacrifice in the wilderness"" , the public is quick to believe they decapitated their baby with a pair of scissors as part of a bizarre religious rite. Law-enforcement officials find new witnesses, forensics experts, and a lot of circumstantial evidence—including a small wooden coffin Michael uses as a receptacle for his parishioners' packs of un-smoked cigarettes—and reopen the investigation, and eventually Lindy is charged with murder. Seven months pregnant, she ignores her attorneys' advice to play on the jury's sympathy and appears emotionless on the stand, convincing onlookers she is guilty of the crime of which she is accused. As the trial progresses, Michael's faith in his religion and his belief in his wife disintegrate, and he stumbles through his testimony, suggesting he is concealing the truth. In October 1982, Lindy is found guilty and sentenced to life imprisonment with hard labour, while Michael is found guilty as an accessory and given an 18-month suspended sentence. More than three years later, while searching for the body of an English tourist who fell from Uluru, police discover a small item of clothing that is identified as the jacket Lindy had insisted Azaria was wearing over her jumpsuit, which had been recovered early in the investigation. She is immediately released from prison, the case reopened and all convictions against the Chamberlains overturned."


To check the number of rows and columns, use the `shape` property.

In [5]:
df.shape

(42303, 2)

Now examine the metadata. The [dataset documentation](http://www.cs.cmu.edu/~ark/personas/data/README.txt) explains that the data contains nine fields. Load the data into a pandas DataFrame and specify the column names.

In [6]:
movie_meta_df = pd.read_csv('../data/movie.metadata.tsv', sep='\t', names=['movie_id','freebase_id','name','release_date','box_office_revenue','runtime','languages','countries','genres'])
movie_meta_df.head()

Unnamed: 0,movie_id,freebase_id,name,release_date,box_office_revenue,runtime,languages,countries,genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science Fiction"", ""/m/03npn"": ""Horror"", ""/m/03k9fj"": ""Adventure"", ""/m/0fdjb"": ""Supernatural"", ""/m/02kdv5l"": ""Action"", ""/m/09zvmj"": ""Space western""}"
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey Mystery,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0"": ""Drama"", ""/m/0hj3n01"": ""Crime Drama""}"
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""Drama""}"
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic thriller"", ""/m/09blyk"": ""Psychological thriller""}"
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


Set the index to **movie_id**, which will make it easier to merge this dataset with the plot.

In [7]:
movie_meta_df.set_index('movie_id', inplace=True)

Because you only need the movie name and something to link this metadata to the plot (**movie_id**), drop the remaining columns.

In [8]:
movie_meta_df=movie_meta_df.drop(['freebase_id','release_date','box_office_revenue','runtime','languages','countries','genres'], axis=1)
movie_meta_df.head()

Unnamed: 0_level_0,name
movie_id,Unnamed: 1_level_1
975900,Ghosts of Mars
3196793,Getting Away with Murder: The JonBenét Ramsey Mystery
28463795,Brun bitter
9363483,White Of The Eye
261236,A Woman in Flames


## 3. Extracting topics
([Go to top](#Challenge-Lab-6.3:-Implementing-Topic-Extraction))

You must now decide if you are going to use Amazon Comprehend or the SageMaker NTM algorithm to extract your topics. Both will do a good job of giving you topics, but each has different data requirements.

Refer to the notebooks from labs 6.1 and 6.2 for any code snippets you might need for each solution. Experiment with the number of topics to see if you can get better results. 

Questions to address:

1. What data cleanup do you need to perform?

2. How many topics will give you the best results?

In [9]:
# Initialize a tokenizer and lemmatizer
tokenizer = RegexpTokenizer(r'\w+')
lemmatizer = WordNetLemmatizer()

# Download stopwords
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = tokenizer.tokenize(text.lower())  # Tokenize and lowercase
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and len(word) > 2]  # Lemmatize
    return ' '.join(tokens)

# Apply preprocessing to the plot column
df['processed_plot'] = df['plot'].apply(preprocess_text)

# Preview the processed data
df[['movie_id', 'processed_plot']].head(5)

Unnamed: 0,movie_id,processed_plot
0,23890098,shlykov hard working taxi driver lyosha saxophonist develop bizarre love hate relationship despite prejudice realize different
1,31186339,nation panem consists wealthy capitol twelve poorer district punishment past rebellion district must provide boy girl age selected lottery annual hunger game tribute must fight death arena sole survivor rewarded fame wealth first reaping year old primrose everdeen chosen district older sister katniss volunteer take place peeta mellark baker son gave katniss bread starving district tribute katniss peeta taken capitol accompanied frequently drunk mentor past victor haymitch abernathy warns career tribute train intensively special academy almost always win interview caesar flickerman peeta unexpectedly reveals love katniss outraged believing ploy gain audience support sponsor may provide game gift food medicine tool however discovers peeta meant said televised game begin half tribute killed first minute katniss barely survives ignoring haymitch advice run away melee tempting supply weapon strewn front structure called cornucopia peeta form uneasy alliance four career later find katniss corner tree rue hiding nearby tree draw attention poisonous tracker jacker nest hanging branch katniss drop sleeping besieger scatter except glimmer killed insect hallucinating due tracker jacker venom katniss warned run away peeta rue care katniss couple day recovers meanwhile alliance gathered supply pile katniss rue draw destroys stockpile setting mine planted around furious cato kill boy assigned guard katniss run scene hears rue calling name find rue trapped release marvel tribute district throw spear katniss dodge spear causing stab rue stomach instead katniss shoot dead arrow comfort dying rue song afterward gather arranges flower around rue body televised spark riot rue district president snow summons seneca crane gamemaker express displeasure way game turning since katniss peeta presented public star crossed lover haymitch able convince crane make rule change avoid inciting riot announced tribute district win pair upon hearing katniss search peeta find infected sword wound leg portrays deeply love gain sponsor gift soup announcer proclaims feast thing survivor need provided peeta begs risk getting medicine katniss promise fall asleep head feast clove ambush pin clove gloat thresh district tribute kill clove overhearing tormenting katniss killing rue spare katniss time rue medicine work keeping peeta mobile foxface girl district dy eating nightlock berry stole peeta neither knew highly poisonous crane change time day arena late night unleashes pack hound like creature speed thing kill thresh force katniss peeta flee roof cornucopia encounter cato battle katniss wound cato arrow peeta hurl creature katniss shoot cato spare prolonged death peeta katniss apparently victorious rule change allowing two winner suddenly revoked peeta tell katniss shoot instead give half nightlock however commit suicide hastily proclaimed victor 74th hunger game haymitch warns katniss made powerful enemy display defiance peeta return district crane locked room bowl nightlock berry president snow considers situation
2,20663735,poovalli induchoodan sentenced six year prison life murdering classmate induchoodan son justice maranchery karunakara menon framed case manapally madhavan nambiar crony dysp sankaranarayanan take revenge idealist judge menon earlier given jail sentence manapally corruption case induchoodan achieved top rank indian civil service loses post manapally sudheeran saikumar enters list civil service trainee learn flashback ramakrishnan son moopil nair actually killed classmate six year pass manapally madhavan nambiar former state minister dead induchoodan rage gross injustice meted thus destroying promising life released prison induchoodan thwart manapally pavithran performing funeral ritual nambiar bharathapuzha many confrontation induchoodan manapally henchman follow induchoodan also fall love anuradha aishwarya strong willed independent minded daughter mooppil nair justice menon wife return back kerala stay induchoodan appearance girl named indulekha kanaka claim daughter justice menon menon flatly refuse claim banishes forced circumstance instigation help manapally pavithran reluctantly come open claim induchoodan first thrash protester upon knowing truth chandrabhanu uncle accepts task protection capacity elder brother induchoodan decides marry indulekha good friend jayakrishnan induchoodan confrontation father prod accept mistake acknowledge parentage indulekha menon ultimately regret go confess daughter next day induchoodan return poovally indulekha found dead menon accused murdering whole act planned pavithran killing indulekha force raman nair testify menon court court nandagopal maarar close friend induchoodan famous supreme court lawyer appears menon manages lay bare murder plot hidden intention party menon judged innocent crime court confronting pavithran promising retribution crime killing indulekha induchoodan return father show remorse action including believing innocence son speaking induchoodan menon suffers heart stroke pass away menon funeral manapally pavithran arrives poke fun induchoodan also try carry postponed last ritual father induchoodan interrupt ritual avenges death sister father severely injuring pavithran way back peaceful life induchoodan accepts anuradha life partner
3,2231378,lemon drop kid new york city swindler illegally touting horse florida racetrack several successful hustle kid come across beautiful gullible woman intending bet lot money kid convinces switch bet employing prefabricated con unfortunately kid woman belongs notorious gangster moose moran money kid choice finish dead last furious moran demand kid provide 000 christmas eve kid make new year kid decides return new york try come money first try girlfriend brainy baxter however talk long term commitment arises kid quickly make escape next visit local crime bos oxford charley past dealing fall charley serious tax trouble particularly care kid anyway leaf charley establishment give hope kid notice cornerside santa claus kettle thinking quickly kid fashion santa suit begin collecting donation fails recognized passing policeman remembers previous underhanded activity well kid land court convicted collecting charity without license sentenced ten day jail however court kid learns scheme went wrong short stay brainy arrives bail set restarting santa operation time legitimate backing end need charity represent city license kid receives key inspiration remembers nellie thursday kindly neighborhood resident denied entry retirement home jailed husband criminal past safecracker organizing small time new york swindler brainy surprised charmed kid apparent goodwill kid convert abandoned casino nellie thursday home old doll small group elderly woman makeshift amenity complete project kid able receive important city license free collect kid compatriot dress santa claus position throughout manhattan others unaware kid plan keep money pay moran scheme huge success netting 000 day overjoyed brainy decides leave job dancer look home full time christmas coincidentally employer none oxford charley brainy cheerfully informs effort seeing potential gold mine charley decides muscle operation reasoning nellie thursday home wherever nellie thursday charley crew kidnap home inhabitant move charley mansion nyack kid learns return home late night find home deserted money gone clued oversized oxford footprint snow kid friend pay charley visit charley reveals true nature kid scheme phone conversation moose moran kid accomplice angry move confront kid manages slip away however brainy track outside voice disgust action day stewing self pity kid surprised meet nellie escaped charley compound decides recover money sneaking charley home guise elderly woman find charley crew moving woman time secure location using heightened activity advantage kid enters charley office confronts brief struggle kid overpowers charley make money narrowly avoiding thug charley sent ensuing chaos allows brainy others escape later night kid return original nellie thursday home meet moose moran deal appears jeopardy moran arrives charley charley demand kid reimburse would leave little moran however kid turn table hitting switch revealing hidden casino table occupied mainly escaped old doll kid still loyal friend hold gangster police initiate raid moran charley arrested judge sentenced kid earlier warns keeping eye kid assures necessary attention lie home going become reality night main event begin nellie husband henry free parole joyously reunites wife
4,595909,seventh day adventist church pastor michael chamberlain wife lindy two son nine week old daughter azaria camping holiday outback baby sleeping tent family enjoying barbecue fellow camper cry heard lindy return tent check azaria certain see dingo something mouth running approach discovers infant missing everyone join force search without success assumed lindy saw animal carrying child subsequent inquest rule account event true tide public opinion soon turn chamberlain many lindy seems stoic cold hearted accepting disaster befallen gossip begin swell soon accepted statement fact couple belief widely practised country medium report rumour name azaria mean sacrifice wilderness public quick believe decapitated baby pair scissors part bizarre religious rite law enforcement official find new witness forensics expert lot circumstantial evidence including small wooden coffin michael us receptacle parishioner pack smoked cigarette reopen investigation eventually lindy charged murder seven month pregnant ignores attorney advice play jury sympathy appears emotionless stand convincing onlooker guilty crime accused trial progress michael faith religion belief wife disintegrate stumble testimony suggesting concealing truth october 1982 lindy found guilty sentenced life imprisonment hard labour michael found guilty accessory given month suspended sentence three year later searching body english tourist fell uluru police discover small item clothing identified jacket lindy insisted azaria wearing jumpsuit recovered early investigation immediately released prison case reopened conviction chamberlain overturned


In [11]:
# Initialize S3 client
s3_client = boto3.client('s3')

# Define S3 bucket and file path
bucket = 'c125984a3128011l7679056t1w199902952632-labbucket-owxxnbun3jgz'
file_path = 'path/to/processed_plot_data.csv'

# Save the processed data to a local file
df[['processed_plot']].to_csv('processed_plot_data.csv', index=False)

# Upload the file to S3
s3_client.upload_file('processed_plot_data.csv', bucket, file_path)

In [13]:
comprehend_client = boto3.client('comprehend')

response = comprehend_client.start_topics_detection_job(
    JobName=f"top-job-{uuid.uuid1()}",
    InputDataConfig={
        'S3Uri': f's3://{bucket}/{file_path}',
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': f's3://{bucket}/output/',
    },
    DataAccessRoleArn='arn:aws:iam::199902952632:role/service-role/c125984a3128011l7679056t1w-ComprehendDataAccessRole-S5V4dviXZ8eE',
    NumberOfTopics=10  # Specify the number of topics
)

job_id = response['JobId']
print(f'Topic modeling job started: {job_id}')

Topic modeling job started: ccffbc1c4c551f56105e8d75e34f74f3


In [14]:
# Monitor the job status
job_status = comprehend_client.describe_topics_detection_job(JobId=job_id)['TopicsDetectionJobProperties']['JobStatus']
print(f'Job Status: {job_status}')

Job Status: IN_PROGRESS


In [15]:
import time

def check_job_status(job_id):
    while True:
        response = comprehend_client.describe_topics_detection_job(
            JobId=job_id
        )
        status = response['TopicsDetectionJobProperties']['JobStatus']
        print(f"Job Status: {status}")
        if status == 'COMPLETED':
            print("Job completed successfully!")
            break
        elif status == 'FAILED':
            raise Exception("The job failed.")
        time.sleep(30)  # Check every 30 seconds

# Start job and check status
check_job_status(job_id)

Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_P

In [16]:
import boto3

s3_client = boto3.client('s3')

def download_output_file(bucket, topics_comprehend_key, file_name):
    # Download the tar.gz file from the specified S3 bucket and prefix
    output_file = f'/tmp/{file_name}'
    s3_client.download_file(bucket, f'{topics_comprehend_key}', output_file)
    return output_file

In [17]:
import tarfile
import os

def extract_tar_gz(file_path, extract_to='/tmp'):
    with tarfile.open(file_path, "r:gz") as tar:
        tar.extractall(path=extract_to)
        print(f"Extracted files to {extract_to}")
        # List the files extracted
        extracted_files = os.listdir(extract_to)
        print(f"Extracted files: {extracted_files}")

        # Print out all files, including those in subdirectories
        for root, dirs, files in os.walk(extract_to):
            print(f"Directory: {root}")
            for file in files:
                print(f"File: {file}")

        return extracted_files

In [18]:
import shutil

# Function to search for and move CSV files
def move_csv_files(extract_to='/tmp', notebook_dir='../en_us'):
    found_files = []
    
    # Walk through all extracted directories and files
    for root, dirs, files in os.walk(extract_to):
        for file in files:
            if file in ['topic-terms.csv', 'doc-topics.csv']:
                full_file_path = os.path.join(root, file)
                found_files.append(full_file_path)
                # Move the file to the notebook directory
                shutil.copy(full_file_path, os.path.join(notebook_dir, file))
                print(f"Moved {file} to {notebook_dir}")

    if not found_files:
        print("No CSV files found.")

In [19]:
import csv

def print_topics(file_path):
    with open(file_path, 'r') as csv_file:
        csv_reader = csv.reader(csv_file)
        next(csv_reader)  # Skip header row
        for row in csv_reader:
            topic_index = row[0]    # Topic number
            term = row[1]           # Term associated with the topic
            score = row[2]          # Relevance score of the term
            print(f'Topic {topic_index}: {term} (Score: {score})')

In [20]:
response = comprehend_client.describe_topics_detection_job(
            JobId=job_id
        )
topic_comprehend_output_file = response['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']
print(f'output filename: {topic_comprehend_output_file}')

topics_comprehend_bucket, topics_comprehend_key = topic_comprehend_output_file.replace("s3://", "").split("/", 1)
print(topics_comprehend_key)

output filename: s3://c125984a3128011l7679056t1w199902952632-labbucket-owxxnbun3jgz/output/199902952632-TOPICS-ccffbc1c4c551f56105e8d75e34f74f3/output/output.tar.gz
output/199902952632-TOPICS-ccffbc1c4c551f56105e8d75e34f74f3/output/output.tar.gz


In [21]:
# Define file name
file_name = 'output.tar.gz'

# Download tar.gz file
tar_gz_file = download_output_file(bucket, topics_comprehend_key, file_name)

# Extract the tar.gz file
extracted_files = extract_tar_gz(tar_gz_file)

# Move the CSV files to the notebook location
move_csv_files()

# Assuming topic-terms.csv is one of the extracted files, print topics
if 'topic-terms.csv' in extracted_files:
    print_topics('/tmp/topic-terms.csv')
else:
    print("topic-terms.csv not found!")

Extracted files to /tmp
Extracted files: ['.font-unix', '.X11-unix', '.ICE-unix', '.Test-unix', '.XIM-unix', 'output.tar.gz', 'doc-topics.csv', 'topic-terms.csv', 'systemd-private-bd1d0c9b686c4491a99dda395b4645f5-chronyd.service-UXJKz4', 'hsperfdata_role-agent', '.java_pid4139', 'jetty-localhost-9081-role-proxy-agent_war-_-any-10912996224794097439']
Directory: /tmp
File: output.tar.gz
File: doc-topics.csv
File: topic-terms.csv
File: .java_pid4139
Directory: /tmp/.font-unix
Directory: /tmp/.X11-unix
Directory: /tmp/.ICE-unix
Directory: /tmp/.Test-unix
Directory: /tmp/.XIM-unix
Directory: /tmp/hsperfdata_role-agent
File: 4139
Moved doc-topics.csv to ../en_us
Moved topic-terms.csv to ../en_us
Topic 000: money (Score: 0.0070251743)
Topic 000: gang (Score: 0.0046430626)
Topic 000: work (Score: 0.0056518908)
Topic 000: steal (Score: 0.0035103702)
Topic 000: make (Score: 0.0059012785)
Topic 000: police (Score: 0.004918466)
Topic 000: bank (Score: 0.0027779688)
Topic 000: job (Score: 0.0032056

# Congratulations!

You have completed this lab, and you can now end the lab by following the lab guide instructions.

*©2023 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. All trademarks are the property of their owners.*
