# Keyword Extraction
This notebook reads the transcript dataset and generates keywords using the keyphrase_vectorizers, spacy, and keybert libraries

In [1]:
import pandas as pd
from keyphrase_vectorizers import KeyphraseCountVectorizer
import spacy
from keybert import KeyBERT
from collections import Counter
from nltk.stem import WordNetLemmatizer
import re
import os

## Input Course URL
The course id can be found in the hyperlink for any page in the course.

In [2]:
url = 'https://www.coursera.org/learn/siads697698/lecture/3vwIb/how-to-do-a-standup'
course = re.search('(?<=coursera.org/learn/)(\w+)', url).group(0)

## Load Directory

In [4]:
directory = os.listdir('/Users/nicolascap/MADS/Capstone/intermediate_data')
new = True
for file in directory:
    if '{}_summaries_keywords'.format(course) in file:
        print("Course Already In Directory")
        new = False
        break

Course Already In Directory


## Read in Transcript DataFrame

In [5]:
df = pd.read_csv("./intermediate_data/transcripts_{}_summaries.csv".format(course)).drop(['Unnamed: 0'], axis=1)
df

Unnamed: 0,course_id,video_title,transcripts,length,summary
0,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,We'll see if anybody is joining us today. What...,20293,"Because at first time, I was thinking about us..."
1,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"I'm going to do this, Git log.one line and tha...",20294,"In this branch, let's create a new file and ca..."
2,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"Well, not sure if anybody is joining this morn...",21840,"If that doesn't get you exactly what you want,..."
3,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"now. Cool. There's lots of stuff here. Wow, l...",21840,If you're not very comfortable doing terminal ...
4,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"Hello, nice to meet you. >> Nice to meet you t...",24317,">> Yeah, you can do that too, so today, I'm go..."
5,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,model perhaps we don't have any rules like th...,24317,"And so the ideal model for each one could, it ..."
6,siads697698,how-to-write-an-effective-blog-post,It's not enough to just do data science on you...,7725,It could be that you've done something that ot...
7,siads697698,how-to-do-a-standup,I mentioned to you that we're going to do some...,3253,I mentioned to you that we're going to do some...
8,siads697698,how-to-collaborate-with-a-team,One of the most unexpectedly challenging parts...,13279,I don't know what a really reliable and certai...
9,siads697698,capstone-overview,"Hi, welcome to the capstone. My name's Dr. Ell...",5438,Office hours are not required or expected of y...


## Generate First Transcript for Keywords

In [6]:
new_line_full = df.transcripts.iloc[6] 
new_line_summary = df.summary.iloc[6] 

## Initialize Vectorizer

In [7]:
vectorizer = KeyphraseCountVectorizer()
nlp=spacy.load('en_core_web_sm')

## Initialize Model

In [8]:
kw_model = KeyBERT()

## Generate Full Text Keywords

In [10]:
res_full = kw_model.extract_keywords(docs=new_line_full,vectorizer=KeyphraseCountVectorizer(), top_n=10)
res_full

[('professional data science', 0.4502),
 ('data scientist', 0.4373),
 ('blog', 0.4039),
 ('data science career', 0.3948),
 ('blogs', 0.3813),
 ('data science', 0.3778),
 ('data science project', 0.3187),
 ('data', 0.3146),
 ('work', 0.3116),
 ('writing', 0.3071)]

## Generate Summary Text Keywords

In [11]:
res_summary = kw_model.extract_keywords(docs=new_line_summary,vectorizer=KeyphraseCountVectorizer(), top_n=10)
res_summary

[('voice', 0.4438),
 ('editing', 0.3516),
 ('model', 0.3113),
 ('model architecture', 0.2947),
 ('text', 0.1829),
 ('category', 0.1523),
 ('picture', 0.1472),
 ('other people', 0.1382),
 ('lot', 0.1288),
 ('enough room', 0.1278)]

## Combine Keywords and Take Top 5

In [12]:
top_5 = res_full + res_summary
top_5.sort(key = lambda x: x[1], reverse=True)
d = dict(top_5)#dict(top_5[:3])
k = [k for k,v in d.items()]
z = [z.split() for z in k]
z = sum(z, [])
lemmatizer = WordNetLemmatizer()
z_1 = [lemmatizer.lemmatize(z_1) for z_1 in z ]

c = Counter(z_1)
countDict = dict(c) 
rep_words = [word for word, occurrences in countDict.items() if occurrences >= 2]

tracker_dict = {word: False for word in rep_words}
remove_list = []
for i in k:
    remove = False
    for word in rep_words:
        if word in i:
            if tracker_dict[word] == False:
                tracker_dict[word] = True
            else:
                #print(word, i)
                remove=True
    if remove:
        remove_list.append(i)

for i in remove_list:
    k.remove(i)
top_5 = k[:5]

In [13]:
top_5

['professional data science', 'voice', 'blog', 'editing', 'work']

## Apply to All Transcripts

In [14]:
def keyword_creation(transcript, summary):
    res_full = kw_model.extract_keywords(docs=transcript,vectorizer=KeyphraseCountVectorizer(), top_n=10)
    res_summary = kw_model.extract_keywords(docs=summary,vectorizer=KeyphraseCountVectorizer(), top_n=10)
    top_5 = res_full + res_summary
    top_5.sort(key = lambda x: x[1], reverse=True)
    d = dict(top_5)
    k = [k for k,v in d.items()]
    ## simplification
    
    z = [z.split() for z in k]
    z = sum(z, [])
    lemmatizer = WordNetLemmatizer()
    z_1 = [lemmatizer.lemmatize(z_1) for z_1 in z ]

    c = Counter(z_1)
    countDict = dict(c) 
    rep_words = [word for word, occurrences in countDict.items() if occurrences >= 2]

    tracker_dict = {word: False for word in rep_words}
    remove_list = []
    for i in k:
        remove = False
        for word in rep_words:
            if word in i:
                if tracker_dict[word] == False:
                    tracker_dict[word] = True
                else:
                    #print(word, i)
                    remove=True
        if remove:
            remove_list.append(i)

    for i in remove_list:
        k.remove(i)
    top_5 = k[:5]
    ## simplification
    
    return top_5



df['keywords'] = df.apply(lambda x: keyword_creation(x.transcripts, x.summary), axis=1)

In [17]:
df.head()

Unnamed: 0,course_id,video_title,transcripts,length,summary,keywords
0,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,We'll see if anybody is joining us today. What...,20293,"Because at first time, I was thinking about us...","[git tutorial today, terminal, analysis script..."
1,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"I'm going to do this, Git log.one line and tha...",20294,"In this branch, let's create a new file and ca...","[git log.one line, commits, dvc repository]"
2,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"Well, not sure if anybody is joining this morn...",21840,"If that doesn't get you exactly what you want,...","[license file, team meeting, office hour, proj..."
3,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"now. Cool. There's lots of stuff here. Wow, l...",21840,If you're not very comfortable doing terminal ...,"[github, make dataset file, folder structure, ..."
4,siads697698,recording-of-elle-o-brien-office-hours-siads-6...,"Hello, nice to meet you. >> Nice to meet you t...",24317,">> Yeah, you can do that too, so today, I'm go...","[jupiter notebook experience, gpu, course feed..."


## Save Dataset
We save the transcript dataset as a csv file for further analysis.

In [19]:
df.to_csv("./intermediate_data/transcripts_{}_summaries_keywords.csv".format(course))

## Next step
After you saved the dataset here, run the next step in the workflow [5-HyperlinkGeneration.ipynb](./5-HyperlinkGeneration.ipynb) or go back to [0-Workflow.ipynb](./0-Workflow.ipynb).

---

**Authors:** [Wei Zhou](mailto:weiwzhou@umich.edu), [Nick Capaldini](mailto:nickcaps@umich.edu), University of Michigan, August 21, 2022

---