<a href="https://colab.research.google.com/github/Jliu223/AI-in-Education/blob/main/align_instructional_content_with_lecture_materials_with_timestamps_linenums_openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Align Instructional Content with Lecture Materials by OpenAI GPT-4o (and GPT-4o-mini)
## USING THE TEST DATA WITH TIMESTAMPLS AND LINE NUMBERS

- **Lecture materials**: Refers to the prepared resources such as slides, notes, handouts, or visual aids that support the lecture. These are typically created before the class.

 - Example: "The instructor uploaded the lecture materials to the course portal, including slides and code."
 - The lines in the lecture materials are indexed by line numbers.


- **Instructional content**: Refers to the actual verbal content delivered by the instructor during the lecture. This includes both the structured content and any ad hoc elaborations.

 - Example: "The instructional content focused on explaining key algorithms and their real-world applications."
 - The lines of instructional content are indexed by timestamps.

## The Problem:
- Transcribe a segment of the instructional content into text.
- Identify the corresponding chunks in the lecture materials that match the transcribed segment.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load the Data Containing Manually Annotated Instructional Content and Lecture Materials

In [None]:
#path = "/content/drive/MyDrive/Colab Notebooks/Co_op_JohnLiu/mit_intro_programming/data/test/mitlecture4_transcript_materials_md.csv"

manual_test_path = "/content/drive/MyDrive/Colab Notebooks/Co_op_JohnLiu/mit_intro_programming/data/test/mitlecture12.csv"

In [None]:
test = pd.read_csv(manual_test_path, header=0, encoding='latin-1')
#test.columns = ['Time', 'Transcript_Text', 'Lecture_Notes_Text_md']
test.head()

Unnamed: 0,Time,Transcript_Segments,Lecture_Notes
0,0:00-3:00,"0:30 PROFESSOR: So, for the last two lectures\...",21:# 6.0001 LECTURE 12\n22:---\n23:# SEARCHING...
1,3:00-6:00,3:02 Sometimes also called British Museum algo...,35:# 6.0001 LECTURE 12\n36:---\n37:# LINEAR SE...
2,6:00-9:00,6:03 I'm throwing away half of the remaining l...,90:# BISECTION SEARCH\n91:\n92:# IMPLEMENTATIO...
3,9:00-12:00,9:02 Because it says it's never true.\n9:05 Ou...,147:6.0001 LECTURE 12\n148:---\n149:# AMORTIZE...
4,12:00-15:00,"12:06 That, by the way-- the complexity of tha...",182:6.0001 LECTURE 12 12\n183:---\n184:# COMP...


In [None]:
test.shape

(16, 3)

In [None]:
# rename test columns
test = test.rename(columns = {'Transcript_Segments':'Transcript_Text', 'Lecture_Notes':'Lecture_Notes_Text_md'})

In [None]:
test.head()

Unnamed: 0,Time,Transcript_Text,Lecture_Notes_Text_md
0,0:00-3:00,"0:30 PROFESSOR: So, for the last two lectures\...",21:# 6.0001 LECTURE 12\n22:---\n23:# SEARCHING...
1,3:00-6:00,3:02 Sometimes also called British Museum algo...,35:# 6.0001 LECTURE 12\n36:---\n37:# LINEAR SE...
2,6:00-9:00,6:03 I'm throwing away half of the remaining l...,90:# BISECTION SEARCH\n91:\n92:# IMPLEMENTATIO...
3,9:00-12:00,9:02 Because it says it's never true.\n9:05 Ou...,147:6.0001 LECTURE 12\n148:---\n149:# AMORTIZE...
4,12:00-15:00,"12:06 That, by the way-- the complexity of tha...",182:6.0001 LECTURE 12 12\n183:---\n184:# COMP...


In [None]:
test.iloc[1].Transcript_Text

"3:04 Can I predict that?\n3:05 Can I make guesses about how much time\n3:09 I'm going to need to solve this problem?\n3:11 Especially if it's in a real world\n3:12 circumstance where time is going to be crucial.\n3:16 Equally important is going the other direction.\n3:20 We want you to begin to reason about the algorithms\n3:23 you write to be able to say how do\n3:25 certain choices in a design of an algorithm influence\n3:30 how much time it's going to take.\n3:32 If I choose to do this recursively,\n3:33 is that going to be different than iteratively?\n3:35 If I choose to do this with a particular kind of structure\n3:38 in my algorithm, what does that say about the amount\n3:41 of time I'm going to need?\n3:43 And you're going to see there's a nice association\n3:45 between classes of algorithms and the interior structure\n3:48 of them.\n3:50 And in particular, we want to ask some fundamental questions.\n3:54 Are there fundamental limits to how much time\n3:57 it's going to take t

In [None]:
test.iloc[1].Lecture_Notes_Text_md

'12:6.0001 LECTURE 10 2\n13:# WANT TO UNDERSTAND\n14:\n15:EFFICIENCY OF PROGRAMS\n16:\n17:Computers are fast and getting faster \x96 so maybe efficient programs don\x92t matter?\n18:\n19:- But data sets can be very large (e.g., in 2014, Google served 30,000,000,000,000 pages, covering 100,000,000 GB \x96 how long to search brute force?)\n20:- Thus, simple solutions may simply not scale with size in an acceptable manner.\n21:\n22:How can we decide which option for a program is most efficient?\n23:\n24:Separate time and space efficiency of a program.\n25:\n26:Tradeoff between the:\n27:\n28:- Can some times pre-compute results are stored; then use \x93lookup\x94 to retrieve (e.g., memoization for Fibonacci).\n29:- Will focus on the efficiency.'

## Load Lecture Materials for the Lecture

In [None]:
import os
os.listdir("/content/drive/MyDrive/Colab Notebooks/Co_op_JohnLiu/mit_intro_programming/data/slides_md")

['1_welcome.md',
 '10_understand_program_efficiency_1_md_json_objs.pkl',
 '10_understand_program_efficiency_1_com.md',
 '12_searching_sorting_md_json_objs.pkl',
 '2_branching_iteration.md',
 '3_string_manipulation.md',
 '4_decomposition.md',
 '5_tuples.md',
 '6_recursion.md',
 '7_testing_debugging.md',
 '8_object_oriented.md',
 '9_python_classes.md',
 '11_understand_program_efficiency_2.md',
 '12_searching_sorting.md',
 '7_testing_debugging_linenums.md',
 '11_understand_program_efficiency_2_linenums.md',
 '9_python_classes_linenums.md',
 '8_object_oriented_linenums.md',
 '5_tuples_linenums.md',
 '2_branching_iteration_linenums.md',
 '10_understand_program_efficiency_1_com_linenums.md',
 '6_recursion_linenums.md',
 '4_decomposition_linenums.md',
 '12_searching_sorting_linenums.md',
 '1_welcome_linenums.md',
 '3_string_manipulation_linenums.md',
 '.ipynb_checkpoints',
 '3_string_manipulation_linenums_annotated.md',
 '1_welcome_linenums_annotated.md']

In [None]:
#materials_path = "/content/drive/MyDrive/Colab Notebooks/Co_op_JohnLiu/mit_intro_programming/data/slides_txt/1_welcome_text.txt"

#materials_path = "/content/drive/MyDrive/Colab Notebooks/Co_op_JohnLiu/mit_intro_programming/data/slides_txt/1_welcome_text_linenums.txt"

#materials_path = "/content/drive/MyDrive/Colab Notebooks/Co_op_JohnLiu/mit_intro_programming/data/slides_md/1_welcome_linenums.md"

materials_path = "/content/drive/MyDrive/Colab Notebooks/Co_op_JohnLiu/mit_intro_programming/data/slides_md/12_searching_sorting_linenums.md"

In [None]:
with open(materials_path, 'r') as file:
    lecture_materials = file.read()

In [None]:
print(lecture_materials)

1:# SEARCHING AND SORTING ALGORITHMS
2:
3:(download slides and .py files and follow along!)
4:
5:# 6.0001 LECTURE 12
6:
7:6.0001 LECTURE 12
8:---
9:# SEARCH ALGORITHMS
10:
11:§ search algorithm – method for finding an item or group of items with specific properties within a collection of items
12:
13:§ collection could be implicit
14:
15:- example – find square root as a search problem
16:
17:§ collection could be explicit
18:
19:- example – is a student record in a stored collection of data?
20:
21:# 6.0001 LECTURE 12
22:---
23:# SEARCHING ALGORITHMS
24:
25:# 1. Linear Search
26:
27:- Brute force search (aka British Museum algorithm)
28:- List does not have to be sorted
29:
30:# 2. Binary Search
31:
32:- List MUST be sorted to give correct answer
33:- Saw two different implementations of the algorithm
34:
35:# 6.0001 LECTURE 12
36:---
37:# LINEAR SEARCH ON UNSORTED LIST: RECAPE
38:
39:def linear_search(L, e):
40:found = False
41:for i in range(len(L)):
42:if e == L[i]:
43:found = True

## OpenAI Set up

In [None]:
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

In [None]:
from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)

### Test Whether GPT-4o Works

In [None]:
response = client.chat.completions.create(
    model="gpt-4o-mini-2024-07-18",
    #model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "developer",
            "content": "You extract email addresses into JSON data."
        },
        {
            "role": "user",
            "content": "Feeling stuck? Send a message to help@mycompany.com."
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "email_schema",
            "schema": {
                "type": "object",
                "properties": {
                    "email": {
                        "description": "The email address that appears in the input",
                        "type": "string"
                    },
                    "additionalProperties": False
                }
            }
        }
    }
)

In [None]:
print(response.choices[0].message.content)

{"email":"help@mycompany.com"}


## Prompt GPT-4o for Chunks Identification in Lecture Materials

In [None]:
system_prompt = '''
You are an assistant tasked with the following:

1. You will receive:

- A segment of the transcript of instructional content, where each line is marked by a timestamp.
- The corresponding lecture materials, where each line is numbered with a line index.

2. Your job is to:

- Analyze the transcript segment and the lecture materials.
- Identify which chunks of the lecture materials are discussed in the given transcript segment.

3. Output:

- For each identified chunk in the lecture materials, return the first and last
line indices as a list of JSON objects in the following format:
[
  {"start_line_idx": <first line index of chunk 1>, "end_line_idx": <last line index of chunk 1>},
  ...,
  {"start_line_idx": <first line index of chunk n>, "end_line_idx": <last line index of chunk n>}
]

- If no chunk in the lecture materials matches the transcript segment, return this JSON object:

[{"start_line_idx": -1, "end_line_idx": -1}]
'''

In [None]:
user_prompt_with_variables = '''
Please analyze the given timestamped segment of transcript of instructional content and the corresponding
lecture materials. Your task is to identify the chunks in the lecture materials that
match the provided transcript segment.

Input:
Transcript Segment:
{transcript}

Lecture Materials:
{lecture_materials}

Output:
- Return the first line index and last line index of each matching chunk in the lecture materials as
a list of JSON objects in the following format:
[
    {{"start_line_idx": <first line index of chunk 1>, "end_line_idx": <last line index of chunk 1>}},
    ...,
    {{"start_line_idx": <first line index of chunk n>, "end_line_idx": <last line index of chunk n>}}
].

- If no chunk in the lecture materials matches the transcript segment, return this JSON object:
[{{"start_line_idx": -1, "end_line_idx": -1}}].
'''

In [None]:
transcript_segment = test.iloc[0]['Transcript_Text']
transcript_segment

"0:30 PROFESSOR: Quick, quick recap of what we did last time.\n0:33 So last time we introduced this idea of decomposition\n0:36 and abstraction.\n0:37 And we started putting that into our programs.\n0:39 And these were sort of high level concepts,\n0:41 and we achieved them using these concrete things called\n0:45 functions in our programs.\n0:46 And functions allowed us to create\n0:47 code that was coherent, that had some structure to it,\n0:53 and was reusable.\n0:55 OK.\n0:57 And from now on in problem sets and in lectures,\n0:59 I'm going to be using functions a lot.\n1:01 So make sure that you understand how they work\n1:04 and all of those details.\n1:05 So today, we're going to introduce two new data types.\n1:09 And they're called compound data types,\n1:11 because they're actually data types that are made up\n1:13 of other data types, particularly ints, floats,\n1:17 Booleans, and strings.\n1:19 And actually not just these, but other data types as well.\n1:22 So that's why th

In [None]:
user_prompt = user_prompt_with_variables.format(transcript=transcript_segment, lecture_materials=lecture_materials)
user_prompt

'\nPlease analyze the given timestamped segment of transcript of instructional content and the corresponding\nlecture materials. Your task is to identify the chunks in the lecture materials that\nmatch the provided transcript segment.\n\nInput:\nTranscript Segment:\n0:30 PROFESSOR: Quick, quick recap of what we did last time.\n0:33 So last time we introduced this idea of decomposition\n0:36 and abstraction.\n0:37 And we started putting that into our programs.\n0:39 And these were sort of high level concepts,\n0:41 and we achieved them using these concrete things called\n0:45 functions in our programs.\n0:46 And functions allowed us to create\n0:47 code that was coherent, that had some structure to it,\n0:53 and was reusable.\n0:55 OK.\n0:57 And from now on in problem sets and in lectures,\n0:59 I\'m going to be using functions a lot.\n1:01 So make sure that you understand how they work\n1:04 and all of those details.\n1:05 So today, we\'re going to introduce two new data types.\n1:09 A

In [None]:
response = client.chat.completions.create(
    model="gpt-4o-mini-2024-07-18",
    #model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "developer",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": user_prompt
        }
    ]
)

In [None]:
response = client.chat.completions.create(
    model="gpt-4o-mini-2024-07-18",
    messages=[
        {
            "role": "developer",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": user_prompt
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "chunks_schema",
            "schema": {
                "type": "object",
                "properties": {
                    "start_line_idx": {
                        "description": "first line index of chunk 1",
                        "type": "integer"
                    },
                    "end_line_idx": {
                        "description": "last line index of chunk 1",
                        "type": "integer"
                    },
                    "additionalProperties": False
                }
            }
        }
    }
)

In [None]:
print(response.choices[0].message.content)

```json
[
    {"start_line_idx": 7, "end_line_idx": 15},
    {"start_line_idx": 19, "end_line_idx": 33},
    {"start_line_idx": 38, "end_line_idx": 60}
]
```


In [None]:
import json
import re

In [None]:
def extract_json_string(response_string):
    """
    Extracts a JSON string from a string that might contain surrounding text.

    Args:
        response_string (str): The input string, potentially containing JSON data.

    Returns:
        str or None: The extracted JSON string, or None if no valid JSON is found.
    """

    # Define the regex pattern to match the JSON segment
    pattern = r"```json(.*?)```"

    # Search for the pattern in the text
    match = re.search(pattern, response_string, re.DOTALL)

    # Return the matched JSON segment or an empty string if no match is found
    if match:
        response_string = match.group(1).strip()
    else:
        return response_string

    try:
        # Attempt to parse the entire string as JSON
        json.loads(response_string)
        return response_string  # If successful, the entire string is valid JSON
    except json.JSONDecodeError:
        pass  # If parsing fails, it's not pure JSON, try extracting


    # If not pure JSON, look for JSON-like structure
    start_index = response_string.find('[')
    if start_index == -1:
        return None  # No opening bracket, not likely JSON

    end_index = response_string.rfind(']')
    if end_index == -1:
        return None # No closing bracket, not likely JSON

    json_like_string = response_string[start_index : end_index + 1]

    try:
         json.loads(json_like_string)
         return json_like_string # if it can be parsed, then its valid json
    except json.JSONDecodeError:
        return None

In [None]:
response.choices[0].message.content

'[\n    {"start_line_idx": 455, "end_line_idx": 491},\n    {"start_line_idx": 500, "end_line_idx": 527},\n    {"start_line_idx": 532, "end_line_idx": 548}\n]'

In [None]:
json_response = extract_json_string(response.choices[0].message.content)
print(json_response)

None


In [None]:
def extract_segments_to_paragraph(json_response, lecture_materials):
    """
    Extracts content from lecture content based on segment indices in JSON response
    and combines them into a single paragraph.

    Args:
        json_response (str): JSON string containing segment indices.
        lecture_content (str): Multiline string of lecture content.

    Returns:
        str: A single paragraph containing extracted content, or None if no content.
    """
    try:
        segments = json.loads(json_response)
    except json.JSONDecodeError:
        print("Error: Invalid JSON format.")
        return None

    if not segments or (len(segments) == 1 and segments[0]["start_line_idx"] == -1):
        print("No matching segments found.")
        return None

    lines = lecture_materials.splitlines()

    # line number to line dictionary
    num_line_dict = {}
    for line in lines:
        first_colon = line.find(":")
        if first_colon != -1:
            line_number = int(line[:first_colon])
            line_text = line[first_colon+1:].strip()
            num_line_dict[line_number] = line_text

    extracted_content = []

    # Check if it is a dictionary
    if isinstance(segments, dict):
        # Make it an element in a list
        segments = [segments]

    for segment in segments:
        start_idx = segment["start_line_idx"]
        end_idx = segment["end_line_idx"]

        if not (0 <= start_idx <= len(lines) and 0 <= end_idx <= len(lines) and start_idx <= end_idx):
             print(f"Error: Invalid line indices {start_idx}, {end_idx}")
             continue
        else:
            for line_num in range(start_idx, end_idx+1):
                if line_num in num_line_dict:
                    extracted_content.append(str(line_num) + ":" + num_line_dict[line_num])
                    #extracted_content.append(num_line_dict[line_num])


    #Combine extracted content into a single paragraph, handling blank lines
    paragraph = "\n".join([line.strip() for line in extracted_content if line.strip()])
    return paragraph

In [None]:
print(extract_segments_to_paragraph(json_response, lecture_materials))

7:# LAST TIME
8:
9:functions
10:
11:decomposition – create structure
12:
13:abstraction – suppress details
14:
15:from now on will be using functions a lot
19:# TODAY
20:
21:have seen variable types: int, float, bool, string
22:
23:# introduce new compound data types
24:
25:- tuples
26:- lists
27:
28:# idea of
29:
30:- aliasing
31:- mutability
32:- cloning
33:
38:# TUPLES
39:
40:an ordered sequence of elements, can mix element types
41:
42:cannot change element values, immutable
43:
44:represented with parentheses
45:
46:te = ()
47:
48:t = (2,"mit",3)
49:
50:t[0] evaluates to 2
51:
52:(2,"mit",3) + (5,6) evaluates to (2,"mit",3,5,6)
53:
54:t[1:2] slice tuple, evaluates to ("mit",)
55:
56:t[1:3] slice tuple, evaluates to ("mit",3)
57:
58:len(t) evaluates to 3
59:
60:t[1] = 4 gives error, can’t modify object


In [None]:
print(test.iloc[0]['Lecture_Notes_Text_md'])

1:# 6.0001 LECTURE 5
2:
3:# TUPLES, LISTS, ALIASING, MUTABILITY, CLONING
4:
5:(download slides and .py files and follow along!)
6:---
7:# LAST TIME
8:
9:functions
10:
11:decomposition  create structure
12:
13:abstraction  suppress details
14:
15:from now on will be using functions a lot
16:
17:# 6.0001 LECTURE 5
18:---
19:# TODAY
20:
21:have seen variable types: int, float, bool, string
22:
23:# introduce new compound data types
24:
25:- tuples
26:- lists
27:
28:# idea of
29:
30:- aliasing
31:- mutability
32:- cloning
33:
34:# 6.0001 LECTURE 5
35:
36:3
37:---
38:# TUPLES
39:
40:an ordered sequence of elements, can mix element types
41:
42:cannot change element values, immutable
43:
44:represented with parentheses
45:
46:te = ()
47:
48:t = (2,"mit",3)
49:
50:t[0] evaluates to 2
51:
52:(2,"mit",3) + (5,6) evaluates to (2,"mit",3,5,6)
53:
54:t[1:2] slice tuple, evaluates to ("mit",)
55:
56:t[1:3] slice tuple, evaluates to ("mit",3)
57:
58:len(t) evaluates to 3
59:
60:t[1] = 4 gives erro

In [None]:
results_dicts = []

In [None]:
row_stack_dict = {"time": test.iloc[0]['Time'],
                  "transcript_segment": transcript_segment,
                  "lecture_notes":test.iloc[0]['Lecture_Notes_Text_md'],
                  'identified_text': extract_segments_to_paragraph(json_response, lecture_materials),
                  'identified_json_lines': json_response}

In [None]:
results_dicts = []
results_dicts.append(row_stack_dict)

In [None]:
pd.DataFrame(results_dicts)

Unnamed: 0,time,transcript_segment,lecture_notes,identified_text,identified_json_lines
0,0:00-3:00,"0:30 PROFESSOR: Quick, quick recap of what we ...","1:# 6.0001 LECTURE 5\n2:\n3:# TUPLES, LISTS, A...",7:# LAST TIME\n8:\n9:functions\n10:\n11:decomp...,"[\n {""start_line_idx"": 7, ""end_line_idx"": 1..."


In [None]:
test.columns

Index(['Time', 'Transcript_Text', 'Lecture_Notes_Text_md'], dtype='object')

In [None]:
from tqdm import tqdm

In [None]:
import time
import random

responses_dict = {}
for idx, row in tqdm(test.iterrows(), \
                     total=len(test), desc="Processing Rows"):

    print(f"Processing item {idx}...")
    sleep_time = random.uniform(5, 10)  # Random float between 5 and 10 seconds
    print(f"Sleeping for {sleep_time:.2f} seconds...")
    time.sleep(sleep_time)

    transcript_segment = row['Transcript_Text']

    user_prompt = user_prompt_with_variables.format(transcript=transcript_segment, lecture_materials=lecture_materials)

    response = client.chat.completions.create(
        model="gpt-4o-mini-2024-07-18",
        #model="gpt-4o-2024-08-06",
        messages=[
            {
                "role": "developer",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": user_prompt
            }
        ]
    )

    ## store the responses for future processing
    responses_dict[idx] = response

Processing Rows:   0%|          | 0/16 [00:00<?, ?it/s]

Processing item 0...
Sleeping for 6.60 seconds...


Processing Rows:   6%|▋         | 1/16 [00:08<02:07,  8.52s/it]

Processing item 1...
Sleeping for 7.98 seconds...


Processing Rows:  12%|█▎        | 2/16 [00:19<02:18,  9.86s/it]

Processing item 2...
Sleeping for 5.92 seconds...


Processing Rows:  19%|█▉        | 3/16 [00:27<01:58,  9.08s/it]

Processing item 3...
Sleeping for 7.12 seconds...


Processing Rows:  25%|██▌       | 4/16 [00:36<01:47,  8.92s/it]

Processing item 4...
Sleeping for 8.90 seconds...


Processing Rows:  31%|███▏      | 5/16 [00:47<01:46,  9.66s/it]

Processing item 5...
Sleeping for 6.23 seconds...


Processing Rows:  38%|███▊      | 6/16 [00:54<01:28,  8.81s/it]

Processing item 6...
Sleeping for 5.15 seconds...


Processing Rows:  44%|████▍     | 7/16 [01:01<01:13,  8.22s/it]

Processing item 7...
Sleeping for 5.84 seconds...


Processing Rows:  50%|█████     | 8/16 [01:08<01:02,  7.79s/it]

Processing item 8...
Sleeping for 6.52 seconds...


Processing Rows:  56%|█████▋    | 9/16 [01:17<00:57,  8.24s/it]

Processing item 9...
Sleeping for 5.39 seconds...


Processing Rows:  62%|██████▎   | 10/16 [01:24<00:46,  7.78s/it]

Processing item 10...
Sleeping for 9.70 seconds...


Processing Rows:  69%|██████▉   | 11/16 [01:34<00:43,  8.72s/it]

Processing item 11...
Sleeping for 8.24 seconds...


Processing Rows:  75%|███████▌  | 12/16 [01:44<00:36,  9.05s/it]

Processing item 12...
Sleeping for 7.21 seconds...


Processing Rows:  81%|████████▏ | 13/16 [01:54<00:27,  9.18s/it]

Processing item 13...
Sleeping for 8.15 seconds...


Processing Rows:  88%|████████▊ | 14/16 [02:03<00:18,  9.31s/it]

Processing item 14...
Sleeping for 7.94 seconds...


Processing Rows:  94%|█████████▍| 15/16 [02:13<00:09,  9.32s/it]

Processing item 15...
Sleeping for 9.22 seconds...


Processing Rows: 100%|██████████| 16/16 [02:24<00:00,  9.02s/it]


In [None]:
test.columns

Index(['Time ', 'Transcript_Text', 'Lecture_Notes_Text_md'], dtype='object')

In [None]:
test = test.rename(columns = {'Time ':'Time'})

In [None]:
results_dicts = []
for idx, row in tqdm(test.iterrows(), \
                     total=len(test), desc="Processing Rows"):
    response = responses_dict[idx]

    transcript_segment = row['Transcript_Text']

    json_response = extract_json_string(response.choices[0].message.content)

    try:
        extracted_text = extract_segments_to_paragraph(json_response, lecture_materials)
    except TypeError as e:
        print(f"Error processing row {idx}: {e}")
        print(json_response)
        print(response.choices[0].message.content)
        continue

    row_stack_dict = {"time": row['Time'],
                      "transcript_segment": transcript_segment,
                      "lecture_notes":row['Lecture_Notes_Text_md'],
                      'identified_text': extracted_text,
                      'identified_json_lines': json_response}

    results_dicts.append(row_stack_dict)

Processing Rows: 100%|██████████| 16/16 [00:00<00:00, 1369.85it/s]


In [None]:
for resp in responses_dict.values():
    print(resp.choices[0].message.content)

```json
[
    {"start_line_idx": 7, "end_line_idx": 15},
    {"start_line_idx": 19, "end_line_idx": 33},
    {"start_line_idx": 38, "end_line_idx": 60}
]
```
```json
[
    {"start_line_idx": 38, "end_line_idx": 61},
    {"start_line_idx": 66, "end_line_idx": 74}
]
```
```json
[
    {"start_line_idx": 66, "end_line_idx": 81}
]
```
```json
[
  {"start_line_idx": 84, "end_line_idx": 100}
]
```
Based on the provided transcript segment and lecture materials, the transcript discusses manipulating tuples and extracting unique strings, iterating over tuples, retrieving minimum and maximum values, and also comments on lists being mutable. The segment also mentions testing the program and switching to lists from tuples.

Given these points, it's clear that the following sections in the lecture materials are relevant to the transcript:

1. **Manipulating Tuples**:
   - Lines 84 to 102, where it details iterating over tuples, extracting unique words, and calculating minimum and maximum values.

2.

In [None]:
mitlecture_aligned_results = pd.DataFrame(results_dicts)

In [None]:
mitlecture_aligned_results

Unnamed: 0,time,transcript_segment,lecture_notes,identified_text,identified_json_lines
0,0:00-3:00,"0:30 PROFESSOR: So, for the last two lectures\...",21:# 6.0001 LECTURE 12\n22:---\n23:# SEARCHING...,9:# SEARCH ALGORITHMS\n10:\n11:§ search algori...,"[\n {""start_line_idx"": 9, ""end_line_idx"": 1..."
1,3:00-6:00,3:02 Sometimes also called British Museum algo...,35:# 6.0001 LECTURE 12\n36:---\n37:# LINEAR SE...,25:# 1. Linear Search\n26:\n27:- Brute force s...,"[\n {""start_line_idx"": 25, ""end_line_idx"": ..."
2,6:00-9:00,6:03 I'm throwing away half of the remaining l...,90:# BISECTION SEARCH\n91:\n92:# IMPLEMENTATIO...,90:# BISECTION SEARCH\n91:\n92:# IMPLEMENTATIO...,"[\n {""start_line_idx"": 90, ""end_line_idx"": ..."
3,9:00-12:00,9:02 Because it says it's never true.\n9:05 Ou...,147:6.0001 LECTURE 12\n148:---\n149:# AMORTIZE...,141:- When sorting is less than O(n)\n142:\n14...,"[\n {""start_line_idx"": 141, ""end_line_idx"":..."
4,12:00-15:00,"12:06 That, by the way-- the complexity of tha...",182:6.0001 LECTURE 12 12\n183:---\n184:# COMP...,"171:# MONKEY SORT\n172:\n173:aka bogosort, stu...","[\n {""start_line_idx"": 171, ""end_line_idx"":..."
5,15:00-18:00,15:01 And then I'm going to loop.\n15:02 As lo...,210:# COMPLEXITY OF BUBBLE SORT\n211:\n212:def...,198:# BUBBLE SORT\n199:\n200:- compare consecu...,"[\n {""start_line_idx"": 198, ""end_line_idx"":..."
6,18:00-21:00,18:01 What's this?\n18:02 Quadratic.\n18:05 So...,231:6.0001 LECTURE 12 15\n232:---\n233:# SELEC...,233:# SELECTION SORT\n234:\n235:# § First Step...,"[\n {""start_line_idx"": 233, ""end_line_idx"":..."
7,21:00-24:00,21:11 PROFESSOR: I would invite you to watch\n...,250:6.0001 LECTURE 12 16\n251:---\n252:# ANALY...,233:# SELECTION SORT\n234:\n235:# § First Step...,"[\n {""start_line_idx"": 233, ""end_line_idx"":..."
8,24:00-27:00,24:02 And as long as I still have things\n24:0...,258:# 6.0001 LECTURE 12\n259:\n260:17\n261:---...,233:# SELECTION SORT\n234:\n235:# § First Step...,"[\n {""start_line_idx"": 233, ""end_line_idx"":..."
9,27:00-30:00,27:03 I'm going to sort them.\n27:04 And when ...,292:6.0001 LECTURE 12\n293:\n294:19\n295:---\n...,282:# MERGE SORT\n283:\n284:Use a divide-and-c...,"[\n {""start_line_idx"": 282, ""end_line_idx"":..."


In [None]:
aligned_results_path = "/content/drive/MyDrive/Colab Notebooks/Co_op_JohnLiu/mit_intro_programming/data/test/mitlecture12_time_linenums_aligned_results_openai_md.csv"

In [None]:
mitlecture_aligned_results.to_csv(aligned_results_path, index=False)