This script is to help with processing transcripts. The first portion helps to load the transcripts. The second portion helps to remove the interviewer's statements from the transcripts. 

You might want this kind of script if you are just interested in looking at what study participants say.

Note that most of this script runs out of R, but there are portions (especially mounting the Drive to access files stored on your Google Drive) that run out of python.

In [None]:
%load_ext rpy2.ipython

In [None]:
# Mount data from drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import pandas as pd

In [None]:
os.listdir()

['.config', 'drive', 'sample_data']

In [None]:
data_dir = "/content/drive/Shareddrives/Working Group - NLP in Engineering Education Research/Fall 2021 Independent Study/data/Sample Interviews"

In [None]:
os.chdir(data_dir)

In [None]:
os.listdir()

['Interview 01.docx',
 'Interview 02.docx',
 'Interview 03.docx',
 'Interview 04.docx']

This next portion download and loads libraries needed to read in the files in R

In [None]:
%%R
library(tidyverse)

R[write to console]: ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

R[write to console]: ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
✔ tibble  3.1.4     ✔ dplyr   1.0.7
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   2.0.1     ✔ forcats 0.5.1

R[write to console]: ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()



In [None]:
%%R
system('sudo apt-get install -y libpoppler-cpp-dev', intern=TRUE)

In [None]:
%%R
install.packages("readtext")

In [None]:
%%R
library(readtext)

In [None]:
%%R
data_dir <- "/content/drive/Shareddrives/Working Group - NLP in Engineering Education Research/Fall 2021 Independent Study/data/Sample Interviews/"

Prepare for loading in the transcripts in the data_dir location

In [None]:
%%R

file_name_df <- tibble(file_name = list.files())

file_name_df <- file_name_df %>% filter(str_detect(file_name, "docx"))

my_files <- file_name_df$file_name

In [None]:
%%R
file_name_df$file_name

[1] "Interview 01.docx" "Interview 02.docx" "Interview 03.docx"
[4] "Interview 04.docx"


In [None]:
%%R


temp_files <- lapply(paste0(data_dir, my_files), readtext)

names(temp_files) <- my_files

In [None]:
%%R
temp_sol <- bind_rows(temp_files, .id = "column_label")

In [None]:
%%R
temp_sol

readtext object consisting of 4 documents and 1 docvar.
# Description: df [4 × 3]
  column_label      doc_id            text               
  <chr>             <chr>             <chr>              
1 Interview 01.docx Interview 01.docx "\"Interview \"..."
2 Interview 02.docx Interview 02.docx "\"Interview \"..."
3 Interview 03.docx Interview 03.docx "\"Interview \"..."
4 Interview 04.docx Interview 04.docx "\"Interview \"..."


In [None]:
%%R
temp_sol$text[3]

[1] "Interview 03\nI: This is the third interview. This is the first sentence in the first interviewer statement. This is the second sentence in the first interviewer statement.\nP: This is participant’s first response. This is the first sentence in the first participant statement. This is the second sentence in the first participant statement.\nI: This is the third interview. This is the first sentence in the second interviewer statement. This is the second sentence in the second interviewer statement.\nP: This is participant’s second response. This is the first sentence in the second participant statement. This is the second sentence in the second participant statement.\nEnd Transcript"


In [None]:
%%R
by_speaker <- temp_sol %>% 
  separate_rows(text, sep = "\n") %>% 
  separate(text, into = c("Speaker", "text"), sep = ":") %>% 
  filter(Speaker %in% c("I", "P")) %>% 
  group_by(doc_id, Speaker) %>% 
  summarize(text = paste(text, collapse= ' '))

`summarise()` has grouped output by 'doc_id'. You can override using the `.groups` argument.


In [None]:
%%R
temp_sol %>% 
  separate_rows(text, sep = "\n") %>%
  separate(text, into = c("Speaker", "text"), sep = ":") %>%
  filter(Speaker %in% c("I", "P"))%>%
  group_by(doc_id, Speaker) %>% 
  summarize(text = paste(text, collapse= ' ')) 


`summarise()` has grouped output by 'doc_id'. You can override using the `.groups` argument.
# A tibble: 8 × 3
# Groups:   doc_id [4]
  doc_id            Speaker text                                                
  <chr>             <chr>   <chr>                                               
1 Interview 01.docx I       " This is the first interview. This is the first se…
2 Interview 01.docx P       " This is participant’s first response. This is the…
3 Interview 02.docx I       " This is the second interview. This is the first s…
4 Interview 02.docx P       " This is participant’s first response. This is the…
5 Interview 03.docx I       " This is the third interview. This is the first se…
6 Interview 03.docx P       " This is participant’s first response. This is the…
7 Interview 04.docx I       " This is the fourth interview. This is the first s…
8 Interview 04.docx P       " This is participant’s first response. This is the…


Just look at the transcript from interview 02

In [None]:
%%R

by_speaker %>% filter(str_detect(doc_id, "02"))

# A tibble: 2 × 3
# Groups:   doc_id [1]
  doc_id            Speaker text                                                
  <chr>             <chr>   <chr>                                               
1 Interview 02.docx I       " This is the second interview. This is the first s…
2 Interview 02.docx P       " This is participant’s first response. This is the…


Just look at the speaker's statements in interview 02

In [None]:
%%R

by_speaker %>% filter(str_detect(doc_id, "02")) %>% filter(Speaker == "P") 

# A tibble: 1 × 3
# Groups:   doc_id [1]
  doc_id            Speaker text                                                
  <chr>             <chr>   <chr>                                               
1 Interview 02.docx P       " This is participant’s first response. This is the…


Use str_squish() to clean this up a little bit

In [None]:
%%R

by_speaker %>% filter(str_detect(doc_id, "02")) %>% filter(Speaker == "P") %>% mutate(text = str_squish(text))

# A tibble: 1 × 3
# Groups:   doc_id [1]
  doc_id            Speaker text                                                
  <chr>             <chr>   <chr>                                               
1 Interview 02.docx P       This is participant’s first response. This is the f…


In [None]:
%%R

by_speaker %>% filter(Speaker == "P")

# A tibble: 4 × 3
# Groups:   doc_id [4]
  doc_id            Speaker text                                                
  <chr>             <chr>   <chr>                                               
1 Interview 01.docx P       " This is participant’s first response. This is the…
2 Interview 02.docx P       " This is participant’s first response. This is the…
3 Interview 03.docx P       " This is participant’s first response. This is the…
4 Interview 04.docx P       " This is participant’s first response. This is the…


In [None]:
%%R
by_speaker %>% write_csv("example_csv.csv")

In [None]:
%%R
getwd()

[1] "/content/drive/Shareddrives/Working Group - NLP in Engineering Education Research/Fall 2021 Independent Study/data/Sample Interviews"
