## Feedback Preprocessing - Stage 1

**BACKGROUND**:

The preprocessing of the feedback data was done through the following stages:

- Stage 0: The initial raw data files containing the feedback were sourced from Monash University's Master of Data Science course. The required data was isolated and filtered out from the original data. Additionally, the feedback text contained a lot of dirty characters including HTML code. Through a meticulous data wrangling process, these the feeback text was cleaned and brought into a final form which was to be labelled. 

The feedback data was labelled with the help of a lightweight text annotation tool called Yedda. Yedda was customised to include the below 8 rubrics of Learner-centred feedback:

- **Focus on Future Impact**:
   - Impact 1: Provide actionable information for improving on specific aspects of the student's task
   - Impact 2: Provide actionable information for improving the overall task, in line with the learning outcomes.
   - Impact 3: Provide actionable information for the student to improve certain skills and strategies beyond the scope of the learning outcomes.
   
- **Sensemaking**:
  - Sensemaking 1: Higlight strengths and weaknesses of specific aspects of the student's work.
  - Sensemkaing 2: Highlight the overall strengths and weaknesses of the student's work with respect to the learning outcomes.
  
- **Support Agency**:
  - Agency 1: Encourage the student to look up the academic material or feedback, look up sources on the internet for more information or reach out to the teacher regarding the feedback or learning outcomes.
  - Agency 2: Provide statements that affirm and acknowledge the student's work.
  - Agency 3: Include comments that help foster a closer teacher-student relationship.
  
The Yedda tool exports the labelled data as `.sentence` files. For more information on the usage of this tool, please click here: [Yedda](https://github.com/jiesutd/YEDDA)

In this notebook, we will load the `.sentence` files that have been generated from the annotation process. For better manoeuvrability, we can use the Pandas library to load these files into dataframes and then combine them into a single dataframe with 1000 annotated text samples.

The Yedda exported `.sentence` files were converted into standard text files before loading them here. 

**Note**: The code for Preprocessing stage 0 could not be shared in this repository due to confidentially reasons. FOr any queries regarding the preprocessing steps, please reach out to me on my email ID: siddharthgupte@hotmail.com

After the completion of the steps in this notebook, the stage 1 file was converted to a stage 2 file in excel by adding a new column called `FeedbackCode`. To view this column, please refer to the `stage2.csv` file in this repository.

First, let us load the required libraries including `Pandas` and `itertools`.

In [59]:
# Loading the required libraries
import pandas as pd
import re
import itertools

### 1. Loading the Annotated Text

To load the annotated `.txt` files, we need to load them into lists that can be combined later.

In [50]:
# Specifying the column name for the dataframes
colnames=['Sentence', 'Rubric']

In [102]:
# Loading labelled data file 1
with open("./LabelledFeedback/1-100_labelled.txt") as file1:
    dat1 = [line.rstrip() for line in file1]

In [103]:
# Loading labelled data file 2
with open("./LabelledFeedback/101-200_labelled.txt") as file2:
    dat2 = [line.rstrip() for line in file2]

In [104]:
# Loading labelled data file 3
with open("./LabelledFeedback/201-300_labelled.txt") as file3:
    dat3 = [line.rstrip() for line in file3]

In [105]:
# Loading labelled data file 4
with open("./LabelledFeedback/301-400_labelled.txt") as fil4:
    dat4 = [line.rstrip() for line in fil4]

In [106]:
# Loading labelled data file 5
with open("./LabelledFeedback/401-500_labelled.txt") as file5:
    dat5 = [line.rstrip() for line in file5]

In [107]:
# Loading labelled data file 6
with open("./LabelledFeedback/501-600_labelled.txt") as file6:
    dat6 = [line.rstrip() for line in file6]

In [108]:
# Loading labelled data file 7
with open("./LabelledFeedback/601-700_labelled.txt") as file7:
    dat7 = [line.rstrip() for line in file7]

In [109]:
# Loading labelled data file 8
with open("./LabelledFeedback/701-800_labelled.txt") as file8:
    dat8 = [line.rstrip() for line in file8]

In [110]:
# Loading labelled data file 9
with open("./LabelledFeedback/801-900_labelled.txt") as file9:
    dat9 = [line.rstrip() for line in file9]

In [111]:
# Loading labelled data file 10
with open("./LabelledFeedback/901-1000_labelled.txt") as file10:
    dat10 = [line.rstrip() for line in file10]

In [112]:
# Loading labelled data file 11
with open("./LabelledFeedback/1001-1039_labelled.txt") as file11:
    dat11 = [line.rstrip() for line in file11]

In [113]:
# Loading labelled data file 12
with open("./LabelledFeedback/1040-1079_labelled.txt") as file12:
    dat12 = [line.rstrip() for line in file12]

In the previous step, we loaded each of the labelled data files as lists. We can use the `itertools` library to combine these lists into a single master list.

In [114]:
# Creating a master list from the lists of feedbacks
full_dat = list(itertools.chain(dat1, dat2, dat3, dat4, dat5, dat6, dat7, dat8, dat9, dat10, dat11, dat12))

# Checking the length of the master list
len(full_dat)

12171

In [115]:
# Checking the master list
full_dat

['Yuejing more in depth analysis is required and see how you can link key concepts. 58.33\tSensemaking 1&Impact 1',
 '',
 '\ufeffTeam 1 requested to re-do their workbook 3 to better their original mark of 55.\tImpact 1',
 'The team submitted the workbook 23 days after the submission date. KV 59\tSensemaking 2',
 '',
 'Risk assessment and report needs work as discussed in tutorial 70\tSensemaking 1',
 '',
 '"Good effort, Please refer to detailed feedback file attached.\tAgency 2&Agency 1',
 'Best WishesNergiz." 60.6\tAgency 3',
 '',
 '"lacked depth and detail.\tSensemaking 1',
 'Failed to cite, and use the prescribed texts so the answers were over simplified and not in accordance with ther learning of this course.\tSensemaking 2',
 'DevOSps is a hybrid between Project managemetn and SDLC.\tSensemaking 1',
 'EA provides detail design and requirements for planning as agreed by software developers and IT operations.\tSensemaking 1',
 'YOu missed several very vital points" 54.8\tSensemaking

This master list can be converted to a Pandas dataframe as follows:

In [116]:
# Converting the list of feedback texts into a dataframe
feed_df = pd.DataFrame(full_dat)

# Checking the newly created dataframe
feed_df

Unnamed: 0,0
0,Yuejing more in depth analysis is required and...
1,
2,﻿Team 1 requested to re-do their workbook 3 to...
3,The team submitted the workbook 23 days after ...
4,
...,...
12166,"In part (b), which was more complicated than i..."
12167,Q3: 11.5/13: More English exposition is requir...
12168,You made two errors: finding the determinant i...
12169,Q4: 5.5/6: The point where the two circles int...


As shown above, we have a very unorganized DataFrame. Let us go about making it more readable and creating proper columns. First, let us name this column as follows:

In [117]:
# Naming the column 
feed_df.columns = ['Sentence']

# Checking the data with the new column name
feed_df

Unnamed: 0,Sentence
0,Yuejing more in depth analysis is required and...
1,
2,﻿Team 1 requested to re-do their workbook 3 to...
3,The team submitted the workbook 23 days after ...
4,
...,...
12166,"In part (b), which was more complicated than i..."
12167,Q3: 11.5/13: More English exposition is requir...
12168,You made two errors: finding the determinant i...
12169,Q4: 5.5/6: The point where the two circles int...


### 2. Isolating the Rubrics from the Text

In this section, we can define a function that can separate the rubrics from the text and put them into a new column. Each feedback is accompanied by a rubric or rubric combination and separated from it with a `\t`. We will use this to split the data.

In [118]:
# Defining a function to separate the rubrics from the feedback text
def remRubric(text):

    rubric1 = ''

    if '\t' in text:

        rubric1 = text.split("\t", 1)[1]

    return rubric1

# Applying the rubric separating function to the data
feed_df['Rubric'] = feed_df['Sentence'].apply(lambda x : remRubric(x))

In [119]:
# Checking the separated rubrics
feed_df

Unnamed: 0,Sentence,Rubric
0,Yuejing more in depth analysis is required and...,Sensemaking 1&Impact 1
1,,
2,﻿Team 1 requested to re-do their workbook 3 to...,Impact 1
3,The team submitted the workbook 23 days after ...,Sensemaking 2
4,,
...,...,...
12166,"In part (b), which was more complicated than i...",Sensemaking 1
12167,Q3: 11.5/13: More English exposition is requir...,Impact 2
12168,You made two errors: finding the determinant i...,Sensemaking 1
12169,Q4: 5.5/6: The point where the two circles int...,Sensemaking 1


In [120]:
# Printing the new occurrences of rubrics and rubric combinations
print(feed_df['Rubric'].unique())

['Sensemaking 1&Impact 1' '' 'Impact 1' 'Sensemaking 2' 'Sensemaking 1'
 'Agency 2&Agency 1' 'Agency 3' 'Agency 2' 'Agency 1' 'Impact 2'
 'Agency 2&Impact 1' 'Sensemaking 1&Impact 2' 'Agency 2&Sensemaking 1'
 'Sensemaking 2&Sensemaking 1' 'Sensemaking 2&Impact 2'
 'Sensemaking 1&Sensemaking 2' 'Impact 1&Impact 2'
 'Sensemaking 1&Impact 3' 'Impact 3' 'Agency 1&Impact 1'
 'Agency 2&Impact 2' 'Agency 3&Agency 2' 'Impact 3&Agency 3'
 'Sensemaking 2&Impact 1' 'Impact 1&Sensemaking 1'
 'Sensemaking 1&Agency 1' 'Impact 3&Impact 2' 'Impact 3&Impact 1'
 'Impact 1&Impact 3' 'Impact 2&Impact 1' 'Impact 1&Agency 2'
 'Agency 2&Sensemaking 2' 'Sensemaking 1&Agency 2'
 'Impact 2&Sensemaking 1' 'Agency 2&Agency 3' 'Sensemaking 2&Agency 2'
 'Agency 1&Sensemaking 1' 'Impact 3&Sensemaking 1' 'Agency 1&Impact 2'
 'Agency 3&Sensemaking 1' 'Sensemaking 2&Agency 3'
 'Agency 2&Sensemaking 1&Impact 1' 'Sensemaking 2&Agency 1'
 'Agency 2&Sensemaking 1&Agency 1' 'Sensemaking 2&Sensemaking 1&Impact 1'
 'Sensemaki

There is a lot of non-feedback text in between the feedback comments. This kind of text is not labelled with any rubric. Therefore, the corresponding value in the new Rubric column will be NULL. Therefore, we need to drop these rows from the dats as they do not contribute to the analysis that will be conducted subsequently.

In [121]:
# Filtering the empty rubric values from the data
feed_df = feed_df[feed_df['Rubric'] != '']

# Resetting the index of the dataframe
feed_df = feed_df.reset_index(drop = True)

# Checking the unique rubric values in the filtered dataframe
print(feed_df['Rubric'].unique())

['Sensemaking 1&Impact 1' 'Impact 1' 'Sensemaking 2' 'Sensemaking 1'
 'Agency 2&Agency 1' 'Agency 3' 'Agency 2' 'Agency 1' 'Impact 2'
 'Agency 2&Impact 1' 'Sensemaking 1&Impact 2' 'Agency 2&Sensemaking 1'
 'Sensemaking 2&Sensemaking 1' 'Sensemaking 2&Impact 2'
 'Sensemaking 1&Sensemaking 2' 'Impact 1&Impact 2'
 'Sensemaking 1&Impact 3' 'Impact 3' 'Agency 1&Impact 1'
 'Agency 2&Impact 2' 'Agency 3&Agency 2' 'Impact 3&Agency 3'
 'Sensemaking 2&Impact 1' 'Impact 1&Sensemaking 1'
 'Sensemaking 1&Agency 1' 'Impact 3&Impact 2' 'Impact 3&Impact 1'
 'Impact 1&Impact 3' 'Impact 2&Impact 1' 'Impact 1&Agency 2'
 'Agency 2&Sensemaking 2' 'Sensemaking 1&Agency 2'
 'Impact 2&Sensemaking 1' 'Agency 2&Agency 3' 'Sensemaking 2&Agency 2'
 'Agency 1&Sensemaking 1' 'Impact 3&Sensemaking 1' 'Agency 1&Impact 2'
 'Agency 3&Sensemaking 1' 'Sensemaking 2&Agency 3'
 'Agency 2&Sensemaking 1&Impact 1' 'Sensemaking 2&Agency 1'
 'Agency 2&Sensemaking 1&Agency 1' 'Sensemaking 2&Sensemaking 1&Impact 1'
 'Sensemaking 

In [122]:
# Checking if the rubrics have been properly split from the data
print(feed_df[~feed_df["Sentence"].str.contains('\t')])

Empty DataFrame
Columns: [Sentence, Rubric]
Index: []


Now that we have split the rubrics from the data and put them in a new column, we can remove them from the feedback text. The below function replaces every instance of `\t` and a rubric with a whitespace character.

In [123]:
# Defining a function to remove the rubric from the feedback text
def remLabel(string):
    
        # Splitting the rubric from the text
        label = string.split("\t", 1)[1]
        
        # Replacing the rubric from the text with blank
        new_string = string.replace(label, "" )
        
        # Replacing 
        new_string1 = new_string.replace('\t', '')

        return new_string1

# Applying the rubric removal function to the data
feed_df['SentenceLabelRem'] = feed_df['Sentence'].apply(lambda x : remLabel(x))

We were able to successfully spit the rubrics from the text and put them into another column. The next step we will perform is to split the score from the text. It must be noted that not every row contains the score. The score represents the final marking for the student for one piece of feedback. However, one piece of feedback is spread out over multiple rows.

Therefore, we must split the score, put it in a new column and then duplicate it across all the rows that are part of the same feedback. Since the score is the last comment of the text, we can split it relatively easily.

In [132]:
# Splitting the last word from the data
feed_df['Score'] = feed_df['SentenceLabelRem'].apply(lambda x: x.split(' ')[-1])

# Checking the data
feed_df

Unnamed: 0,Sentence,Rubric,SentenceLabelRem,Score,SentenceScoreRem
0,Yuejing more in depth analysis is required and...,Sensemaking 1&Impact 1,Yuejing more in depth analysis is required and...,58.33,Yuejing more in depth analysis is required and...
1,﻿Team 1 requested to re-do their workbook 3 to...,Impact 1,﻿Team 1 requested to re-do their workbook 3 to...,55.,﻿Team 1 requested to re-do their workbook 3 to...
2,The team submitted the workbook 23 days after ...,Sensemaking 2,The team submitted the workbook 23 days after ...,59,The team submitted the workbook 23 days after ...
3,Risk assessment and report needs work as discu...,Sensemaking 1,Risk assessment and report needs work as discu...,70,Risk assessment and report needs work as discu...
4,"""Good effort, Please refer to detailed feedbac...",Agency 2&Agency 1,"""Good effort, Please refer to detailed feedbac...",attached.,"""Good effort, Please refer to detailed feedbac..."
...,...,...,...,...,...
5754,Q2: 6.5/7 Need to use more English to communic...,Impact 1,Q2: 6.5/7 Need to use more English to communic...,(d).,Q2: 6.5/7 Need to use more English to communic...
5755,"In part (b), which was more complicated than i...",Sensemaking 1,"In part (b), which was more complicated than i...",else.,"In part (b), which was more complicated than i..."
5756,Q3: 11.5/13: More English exposition is requir...,Impact 2,Q3: 11.5/13: More English exposition is required.,required.,Q3: 11.5/13: More English exposition is required.
5757,You made two errors: finding the determinant i...,Sensemaking 1,You made two errors: finding the determinant i...,(e).,You made two errors: finding the determinant i...


In a similar operation that was conducted to remove the rubrics from the feedback text, we will now define a function that removes the score from the feedback text, since we have successfully separated it and placed it in another column. It is important to remove it completely from the feedback text after separation since we don't want it to be analysed as a part of the text by the NLP, machine learning or deep learning techiques later.

In [125]:
# Defining a function to remove the score from the feedback text.
def remScore(text):
    
    # Splitting the last word from the feedback text
    last = text.split(' ')[-1]
    
    # Checking if the last word is a number
    if last.isnumeric() or re.match(r'^-?\d+(?:\.\d+)$', last) is not None:
        
        # Replacing the numerical value in the text with whitespace
        text = text.replace(last, "" )
    
    # Returning the text with the value removed
    return text

feed_df['SentenceScoreRem'] = feed_df['SentenceLabelRem'].apply(lambda x : remScore(x))

In [126]:
# Checking the data
feed_df

Unnamed: 0,Sentence,Rubric,SentenceLabelRem,Score,SentenceScoreRem
0,Yuejing more in depth analysis is required and...,Sensemaking 1&Impact 1,Yuejing more in depth analysis is required and...,58.33,Yuejing more in depth analysis is required and...
1,﻿Team 1 requested to re-do their workbook 3 to...,Impact 1,﻿Team 1 requested to re-do their workbook 3 to...,55.,﻿Team 1 requested to re-do their workbook 3 to...
2,The team submitted the workbook 23 days after ...,Sensemaking 2,The team submitted the workbook 23 days after ...,59,The team submitted the workbook 23 days after ...
3,Risk assessment and report needs work as discu...,Sensemaking 1,Risk assessment and report needs work as discu...,70,Risk assessment and report needs work as discu...
4,"""Good effort, Please refer to detailed feedbac...",Agency 2&Agency 1,"""Good effort, Please refer to detailed feedbac...",attached.,"""Good effort, Please refer to detailed feedbac..."
...,...,...,...,...,...
5754,Q2: 6.5/7 Need to use more English to communic...,Impact 1,Q2: 6.5/7 Need to use more English to communic...,(d).,Q2: 6.5/7 Need to use more English to communic...
5755,"In part (b), which was more complicated than i...",Sensemaking 1,"In part (b), which was more complicated than i...",else.,"In part (b), which was more complicated than i..."
5756,Q3: 11.5/13: More English exposition is requir...,Impact 2,Q3: 11.5/13: More English exposition is required.,required.,Q3: 11.5/13: More English exposition is required.
5757,You made two errors: finding the determinant i...,Sensemaking 1,You made two errors: finding the determinant i...,(e).,You made two errors: finding the determinant i...


If we take a closer look at the new `Score` column, we can see that there are several instances where the last word of the feedback text was not the score. Therefore, we have a lot of non-numeric values in the `Score` column. The idea is to remove them completely and then duplicate them with the next occuring numeric value.

For example, we have values like '(d)', 'else.' and 'required' in the above view of the data. The next occuring numeric value is 83.78. Therefore, we need to replace these values with 83.78 as they all belong to the same feedback.

Therefore, we will define a function to replace the non-numeric values with a whitespace.

Some of the values in the Score column also contains the character `.`. Therefore, we also need to make sure that these values are left as it is and then we can remove it separately. 

In [127]:
# Erasing the non-numeric values in the Score column
def remNonNumeric(text):
    
    # Using regular expressions to replace the non-numeric values with whitespace
    rem = re.sub(r'[^0-9.]', '', text)
    
    # Returning the replaced values
    return rem

# Applying the non-numeric value removal function to the Score column
feed_df['Score'] = feed_df['Score'].apply(lambda x : remNonNumeric(x))

In [128]:
# Replacing the single punctuation with a whitespace
feed_df['Score'] = feed_df['Score'].replace('.','')

In [129]:
# Checking the data
feed_df

Unnamed: 0,Sentence,Rubric,SentenceLabelRem,Score,SentenceScoreRem
0,Yuejing more in depth analysis is required and...,Sensemaking 1&Impact 1,Yuejing more in depth analysis is required and...,58.33,Yuejing more in depth analysis is required and...
1,﻿Team 1 requested to re-do their workbook 3 to...,Impact 1,﻿Team 1 requested to re-do their workbook 3 to...,55.,﻿Team 1 requested to re-do their workbook 3 to...
2,The team submitted the workbook 23 days after ...,Sensemaking 2,The team submitted the workbook 23 days after ...,59,The team submitted the workbook 23 days after ...
3,Risk assessment and report needs work as discu...,Sensemaking 1,Risk assessment and report needs work as discu...,70,Risk assessment and report needs work as discu...
4,"""Good effort, Please refer to detailed feedbac...",Agency 2&Agency 1,"""Good effort, Please refer to detailed feedbac...",,"""Good effort, Please refer to detailed feedbac..."
...,...,...,...,...,...
5754,Q2: 6.5/7 Need to use more English to communic...,Impact 1,Q2: 6.5/7 Need to use more English to communic...,,Q2: 6.5/7 Need to use more English to communic...
5755,"In part (b), which was more complicated than i...",Sensemaking 1,"In part (b), which was more complicated than i...",,"In part (b), which was more complicated than i..."
5756,Q3: 11.5/13: More English exposition is requir...,Impact 2,Q3: 11.5/13: More English exposition is required.,,Q3: 11.5/13: More English exposition is required.
5757,You made two errors: finding the determinant i...,Sensemaking 1,You made two errors: finding the determinant i...,,You made two errors: finding the determinant i...


Our stage 1 data is ready. Next, we will export this data to a `csv` file. These new whitespace rows will be replaced with the next occuring numeric value in Excel. Furthermore, we will add a new column called `FeedbackCode` in excel. The file we end up with in the second stage is called `stage2.csv` and it can be found in the same directory as this notebook.

In [133]:
# Exporting the stage 1 data to a csv file.
feed_df[['SentenceLabelRem', 'SentenceScoreRem', 'Rubric', 'Score']].to_csv('./LabelledFeedback/stage1.csv')