# Evaluation of Descriptive Answers Using NLP

## Problem Statement

Our goal is to look at the Original answer and the student answer and note their similarities and differences and give a score or percentage

### libraries used:
kindly install proper versions and import these while using!
* Pandas
* spacy
* neuralcoref
* textblob
* re
* string
* nltk, nltk.stem, nltk.tokenize, nltk.corpus
* matplotlib
* numpy
* sklearn


### Files to download:
Please download the training_data.xlsx and test_data.xlsx from the links given in report and keep in the same folder where this code lies.

## 1. Getting The Data

#### Kindly download the test_data.xlsx from the link given in report and keep it in the same folder where this file lies and add your 

#### own test cases in it,or change numbers from 0 to 9 to use the test cases we provided. Thanks!

In [1]:
import pandas as pd
#load the original answer as below, choose number between 0 to 9
test_data = pd.read_excel("test_data.xlsx")
original_answer_script = test_data["original_answer"][0]

#load the student answer as below, choose number between 0 to 9
student_answer_script = test_data["Student_answer"][0]


In [2]:
print('\033[1m'+"Original Answer"+'\033[0m')
print(original_answer_script)
print('\033[1m'+"\nStudent's Answer"+'\033[0m')
print(student_answer_script)

[1mOriginal Answer[0m
Historiography has a number of related meanings. Firstly, it can refer to how history has been produced: the story of the development of and practices (for example, the move from short-term biographical narrative toward long-term thematic analysis). Secondly, it can refer to what has been produced: a specific body of historical writing (for example, "medieval historiography during the 1960s" means "Works of medieval history written during the 1960s").Thirdly, it may refer to why history is produced: the philosophy of history. As a meta-level analysis of descriptions of the past, this third conception can relate to the first two in that the analysis usually focuses on the narratives, interpretations, world view, use of evidence, or method of presentation of other historians. Professional historians also debate the question of whether history can be taught as a single coherent narrative or a series of competing narratives.
[1m
Student's Answer[0m
History has man

## 2. Cleaning The Data

**Common data cleaning steps on all text:**
* Make text all lower case, Remove punctuation, Remove numerical values, Remove common non-sensical text (/n)
* coreference resolution, Tokenize text, Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization using pos_tag, Parts of speech tagging
* Deal with typos, Replacing synonyms etc

**Further high level cleaning techniques like:**
* chunking, collocation extraction, Bi - grams.
* Relationship extraction, NER etc can be used for more accuracy.


### 2.1  Inserting Text in DataFrame

In [3]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd  
  
# assign data of lists.  
data = {'Person': ['Teacher', 'student'], 'text': [original_answer_script, student_answer_script]}  
data_df = pd.DataFrame(data) #data_df is unprocessed dataset.

### 2.2  Coreference Resolution 

In [4]:
#corefernce resolution : Replacing pronouns with nouns
import neuralcoref
import spacy

nlp = spacy.load("en_core_web_md") #medium (md) is used due to limited RAM, can use lg(large) for more accuracy.

neuralcoref.add_to_pipe(nlp)

doc_1 = nlp(original_answer_script)
doc_2 = nlp(student_answer_script)

original_answer_script = doc_1._.coref_resolved
student_answer_script = doc_2._.coref_resolved
print('\033[1m'+"Original Answer (Replacing Pronouns with nouns)"+'\033[0m')
print(original_answer_script)
print('\033[1m'+"\nStudent Answer (Replacing Pronouns with nouns)"+'\033[0m')
print(student_answer_script)

[1mOriginal Answer (Replacing Pronouns with nouns)[0m
Historiography has a number of related meanings. Firstly, it can refer to how history has been produced: the story of the development of and practices (for example, the move from short-term biographical narrative toward long-term thematic analysis). Secondly, it can refer to what has been produced: a specific body of historical writing (for example, "medieval historiography during the 1960s" means "Works of medieval history written during the 1960s").Thirdly, it may refer to why history is produced: the philosophy of history. As a meta-level analysis of descriptions of the past, this third conception can relate to the first two in that the analysis usually focuses on the narratives, interpretations, world view, use of evidence, or method of presentation of other historians. Professional historians also debate the question of whether history can be taught as a single coherent narrative or a series of competing narratives.
[1m
Stud

### 2.3 Removing Punctuations and making the text lower case

In [5]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = text.replace(' “' , " ")
    text = text.replace("’s ", " ")
    text = text.replace('” ', " ")
    return text
    return text

text_clean_1 = clean_text_round1(original_answer_script)
text_clean_2 = clean_text_round1(student_answer_script)
print('\033[1m'+"Original Answer (Removing punctuations and all lower case)"+'\033[0m')
print(text_clean_1)
print('\033[1m'+"\nStudent Answer (Removing punctuations and all lower case)"+'\033[0m')
print(text_clean_2)

[1mOriginal Answer (Removing punctuations and all lower case)[0m
historiography has a number of related meanings firstly it can refer to how history has been produced the story of the development of and practices for example the move from shortterm biographical narrative toward longterm thematic analysis secondly it can refer to what has been produced a specific body of historical writing for example medieval historiography during the  means works of medieval history written during the  it may refer to why history is produced the philosophy of history as a metalevel analysis of descriptions of the past this third conception can relate to the first two in that the analysis usually focuses on the narratives interpretations world view use of evidence or method of presentation of other historians professional historians also debate the question of whether history can be taught as a single coherent narrative or a series of competing narratives
[1m
Student Answer (Removing punctuations an

### 2.4 Fixing Typos and Counting number  of Typos

In [6]:
#fix typos
from textblob import TextBlob
 
#correct typos in students answer
textBlb = TextBlob(text_clean_2)            
text_fixed_typos_2 = textBlb.correct() 

#correct typos in original answer (typos acc to textblob)
textBlb = TextBlob(text_clean_1)            
text_fixed_typos_1 = textBlb.correct() 

# checking the typos not in the original text
typos_list = []
typos_list_corrected = []
wordslist_original = list(text_clean_1.split())
wordslist_1 = list(text_clean_2.split())
wordslist_2 = list(text_fixed_typos_2.split())

for i in range(len(wordslist_1)) :
    if wordslist_1[i] != wordslist_2[i] :
        if wordslist_1[i] not in wordslist_original :
            typos_list.append(wordslist_1[i])
            typos_list_corrected.append(wordslist_2[i])

typos = len(typos_list)
print('\033[1m'+"List of typos"+'\033[0m')
print(typos_list, typos_list_corrected)
print("no of typos : ")
print(typos)
cleaned_1 = str(text_fixed_typos_1)
cleaned_2 = str(text_fixed_typos_2)
print('\033[1m'+"Original Answer (After Fixing typos)"+'\033[0m')
print(cleaned_1)
print('\033[1m'+"\nStudent Answer (After Fixing typos)"+'\033[0m')
print(cleaned_2)

[1mList of typos[0m
['implications'] ['implication']
no of typos : 
1
[1mOriginal Answer (After Fixing typos)[0m
historiography has a number of related meanings firstly it can refer to how history has been produced the story of the development of and practices for example the move from shorter biographical narrative toward longer rheumatic analysis secondly it can refer to what has been produced a specific body of historical writing for example medieval historiography during the  means works of medieval history written during the  it may refer to why history is produced the philosophy of history as a metalevel analysis of descriptions of the past this third conception can relate to the first two in that the analysis usually focused on the narratives interpretations world view use of evidence or method of presentation of other historians professional historians also debate the question of whether history can be taught as a single coherent narrative or a series of competing narrative

### 2.5 Pos Tagging

In [7]:
#pos tagging
 
# Define function to lemmatize each word with its POS tag
import nltk
import nltk.corpus
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

# POS_TAGGER_FUNCTION 
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:         
        return None
    

# tokenize the sentence and find the POS tag for each token
pos_tagged_1 = nltk.pos_tag(nltk.word_tokenize(cleaned_1))
pos_tagged_2 = nltk.pos_tag(nltk.word_tokenize(cleaned_2))
 
# the above pos tags are a little confusing.
 
# we can use our own pos_tagger function to make things simpler to understand.
wordnet_tagged_1 = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged_1))
wordnet_tagged_2 = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged_2))

data_tagged = {'Person': ['Teacher', 'student'], 'text': [wordnet_tagged_1, wordnet_tagged_2]}  
data_df_tagged = pd.DataFrame(data_tagged) 
data_df_tagged

Unnamed: 0,Person,text
0,Teacher,"[(historiography, n), (has, v), (a, None), (nu..."
1,student,"[(history, n), (has, v), (many, a), (relevant,..."


### 2.6 Lemmatization

In [8]:
#lemmatization
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
 
lemmatized_text_1 = []
lemmatized_text_2 = []


for word, tag in wordnet_tagged_1:
    if tag is None:
        # if there is no available tag, append the token as is
        lemmatized_text_1.append(word)
    else:       
        # else use the tag to lemmatize the token
        lemmatized_text_1.append(lemmatizer.lemmatize(word, tag))
lemmatized_text_1 = " ".join(lemmatized_text_1)

for word, tag in wordnet_tagged_2:
    if tag is None:
        # if there is no available tag, append the token as is
        lemmatized_text_2.append(word)
    else:       
        # else use the tag to lemmatize the token
        lemmatized_text_2.append(lemmatizer.lemmatize(word, tag))
lemmatized_text_2 = " ".join(lemmatized_text_2)
 
data_lemmatized = {'Person': ['Teacher', 'student'], 'text': [lemmatized_text_1, lemmatized_text_2]}  
data_df_lemmatized = pd.DataFrame(data_lemmatized) 
data_df_lemmatized

Unnamed: 0,Person,text
0,Teacher,historiography have a number of related meanin...
1,student,history have many relevant implication first i...


### 2.7 Replace synonyms with original words 

In [9]:
from nltk.corpus import wordnet

original_word_list = lemmatized_text_1.split()
student_word_list = lemmatized_text_2.split()

replaced_words_list = []

for original_word in original_word_list:                           #replace synonyms in original_text
    synonyms = []
    for syn in wordnet.synsets(original_word):                         
        for l in syn.lemmas():
            synonyms.append(l.name())
    for synonym in synonyms:
        if (synonym in original_word_list) and (synonym != original_word):
            for i in range(len(original_word_list)):
                if original_word_list[i] == synonym:
                    original_word_list[i] = original_word
                    print(synonym,original_word )
                    print(original_word_list)
            replaced_words_list.append([synonym, original_word])
lemmatized_text_1 = " ".join(original_word_list)



for original_word in original_word_list:                           #replace synonyms in student text with original text
    synonyms = []
    for syn in wordnet.synsets(original_word):                         
        for l in syn.lemmas():
            synonyms.append(l.name())
    for synonym in synonyms:
        if (synonym in student_word_list) and (synonym != original_word):
            for i in range(len(student_word_list)):
                if student_word_list[i] == synonym:
                    student_word_list[i] = original_word
                    print(student_word_list)
            replaced_words_list.append([synonym, original_word])

for word in student_word_list:
    synonyms = []
    for syn in wordnet.synsets(word):                         
        for l in syn.lemmas():
            synonyms.append(l.name())
    for synonym in synonyms:
        if (synonym in original_word_list) and (synonym != word):
            for i in range(len(student_word_list)):
                if student_word_list[i] == word:
                    student_word_list[i] = synonym
            replaced_words_list.append([word, synonym])
lemmatized_text_ = " ".join(student_word_list)
           
print(replaced_words_list)

relate related
['historiography', 'have', 'a', 'number', 'of', 'related', 'meaning', 'firstly', 'it', 'can', 'refer', 'to', 'how', 'history', 'have', 'be', 'produce', 'the', 'story', 'of', 'the', 'development', 'of', 'and', 'practice', 'for', 'example', 'the', 'move', 'from', 'short', 'biographical', 'narrative', 'toward', 'longer', 'rheumatic', 'analysis', 'secondly', 'it', 'can', 'refer', 'to', 'what', 'have', 'be', 'produce', 'a', 'specific', 'body', 'of', 'historical', 'writing', 'for', 'example', 'medieval', 'historiography', 'during', 'the', 'mean', 'work', 'of', 'medieval', 'history', 'write', 'during', 'the', 'it', 'may', 'refer', 'to', 'why', 'history', 'be', 'produce', 'the', 'philosophy', 'of', 'history', 'as', 'a', 'metalevel', 'analysis', 'of', 'description', 'of', 'the', 'past', 'this', 'third', 'conception', 'can', 'related', 'to', 'the', 'first', 'two', 'in', 'that', 'the', 'analysis', 'usually', 'focus', 'on', 'the', 'narrative', 'interpretation', 'world', 'view', 'use

['history', 'have', 'many', 'relevant', 'implication', 'first', 'it', 'may', 'have', 'to', 'do', 'with', 'how', 'history', 'be', 'create', 'history', 'of', 'development', 'and', 'practice', 'eg', 'the', 'transition', 'from', 'short', 'biographical', 'story', 'to', 'longer', 'rheumatic', 'analysis', 'second', 'it', 'may', 'refer', 'to', 'what', 'be', 'create', 'as', 'particular', 'collection', 'of', 'historical', 'work', 'for', 'example', 'medieval', 'history', 'of', 'the', 'be', 'work', 'of', 'medieval', 'history', 'write', 'in', 'the', 'mean', 'as', 'an', 'analysis', 'of', 'the', 'metal', 'level', 'of', 'the', 'past', 'this', 'third', 'concept', 'be', 'the', 'first', 'two', 'as', 'the', 'analysis', 'usually', 'focus', 'on', 'story', 'interpretation', 'worldviews', 'use', 'of', 'evidence', 'or', 'other', 'historian', 'representation', 'it', 'may', 'be', 'relate', 'professional', 'historian', 'also', 'argue', 'whether', 'history', 'can', 'be', 'teach', 'as', 'a', 'single', 'coherent', '

In [10]:
synonyms = []
for syn in wordnet.synsets("story"):                         
    for l in syn.lemmas():
        synonyms.append(l.name())
            
print(synonyms)

['narrative', 'narration', 'story', 'tale', 'story', 'floor', 'level', 'storey', 'story', 'history', 'account', 'chronicle', 'story', 'report', 'news_report', 'story', 'account', 'write_up', 'fib', 'story', 'tale', 'tarradiddle', 'taradiddle']


In [11]:
print('\033[1m'+"Original Answer (Lemmatized)"+'\033[0m')
print(lemmatized_text_1)
print('\033[1m'+"\nStudent Answer (Lemmatized)"+'\033[0m')
print(lemmatized_text_2)

[1mOriginal Answer (Lemmatized)[0m
historiography have as number of related meaning firstly it can related to how history have be produce the history of the development of and practice for example the move from short biographical narrative toward longer rheumatic analysis secondly it can related to what have be produce as specific body of historical writing for example medieval historiography during the meaning work of medieval history writing during the it may related to why history be produce the philosophy of history as as metalevel analysis of description of the past this third conception can related to the firstly two in that the analysis usually focus on the narrative interpretation world view practice of evidence or method of presentation of other historian professional historian also debate the question of whether history can be teach as as single coherent narrative or as series of compete narrative
[1m
Student Answer (Lemmatized)[0m
history have many relevant implication f

## Optional text preprocessing techniques

### a. Chunking (Shallow Parsing)

Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.)

### b. Named-entity recognition 


Named-entity recognition (NER) aims to find named entities in text and classify them into predefined categories (names of persons, locations, organizations, times, etc.).

### c. Collocation Extraction


Collocations are word combinations occurring together more often than would be expected by chance. Collocation examples are “break the rules,” “free time,” “draw a conclusion,” “keep in mind,” “get ready,” and so on.

### d. Relationship Extraction

Relationship extraction allows obtaining structured information from unstructured sources such as raw text. Strictly stated, it is identifying relations (e.g., acquisition, spouse, employment) among named entities (e.g., people, organizations, locations). For example, from the sentence “Matthew and Emily married yesterday,” we can extract the information that Matthew is Emily’s husband.

## 3. Organizing The Data

The output of this section will be clean, organized data in two standard text formats:
1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

### 3.1 Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [12]:
# assign data of lists.  
data = {'Person': ['Teacher', 'student'], 'text': [lemmatized_text_1, lemmatized_text_2]}  
data_df_cleaned = pd.DataFrame(data) #data_df is unprocessed dataset.

# Let's take a look at our dataframe
data_df_cleaned

Unnamed: 0,Person,text
0,Teacher,historiography have as number of related meani...
1,student,history have many relevant implication first i...


### 3.2 Document-Term Matrix

For many of the techniques we'll be using in future, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [13]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_df_cleaned.text)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_df_cleaned.index
data_dtm

Unnamed: 0,analysis,argue,biographical,body,coherent,collection,compete,concept,conception,create,...,teach,transition,use,usually,view,work,world,worldviews,write,writing
0,3,0,1,1,1,0,1,0,1,0,...,1,0,0,1,1,1,1,0,0,2
1,3,1,1,0,1,1,1,1,0,2,...,1,1,1,1,0,2,0,1,1,0


## 4. Exploratory Data Analysis (Text Analysis)

* 1. Counting number of common words,missing words, Total words and Total different words, between original and student scripts
* 2. Lists of nouns, verbs, adverbs, adjectives to choose sentences for similarity scoring
* 3. Sentence similarity scoring and counting missing nouns
*  Antonyms and wrong mapping of functions (verbs) for given noun(s) will be taken care in the inbuilt similarity score.


### 4.1 Counting number of common words, missing words, Total words and Total different words, between original and student scripts

In [14]:
# Counting number of common words between original and student text
words = [ key for key in dict(data_dtm.iloc[0])] 

common_word_count = 0
total_words_student_count = 0
missing_words_count = 0
total_words_original_count = 0

common_words = []
total_words_student = []
missing_words = []
total_words_original = []

for word in words:
    list_ = list(data_dtm[word])
    if (list_[0] != 0) and (list_[1] != 0) :
        common_word_count += 1
        common_words.append(word)
    if (list_[1] != 0) :
        total_words_student_count += 1
        total_words_student.append(word)
    if (list_[1] == 0) and (list_[0] != 0):
        missing_words_count += 1
        missing_words.append(word)
    if (list_[0] != 0) :
        total_words_original_count += 1   
        total_words_original.append(word)

original_total_word_count = data_dtm.sum(axis = 1)[0] 
student_total_word_count = data_dtm.sum(axis = 1)[1]
        
common_word_count_percentage = common_word_count/ total_words_student_count

### 4.2 Lists of nouns to choose sentences for similarity scoring

In [15]:
#create lists of  nouns from lemmatized text to select proper sentences while comparing similarity
nouns_original = []

pos_tagged_1 = nltk.pos_tag(nltk.word_tokenize(lemmatized_text_1))
 
# we can use our own pos_tagger function to make things simpler to understand.
wordnet_tagged_1 = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged_1))

for word,tag in wordnet_tagged_1 :
    if tag == "n":
        nouns_original.append(word)


In [16]:
#removing duplicates from list
distinct_nouns_original = []
[distinct_nouns_original.append(x) for x in nouns_original if x not in distinct_nouns_original]
nouns_original = distinct_nouns_original

### 4.3 Sentence similarity scoring and counting missing nouns

In [17]:
# Getting sentences from original text and student text
student_text_sentences_list = []
original_text_sentences_list = []

student_text_sentence = ""
original_text_sentence = ""
word_count = 0
sentence_count = 0

list_1 = original_answer_script.split(".")
if list_1[len(list_1) - 1] == " ":
    list_1.pop()

# generating original_text_sentences_list
for word in lemmatized_text_1.split(" "):
    if word_count < len(list_1[sentence_count].lstrip().split(" ")) :
        original_text_sentence = original_text_sentence + " " + word
        word_count += 1
    else :
        word_count = 0
        sentence_count += 1
        original_text_sentences_list.append(original_text_sentence.lstrip())
        original_text_sentence = ""
        original_text_sentence = original_text_sentence + " " + word
        word_count += 1
        
original_text_sentences_list.append(original_text_sentence.lstrip())
print('\033[1m'+"Original Text Sentences"+'\033[0m')
print(original_text_sentences_list)

# generating student_text_sentences_list
list_2 = student_answer_script.split(".")

if list_2[len(list_2) - 1] == " ":
    list_2.pop()
    
word_count = 0
sentence_count = 0

for word in lemmatized_text_2.split(" "):
    if word_count < len(list_2[sentence_count].lstrip().split(" ")) :
        student_text_sentence = student_text_sentence + " " + word
        word_count += 1
    else :
        word_count = 0
        sentence_count += 1
        student_text_sentences_list.append(student_text_sentence.lstrip())
        student_text_sentence = ""
        student_text_sentence = student_text_sentence + " " + word
        word_count += 1
        
student_text_sentences_list.append(student_text_sentence.lstrip())      
print('\033[1m'+"\nStudent Text Sentences"+'\033[0m')    
print(student_text_sentences_list)

[1mOriginal Text Sentences[0m
['historiography have as number of related meaning', 'firstly it can related to how history have be produce the history of the development of and practice for example the move from short biographical narrative toward longer rheumatic', 'analysis secondly it can related to what have be produce as specific body of historical writing for example medieval historiography during the meaning work of medieval history writing during the it', 'may related to why history be produce the philosophy of history as', 'as metalevel analysis of description of the past this third conception can related to the firstly two in that the analysis usually focus on the narrative interpretation world view practice of evidence or method of presentation of', 'other historian professional historian also debate the question of whether history can be teach as as single coherent narrative or as series of', 'compete', 'narrative']
[1m
Student Text Sentences[0m
['history have many relev

In [18]:
# creating a class that gives nouns in a sentence
class pos_sentence:
    def __init__(self, sentence):
        self.nouns = [noun for noun in nouns_original if noun in sentence.split(" ")]
        


In [19]:
import spacy
nlp = spacy.load("en_core_web_md")

similar_sentence_list = []
similarity_score_list = []
missing_noun_list = []

for sentence in original_text_sentences_list :
    similar_sentences = []
    similarity_score_overall = 0
    
    sentence_object = pos_sentence(sentence)
    distinct_nouns_sentence = []
    [distinct_nouns_sentence.append(x) for x in sentence_object.nouns if x not in distinct_nouns_sentence]
    if len(distinct_nouns_sentence) != 0 :
        for noun in distinct_nouns_sentence:
            similarity_score = 0
            for sentence_1 in student_text_sentences_list :
                if noun in sentence_1.split(" ") :
                    if nlp(sentence_1).similarity(nlp(sentence)) > similarity_score :
                        similarity_score = nlp(sentence_1).similarity(nlp(sentence))
                        similar_sentence = sentence_1

            if similarity_score == 0:
                missing_noun_list.append(noun)

            elif similar_sentence not in similar_sentences :
                similar_sentences.append(similar_sentence)
                similarity_score_overall += similarity_score
    if len(similar_sentences) != 0:
        similarity_score_list.append(similarity_score_overall/len(similar_sentences))
        similar_sentence_list.append([sentence, similar_sentences])



In [20]:
for i in range(len(similar_sentence_list)) :       
    print(f"{i}:{similar_sentence_list[i]}, \noverall similarity:{similarity_score_list[i]}")       
print('\033[1m'+"\nList of missing nouns\n"+'\033[0m')   
print(missing_noun_list)

0:['firstly it can related to how history have be produce the history of the development of and practice for example the move from short biographical narrative toward longer rheumatic', ['first it may have to do with how history be create history of development and practice eg the transition from short biographical story to longer rheumatic analysis', 'second it may refer to what be create a particular collection of historical work for example medieval history of the be work of medieval history write in the mean as']], 
overall similarity:0.9735370368294042
1:['analysis secondly it can related to what have be produce as specific body of historical writing for example medieval historiography during the meaning work of medieval history writing during the it', ['second it may refer to what be create a particular collection of historical work for example medieval history of the be work of medieval history write in the mean as', 'first it may have to do with how history be create history of

## 5. Feature Engineering : Extracting features from the data

*  Score based on sentence similarity(synoyms, Antonyms,Wrong function mapping will be  taken care in this score)
*  Score based on counting number of common words, common nouns, Total words and Total different words
*  Score reduction based on missing nouns (Topics missed in student answer)


In [21]:
distinct_nouns_original_count = len(distinct_nouns_original)

In [22]:
print('\033[1m'+"\nList of missing nouns\n"+'\033[0m')
print(missing_noun_list)
missing_nouns_count = len(missing_noun_list)
print(f"no_of_missing_nouns: {len(missing_noun_list)}")

[1m
List of missing nouns
[0m
['historiography', 'number', 'meaning', 'move', 'historiography', 'meaning', 'body', 'writing', 'philosophy', 'metalevel', 'description', 'conception', 'world', 'view', 'method', 'presentation', 'question']
no_of_missing_nouns: 17


In [23]:
print(similarity_score_list)
similarity_score = sum(similarity_score_list)/len(similarity_score_list)
print(f"similarity_score = {similarity_score}")

[0.9735370368294042, 0.9698571424588274, 0.9582720332402799, 0.9604702690683868, 0.9067023316712794]
similarity_score = 0.9537677626536354


In [24]:
typos_list

['implications']

In [25]:
print(f"common_word_count = {common_word_count} and total_words_count_student = {total_words_student_count} and % = {common_word_count_percentage}")
print(f"\ncommon_words = {common_words}\n")
print(f"total_words_student = {total_words_student}\n")
print(f"missing_words_student = {missing_words}\n")
print(f"total_words_original = {total_words_original}\n")
print(f"common_word_count = {common_word_count}\ntotal_words_student_count = {total_words_student_count} \nmissing_words_count = {missing_words_count} \ntotal_words_original_count = {total_words_original_count}")

common_word_count = 24 and total_words_count_student = 43 and % = 0.5581395348837209

common_words = ['analysis', 'biographical', 'coherent', 'compete', 'development', 'evidence', 'example', 'focus', 'historian', 'historical', 'history', 'interpretation', 'longer', 'medieval', 'past', 'practice', 'professional', 'rheumatic', 'series', 'short', 'single', 'teach', 'usually', 'work']

total_words_student = ['analysis', 'argue', 'biographical', 'coherent', 'collection', 'compete', 'concept', 'create', 'development', 'evidence', 'example', 'focus', 'historian', 'historical', 'history', 'implication', 'interpretation', 'level', 'longer', 'mean', 'medieval', 'metal', 'particular', 'past', 'practice', 'professional', 'refer', 'relate', 'relevant', 'representation', 'rheumatic', 'second', 'series', 'short', 'single', 'story', 'teach', 'transition', 'use', 'usually', 'work', 'worldviews', 'write']

missing_words_student = ['body', 'conception', 'debate', 'description', 'firstly', 'historiography

In [26]:
print(similarity_score)

0.9537677626536354


In [27]:
#Features for model are derived as follows: 

#Feature 1
similarity_score = similarity_score   #considers the presence of synonyms, antonyms and wrong mapping of verbs, adj, adv of corresponding nouns

missing_nouns_count                   # total no of missing nouns(distinct missing topics) in student's text
distinct_nouns_original_count         # total no of nouns (distinct topics) in original text
#Feature 2
fraction_of_topics_missed = (missing_nouns_count/distinct_nouns_original_count)

missing_words_count                   # total no of missing words(distinct) in student's text (verbs,nouns, adjectives etc)
total_words_student_count             # total no of words(distinct) in student's text (excluding stop words)
total_words_original_count            # total no of words(distinct) in original text (excluding stop words)
#Feature  3
fraction_of_new_topics = (total_words_student_count -total_words_original_count + missing_words_count)/ total_words_original_count



## 6. Model 

#### Kindly download the training_data.xlsx from the link given in the report and put it in the same folder where this file lies.

In [28]:
training_data = pd.read_excel("training_data.xlsx")

In [29]:
training_data #All the features are derived individually by going through the same procedure and are entered manually in the file

Unnamed: 0,Test cases,Teacher_text,Student_text,label,similarity_score,fraction_of_topics_missed,fraction_of_new_topics,original_score,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,1,Cells provide structure and support to the bod...,Cells are the basic unit of every organism. Th...,"Including Synonyms, antonym or wrong statements",0.944236,0.392857,0.413043,75,"Note: Columns E, F, G are filled from the resu...",,...,,,,,,,,,,
1,2,Cells provide structure and support to the bod...,Cell is the basic unit of a living organism. T...,"Less points, But all right",0.953181,0.464286,0.086957,65,,,...,,,,,,,,,,
2,3,Cells provide structure and support to the bod...,The body of an organism is supported and struc...,wrong mapping of functions but includes all.,0.930503,0.357143,0.304348,70,,,...,,,,,,,,,,
3,4,Cells provide structure and support to the bod...,Cells provide structure and support to the bod...,including all points. With correct mapping of...,0.982915,0.178571,0.152174,98,,,...,,,,,,,,,,
4,5,Cells provide structure and support to the bod...,The interior of the cell is organized into va...,Various Synonyms. Same text.,0.942905,0.285714,0.217391,80,,,...,,,,,,,,,,
5,6,Cells provide structure and support to the bod...,Cells provide structure and support to the bod...,Misspelled words and antonyms,0.975938,0.285714,0.217391,95,,,...,,,,,,,,,,
6,7,Cells provide structure and support to the bod...,Cells offer shape and help to the structure of...,Using more pronouns,0.924222,0.357143,0.326087,80,,,...,,,,,,,,,,
7,8,Cells provide structure and support to the bod...,The body of an organism is supported and struc...,More points with unnecessary points. [Deductio...,0.943895,0.214286,0.847826,70,,,...,,,,,,,,,,
8,9,Cells provide structure and support to the bod...,Every human has cells in their body. Cell is t...,Writing facts but irrelevant to the question.,0.926133,0.714286,0.434783,50,,,...,,,,,,,,,,
9,10,Cells provide structure and support to the bod...,The body of an organism is supported and struc...,"No antonyms, includes all points with differen...",0.952364,0.25,0.326087,95,,,...,,,,,,,,,,


In [31]:
#fitting a multiple linear regression model
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, metrics

# defining feature matrix(X) and response vector(y)
X = training_data[["similarity_score","fraction_of_topics_missed","fraction_of_new_topics" ]]
y = training_data[["original_score"]]

# # splitting X and y into training and testing sets
# from sklearn.model_selection import train
# X_train, y_train = train(X, y)

# create linear regression object
reg = linear_model.LinearRegression()

# train the model using the training sets
reg.fit(X, y)

# regression coefficients
print('Coefficients: ', reg.coef_)
coefficients = reg.coef_
#regression intercept
print("Intercept:" , reg.intercept_ )
Intercept = reg.intercept_

# variance score: 1 means perfect prediction
# print('Variance score: {}'.format(reg.score(X_test, y_test)))

# plot for residual error

## setting plot style
# plt.style.use('fivethirtyeight')

# ## plotting residual errors in training data
# plt.scatter(reg.predict(X_train), reg.predict(X_train) - y_train,color = "green", s = 10, label = 'Train data')

# ## plotting residual errors in test data
# plt.scatter(reg.predict(X_test), reg.predict(X_test) - y_test,color = "blue", s = 10, label = 'Test data')

# ## plotting line for zero residual error
# plt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)

# ## plotting legend
# plt.legend(loc = 'upper right')

# ## plot title
# plt.title("Residual errors")

# ## method call for showing the plot
# plt.show()


Coefficients:  [[169.05178094 -69.15099897 -20.09644363]]
Intercept: [-51.51129328]


## 7. Prediction

In [32]:
# model
predicted_score =  -51.51129328 + (169.05178094 * similarity_score) + (-69.15099897*fraction_of_topics_missed) +(-20.09644363* fraction_of_new_topics)

if predicted_score>100: #For the cases where no topics are missing and less synonyms have been used.
    predicted_score=100

In [33]:
#here's the predicted score of given test data in the beginning
print(predicted_score)

60.70288801187889
