## NLP and Linear regression

In [87]:
# Sample essays:
# The first essay contains a plethora of spelling misstakes and or grammar mistakse 

essay1 = "Supercomputers is very powerful computers that are designed to perform n\
compleks calculations at high speeds. Thei are typically used for scientific research n\
and weather forecasting. Supercomputers is be quite expensive and require specialized facilities to house them."

# The second essay has a little bit less, but still quite many misstakes
essay2 = "Supercomputers may be used to solve some of the most challenging problems in science and engineeering. n\
They are used in fields such as physikcs, chemistry, and biology to simulate complex systems and perform large-scale n\
data analysis. Supercomputers are also used in the design of new products and the optimization of manufacturing processes."

# Third essay is a run-of-the mill essay, not good, not horrible
essay3 = "One of the key features of supercomputers is their ability to perform parallel processing. n\
This means that they can break down complex problems into smaller parts and solve them simultaneously, n\
greatly reducing the time required to complete the calculation supercomputers are also capable of handling n\
large amounts of data which is essential in many scientific and engineering applications."

# The fourth is already quite good with 1 or 2 grammar misstakes for example commas and sentence length
essay4 = "Supercomputers are essential tools for scientific research and innovation.n\
They are used to model and simulate complex systems such as the behavior of molecules n\
in a chemical reaction or the flow of air over an airplane wing. Supercomputers are also n\
used to analyze large amounts of datas, such as genomic data in the field of bioinformatics."

# The fifth is the perfect essay in from this entire lot
essay5 = "Supercomputers are driving innovation in fields such as artificial intelligence, machine learning, n\
and autonomous systems. They are used to train large-scale deep learning models, which are used in applications n\
such as image recognition and natural language processing. Supercomputers are also essential in the development n\
of new technologies, such as self-driving cars and smart cities. As supercomputers continue to advance, their n\
potential applications are virtually limitless, and they will play an increasingly important role in shaping the future of technology and society."

# This essay results in a predicted grade of 2
toBeGraded = "Super computers is a very important area in computer science. It is used for complex calculation and simulation.n\
It has a big role in science and industry. Many of the computer science researchers are working on this area. The super computer n\
has huge processing power and memory, making it very useful for scientific simulations and weather forecasting. Companies like IBM and Cray are the leaders in the super computer market."

# This is the fifth essay, which can be used to check wether the grading is at all reliable
# It seems to be considering the result (predicted grade) is 5 for this essay
toBeGraded5 = "Supercomputers are driving innovation in fields such as artificial intelligence, machine learning, n\
and autonomous systems. They are used to train large-scale deep learning models, which are used in applications n\
such as image recognition and natural language processing. Supercomputers are also essential in the development n\
of new technologies, such as self-driving cars and smart cities. As supercomputers continue to advance, their n\
potential applications are virtually limitless, and they will play an increasingly important role in shaping the future of technology and society."

# This is the third essay for testing the grading
# it got a score of 3.8 which is quite high considering this is the example essay for a grading of 3, I'm assuming this is due to the small dataset
toBeGraded3 = "One of the key features of supercomputers is their ability to perform parallel processing. n\
This means that they can break down complex problems into smaller parts and solve them simultaneously, n\
greatly reducing the time required to complete the calculation supercomputers are also capable of handling n\
large amounts of data which is essential in many scientific and engineering applications."


In [88]:
# Imports 

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LinearRegression

In [89]:
# Downloads
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [90]:
# Defining the function which preprocesses the "toBeGraded" essay
def preprocess(text):
    # Tokenization
    words = word_tokenize(text.lower())

    # -Stop words + lemmatization
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]

    # Remake sentances
    sentences = ' '.join(words)
    return sentences


In [91]:
preEssay = preprocess(toBeGraded)
#preEssay = preprocess(toBeGraded5)
# preEssay = preprocess(toBeGraded3)



To determine the grade of an essay, we can try and user certain key attributes of sed text:
*   sentence length
*   unique words
*   grammar errors
*   transitional phrases





In [92]:
features = {}
features['sentence_length'] = len(sent_tokenize(preEssay))
features['unique_words'] = len(set(word_tokenize(preEssay)))
features['grammar_errors'] = 0 # Placeholder for grammar error detection
features['transitional_phrases'] = preEssay.count('however')

In [93]:
dataset = [
    {'essay': essay1, 'score': 1},
    {'essay': essay2, 'score': 2},
    {'essay': essay3, 'score': 3},
    {'essay': essay4, 'score': 4},
    {'essay': essay5, 'score': 5},
]

In [94]:
# print(dataset)

In [95]:
for i in range(len(dataset)):
    dataset[i]['essay'] = preprocess(dataset[i]['essay'])

# X and Y lists for the linearRegression
X = []
y = []

In [96]:
# For each essay in the dataset of essays
for essay in dataset:
    x_i = []
    x_i.append(len(sent_tokenize(essay['essay'])))
    x_i.append(len(set(word_tokenize(essay['essay']))))
    x_i.append(0) # Placeholder for grammar error detection
    x_i.append(essay['essay'].count('however'))
    X.append(x_i)
    y.append(essay['score'])

model = LinearRegression()
model.fit(X, y)

LinearRegression()

In [97]:
x_sample = [features['sentence_length'], features['unique_words'], features['grammar_errors'], features['transitional_phrases']]
predicted_grade = model.predict([x_sample])[0]
print('Predicted grade:', round(predicted_grade, 2))

Predicted grade: 2.01
