# BERT

Sources:
https://towardsdatascience.com/checking-grammar-with-bert-and-ulmfit-1f59c718fe75
https://gist.github.com/sayakmisra/dbb06efec99e760cf9e5d197175ad9c5#file-grammar-checker-bert-ipynb

In [None]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [None]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


Package from: https://github.com/huggingface/transformers

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 5.1 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 46.5 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 73.3 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.23.1


# Loading Data

In [1]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# To unmount your Google Drive:
# drive.flush_and_unmount()

In [None]:
# Load the dataset into a pandas dataframe.
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Erdos Fall 2022/Dataset/train.csv")

In [None]:
# Report the number of essays in train set.
print('Number of training essays: {:,}\n'.format(df.shape[0]))

# Display 10 random rows from the data.
df.sample(10)

Number of training essays: 3,911



Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
1764,8593D649EBE4,"Dear, TEACHER_NAME\n\nI know that us student a...",3.0,2.5,3.0,3.0,3.0,3.0
3508,EF47AB98271A,Do you agree or disagree with the impression a...,3.0,3.0,4.0,3.5,3.0,3.0
1618,7B0430EDACBA,I agree that having a positive attitude is the...,4.0,3.0,3.5,3.5,3.5,3.5
1642,7CBB1B6F4E25,Having electives classes can be a great opport...,3.5,3.5,4.0,4.0,3.5,3.0
198,0EEE49F99224,Some people believe that you don't identifying...,2.5,2.5,2.5,3.5,2.5,2.5
2598,C0B30026B439,Something I would like to accomplish is being ...,2.5,2.5,3.0,3.0,3.0,3.0
1667,7EA986233EDA,One of minister Winston Churchill most famous ...,3.5,2.5,3.5,3.0,3.0,3.0
1896,8EEA38B2E6CD,Generic_Name is working hard to be owner of hi...,3.0,3.0,3.0,2.5,2.5,2.0
3631,F4C52358CE03,In this reasons from Churchill's statement. I ...,2.5,2.5,2.5,2.0,2.5,2.5
2897,D190133EDD2B,Dear council\n\nI think the city council shoul...,3.0,3.0,2.5,2.5,2.0,2.5


In [None]:
df['full_text'][0]

"I think that students would benefit from learning at home,because they wont have to change and get up early in the morning to shower and do there hair. taking only classes helps them because at there house they'll be pay more attention. they will be comfortable at home.\n\nThe hardest part of school is getting ready. you wake up go brush your teeth and go to your closet and look at your cloths. after you think you picked a outfit u go look in the mirror and youll either not like it or you look and see a stain. Then you'll have to change. with the online classes you can wear anything and stay home and you wont need to stress about what to wear.\n\nmost students usually take showers before school. they either take it before they sleep or when they wake up. some students do both to smell good. that causes them do miss the bus and effects on there lesson time cause they come late to school. when u have online classes u wont need to miss lessons cause you can get everything set up and go t

In [None]:
df['full_text'][5]

"Dear Principal,\r\n\r\nOur school should have a community center. The reasons why, are so students can learn what our community needs, how to make our community better place, and why is community important for students to know. Its a great to have a community center to know how we can make things better.\r\n\r\nStudents think community center takes their time away. but they have to learn what our community needs. students will participate in a group of students making a list what our community needs, therefore students will learn what our community needs! students will present their list of things our community needs! due to that students will be giving extra credit for the ones who have low grades!\r\n\r\nSome students don't participate because their friends say its waste of time. it would not be waste of time when you get to know how our community can be a better place for us. students should know that the program is about our own lives, because if our community is bad well our live

In [None]:
text = df['full_text'].apply(lambda x: x.replace('\r\n\r\n', ' ') and x.replace('\n\n', ' '))

In [None]:
text.shape

(3911,)

In [None]:
# Get the list grammar scores
labels = df.grammar.values

In [None]:
labels.shape

(3911,)

# Import Grammar Checker BERT Model

In [None]:
!pip install transformers

from transformers import BertForSequenceClassification

output_dir = "/content/drive/MyDrive/Colab Notebooks/Erdos Fall 2022/model_save/"

print(output_dir)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 5.0 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 74.2 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 54.1 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.24.0
/content/drive/MyDrive/Colab Notebooks/Erdos Fall 2022/model_save/


In [None]:
from transformers import BertTokenizer
import torch
# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained(output_dir)
model_loaded = BertForSequenceClassification.from_pretrained(output_dir)

Loading BERT tokenizer...


### Try on first essay

In [None]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloadin

True

In [None]:
from nltk import tokenize

In [None]:
essay1_sentences = [sentence for sentence in tokenize.sent_tokenize(text[0])]

In [None]:
len(essay1_sentences)

18

In [None]:
encoded_dict = tokenizer.batch_encode_plus(
                        essay1_sentences,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 64,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
# Add the encoded sentence to the list.    
input_id = encoded_dict['input_ids']
    
# And its attention mask (simply differentiates padding from non-padding).
attention_mask = encoded_dict['attention_mask']
input_id = torch.LongTensor(input_id)
attention_mask = torch.LongTensor(attention_mask)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_loaded = model_loaded.to(device)
input_id = input_id.to(device)
attention_mask = attention_mask.to(device)

In [None]:
with torch.no_grad():
  # Forward pass, calculate logit predictions
  outputs = model_loaded(input_id, token_type_ids=None, attention_mask=attention_mask)

outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-1.4535,  2.3197],
        [ 1.4277, -2.5745],
        [-1.9283,  3.2056],
        [-1.9442,  3.1001],
        [-1.9300,  2.7581],
        [-0.7864,  0.9893],
        [-1.6461,  2.9197],
        [-1.4238,  2.1689],
        [-1.4116,  2.9467],
        [-1.8151,  2.9699],
        [ 0.1897, -0.5348],
        [ 1.2897, -2.6996],
        [ 0.3602, -0.8343],
        [ 0.8457, -0.9150],
        [ 1.1205, -2.2841],
        [-1.7858,  2.8251],
        [-1.1794,  1.6135],
        [-1.5450,  2.8306]], device='cuda:0'), hidden_states=None, attentions=None)

In [None]:
logits = outputs[0]
index = logits.argmax(dim=1)
for id in index:
  if id == 1:
    print("Gramatically correct")
  else:
    print("Gramatically in-correct")

Gramatically correct
Gramatically in-correct
Gramatically correct
Gramatically correct
Gramatically correct
Gramatically correct
Gramatically correct
Gramatically correct
Gramatically correct
Gramatically correct
Gramatically in-correct
Gramatically in-correct
Gramatically in-correct
Gramatically in-correct
Gramatically in-correct
Gramatically correct
Gramatically correct
Gramatically correct


In [None]:
type(index)

torch.Tensor

In [None]:
print('The number of grammatically correct sentences is ', torch.sum(index).item(), ' out of ', len(essay1_sentences), ' sentences')

The number of grammatically correct sentences is  12  out of  18  sentences


In [None]:
print('Correct ratio is ', torch.sum(index).item()/len(essay1_sentences))

Correct ratio is  0.6666666666666666


In [None]:
print('Grammar score is ', labels[0])

Grammar score is  4.0


## Make a list of ratios corresponding grammatically correct sentences for essays in trainset

In [None]:
grammar_correct_ratio = []

In [None]:
for i in range(len(text)):
  if i%100 == 0:
    print('Running on essay ', i, '/',len(text))
  sentences = [sentence for sentence in tokenize.sent_tokenize(text[i])]
  encoded_dict = tokenizer.batch_encode_plus(
                          sentences,                      # Sentence to encode.
                          add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                          max_length = 64,           # Pad & truncate all sentences.
                          pad_to_max_length = True,
                          return_attention_mask = True,   # Construct attn. masks.
                          return_tensors = 'pt',     # Return pytorch tensors.
                    )
      
  # Add the encoded sentence to the list.    
  input_id = encoded_dict['input_ids']
      
  # And its attention mask (simply differentiates padding from non-padding).
  attention_mask = encoded_dict['attention_mask']
  input_id = torch.LongTensor(input_id)
  attention_mask = torch.LongTensor(attention_mask)

  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  model_loaded = model_loaded.to(device)
  input_id = input_id.to(device)
  attention_mask = attention_mask.to(device)

  with torch.no_grad():
    # Forward pass, calculate logit predictions
    outputs = model_loaded(input_id, token_type_ids=None, attention_mask=attention_mask)

  logits = outputs[0]
  index = logits.argmax(dim=1)

  grammar_correct_ratio.append(torch.sum(index).item()/len(sentences))

Running on essay  0 / 3911
Running on essay  100 / 3911
Running on essay  200 / 3911
Running on essay  300 / 3911
Running on essay  400 / 3911
Running on essay  500 / 3911
Running on essay  600 / 3911
Running on essay  700 / 3911
Running on essay  800 / 3911
Running on essay  900 / 3911
Running on essay  1000 / 3911
Running on essay  1100 / 3911
Running on essay  1200 / 3911
Running on essay  1300 / 3911
Running on essay  1400 / 3911
Running on essay  1500 / 3911
Running on essay  1600 / 3911
Running on essay  1700 / 3911
Running on essay  1800 / 3911
Running on essay  1900 / 3911
Running on essay  2000 / 3911
Running on essay  2100 / 3911
Running on essay  2200 / 3911
Running on essay  2300 / 3911
Running on essay  2400 / 3911
Running on essay  2500 / 3911
Running on essay  2600 / 3911
Running on essay  2700 / 3911
Running on essay  2800 / 3911
Running on essay  2900 / 3911
Running on essay  3000 / 3911
Running on essay  3100 / 3911
Running on essay  3200 / 3911
Running on essay  3300

In [None]:
# check the list of ratio
grammar_correct_ratio

[0.6666666666666666,
 0.35714285714285715,
 0.631578947368421,
 0.9166666666666666,
 0.0,
 0.7,
 0.7777777777777778,
 0.2727272727272727,
 0.391304347826087,
 0.1875,
 0.5454545454545454,
 0.8888888888888888,
 0.4,
 0.84,
 0.8536585365853658,
 0.0,
 0.1111111111111111,
 0.8846153846153846,
 0.5384615384615384,
 0.8,
 0.05263157894736842,
 0.45,
 0.8333333333333334,
 0.6875,
 0.9767441860465116,
 0.0,
 0.13333333333333333,
 0.7931034482758621,
 0.8,
 0.0,
 0.72,
 0.18181818181818182,
 0.6,
 0.8611111111111112,
 0.9512195121951219,
 0.6857142857142857,
 0.6071428571428571,
 0.7727272727272727,
 0.41379310344827586,
 0.0,
 0.8666666666666667,
 0.7692307692307693,
 0.1111111111111111,
 0.8695652173913043,
 0.8571428571428571,
 0.5714285714285714,
 0.6923076923076923,
 0.0,
 0.13636363636363635,
 0.2903225806451613,
 0.24528301886792453,
 0.5,
 0.8666666666666667,
 0.6363636363636364,
 0.7435897435897436,
 0.7777777777777778,
 0.7916666666666666,
 0.5555555555555556,
 0.6595744680851063,
 0

In [None]:
df_grammar = pd.DataFrame({'cleaned_full_text':text, 'grammar_score': labels, 'ratio_grammar_correct_sentences': grammar_correct_ratio })

In [None]:
df_grammar

Unnamed: 0,cleaned_full_text,grammar_score,ratio_grammar_correct_sentences
0,I think that students would benefit from learn...,4.0,0.666667
1,When a problem is a change you have to let it ...,2.0,0.357143
2,"Dear, Principal If u change the school policy ...",3.0,0.631579
3,The best time in life is when you become yours...,4.0,0.916667
4,Small act of kindness can impact in other peop...,2.5,0.000000
...,...,...,...
3906,I believe using cellphones in class for educat...,2.5,0.500000
3907,"Working alone, students do not have to argue w...",3.5,0.437500
3908,"""A problem is a chance for you to do your best...",3.5,0.375000
3909,Many people disagree with Albert Schweitzer's ...,4.5,1.000000


In [None]:
# Save data to csv in Google Drive
df_grammar.to_csv('/content/drive/MyDrive/Colab Notebooks/Erdos Fall 2022/grammar_train.csv')

In [None]:
sentence_number = []

In [None]:
for i in range(len(text)):
  if i%100 == 0:
    print('Running on essay ', i+1, '/',len(text))
  sentence_number.append(len(tokenize.sent_tokenize(text[i])))

Running on essay  1 / 3911
Running on essay  101 / 3911
Running on essay  201 / 3911
Running on essay  301 / 3911
Running on essay  401 / 3911
Running on essay  501 / 3911
Running on essay  601 / 3911
Running on essay  701 / 3911
Running on essay  801 / 3911
Running on essay  901 / 3911
Running on essay  1001 / 3911
Running on essay  1101 / 3911
Running on essay  1201 / 3911
Running on essay  1301 / 3911
Running on essay  1401 / 3911
Running on essay  1501 / 3911
Running on essay  1601 / 3911
Running on essay  1701 / 3911
Running on essay  1801 / 3911
Running on essay  1901 / 3911
Running on essay  2001 / 3911
Running on essay  2101 / 3911
Running on essay  2201 / 3911
Running on essay  2301 / 3911
Running on essay  2401 / 3911
Running on essay  2501 / 3911
Running on essay  2601 / 3911
Running on essay  2701 / 3911
Running on essay  2801 / 3911
Running on essay  2901 / 3911
Running on essay  3001 / 3911
Running on essay  3101 / 3911
Running on essay  3201 / 3911
Running on essay  3301

In [None]:
len(sentence_number)

3911

In [None]:
df_train_sentence_number = pd.DataFrame({'sentence_number':sentence_number})

In [None]:
df_train_sentence_number.to_csv('/content/drive/MyDrive/Colab Notebooks/Erdos Fall 2022/grammar_train_sentence_number.csv')

## Combine Train csv

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
train_1 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Erdos Fall 2022/grammar_train.csv', index_col=0)

In [5]:
train_2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Erdos Fall 2022/grammar_train_sentence_number.csv', index_col=0)

In [7]:
train_1.head()

Unnamed: 0,cleaned_full_text,grammar_score,ratio_grammar_correct_sentences
0,I think that students would benefit from learn...,4.0,0.666667
1,When a problem is a change you have to let it ...,2.0,0.357143
2,"Dear, Principal If u change the school policy ...",3.0,0.631579
3,The best time in life is when you become yours...,4.0,0.916667
4,Small act of kindness can impact in other peop...,2.5,0.0


In [8]:
train_2.head()

Unnamed: 0,sentence_number
0,18
1,14
2,19
3,36
4,3


In [9]:
train_comb = train_1

In [10]:
train_comb['sentence_number'] = train_2['sentence_number']

In [11]:
train_comb

Unnamed: 0,cleaned_full_text,grammar_score,ratio_grammar_correct_sentences,sentence_number
0,I think that students would benefit from learn...,4.0,0.666667,18
1,When a problem is a change you have to let it ...,2.0,0.357143,14
2,"Dear, Principal If u change the school policy ...",3.0,0.631579,19
3,The best time in life is when you become yours...,4.0,0.916667,36
4,Small act of kindness can impact in other peop...,2.5,0.000000,3
...,...,...,...,...
3906,I believe using cellphones in class for educat...,2.5,0.500000,6
3907,"Working alone, students do not have to argue w...",3.5,0.437500,16
3908,"""A problem is a chance for you to do your best...",3.5,0.375000,8
3909,Many people disagree with Albert Schweitzer's ...,4.5,1.000000,21


In [12]:
train_comb.to_csv('/content/drive/MyDrive/Colab Notebooks/Erdos Fall 2022/grammar_train_comb.csv')