## Example for using the Paraphraser

In [1]:
import paraphrase_pipeline as ppipe
from transformers import pipeline
from transformers.pipelines import FillMaskPipeline
from transformers import AutoTokenizer, AutoModelForMaskedLM

In [2]:
tokenizer = AutoTokenizer.from_pretrained("roberta-large")
model = AutoModelForMaskedLM.from_pretrained("roberta-large")
model.resize_token_embeddings(len(tokenizer))
unmasker = FillMaskPipeline(model=model, tokenizer=tokenizer, device=0)
# unmasker = pipeline('fill-mask', model='roberta-large')
paraphraser = ppipe.ParaphrasePipeline(unmasker)

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
filename = r"./../data/wikipedia/og/666-ORIG-31.txt"
with open(filename) as file:
    originalText = file.read()
originalText

'The alkali metals are more similar to each other than the elements in any other group are to each other. For instance, when moving down the table, all known alkali metals show increasing atomic radius, decreasing electronegativity, increasing reactivity, and decreasing melting and boiling points as well as heats of fusion and vaporisation. In general, their densities increase when moving down the table, with the exception that potassium is less dense than sodium.'

In [6]:
spun_text, df = paraphraser.parapherase(originalText, mask=0.1, range_replace=(1, 4), mark_replace=True, return_df=True)

In [7]:
spun_text

'The alkali metals are more similar to each other than the elements in any[ metal] group are to each other. For[ instance][,] when moving down the table, all known alkali[ groups] show increasing atomic radius,[ increasing] electronegativity, increasing reactivity, and[ increasing] melting and boiling points as well as heats of fusion and[ oxid]isation. In general, their densities increase when moving down the[ scale], with the exception that[ magnesium] is less dense than sodium.'

In [8]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,token_str,score,state
index,token,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
85,35797,Ġpotassium,0.43621,ignored
85,16904,Ġlithium,0.195053,looked
85,18051,Ġaluminium,0.068147,looked
85,36165,Ġmagnesium,0.051567,chosen
65,31973,Ġion,0.411731,ignored
65,34774,Ġcrystall,0.235001,looked
65,29406,Ġvapor,0.215348,looked
65,44322,Ġoxid,0.025223,chosen
79,2103,Ġtable,0.964238,ignored
79,889,Ġlist,0.007279,looked


In [4]:
filename = r"./../data/wikipedia/og/339-ORIG-2.txt"
with open(filename) as file:
    originalText = file.read()
originalText

'Ayn Rand (; born Alisa Zinovyevna Rosenbaum; \xa0– March 6, 1982) was a Russian-American writer and philosopher. She is known for her two best-selling novels, "The Fountainhead" and "Atlas Shrugged", and for developing a philosophical system she named Objectivism. Educated in Russia, she moved to the United States in 1926. She had a play produced on Broadway in 1935 and 1936. After two early novels that were initially unsuccessful, she achieved fame with her 1943 novel, "The Fountainhead". In 1957, Rand published her best-known work, the novel "Atlas Shrugged". Afterward, she turned to non-fiction to promote her philosophy, publishing her own periodicals and releasing several collections of essays until her death in 1982.'

In [4]:
spun_text, df = paraphraser.parapherase(originalText, mask=0.1, range_replace=(1, 4), mark_replace=True, return_df=True)

In [5]:
spun_text

'Ayn Rand (; born Alisa Zinovyevna Rosenbaum; Â\xa0â€“[September] 6, 1982) was a Russian[—]American[ novelist] and[ educator]. She is known for her two best-selling novels, "The Fountainhead" and "Atlas[ Sch]rugged", and for developing a philosophical[ school] she named Objectivism. Educated in Russia, she moved to the United States[ by] 1926. She had a play produced on Broadway in 1935 and 1936[,] After[ writing] early novels that were initially unsuccessful, she[ gained] fame with her 1943 novel,["]The Fountainhead". In 1957[,] Rand published her best-known work, the[ memoir] "Atlas Shrugged". Afterward, she turned to[ Non]-fiction to promote her philosophy[ by] publishing her own periodicals and releasing several collections of essays[ after] her death[ of] 1982.'

In [6]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,token_str,score,state
index,token,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
96,4,.,0.999610,ignored
96,479,Ġ.,0.000186,looked
96,6,",",0.000062,chosen
96,7586,..,0.000053,looked
31,12,-,0.964292,ignored
...,...,...,...,...
82,30,Ġby,0.000150,chosen
121,6,",",0.999708,ignored
121,2156,"Ġ,",0.000069,chosen
121,4,.,0.000048,looked


### Iteratding over the Data

In [10]:
from os import listdir
from os.path import isfile, join
ogpath = "./../data/thesis/og"
sppath = "./../data/thesis/sp"
ogfiles = [f for f in listdir(ogpath) if isfile(join(ogpath, f))]

In [5]:
ogfiles[0]

'1-ORIG-10.txt'

In [11]:
with open(join(ogpath, ogfiles[0])) as file:
    originalText = file.read()
originalText

'Data were collected and analysed using excels spreadsheet and presented them into graphs and tables. Where data was exported from the sources, they were analytically explained in detail, outlining the trends and figures in the data, such as either in the form of plotted graphs or tables with corresponding graphs. Table and graphs were relevant in the analysis of the findings, since it made clarification simple and less complex, in instances where graphs were used to support the table, the essentiality was to concretise the explanations further pictorially.'

## Check for problems in the Data or incobatebilety

In [3]:
import paraphrase_pipeline as ppipe
from transformers import pipeline
from transformers.pipelines import FillMaskPipeline
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("roberta-large")
model = AutoModelForMaskedLM.from_pretrained("roberta-large")
model.resize_token_embeddings(len(tokenizer))
unmasker = FillMaskPipeline(model=model, tokenizer=tokenizer, use_fast=True, device=-1)
# unmasker = pipeline('fill-mask', model='roberta-large')
paraphraser = ppipe.ParaphrasePipeline(unmasker)
tokenizer = paraphraser.tokenizer
model = paraphraser.model

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
filename = r"./../data/thesis/og/1-ORIG-18.txt"
with open(filename, 'r', encoding='latin-1') as file:
    originalText = file.read()
originalText

"Economic development in the past was seen regarding the planned alteration of the structure of production and employment so that agricultureÕs share of both production and employment declines while that of manufacturing and services increases (Todaro, 2013)Todaro claims that economic growth is a vital condition to improve the quality of life. Development was therefore defined as a rapid and sustained rise in real output per head and attendant shifts in the technological and economic characteristics of society. This conceptualisation gave priority to increased commodity output instead of the human beings involved in the production. Increases in the output of such industries were recorded as growth in the economy and for that matter development for the country (Kane, 2008).\nEconomic development refers to a qualitative variation in what or how goods and services produced through shifts in resource use, workforce skills, technology, information, production methods, or financial arrangeme

In [4]:
tokens = tokenizer(originalText)
tokens

Token indices sequence length is longer than the specified maximum sequence length for this model (968 > 512). Running this sequence through the model will result in indexing errors


{'input_ids': [50265, 41376, 709, 11, 5, 375, 21, 450, 2624, 5, 1904, 39752, 9, 5, 3184, 9, 931, 8, 4042, 98, 14, 6300, 3849, 15722, 29, 458, 9, 258, 931, 8, 4042, 11081, 150, 14, 9, 3021, 8, 518, 3488, 36, 565, 1630, 5191, 6, 1014, 43, 565, 1630, 5191, 1449, 14, 776, 434, 16, 10, 4874, 1881, 7, 1477, 5, 1318, 9, 301, 4, 2717, 21, 3891, 6533, 25, 10, 6379, 8, 5232, 1430, 11, 588, 4195, 228, 471, 8, 22012, 10701, 11, 5, 9874, 8, 776, 12720, 9, 2313, 4, 152, 28647, 3258, 851, 3887, 7, 1130, 8497, 4195, 1386, 9, 5, 1050, 14766, 963, 11, 5, 931, 4, 43832, 11, 5, 4195, 9, 215, 4510, 58, 2673, 25, 434, 11, 5, 866, 8, 13, 14, 948, 709, 13, 5, 247, 36, 530, 1728, 6, 2266, 322, 50118, 41376, 709, 12859, 7, 10, 29981, 21875, 11, 99, 50, 141, 3057, 8, 518, 2622, 149, 10701, 11, 5799, 304, 6, 6862, 2417, 6, 806, 6, 335, 6, 931, 6448, 6, 50, 613, 7863, 4, 96, 97, 1617, 6, 24, 16, 5, 434, 9, 5, 247, 3849, 15722, 29, 4764, 14, 16, 7, 1477, 5, 157, 12, 9442, 9, 5, 247, 3849, 15722, 29, 24696, 36, 565,

In [19]:
len(tokens['input_ids'])

968

In [17]:
encode_input = tokenizer(originalText,  return_tensors='pt')

In [18]:
output_tensor = model(**encode_input)

IndexError: index out of range in self

In [39]:
testLongText = ' '.join(511*['you']) # 966 to long, 510 is ok, 511 is to long

In [40]:
len(tokenizer(testLongText)['input_ids'])

513

In [41]:
testLong_input = tokenizer(testLongText,  return_tensors='pt')

In [52]:
print(testLong_input['input_ids'].size())
print(testLong_input['attention_mask'].size())

torch.Size([1, 513])
torch.Size([1, 513])


In [42]:
model(**testLong_input)

IndexError: index out of range in self

### Input size is bounded

The input dimension of the model is 1024 and with out the attentionmast 512 and with out the added
start and end tokens 510.

In [1]:
import paraphrase_pipeline as ppipe
from transformers import pipeline
from transformers.pipelines import FillMaskPipeline
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("roberta-large")
model = AutoModelForMaskedLM.from_pretrained("roberta-large")
model.resize_token_embeddings(len(tokenizer))
unmasker = FillMaskPipeline(model=model, tokenizer=tokenizer, use_fast=True, device=-1)
# unmasker = pipeline('fill-mask', model='roberta-large')
paraphraser = ppipe.ParaphrasePipeline(unmasker)
tokenizer = paraphraser.tokenizer
model = paraphraser.model

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
filename = r"./../data/thesis/og/1-ORIG-18.txt"
with open(filename, 'r', encoding='latin-1') as file:
    originalText = file.read()

In [3]:
spun_text, df = paraphraser.parapherase(originalText, mask=0.1, range_replace=(1, 4), mark_replace=True, return_df=True)

Token indices sequence length is longer than the specified maximum sequence length for this model (968 > 512). Running this sequence through the model will result in indexing errors
shift: 456
masked_index_in tensor([[759]]), masked_index_out: tensor([[303]])
shift: 159
masked_index_in tensor([[414]]), masked_index_out: tensor([[255]])
shift: 32
masked_index_in tensor([[287]]), masked_index_out: tensor([[255]])
shift: 456
masked_index_in tensor([[894]]), masked_index_out: tensor([[438]])
shift: 456
masked_index_in tensor([[840]]), masked_index_out: tensor([[384]])
shift: 102
masked_index_in tensor([[357]]), masked_index_out: tensor([[255]])
shift: 456
masked_index_in tensor([[807]]), masked_index_out: tensor([[351]])
shift: 456
masked_index_in tensor([[816]]), masked_index_out: tensor([[360]])
shift: 0
masked_index_in tensor([[37]]), masked_index_out: tensor([[37]])
shift: 0
masked_index_in tensor([[3]]), masked_index_out: tensor([[3]])
shift: 456
masked_index_in tensor([[742]]), maske

In [6]:
spun_text

"Economic development[ during] the past was seen regarding the planned alteration of[ a] structure of production and[ production] so that agricultureÕs share[ in] both production and employment declines[ while] that of manufacturing and[ consumption] increases (Todaro, 2013)Todaro claims that economic growth is a vital condition to[ enhancing] the quality of life. Development was[ previously] defined as[ any] rapid and sustained rise in real output per head and attendant shifts in[ underlying] technological and economic characteristics of society. This conceptualisation gave[ prominence] to increased commodity output instead of the human beings involved in the production. Increases in the output of such industries were recorded as growth in the economy[ not] for that matter development for the country (Kane, 2008).\nEconomic development refers to a qualitative variation in what or how goods and services produced through[ shifts] in resource use, workforce skills,[ knowledge], informati

### Getting Model inputsize

In [74]:
model.get_input_embeddings()

Embedding(50269, 1024)

In [70]:
model.get_output_embeddings().in_features

1024