In [26]:
pip install transformers==4.3

Collecting transformers==4.3
  Using cached transformers-4.3.0-py3-none-any.whl (1.8 MB)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.12.3
    Uninstalling transformers-4.12.3:
      Successfully uninstalled transformers-4.12.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 2.1.0 requires transformers<5.0.0,>=4.6.0, but you have transformers 4.3.0 which is incompatible.[0m
Successfully installed transformers-4.3.0


### Using pre-trained transformers (2pts)
_for fun and profit_

There are many toolkits that let you access pre-trained transformer models, but the most powerful and convenient by far is [`huggingface/transformers`](https://github.com/huggingface/transformers). In this week's practice, you'll learn how to download, apply and modify pre-trained transformers for a range of tasks. Buckle up, we're going in!


__Pipelines:__ if all you want is to apply a pre-trained model, you can do that in one line of code using pipeline. Huggingface/transformers has a selection of pre-configured pipelines for masked language modelling, sentiment classification, question aswering, etc. ([see full list here](https://huggingface.co/transformers/main_classes/pipelines.html))

A typical pipeline includes:
* pre-processing, e.g. tokenization, subword segmentation
* a backbone model, e.g. bert finetuned for classification
* output post-processing

Let's see it in action:

In [None]:
import transformers
from transformers import pipeline
classifier = pipeline('sentiment-analysis', model="distilbert-base-uncased-finetuned-sst-2-english")

print(classifier("BERT is amazing!"))

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998860955238342}]


In [None]:
data = {
    'arryn': 'As High as Honor.',
    'baratheon': 'Ours is the fury.',
    'stark': 'Winter is coming.',
    'tyrell': 'Growing strong.'
}
for t, d in data.items():
  print(classifier(d)[0]['label'])
#classifier(data[])['label']

POSITIVE
NEGATIVE
POSITIVE
POSITIVE


In [None]:
import base64
data = {
    'arryn': 'As High as Honor.',
    'baratheon': 'Ours is the fury.',
    'stark': 'Winter is coming.',
    'tyrell': 'Growing strong.'
}

# YOUR CODE: predict sentiment for each noble house and create outputs dict

outputs = { house : True if classifier(text)[0]['label'] == "POSITIVE" else False for house , text in data.items() }
# dict (house name) : True if positive, False if negative>
print(outputs)
assert sum(outputs.values()) == 3 and outputs[base64.decodestring(b'YmFyYXRoZW9u\n').decode()] == False
print("Well done!")

{'arryn': True, 'baratheon': False, 'stark': True, 'tyrell': True}
Well done!


  


You can also access vanilla Masked Language Model that was trained to predict masked words. Here's how:

In [None]:
mlm_model = pipeline('fill-mask', model="bert-base-uncased")
MASK = mlm_model.tokenizer.mask_token

for hypo in mlm_model(f"Donald {MASK} is the president of the united states."):
  print(f"P={hypo['score']:.5f}", hypo['sequence'])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


P=0.99719 [CLS] donald trump is the president of the united states. [SEP]
P=0.00024 [CLS] donald duck is the president of the united states. [SEP]
P=0.00022 [CLS] donald ross is the president of the united states. [SEP]
P=0.00020 [CLS] donald johnson is the president of the united states. [SEP]
P=0.00018 [CLS] donald wilson is the president of the united states. [SEP]


In [None]:
mlm_model = pipeline('fill-mask', model="bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Your turn: use bert to recall what year was the Soviet Union founded in
mlm_model(f'Soviet Union founded in year {MASK} .')

[{'score': 0.019652415066957474,
  'sequence': '[CLS] soviet union founded in year 1945. [SEP]',
  'token': 3386,
  'token_str': '1945'},
 {'score': 0.018976988270878792,
  'sequence': '[CLS] soviet union founded in year 1947. [SEP]',
  'token': 4006,
  'token_str': '1947'},
 {'score': 0.01777983456850052,
  'sequence': '[CLS] soviet union founded in year 1917. [SEP]',
  'token': 4585,
  'token_str': '1917'},
 {'score': 0.0127203818410635,
  'sequence': '[CLS] soviet union founded in year 1949. [SEP]',
  'token': 4085,
  'token_str': '1949'},
 {'score': 0.012019027955830097,
  'sequence': '[CLS] soviet union founded in year 1948. [SEP]',
  'token': 3882,
  'token_str': '1948'}]

```

```

```

```


Huggingface offers hundreds of pre-trained models that specialize on different tasks. You can quickly find the model you need using [this list](https://huggingface.co/models).


In [None]:
text = """Almost two-thirds of the 1.5 million people who viewed this liveblog had Googled to discover
 the latest on the Rosetta mission. They were treated to this detailed account by the Guardian’s science editor,
 Ian Sample, and astronomy writer Stuart Clark of the moment scientists landed a robotic spacecraft on a comet 
 for the first time in history, and the delirious reaction it provoked at their headquarters in Germany.
  “We are there. We are sitting on the surface. Philae is talking to us,” said one scientist.
"""

# Task: create a pipeline for named entity recognition, use task name 'ner' and search for the right model in the list
ner_model = pipeline('ner')

named_entities = ner_model(text)

In [None]:
print('OUTPUT:', named_entities)
word_to_entity = {item['word']: item['entity'] for item in named_entities}
assert 'org' in word_to_entity.get('Guardian').lower() and 'per' in word_to_entity.get('Stuart').lower()
print("All tests passed")

OUTPUT: [{'word': 'Google', 'score': 0.8803119659423828, 'entity': 'I-MISC', 'index': 19}, {'word': 'Rose', 'score': 0.9005069136619568, 'entity': 'I-MISC', 'index': 27}, {'word': '##tta', 'score': 0.9509623050689697, 'entity': 'I-MISC', 'index': 28}, {'word': 'Guardian', 'score': 0.9992534518241882, 'entity': 'I-ORG', 'index': 40}, {'word': 'Ian', 'score': 0.9992009401321411, 'entity': 'I-PER', 'index': 46}, {'word': 'Sam', 'score': 0.999500036239624, 'entity': 'I-PER', 'index': 47}, {'word': '##ple', 'score': 0.9964978694915771, 'entity': 'I-PER', 'index': 48}, {'word': 'Stuart', 'score': 0.9991856217384338, 'entity': 'I-PER', 'index': 53}, {'word': 'Clark', 'score': 0.99964839220047, 'entity': 'I-PER', 'index': 54}, {'word': 'Germany', 'score': 0.9998210668563843, 'entity': 'I-LOC', 'index': 85}, {'word': 'Phil', 'score': 0.6295722126960754, 'entity': 'I-PER', 'index': 99}, {'word': '##ae', 'score': 0.8340393304824829, 'entity': 'I-PER', 'index': 100}]
All tests passed


### The building blocks of a pipeline

Huggingface also allows you to access its pipelines on a lower level. There are two main abstractions for you:
* `Tokenizer` - converts from strings to token ids and back
* `Model` - a pytorch `nn.Module` with pre-trained weights

You can use such models as part of your regular pytorch code: insert is as a layer in your model, apply it to a batch of data, backpropagate, optimize, etc.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel, pipeline

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


In [None]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
    ]

# tokenize a batch of inputs. "pt" means [p]y[t]orch tensors
tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="pt")

for key in tokens_info:
    print(key, tokens_info[key])

print("Detokenized:")
for i in range(2):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

input_ids tensor([[ 101, 5355, 1010, 1045, 2572, 2115, 2269, 1012,  102,    0,    0,    0,
            0,    0,    0],
        [ 101, 2166, 2003, 2054, 6433, 2043, 2017, 1005, 2128, 5697, 2437, 2060,
         3488, 1012,  102]])
token_type_ids tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
attention_mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Detokenized:
[CLS] luke, i am your father. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] life is what happens when you're busy making other plans. [SEP]


In [None]:
# You can now apply the model to get embeddings
with torch.no_grad():
    out = model(**tokens_info)

print(out)

(tensor([[[-0.3502,  0.2246, -0.2345,  ..., -0.2232,  0.1730,  0.6747],
         [-0.6097,  0.6892, -0.5512,  ..., -0.4814,  0.5322,  1.3833],
         [ 0.1842,  0.4881,  0.2193,  ..., -0.2699,  0.2246,  0.7985],
         ...,
         [-0.4413,  0.2748, -0.0391,  ..., -0.0604, -0.4358,  0.1384],
         [-0.5414,  0.4633,  0.0678,  ..., -0.1871, -0.5046,  0.2752],
         [-0.3940,  0.6180,  0.2092,  ..., -0.2345, -0.4177,  0.3341]],

        [[ 0.1622, -0.1154, -0.3894,  ..., -0.4180,  0.0138,  0.7644],
         [ 0.6471,  0.3774, -0.4082,  ...,  0.0050,  0.5559,  0.4385],
         [ 0.3351, -0.3158, -0.1178,  ...,  0.1348, -0.3143,  1.4409],
         ...,
         [ 1.2932, -0.1743, -0.5613,  ..., -0.2718, -0.1367,  0.4217],
         [ 1.0305,  0.1708, -0.2985,  ...,  0.2097, -0.4627, -0.4277],
         [ 1.0854,  0.1760, -0.0377,  ...,  0.3152, -0.5979, -0.3465]]]), tensor([[-0.8854, -0.4722, -0.9392,  ..., -0.8081, -0.6955,  0.8748],
        [-0.9297, -0.5161, -0.9334,  ..., -0

### Fine-tuning for salary prediction (5 pts)

Now let's put all this monstrosity to good use!

Remember week5 when you've trained a convolutional neural network for salary prediction? Now let's see how transformers fare at this task.

__The goal__ is to take one or more pre-trained models and fine-tune it for salary prediction. A good baseline solution would be to get RoBerta or T5 from [huggingface model list](https://huggingface.co/models) and fine-tune it to solve the task. After choosing the model, please take care to use the matching Tokenizer for preprocessing, as different models have different preprocessing requirements.


There are no prompts this time: you will have to write everything from scratch. Although, feel free to reuse any code from the original salary prediction notebook :)

#I really tried a lot to solve that error but noway, first error was memory crashed but now this error "Target 27500 is out of bounds"

In [None]:
#!wget https://www.dropbox.com/s/r9d1f3ve471osob/Train_rev1.zip?dl=1 -O data.zip
#!unzip -e data.zip

In [82]:
import transformers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [83]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [84]:
data = pd.read_csv("/content/drive/MyDrive/Data/Train_rev1.csv", index_col=None)
data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')

text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]
target_column = "Log1pSalary"
data[categorical_columns] = data[categorical_columns].fillna('NaN') # cast nan to string

data.sample(3)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName,Log1pSalary
225932,72440401,Building Surveyor (Projects),"Our client, a prestigious Central London Estat...",London South East,South East London,,permanent,Deverell Smith Recruitment Ltd,Trade & Construction Jobs,"55000 - 60000 per annum + Pension, Healthcare",57500,totaljobs.com,10.959558
16580,66896140,Assistant Scientist,Assistant Scientist/Analytical Chemistry Aberd...,"Aberdeen, Central Scotland",UK,,contract,Stafffinders,Healthcare & Nursing Jobs,8.50/hour,16320,cv-library.co.uk,9.700208
78746,69016046,Support Worker Health Care,Are you a support worker looking for a new cha...,"Leeds, West Yorkshire, West Yorkshire",Leeds,,permanent,BS Social Care,Admin Jobs,13000 - 15500/annum,14250,cv-library.co.uk,9.564583


In [85]:
X = data.drop(['Log1pSalary'], axis=1)

In [86]:
X = data.drop(['SalaryNormalized'], axis=1)
X

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SourceName,Log1pSalary
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,cv-library.co.uk,10.126671
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,cv-library.co.uk,10.308986
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,cv-library.co.uk,10.308986
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,cv-library.co.uk,10.221977
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,cv-library.co.uk,10.126671
...,...,...,...,...,...,...,...,...,...,...,...,...
244763,72705211,TEACHER OF SCIENCE,Position: Qualified Teacher Subject/Specialism...,Swindon,Swindon,,contract,,Teaching Jobs,450 - 500 per week,hays.co.uk,10.034559
244764,72705212,TEACHER OF BUSINESS STUDIES AND ICT,Position: Qualified Teacher or NQT Subject/Spe...,Swindon,Swindon,,contract,,Teaching Jobs,450 - 500 per week,hays.co.uk,10.034559
244765,72705213,ENGLISH TEACHER,Position: Qualified Teacher Subject/Specialism...,Swindon,Swindon,,contract,,Teaching Jobs,450 - 500 per week,hays.co.uk,10.034559
244766,72705216,SUPPLY TEACHERS,Position: Qualified Teacher Subject/Specialism...,Wiltshire,Wiltshire,,contract,,Teaching Jobs,450 to 500 per week,hays.co.uk,10.034559


In [87]:
X['combined']=X['Title'].astype(str)+' '+X['FullDescription'].astype(str)+' '+X['Category'].astype(str)

In [88]:
y =data['SalaryNormalized']
y.astype('object').dtypes

dtype('O')

In [89]:
y

0         25000
1         30000
2         30000
3         27500
4         25000
          ...  
244763    22800
244764    22800
244765    22800
244766    22800
244767    42500
Name: SalaryNormalized, Length: 244768, dtype: int64

In [91]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y ,  test_size=0.2, random_state=0 )

In [92]:
X_train_datset = list(X_train["combined"])
X_test_datset = list(X_test["combined"])
y_train_datset = list(y_train)
y_test_datset = list(y_test)

In [93]:
X_train_datset[0]

'Sales Administrator Sales Administrator A leading Surrey based group is offering an excellent opportunity for an experienced Sales Administrator to join their team. The candidate must have experience of Kerridge/ADP The company would also consider an inexperienced person on a lower salary package Phone Peter or email Motorvation cover all the South East of England. We have a variety of jobs available from Dealer Principal, to Sales Executives, Parts, Service and Technicians positions Other/General Jobs'

In [94]:
len(X_test_datset)

48954

In [95]:
len(y_test)

48954

In [96]:
len(y_train)

195814

In [97]:
X_train_datset[0] 

'Sales Administrator Sales Administrator A leading Surrey based group is offering an excellent opportunity for an experienced Sales Administrator to join their team. The candidate must have experience of Kerridge/ADP The company would also consider an inexperienced person on a lower salary package Phone Peter or email Motorvation cover all the South East of England. We have a variety of jobs available from Dealer Principal, to Sales Executives, Parts, Service and Technicians positions Other/General Jobs'

In [98]:
X_test_datset[0]

'Financial Controller Our client, a leading highend Property Developer based in Surrey, requires a Financial Controller to join their established team. The successful candidate will be ACA qualified from a Big 4 Practice, before spending time in a commercial role within industry. The role will offer you the rare opportunity to get heavily involved in prestigious, highprofile Real Estate developments in key locations in and around Surrey. You will be managing a team of 8 in Finance, comprising a mixture of Junior Property Accountants, Credit Controllers and Administration staff. Key Responsibilities include: • Costing and cashflow analysis • Budgeting and Forecasting • Management Reporting • Liaising with senior internal stakeholders on a daily basis • Monthly/Quarterly and Statutory Reporting You will be: • Graduate calibre • Recently qualified accountant (ACCA/CIMA/ACA etc) • Real Estate/Property background highly desirable • Articulate and engaging • Wellversed in dealing confidently

In [99]:

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import RobertaTokenizer, RobertaModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') #bert-base-uncased
model = BertForSequenceClassification.from_pretrained('bert-base-uncased') #bert-base-uncased

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [100]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
mode = model.to(device)



In [101]:
device

'cuda'

In [102]:
X_train_tokenized = tokenizer(X_train_datset, padding=True, truncation=True, return_tensors="pt").to(device)

In [103]:
X_test_tokenized= tokenizer(X_test_datset, padding=True, truncation=True, return_tensors="pt").to(device)

In [104]:
for key in X_train_tokenized:
    print(key, X_train_tokenized[key])

print("Detokenized:")
for i in range(10000,10010):
    print(tokenizer.decode(X_train_tokenized['input_ids'][i]))


input_ids tensor([[  101,  4341,  8911,  ...,     0,     0,     0],
        [  101,  3353,  3208,  ...,     0,     0,     0],
        [  101,  4007,  3992,  ...,     0,     0,     0],
        ...,
        [  101,  2449,  2458,  ...,     0,     0,     0],
        [  101, 24555,  1013,  ...,     0,     0,     0],
        [  101,  1044,  2290,  ...,     0,     0,     0]], device='cuda:0')
token_type_ids tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0')
attention_mask tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')
Detokenized:
[CLS] internal sales engineer ( power transmissions ) internal sales engineer ( power transmission

In [105]:

from transformers import BertTokenizer, BertForSequenceClassification
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

In [None]:
X_train_tokenized

{'input_ids': tensor([[  101,  4341,  8911,  ...,     0,     0,     0],
        [  101,  3353,  3208,  ...,     0,     0,     0],
        [  101,  4007,  3992,  ...,     0,     0,     0],
        ...,
        [  101,  2449,  2458,  ...,     0,     0,     0],
        [  101, 24555,  1013,  ...,     0,     0,     0],
        [  101,  1044,  2290,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

In [106]:
train_dataset = Dataset(X_train_tokenized, y_train_datset)
test_dataset = Dataset(X_test_tokenized, y_test_datset)

In [None]:
for x in train_dataset :
  print(x)
  break

{'input_ids': tensor([  101,  4341,  8911,  4341,  8911,  1037,  2877,  9948,  2241,  2177,
         2003,  5378,  2019,  6581,  4495,  2005,  2019,  5281,  4341,  8911,
         2000,  3693,  2037,  2136,  1012,  1996,  4018,  2442,  2031,  3325,
         1997, 14884, 13623,  1013,  4748,  2361,  1996,  2194,  2052,  2036,
         5136,  2019, 26252,  2711,  2006,  1037,  2896, 10300,  7427,  3042,
         2848,  2030, 10373,  5013, 21596,  3104,  2035,  1996,  2148,  2264,
         1997,  2563,  1012,  2057,  2031,  1037,  3528,  1997,  5841,  2800,
         2013, 11033,  4054,  1010,  2000,  4341, 12706,  1010,  3033,  1010,
         2326,  1998, 20202,  4460,  2060,  1013,  2236,  5841,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0, 

  if __name__ == '__main__':


In [107]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir="output",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=4, 
    num_train_epochs=1,
    weight_decay = 0.01,
 
)


In [113]:

with torch.no_grad():
  model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

trainer = Trainer(
    model=model.to(device),
    args=args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset, 

)


trainer.train()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

RuntimeError: ignored

In [None]:
trainer.evaluate(test_dataset)

In [None]:
trainer.predict(test_dataset)

In [None]:
output=trainer.predict(test_dataset)[1]

In [None]:
from sklearn.metrics import confusion_matrix

cm=confusion_matrix(y_test,output)
cm

### The search for similar questions (3pts)

* Implement a function that takes a text string and finds top-k most similar questions from `quora.txt`
* Demonstrate your function using at least 5 examples

There are no prompts this time: you will have to write everything from scratch.


In [27]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

--2021-11-13 19:53:58--  https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.80.18, 2620:100:6035:18::a27d:5512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.80.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/obaitrix9jyu84r/quora.txt [following]
--2021-11-13 19:53:58--  https://www.dropbox.com/s/dl/obaitrix9jyu84r/quora.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc16140fdce04dbcbac4263b0c61.dl.dropboxusercontent.com/cd/0/get/BZ5i-gdJssytcjHvAJ6GH_D7xoPdINUC8hhNSgorFEvJ8uQqbJ1ZDolCGbJEElNjTu9rZyxcO0RycFdNO6ci5fsoMYB3vYcryxd3ykz5UoduxpUUml5uucb1MiS1iHWSrRdVd5ltqtARg2mprUR6h82K/file?dl=1# [following]
--2021-11-13 19:53:59--  https://uc16140fdce04dbcbac4263b0c61.dl.dropboxusercontent.com/cd/0/get/BZ5i-gdJssytcjHvAJ6GH_D7xoPdINUC8hhNSgorFEvJ8uQqbJ1ZDolCGbJEElNjTu9rZyxcO0RycFdNO6ci5fsoMYB3vY

In [28]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'


In [29]:
device

'cuda'

In [30]:
sentence_list = []
with open("quora.txt", "r") as f:
  for i in f:
    sentence_list.append(i.strip("\n"))

In [31]:
sentence_list[0]

"Can I get back with my ex even though she is pregnant with another guy's baby?"

In [32]:
from transformers import AutoTokenizer, AutoModel

In [33]:
model = "sentence-transformers/bert-base-nli-mean-tokens"

In [34]:
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModel.from_pretrained(model)

In [35]:
tokens = {'input_ids': [], 'attention_mask':[]}

In [36]:
for sentence in sentence_list:
  new_tokens = tokenizer.encode_plus(sentence, max_length=128, truncation=True, padding='max_length',
                                     return_tensors='pt')
  tokens['input_ids'].append(new_tokens['input_ids'][0])
  tokens["attention_mask"].append(new_tokens['attention_mask'][0])

In [37]:
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

In [38]:
train={}
train = {'input_ids': tokens['input_ids'][0:400],'attention_mask': tokens['attention_mask'][0:400]}

In [14]:
device

'cuda'

In [39]:
outputs = []
for i in range(len(tokens))
 with torch.no_grad():
  outputs = model(**train)



In [40]:
embedding = outputs.last_hidden_state

In [41]:
attention = tokens['attention_mask'][0:400]

In [42]:
mask = attention.unsqueeze(-1).expand(embedding.shape).float()

In [43]:
mask_embedding = embedding*mask

In [44]:
summ = torch.sum(mask_embedding,1)

In [45]:
counts = torch.clamp(mask.sum(1),min = 1e-9)

In [46]:
mean_pooled = summ / counts

In [47]:
from sklearn.metrics.pairwise import cosine_similarity

In [48]:
mean_pooled = mean_pooled.detach().numpy()

In [119]:
for i in range(0,6):
 max = cosine_similarity([mean_pooled[i]], mean_pooled[i+1:])
 top_k_result = torch.topk(t, k=5)
 print("For Sentence :", sentence_list[i])
 print("The Most similar questions are :" )
 for score, j in zip(top_k_result[0][0], top_k_result[1][0]):
  print(sentence_list[j], " Score : {:.4f}".format(score))
 print()
 print()
 print("*****************************************************************")


For Sentence : Can I get back with my ex even though she is pregnant with another guy's baby?
The Most similar questions are :
Can eating boiled egg or omelette cause bird flu?  Score : 0.6631
What can I do to improve my immune system?  Score : 0.6118
How do I research for MUN?  Score : 0.6094
What does entertainment mean for you?  Score : 0.5976
What happens to my stock options when I quit?  Score : 0.5805


*****************************************************************
For Sentence : What are some ways to overcome a fast food addiction?
The Most similar questions are :
Can eating boiled egg or omelette cause bird flu?  Score : 0.6631
What can I do to improve my immune system?  Score : 0.6118
How do I research for MUN?  Score : 0.6094
What does entertainment mean for you?  Score : 0.5976
What happens to my stock options when I quit?  Score : 0.5805


*****************************************************************
For Sentence : Who were the great Chinese soldiers and leaders who 

I used part of data 400 questions to avoid memory crashed

```















```

__Bonus demo:__ transformer language models. 

`/* No points awarded for this task, but its really cool, we promise :) */`

In [None]:
import torch
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', add_prefix_space=True)
model = GPT2LMHeadModel.from_pretrained('gpt2').train(False).to(device)

text = "The Fermi paradox "
tokens = tokenizer.encode(text)
num_steps = 1024
line_length, max_length = 0, 70

print(end=tokenizer.decode(tokens))

for i in range(num_steps):
    with torch.no_grad():
        logits = model(torch.as_tensor([tokens], device=device))[0]
    p_next = torch.softmax(logits[0, -1, :], dim=-1).data.cpu().numpy()

    next_token_index = p_next.argmax() #<YOUR CODE: REPLACE THIS LINE>
    # YOUR TASK: change the code so that it performs nucleus sampling

    tokens.append(int(next_token_index))
    print(end=tokenizer.decode(tokens[-1]))
    line_length += len(tokenizer.decode(tokens[-1]))
    if line_length >= max_length:
        line_length = 0
        print()

