<a href="https://colab.research.google.com/github/Chenxin-Sun/Sentimental_analysis_transformer/blob/main/Train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tensorflow with GPU

This notebook provides an introduction to computing on a [GPU](https://cloud.google.com/gpu) in Colab. In this notebook you will connect to a GPU, and then run some basic TensorFlow operations on both the CPU and a GPU, observing the speedup provided by using the GPU.


## Enabling and testing the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

Next, we'll confirm that we can connect to the GPU with tensorflow:

In [1]:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


## Observe TensorFlow speedup on GPU relative to CPU

This example constructs a typical convolutional neural network layer over a
random image and manually places the resulting ops on either the CPU or the GPU
to compare execution speed.

In [2]:
import tensorflow as tf
import timeit

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  print(
      '\n\nThis error most likely means that this notebook is not '
      'configured to use a GPU.  Change this in Notebook Settings via the '
      'command palette (cmd/ctrl-shift-P) or the Edit menu.\n\n')
  raise SystemError('GPU device not found')

def cpu():
  with tf.device('/cpu:0'):
    random_image_cpu = tf.random.normal((100, 100, 100, 3))
    net_cpu = tf.keras.layers.Conv2D(32, 7)(random_image_cpu)
    return tf.math.reduce_sum(net_cpu)

def gpu():
  with tf.device('/device:GPU:0'):
    random_image_gpu = tf.random.normal((100, 100, 100, 3))
    net_gpu = tf.keras.layers.Conv2D(32, 7)(random_image_gpu)
    return tf.math.reduce_sum(net_gpu)

# We run each op once to warm up; see: https://stackoverflow.com/a/45067900
cpu()
gpu()

# Run the op several times.
print('Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images '
      '(batch x height x width x channel). Sum of ten runs.')
print('CPU (s):')
cpu_time = timeit.timeit('cpu()', number=10, setup="from __main__ import cpu")
print(cpu_time)
print('GPU (s):')
gpu_time = timeit.timeit('gpu()', number=10, setup="from __main__ import gpu")
print(gpu_time)
print('GPU speedup over CPU: {}x'.format(int(cpu_time/gpu_time)))

Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
CPU (s):
7.557269375999994
GPU (s):
0.14599719200001005
GPU speedup over CPU: 51x


In [1]:
import torch
from torch.utils.data import DataLoader,Dataset


In [2]:
pip install torch torchvision transformers

Collecting transformers
  Downloading transformers-4.32.1-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m112.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.2 MB/s[0m eta [36m0:00:

In [3]:
from transformers import BertTokenizer,BertForSequenceClassification, AdamW

In [4]:
model_name='bert-base-uncased'
tokenizer=BertTokenizer.from_pretrained(model_name)
model=BertForSequenceClassification.from_pretrained(model_name,num_labels=2)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
class SentimentDataset(Dataset):
  def __init__(self,texts,labels,tokenizer,max_length):
    self.texts=texts
    self.labels=labels
    self.tokenizer=tokenizer
    self.max_length=max_length
  def __len__(self):
    return len(self.texts)
  def __getitem__(self,idx):
    text=self.texts[idx]
    label=self.labels[idx]

    encoding=self.tokenizer.encode_plus(
        text,
        add_special_tokens=True, #给文本加上开头，断句的标识符
        max_length=self.max_length,
        return_tensors='pt',  #设置返回一个pytorch张量
        padding='max_length',  #需要每一个输入的文本长度一样，如果文本长度小于max_length则用pad字符填充
        truncation=True
    )
    input_ids=encoding['input_ids'].flatten()  #把二维数组压缩为一维 方便输入（这是flatten()的作用）
    attention_mask=encoding['attention_mask'].flatten()

    return {
        'input_ids':input_ids,
        'attention_mask':attention_mask,
        'label':torch.tensor(label)
    }

In [6]:
def train(model,dataloader,optimizer,loss_fn,device):
  model.train()
  total_loss=0

  for batch in dataloader:
    input_ids=batch['input_ids'].to(device)
    attention_mask=batch['attention_mask'].to(device)
    labels=batch['label'].to(device)

    optimizer.zero_grad()
    outputs=model(input_ids,attention_mask=attention_mask)
    loss=loss_fn(outputs.logits,labels)
    loss.backward()
    optimizer.step()
    total_loss+=loss.item()
  return total_loss/len(dataloader)

In [47]:
def predict(model, dataloader, device):
    model.eval()
    predictions = []

    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            _, predicted = torch.max(outputs.logits, 1)
            predictions.extend(predicted.cpu().numpy())

    return predictions

In [7]:
def evaluate(model,dataloader,loss_fn,device):
  model.eval()
  total_loss=0
  correct=0
  with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs.logits, labels)
            total_loss += loss.item()

            _, predicted = torch.max(outputs.logits, 1)
            correct += (predicted == labels).sum().item()
  return total_loss / len(dataloader), correct / len(dataloader.dataset)

In [8]:
batch_size=32
max_length=128
learning_rate=2e-5
epochs=5


Data Preprocessing

In [37]:
import pandas as pd
train_df=pd.read_csv('/content/train.csv')
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [38]:
from sklearn.model_selection import train_test_split
train_text,test_text,train_target,test_target=train_test_split(train_df['text'],train_df['target'],test_size=0.3,stratify=train_df['target'])


In [39]:
import re
import string
def remove_html(text):
    html = re.compile(r'<.*?>')
    return html.sub(r'', text)
#匹配与表情符号相关的特定 Unicode 范围内的字符序列。
def remove_emoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)
def remove_punc(text):
  table=str.maketrans("","",string.punctuation)  #构造映射，把标点符号映射到空值
  return text.translate(table)  #执行

def clean_df(df):
  df=df.apply(lambda x :remove_html(x))
  df=df.apply(lambda x :remove_emoji(x))
  df=df.apply(lambda x :remove_punc(x))
  df= df.replace("\s+", " ", regex=True) #将多个空格都替换为一个
  return df

In [40]:
train_text=clean_df(train_text)
test_text=clean_df(test_text)

In [41]:
train_text=train_text.tolist()
train_label=train_target.tolist()

In [42]:
train_dataset=SentimentDataset(train_text,train_label,tokenizer,max_length)
train_dataloader=DataLoader(train_dataset,batch_size=batch_size,shuffle=True)


In [43]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer=AdamW(model.parameters(),lr=learning_rate)
loss_fn=torch.nn.CrossEntropyLoss()

for epoch in range(epochs):
  train_loss=train(model,train_dataloader,optimizer,loss_fn,device)
  print(f'Epoch{epoch+1}/{epochs},Train Loss:{train_loss:.4f}')



Epoch1/5,Train Loss:0.1536
Epoch2/5,Train Loss:0.0998
Epoch3/5,Train Loss:0.0775
Epoch4/5,Train Loss:0.0641
Epoch5/5,Train Loss:0.0515


In [58]:
evaluate(model,train_dataloader,loss_fn,device)

(0.03328290157414649, 0.9844248451867142)

In [45]:
test_text=test_text.tolist()
test_label=test_target.tolist()

In [60]:
test_dataset=SentimentDataset(test_text,test_label,tokenizer,max_length)
test_dataloader=DataLoader(test_dataset,batch_size=batch_size,shuffle=False)

In [61]:
predictions = predict(model, test_dataloader, device)

In [62]:
from sklearn.metrics import classification_report
print(classification_report(test_target,predictions))

              precision    recall  f1-score   support

           0       0.95      0.94      0.94      1303
           1       0.92      0.93      0.93       981

    accuracy                           0.94      2284
   macro avg       0.94      0.94      0.94      2284
weighted avg       0.94      0.94      0.94      2284



In [65]:
import pandas as pd
df=pd.read_csv('/content/test.csv')
df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [66]:
test=clean_df(df['text'])


In [68]:
test=test.to_list()

In [69]:
class InputDataset(Dataset):
  def __init__(self,texts,tokenizer,max_length):
    self.texts=texts
    self.tokenizer=tokenizer
    self.max_length=max_length
  def __len__(self):
    return len(self.texts)
  def __getitem__(self,idx):
    text=self.texts[idx]
    encoding=self.tokenizer.encode_plus(
        text,
        add_special_tokens=True, #给文本加上开头，断句的标识符
        max_length=self.max_length,
        return_tensors='pt',  #设置返回一个pytorch张量
        padding='max_length',  #需要每一个输入的文本长度一样，如果文本长度小于max_length则用pad字符填充
        truncation=True
    )
    input_ids=encoding['input_ids'].flatten()  #把二维数组压缩为一维 方便输入（这是flatten()的作用）
    attention_mask=encoding['attention_mask'].flatten()

    return {
        'input_ids':input_ids,
        'attention_mask':attention_mask
    }

In [70]:
test_dataset=InputDataset(test,tokenizer,max_length)
test_dataloader=DataLoader(test_dataset,batch_size=batch_size,shuffle=False)

In [71]:
predictions = predict(model, test_dataloader, device)

In [76]:
pre=pd.DataFrame(predictions)

In [78]:
result=pd.concat([df['id'],pre],axis=1)

In [81]:
result.columns=['id','target']

In [82]:
result

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
...,...,...
3258,10861,1
3259,10865,1
3260,10868,1
3261,10874,1


In [83]:
result= result.reset_index(drop=True)

In [84]:
result

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
...,...,...
3258,10861,1
3259,10865,1
3260,10868,1
3261,10874,1


In [86]:
result.to_csv('/content/prediction.csv',index=False)