
# Tweet Sentiment Extraction

![tweets_resize.png](attachment:113a7a6d-1cba-4e25-b3ad-3b77fb2d5136.png)


## Overview

This Notebook will be completed in two main ways.<br/>
First, find and visualize useful data or meaningful relationships within the data.<br/>
Second, select a model based on the visualization of the previous process. Transform or refine the data into the appropriate form for the model to be used.<br/><br/>

This competition does not categorize the positive or negative of a sentence, unlike a general emotional analysis competition.<br/>
In the sentence, it was interesting in that it was a contest to extract the sentence that influenced the emotion of the sentence the most. 
##### "We should keep in mind that we should extract sentences, not categorize sentiment."<br/>


#### My opinion :
1) Since all data have different lengths of labels, we have to infer the values from start index and end index.<br/>
2) The result we have to submit is the original sentence, so it should not be cleaning text. But this notebook will preprocess for visualization.

***

## My workflow
#### 1. Import & Install libray
* Import basic libray
* Import Enginnering libray
* Install stopwords list

#### 2. Check out my data
* Check Shape / Info

#### 3. Exploratory Data Analysis(EDA) with Visualization [Before Preprocessing]
* Plot the null values
* Plot the "sentiment" columns count
* Number of alphabets by sentence / Number of words by sentence
* Bi-gram by sentiment per texts in Tweets

#### 4. Prepocessing Data
* Drop null rows & useless columns
* Cleansing "text" / "selected_text" data

#### 5. Visualization [After Preprocessing]
* Wordcloude by sentiment per selected text in Tweets

#### 6. Feature Enginnering 
* How to use Bert tokenization
* Convert to data suitable for Bert model (Dataset / Dataloader)

#### 7.Modeling
* Bert Modeling
* Loss Function (CrossEntropyLoss / Jaccard )
* Training
* Evaluating

#### 8. Submission
* Submit the predictions.

# 1. Import & Install libray
* Import basic libray
* Import Enginnering libray
* Install stopwords list

In [None]:
import re
import os
import random

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn')

%matplotlib inline

In [None]:
import torch
import torchtext
from torchtext import data, datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data
from torch.utils.data import TensorDataset, DataLoader
import torch.optim as optim

import transformers
from transformers import BertTokenizer, BertModel, BertConfig

import tokenizers

import nltk
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
from PIL import Image

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [None]:
train_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/train.csv')
test_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/test.csv')
train_df.head()

# 2. Check out my data
* Check Shape / Info
* Set color palette

In [None]:
print("Train data size : {}".format(train_df.shape))
print("Test data size : {}".format(test_df.shape))

In [None]:
train_df.info()

In [None]:
PuBu_palette = sns.color_palette("PuBu", 10)
YlGnBu_palette = sns.color_palette("YlGnBu", 10)
sns.palplot(PuBu_palette)
sns.palplot(YlGnBu_palette)

#### ✔️ This notebook will use this palettes.

# 3. Exploratory Data Analysis(EDA) with Visualization [Before Preprocessing]
* Plot the null values
* Plot the "sentiment" columns count
* Number of alphabets by sentence / Number of words by sentence
* Bi-gram by sentiment per texts in Tweets

### 3-1)  Plot the Null Values

In [None]:
pd.DataFrame(train_df.isnull().sum(), columns=["Train Null Count"])

In [None]:
pd.DataFrame(test_df.isnull().sum(), columns=["Test Null Count"])

In [None]:
msno.matrix(df=train_df.iloc[:,:],figsize=(5,5),color=YlGnBu_palette[8])

### 3-2)  Plot the "sentiment" columns count

In [None]:
fig, ax = plt.subplots(1,1,figsize=(7, 5))
ax = sns.countplot(train_df["sentiment"].sort_values(ascending=False),
              order = train_df['sentiment'].value_counts().index,
              palette=PuBu_palette[-5:])
ax.patch.set_alpha(0)
fig.text(0.1,0.92,"distribution by Sentiment in Tweets", fontweight="bold", fontfamily='serif', fontsize=17)
plt.show()

### 3-3) Number of alphabets by sentence / Number of words by sentence

In [None]:
def get_length_alphabets(text):
    text = str(text)
    return len(text)

In [None]:
def get_length_words(text):
    text = str(text)
    return len(text.split(' '))

In [None]:
train_df['length_alphabets'] = train_df['text'].apply(lambda x: get_length_alphabets(x))
train_df['length_words'] = train_df['text'].apply(lambda x: get_length_words(x))
train_df.head()

#### 💡 We can check dataframes describe ([length_alphabets] / [length_words] )

In [None]:
train_df.describe()

In [None]:
PuBu_palette

In [None]:
three_PuBu_palette = list()
three_PuBu_palette.append(PuBu_palette[2])
three_PuBu_palette.append(PuBu_palette[6])
three_PuBu_palette.append(PuBu_palette[4])
three_PuBu_palette

In [None]:
fig = plt.figure(figsize=(12, 8))
gs = fig.add_gridspec(2,1)

axes = list()

for index, data in zip(range(2), train_df):
    axes.append(fig.add_subplot(gs[index, 0]))
    
    
    if index==0:
        sns.kdeplot(x='length_alphabets', data=train_df, 
                        fill=True, ax=axes[index], cut=0, bw_method=0.20, 
                        lw=1.4 , hue='sentiment', palette=three_PuBu_palette,
                         alpha=0.3)
    else:
        sns.kdeplot(x='length_words', data=train_df, 
                    fill=True, ax=axes[index], cut=0, bw_method=0.20, 
                    lw=1.4 , hue='sentiment',palette=three_PuBu_palette,
                     alpha=0.3) 

    axes[index].set_yticks([])
    if index != 1 : axes[index].set_xticks([])
    axes[index].set_ylabel('')
    axes[index].set_xlabel('')
    axes[index].spines[["top","right","left","bottom"]].set_visible(False)
    
    
    if index == 0:
        axes[index].text(-0.2,0,"length_alphabets",fontweight="light", fontfamily='serif', fontsize=13,ha="right")
    else:
        axes[index].text(-0.2,0,"length_words",fontweight="light", fontfamily='serif', fontsize=13,ha="right")
        
        
    axes[index].patch.set_alpha(0)
    if index != 0 : axes[index].get_legend().remove()
        
fig.text(0.05,0.91,"Count distribution by length in Tweets", fontweight="bold", fontfamily='serif', fontsize=20)
plt.show()

### 3-4) Bi-gram by sentiment per texts in Tweets

In [None]:
def get_top_tweet_bigrams(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
fig, axes = plt.subplots(1,3, figsize=(12, 10), constrained_layout=True)

sentiment_list = list(np.unique(train_df['sentiment']))

for i, sentiment in zip(range(3), sentiment_list):
    top_tweet_bigrams = get_top_tweet_bigrams(train_df[train_df['sentiment']==sentiment]['text'].fillna(" "))[:10]
    x,y = map(list,zip(*top_tweet_bigrams))
    sns.barplot(x=y, y=x, ax=axes[i], palette=PuBu_palette[::-1])
    axes[i].text(0,-0.7, sentiment, fontweight="bold", fontfamily='serif', fontsize=13,ha="right")
    axes[i].patch.set_alpha(0)

fig.text(0,1.01,"Bi-gram by {}texts in Tweets".format(sentiment), fontweight="bold", fontfamily='serif', fontsize=18)
plt.show()

# 4. Prepocessing Data
* Drop null rows & useless columns
* Cleansing "text" / "selected_text" data

### 4-1) Drop null rows & useless columns

In [None]:
train_df_before_drop_shape = train_df.shape
test_df_before_drop_shape = test_df.shape

In [None]:
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

In [None]:
train_df.drop(['length_alphabets','length_words'], axis=1, inplace=True)

In [None]:
train_df_drop_shape = train_df.shape
test_df_drop_shape = test_df.shape

print("Train dataset Shape : {} => {}".format(train_df_before_drop_shape, train_df_drop_shape))
print("Test dataset Shape : {} => {}".format(test_df_before_drop_shape, test_df_drop_shape))

### 4-2) Cleansing "text" / "selected_text" data

In [None]:
def preprocess_fn(text):
    text = str(text)
    text = text.lower()  # lowercase

    text = re.sub(r'[!]+', '!', text)
    text = re.sub(r'[?]+', '?', text)
    text = re.sub(r'[.]+', '.', text)
    text = re.sub(r"'", "", text)
    text = re.sub('\s+', ' ', text).strip()  # Remove and double spaces
    text = re.sub(r'&amp;?', r'and', text)  # replace & -> and
    # remove some puncts (except . ! # ? *)
    text = re.sub(r'[:"$%&\+,-/:;<=>@\\^_{|}~`]+', '', text)
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'EMOJI', text)
    
    return text

In [None]:
train_df['text'] = train_df['text'].apply(lambda x: preprocess_fn(x))
train_df['selected_text'] = train_df['selected_text'].apply(lambda x: preprocess_fn(x))

test_df['text'] = test_df['text'].apply(lambda x: preprocess_fn(x))

In [None]:
train_df.head()

# 5. Visualization [After Preprocessing]
* Wordcloude by sentiment per selected text in Tweets

### 5-1) Wordcloud by sentiment per selected text in Tweets

In [None]:
mask_dir = np.array(Image.open('../input/masksforwordclouds/twitter_mask3.jpg'))

In [None]:
fig, axes = plt.subplots(1,3, figsize=(24,12))
sentiment_list = np.unique(train_df['sentiment'])

for i, sentiment in zip(range(3), sentiment_list):
    wc = WordCloud(background_color="white", max_words = 2000, width = 1600, height = 800, mask=mask_dir, colormap="Blues").generate(" ".join(train_df[train_df['sentiment']==sentiment]['selected_text']))
    
    axes[i].text(0.5,1, "{} text".format(sentiment), fontweight="bold", fontfamily='serif', fontsize=17)
    axes[i].patch.set_alpha(0)
    axes[i].axis('off')
    axes[i].imshow(wc)

fig.text(0.1,0.8,"WordCloud by sentiment per selected text in Tweets", fontweight="bold", fontfamily='serif', fontsize=20)
plt.show()

# 6. Feature Enginnering 
* How to use Bert tokenization
* Convert to data suitable for Bert model (Dataset / Dataloader)

#### -> We do not preprocess because our goal is to print literally.

In [None]:
train_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/train.csv')
test_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/test.csv')

train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

train_df.head()

### 6-1) How to use Bert tokenization

In [None]:
bert_tokenizer = BertTokenizer.from_pretrained(
    '../input/bert-base-uncased/vocab.txt',
    do_lower_case=True)

In [None]:
print("Original Text")
print(train_df['text'][0]) # original sentence
print("\n")

print("Original Text Data")
print(bert_tokenizer.tokenize(train_df['text'][0]))
print(bert_tokenizer.convert_tokens_to_ids(bert_tokenizer.tokenize(train_df['text'][0])))
print(bert_tokenizer.decode(bert_tokenizer.convert_tokens_to_ids(bert_tokenizer.tokenize(train_df['text'][0]))))
print("\n")

print("Selcted Text Data")
print(bert_tokenizer.tokenize(train_df['selected_text'][0]))
print(bert_tokenizer.convert_tokens_to_ids(bert_tokenizer.tokenize(train_df['selected_text'][0])))
print(bert_tokenizer.decode(bert_tokenizer.convert_tokens_to_ids(bert_tokenizer.tokenize(train_df['selected_text'][0]))))

In [None]:
def get_max_len(df):
    max_len = 0
    for text in df['text']:

        # Tokenize the text and add special tokens i.e `[CLS]` and `[SEP]`
        input_ids = bert_tokenizer.encode(text, add_special_tokens=True)
        # Update the maximum sentence length.
        max_len = max(max_len, len(input_ids))

    return max_len

In [None]:
max_len = get_max_len(train_df)
print("max len : ", max_len)

### 6-2) Convert to data suitable for Bert model (Dataset / Dataloader)

In [None]:
class BertDataset(torch.utils.data.Dataset):
    def __init__(self, df, max_len=max_len, is_label=False):
        self.df = df
        self.max_len = max_len + 3 # 3 means `[CLS]` and sentiment and `[SEP]`
        self.is_label = is_label

        
    def __len__(self):
        return len(self.df)
    

    def __getitem__(self, index):
        global data
        data = {}
        row = self.df.iloc[index]

        ids, masks, token_type  = self.get_bert_tokenize(row)
        data['input_ids'] = ids
        data['attention_masks'] = masks
        data['token_type_ids'] = token_type
        
        # Text / Selected Text Decode
        data['text'] = bert_tokenizer.decode(ids)
        
        if self.is_label:
            start_idx, end_idx = self.get_label_idx(data, row)
            data['start_index'] = start_idx
            data['end_index'] = end_idx
             
        return data

        
    def get_label_idx(self, data, row):
        # get lavel ids
        global start_index
        global end_index
        
        text_id = bert_tokenizer.encode(
                row['selected_text'],
                add_special_tokens=False,
            )
        label_len = len(text_id)
            
        # get start index / end index        
        for i in range(self.max_len):
            if data['input_ids'][i] == text_id[0]:
                if data['input_ids'][i+label_len-1] == text_id[-1]:
                    start_index = i
                    end_index = i+label_len-1
                    break

        return torch.tensor(start_index), torch.tensor(end_index)
    
    
    def get_bert_tokenize(self, row):
        
        text = row['text']
        sentiment = row['sentiment']
        
        encoded = bert_tokenizer.encode_plus(
          sentiment,
          text,
          add_special_tokens=True,
          max_length=self.max_len,
          pad_to_max_length=True,
          return_token_type_ids=True,
          return_attention_mask=True,
          return_tensors='pt'
        )

        input_ids = encoded['input_ids'].squeeze()
        attention_masks = encoded['attention_mask'].squeeze()
        token_type_ids = encoded['token_type_ids'].squeeze()
        
        return input_ids, attention_masks, token_type_ids  

* 1) Split Train data / Validation data
* 2) Get Dataset
* 3) Get Dataloader

In [None]:
train_df, val_df = train_test_split(train_df, test_size=0.2)
print("Train dataframe size:",train_df.shape)
print("Validation dataframe size:",val_df.shape)

In [None]:
train_dataset = BertDataset(train_df, is_label=True)
val_dataset = BertDataset(val_df, is_label=True)
test_dataset = BertDataset(test_df, is_label=False)

In [None]:
print("Original text length: {} \n".format(len(train_dataset[0]['input_ids'])))
print("Original text: {} \n\n".format(train_dataset[0]['text']))

print("[input_ids] \n", train_dataset[0]['input_ids'])
print("\n[attention_masks] \n", train_dataset[0]['attention_masks'])
print("\n[token_type_ids] \n", train_dataset[0]['token_type_ids'])

- 1. input_ids: Token value to be passed to the model.
- 2. attention_masks: Token value for the model to recognize the sequence.
- 3. token_type_ids: Token value for the model to distinguish between two sequences. The first is 'sentiment', and the second is a token value for the 'text' value.

In [None]:
def get_train_val_loaders(train_dataset, val_dataset, batch_size=8):
    
    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=8, 
        shuffle=True, 
        num_workers=0,
        drop_last=True)
    
    val_loader = torch.utils.data.DataLoader(
        val_dataset,
        batch_size=batch_size, 
        shuffle=False, 
        num_workers=0)
    
    dataloaders_dict = {"train": train_loader, "val": val_loader}
    return dataloaders_dict


def get_test_loader(dataset, batch_size=32):
    
    loader = torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size, 
        shuffle=False, 
        num_workers=0)  
    
    return loader

In [None]:
dict_loader = get_train_val_loaders(train_dataset, val_dataset)
test_loader = get_test_loader(test_dataset)

In [None]:
dict_loader

# 7.Modeling
* Bert Modeling
* Loss Function (CrossEntropyLoss / Jaccard )
* Training
* Evaluating

In [None]:
USE_CUDA = torch.cuda.is_available() 
print(USE_CUDA)

device = torch.device('cuda:0' if USE_CUDA else 'cpu') 
print('A device that proceeds with : ',device)

### 7-1) Bert Modeling

In [None]:
## main ##
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        config = BertConfig.from_pretrained(
            '../input/bert-base-uncased/config.json', output_hidden_states=True)    
        self.bert = transformers.BertModel.from_pretrained(
            '../input/bert-base-uncased/pytorch_model.bin', config=config)
        self.hidden_size = self.bert.config.hidden_size
        self.LSTM = nn.LSTM(self.hidden_size*2, 128)
        self.layer = nn.Sequential(
            nn.Linear(128,64),
            nn.Dropout(0.2),
        )
    
        # The output will have two dimensions ("start_logits", and "end_logits")
        self.FC = nn.Linear(64,2)
        torch.nn.init.normal_(self.FC.weight, std=0.02)
        
 
    def forward(self, ids, mask, token):
        # Return the hidden states from the BERT backbone
        out = self.bert(
            ids,
            attention_mask=mask,
            token_type_ids=token
        )
        
        out = torch.cat((out[2][-1],out[2][-2]), dim=-1)
        
        out, _ = self.LSTM(out)
        out = self.layer(out)
        logits = self.FC(out)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        return start_logits, end_logits

### 7-2) Loss Function 
* CrossEntropyLoss
* Jaccard 

In [None]:
def loss_fn(start_logits, end_logits, start_positions, end_positions):
    ce_loss = nn.CrossEntropyLoss()
    start_loss = ce_loss(start_logits, start_positions)
    end_loss = ce_loss(end_logits, end_positions)    
    total_loss = start_loss + end_loss
    return total_loss

In [None]:
def get_selected_text(text_encode, start_idx, end_idx):
    text_encode = text_encode[start_idx: end_idx + 1]
    selected_text = bert_tokenizer.decode(text_encode)
    return selected_text


def get_original_text(text_encode):
    text_encode = text_encode[3:]
    for i, encode in enumerate(text_encode):
        if encode == 102:
            last_index = i
            break
    return bert_tokenizer.decode(text_encode[:last_index])
    
       
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))


def compute_jaccard_score(text_encode, start_idx, end_idx, start_logits, end_logits):
    start_pred = np.argmax(start_logits)
    end_pred = np.argmax(end_logits)
    # print("start predict index : {} / end predict index : {}".format(start_pred, end_pred))
    # print("start real index : {} / end real index : {}".format(start_idx, end_idx))
    # print(text_encode)
    
    if start_pred > end_pred:
        pred = get_original_text(text_encode)
        # print("get original text", pred)
    else:
        pred = get_selected_text(text_encode, start_pred, end_pred)
        # print("get selected text : ",pred)
        
    true = get_selected_text(text_encode, start_idx, end_idx)
    # print("get label text : ",true)
    return jaccard(true, pred)

### 7-3) Training

In [None]:
def train_model(model, dataloaders_dict, criterion, optimizer, num_epochs, filename):
    model.to(device)

    for epoch in range(num_epochs):
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()
            else:
                model.eval()

            epoch_loss = 0.0
            epoch_jaccard = 0.0
            
            for j, data in enumerate((dataloaders_dict[phase])):
                ids = data['input_ids'].to(device, dtype=torch.int64)
                masks = data['attention_masks'].to(device, dtype=torch.int64)
                token = data['token_type_ids'].to(device, dtype=torch.int64)
                start_idx = data['start_index'].to(device, dtype=torch.int64)
                end_idx = data['end_index'].to(device, dtype=torch.int64)

                optimizer.zero_grad()

                with torch.set_grad_enabled(phase == 'train'):
                    
                    start_logits, end_logits = model(ids, masks, token)
                    
                    loss = criterion(start_logits, end_logits, start_idx, end_idx)
                    
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                    epoch_loss += loss.item() * len(ids)
                    
                    start_idx = start_idx.cpu().detach().numpy()
                    end_idx = end_idx.cpu().detach().numpy()
                    start_logits = torch.softmax(start_logits, dim=1).cpu().detach().numpy()
                    end_logits = torch.softmax(end_logits, dim=1).cpu().detach().numpy()
                    
                    
                    for i in range(len(ids)):                        
                        jaccard_score = compute_jaccard_score(
                            ids[i],
                            start_idx[i],#인코딩 후
                            end_idx[i],
                            start_logits[i], #인코딩 후
                            end_logits[i])
                        epoch_jaccard += jaccard_score
                    
            epoch_loss = epoch_loss / len(dataloaders_dict[phase].dataset)
            epoch_jaccard = epoch_jaccard / len(dataloaders_dict[phase].dataset)
            
            print('Epoch {}/{} | {:^5} | Loss: {:.4f} | Jaccard: {:.4f}'.format(
                epoch + 1, num_epochs, phase, epoch_loss, epoch_jaccard))
    
    torch.save(model.state_dict(), SAVE_MODEL_PATH)

* Training Bert Model

In [None]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [None]:
EPOCHS = 3
SAVE_MODEL_PATH = f'bert.pth'

In [None]:
model = Model() 
optimizer = optim.AdamW(model.parameters(), lr=3e-5, betas=(0.9, 0.999))
criterion = loss_fn

train_model(
    model, 
    dict_loader,
    criterion, 
    optimizer, 
    EPOCHS,
    SAVE_MODEL_PATH)

### 7-4) Evaluating

In [None]:
predictions = []

model = Model()
model.cuda()
model.load_state_dict(torch.load(f'bert.pth'))
model.eval()


for data in test_loader:
    ids = data['input_ids'].to(device, dtype=torch.int64)
    masks = data['attention_masks'].to(device, dtype=torch.int64)
    token = data['token_type_ids'].to(device, dtype=torch.int64)
                                          
    start_logits = []
    end_logits = []
    with torch.no_grad():
        start_logit, end_logit = model(ids, masks, token)
        start_logits = torch.softmax(start_logit, dim=1).cpu().detach().numpy()
        end_logits = torch.softmax(end_logit, dim=1).cpu().detach().numpy()
    
    for i in range(len(ids)):    
        start_pred = np.argmax(start_logits[i])
        end_pred = np.argmax(end_logits[i])
        if start_pred > end_pred:
            pred = get_original_text(ids[i])
        else:
            pred = get_selected_text(ids[i], start_pred, end_pred)
        predictions.append(pred)

In [None]:
test_df.head()

In [None]:
print("[Predicton head]")
for i,pred in enumerate(predictions[:5]):
    print("index",i,": ",pred)

# 8. Submission
* Submit the predictions

In [None]:
submission = pd.read_csv('../input/tweet-sentiment-extraction/sample_submission.csv')
submission['selected_text'] = predictions
submission.head()

In [None]:
submission.to_csv('submission.csv', index=False)

##### reference 
* https://www.kaggle.com/subinium/tps-apr-highlighting-the-data : EDA part (Visualiztion)
* https://www.kaggle.com/shoheiazuma/tweet-sentiment-roberta-pytorch : Feature Enginnering & Modeling(Bert)
* https://www.kaggle.com/abhishek/bert-base-uncased-using-pytorch : Modeling(Bert)
* https://www.kaggle.com/parulpandey/eda-and-preprocessing-for-bert : Modeling(Bert)

###  If this notebook is useful for your kaggling, "UPVOTE" for it 👀
#### THX to Reading My Notebook🌈