In [1]:
!pip install transformers



In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

### Loading data

In [3]:
data = pd.read_csv('/content/Twitter_Data.csv')
data.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


### Data preprocessing

In [4]:
data = data[data['category'] != 0] # remove examples with category = 0
data['category'].replace(-1, 0, inplace = True) # assign category 0 to the negative examples, keep 1 for the positive ones
data.reset_index(inplace = True)
data.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['category'].replace(-1, 0, inplace = True) # assign category 0 to the negative examples, keep 1 for the positive ones


Unnamed: 0,index,clean_text,category
0,0,when modi promised “minimum government maximum...,0.0
1,2,what did just say vote for modi welcome bjp t...,1.0
2,3,asking his supporters prefix chowkidar their n...,1.0
3,4,answer who among these the most powerful world...,1.0
4,8,with upcoming election india saga going import...,1.0


### Create training set

In [5]:
train = pd.DataFrame()
train[['text', 'label']] = data[['clean_text', 'category']]
print((train.isnull()).any()) # Check for missing data, True there is missing data

text     True
label    True
dtype: bool


In [6]:
# Remove any rows containing missing data
train.dropna(inplace = True)

In [7]:
train.isnull().any()

text     False
label    False
dtype: bool

In [8]:
len(train)

107758

### Train-Test split

In [9]:
text = train['text'].values.tolist()
labels = train['label'].values.tolist()

In [10]:
training_sentences, validation_sentences, training_labels, validation_labels = train_test_split(text, labels, test_size=.2)

### Tokenization

In [11]:
from transformers import BertTokenizer

model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)

train_encodings = tokenizer(
    training_sentences,
    truncation = True,
    padding = True
    )
val_encodings = tokenizer(
    validation_sentences,
    truncation = True,
    padding = True
    )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [12]:
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices(
    (
      dict(train_encodings),
      training_labels
    )
  )

val_dataset = tf.data.Dataset.from_tensor_slices(
    (
      dict(val_encodings),
      validation_labels
    )
  )

### Model training

In [13]:
from transformers import TFBertForSequenceClassification

model = TFBertForSequenceClassification.from_pretrained(
    model_name,
    )

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
learning_rate = 2e-5
number_of_epochs = 1

optimizer = tf.keras.optimizers.Adam(
    learning_rate=learning_rate
)
loss = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True
)
metric = tf.keras.metrics.SparseCategoricalAccuracy(
    'accuracy'
)
model.compile(
    optimizer=optimizer,
    loss=loss,
    metrics=[metric]
)

In [15]:
model.fit(
    train_dataset.shuffle(100).batch(64),
    epochs = 1,
    batch_size = 16,
    validation_data = val_dataset.shuffle(100).batch(64)
  )



<keras.src.callbacks.History at 0x7b679e42b100>

### Testing

In [20]:
def get_sentiment(sentence):
  input_encoding = tokenizer.encode(
      sentence,
      truncation = True,
      padding = True,
      return_tensors = 'tf'
  )
  output = model.predict(input_encoding)[0]
  output_prediction = tf.nn.softmax(output, axis=1)
  labels = ['Negative', 'Positive']
  label = tf.argmax(output_prediction, axis=1)
  label = label.numpy()
  print(labels[label[0]])

In [21]:
setence = " I hate any one can hurt you "
get_sentiment(setence)

Negative


In [22]:
setence = "I hate the selfishness in you"
get_sentiment(setence)

Negative


In [23]:
setence = "I love NLP"
get_sentiment(setence)

Positive


In [24]:
setence = "I am so happy"
get_sentiment(setence)

Positive


### Using hugging face transformers

In [26]:
from transformers import pipeline

pipeline = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [27]:
pipeline('I hate any one can hurt you')

[{'label': 'POSITIVE', 'score': 0.9785550236701965}]

In [28]:
pipeline('I hate the selfishness in you')

[{'label': 'NEGATIVE', 'score': 0.9951192140579224}]

In [29]:
pipeline('I love NLP')

[{'label': 'POSITIVE', 'score': 0.9997692704200745}]

In [30]:
pipeline('I am so happy')

[{'label': 'POSITIVE', 'score': 0.9998812675476074}]