In [1]:
import numpy as np
import pandas as pd


This line of code, below  is importing two classes from the `transformers` library, which is a library provided by Hugging Face for Natural Language Processing (NLP) tasks.

1. `AutoTokenizer`: This class is used for tokenizing the input text data. Tokenization is the process of converting text into tokens, which are small structures or units. In NLP, these tokens are usually words, subwords, or phrases. This is an important step in preparing data for use in a machine learning model.

2. `TFAutoModelForSequenceClassification`: This class is a TensorFlow model that is used for sequence classification tasks. Sequence classification is a predictive modeling problem where you have some sequence of inputs over space or time and the task is to predict a category for the sequence. This is used in tasks like sentiment analysis, where the goal is to classify a sequence of words (like a sentence or a tweet) into categories like positive, negative, or neutral.

In [1]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

In [2]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("./sentiment_transfer_learning_tensorflow/")

# Load model
loaded_model = TFAutoModelForSequenceClassification.from_pretrained('./sentiment_transfer_learning_tensorflow/')

Some layers from the model checkpoint at ./sentiment_transfer_learning_tensorflow/ were not used when initializing TFBertForSequenceClassification: ['dropout_113']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at ./sentiment_transfer_learning_tensorflow/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [3]:
logitpreds = loaded_model(tokenizer(["He is useless, I dont know why he came to our neighbourhood",
                                     "That guy is well", "He is such a retard"],
                                    return_tensors="np",padding=True,truncation=True))['logits']

print(logitpreds)

tf.Tensor(
[[-0.42841133  0.23808081]
 [-0.45384634  0.21313922]
 [-0.39815122  0.22506994]], shape=(3, 2), dtype=float32)


In [4]:
import tensorflow as tf
import numpy as np
probabilities = tf.nn.softmax(logitpreds).numpy()
predictions = np.argmax(probabilities, axis=1)
print(predictions)

[1 1 1]


In [5]:
predict_score_and_class_dict = {0: 'Negative', 1: 'Positive'}

import numpy as np
for pred in predictions:
    print(predict_score_and_class_dict[pred])

Positive
Positive
Positive


In [6]:
def predict_sentiment(text):
    # Process the text using the loaded tokenizer
    tokens = tokenizer(
        [text],
        return_tensors="tf",
        padding=True,
        truncation=True
    )

    # Get the model predictions
    preds = loaded_model(tokens)['logits']
    class_pred = np.argmax(preds, axis=1)[0]

    # Return the predicted sentiment label
    return predict_score_and_class_dict[class_pred]

I called the unlabeled Data from Truth Social that I have tidyt before.

In [7]:
import pandas as pd
df=pd.read_csv("unlabeled_data_cleaned.csv", encoding='latin-1')
df

Unnamed: 0,text
0,q be ready anons - public awakening coming - q...
1,enough is enough retruth
2,sjustthenewscompolitics-policyall-things-trump...
3,stmerealxreport
4,cecebloomwood
...,...
739728,bob lighthizer did a great job for america sww...
739729,the time to stand up to this growing tyranny i...
739730,swwwmiamiheraldcomnewspolitics-governmentartic...
739731,swwwfoxnewscompoliticstrump-loves-the-idea-of-...


In [11]:
# Convert all non-string types to strings
df['text'] = df['text'].astype(str)




I took a sample of 10% of the data to test the model.
To save time, I will only use 10% of the data to test the model. 


In [12]:
# make a subset of 1 percent of the data
df_sub = df.sample(frac=0.1, random_state=100)


In [13]:
# Now apply your function
df_sub['result'] = df_sub['text'].apply(predict_sentiment)

KeyboardInterrupt: 

In [None]:
# Print the results

print(df_sub[['text', 'result', ]])
# describe the results
df_sub['result'].describe()



I saved the results to a csv file and plotted the distribution of the results.

In [None]:
# save the results to a csv file
df_sub.to_csv('sentiment_results1.csv', index=True)

In [None]:
# plot the distribution of the results
import matplotlib.pyplot as plt
df_sub['result'].value_counts().plot(kind='bar')
plt.show()
