Fist cleaning the dataset is a must.
the dataset is orginized like so:
dev-articles -> Articles for testing the results
train-articles -> Articles for training the model
train-labels-task1 -> Labels of propaganda technique in articles
train-labels-task2 -> Labels of propaganda technique with lines in articles

Firstly, it is needed to gather labels from the task1 folder, put them in a dict where the article number is the key and the start and end of the propaganda techniques are values.

In [8]:
import os
articles = os.listdir("datasets/train-articles/") # this is where our news articles are located
propagandaTagsSpan = os.listdir("datasets/train-labels-task1-span-identification") # this is where our tags are located
articles.sort()
propagandaTagsSpan.sort()
propTagsSpan = {} # Dictionary containing the news article number as a key, and propoganda snip as values

for epoch in range(len(articles)):
    article = articles[epoch]
    articleNoExt = os.path.splitext(article)[0] # remove the .txt file extension ([2])
    articles[epoch] = articleNoExt # replace newsArticles[i] with the same name but without the .txt extension
    articleNo = articleNoExt.replace('article', '') # remove 'article' to leave just the number
    
    tagPath = "datasets/train-labels-task1-span-identification/"+ articleNoExt + ".task1-SI.labels"
    with open(tagPath) as f:
        tags = f.readlines()
        # replace \t and \n in tags with " " for easier processing later on
        for epoch in range(len(tags)):
            tag = tags[epoch]
            tag = tag.replace("\t", " ")
            tag = tag.replace("\n", " ")
            tags[epoch] = tag 
        propTagsSpan[articleNoExt] =  tags
    f.close()

print(propTagsSpan[articles[0]])

['111111111 265 323 ', '111111111 1795 1935 ', '111111111 149 157 ', '111111111 1069 1091 ', '111111111 1334 1462 ', '111111111 1577 1616 ', '111111111 2023 2086 ']


Using the dict created, read all of the sentences that have been annotated as "propaganda" from the 'train-articles' folder, and put them in a list which will be named 'propSentencesSpan'.

In [9]:
propSentencesSpan = []

for article in articles:
    artPath = "datasets/train-articles/" + article + ".txt"
    
    labels = propTagsSpan[article]
    
    with open(artPath, encoding="utf-8") as f:
        wholeArticle = f.read()
        for label in labels:
            label = label.split()
            start = int(label[1])
            end = int(label[2])
            
            labeledLine = wholeArticle[start:end]
            labeledLine = labeledLine.replace("\n", " ")
            labeledLine = labeledLine.replace("\t", " ")
          
            propSentencesSpan.append(labeledLine)
    f.close()
    
print(propSentencesSpan[0])

The next transmission could be more pronounced or stronger


Create a dictionary with the keys being the propoganda sentences, and the values being their associated propoganda type. This is to setup the data to be put into a Pandas dataframe.

*Using the list of propaganda senteces that we've gathered, create another list, 'notPropSentences' which will contain sentences from the articles that have not been annotated as propaganda.*

In [10]:
import nltk
notPropSentences = []

count = 0
maxNum = len(propSentencesSpan) # we want an equal number of propaganda and non-propaganda sentences to create a balanced training set
for article in articles:
    artPath = "datasets/train-articles/" + article + ".txt"
    with open(artPath, encoding="utf-8") as f:
        wholeArticle = f.read()
        
        # Remove SPANNED lines of propoganda from articles to detect non-propoganda lines
        currentPropSentences = []
        tags = propTagsSpan[article]
        for tag in tags:
            tag = tag.split()
            start = int(tag[1])
            end = int(tag[2])
            taggedLine = wholeArticle[start:end]
            taggedLine = taggedLine.replace("\n", " ")
            taggedLine = taggedLine.replace("\t", " ")
            currentPropSentences.append(taggedLine)
        
        sentences = nltk.sent_tokenize(wholeArticle)
        for sentence in sentences:
            if(count == maxNum):
                break
            notProp = True
            sentence = sentence.replace("\n", " ")
            sentence = sentence.replace("\t", " ")
            for propSentence in currentPropSentences:
                if(propSentence in sentence):
                    notProp = False
                    
            if(notProp): 
                count +=1
                notPropSentences.append(sentence)

print(len(propSentencesSpan))
print(len(notPropSentences))
print(notPropSentences[0])

5468
5468
An outbreak of both bubonic plague, which is spread by infected rats via flea bites, and pneumonic plague, spread person to person, has killed more than 200 people in the Indian Ocean island nation since August.


Defining a dataset for the propaganda items and sentences

In [11]:
import pandas as pd

# In order to use pandas, we have to create a dict where we will store as values, which we can then convert into a Pandas DataFrame
sentencesToCSV = {}
sentencesToCSV["Propaganda"] = []
sentencesToCSV["Sentence"] = []

# A special dict for the Keras Logistic Regression Model:

sentencesToCSVKeras = {}
sentencesToCSVKeras["Propaganda"] = []
sentencesToCSVKeras["Sentence"] = []

for sentence in propSentencesSpan: 
    sentencesToCSV["Propaganda"].append("Yes")
    sentencesToCSV["Sentence"].append(sentence) 
    
    sentencesToCSVKeras["Propaganda"].append(1)
    sentencesToCSVKeras["Sentence"].append(sentence)
    

for sentence in notPropSentences:  
    sentence.replace("\n", " ")
    sentence.replace("\t", " ")
    
    sentencesToCSV["Propaganda"].append("No")
    sentencesToCSV["Sentence"].append(sentence) 
    
    sentencesToCSVKeras["Propaganda"].append(0)
    sentencesToCSVKeras["Sentence"].append(sentence)


df = pd.DataFrame.from_dict(sentencesToCSV)
dfKeras = pd.DataFrame.from_dict(sentencesToCSVKeras)

df.head()

Unnamed: 0,Propaganda,Sentence
0,Yes,The next transmission could be more pronounced...
1,Yes,when (the plague) comes again it starts from m...
2,Yes,appeared
3,Yes,"a very, very different"
4,Yes,He also pointed to the presence of the pneumon...


Turning the train data to a matrix and numpy array.
Also spliting the dataset whilst shuffeling to make the testing more accurate and making the model better.

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split

train_sentences, test_sentences, train_tags, test_tags = train_test_split(df["Sentence"],
                                                                      df["Propaganda"],
                                                                      test_size=0.1, 
                                                                      random_state=10,
                                                                      stratify=df["Propaganda"])

train_tags = train_tags.to_numpy()
train_sentences = train_sentences.to_numpy()
# Testing set (what we will use to test the trained model)
test_tags = test_tags.to_numpy()
test_sentences = test_sentences.to_numpy()


print(train_sentences[1])
print(train_tags[1])


# Do the same thing for the Keras df

train_sentences, test_sentences, train_tags, test_tags = train_test_split(dfKeras["Sentence"],
                                                                      dfKeras["Propaganda"],
                                                                      test_size=0.1, 
                                                                      random_state=10,
                                                                      stratify=dfKeras["Propaganda"])

train_tags_keras = train_tags.to_numpy()
train_sentences_keras = train_sentences.to_numpy()
# Testing set (what we will use to test the trained model)
test_tags_keras = test_tags.to_numpy()
test_sentences_keras = test_sentences.to_numpy()

Teams are motivated and working hard.
No


Turn the numpy arrays into vectors which will, in turn, be turned into an array

In [39]:
from sklearn.feature_extraction.text import CountVectorizer


count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_sentences)
test_counts = count_vect.transform(test_sentences)
print(train_counts.shape)
print(train_tags.shape)


# Same thing but for Keras

count_vect_keras = CountVectorizer()
train_counts_keras = count_vect_keras.fit_transform(train_sentences_keras).toarray()
test_counts_keras = count_vect_keras.transform(test_sentences_keras).toarray()

(9842, 13363)
(9842,)


Define functions for calculating precision and recall function for mathimatical needs.
IMPORTANT NOTE: the functions are not my code, nor are changed from the original source, those are functions commonly used and very much optimized, thus there is no need to change it.

In [14]:
from keras.models import Sequential 
from keras.layers import Dense, Activation 
from keras import backend as K

# The functions below were taken from [3]
def recall_m(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

def precision_m(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision



keras_lr_1 = Sequential() 
keras_lr_1.add(Dense(input_dim = 13363, units = 1)) # 13229 is the shape of the df for task 1, 1 is output dimension of the test tag which is 0 or 1 
keras_lr_1.add(Activation('relu'))
keras_lr_1.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy', recall_m, precision_m])






Define the logistic reggresioin model, using the commonly used parameters for the model

In [15]:
from sklearn.linear_model import LogisticRegression
import datetime
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

# What we will use for LogisticRegression
clf_lr = LogisticRegression(solver='lbfgs', multi_class="ovr", max_iter=1000, random_state=1)

Defining the train function for making the gradient better.

In [16]:
def train_model(clf, X_train, y_train, epochs=10):
    scores = []
    print("Starting training...")
    for epoch in range(1, epochs + 1):
        print("Epoch:" + str(epoch) + "/" + str(epochs) + " -- " + str(datetime.datetime.now()))
        clf.fit(X_train, y_train)
        score = clf.score(X_train, y_train)
        scores.append(score)
    print("Done training.")
    return scores

Defining a function for percision and calculating how accurate the model is.
Another function for recalling the model to advance it based on results.

In [17]:
def precision(actualTags, predictions, classOfInterest):
    actualCounter = 0
    predCounter = 0
    for i in range(len(predictions)):
        if classOfInterest == predictions[i]:
            predCounter += 1
            if classOfInterest == actualTags[i]:
                actualCounter += 1
    return actualCounter/predCounter

def recall(actualTags, predictions, classOfInterest):
    actualTagCounter = 0
    predictionsCounter = 0
    for i in range(len(predictions)):
        if classOfInterest == actualTags[i]:
            actualTagCounter += 1
            if classOfInterest == predictions[i]:
                predictionsCounter += 1
   
    return predictionsCounter/actualTagCounter

Running 10 epochs of basic training, number of epochs will decide how effective the model will be.
The accuracy may stagnate because the model is very much serfuce level and only tokenizing sentences without actual word by word detection.

In [18]:
import tensorflow as tf

keras_lr_1.fit(train_counts_keras, train_tags_keras, epochs= 10, batch_size=128, verbose=1, validation_data=(test_counts, test_tags_keras))

loss, accuracy1_keras, recall1_keras, precision1_keras = keras_lr_1.evaluate(test_counts, test_tags_keras, verbose=0)

print("Accuracy:", accuracy1_keras)
print("Precision:", precision1_keras)
print("Recall:", recall1_keras)

Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy: 0.5
Precision: 0.0
Recall: 0.0


Calculating the accuracy

In [19]:
clf_lr_score = train_model(clf_lr, train_counts, train_tags, 10)
y_pred = clf_lr.predict(test_counts)
print("Accuracy:",clf_lr_score)

Starting training...
Epoch:1/10 -- 2024-02-18 23:17:50.582969
Epoch:2/10 -- 2024-02-18 23:17:51.273715
Epoch:3/10 -- 2024-02-18 23:17:51.965838
Epoch:4/10 -- 2024-02-18 23:17:52.982632
Epoch:5/10 -- 2024-02-18 23:17:53.640029
Epoch:6/10 -- 2024-02-18 23:17:54.300503
Epoch:7/10 -- 2024-02-18 23:17:55.210294
Epoch:8/10 -- 2024-02-18 23:17:56.074222
Epoch:9/10 -- 2024-02-18 23:17:56.892296
Epoch:10/10 -- 2024-02-18 23:17:57.657264
Done training.
Accuracy: [0.9348709611867506, 0.9348709611867506, 0.9348709611867506, 0.9348709611867506, 0.9348709611867506, 0.9348709611867506, 0.9348709611867506, 0.9348709611867506, 0.9348709611867506, 0.9348709611867506]


Save the logistic reggresion model and uplaod to the folder (Saved in github to avoid rerunning the code many times)

In [20]:
import joblib

joblib.dump(clf_lr, "PropDetectionModel.clf")

['PropDetectionModel.clf']

Example import of the model

In [21]:
import joblib

model = joblib.load("PropDetectionModel.clf")

Example of usage for a single sentence.

In [43]:
sentence = ['In 2000 the 21st century started']
print(model.predict(count_vect_keras.transform(sentence)))

ValueError: X has 6 features, but LogisticRegression is expecting 13363 features as input.

Example usage for an array of sentences (for example an article)

In [36]:
sentences = ['In 2000 the 21st century started','Even though no one has noticed it', "Now we live as we live", "One might belive otherwise"]
print(model.predict(count_vect_keras.transform(sentences)))

[0 1 0 1]
