**1) Introduction**

This ML project offers a comprehensive solution to classify news articles into distinct political orientations, aiding users in understanding the underlying biases and perspectives shaping public discourse.

The project comprises a series of intricately connected components:


- *Scraping Applications*: url_scraping.py: This application retrieves a list of URLs containing news articles from diverse sources. news_scraping.py: Upon obtaining URLs, this application extracts the text content of the articles in a raw format.


- *Data Cleaning* (cleaner.py): Once the text data is collected, the cleaner.py script plays a pivotal role in preprocessing the raw text, eliminating redundant information, such as HTML tags, advertisements, and other noise, to ensure the integrity and quality of the dataset.


- *Data Preprocessing*: The collected text data undergoes preprocessing steps such as tokenization, stop-word removal, and possibly stemming or lemmatization to prepare it for analysis.

A series of ML models shall be proposed, evaluated and then compared amongst each other. Finally, simple tf-idf, neural embedding and hybrid search engines will be implemented for the same dataset.

In [1]:
# Importing standard libriaries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt # for visualisation purposes

____________________
**2) Preprocessing Data**

- *OneHotEcondoer*: preprocessing technique that converts categorical variables into a numerical format that machine learning algorithms can understand. It takes categorical data, then creates binary columns for each category, and, finally, for each sample, puts 1 in the column matching its category and 0 in others.

- *ColumnTransformer*: applies different transformations to different columns of your dataset in a single step.It nsures all transformations are applied consistently during training and prediction. It is quite useful as real datasets often have mixed types (ex. numeric, categorical, text).

In [2]:
# Imprting required libraries from skleanr
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [3]:
df = pd.read_json('datasets/news.json').drop('source',axis=1)

In [4]:
df.head()

Unnamed: 0,article,orientation
0,Health authorities in one state have issued an...,western_conservative
1,\n'Kennedy Saves the World' podcast host Kenne...,western_conservative
2,\nFormer counterterrorism analyst Jonathan Sch...,western_conservative
3,\nFox News Flash top headlines are here. Check...,western_conservative
4,\nCrowe is charged with harassment and stalkin...,western_conservative


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1164 entries, 0 to 1163
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   article      1164 non-null   object
 1   orientation  1164 non-null   object
dtypes: object(2)
memory usage: 18.3+ KB


In [6]:
df['orientation'].unique()

array(['western_conservative', 'non_western', 'western_progressive'],
      dtype=object)

We have a total of three unique categories: western conservative, non-western and western progressive.

In [7]:
# Turn categories into numbers

# Define the categorical features
categorical_features = ['orientation']

# Initialize the OneHotEncoder
one_hot = OneHotEncoder()

# Initialize the ColumnTransformer
transformer = ColumnTransformer([('one_hot', one_hot, categorical_features)], remainder='passthrough')

# Apply the transformation to your dataframe
df_transformed = transformer.fit_transform(df)
df_transformed[:3]

array([[0.0, 1.0, 0.0,
        'Health authorities in one state have issued an urgent alert for residents who visited a Costco, DFO, businesses and caught trams after two measles cases were infectious while in public.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nVictorian residents have been put on alert after two holidaymakers returning from overseas were unknowingly infectious with measles while out in the community.\nThe Department of Health revealed the new cases on Saturday afternoon, which brings the total measles cases to three after another traveller was identified this week.\nAt least 10 exposure sites have been listed, with the days ranging between Wednesday January 17 and Wednesday January 24, on the department\'s website.\nWant more news? Stream Sky News Australia’s live channel here. \nWednesday January 17 \n6am to 3pm: Bay City Auto Group (and associated construction site) 14 Dandenong Road West, Frankston\n7:30pm to 9pm: Box Hill Action Indoor Sports 9 Clarice Road, Box Hill

In [8]:
# Turning it into a dataset format
data = pd.DataFrame(df_transformed)
data.columns = ['western_conservative','non_western','western_progressive','article'] # renaming the columns

# Removing \n and pre-word-embedding cleaning
char = '\\'
data['article'] = data['article'].str.replace('"','')
data['article'] = data['article'].str.replace("'","")
data['article'] = data['article'].str.replace(',','')
data['article'] = data['article'].str.replace('.','')
data['article'] = data['article'].str.lower()
data['article'] = data['article'].str.replace('\n','')
data['article'] = data['article'].str.replace(char,'')
data['article'] = data['article'].str.replace('/','')
data['article'] = data['article'].str.replace('—','')
data['article'] = data['article'].str.replace('_','')
data['article'] = data['article'].str.replace('’','')
data['article'] = data['article'].str.replace('-','')
data['article'] = data['article'].str.replace('@','')
data['article'] = data['article'].str.replace('–','')
data['article'] = data['article'].str.replace('‘','')
data['article'] = data['article'].str.replace('…','')
data['article'] = data['article'].str.replace('”','')
data['article'] = data['article'].str.replace('“','')
data['article'] = data['article'].str.replace(':','')
data['article'] = data['article'].str.replace('!','')
data['article'] = data['article'].str.replace('?','')
data['article'] = data['article'].str.replace('^','')
data['article'] = data['article'].str.replace('<','')
data.head()

Unnamed: 0,western_conservative,non_western,western_progressive,article
0,0.0,1.0,0.0,health authorities in one state have issued an...
1,0.0,1.0,0.0,kennedy saves the world podcast host kennedy a...
2,0.0,1.0,0.0,former counterterrorism analyst jonathan schan...
3,0.0,1.0,0.0,fox news flash top headlines are here check ou...
4,0.0,1.0,0.0,crowe is charged with harassment and stalking ...


In [9]:
# Removing stopwrods for word-embedding
from nltk.corpus import stopwords

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

In [10]:
data['article'] = data['article'].apply(remove_stopwords)

In [11]:
data.head()

Unnamed: 0,western_conservative,non_western,western_progressive,article
0,0.0,1.0,0.0,health authorities one state issued urgent ale...
1,0.0,1.0,0.0,kennedy saves world podcast host kennedy fox n...
2,0.0,1.0,0.0,former counterterrorism analyst jonathan schan...
3,0.0,1.0,0.0,fox news flash top headlines check whats click...
4,0.0,1.0,0.0,crowe charged harassment stalking related acti...


In [12]:
# Saving cleaned data
data.to_csv('news_clean.csv',index=False)

___________________

***3) Word-Embedding***


Word embedding is a technique that represents words as dense numerical vectors in a continuous vector space, capturing semantic and syntactic relationships between words based on their context in large datasets. Unlike traditional methods like one-hot encoding, word embeddings place similar words closer together in the vector space, allowing models to generalize better by understanding analogies, synonyms, and other linguistic patterns.

- *Word2Vec*: a word embedding technique that uses a shallow neural network to learn vector representations of words based on their contextual usage in large text datasets. It captures semantic and syntactic relationships by predicting either a target word from its neighbors (Continuous Bag of Words, CBOW) or predicting surrounding words from a target word (Skip-gram). The resulting dense vectors place similar words (e.g., "king" and "queen") close together in the vector space, enabling tasks like analogy solving (e.g., "king - man + woman ≈ queen").

In [13]:
data = pd.read_csv('datasets/news_clean.csv').dropna()
data.shape

(1162, 4)

In [14]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# tokenizing text data
tokenized_data = [simple_preprocess(article) for article in data['article']]

# training Word2Vec model
# using recommended parameters
word2vec_model = Word2Vec(sentences=tokenized_data, vector_size=100, window=5, min_count=1, workers=4)


# retrieving word vectors for each token in the article
word_vectors = []
for tokens in tokenized_data:
    vectors = [word2vec_model.wv[token] for token in tokens if token in word2vec_model.wv]
    if vectors:
        article_vector = sum(vectors) / len(vectors)  # average the word vectors to get one vector per article
        word_vectors.append(article_vector)
    else:
        word_vectors.append(None)  # handle case where all tokens are out-of-vocabulary

# converting word_vectors to pandas series
word_vectors_series = pd.Series(word_vectors, name='word_embeddings')

# adding  the word vectors as a new column in your DataFrame
data['word_embeddings'] = word_vectors_series

In [15]:
data.head()

Unnamed: 0,western_conservative,non_western,western_progressive,article,word_embeddings
0,0.0,1.0,0.0,health authorities one state issued urgent ale...,"[-0.59863424, 0.13775885, 0.012549366, -0.4808..."
1,0.0,1.0,0.0,kennedy saves world podcast host kennedy fox n...,"[-0.85196894, 0.1582429, 0.08021765, -0.659584..."
2,0.0,1.0,0.0,former counterterrorism analyst jonathan schan...,"[-0.6727169, 0.16773795, 0.027970431, -0.64749..."
3,0.0,1.0,0.0,fox news flash top headlines check whats click...,"[-0.9936382, 0.21699953, -0.15789154, -0.89784..."
4,0.0,1.0,0.0,crowe charged harassment stalking related acti...,"[-0.69958556, 0.20131096, 0.039057776, -0.5881..."


In [21]:
# Right now the arrays are Series containing whole strings
# Converting to lists with floats:

# Check the type of the first element
print(type(data['word_embeddings'].iloc[0]))

# If it's already a NumPy array, convert it to a list directly
if isinstance(data['word_embeddings'].iloc[0], np.ndarray):
    data['word_embeddings'] = data['word_embeddings'].apply(lambda x: x.tolist())



<class 'numpy.ndarray'>


In [22]:
type(data['word_embeddings'][0])
# List

list

In [23]:
type(data['word_embeddings'][0][0])
# Float

float

In [None]:
# Saving the word-embedded dataframe
data.to_csv('news_embedded.csv',index=False)

_____________________
**4) Model Fitting and Evaluation**

The problem is a Text Classification Problem. 

For this kind of problem, the following suitable models will be applied and evaluated:

- Naive Bayes

- Support Vector Machines (SVM)

- Random Forest or Gradient Boosting Machines

- Neural Networks: deep learning models like Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs).



I) ***Naive Bayes***

In [24]:
data = pd.read_csv('datasets/news_embedded.csv')

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


# Concatenating the label columns into a single label column
data['label'] = data[['western_conservative', 'non_western', 'western_progressive']].idxmax(axis=1)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['article'], data['label'], test_size=0.2, random_state=42)

# Vectorizing the text data
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Create the model and fitting it to the data
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_vectorized, y_train)

# Making predictions
y_pred = nb_classifier.predict(X_test_vectorized)

In [28]:
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9267241379310345


In [29]:
# Getting predicted probabilities
proba = nb_classifier.predict_proba(X_test_vectorized)

# Creating a DataFrame to display results
results_data = pd.DataFrame({'Article': X_test, 'Predicted Label': y_pred})
for i, label in enumerate(nb_classifier.classes_):
    results_data[label + ' Probability'] = proba[:, i]

# Printing the results
print(results_data.iloc[90])

Article                             former president donald trump seeking sweeping...
Predicted Label                                                   western_progressive
non_western Probability                                                           0.0
western_conservative Probability                                                  0.0
western_progressive Probability                                                   1.0
Name: 865, dtype: object


II) ***Support Vector Machine***

In [30]:
from sklearn.svm import SVC # Importing SVC 

# Create and train the SVM model
svm_classifier = SVC(kernel='linear', probability=True) # You can experiment with different kernels
svm_classifier.fit(X_train_vectorized, y_train)

# Make predictions
y_pred_svm = svm_classifier.predict(X_test_vectorized)

# Evaluate the model
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print("SVM Accuracy:", accuracy_svm)

# Getting predicted probabilities
proba_svm = svm_classifier.predict_proba(X_test_vectorized)

# Creating a DataFrame to display results
results_svm = pd.DataFrame({'Article': X_test, 'Predicted Label': y_pred_svm})
for i, label in enumerate(svm_classifier.classes_):
    results_svm[label + ' Probability'] = proba_svm[:, i]

# Printing the results
results_svm.iloc[90]

SVM Accuracy: 0.9827586206896551


Article                             former president donald trump seeking sweeping...
Predicted Label                                                   western_progressive
non_western Probability                                                      0.003567
western_conservative Probability                                             0.001326
western_progressive Probability                                              0.995107
Name: 865, dtype: object

III) ***Random Forest***

In [31]:
from sklearn.ensemble import RandomForestClassifier 


# Create and train the Random Forest model
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42) # You can adjust n_estimators
rf_classifier.fit(X_train_vectorized, y_train)

# Make predictions
y_pred_rf = rf_classifier.predict(X_test_vectorized)

# Evaluate the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", accuracy_rf)

# Getting predicted probabilities
proba_rf = rf_classifier.predict_proba(X_test_vectorized)

# Creating a DataFrame to display results
results_rf = pd.DataFrame({'Article': X_test, 'Predicted Label': y_pred_rf})
for i, label in enumerate(rf_classifier.classes_):
    results_rf[label + ' Probability'] = proba_rf[:, i]

# Printing the results
results_rf.iloc[90]

Random Forest Accuracy: 0.978448275862069


Article                             former president donald trump seeking sweeping...
Predicted Label                                                   western_progressive
non_western Probability                                                          0.01
western_conservative Probability                                                  0.0
western_progressive Probability                                                  0.99
Name: 865, dtype: object

IV) ***Recurrent Neural Networks***

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Assuming you have a vocabulary size and maximum sequence length
vocab_size = 10000  
max_len = 200      

# Convert text to sequences of integers (tokenization)
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(data['article'])
sequences = tokenizer.texts_to_sequences(data['article'])
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len, truncating='post', padding='post')

# One-hot encode the labels
labels = pd.get_dummies(data['label']).values


# Split the data
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, labels, test_size=0.2, random_state=42)

# Define the model
model = Sequential()
model.add(Embedding(vocab_size, 128, input_length=max_len)) # Embedding layer
model.add(LSTM(128)) # LSTM layer
model.add(Dense(3, activation='softmax')) # Output layer (3 classes)

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=5, batch_size=32, validation_data=(X_test, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print("RNN Accuracy:", accuracy)

# Make predictions
predictions = model.predict(X_test)





Epoch 1/5


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
RNN Accuracy: 0.7801724076271057


In [None]:
# Get predicted probabilities
predictions = model.predict(X_test)

# Create a datset to display results
results_rnn = pd.DataFrame({'Article': X_test[:, 0], 'Predicted Label': np.argmax(predictions, axis=1)}) # Assuming X_test contains tokenized sequences

# Get class labels from the one-hot encoded training labels
class_labels = list(pd.get_dummies(data['label']).columns)

for i, label in enumerate(class_labels):
    results_rnn[label + ' Probability'] = predictions[:, i]

# Printing the results for the 90th sample
results_rnn.iloc[90]



Article                             36.000000
Predicted Label                      2.000000
non_western Probability              0.002677
western_conservative Probability     0.001412
western_progressive Probability      0.995911
Name: 90, dtype: float64

_____________________________
**5) Search Engine Implementation** 


This code implements and evaluates three different search engines for political news articles, along with an evaluation system using LLM (Large Language Model) judgments.

The following section has been built with the help of Stack Overflow articles.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import openai  # For LLM judge evaluation

# Load the cleaned data
data = pd.read_csv('datasets/news_embedded.csv')

Threads used as reference:


- https://stackoverflow.com/questions/23838056/what-is-the-difference-between-transform-and-fit-transform-in-sklearn



- https://stackoverflow.com/questions/27697766/understanding-min-df-and-max-df-in-scikit-countvectorizer


- https://stackoverflow.com/questions/18424228/cosine-similarity-between-2-number-lists


- https://stackoverflow.com/questions/51485813/an-illegal-reflective-access-operation-has-occurred-while-setting-up-spring-xd/51486492#51486492


- https://stackoverflow.com/questions/17901218/numpy-argsort-what-is-it-doing


- https://stackoverflow.com/questions/26984414/efficiently-sorting-a-numpy-array-in-descending-order

In [None]:
class TFIDEngine:
    def __init__(self, documents):
        self.vectorizer = TfidfVectorizer()
        self.tfidf_matrix = self.vectorizer.fit_transform(documents) # Create TF-IDF matrix from documents
        self.documents = documents # Store original documents
        
    def search(self, query, top_k=5):
        query_vec = self.vectorizer.transform([query]) # # Convert query to TF-IDF vector
        # Calculate cosine similarity between query and all documents
        similarities = cosine_similarity(query_vec, self.tfidf_matrix).flatten()
        # Get indices of top k most similar documents
        top_indices = similarities.argsort()[-top_k:][::-1]
        # Return top documents with their similarity scores
        return [(self.documents.iloc[i], similarities[i]) for i in top_indices]

Threads used as reference:


- https://stackoverflow.com/questions/6910641/how-do-i-get-indices-of-n-maximum-values-in-a-numpy-array

- https://stackoverflow.com/questions/51485813/an-illegal-reflective-access-operation-has-occurred-while-setting-up-spring-xd/51486492#51486492

- https://stackoverflow.com/questions/55258608/get-values-from-list-of-tuples-according-to-first-value/55258680#55258680

- https://stackoverflow.com/questions/70411258/image-tag-how-to-put-on-an-icon-not-string-not-url-in-case-of-url-failure

- https://stackoverflow.com/questions/65419499/download-pre-trained-sentence-transformers-model-locally

In [None]:
class NeuralEmbeddingSE:
    def __init__(self, documents):
        # Load pre-trained sentence transformer model
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        # Create embeddings for all documents
        self.doc_embeddings = self.model.encode(documents.tolist())
        self.documents = documents
        
    def search(self, query, top_k=5):
        query_embedding = self.model.encode([query])  # Create embedding for query
        # Calculate cosine similarity between query and document embeddings
        similarities = cosine_similarity(query_embedding, self.doc_embeddings).flatten()
        top_indices = similarities.argsort()[-top_k:][::-1]  # Get top indices
        return [(self.documents.iloc[i], similarities[i]) for i in top_indices]

Threads used as reference: 

- https://stackoverflow.com/questions/58662904/how-to-access-stdshared-ptr-methods/58663062#58663062

- https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-definitions

- https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value

In [None]:

    
class HybridSE:
    def __init__(self, documents):
        # Initialize both engines
        self.tfidf_engine = TFIDEngine(documents)
        self.neural_engine = NeuralEmbeddingSE(documents)
        self.documents = documents
        
    def search(self, query, top_k=5, alpha=0.5):
        # Get results from both engines (twice as many as needed)
        tfidf_results = self.tfidf_engine.search(query, top_k*2)
        neural_results = self.neural_engine.search(query, top_k*2)
        
        # Combine scores with weighted average
        combined_scores = {}
        for doc, score in tfidf_results:
            combined_scores[doc] = combined_scores.get(doc, 0) + alpha * score
            
        for doc, score in neural_results:
            combined_scores[doc] = combined_scores.get(doc, 0) + (1 - alpha) * score
            
        # Sort by combined score and return top k
        sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
        return sorted_results[:top_k]

Threads: used as reference:

- https://stackoverflow.com/questions/76256897/how-to-make-a-conditional-statement-using-two-different-dataframes-in-pandas/76257033#76257033

- https://stackoverflow.com/questions/75203085/sanitys-urlfor-not-working-with-my-react-app

- https://stackoverflow.com/questions/75329835/kotlin-json-string-with-linebreaker-and-variable

- https://stackoverflow.com/questions/71992082/how-to-install-and-run-ward-monitoring-tool-on-linux-ubuntu/71992083#71992083

In [None]:
class LLMJudge:
    def __init__(self, api_key):
        openai.api_key = api_key
        self.prompt_template = "Query: {query}, Retrieved Document: {document}" # Template for LLM evaluation
     
    def evaluate(self, query, document):
        prompt = self.prompt_template.format(query=query, document=document)
        try:
            # Call OpenAI API to get relevance rating
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "system", "content": prompt}],
                max_tokens=2
            )
            rating = int(response.choices[0].message.content.strip())
            return rating
        except:
            return 3  # Default neutral rating if API fails

Threads used as reference:

- https://stackoverflow.com/questions/522563/how-can-i-access-the-index-value-in-a-for-loop

- https://stackoverflow.com/questions/9039961/finding-the-average-of-a-list

In [None]:
def evaluate_search_engine(engine, queries, judge, top_k=3):
    # Evaluates an engine by getting average rating across all queries
    scores = []
    for query in queries:
        results = engine.search(query, top_k=top_k)
        for doc, _ in results:
            score = judge.evaluate(query, doc)
            scores.append(score)
    return np.mean(scores) if scores else 0

Threads used as reference:

- https://stackoverflow.com/questions/1798465/remove-last-3-characters-of-a-string

- https://stackoverflow.com/questions/45310254/fixed-digits-after-decimal-with-f-strings

- https://stackoverflow.com/questions/23267409/how-to-implement-retry-mechanism-into-python-requests-library

In [None]:
# Initialize search engines
tfidf_se = TFIDEngine(data['article'])
neural_se = NeuralEmbeddingSE(data['article'])
hybrid_se = HybridSE(data['article'])

# Sample test queries
test_queries = [
    "conservative view on immigration",
    "progressive economic policies",
    "international relations from non-western perspective",
    "healthcare reform debate",
    "climate change policies"
]

# a manual comparison for demonstration without API key 
def manual_evaluation():
    print("=== Sample Search Results ===")
    for query in test_queries[:2]:  # Just show first 2 for demo
        print(f"\nQuery: '{query}'")
        
        print("\nTF-IDF Results:")
        for doc, score in tfidf_se.search(query, top_k=1):
            print(f"Score: {score:.3f} | {doc[:100]}...")
        
        print("\nNeural Embedding Results:")
        for doc, score in neural_se.search(query, top_k=1):
            print(f"Score: {score:.3f} | {doc[:100]}...")
        
        print("\nHybrid Results:")
        for doc, score in hybrid_se.search(query, top_k=1):
            print(f"Score: {score:.3f} | {doc[:100]}...")

manual_evaluation()

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


=== Sample Search Results ===

Query: 'conservative view on immigration'

TF-IDF Results:
Score: 0.129 | oklahoma gov kevin stitt (r) says invasion occurring southern border applauds texas gov greg abbotta...

Neural Embedding Results:
Score: 0.418 | liveisraelhamas warlivetech ceos testify senateby aditi sangal elise hammond maureen chowdhury tori ...

Hybrid Results:
Score: 0.418 | liveisraelhamas warlivetech ceos testify senateby aditi sangal elise hammond maureen chowdhury tori ...

Query: 'progressive economic policies'

TF-IDF Results:
Score: 0.071 | donald trumps voters get attention joe bidens may decide general electiona historic rematch white ho...

Neural Embedding Results:
Score: 0.377 | marketsfear & greed indexlatest market newsit happened us economy defied yet another forecast big wa...

Hybrid Results:
Score: 0.188 | marketsfear & greed indexlatest market newsit happened us economy defied yet another forecast big wa...


_____________________________
**6) Conclusion** 



The implementation of the ML models in the third section yielded considerably positive results, with accuracy scores rarely falling below the 0.85 threshold. The model with the highest performance was SVM with an average score of 0.98, whereas the worst performance was obtained with the application of an RNN model with an average score of 0.78. Ultimately, the dataset contained news sources which may fall under the category of “mainstream media”, therefore, in order to further test the models’ predictive performance, it would be advisable to build new datasets containing a wider variety of news sources. Finally, less satisfactory were the results obtained by the search engine implementations. Even the Neural Embedding Engine, which yielded the best scores out the three engines proposed, failed nonetheless to surpass the 0.5 score threshold, thus showcasing a poor affinity between the data collected and the parameters adopted in the final evaluation.

