What makes DATA so POWERFUL?
It has the power to predict the feature, which is why artificial intelligence is a hot potato

Started with using various APIs from the New York Times, The NEWS, Bloomberg.
However, their data could not go past a few months before. Thus, "The Guardian" was used

Some features to notice:

In [2]:
# def fetch_article
# Note the parameters used for fetch_article
# GET request to retrieve articles 
# Function efficiently parses JSON response to extract title, content for the articles

def fetch_articles(api_key, keyword, page_size):
    print(f"Fetching articles for '{keyword}'...")
    base_url = "https://content.guardianapis.com/search"
    params = {
        'api-key': api_key,
        'q': keyword,
        'page-size': page_size,
        'show-fields': 'body',
        'from-date': '2012-01-01',
        'to-date': '2012-12-31'
    }
    constructed_url = requests.Request('GET', base_url, params=params).prepare().url
    print(f"Constructed URL for '{keyword}': {constructed_url}")
   
    try:
        response = requests.get(constructed_url, timeout=10)
        response.raise_for_status()
        data = response.json()
        articles = data['response']['results']
        if not articles:
            print(f"No articles found for '{keyword}'.")
            return []
        print(f"Fetched {len(articles)} articles for '{keyword}'.")
        return [(article['webTitle'], article['fields']['body']) for article in articles if 'body' in article['fields']]
    except requests.exceptions.HTTPError as http_err:
        print(f"HTTP error occurred while fetching articles for '{keyword}': {http_err}")
    except requests.exceptions.RequestException as err:
        print(f"Request error while fetching articles for '{keyword}': {err}")
    return []

About stop words:
Initially, I created stop words by manual, but after a few run through the articles, found out that html factors were included.
Also, excluded keywords that intuitively did not seem meaningful (theguardian, the)

In [3]:
custom_stop_words = ["https", "com", "theguardian", "href", "www", "class", "block", "time", "div", "id", "h2", "figure", "elements", "the"]

TF-IDF (Term Frequency Inverse Document Frequency) method was used to extract keywords.
TfidfVectorizer: Ignores frequently used English words
fit_transform: learns the vocabulary and inverse document frequency weightings, then transforms the documents into a sparse matrix

In [4]:
def extract_keywords_tfidf(documents, top_n=10):
    vectorizer = TfidfVectorizer(max_features=500, stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()
    dense = tfidf_matrix.todense().tolist()
    keywords = [sorted(zip(feature_names, doc), key=lambda x: x[1], reverse=True)[:top_n] for doc in dense]
    return keywords

analyze_sentiment applies TextBlob (library that provides API for common language processing)

In [5]:
# sentiment polarity: range from [-1,1] for positivty or negativity
def analyze_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity

Latent Dirichlet Allocation (LDA) for topic modeling

In [6]:
# CountVectorizer: Convert text documents into a matrix of token counts (parameters max_df and min_df are used to filter words)
# LDA Model: Initialized with the number of topics (n_components) and a random state for reproducibility
# Topic Extraction: For each topic, it sorts the words based on their association with the topic and selects the top N words to represent each topic

def perform_lda(documents, n_topics=5, n_words=10):
    count_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
    doc_term_matrix = count_vectorizer.fit_transform(documents)
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=0)
    lda.fit(doc_term_matrix)
    words = count_vectorizer.get_feature_names_out()
    topics = {i: [words[index] for index in topic.argsort()[:-n_words - 1:-1]] for i, topic in enumerate(lda.components_)}
    return topics

aggregate_ranking: loops through article and calculates average sentiment, returns 20 most used keywords

In [7]:
def aggregate_rankings(articles, keyword):
    print(f"Aggregating rankings for {len(articles)} articles...")
    word_ranking_sums = Counter()
    sentiments = []
    for title, body in articles:
        word_counts = count_words(body)
        word_ranking_sums.update(word_counts)
        sentiment = analyze_sentiment(body)
        sentiments.append(sentiment)
    average_sentiment = sum(sentiments) / len(sentiments) if sentiments else 0
    print(f"Average sentiment for '{keyword}': {average_sentiment}")
    return word_ranking_sums.most_common(20), average_sentiment

plot_keyword_rankings_interactive: for visualizing
used bar graphs to intuitively see which is number 1 to number 20

In [8]:
def plot_keyword_rankings_interactive(rankings, keyword):
    if rankings:
        words, scores = zip(*rankings)
        fig = px.bar(x=words, y=scores, title=f'Top Word Rankings for {keyword}', labels={'x':'Words', 'y':'Frequency'})
        fig.show()
    else:
        print(f"No data to plot for '{keyword}'.") 

HERE IS THE INITIAL TRIALS:

Results for 2020 Election:

![Alt text](images/Screenshot%202024-04-11%20at%203.54.02%20PM.png)

Results for 2016 Election:

![Alt text](images/Screenshot%202024-04-11%20at%203.54.18%20PM.png)

Results for 2012 Election:

![Alt text](images/Screenshot%202024-04-11%20at%203.54.22%20PM.png)

HERE ARE THE FINAL TRIALS:

2020 Election:

![Alt text](images/Screenshot%202024-04-11%20at%205.07.36%20PM.png)

2016 Results:

![Alt text](images/Screenshot%202024-04-11%20at%205.07.41%20PM.png)

2012 Results:

![Alt text](images/Screenshot%202024-04-11%20at%205.07.55%20PM.png)

To find the relations between keywords and winner, at first I did not notice any commonality.
However, soon I realized that the winner had keywords that had a more "neutrality" in the words itself.

//