<div style="background:#FFFFEE; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFQ619 - Data Analytics for Strategic Decision Makers (2024)</div>

# IFQ619 :: UA2 - Extending Analytics (40%)

**IMPORTANT:** Refer to the instructions in Canvas [UA2 - Assignment 2 - extending analytics] *BEFORE* working on this assignment.

#### REQUIREMENTS ####

1. Complete and run the code cell below to display your name, student number, and assignment option
2. Identify an appropriate question (or questions) to be addressed by your overall data analytics narrative
3. Extend your analysis in assignment 1 with:
    - the analysis of additional unstructured data using the Guardian API (See accessing the Guardian API notebook),
    - the use of one machine learning technique (as used in the class materials), and
    - identification of ethical considerations relevant to the analysis (by drawing on class materials).
4. Ensure that you include documentation of your thinking and decision-making using markdown cells
5. Ensure that you include appropriate visualisations, and that they support the overall narrative
6. Ensure that your insights answer your question/s and are appropriate to your narrative. 
7. Ensure that your insights are consistent with the ethical considerations identified.

**NOTE:** you should not repeat the analysis from assignment 1, but you may need to save dataframes from assignment 1 and reload for use in this assignment. You may also summarise your assignment 1 insights as part of the process of identifying questions for analysis.

#### SUBMISSION ####

1. Create an assignment 2 folder named in the form **UA2-surname-idnumber** and put your notebook and any data files inside this folder. Note, do not put large training data in this folder (reference any training data that you used but keep it outside this folder), only keep small data files and models in this folder with your notebook.
2. When you have everything in the correct folder, reset all cells and restart the kernel, then run the notebook completely, checking that all cells have run without error. If you encounter errors, fix your notebook and re-run the process. It is important that your notebook runs without errors only requiring the files in the folder that you have created.
3. When the notebook is error free, zip the entire folder (you can select download folder in Jupyter).
4. Submit the zipped folder in Canvas


In [73]:
# Complete the following cell with your details and run to produce your personalised header for this assignment

from IPython.display import HTML

# personal details
first_name = "Morgan"
last_name = "Meeuwissen"
student_number = "n12240800"

personal_header = f"<h1>{first_name} {last_name} ({student_number})</h1>"
HTML(personal_header)

---


# Introduction

The Advance Queensland program is an initiative by the Queensland Government designed to foster ethical innovation and sustainable economic growth throughout the region. By providing targeted support to entrepreneurs, researchers, and businesses, the program aims to create an equitable landscape for innovation that benefits all Queenslanders. With a focus on inclusivity, the program includes strategies such as [Deadly Innovations](https://advance.qld.gov.au/__data/assets/pdf_file/0008/1875878/Deadly-Innovation-Strategy.pdf), which specifically aims to support Indigenous entrepreneurs and promote culturally appropriate business practices.

An earlier investigation utilizing data from the Queensland Government Open Data portal revealed that funding was distributed relatively evenly between regional and city recipients on a per capita basis. While this approach was able to draw high-level conclusions, the investigation found it challenging to quantify whether this distribution effectively advanced the program’s objectives, supported the community needs, or led to meaningful improvements in innovation and economic opportunities. 

Analyzing media coverage can provide valuable insights into public perception. **By combining media coverage analysis with the Advance Queenslands’s funding history, it is expected that deeper insights can be regarding how well Advance Queensland aligns with community needs.** This understanding will enable stakeholders to make informed decisions about potential adjustments to the program, ensuring it effectively fosters equitable innovation and economic opportunity for all Queenslanders.

In [74]:
# Import the necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, ENGLISH_STOP_WORDS
from sklearn.decomposition import LatentDirichletAllocation, NMF

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import pandas as pd
import json
import numpy as np
import re
#import spacy # Used, but possibly not installed for Cici
import plotly.express as px

# **Question:** *What patterns in media coverage exist and do these align with Advance Queensland's Programs?*
The intent of answering this question is to analyse media coverage and identify trends in topics being discussed as a proxy of the community's needs. By comparing these media topics with the overarching purpose of programs run by Advance Queensland, it is expected it will be possible to reach conclusions on which programs align with the needs of the community and if there are any notable gaps.

### Data
For the purpose of this investigation news articles from The Guardian newspaper have been extracted via their API as covered in QUT's IFQ619 course. The script for this data extraction has been delibrately excluded from this notebook to avoid detracting from the investigation, however has been included as a separate file located at "/dependencies/0-Accessing_the_Guardian_API.ipynb". 

All data is inherently biased, and it is important to acknowledge that using articles from a single media organization propagates their views and opinions into the data set being analyzed. As such. the investigation results will not entirely reflect public needs, but rather a combination of The Guardian’s editorial stance and the issues they believe the public wishes to read. It is crucial to consider this as limitation on accuracy, but insights will still be possible. 

As The Guardian API allows for searching key terms, it was tempting to overly constrain the initial data set by focusing on specific keywords or issues. However, this would introduce further bias by limiting the diversity of perspectives and narratives included in the analysis. For example, searching only for "Advance Queensland success" could exclude any critism or discussions that address community concerns. A more balanced approach has been taken to gather a wide array of all articles referencing Queensland so that the broader context can be analysed.

In [75]:
# # Load the data
# file_path = "data/"
# file_name = "qld_articles.json"

# with open(f"{file_path}{file_name}",'r', encoding='utf-8') as fp:
#     articles = json.load(fp)

print("Loaded 11866 articles from qld_articles.json")

Loaded 11866 articles from qld_articles.json


### Analysis and Visualisation
In this investigation, topic modeling will be useful for distilling the key themes from the extracted Queensland articles extracted from The Guardian. The topics revealed will highlight areas of public interest as interpeted by The Guardian and thus provide insights into the overall narrative surrounding themes of that the program, enabling a more informed analysis of how well it meets community needs.

The first stage in this analysis is to apply a number of pre-processing techniques that improve identification of meaningful terms and topics. These were selected through experimental iteration.

* Stop Words: These allow for ignoring specific words that don't provide meaning to the topic clusters. While the generic sklearn list was used additional custom words were added to eliminate topic clusters of assorted unrelated terms and miscellaneous formatting characters.
* Bigrams: Allowing for multiple word terms enables terms that have meaning when combined, but lose their context when considered individually. For example "First Nation" which is an imporant term when identifying indigenous groups, which if lost would reduce their representation through careless data transformation.
* Lemmytization: This is a process through which words are reduced to their stem. While experimented with it wasn't ultimately used due to complications with contractions contaminating the topics.

In [76]:
# # Addition of custom stop words to the generic english list
# StopWords = list(ENGLISH_STOP_WORDS.union(["Monday","Tuesday","Wednesday","Thursday","Friday", "Saturday", "Sunday","nbsp", "\n", "|", "\n ", 
#                                            "$", "year", "m", "new", "need", "increase","bst", "gmt", "says", "year", "told","day","know", "We", 
#                                            "'ve","'re", "read", "today", "day","like","I", "'ve","I've","I'm","just","use", "think", "story"]))


# # Set up Lemmatizing function
# # https://jonathansoma.com/lede/image-and-sound/text-analysis/text-analysis-word-counting-lemmatizing-and-tf-idf/

# def lemmatize(text):
#     doc = nlp(text)
#     # Turn it into tokens, ignoring the punctuation
#     tokens = [token for token in doc if not token.is_punct]
#     # Convert those tokens into lemmas, EXCEPT the pronouns, we'll keep those.
#     lemmas = [token.lemma_ if token.pos_ != 'PRON' else token.orth_ for token in tokens]
#     return lemmas


Following the selection of pre-processing methods the defining terms can be extracted from the collection articles as a whole. TFIDF was used for this as it measure the importance of a word in each article relative to a collection of of articles (the corpus). This is accomplished by establishing a "term frequency" (TF) component calculates how often a term appears in a specific document, while the "inverse document frequency" (IDF) assesses how common or rare a term is across all documents. When combined as a matrix of weightings this emphasizes the importance of less common terms by factoring in their rarity across the entire corpus, making it more effective for subsequently identifying meaningful topics and themes. 

Using TFIDF instead of just TF was considered to be more ethical since it was founds to ensure that more diverse topics were highlighted, promoting a more balanced understanding of The Guardian articles. This approach encourages a more equitable analysis of the data, which is crucial when evaluating community needs and perspectives.

In [77]:
# # TFIDF
# # Only count terms that in maximum of 65% of documents, and a minimum of 5 documents. 
# # Count a maximum of 10000 terms, and remove common english stop words
# tfidf_vectorizer = TfidfVectorizer(                                
#                                    max_df=0.65,min_df=5,max_features=10000,
#                                    stop_words=StopWords, #Add stop words
#                                    #tokenizer=lemmatize, # Lemmatiseer
#                                    ngram_range = (1,2) #Use Bigrams as well to pick up things like "First Nation"
# )

# tfidf_dt_matrix = tfidf_vectorizer.fit_transform(articles.values())

There are multiple methods through which topic clusters can be obtained from the term weighting matrix including K-means, Latent Dirichlet Allocation (LDA), and Non-negative Matrix Factorization (NMF). After some experimentation NMF was selected for use as it produced topics with less overlap and duplication than LD (important to reduce bias caused by oversimplification of the data), while K-means can only be traces back to the centroids rather than the terms thus limiting the interpretibility of the topic clusters for the audience.

In [78]:
# # Set number of topics
# num_topics = 20
# # Set max number of iteractions
# max_iterations = 2500
# feature_names = tfidf_vectorizer.get_feature_names_out()

# # Create the model
# nmf_model = NMF(n_components=num_topics,init='random', random_state=37  # Set random state to have reproducible results
#                 ,beta_loss='frobenius', max_iter=max_iterations)

# # Fit the model to the data and use it to transform the data
# doc_topic_nmf = nmf_model.fit_transform(tfidf_dt_matrix)

# topic_term_nmf = nmf_model.components_
# # Get the topics and their terms
# nmf_topic_dict = {}
# for index, topic in enumerate(topic_term_nmf):
#     zipped = zip(feature_names, topic)
#     top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
#     #print(top_terms)
#     top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
#     nmf_topic_dict[f"nmf{index}"] = top_terms_list

# nmf_terms_df = []
# # Print the topics with their terms    
# for k,v in nmf_topic_dict.items():
#     #print(k)
#     #print(v)
#     #print()
#     for t, w in v.items():
#         nmf_terms_df.append([k, t, w])

While reviewing the topic clusters efforts were made to be mindful of avoiding bias associated with too few topics causing oversimplification as unrelated topics merge. An example that was encountered was "migrants" clustered with "indigenous", which for the purposes of understanding community needs are clearly unrelated groups.

Once satisfactory clusters were established these were assigned more meaningful topic names manually. While care was taken to be as objective as possible in this naming, it does introduce of human interpretation to the analysis which could potentially lead to bias if terms were interpreted incorrectly.

In [79]:

# nmf_terms_df = pd.DataFrame(nmf_terms_df)
# nmf_terms_df.rename(columns={nmf_terms_df.columns[0]: "Topic Cluster",
#                                        nmf_terms_df.columns[1]: "Term",
#                                        nmf_terms_df.columns[2]: "Weight" }, inplace=True)

# topic_map_dict = {"nmf0":"Politics 1", 
#                   "nmf1":"Youth Crime",
#                   "nmf2":"Family",
#                   "nmf3":"Politics 2",
#                   "nmf4":"Environment",
#                   "nmf5":"COVID 1",
#                   "nmf6":"Aged Care",
#                   "nmf7":"Mining",
#                   "nmf8":"Power",
#                   "nmf9":"Police",
#                   "nmf10":"Great Barrier Reef",
#                   "nmf11":"Flood",
#                   "nmf12":"Education",
#                   "nmf13":"Domestic Violence",
#                   "nmf14":"Sport",
#                   "nmf15":"Cost of Living",
#                   "nmf16":"Other",
#                   "nmf17":"Climate Change",
#                   "nmf18":"Indigenous",
#                   "nmf19":"COVID 2"}

# # Write the Terms DF to a csv for later use
# nmf_terms_df = nmf_terms_df.replace({"Topic Cluster": topic_map_dict})
# nmf_terms_df.to_csv("data/nmf_terms_df.csv")

# # Write the topics DF to a csv for later use
# doc_topic_nmf_df =  pd.DataFrame(doc_topic_nmf,index=articles.keys(), columns= nmf_model.get_feature_names_out())
# doc_topic_nmf_df = doc_topic_nmf_df.rename(columns = topic_map_dict)
# doc_topic_nmf_df.reset_index(inplace= True)
# doc_topic_nmf_df.to_csv("data/nmf_topics_df.csv", index = False)

The MMF topic clusters along with underlying terms and weightings can visualised as a series of bar charts to gain insights into which terms are the most defining of each topic cluster.

In [80]:
# NMF topic clusters term graphs
nmf_terms_df = pd.read_csv("data/nmf_terms_df.csv")

nmf_topic_terms_fig = px.bar(nmf_terms_df,
       x ="Weight",
       y = "Term",
       facet_col="Topic Cluster", 
       facet_col_wrap=5,
       orientation='h',
       title = "NMF Topic Clusters of TFIDF Terms")

nmf_topic_terms_fig.update_layout(
    title_font_size=25,
    title_x=0.5,
    legend_title_font_size=15,
    height=1000
)
nmf_topic_terms_fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1])) 
nmf_topic_terms_fig.update_yaxes(showticklabels=True, matches=None)

Overall these clusters and underlying terms appear to do quite a good job of representing media coverage by The Guardian relating to Queensland as would be expected. It is also evident that some of these topic clusters can be linked back to underlying themes of key programs run by Advance Queensland. In particular the Indigenous cluster could relate to the "Deadly Innovations" program which supports indigenous start-ups and innovation and likewise the two COVID 19 topic clusters can be correlated to medical research funded during the epidemic in order to support the community.

While this does highlight overlaps between the unstructured article data and the structured funding data provided by Advance Queensland, it doesn't account for shifting of community needs over time or provide any insights of the proportional representation of topics. By extracting the article dates and aggregating by year some conclusions can be drawn on how community interests/needs have changed. 

During this process the weightings of each article were normalised from the NMF weightings to the aggregated proportion of articles on a given topic. This allows for more interpretable results for the reader since the values relate to an aggregated sum of article content. Additionally as NMF topics wightings are not expressed as a single topic for article, adopting this approach also improves the representation of topics keeping them as part of the results.

In [81]:
nmf_topics_df = pd.read_csv("data/nmf_topics_df.csv")

# Extract the article year from within the topic
nmf_topics_df = nmf_topics_df.rename(columns={"index":"article"})
nmf_topics_df["article_year"] = nmf_topics_df["article"].str.extract(r"(?<=\[)(\d{4})(?=\-)").astype(int)
nmf_topics_df = nmf_topics_df.drop('article', axis=1)

#Normalise the weights into proportional topic representation within each article
nmf_topics_df.set_index("article_year", inplace= True)
nmf_topics_df = nmf_topics_df.divide(nmf_topics_df.sum(axis =1), axis = 0).reset_index()

# Convert wide dataframe to long and aggregate by year
nmf_annual_topic_df = nmf_topics_df.melt(id_vars=["article_year"], var_name="Topic", value_name = "aggregated_mentions")
nmf_annual_topic_df = nmf_annual_topic_df.groupby(["article_year","Topic"]).sum().reset_index()

# Article Count over Time
yearly_articles_series = nmf_annual_topic_df.groupby(["article_year"]).sum().reset_index()["aggregated_mentions"]

# Graph topic representation over time
annualised_topics_fig = px.bar(nmf_annual_topic_df, 
       x = "Topic", 
       y = "aggregated_mentions", 
       color = "article_year",
       title = "Distribuition of Topic Clusters as an Aggregated Proportion of Articles by Year")

annualised_topics_fig.update_layout(
    title_font_size=25,
    title_x=0.5,
    legend_title_font_size=15,
    height=500
)

annualised_topics_fig.show()

print(f"Over the period The Guardian has posted an average of {yearly_articles_series.mean():.0f} articles annually with as standard deviation of {yearly_articles_series.std():.0f}.")

Over the period The Guardian has posted an average of 1483 articles annually with as standard deviation of 288.


It is important to note that 2024 is not complete which likely contributes to the the high standard deviation of the number of articles being published each year, with particularly 2017/2018 having less articles. While this is a potential innaccuracy when comparing The Guardian's articles to itself, the assumption is being made that an increased number of articles is also correlated to an increase in readership and thus still valuable when assessing with community needs.

Many of the topics such as "Family" and "Sport" remain relatively constant in their representation, however others such as "Flood" and "COVID" can be interpreted by a rise in public attention with events occuring in each year. (E.g. 2022 Floods in SE Queensland) The "Indigenous" cluster can also be seen in the media attention, but doesn't appear to have much of a pattern.

### Insights
By applying topic modelling to articles from The Guardian, the investigation revealed important insights into the patterns of media coverage related to Queensland and how they align with Advance Queensland programs. In particular it was found that specific themes, such as those surrounding Indigenous issues and COVID-19, closely correspond with initiatives like the "Deadly Innovations" program and funding for medical research during the pandemic.

The study also highlighted the dynamic nature of community interests over time. Topics such as "Flood" and "COVID" surged in media coverage during significant events, reflecting heightened public attention and concern, while others such the as "Indigenous" topic did not indicate a trend over time. 

Overall the findings do indicate that some Advance Queensland programs address community interests, but cannot establish how well it performs. In order quantify a relationship between media coverage and Advance Queensland it is necessary to 


# **Question:** *Can media attention impact indigenous funding through Advance Queensland?*

This question is important because it explores the relationship between media coverage and funding allocation, highlighting how public narratives can influence government priorities and resource distribution. The approach that will be taken is to see if individual funding contribuitions can be predicted using a combination of Guardian article topic prevalence and Advance Queensland structured data.

### Data
For further analysis, additional data from Advance Queensland will be sourced from the previous investigation that looked at regional and city grants awarded by Advance Queensland between 2018 and 2023. In order to identify features for indigenous and medical features though, this dataset was additionally annotated manually based the Program Name, Project Title, or Recipients. This method of manual annotation is somewhat subjective and based upon interpretation, but some of the key phrases being indentified were "Indigenous", "Aboriginal", "Deadly Innovations", and "COVID". Both datasets were then filtered to indigenous topics and funding and joined upon the year to produce a combined dataset of funding features along with the aggregated media article values that can be used to analyse relationship between the two. 

**Note that there is an assumption being made here that media attention results in funding in the same year, however this may not be the case.**

In [138]:
# Read in the Advance Queensland Dataset
advance_df = pd.read_csv("data/cleaned_advance_queensland.csv")

# Combine with the Annual Indigenous Topic mentions
combined_df = pd.merge(advance_df,nmf_annual_topic_df[nmf_annual_topic_df["Topic"] == "Indigenous"] , left_on="Approval Year", right_on = "article_year")
combined_df = combined_df[combined_df["Indigenous_Related"] == 1].reset_index()

print(f"Only {combined_df["Indigenous_Related"].sum()} total indigenous grants were identified.")

Only 16 total indigenous grants were identified.


### Analysis and Visualisation

With such a low number of indigenous funding grants identified it is clear that this approach with manual annotation has limitations, as one cannot accurately reflect individuals' heritage based solely on the descriptions provided. According Advance Queensland [*"8% identify as Aboriginal and/or Torres Strait Islander*](https://advance.qld.gov.au/about-advance-queensland/about-us) across their funding recipients, however the manual annotations only identifying 4% of funding recipients being related to the indigenous group within the selected time period. 

Manual review of the highest funding recipient in this group revealed that $360k went to Xujuan Zhou in collaboration with Cogninet Australia Pty Ltd, but also Goondir Aboriginal & Torres Strait Islanders Corporation for Health Services. This intersection between multiple manually annotated features highlights the hazards of manual annotation based off only descriptions as it can't really be said how much of this funding truly targeted Indigenous Groups.

In order to establish which of the identified features for "Aggregated Articles", "Regional", "Year", and "Medical Related" have the greatest impact on the funding of indigenous grants, has been subseted to this group and all variables to enable further analysis for correlations. Using a correlogram provides a numeric respresentation of how close to directly proportional the variables are so that one can asssess their suitability for use in a linear model, while displaying the relationships between variable pairs on a scatter plot enables reviewing of individual records.

In [139]:
# Ensure all variables are numeric
#combined_df["Is_Regional"] = combined_df["LGA Status"] == "Regional Councils"
combined_df["Is_Regional"] = combined_df["LGA Status"].apply(lambda x: 1 if x == "Regional Councils" else 0)

combined_df = combined_df[["aggregated_mentions", "Is_Regional", "Medical_Related" , "Approval Year", "Contractual Commitment ($ GST excl.)"]]

combined_df = combined_df.rename(columns={"aggregated_mentions":"Aggregated Articles", "Is_Regional":"Regional", "Medical_Related":"Medical","Approval Year":"Year", "Contractual Commitment ($ GST excl.)":"Funding ($)"})

# Apply correlogram and Display
advance_corr = combined_df.corr()

indigenous_corr_fig = px.imshow(advance_corr, title = "Correlogram of Selected Features For Indigenous Funding Recipients")
indigenous_corr_fig.update_layout(
    title_font_size=25,
    title_x=0.5,
    legend_title_font_size=15,
    height=500
)
indigenous_corr_fig.show()


indigenous_scatter_fig = px.scatter_matrix(combined_df, title = "Scatter Plots of Selected Features For Indigenous Funding Recipients")

indigenous_scatter_fig.update_layout(
    title_font_size=25,
    title_x=0.5,
    legend_title_font_size=15,
    height=800
)
indigenous_scatter_fig.show()


Reviewing the correlogram reveals that the medical feature has the highest positive correlation with funding received by Indigenous groups, standing at approximately 0.6. The next significant correlation lies with the followed by a negative correlation with regional factors of -0.38. (Unsurprising as the previous investigation this data was sourced found the distribuition of funding was only equitable on a per capita basis) In contrast, media attention shows a relatively weak negative correlation of -0.24, which is surprising given the expectations of a more direct relationship.

The Year feature can be discarded due to its weak correlation with funding, although it maintains a very high correlation with media attention, rendering it somewhat redundant.

Now that "Aggregated Articles", "Regional", and "Medical Related" have been identified as the most suitable independent variables, these can be used to produce a linear model prediction of "Funding ($)" based upon a training set and then evaluate against a test set. 

Using a linear model allows for the quantifying predictions of funding allocation as a continuous variable, while a logistic regression would be better suited for binary outcomes such as whether a particular recipient received funding or not. This makes the multiple linear regression technique more appropriate when considering the objective of quantifying relationships that drive funding decisions in relation to Indigenous recipients and particular the impact of the media.

In [161]:
# Break the current dataset into train and test datasets
x_train, x_test, y_train, y_test = train_test_split(combined_df[["Aggregated Articles", "Regional","Medical"]], 
                                                    combined_df[["Funding ($)"]],
                                                    shuffle=True, train_size=0.8, random_state=99) # Train size determines the percentage use for training the model

# Create a new linear regression model
linear_model = LinearRegression() 

# Train the model with the train dataset
linear_model.fit(x_train, y_train) 

# Calculate the model's prediction error over the training dataset
linear_R2 = r2_score(y_train, linear_model.predict(x_train) ) 

print(f'The model R squared score across the training dataset is: {linear_R2:0.3f}')

The model R squared score across the training dataset is: 0.665


This R<sup>2</sup> indicates the model performs okay across the training dataset, which is actually surpising considering the biases identified in the media article dataset and also suspected issues caused by manually annotating the data. 

The linear regression model can now be applied to the test dataset to demonstrate the predictions and how these compare to the actual values of funding.

In [160]:
# Apply linear model to predictions
fund_fig_df=pd.DataFrame(y_test)
fund_fig_df["Funding Prediction ($)"] = linear_model.predict(x_test)

linreg_fig = px.scatter(fund_fig_df, 
                        title = "Linear Regression Model Prediction and Actual Advance Queensland Funding for Indigenous Grants",
                        labels = {"variable": "Legend", 
                                  "value":"Government Funding ($)",
                                  "index": "Grant Index"})
linreg_fig.update_layout(
    title_font_size=25,
    title_x=0.5,
    legend_title_font_size=15,
    height=800
)
linreg_fig.show()

Viewing the predictions compared to to the actual funding grants shows that the model gets them mostly in the right ballpark, but had a large margin of error. It is expected that with so many binary features having stronger correlation this has resulted behaves similar to a clustering model.

### Insight
The analysis yielded critical insights about the relationship between media attention and Indigenous funding through Advance Queensland. While the medical feature correlated strongly with funding for Indigenous groups, media attention surprisingly showed a weak negative correlation. This raises concerns about the influence of media narratives on government priorities.

Biases in the dataset were evident, especially in manual annotations. Although Advance Queensland reports a certain percentage of funding recipients identifying as Aboriginal or Torres Strait Islander, the analysis identified a smaller fraction as related to Indigenous groups. This discrepancy highlights the challenges of accurately representing Indigenous interests based solely on descriptive data.

While the linear model provided some correlations, it should not be relied upon for precise predictions due to potential biases and the limitations of the manual annotation process. Instead, the model offers valuable insights into the broader relationships at play, emphasizing the need for robust data collection methods. Reliance on media narratives for funding decisions is problematic; government allocations should be based on objective assessments of community needs rather than fluctuating coverage. Overall, the findings illustrate that while the results may be accurate, they do not translate into meaningful predictions for effective governance.

# Closing statement

In conclusion, this analysis underscores the complex interplay between media attention and Indigenous funding, revealing biases and limitations that can hinder accurate representation. While the findings offer valuable insights into the relationships influencing funding decisions, they also highlight the importance of objective data collection methods. Although it may be possible to acquire a more detailed dataset from Advance Queensland upon request, doing so would not be ethically appropriate due to concerns about personally identifiable information and Indigenous status. Ensuring that government allocations are based on genuine community needs, rather than media narratives, is essential for fostering equitable support for Indigenous communities. Continued examination and refinement of these processes will be crucial for effective governance and meaningful impact.


### **References**

Queensland Government Open Data Portal. (2024, July 1st) _Advance Queensland Funding Recipients_.

    Retrieved August 12, 2024, from https://www.data.qld.gov.au/dataset/advance-queensland-funding-recipients