<div style="background:#FFFFEE; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFQ619 - Data Analytics for Strategic Decision Makers (2024)</div>

# IFQ619 :: UA2 - Extending Analytics (40%)

**IMPORTANT:** Refer to the instructions in Canvas [UA2 - Assignment 2 - extending analytics] *BEFORE* working on this assignment.

#### REQUIREMENTS ####

1. Complete and run the code cell below to display your name, student number, and assignment option
2. Identify an appropriate question (or questions) to be addressed by your overall data analytics narrative
3. Extend your analysis in assignment 1 with:
    - the analysis of additional unstructured data using the Guardian API (See accessing the Guardian API notebook),
    - the use of one machine learning technique (as used in the class materials), and
    - identification of ethical considerations relevant to the analysis (by drawing on class materials).
4. Ensure that you include documentation of your thinking and decision-making using markdown cells
5. Ensure that you include appropriate visualisations, and that they support the overall narrative
6. Ensure that your insights answer your question/s and are appropriate to your narrative. 
7. Ensure that your insights are consistent with the ethical considerations identified.

**NOTE:** you should not repeat the analysis from assignment 1, but you may need to save dataframes from assignment 1 and reload for use in this assignment. You may also summarise your assignment 1 insights as part of the process of identifying questions for analysis.

#### SUBMISSION ####

1. Create an assignment 2 folder named in the form **UA2-surname-idnumber** and put your notebook and any data files inside this folder. Note, do not put large training data in this folder (reference any training data that you used but keep it outside this folder), only keep small data files and models in this folder with your notebook.
2. When you have everything in the correct folder, reset all cells and restart the kernel, then run the notebook completely, checking that all cells have run without error. If you encounter errors, fix your notebook and re-run the process. It is important that your notebook runs without errors only requiring the files in the folder that you have created.
3. When the notebook is error free, zip the entire folder (you can select download folder in Jupyter).
4. Submit the zipped folder in Canvas


In [10]:
# Complete the following cell with your details and run to produce your personalised header for this assignment

from IPython.display import HTML

# personal details
first_name = "Morgan"
last_name = "Meeuwissen"
student_number = "n12240800"

personal_header = f"<h1>{first_name} {last_name} ({student_number})</h1>"
HTML(personal_header)

---


# Themes that use topic modelling?
* Long term affects of funding? still remain in news or short-lived?
* Funding as a result of news (public image leading funding)
* Funding to be lagged one year from the news? Gov funding cyle?
* Impact on Local Jobs: The Guardian’s coverage of labor market issues and economic policies provides a context for evaluating how programs like Advance Queensland impact local employment. Discussions on job creation and displacement relate directly to the ethical consideration of benefiting local communities through such initiatives.

Word limit: 1500–3000 words. Three-four analytics cycles?

**Outline:**

* COntext from first assignment and final insight + graph - fairly distributed -> Overarching question... does it make a difference?
* Question: What does the media attention show? Topic Cluser graphs * How much of the media attention is for indigenous matters? Cumulative clusters bar chart by articles.
* Question: What impacts media attention from the funding (count articles on the topic and correlelgram)
* Question: Can a relationship be established? Predict funding from public opinion. articles from linreg or perhaps a logistic regression of if there is any attention given by topic. (if low articles)


CAN WE TOPIC MODEL ACROSS MULTIPLE CORPUS? TOPICS FROM THE PROGRAM NAME/PROJECT NAME AND FIND MATCHES IN THE GUARDIAN
https://towardsdatascience.com/unsupervised-nlp-topic-models-as-a-supervised-learning-input-cf8ee9e5cf28
rain LDA Model on 100,000 Restaurant Reviews from 2016
Grab Topic distributions for every review using the LDA Model
Use Topic Distributions directly as feature vectors in supervised classification models (Logistic Regression, SVC, etc) and get F1-score.
Use the same 2016 LDA model to get topic distributions from 2017 (the LDA model did not see this data!)
Want a topic that combines regional and indigenous? bigram traditional owner

# Past assignment Andrew notes
d your assignment 1 analysis by including consideration of ethical and human factors, including 1 extended analysis technique, and using at least 1 additional data source. Possible additional data sources:

    Guardian API (See accessing the Guardian API notebook)
    Hospital location data (Provided, see data folder)

**NOTE: you should not repeat the analysis from assignment 1, but you may need to save dataframes from assignment 1 and reload for use in this assignment. You may also summarise your assignment 1 insights as part of the process of identifying questions for analysis.**

# Insights
Start off by working out what the QLD gov articles actually show. This will be done through topic modelling variants and then clustering
articles json only has the title as a key and the text, but the title has the date that can be extracted for potential use.


clustering to topics already done lda and nmf?  TFIDF is doc specific, others take into account the whole corpus
Establish topics that match Advance Queensland and then use as predictors by year?

No mentions of Advance QLD
QLD Gov search not finding a cluster for Regional
**Try working from regional queensland back? Just regional? If taking a specific approach like this could have way less clusters....**

https://open-platform.theguardian.com/explore/ very useful. Narrow down from 79

In [11]:
# Import the necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, ENGLISH_STOP_WORDS
from sklearn.decomposition import LatentDirichletAllocation, NMF
import pandas as pd
import json

import re
import spacy

In [12]:
# Load the data - articles from The Guardian about the war in Ukraine
file_path = "data/"
file_name = "qld_gov_articles.json"

with open(f"{file_path}{file_name}",'r', encoding='utf-8') as fp:
    articles = json.load(fp)

#print(f"Loaded {len(articles)} articles from {file_name}")


# AYOOOOOO WE DOING A LEMMATISING FUNCTION
# https://jonathansoma.com/lede/image-and-sound/text-analysis/text-analysis-word-counting-lemmatizing-and-tf-idf/


nlp = spacy.load("en_core_web_sm")
# Source: spaCy is a dream, but a dream where sometimes your legs won’t move right and you can’t read text. But sometimes you can fly! So yes, as always, ups and downs
# Morgan: I hate this
# python -m spacy download en
# python -m pip install spacy

def lemmatize(text):
    doc = nlp(text)
    # Turn it into tokens, ignoring the punctuation
    tokens = [token for token in doc if not token.is_punct]
    # Convert those tokens into lemmas, EXCEPT the pronouns, we'll keep those.
    lemmas = [token.lemma_ if token.pos_ != 'PRON' else token.orth_ for token in tokens]
    return lemmas


# Only count terms that in maximum of 75% of documents, and a minimum of 2 documents. 
# Count a maximum of 10000 terms, and remove common english stop words
StopWords = list(ENGLISH_STOP_WORDS.union(["Monday","Tuesday","Wednesday","Thursday","Friday", "Saturday", "Sunday","nbsp", "\n", "|", "\n ", 
                                           "$", "year", "m", "new", "need", "increase","bst", "gmt", "says", "year", "told"]))


# #count_vectorizer = CountVectorizer(preprocessor=lambda x: x.replace('(\D+)', "", regex = True))
# #count_vectorizer = CountVectorizer(
# count_vectorizer = CountVectorizer(
#                                    preprocessor=lambda x: re.sub(r"([\d\.])+", "NUM", x),
#                                    max_df=0.75,min_df=5,max_features=10000,
#                                    stop_words=StopWords, #Add stop words
#                                    tokenizer=lemmatize, # Lemmatiseer
#                                    ngram_range = (1,2)) #Use Bigrams as well to pick up things like "First Nation"
# count_dt_matrix = count_vectorizer.fit_transform(articles.values())

In [72]:
# TFIDF
# Only count terms that in maximum of 75% of documents, and a minimum of 5 documents. 
# Count a maximum of 10000 terms, and remove common english stop words
tfidf_vectorizer = TfidfVectorizer(
                                   preprocessor=lambda x: re.sub(r"([\d\.])+", "NUM", x),
                                   max_df=0.75,min_df=5,max_features=10000,
                                   stop_words=StopWords, #Add stop words
                                   tokenizer=lemmatize, # Lemmatiseer
                                   ngram_range = (1,2) #Use Bigrams as well to pick up things like "First Nation"
)


tfidf_dt_matrix = tfidf_vectorizer.fit_transform(articles.values())


The parameter 'token_pattern' will not be used since 'tokenizer' is not None'


Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['far', 'make', 'say', 'tell', 'whereaft'] not in stop_words.



Introduce NMF

In [22]:
# Set number of topics
num_topics = 15
# Set max number of iteractions
max_iterations = 500
feature_names = tfidf_vectorizer.get_feature_names_out()

# Create the model
nmf_model = NMF(n_components=num_topics,init='random', random_state=42  # Set random state to have reproducible results
                ,beta_loss='frobenius', max_iter=max_iterations)

# Fit the model to the data and use it to transform the data
doc_topic_nmf = nmf_model.fit_transform(tfidf_dt_matrix)

topic_term_nmf = nmf_model.components_

In [44]:
# Get the topics and their terms
nmf_topic_dict = {}
for index, topic in enumerate(topic_term_nmf):
    zipped = zip(feature_names, topic)
    top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
    #print(top_terms)
    top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
    nmf_topic_dict[f"Topic_{index}"] = top_terms_list

nmf_topic_terms_df = []
# Print the topics with their terms    
for k,v in nmf_topic_dict.items():
    print(k)
    print(v)
    print()

    for t, w in v.items():
        nmf_topic_terms_df.append([k, t, w])

nmf_topic_terms_df = pd.DataFrame(nmf_topic_terms_df)
nmf_topic_terms_df.rename(columns={nmf_topic_terms_df.columns[0]: "Topic Cluster",
                                       nmf_topic_terms_df.columns[1]: "Term",
                                       nmf_topic_terms_df.columns[2]: "Weight" }, inplace=True)

Topic_0
{'coal': 1.7919, 'royalty': 0.9452, 'mining': 0.4705, 'tonne': 0.4603, 'coalmine': 0.4414, 'price': 0.3856, 'thermal': 0.3722, 'company': 0.3659, 'climate': 0.3513, 'project': 0.3089}

Topic_1
{'treaty': 0.6872, 'indigenous': 0.3214, 'First': 0.3103, 'First Nations': 0.3018, 'Nations': 0.2941, 'Aboriginal': 0.2928, 'people': 0.2772, 'traditional': 0.261, 'native title': 0.2572, 'Torres': 0.257}

Topic_2
{'Adani': 2.082, 'Carmichael': 0.5473, 'rail': 0.5191, 'project': 0.5006, 'loan': 0.4986, 'Naif': 0.4432, 'royalty': 0.3797, 'Adani ’s': 0.3427, 'Labor': 0.3379, 'company': 0.295}

Topic_3
{'child': 1.7003, 'youth': 1.4161, 'watch house': 1.0496, 'watch': 0.9755, 'detention': 0.9177, 'house': 0.8043, 'young': 0.6483, 'police': 0.5962, 'justice': 0.5411, 'crime': 0.4954}

Topic_4
{'border': 1.0037, 'Covid': 0.9456, 'case': 0.9344, 'quarantine': 0.8485, 'health': 0.8022, 'NSW': 0.704, 'test': 0.6995, 'Covid NUM': 0.6563, 'vaccine': 0.6214, 'coronavirus': 0.5755}

Topic_5
{'energy'

### **Topics:**

Resources
Indigenous 
Adani
Youth Crime
COVID
Energy
Great Barrier Reef
Adani Environmental Impact
Environment
Housing
Domestic Violence
Other 1
Other 2
Flooding
Sport


In [45]:
# Your answer here

nmf_topic_terms_fig = px.bar(nmf_topic_terms_df,
       x ="Weight",
       y = "Term",
       facet_col="Topic Cluster", 
       facet_col_wrap=5,
       orientation='h',
       title = "Total Contractual Commitment ($ GST excl.) by Local Government Area",
       labels = {"Local Government Area (LGA)": "Local Government Area", 
                 "Total Contractual Commitment ($ GST excl.)": "Contractual Commitment ($ GST excl.)"})

nmf_topic_terms_fig.update_layout(
    title_font_size=25,
    title_x=0.5,
    legend_title_font_size=15,
    height=1000
)
nmf_topic_terms_fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1])) # https://plotly.com/python/facet-plots/
nmf_topic_terms_fig.update_yaxes(showticklabels=True, matches=None)

In [99]:
# Below gets the topics 

doc_topic_nmf_df =  pd.DataFrame(doc_topic_nmf,index=articles.keys(), columns= nmf_model.get_feature_names_out())
doc_topic_nmf_df = doc_topic_nmf_df.reset_index().rename(columns={"index":"article"})
#doc_topic_nmf_df["article_year"] = doc_topic_nmf_df["article"].str.extract(r"(\d{4}-\d{2}-\d{2})")
doc_topic_nmf_df["article_year"] = doc_topic_nmf_df["article"].str.extract(r"(?<=\[)(\d{4})(?=\-)")

doc_topic_nmf_df = doc_topic_nmf_df.drop('article', axis=1)
doc_topic_nmf_df.set_index("article_year", inplace= True)
sum_doc_topic_nmf_df  = doc_topic_nmf_df.sum().reset_index().rename(columns={"index":"Topic", "0":"Total NMF Topic Weight Across all Articles(Should normalise this really)"})
sum_doc_topic_nmf_df = sum_doc_topic_nmf_df.rename(columns={"index":"Topic", 0:"Total NMF Topic Weight Across all Articles(Should normalise this really)"})



px.bar(sum_doc_topic_nmf_df, x = "Topic", y = "Total NMF Topic Weight Across all Articles(Should normalise this really)")

In [98]:
sum_doc_topic_nmf_df

Unnamed: 0,Topic,0
0,nmf0,10.447087
1,nmf1,15.669275
2,nmf2,13.788805
3,nmf3,10.529782
4,nmf4,11.574933
5,nmf5,12.738551
6,nmf6,6.40192
7,nmf7,11.631789
8,nmf8,9.690547
9,nmf9,7.485206
