<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFQ619 - Data Analytics for Strategic Decision Makers (2024)</div>

# IFQ619 :: C1-UnstructuredAnalytics

For this session, the focus will be on analysis of unstructured text. However, the thinking required is similar to approaches to analysing images, video, sound and other unstructured data. Primarily, the analysis is based on the notion that there are useful patterns in the unstructured data which can be obtained mathematically. By converting the data to a mathematical structure, various algorithms can be applied to the structure with the aim of identifying patterns. 

In the case of the `topic modelling` approaches below, many of the techniques are *probabilistic* - that is they mathematically identify the *likelihood* that a feature might be important. Thus, they are never 100% accurate, and their use needs to be mediated by a more pragmatic *useful or not* approach, rather than *right or wrong*.

In [1]:
# Import the necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, ENGLISH_STOP_WORDS
from sklearn.decomposition import LatentDirichletAllocation, NMF
import pandas as pd
import json
import random
from sklearn.cluster import KMeans

### Accessing the data via The Guardian API

See the `Accessing_the_Guardian_API.ipynb` notebook file for details on getting the data. **Note:** This approach may be used for additional data for Assignment 2.

### Read in pre-saved data

To save time, we're loading in pre-saved data that was fetched using the Guardian API.

In [2]:
# Load the data - articles from The Guardian about the war in Ukraine
file_path = "data/"
file_name = "qld_gov_articles.json"

with open(f"{file_path}{file_name}",'r', encoding='utf-8') as fp:
    articles = json.load(fp)

print(f"Loaded {len(articles)} articles from {file_name}")

Loaded 1063 articles from qld_gov_articles.json


Each dictionary entry includes the *title [date]* as `key` and the *body text* from the article as `value`.

In [3]:
#article1 = list(articles.items())[0]
#print("Key:",article1[0])
#print("Value:",article1[1][:300],"...") # Just show first 300 characters

So the values gives us a list of documents that we can analyse.

In [4]:
# Get a list of documents
#documents = list(articles.values())

# View first 400 characters of the 1st document
#documents[0][:400]

### Term Count 

**Finding important terms by the frequency of their occurance**

Using `CountVectorizer` create a `vector` for each document where the dimensionality of the vector is the `vocabulary` (all terms in the collection), and the value of each component is the number of times that the `term` occurs in the document.

All of these analyses, approach the document as a [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) model. In this approach, the order of the words don't matter. A popular approach that takes into account order is [Word embedding](https://en.wikipedia.org/wiki/Word_embedding). This session does not explore word embedding.

In [5]:
# AYOOOOOO WE DOING A LEMMATISING FUNCTION
# https://jonathansoma.com/lede/image-and-sound/text-analysis/text-analysis-word-counting-lemmatizing-and-tf-idf/

import spacy
nlp = spacy.load("en_core_web_sm")
# Source: spaCy is a dream, but a dream where sometimes your legs won’t move right and you can’t read text. But sometimes you can fly! So yes, as always, ups and downs
# Morgan: I hate this
# python -m spacy download en
# python -m pip install spacy

def lemmatize(text):
    doc = nlp(text)
    # Turn it into tokens, ignoring the punctuation
    tokens = [token for token in doc if not token.is_punct]
    # Convert those tokens into lemmas, EXCEPT the pronouns, we'll keep those.
    lemmas = [token.lemma_ if token.pos_ != 'PRON' else token.orth_ for token in tokens]
    return lemmas

In [6]:
# Only count terms that in maximum of 75% of documents, and a minimum of 2 documents. 
# Count a maximum of 10000 terms, and remove common english stop words
StopWords = list(ENGLISH_STOP_WORDS.union(["Monday","Tuesday","Wednesday","Thursday","Friday", "Saturday", "Sunday","nbsp", "\n", "|", "\n ", 
                                           "$", "year", "m", "new", "need", "increase","bst", "gmt", "says", "year", "told"]))
import re

#count_vectorizer = CountVectorizer(preprocessor=lambda x: x.replace('(\D+)', "", regex = True))
#count_vectorizer = CountVectorizer(
count_vectorizer = CountVectorizer(
                                   preprocessor=lambda x: re.sub(r"([\d\.])+", "NUM", x),
                                   max_df=0.60,min_df=5,max_features=10000,
                                   stop_words=StopWords, #Add stop words
                                   tokenizer=lemmatize, # Lemmatiseer
                                   ngram_range = (1,2)) #Use Bigrams as well to pick up things like "First Nation"
count_dt_matrix = count_vectorizer.fit_transform(articles.values())



**I HAVE ADDED MY OWN ADDITIONAL STOP WORDS HERE FOR NUMBERS**
NOt sufficient, referred to https://stackoverflow.com/questions/43216530/adding-numbers-to-stop-words-to-scikit-learns-countvectorizer

In [7]:
# Take a look at the vector for the first document
#doc001_vector = count_dt_matrix.toarray()[0]
#doc001_vector

In [8]:
# Get the 1000 terms identified during the vectorization process
feature_names = count_vectorizer.get_feature_names_out()
feature_names

array(['\n \n', '\n \n ', '\n \n \n ', ..., '’ve read', '’ve say',
       '’ve tell'], dtype=object)

In [9]:
# Look at how the counts match up to the terms (for the 1st doc)
#doc001_term_counts = list(zip(feature_names,doc001_vector[0])) 
#doc001_term_counts

In [10]:
# Take a look at the vocabulary which shows the total counts for whole collection
count_vectorizer.vocabulary_

{'prominent': 7332,
 'victim': 9451,
 'right': 7923,
 'group': 5177,
 'remove': 7748,
 'child': 3279,
 'police': 7060,
 'watch': 9534,
 'house': 5402,
 'release': 7713,
 'confront': 3613,
 'footage': 4830,
 'brutal': 3056,
 'treatment': 9260,
 'adult': 2528,
 'holding': 5360,
 'video': 9461,
 'publish': 7428,
 'long': 6061,
 'investigation': 5699,
 'SBS': 2009,
 'young': 9743,
 'people': 6899,
 'lock': 6049,
 'freeze': 4879,
 'isolation': 5726,
 'cell': 3205,
 'panic': 6816,
 'struggle': 8820,
 'indefinite': 5529,
 'detention': 4061,
 'overcrowded': 6768,
 'broadly': 3049,
 'criticise': 3846,
 'human': 5421,
 'organisation': 6727,
 'member': 6251,
 'crime': 3824,
 'zero': 9759,
 'tolerance': 9171,
 'approach': 2717,
 'youth': 9751,
 'crimeNUM': 3825,
 '\n  \n \n': 15,
 'term': 9045,
 'Labor': 1227,
 'MP': 1305,
 'Bush': 449,
 'appear': 2708,
 'break': 3031,
 'rank': 7510,
 'justice': 5793,
 'policynum': 7084,
 'work': 9675,
 'support': 8888,
 'enter': 4435,
 'parliament': 6830,
 'tell'

In [11]:
count_vectorizer.vocabulary_["regional"]

7652

#### Display matrix in dataframe

Take the term count matrix and display in a dataframe to make visible the structure


In [12]:
count_dt_matrix.toarray()

array([[1, 2, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [13]:
# Create a new dataframe with the matrix - use titles for the index and terms for the columns
count_df = pd.DataFrame(count_dt_matrix.toarray(), index=articles.keys(), columns=feature_names)
count_df

Unnamed: 0,\n \n,\n \n.1,\n \n \n,\n \n \n.1,\n \n.2,\n \n \n \n,\n \n \n \n sign,\n \n \n.2,\n \n relate,\n \n Sign,...,’s work,’s world,’s youth,’ve,’ve NUM,’ve just,’ve make,’ve read,’ve say,’ve tell
‘Harrowing’ footage sparks calls for Queensland government to remove children from police watch houses [2024-07-18T15:00:15Z],1,2,0,0,3,0,0,0,3,0,...,0,0,0,0,0,0,0,0,0,0
Queensland government accused of cowing to Christian Lobby on anti-discrimination bill [2024-06-14T15:00:56Z],0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Queensland government hoses down suggestions it is considering bailout for Bonza [2024-05-10T08:26:48Z],0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Queensland government accused of failing to provide adequate schooling to locked up children [2024-04-11T15:00:28Z],0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"‘There has to be a way’: Queensland government working to reunite Molly the magpie with family, premier says [2024-03-28T06:26:13Z]",1,1,0,0,1,0,0,0,1,1,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mark Latham takes aim at Sky boss in post-sacking Twitter spray | The Weekly Beast [2017-03-30T20:35:02Z],0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Where the 2017 Queensland election will be won and lost [2017-10-29T02:00:35Z],0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Bill Shorten takes town hall test: Trump, tax, refugees and the gum tree menace [2017-03-02T10:15:44Z]",0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
Indigenous incarceration: turning the tide on colonisation's cruel third act [2017-02-20T03:08:41Z],0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


By selecting a row from the dataframe and sorting the values (counts), we can identify the top 10 terms

#### Create a top10 terms dataframe

Using the index from the documents, create a dataframe that can hold the top10 terms for each document. We also include columns for our other analysis (tfidf, lda, lda)

In [14]:
# Create a dataframe to hold top terms for each analysis type
terms_df = pd.DataFrame(index=count_df.index,columns=['count','tfidf','lda','nmf'])
terms_df

Unnamed: 0,count,tfidf,lda,nmf
‘Harrowing’ footage sparks calls for Queensland government to remove children from police watch houses [2024-07-18T15:00:15Z],,,,
Queensland government accused of cowing to Christian Lobby on anti-discrimination bill [2024-06-14T15:00:56Z],,,,
Queensland government hoses down suggestions it is considering bailout for Bonza [2024-05-10T08:26:48Z],,,,
Queensland government accused of failing to provide adequate schooling to locked up children [2024-04-11T15:00:28Z],,,,
"‘There has to be a way’: Queensland government working to reunite Molly the magpie with family, premier says [2024-03-28T06:26:13Z]",,,,
...,...,...,...,...
Mark Latham takes aim at Sky boss in post-sacking Twitter spray | The Weekly Beast [2017-03-30T20:35:02Z],,,,
Where the 2017 Queensland election will be won and lost [2017-10-29T02:00:35Z],,,,
"Bill Shorten takes town hall test: Trump, tax, refugees and the gum tree menace [2017-03-02T10:15:44Z]",,,,
Indigenous incarceration: turning the tide on colonisation's cruel third act [2017-02-20T03:08:41Z],,,,


Populate the count column with data created by the count vectorizer.

In [15]:
#For each doc, get the 10 columns with the largest counts
for idx in terms_df.index:
    counts = dict(count_df.loc[idx].sort_values(ascending=False).head(10))
    #print(counts)
    terms_df.at[idx,'count'] = list(counts.keys()) # Just the list of terms

terms_df

Unnamed: 0,count,tfidf,lda,nmf
‘Harrowing’ footage sparks calls for Queensland government to remove children from police watch houses [2024-07-18T15:00:15Z],"[child, watch, watch house, house, footage, yo...",,,
Queensland government accused of cowing to Christian Lobby on anti-discrimination bill [2024-06-14T15:00:56Z],"[school, base, teacher, Labor, experience, fai...",,,
Queensland government hoses down suggestions it is considering bailout for Bonza [2024-05-10T08:26:48Z],"[administrator, aircraft, federal, financial, ...",,,
Queensland government accused of failing to provide adequate schooling to locked up children [2024-04-11T15:00:28Z],"[youth, child, hour, detention, centre, educat...",,,
"‘There has to be a way’: Queensland government working to reunite Molly the magpie with family, premier says [2024-03-28T06:26:13Z]","[Molly, family, department, bird, couple, care...",,,
...,...,...,...,...
Mark Latham takes aim at Sky boss in post-sacking Twitter spray | The Weekly Beast [2017-03-30T20:35:02Z],"[Sky, political, News, Australian, , change, ...",,,
Where the 2017 Queensland election will be won and lost [2017-10-29T02:00:35Z],"[Labor, LNP, nation, seat, vote, party, Hanson...",,,
"Bill Shorten takes town hall test: Trump, tax, refugees and the gum tree menace [2017-03-02T10:15:44Z]","[Shorten, Labor, We, want, He, I, people, said...",,,
Indigenous incarceration: turning the tide on colonisation's cruel third act [2017-02-20T03:08:41Z],"[indigenous, people, justice, child, issue, pr...",,,


In [16]:
# # Sample 5 random articles
# samples = random.sample(range(0,len(terms_df)),5)

# for sample in samples:
#     doc = count_df.iloc[sample]
#     top_terms = dict(count_df.iloc[sample].sort_values(ascending=False).head(10))
#     print(f"[{sample}] {doc.name}")
#     print("\t- Top terms:",top_terms)

### Term Frequency / Inverse Document Frequency (TF/IDF)

**Finding terms that are very common in a document, but less common in the whole collection**

The [TF/IDF](https://en.wikipedia.org/wiki/Tf–idf) algorithm takes the term frequencies for a document and divides them by the frequencies of the terms in the whole collection.


In [17]:
# Only count terms that in maximum of 75% of documents, and a minimum of 2 documents. 
# Count a maximum of 10000 terms, and remove common english stop words
tfidf_vectorizer = TfidfVectorizer(
                                   preprocessor=lambda x: re.sub(r"([\d\.])+", "NUM", x),
                                   max_df=0.60,min_df=5,max_features=10000,
                                   stop_words=StopWords, #Add stop words
                                   tokenizer=lemmatize, # Lemmatiseer
                                   ngram_range = (1,2) #Use Bigrams as well to pick up things like "First Nation"
)

In [18]:
# Get the document vectors
tfidf_dt_matrix = tfidf_vectorizer.fit_transform(articles.values())

# Display the vector for the first document
#tfidf_dt_matrix.toarray()[0]



#### Display matrix in dataframe

In [19]:
tfidf_df = pd.DataFrame(tfidf_dt_matrix.toarray(), index=articles.keys(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

Unnamed: 0,\n \n,\n \n.1,\n \n \n,\n \n \n.1,\n \n.2,\n \n \n \n,\n \n \n \n sign,\n \n \n.2,\n \n relate,\n \n Sign,...,’s work,’s world,’s youth,’ve,’ve NUM,’ve just,’ve make,’ve read,’ve say,’ve tell
‘Harrowing’ footage sparks calls for Queensland government to remove children from police watch houses [2024-07-18T15:00:15Z],0.026480,0.050064,0.0,0.0,0.065820,0.0,0.0,0.0,0.066178,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
Queensland government accused of cowing to Christian Lobby on anti-discrimination bill [2024-06-14T15:00:56Z],0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
Queensland government hoses down suggestions it is considering bailout for Bonza [2024-05-10T08:26:48Z],0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
Queensland government accused of failing to provide adequate schooling to locked up children [2024-04-11T15:00:28Z],0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
"‘There has to be a way’: Queensland government working to reunite Molly the magpie with family, premier says [2024-03-28T06:26:13Z]",0.045693,0.043194,0.0,0.0,0.037859,0.0,0.0,0.0,0.038064,0.057274,...,0.0,0.0,0.0,0.024792,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mark Latham takes aim at Sky boss in post-sacking Twitter spray | The Weekly Beast [2017-03-30T20:35:02Z],0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
Where the 2017 Queensland election will be won and lost [2017-10-29T02:00:35Z],0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
"Bill Shorten takes town hall test: Trump, tax, refugees and the gum tree menace [2017-03-02T10:15:44Z]",0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.012028,0.0,0.0,0.0,0.0,0.0,0.0
Indigenous incarceration: turning the tide on colonisation's cruel third act [2017-02-20T03:08:41Z],0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


#### Update the terms matrix

In [20]:
for idx in terms_df.index:
    tfidf = dict(tfidf_df.loc[idx].sort_values(ascending=False).head(10))
    #print(counts)
    terms_df.at[idx,'tfidf'] = list(tfidf.keys()) 

terms_df

Unnamed: 0,count,tfidf,lda,nmf
‘Harrowing’ footage sparks calls for Queensland government to remove children from police watch houses [2024-07-18T15:00:15Z],"[child, watch, watch house, house, footage, yo...","[child, watch, watch house, footage, house, ad...",,
Queensland government accused of cowing to Christian Lobby on anti-discrimination bill [2024-06-14T15:00:56Z],"[school, base, teacher, Labor, experience, fai...","[teacher, faith base, school, faith, discrimin...",,
Queensland government hoses down suggestions it is considering bailout for Bonza [2024-05-10T08:26:48Z],"[administrator, aircraft, federal, financial, ...","[administrator, aircraft, airline, creditor, p...",,
Queensland government accused of failing to provide adequate schooling to locked up children [2024-04-11T15:00:28Z],"[youth, child, hour, detention, centre, educat...","[youth, detention, detention centre, education...",,
"‘There has to be a way’: Queensland government working to reunite Molly the magpie with family, premier says [2024-03-28T06:26:13Z]","[Molly, family, department, bird, couple, care...","[Molly, carer, bird, reunite, family, couple, ...",,
...,...,...,...,...
Mark Latham takes aim at Sky boss in post-sacking Twitter spray | The Weekly Beast [2017-03-30T20:35:02Z],"[Sky, political, News, Australian, , change, ...","[Sky, Sky News, News, ad, political, Lloyd, re...",,
Where the 2017 Queensland election will be won and lost [2017-10-29T02:00:35Z],"[Labor, LNP, nation, seat, vote, party, Hanson...","[LNP, Labor, nation, Hanson, seat, vote, prima...",,
"Bill Shorten takes town hall test: Trump, tax, refugees and the gum tree menace [2017-03-02T10:15:44Z]","[Shorten, Labor, We, want, He, I, people, said...","[Shorten, Labor, hall, yes, progressive, Labor...",,
Indigenous incarceration: turning the tide on colonisation's cruel third act [2017-02-20T03:08:41Z],"[indigenous, people, justice, child, issue, pr...","[indigenous, justice, child, Aboriginal, Torre...",,


#### Compare approaches

In [21]:
# # Sample 5 random articles
# samples = random.sample(range(0,len(terms_df)),5)

# for sample in samples:
#     doc = terms_df.iloc[sample]
#     print(f"[{sample}] {doc.name}")
#     print("\t- Counts:\t",doc['count'])
#     print("\t- TFIDF:\t",doc['tfidf'])
#     print()

### Topic modelling with Latent Dirichlet Allocation (LDA)

[LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is an algorithm for obtaining *topics* (a list of terms) from a document-term matrix. It is a generative probabilistic approach to *decomposition* of the document-term matrix into 2 factor matrices: document-topic and topic-term.

![img](https://editor.analyticsvidhya.com/uploads/26864dtm.JPG)

*Source: [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/06/part-2-topic-modeling-and-latent-dirichlet-allocation-lda-using-gensim-and-sklearn/)*

The LDA model requires the number of topics to be set in advance. As it is a generative model, it also runs over a number of iterations. These values usually need to be experimented with to obtain quality topics.

In [22]:
# Set number of topics
num_topics = 15
# Set max number of iteractions
max_iterations = 500

# Create the model
lda_model = LatentDirichletAllocation(n_components=num_topics,max_iter=max_iterations,learning_method='online')

# Fit the model to the data, and use the model to transform the data (do the decomposition)
doc_topic_matrix = lda_model.fit_transform(count_dt_matrix) # Note that this uses the counts already processed

# Obtain the topics
topic_term_matrix = lda_model.components_

#### View the topics

In [23]:
# Get the topics and their terms
lda_topic_dict = {}
for index, topic in enumerate(topic_term_matrix):
    zipped = zip(feature_names, topic)
    top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
    #print(top_terms)
    top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
    lda_topic_dict[f"topic_{index}"] = top_terms_list

lda_topic_terms_df = []
# Print the topics with their terms    
for k,v in lda_topic_dict.items():
    print(k)
    print(v)
    print()

    for t, w in v.items():
        lda_topic_terms_df.append([k, t, w])

lda_topic_terms_df = pd.DataFrame(lda_topic_terms_df)
lda_topic_terms_df.rename(columns={lda_topic_terms_df.columns[0]: "Topic Cluster",
                                       lda_topic_terms_df.columns[1]: "Term",
                                       lda_topic_terms_df.columns[2]: "Weight" }, inplace=True)

topic_0
{'I': 1581.5397, 'people': 719.769, ' ': 554.9478, 'like': 459.1645, 'We': 451.7063, 'just': 451.1717, 'tell': 439.311, 'party': 420.4249, 'time': 418.1599, 'federal': 402.5929}

topic_1
{'Adani': 1639.6172, 'Carmichael': 384.7475, 'Adani ’s': 327.3459, 'company': 323.4741, 'project': 304.1369, 'coal': 229.0547, 'royalty': 209.3115, 'loan': 207.6915, 'rail': 195.465, 'coalmine': 188.2865}

topic_2
{'housing': 331.3087, 'NUMm': 302.8824, 'cost': 272.1416, ' ': 271.6818, 'home': 255.4294, 'num%': 223.5067, 'We': 221.654, 'rent': 220.6801, 'people': 213.0859, 'saidNUM': 202.4782}

topic_3
{'police': 786.7328, 'violence': 455.8961, 'domestic': 377.815, 'woman': 314.6548, 'officer': 276.3137, 'domestic violence': 273.0379, 'inquiry': 242.4935, 'court': 206.3884, 'law': 179.398, 'victim': 175.6418}

topic_4
{'reduce rate': 0.0667, 'australian premier': 0.0667, 'say resource': 0.0667, 'underneath': 0.0667, 'fund Queensland': 0.0667, 'concernsnum': 0.0667, 'say great': 0.0667, 'exact':

#### Update the terms matrix

In [24]:
for idx,topic in enumerate(doc_topic_matrix):
     # enumerate just takes a list and returns the value and a counter for the loop starting from 0
    topic_num = topic.argmax()
    top_topic = lda_topic_dict[f"topic_{topic_num}"]
    terms_df['lda'].iloc[idx] = list(top_topic.keys())

terms_df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  terms_df['lda'].iloc[idx] = list(top_topic.keys())


Unnamed: 0,count,tfidf,lda,nmf
‘Harrowing’ footage sparks calls for Queensland government to remove children from police watch houses [2024-07-18T15:00:15Z],"[child, watch, watch house, house, footage, yo...","[child, watch, watch house, footage, house, ad...","[child, people, youth, young, police, watch, h...",
Queensland government accused of cowing to Christian Lobby on anti-discrimination bill [2024-06-14T15:00:56Z],"[school, base, teacher, Labor, experience, fai...","[teacher, faith base, school, faith, discrimin...","[housing, NUMm, cost, , home, num%, We, rent,...",
Queensland government hoses down suggestions it is considering bailout for Bonza [2024-05-10T08:26:48Z],"[administrator, aircraft, federal, financial, ...","[administrator, aircraft, airline, creditor, p...","[housing, NUMm, cost, , home, num%, We, rent,...",
Queensland government accused of failing to provide adequate schooling to locked up children [2024-04-11T15:00:28Z],"[youth, child, hour, detention, centre, educat...","[youth, detention, detention centre, education...","[child, people, youth, young, police, watch, h...",
"‘There has to be a way’: Queensland government working to reunite Molly the magpie with family, premier says [2024-03-28T06:26:13Z]","[Molly, family, department, bird, couple, care...","[Molly, carer, bird, reunite, family, couple, ...","[I, people, , like, We, just, tell, party, ti...",
...,...,...,...,...
Mark Latham takes aim at Sky boss in post-sacking Twitter spray | The Weekly Beast [2017-03-30T20:35:02Z],"[Sky, political, News, Australian, , change, ...","[Sky, Sky News, News, ad, political, Lloyd, re...","[I, people, , like, We, just, tell, party, ti...",
Where the 2017 Queensland election will be won and lost [2017-10-29T02:00:35Z],"[Labor, LNP, nation, seat, vote, party, Hanson...","[LNP, Labor, nation, Hanson, seat, vote, prima...","[coal, energy, power, renewable, climate, Labo...",
"Bill Shorten takes town hall test: Trump, tax, refugees and the gum tree menace [2017-03-02T10:15:44Z]","[Shorten, Labor, We, want, He, I, people, said...","[Shorten, Labor, hall, yes, progressive, Labor...","[I, people, , like, We, just, tell, party, ti...",
Indigenous incarceration: turning the tide on colonisation's cruel third act [2017-02-20T03:08:41Z],"[indigenous, people, justice, child, issue, pr...","[indigenous, justice, child, Aboriginal, Torre...","[child, people, youth, young, police, watch, h...",


#### Compare approaches

In [25]:
# # Sample 5 random articles
# samples = random.sample(range(0,len(terms_df)),5)

# for sample in samples:
#     doc = terms_df.iloc[sample]
#     print(f"[{sample}] {doc.name}")
#     print("\t- Counts:\t",doc['count'])
#     print("\t- TFIDF:\t",doc['tfidf'])
#     print("\t- LDA:\t\t",doc['lda'])
#     print()

### Topic modelling with Non-negative Matrix Factorisation (NMF)


[NMF](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a different algorithm for obtaining *topics* (a list of terms) from a document-term matrix. It also factorises the document-term matrix into 2 factor matrices: document-topic and topic-term.

In [26]:
# Parameters from previous shared ones

# Create the model
nmf_model = NMF(n_components=num_topics,init='random',beta_loss='frobenius', max_iter=max_iterations)

# Fit the model to the data and use it to transform the data
doc_topic_nmf = nmf_model.fit_transform(tfidf_dt_matrix)

topic_term_nmf = nmf_model.components_

# THIS IS WHAT I WANT TO OPTIMISE HERE, AN APPROPRIATE CLUSTERING SHOWHING FUNDING

In [27]:
# Get the topics and their terms
nmf_topic_dict = {}
for index, topic in enumerate(topic_term_nmf):
    zipped = zip(feature_names, topic)
    top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
    #print(top_terms)
    top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
    nmf_topic_dict[f"topic_{index}"] = top_terms_list

nmf_topic_terms_df = []
# Print the topics with their terms    
for k,v in nmf_topic_dict.items():
    print(k)
    print(v)
    print()

    for t, w in v.items():
        nmf_topic_terms_df.append([k, t, w])

nmf_topic_terms_df = pd.DataFrame(nmf_topic_terms_df)
nmf_topic_terms_df.rename(columns={nmf_topic_terms_df.columns[0]: "Topic Cluster",
                                       nmf_topic_terms_df.columns[1]: "Term",
                                       nmf_topic_terms_df.columns[2]: "Weight" }, inplace=True)

topic_0
{'I': 1.5941, 'saysnum': 0.6321, 'people': 0.5298, 'think': 0.3762, 'just': 0.351, 'I ’m': 0.3354, '’m': 0.3354, '’ve': 0.3292, 'know': 0.3001, 'We': 0.2995}

topic_1
{'energy': 0.982, 'coal': 0.8845, 'renewable': 0.6352, 'power': 0.6317, 'price': 0.4603, 'electricity': 0.3528, 'renewable energy': 0.3429, 'power station': 0.3127, 'emission': 0.3118, 'project': 0.2987}

topic_2
{'police': 1.3277, 'violence': 0.991, 'domestic': 0.8746, 'domestic violence': 0.7543, 'woman': 0.6405, 'officer': 0.4965, 'inquiry': 0.3816, 'victim': 0.3365, 'coercive': 0.3347, 'taskforce': 0.2983}

topic_3
{'child': 1.9741, 'youth': 1.642, 'watch house': 1.2178, 'watch': 1.1317, 'detention': 1.0647, 'house': 0.9337, 'young': 0.7519, 'police': 0.6909, 'justice': 0.627, 'crime': 0.5736}

topic_4
{'treaty': 1.023, 'indigenous': 0.479, 'First': 0.463, 'First Nations': 0.4504, 'Nations': 0.438, 'Aboriginal': 0.4345, 'people': 0.4138, 'native title': 0.3952, 'Torres': 0.3814, 'traditional': 0.3789}

topic_5

In [28]:
# Your answer here
import plotly.express as px



lda_topic_terms_fig = px.bar(lda_topic_terms_df,
       x ="Weight",
       y = "Term",
       facet_col="Topic Cluster", 
       facet_col_wrap=5,
       orientation='h',
       title = "LDA Topics and Terms",
       labels = {"Local Government Area (LGA)": "Local Government Area", 
                 "Total Contractual Commitment ($ GST excl.)": "Contractual Commitment ($ GST excl.)"})

lda_topic_terms_fig.update_layout(
    title_font_size=25,
    title_x=0.5,
    legend_title_font_size=15,
    height=1000
)
lda_topic_terms_fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1])) # https://plotly.com/python/facet-plots/
lda_topic_terms_fig.update_yaxes(showticklabels=True, matches=None)

In [29]:
# Your answer here
import plotly.express as px



nmf_topic_terms_fig = px.bar(nmf_topic_terms_df,
       x ="Weight",
       y = "Term",
       facet_col="Topic Cluster", 
       facet_col_wrap=5,
       orientation='h',
       title = "Total Contractual Commitment ($ GST excl.) by Local Government Area",
       labels = {"Local Government Area (LGA)": "Local Government Area", 
                 "Total Contractual Commitment ($ GST excl.)": "Contractual Commitment ($ GST excl.)"})

nmf_topic_terms_fig.update_layout(
    title_font_size=25,
    title_x=0.5,
    legend_title_font_size=15,
    height=1000
)
nmf_topic_terms_fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1])) # https://plotly.com/python/facet-plots/
nmf_topic_terms_fig.update_yaxes(showticklabels=True, matches=None)

# SHould make some mention that these are just the top topics... Maybe I should sum them across the year?

#### Update the terms matrix

In [30]:
for idx,topic in enumerate(doc_topic_nmf):
    topic_num = topic.argmax()
    top_topic = lda_topic_dict[f"topic_{topic_num}"]
    terms_df['nmf'].iloc[idx] = list(top_topic.keys())

terms_df


ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




Unnamed: 0,count,tfidf,lda,nmf
‘Harrowing’ footage sparks calls for Queensland government to remove children from police watch houses [2024-07-18T15:00:15Z],"[child, watch, watch house, house, footage, yo...","[child, watch, watch house, footage, house, ad...","[child, people, youth, young, police, watch, h...","[police, violence, domestic, woman, officer, d..."
Queensland government accused of cowing to Christian Lobby on anti-discrimination bill [2024-06-14T15:00:56Z],"[school, base, teacher, Labor, experience, fai...","[teacher, faith base, school, faith, discrimin...","[housing, NUMm, cost, , home, num%, We, rent,...","[forest, hydro, pump, pump hydro, glider, log,..."
Queensland government hoses down suggestions it is considering bailout for Bonza [2024-05-10T08:26:48Z],"[administrator, aircraft, federal, financial, ...","[administrator, aircraft, airline, creditor, p...","[housing, NUMm, cost, , home, num%, We, rent,...","[forest, hydro, pump, pump hydro, glider, log,..."
Queensland government accused of failing to provide adequate schooling to locked up children [2024-04-11T15:00:28Z],"[youth, child, hour, detention, centre, educat...","[youth, detention, detention centre, education...","[child, people, youth, young, police, watch, h...","[police, violence, domestic, woman, officer, d..."
"‘There has to be a way’: Queensland government working to reunite Molly the magpie with family, premier says [2024-03-28T06:26:13Z]","[Molly, family, department, bird, couple, care...","[Molly, carer, bird, reunite, family, couple, ...","[I, people, , like, We, just, tell, party, ti...","[I, people, , like, We, just, tell, party, ti..."
...,...,...,...,...
Mark Latham takes aim at Sky boss in post-sacking Twitter spray | The Weekly Beast [2017-03-30T20:35:02Z],"[Sky, political, News, Australian, , change, ...","[Sky, Sky News, News, ad, political, Lloyd, re...","[I, people, , like, We, just, tell, party, ti...","[Gabba, stadium, Schrinner, cricket, AFL, rebu..."
Where the 2017 Queensland election will be won and lost [2017-10-29T02:00:35Z],"[Labor, LNP, nation, seat, vote, party, Hanson...","[LNP, Labor, nation, Hanson, seat, vote, prima...","[coal, energy, power, renewable, climate, Labo...","[I, people, , like, We, just, tell, party, ti..."
"Bill Shorten takes town hall test: Trump, tax, refugees and the gum tree menace [2017-03-02T10:15:44Z]","[Shorten, Labor, We, want, He, I, people, said...","[Shorten, Labor, hall, yes, progressive, Labor...","[I, people, , like, We, just, tell, party, ti...","[I, people, , like, We, just, tell, party, ti..."
Indigenous incarceration: turning the tide on colonisation's cruel third act [2017-02-20T03:08:41Z],"[indigenous, people, justice, child, issue, pr...","[indigenous, justice, child, Aboriginal, Torre...","[child, people, youth, young, police, watch, h...","[reduce rate, australian premier, say resource..."


In [31]:
px.histogram(terms_df, x = "nmf")

### Compare approaches

In [32]:
# Sample 5 random articles
samples = random.sample(range(0,len(terms_df)),5)

for sample in samples:
    doc = terms_df.iloc[sample]
    print(f"[{sample}] {doc.name}")
    print("\t- Counts:\t",doc['count'])
    print("\t- TFIDF:\t",doc['tfidf'])
    print("\t- LDA:\t\t",doc['lda'])
    print("\t- NMF:\t\t",doc['nmf'])
    print()

[176] Coalmine approvals in Australia this year could add 150m tonnes of CO2 to atmosphere [2023-09-01T22:40:57Z]
	- Counts:	 ['coal', 'coalmine', 'environment', 'climate', 'law', 'metallurgical', 'environment law', 'approval', 'tonne', 'emission']
	- TFIDF:	 ['metallurgical', 'environment law', 'coalmine', 'coal', 'Plibersek', 'environment', 'metallurgical coal', 'climate', 'tonne', 'law']
	- LDA:		 ['water', 'environmental', 'federal', 'plan', 'land', 'project', 'department', 'environment', 'clearing', 'report']
	- NMF:		 ['Adani', 'Carmichael', 'Adani ’s', 'company', 'project', 'coal', 'royalty', 'loan', 'rail', 'coalmine']

[577] Kelly Wilkinson sought help from the police ‘almost every day’ after her first domestic violence complaint. So what went wrong? [2021-04-23T20:00:19Z]
	- Counts:	 ['violence', 'domestic', 'domestic violence', 'police', 'family', 'risk', 'case', 'high', 'separation', 'We']
	- TFIDF:	 ['violence', 'domestic', 'domestic violence', 'Wilkinson', 'separation', '

Sometimes different approaches are used. Sometimes LDA is better, sometimes nmf. LDA/nmf use different. How do we evaluate which is best?

Andrew recommends not just using counts, but it can do pretty good sometimes

In [33]:
print(topic.argmax())


7


In [34]:
# ATTEMPT CLUSTERING
#https://youtu.be/i74DVqMsRWY?list=PL2VXyKi-KpYttggRATQVmgFcQst3z6OlX

km = KMeans(n_clusters=10, random_state=0, n_init="auto")
#km.fit(terms_df.reset_index()["nmf"])
#Fit model
km.fit(doc_topic_nmf)
# At this point it is still on the 10k wide matrix of frequencies, so need to displ

order_centroids = km.cluster_centers_.argsort()[:,::-1]
terms = nmf_model##.get_feature_names_out()

In [35]:
order_centroids

array([[ 4,  6,  0, 14,  5,  3,  2, 11, 12,  7,  1,  9, 13,  8, 10],
       [ 2,  0,  4, 13,  3,  9, 11,  8, 10,  1, 14,  5,  6,  7, 12],
       [ 9, 11,  0,  8,  2,  4,  1, 13, 14,  3, 10, 12,  6,  7,  5],
       [ 0,  8, 14,  7, 11,  1,  2,  5, 10,  9,  4, 13, 12,  3,  6],
       [ 1,  7,  6,  0, 10, 14, 12,  8, 13, 11,  5,  9,  4,  2,  3],
       [ 6, 14, 10,  1,  0, 12,  4,  8, 13, 11,  7,  2,  5,  9,  3],
       [12, 10, 14,  5,  0,  1,  6, 11,  7,  2,  8,  4,  9, 13,  3],
       [ 3,  0,  2,  4, 11,  8, 13,  9,  1, 14, 10, 12,  6,  5,  7],
       [13,  0,  1,  2,  3, 11,  8, 14,  9,  5, 10,  6,  4,  7, 12],
       [10,  6,  0,  1, 14, 11,  7,  8,  9,  2, 13,  4, 12,  3,  5]],
      dtype=int64)

In [36]:
#terms_df['nmf_factor'] = pd.factorize(terms_df[['nmf']])[0]