<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFQ619 - Data Analytics for Strategic Decision Makers (2024)</div>

# IFQ619 :: C1-UnstructuredAnalytics

For this tutorial, you will use the studio notebook as a guide, and:

1. Use the Guardian API to undertake your own search and obtain a json file of documents
2. Create a TF/IDF document-term matrix for your documents
3. Perform topic modelling of your documents using NMF

In [5]:
# Import the necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
import pandas as pd
import json
import random

### 1. Accessing the data via The Guardian API

Make a copy of the studio notebook file, and modify it to perform your own search of the Guardian API. **NOTE:** you will need to obtain your own developer API key first and put it in a file in the appropriate folder.

A suggested search term is "ukraine", or come up with another that is of interest to you and will return a fair amount of data.

Save your search results in a json file, then read in that data below...

In [6]:
# Load the data - articles from The Guardian
file_path = "./Assessments/assignment 2/data/"
file_name = "qld_gov_articles.json"

with open(f"{file_path}{file_name}",'r', encoding='utf-8') as fp:
    articles = json.load(fp)

print(f"Loaded {len(articles)} articles from {file_name}")

Loaded 1446 articles from qld_gov_articles.json


#### Create a top10 terms dataframe

Using the index from the documents, create a dataframe that can hold the top10 terms for each document.

In [7]:
# Create a dataframe to hold top terms for each analysis type
terms_df = pd.DataFrame(index=articles.keys(),columns=['tfidf','nmf'])
terms_df

Unnamed: 0,tfidf,nmf
‘Harrowing’ footage sparks calls for Queensland government to remove children from police watch houses [2024-07-18T15:00:15Z],,
Queensland government accused of cowing to Christian Lobby on anti-discrimination bill [2024-06-14T15:00:56Z],,
Queensland government hoses down suggestions it is considering bailout for Bonza [2024-05-10T08:26:48Z],,
Queensland government accused of failing to provide adequate schooling to locked up children [2024-04-11T15:00:28Z],,
"‘There has to be a way’: Queensland government working to reunite Molly the magpie with family, premier says [2024-03-28T06:26:13Z]",,
...,...,...
Covid case surge continues as 67 deaths recorded nationally – as it happened [2022-01-19T07:37:48Z],,
Penny Wong warns against ‘miscalculation’ as China-Taiwan tensions escalate – as it happened [2022-08-04T09:14:38Z],,
Children’s vaccination program starts – as it happened [2022-01-10T07:38:15Z],,
Liberal MP Bridget Archer to cross the floor on climate bill – as it happened [2022-08-03T09:25:00Z],,


### Term Frequency / Inverse Document Frequency (TF/IDF)


In [8]:
# Set parameters appropriate to your data


tfidf_vectorizer = TfidfVectorizer(
    max_df=0.75, min_df=5, max_features=10000, stop_words="english"
)

In [9]:
# Get the document vectors
tfidf_dt_matrix = tfidf_vectorizer.fit_transform(articles.keys())

# Display the vector for the first document
tfidf_dt_matrix.toarray()[0]

array([0.14896907, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.22104285, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

#### Update the terms matrix

In [10]:
# list of feature names
feature_names = tfidf_vectorizer.get_feature_names_out()

# create a df to combine matrix with feature names
tfidf_df = pd.DataFrame(tfidf_dt_matrix.toarray(), index=articles.keys(), columns=feature_names)
tfidf_df

Unnamed: 0,00,000,00z,01,01t05,01t06,01t07,01t08,01t14,01z,...,workers,world,worst,worth,wrong,year,years,york,young,youth
‘Harrowing’ footage sparks calls for Queensland government to remove children from police watch houses [2024-07-18T15:00:15Z],0.148969,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Queensland government accused of cowing to Christian Lobby on anti-discrimination bill [2024-06-14T15:00:56Z],0.198921,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Queensland government hoses down suggestions it is considering bailout for Bonza [2024-05-10T08:26:48Z],0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Queensland government accused of failing to provide adequate schooling to locked up children [2024-04-11T15:00:28Z],0.161047,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"‘There has to be a way’: Queensland government working to reunite Molly the magpie with family, premier says [2024-03-28T06:26:13Z]",0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Covid case surge continues as 67 deaths recorded nationally – as it happened [2022-01-19T07:37:48Z],0.000000,0.0,0.000000,0.211689,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Penny Wong warns against ‘miscalculation’ as China-Taiwan tensions escalate – as it happened [2022-08-04T09:14:38Z],0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Children’s vaccination program starts – as it happened [2022-01-10T07:38:15Z],0.000000,0.0,0.000000,0.370204,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Liberal MP Bridget Archer to cross the floor on climate bill – as it happened [2022-08-03T09:25:00Z],0.000000,0.0,0.359577,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
for idx in terms_df.index:
    tfidf = dict(tfidf_df.loc[idx].sort_values(ascending=False).head(10))
    #print(counts)
    terms_df.at[idx,'tfidf'] = list(tfidf.keys()) 

terms_df

Unnamed: 0,tfidf,nmf
‘Harrowing’ footage sparks calls for Queensland government to remove children from police watch houses [2024-07-18T15:00:15Z],"[sparks, houses, watch, calls, children, 15z, ...",
Queensland government accused of cowing to Christian Lobby on anti-discrimination bill [2024-06-14T15:00:56Z],"[accused, anti, 56z, government, 06, queenslan...",
Queensland government hoses down suggestions it is considering bailout for Bonza [2024-05-10T08:26:48Z],"[26, 48z, government, 05, queensland, 2024, 00...",
Queensland government accused of failing to provide adequate schooling to locked up children [2024-04-11T15:00:28Z],"[provide, failing, accused, 28z, children, gov...",
"‘There has to be a way’: Queensland government working to reunite Molly the magpie with family, premier says [2024-03-28T06:26:13Z]","[28t06, way, 26, family, premier, 13z, governm...",
...,...,...
Covid case surge continues as 67 deaths recorded nationally – as it happened [2022-01-19T07:37:48Z],"[surge, nationally, recorded, continues, 37, c...",
Penny Wong warns against ‘miscalculation’ as China-Taiwan tensions escalate – as it happened [2022-08-04T09:14:38Z],"[penny, wong, china, warns, 14, 38z, 08, happe...",
Children’s vaccination program starts – as it happened [2022-01-10T07:38:15Z],"[38, children, 15z, 01, happened, 2022, mining...",
Liberal MP Bridget Archer to cross the floor on climate bill – as it happened [2022-08-03T09:25:00Z],"[floor, liberal, mp, 00z, 25, climate, 08, hap...",


### Topic modelling with Non-negative Matrix Factorisation (NMF)


[NMF](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a different algorithm for obtaining *topics* (a list of terms) from a document-term matrix. It also factorises the document-term matrix into 2 factor matrices: document-topic and topic-term.

In [12]:
help(NMF().fit_transform)

Help on method fit_transform in module sklearn.decomposition._nmf:

fit_transform(X, y=None, W=None, H=None) method of sklearn.decomposition._nmf.NMF instance
    Learn a NMF model for the data X and returns the transformed data.

    This is more efficient than calling fit followed by transform.

    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        Training vector, where `n_samples` is the number of samples
        and `n_features` is the number of features.

    y : Ignored
        Not used, present for API consistency by convention.

    W : array-like of shape (n_samples, n_components), default=None
        If `init='custom'`, it is used as initial guess for the solution.
        If `None`, uses the initialisation method specified in `init`.

    H : array-like of shape (n_components, n_features), default=None
        If `init='custom'`, it is used as initial guess for the solution.
        If `None`, uses the initialisation met

In [13]:
# Set the number of topics
num_topics = 30

# Create the model
nmf_model = NMF(n_components=num_topics ,init='random',beta_loss='frobenius')

# Fit the model to the data and use it to transform the data
doc_topic_nmf = nmf_model.fit_transform(tfidf_dt_matrix) # NOTE THAT THIS HAS USED THE TDIDF_DT_MATRIX, IT BUILDS ON THAT ANALYSIS. LDA USED A DIFFERENT COUNTS ONE?

topic_term_nmf = nmf_model.components_

In [14]:
# Get the topics and their terms
nmf_topic_dict = {}
for index, topic in enumerate(topic_term_nmf):
    zipped = zip(feature_names, topic)
    top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
    #print(top_terms)
    top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
    nmf_topic_dict[f"topic_{index}"] = top_terms_list

# Print the topics with their terms    
for k,v in nmf_topic_dict.items():
    print(k)
    print(v)
    print()

topic_0
{'03': 0.7309, '06z': 0.071, 'aukus': 0.068, '15t14': 0.0562, '47': 0.048, 'claims': 0.0459, '43z': 0.0439, '07t14': 0.043, '14t14': 0.0428, '21t14': 0.0414}

topic_1
{'2023': 0.79, '08z': 0.0547, '39z': 0.0544, 'year': 0.0438, 'push': 0.0383, '44z': 0.0381, 'make': 0.0305, '55': 0.0283, '19z': 0.0273, '01z': 0.0261}

topic_2
{'08': 0.9525, '2022': 0.1373, '15z': 0.0897, '2024': 0.0814, 'minister': 0.078, 'ministries': 0.0673, '06t15': 0.0634, '10z': 0.0631, 'houses': 0.0572, 'watch': 0.056}

topic_3
{'police': 0.8251, 'violence': 0.2977, 'domestic': 0.2937, 'inquiry': 0.1761, 'indigenous': 0.139, 'death': 0.1347, 'man': 0.1326, 'response': 0.1178, 'shooting': 0.1084, 'queensland': 0.1014}

topic_4
{'04': 0.8855, '2024': 0.2552, '20': 0.0784, 'people': 0.0687, '51z': 0.0674, '29t15': 0.0651, '03z': 0.056, 'election': 0.051, '30t08': 0.0479, '26z': 0.0471}

topic_5
{'mail': 0.5047, 'morning': 0.4954, 'gaza': 0.2084, '59': 0.1884, 'plan': 0.1691, 'coalition': 0.1688, '11z': 0.138

#### Update the terms matrix

In [15]:
for idx,topic in enumerate(doc_topic_nmf):
    topic_num = topic.argmax()
    top_topic = nmf_topic_dict[f"topic_{topic_num}"]
    #terms_df['nmf'].iloc[idx] = list(top_topic.keys())
    terms_df.loc[idx,'nmf'] = list(top_topic.keys()) # THIS IS THE PANDAS APPROACH TO UPDATE IN PLACE?

terms_df

Unnamed: 0,tfidf,nmf
‘Harrowing’ footage sparks calls for Queensland government to remove children from police watch houses [2024-07-18T15:00:15Z],"[sparks, houses, watch, calls, children, 15z, ...",
Queensland government accused of cowing to Christian Lobby on anti-discrimination bill [2024-06-14T15:00:56Z],"[accused, anti, 56z, government, 06, queenslan...",
Queensland government hoses down suggestions it is considering bailout for Bonza [2024-05-10T08:26:48Z],"[26, 48z, government, 05, queensland, 2024, 00...",
Queensland government accused of failing to provide adequate schooling to locked up children [2024-04-11T15:00:28Z],"[provide, failing, accused, 28z, children, gov...",
"‘There has to be a way’: Queensland government working to reunite Molly the magpie with family, premier says [2024-03-28T06:26:13Z]","[28t06, way, 26, family, premier, 13z, governm...",
...,...,...
1441,,"[covid, deaths, records, nation, 2022, morriso..."
1442,,"[08, 2022, 15z, 2024, minister, ministries, 06..."
1443,,"[01, 42z, 23t14, 47z, cases, 38, 2022, 43, 16z..."
1444,,"[08, 2022, 15z, 2024, minister, ministries, 06..."


### Check against articles

In [16]:
# Sample 5 random articles
samples = random.sample(range(0,len(terms_df)),5)

for sample in samples:
    doc = terms_df.iloc[sample]
    print(f"[{sample}] {doc.name}")
    print("\t- TFIDF:\t",doc['tfidf'])
    print("\t- NMF:\t\t",doc['nmf'])
    print()

[65] Afternoon Update: retiring Shorten channels Sinatra; Harris and Trump agree debate rules; and con artist to star in hit show [2024-09-05T06:42:23Z]
	- TFIDF:	 ['05t06', 'star', 'debate', 'rules', 'hit', 'afternoon', '42', 'update', '23z', '09']
	- NMF:		 nan

[2770] 1324
	- TFIDF:	 nan
	- NMF:		 ['06', '2024', '21z', '50z', 'blues', '34', 'accused', 'maroons', 'game', 'senator']

[787] Some popular accounts likely to disappear from Twitter as Elon Musk ends free access to API [2023-02-03T06:39:11Z]
	- TFIDF:	 ['access', 'likely', 'free', '39', '11z', '02', '2023', 'network', 'mps', 'minister']
	- NMF:		 nan

[863] Australians turn to interstate train travel amid soaring domestic air fares heading into holiday peak [2022-11-28T14:00:24Z]
	- TFIDF:	 ['turn', 'travel', 'peak', 'train', '28t14', 'domestic', 'australians', 'amid', '24z', '11']
	- NMF:		 nan

[479] Heavy rain continues in NSW; SA police head to Alice Springs for backup – as it happened [2024-04-05T06:50:15Z]
	- TFIDF:	 

## Refine your analysis

Once you have worked through the process. Try tweaking the parameters in the TF/IDF vectorizer and also in the NMF topic modelling to try and obtain better results for your data.

#### Advanced

You may obtain better results by doing the following:

1. Creating smaller documents (e.g. article paragraphs)
2. Pre-processing the text by Stemming or Lemmatizing, and by removing additional stop words.

#https://stackoverflow.com/questions/1787110/what-is-the-difference-between-lemmatization-vs-stemming



    The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

    However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

From the NLTK docs:

    Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.

