<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFQ619 - Data Analytics for Strategic Decision Makers (2024)</div>

# IFQ619 :: C1-UnstructuredAnalytics

For this tutorial, you will use the studio notebook as a guide, and:

1. Use the Guardian API to undertake your own search and obtain a json file of documents
2. Create a TF/IDF document-term matrix for your documents
3. Perform topic modelling of your documents using NMF

In [1]:
# Import the necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
import pandas as pd
import json
import random

### 1. Accessing the data via The Guardian API

Make a copy of the studio notebook file, and modify it to perform your own search of the Guardian API. **NOTE:** you will need to obtain your own developer API key first and put it in a file in the appropriate folder.

A suggested search term is "ukraine", or come up with another that is of interest to you and will return a fair amount of data.

Save your search results in a json file, then read in that data below...

In [2]:
# Load the data - articles from The Guardian
file_path = "data/"
file_name = "war_articles.json"

with open(f"{file_path}{file_name}",'r', encoding='utf-8') as fp:
    articles = json.load(fp)

print(f"Loaded {len(articles)} articles from {file_name}")

Loaded 864 articles from war_articles.json


#### Create a top10 terms dataframe

Using the index from the documents, create a dataframe that can hold the top10 terms for each document.

In [3]:
# Create a dataframe to hold top terms for each analysis type
terms_df = pd.DataFrame(index=articles.keys(),columns=['tfidf','nmf'])
terms_df

Unnamed: 0,tfidf,nmf
Ukraine war briefing: Trump vows to end war in call with Zelenskiy [2024-07-20T01:34:47Z],,
Ukraine war briefing: Turkey launches new Ukrainian warship [2024-08-02T00:18:54Z],,
Israel-Gaza war: protesters in Tel Aviv demand end to war – as it happened [2024-05-25T23:15:52Z],,
Ukraine war briefing: US hits China with sanctions over war supplies to Russia [2024-05-02T00:24:30Z],,
Ukraine war briefing: ‘Russia does not control Kursk border’ [2024-08-07T01:23:54Z],,
...,...,...
Oliver Dowden reportedly reveals preferred choice for next Tory leader – UK general election as it happened [2024-07-02T19:59:07Z],,
PM defends response to ICC arrest warrant request – as it happened [2024-05-23T08:26:49Z],,
Museum loses anti-discrimination case – as it happened [2024-04-09T08:10:14Z],,
Woman found dead in North Bondi – as it happened [2024-04-30T07:57:49Z],,


### Term Frequency / Inverse Document Frequency (TF/IDF)


In [4]:
# Set parameters appropriate to your data


tfidf_vectorizer = TfidfVectorizer(
    max_df=0.75, min_df=5, max_features=10000, stop_words="english"
)

In [5]:
# Get the document vectors
tfidf_dt_matrix = tfidf_vectorizer.fit_transform(articles.keys())

# Display the vector for the first document
tfidf_dt_matrix.toarray()[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.2007124 , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

#### Update the terms matrix

In [6]:
# list of feature names
feature_names = tfidf_vectorizer.get_feature_names_out()

# create a df to combine matrix with feature names
tfidf_df = pd.DataFrame(tfidf_dt_matrix.toarray(), index=articles.keys(), columns=feature_names)
tfidf_df

Unnamed: 0,00,000,00z,01,01t14,01z,02,02t00,02t15,02z,...,wong,work,worker,workers,world,year,years,zaporizhzhia,zelenskiy,zomi
Ukraine war briefing: Trump vows to end war in call with Zelenskiy [2024-07-20T01:34:47Z],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.338836,0.0
Ukraine war briefing: Turkey launches new Ukrainian warship [2024-08-02T00:18:54Z],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.426298,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
Israel-Gaza war: protesters in Tel Aviv demand end to war – as it happened [2024-05-25T23:15:52Z],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
Ukraine war briefing: US hits China with sanctions over war supplies to Russia [2024-05-02T00:24:30Z],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.389934,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
Ukraine war briefing: ‘Russia does not control Kursk border’ [2024-08-07T01:23:54Z],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Oliver Dowden reportedly reveals preferred choice for next Tory leader – UK general election as it happened [2024-07-02T19:59:07Z],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
PM defends response to ICC arrest warrant request – as it happened [2024-05-23T08:26:49Z],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
Museum loses anti-discrimination case – as it happened [2024-04-09T08:10:14Z],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
Woman found dead in North Bondi – as it happened [2024-04-30T07:57:49Z],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


In [7]:
for idx in terms_df.index:
    tfidf = dict(tfidf_df.loc[idx].sort_values(ascending=False).head(10))
    #print(counts)
    terms_df.at[idx,'tfidf'] = list(tfidf.keys()) 

terms_df

Unnamed: 0,tfidf,nmf
Ukraine war briefing: Trump vows to end war in call with Zelenskiy [2024-07-20T01:34:47Z],"[end, 34, 47z, trump, war, zelenskiy, briefing...",
Ukraine war briefing: Turkey launches new Ukrainian warship [2024-08-02T00:18:54Z],"[02t00, launches, 18, 54z, ukrainian, new, bri...",
Israel-Gaza war: protesters in Tel Aviv demand end to war – as it happened [2024-05-25T23:15:52Z],"[tel, aviv, protesters, end, 15, 52z, war, isr...",
Ukraine war briefing: US hits China with sanctions over war supplies to Russia [2024-05-02T00:24:30Z],"[02t00, 24, hits, sanctions, 30z, china, war, ...",
Ukraine war briefing: ‘Russia does not control Kursk border’ [2024-08-07T01:23:54Z],"[does, 23, border, kursk, 54z, russia, briefin...",
...,...,...
Oliver Dowden reportedly reveals preferred choice for next Tory leader – UK general election as it happened [2024-07-02T19:59:07Z],"[reportedly, general, uk, leader, 07z, electio...",
PM defends response to ICC arrest warrant request – as it happened [2024-05-23T08:26:49Z],"[arrest, response, 26, icc, 49z, pm, 05, happe...",
Museum loses anti-discrimination case – as it happened [2024-04-09T08:10:14Z],"[case, 10, 14z, 04, happened, muslim, musk, mp...",
Woman found dead in North Bondi – as it happened [2024-04-30T07:57:49Z],"[woman, dead, north, 49z, 57, 04, happened, mu...",


### Topic modelling with Non-negative Matrix Factorisation (NMF)


[NMF](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a different algorithm for obtaining *topics* (a list of terms) from a document-term matrix. It also factorises the document-term matrix into 2 factor matrices: document-topic and topic-term.

In [8]:
help(NMF().fit_transform)

Help on method fit_transform in module sklearn.decomposition._nmf:

fit_transform(X, y=None, W=None, H=None) method of sklearn.decomposition._nmf.NMF instance
    Learn a NMF model for the data X and returns the transformed data.
    
    This is more efficient than calling fit followed by transform.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        Training vector, where `n_samples` is the number of samples
        and `n_features` is the number of features.
    
    y : Ignored
        Not used, present for API consistency by convention.
    
    W : array-like of shape (n_samples, n_components), default=None
        If `init='custom'`, it is used as initial guess for the solution.
        If `None`, uses the initialisation method specified in `init`.
    
    H : array-like of shape (n_components, n_features), default=None
        If `init='custom'`, it is used as initial guess for the solution.
        If `None`, uses th

In [9]:
# Set the number of topics
num_topics = 30

# Create the model
nmf_model = NMF(n_components=num_topics ,init='random',beta_loss='frobenius')

# Fit the model to the data and use it to transform the data
doc_topic_nmf = nmf_model.fit_transform(tfidf_dt_matrix) # NOTE THAT THIS HAS USED THE TDIDF_DT_MATRIX, IT BUILDS ON THAT ANALYSIS. LDA USED A DIFFERENT COUNTS ONE?

topic_term_nmf = nmf_model.components_

In [10]:
# Get the topics and their terms
nmf_topic_dict = {}
for index, topic in enumerate(topic_term_nmf):
    zipped = zip(feature_names, topic)
    top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
    #print(top_terms)
    top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
    nmf_topic_dict[f"topic_{index}"] = top_terms_list

# Print the topics with their terms    
for k,v in nmf_topic_dict.items():
    print(k)
    print(v)
    print()

topic_0
{'04': 0.6253, '39z': 0.0419, '34z': 0.0407, 'death': 0.0372, 'lehrmann': 0.0327, '29z': 0.0326, 'zomi': 0.0301, '56': 0.0257, 'nsw': 0.0255, 'claims': 0.0235}

topic_1
{'great': 0.3334, 'reads': 0.3051, 'like': 0.1058, 'election': 0.0939, 'win': 0.0654, '09z': 0.0619, '27z': 0.0603, 'history': 0.0539, '13z': 0.0537, 'far': 0.0523}

topic_2

topic_3
{'05': 0.8603, 'kharkiv': 0.0674, 'icc': 0.0635, '51z': 0.0506, 'rafah': 0.0493, '38z': 0.0483, 'budget': 0.0441, 'arrest': 0.0382, 'aukus': 0.0382, '02': 0.0364}

topic_4
{'ukraine': 0.693, 'briefing': 0.575, 'war': 0.5377, 'russia': 0.3781, 'zelenskiy': 0.2604, 'kyiv': 0.1549, 'kharkiv': 0.1359, '34': 0.0872, 'kursk': 0.0858, 'nato': 0.0737}

topic_5
{'happened': 1.1065, 'pm': 0.2482, '47z': 0.1486, 'senator': 0.141, '10': 0.1253, '54z': 0.1104, '59': 0.0948, 'greens': 0.0947, 'deal': 0.0923, '03': 0.0921}

topic_6
{'43z': 0.5821, '55': 0.5572, 'budget': 0.1771, 'review': 0.1407, 'dies': 0.1247, 'humanity': 0.0891, '54z': 0.0821, 

#### Update the terms matrix

In [11]:
for idx,topic in enumerate(doc_topic_nmf):
    topic_num = topic.argmax()
    top_topic = nmf_topic_dict[f"topic_{topic_num}"]
    #terms_df['nmf'].iloc[idx] = list(top_topic.keys())
    terms_df.loc[idx,'nmf'] = list(top_topic.keys()) # THIS IS THE PANDAS APPROACH TO UPDATE IN PLACE?

terms_df

Unnamed: 0,tfidf,nmf
Ukraine war briefing: Trump vows to end war in call with Zelenskiy [2024-07-20T01:34:47Z],"[end, 34, 47z, trump, war, zelenskiy, briefing...",
Ukraine war briefing: Turkey launches new Ukrainian warship [2024-08-02T00:18:54Z],"[02t00, launches, 18, 54z, ukrainian, new, bri...",
Israel-Gaza war: protesters in Tel Aviv demand end to war – as it happened [2024-05-25T23:15:52Z],"[tel, aviv, protesters, end, 15, 52z, war, isr...",
Ukraine war briefing: US hits China with sanctions over war supplies to Russia [2024-05-02T00:24:30Z],"[02t00, 24, hits, sanctions, 30z, china, war, ...",
Ukraine war briefing: ‘Russia does not control Kursk border’ [2024-08-07T01:23:54Z],"[does, 23, border, kursk, 54z, russia, briefin...",
...,...,...
859,,"[happened, pm, 47z, senator, 10, 54z, 59, gree..."
860,,"[05, kharkiv, icc, 51z, rafah, 38z, budget, ar..."
861,,"[04, 39z, 34z, death, lehrmann, 29z, zomi, 56,..."
862,,"[04, 39z, 34z, death, lehrmann, 29z, zomi, 56,..."


### Check against articles

In [12]:
# Sample 5 random articles
samples = random.sample(range(0,len(terms_df)),5)

for sample in samples:
    doc = terms_df.iloc[sample]
    print(f"[{sample}] {doc.name}")
    print("\t- TFIDF:\t",doc['tfidf'])
    print("\t- NMF:\t\t",doc['nmf'])
    print()

[1584] 720
	- TFIDF:	 nan

[1034] 170
	- TFIDF:	 nan
	- NMF:		 ['ukraine', 'briefing', 'war', 'russia', 'zelenskiy', 'kyiv', 'kharkiv', '34', 'kursk', 'nato']

[1489] 625
	- TFIDF:	 nan
	- NMF:		 ['china', 'taiwan', 'sea', '03z', 'south', '35', 'tensions', '30', '29z', '53']

[1502] 638
	- TFIDF:	 nan
	- NMF:		 ['00', '34z', '19t15', '16t15', '40z', '17z', 'review', '14z', '03z', '02t15']

[344] Middle East crisis: Israeli military said 10 rockets had been fired from Lebanon and that one of them hit kibbutz HaGoshrim – as it happened [2024-07-30T13:48:28Z]
	- TFIDF:	 ['hit', 'lebanon', '10', '28z', '48', 'military', 'crisis', 'middle', 'east', 'israeli']
	- NMF:		 nan



## Refine your analysis

Once you have worked through the process. Try tweaking the parameters in the TF/IDF vectorizer and also in the NMF topic modelling to try and obtain better results for your data.

#### Advanced

You may obtain better results by doing the following:

1. Creating smaller documents (e.g. article paragraphs)
2. Pre-processing the text by Stemming or Lemmatizing, and by removing additional stop words.

#https://stackoverflow.com/questions/1787110/what-is-the-difference-between-lemmatization-vs-stemming



    The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

    However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

From the NLTK docs:

    Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.

