## Introduction 
#### This document contains the following: 
- Importing data from the cleaned dataset from the "text_preprocessing.ipynb" file. The file is from /Data/bbc_news_post_eda.csv 
- The imported data is a cleaned version of the orginal dataset of bbc_news.csv 
- There are two features we are interested in the "description_text" feature, which is the text from the BBC news website
- This document will begin by splitting the data into a training set and a test set 
- Perfrom vectorization via TFIDF, to convert the raw data into a numerical form, thus we can apply the ML analysis on. Also, TFIDF is chosen, because it perfroms better during Sentiment Analysis
- Once vectrization is applied we can create the ML model for the NLP problem. Two  models are created:  Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF)

------------------

### Import the libraries

In [12]:
#import the required libearies 
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import re


### Import the data

In [15]:
#Import the data 
df = pd.read_csv("Data\\bbc_news_post_eda.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,title,description,description_text
0,0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...,The Ukrainian president say country forgive fo...
1,1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as...",Jeremy Bowen frontline Irpin resident came Rus...
2,2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...,One world biggest fertiliser firm say conflict...
3,3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...,The parent Manchester Arena bombing youngest v...
4,4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...,Consumers feeling impact higher energy cost fu...


----------

## Applay LDA

In [75]:
#1. Import the data for the LDA algorithm
npr_LDF = pd.read_csv("Data\\bbc_news_post_eda.csv")
npr_LDF.drop(columns='Unnamed: 0', inplace=True)
npr_LDF

Unnamed: 0,title,description,description_text
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...,The Ukrainian president say country forgive fo...
1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as...",Jeremy Bowen frontline Irpin resident came Rus...
2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...,One world biggest fertiliser firm say conflict...
3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...,The parent Manchester Arena bombing youngest v...
4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...,Consumers feeling impact higher energy cost fu...
...,...,...,...
17791,England v Ireland: Stuart Broad says bowlers r...,Stuart Broad says England's attack is ready to...,Stuart Broad say Englands attack ready step As...
17792,"England v Ireland: Stuart Broad, Zak Crawley &...",Watch highlights as Stuart Broad's five-wicket...,Watch highlight Stuart Broads fivewicket haul ...
17793,Phil Neville: Inter Miami sack coach after 10 ...,Major League Soccer side Inter Miami sack coac...,Major League Soccer side Inter Miami sack coac...
17794,French Open: Jessica Pegula column on getting ...,"American Jessica Pegula, seeded third in the F...",American Jessica Pegula seeded third French Op...


In [76]:
#2 Perfrom preprocessing on the data (i.e., building the vocab)
#Building the settings for the CountVectroizer. 
from sklearn.feature_extraction.text import CountVectorizer
cv_LDF = CountVectorizer(lowercase=True,stop_words='english',ngram_range=(1, 2) ,max_df = 0.85, min_df= 3 )
    #lowercase --> This will Convert all characters to lowercase before tokenizing
    #stop_words --> This will remove all the commoon words in the english language 
    #ngram_range --> When it is (1,2) this will consider means unigrams and bigrams
    #max_def --> When =0.85, this will tell the model to ignore words that appear in more than 85% of the articles
    #min_def --> When =3, this will tell the model to ignore words that appear in less than 3 of the articles

cv_LDF


In [78]:
#3 Create a Document Term Matrix Using the CountVectroizer's settings 
dtm_LDF = cv_LDF.fit_transform(npr_LDF['description_text'].values.astype('U'))
    #The data must be converted to unicode (i.e., a string in python 3)
dtm_LDF

<17796x14541 sparse matrix of type '<class 'numpy.int64'>'
	with 202580 stored elements in Compressed Sparse Row format>

In [83]:
#4. Get the vocab of all the words (This step is not essential)
#Get the number of words used 
print (len(cv_LDF.get_feature_names_out()))

#There is 14541 words in the Vocab of the DTM 

14541


In [84]:
#Example view the 1994 word 
print (cv_LDF.get_feature_names_out()[1994])

camilla


In [85]:
#5 Perfrom LDA via Sklearn. Remember you will choose the # of data

from sklearn.decomposition import LatentDirichletAllocation 
LDA = LatentDirichletAllocation(n_components=5, random_state=101) 
    #n_components --> The K value 
    #random_state --> The set of documents or order being used

#6 Fit the data of the CountVectroizer into the machine learning model 
LDA.fit(dtm_LDF)

In [86]:
#7 Get the number of K values 
print (len(LDA.components_))

#LDA.compoenets_ --> presents each document vs. each word
print (LDA.components_.shape)

5
(5, 14541)


In [88]:
#8 Grab the highest probability words per topic. We must use argsort --> returns the index positions that needed to sort a list. 

#Get the index positions of the last 15 items, which are the largest
list_LDA = []
for x in range(LDA.components_.shape[0]):
    # print('The words for topic ',x)
    for y in LDA.components_[x].argsort()[-15:] :
        # print(cv_LDF.get_feature_names_out()[y]) 
        list_LDA.append(cv_LDF.get_feature_names_out()[y])
    # print('\n')
    # print('\n')

In [94]:
#Create a list for each Topic 
list_LDA_Nested = []
for x in range(0, len(list_LDA), 15): 
    list_LDA_Nested.append(list_LDA[x:x+15])


In [105]:
#Topic 0
print('Topic 0')
print (list_LDA_Nested[0])
print('\n')
#Topic 1
print('Topic 1')
print (list_LDA_Nested[1])
print('\n')
#Topic 2
print('Topic 2')
print (list_LDA_Nested[2])
print('\n')
#Topic 3
print('Topic 3')
print (list_LDA_Nested[3])
print('\n')
#Topic 4
print('Topic 4')
print (list_LDA_Nested[4])
print('\n')


Topic 0
['manchester city', 'champions', 'bbc', 'liverpool', '2022', 'open', 'united', 'final', 'premier league', 'city', 'premier', 'win', 'manchester', 'say', 'league']


Topic 1
['lead', 'pay', 'bbc', 'energy', 'strike', 'plan', 'new', 'uk', 'paper', 'government', 'home', 'police', 'year', 'people', 'say']


Topic 2
['final', 'south', 'france', 'past', 'test', 'watch', 'victory', 'wales', 'win', 'day', 'say', 'world cup', 'england', 'cup', 'world']


Topic 3
['ukrainian', 'social', 'country', 'russia', 'medium', 'president', 'tell', 'woman', 'world', 'new', 'bbc', 'ukraine', 'say', 'year', 'war']


Topic 4
['new', 'party', 'election', 'living', 'government', 'bbc', 'leader', 'prime minister', 'prime', 'cost', 'people', 'ukraine', 'minister', 'uk', 'say']




In [103]:
#Interpreting Each Topic 
#Topic 0 --> This topic seems to cover local U.K. soccer / European Soccer 
#Topic 1 --> This topic seems to cover the internal U.K governmental concerns and politics 
#Topic 2 --> This topic seems to cover international soccer 
#Topic 3 --> This topic seems to cover the current war in Ukraine 
#Topic 4 --> This topic seems to cover election and the impact of the Ukrainian war on the U.K. politics 

In [106]:
#9 Transfrom the DTM Spares Matrix via the Machine Learning model. The transform Step is where you assign topics to the documents 

results_LDF = LDA.transform(dtm_LDF) 

In [107]:
#10 Create a new column that represents the topic the article belongs too 
Results_list_LDF = [] 
for i,x in enumerate(npr_LDF['description_text']):

    Results_list_LDF.append(results_LDF[i].argsort()[-1:][0])
        #argsort --> find the index positions 

# print (Results_list)
import numpy as np 
npr_LDF['Results'] = np.array(Results_list_LDF)

In [110]:
#11 Add a title to representing the Result's numerical output 
def Column_Convert(value):
    if value == 0:
        return "U.K & European Soccer"
    elif value == 1:
        return "U.K Governmental Politics"
    elif value == 2:
        return "International Soccer"
    elif value == 3:
        return "Ukranian War"
    elif value == 4:
        return "Impact of Ukranian War on U.K Politics"
 
npr_LDF['Results_description'] = npr_LDF['Results'].map(Column_Convert)

In [137]:
#The final results
npr_LDF

Unnamed: 0,title,description,description_text,Results,Results_description
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...,The Ukrainian president say country forgive fo...,3,Ukranian War
1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as...",Jeremy Bowen frontline Irpin resident came Rus...,4,Impact of Ukranian War on U.K Politics
2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...,One world biggest fertiliser firm say conflict...,1,U.K Governmental Politics
3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...,The parent Manchester Arena bombing youngest v...,4,Impact of Ukranian War on U.K Politics
4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...,Consumers feeling impact higher energy cost fu...,1,U.K Governmental Politics
...,...,...,...,...,...
17791,England v Ireland: Stuart Broad says bowlers r...,Stuart Broad says England's attack is ready to...,Stuart Broad say Englands attack ready step As...,2,International Soccer
17792,"England v Ireland: Stuart Broad, Zak Crawley &...",Watch highlights as Stuart Broad's five-wicket...,Watch highlight Stuart Broads fivewicket haul ...,2,International Soccer
17793,Phil Neville: Inter Miami sack coach after 10 ...,Major League Soccer side Inter Miami sack coac...,Major League Soccer side Inter Miami sack coac...,0,U.K & European Soccer
17794,French Open: Jessica Pegula column on getting ...,"American Jessica Pegula, seeded third in the F...",American Jessica Pegula seeded third French Op...,0,U.K & European Soccer


#### LDA Analysis Conclusions: 
- A K of 5 has been applied on the BBC news articles of March 2023 to June 2023 
- CountVectroizer was used to create the DTM and Sklearn's LatentDirichletAllocation was used to fit and transform the DTM's data and apply the LDA analysis 
- This work can help companies with suggesting specific news articles to specific users 


------------------------------------------------

## Apply NFM

In [112]:
#1. Import the data for the NFM algorithm
npr_NMF = pd.read_csv("Data\\bbc_news_post_eda.csv")
npr_NMF.drop(columns='Unnamed: 0', inplace=True)
npr_NMF

Unnamed: 0,title,description,description_text
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...,The Ukrainian president say country forgive fo...
1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as...",Jeremy Bowen frontline Irpin resident came Rus...
2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...,One world biggest fertiliser firm say conflict...
3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...,The parent Manchester Arena bombing youngest v...
4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...,Consumers feeling impact higher energy cost fu...
...,...,...,...
17791,England v Ireland: Stuart Broad says bowlers r...,Stuart Broad says England's attack is ready to...,Stuart Broad say Englands attack ready step As...
17792,"England v Ireland: Stuart Broad, Zak Crawley &...",Watch highlights as Stuart Broad's five-wicket...,Watch highlight Stuart Broads fivewicket haul ...
17793,Phil Neville: Inter Miami sack coach after 10 ...,Major League Soccer side Inter Miami sack coac...,Major League Soccer side Inter Miami sack coac...
17794,French Open: Jessica Pegula column on getting ...,"American Jessica Pegula, seeded third in the F...",American Jessica Pegula seeded third French Op...


In [115]:
#2. Prep the text preprocessing via TF-IDF to build a vocabulary
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_NMF = TfidfVectorizer(lowercase=True,stop_words='english',ngram_range=(1, 2) ,max_df = 0.85, min_df= 3)
    #lowercase --> This will Convert all characters to lowercase before tokenizing
    #stop_words --> This will remove all the commoon words in the english language 
    #ngram_range --> When it is (1,2) this will consider means unigrams and bigrams
    #max_def --> When =0.85, this will tell the model to ignore words that appear in more than 85% of the articles
    #min_def --> When =3, this will tell the model to ignore words that appear in less than 3 of the articles

In [116]:
#3 Create a Document Term Matrix Using the CountVectroizer's settings 
dtm_NMF = tfidf_NMF.fit_transform(npr_LDF['description_text'].values.astype('U'))
    #The data must be converted to unicode (i.e., a string in python 3)
dtm_NMF

<17796x14541 sparse matrix of type '<class 'numpy.float64'>'
	with 202580 stored elements in Compressed Sparse Row format>

In [117]:
#4. Get the vocab of all the words 
#Get the number of words used 
print (len(tfidf_NMF.get_feature_names_out()))


14541


In [118]:
#get the 1994 word 
print (tfidf_NMF.get_feature_names_out()[1994])

camilla


In [120]:
#5 Perfrom LDA via Sklearn. Remember you will choose the # of data

from sklearn.decomposition import NMF
NMF_ML = NMF(n_components=5, random_state=101) 
    #n_components --> The K value 
    #random_state --> The set of documents or order being used

#6 Fit the data of the CountVectroizer into the machine learning model 
NMF_ML.fit(dtm_NMF)

In [121]:
#7 Get the number of K values 
print (len(NMF_ML.components_))

# Present each document vs. each word
print (NMF_ML.components_.shape)

5
(5, 14541)


In [123]:
#8 Grab the highest probability words per topic. We must use argsort --> returns the index positions that needed to sort a list. 

#Get the index positions of the last 15 items, which are the largest
list_NFM = []
for x in range(NMF_ML.components_.shape[0]):
    # print('The words for topic ',x)
    for y in NMF_ML.components_[x].argsort()[-15:] :
        # print(cv_LDF.get_feature_names_out()[y]) 
        list_NFM.append(tfidf_NMF.get_feature_names_out()[y])
    # print('\n')
    # print('\n')


In [125]:
#Create a list for each Topic 
list_NFM_Nested = []
for x in range(0, len(list_NFM), 15): 
    list_NFM_Nested.append(list_NFM[x:x+15])


In [126]:
#Topic 0
print('Topic 0')
print (list_NFM_Nested[0])
print('\n')
#Topic 1
print('Topic 1')
print (list_NFM_Nested[1])
print('\n')
#Topic 2
print('Topic 2')
print (list_NFM_Nested[2])
print('\n')
#Topic 3
print('Topic 3')
print (list_NFM_Nested[3])
print('\n')
#Topic 4
print('Topic 4')
print (list_NFM_Nested[4])
print('\n')


Topic 0
['victory', 'englands', 'semifinal', 'bbc', '2022', 'france', 'win', 'watch', 'wales', 'qatar', 'final', 'england', 'world cup', 'cup', 'world']


Topic 1
['going', 'whats', 'paying', 'attention', 'closely', 'going past', 'whats going', 'attention whats', 'closely paying', 'paying attention', 'day', 'past', 'seven', 'past seven', 'seven day']


Topic 2
['law', 'scotlands', 'advice', 'avoid', 'clash', 'include', 'flu', 'gender', 'family', 'protest', 'award', 'aim', 'documentary', 'glory', 'nan']


Topic 3
['russia', 'war', 'president', 'home', 'russian', 'minister', 'bbc', 'government', 'police', 'new', 'year', 'uk', 'ukraine', 'people', 'say']


Topic 4
['title', 'season', 'goal', 'liverpool', 'win', 'manchester united', 'champions league', 'champions', 'united', 'manchester city', 'city', 'premier league', 'premier', 'manchester', 'league']




In [127]:
#Interpreting Each Topic 
#Topic 0 --> This topic seems to cover international soccer
#Topic 1 --> This topic seems to be repeating many words which could be an indicator that the selected K value is too large 
#Topic 2 --> This topic seems to cover internal news of the U.K & elections 
#Topic 3 --> This topic seems to cover the current war in Ukraine 
#Topic 4 --> This topic seems to cover U.K and european soccer

In [128]:
#9 Transfrom the DTM Spares Matrix via the Machine Learning model

results_NFM = NMF_ML.transform(dtm_NMF) 

In [130]:
#10 Create a new column that represents the topic the article belongs too 
Results_list_NFM = [] 
for i,x in enumerate(npr_NMF['description_text']):

    Results_list_NFM.append(results_NFM[i].argsort()[-1:][0])
        #argsort --> find the index positions 

import numpy as np 
npr_NMF['Results'] = np.array(Results_list_NFM)

In [135]:
#11 Add a title to representing the Result's numerical output 
def Column_Convert(value):
    if value == 0:
        return "International Soccer"    
    elif value == 1:
        return "Many Words are Repeated"
    elif value == 2:
        return "U.K Governmental Politics"
    elif value == 3:
        return "Ukranian War & its Impact"
    elif value == 4:
        return "U.K & European Soccer"

 
npr_NMF['Results_description'] = npr_NMF['Results'].map(Column_Convert)

In [136]:
npr_NMF

Unnamed: 0,title,description,description_text,Results,Results_description
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...,The Ukrainian president say country forgive fo...,3,Ukranian War & its Impact
1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as...",Jeremy Bowen frontline Irpin resident came Rus...,3,Ukranian War & its Impact
2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...,One world biggest fertiliser firm say conflict...,3,Ukranian War & its Impact
3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...,The parent Manchester Arena bombing youngest v...,4,U.K & European Soccer
4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...,Consumers feeling impact higher energy cost fu...,3,Ukranian War & its Impact
...,...,...,...,...,...
17791,England v Ireland: Stuart Broad says bowlers r...,Stuart Broad says England's attack is ready to...,Stuart Broad say Englands attack ready step As...,3,Ukranian War & its Impact
17792,"England v Ireland: Stuart Broad, Zak Crawley &...",Watch highlights as Stuart Broad's five-wicket...,Watch highlight Stuart Broads fivewicket haul ...,0,International Soccer
17793,Phil Neville: Inter Miami sack coach after 10 ...,Major League Soccer side Inter Miami sack coac...,Major League Soccer side Inter Miami sack coac...,4,U.K & European Soccer
17794,French Open: Jessica Pegula column on getting ...,"American Jessica Pegula, seeded third in the F...",American Jessica Pegula seeded third French Op...,3,Ukranian War & its Impact


#### NFM Analysis Conclusions: 
- A K of 5 has been applied on the BBC news articles of March 2023 to June 2023 
- TfidfVectorizer was used to create the DTM and Sklearn's NMF was used to fit and transform the DTM's data and apply the NFM analysis 
- This work can help companies with suggesting specific news articles to specific users 


--------------------------

## Works Conclusion 
- LDA and NFM were applied on BBC news articles of March 2023 to June 2023 
- A K of 5 was selected for both Topic Modeling algorithms 
- NFM seemed to have an issue with one of the topics by have various repeated words which could be an indicator for selecting a too large of a K value 
- Topics of both algorithms covered: European Soccer, International Soccer, Internal Politics of the U.K, and The Ukrainian War. All of which are topics that are very relevant during this work 
- A future improvement to the work is to conduct Coherence Analysis of various K values then selecting the value that produced the highest K
