## Name: Fareed Hassan Khan
## ERP ID: 25367
## Text Analytics Assignment 1

<div style="border: 1px solid black; padding:10px; border-radius: 5px;">
<p>Important Point</p>

Multiple functions has been created to avoid repetitive code.
    
| Function Name | Purpose |
| :--- | :--- |
| lemma_or_stemma | To lemmatized and stemmed news titles | 
| cleaning | To remove unwanted characters using regex | 
| sentence_vector | To transform using pre-trained word2vec or glove or customized word2vec dataset into vectorized matrix | 
| assign_doc_to_clus | map each document to a cluster based on the method we have used | 
</div>

### Dataset Description

I explored the BBC News dataset on Kaggle, which contains over 14,000 news articles published by the British Broadcasting Corporation (BBC) over a six-year period. The dataset covers five categories, namely business, entertainment, politics, sport, and technology. Moreover, each news article is accompanied by a short description, giving a quick insight into its contents.

| Dataset Name | Default task | Download link |
| :--- | :--- | :--- |
| BBC News | Clustering, Classification | https://www.kaggle.com/datasets/gpreda/bbc-news |


_____

**Project Workflow**
1. [Initial steps](#initial)
    1. Importing Libraries
    2. Creating Function
2. [Models](#ml-method)
3. [Clustering KMeans](#kmeans)
4. [Saving Models](#savingmodels)

_________

<a name="initial"></a>
# Initial Steps

Importing Libraries

In [1]:
# To remove warning
import warnings
warnings.filterwarnings('ignore')

# For dataset handling
import pandas as pd
import numpy as np

# For working with models
import nltk
import re
import itertools
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from gensim.models import KeyedVectors, Word2Vec
import gensim 
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Checking similar documents
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans, AgglomerativeClustering, MiniBatchKMeans, MeanShift
from sklearn.decomposition import TruncatedSVD

# For saving models
import pickle

BBC News Dataset loading

In [2]:
df = pd.read_csv('bbc_news.csv', usecols=['title','description'])
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)

Cleaning Unwanted Characters (Regex) function

In [3]:
def cleaning(s):
    s = str(s)
    s = re.sub('[!@#$_]', '', s)
    s = s.replace("co","")
    s = s.replace("https","")
    s = s.replace("[\w*"," ")
    s = s.replace('<.*?>', '')
    s = s.replace('strong>', '')
    s = s.replace('\x92', '')
    return s

df['title'] = df['title'].apply(cleaning)
df['description'] = df['description'].apply(cleaning)

Applying Lemmatization/Stemming

In [4]:
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def lemma_or_stemma(text, which_method):

    if which_method == 'stemming':
        stem_list = []
        for i in text.split():
            stem_list.append(ps.stem(i.lower()))
        return " ".join(stem_list)
    else:
        lem_list = []
        for i in text.split():
            lem_list.append(lemmatizer.lemmatize(i.lower()))
        return " ".join(lem_list)

df['headline_lemmatization'] = df['title'].apply(lambda text: lemma_or_stemma(text, 'lemmatization'))
df['headline_stemming'] = df['title'].apply(lambda text: lemma_or_stemma(text, 'stemming'))

df.head(2)

Unnamed: 0,title,description,headline_lemmatization,headline_stemming
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the untry will no...,ukraine: angry zelensky vow to punish russian ...,ukraine: angri zelenski vow to punish russian ...
1,War in Ukraine: Taking ver in a town under attack,"Jeremy Bowen was on the frontline in Irpin, as...",war in ukraine: taking ver in a town under attack,war in ukraine: take ver in a town under attack


sentence_vector (transforming documents to vectorized matrix using word2vec, glove etc)

In [None]:
def sentence_vector(sentence, word_vectors):
    words = sentence.lower().split()
    vectors = []
    for word in words:
        if word in word_vectors:
            vectors.append(word_vectors[word])
        if not vectors:
            vectors.append(np.zeros(300))
    return np.mean(vectors, axis=0)

<a name="Models"></a>
# Models

Bag of words default (CountVectorizer/tf-idf Vectorizer)

In [11]:
# CountVectorizer
cv = CountVectorizer(ngram_range=(1,2), stop_words='english')
# TfIdf-Vectorizer
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words='english')

# Applying default bag of words on lemmatize documents
lemma_vector_countvect = cv.fit_transform(df['headline_lemmatization']).toarray()
lemma_vector_tfidf = tfidf.fit_transform(df['headline_lemmatization']).toarray()

print(f'CountVectorizer vector Shape {lemma_vector_countvect.shape} ________ Tf-Idf Vector Shape {lemma_vector_tfidf.shape}')

CountVectorizer vector Shape (14039, 79576) ________ Tf-Idf Vector Shape (14039, 79576)


### Bag of Words

Parameter tuning

In [7]:
# CountVectorizer Parameters
parameters_countvectorizer = {
    'n_gram': [(1,2)],
    'max_features': [10000, 20000, 25000],
    'binary':[True, False]}
keys, values = zip(*parameters_countvectorizer.items())
parameters_countvectorizer = [dict(zip(keys, v)) for v in itertools.product(*values)]


# TF-IDF Parameters
parameters_tfidfvectorizer = {
    'n_gram': [(1,2)],
    'max_features': [30000, 50000, 60000],
    'norm':['l1','l2']}
keys, values = zip(*parameters_tfidfvectorizer.items())
parameters_tfidfvectorizer = [dict(zip(keys, v)) for v in itertools.product(*values)]

CountVectorizer (Stemming/Lemmatization)

In [12]:
stemming_data_countvec = []
lemmatization_data_countvect = []

for each_model in parameters_countvectorizer:
    # Count Vectorizer with stemming/lemmatization
    cv = CountVectorizer(ngram_range=each_model['n_gram'], stop_words='english', 
                         max_features=each_model['max_features'], binary=each_model['binary'])
    lemma_vector_countvect = cv.fit_transform(df['headline_lemmatization']).toarray()
    stemm_vector_countvect = cv.fit_transform(df['headline_stemming']).toarray()

    lemmatization_data_countvect.append(lemma_vector_countvect)
    stemming_data_countvec.append(stemm_vector_countvect)

In [19]:
each_similaririty_lemma_countvect = []
each_similaririty_stem_countvect = []

for each in lemmatization_data_countvect:
    similarity_scores = cosine_similarity(each)
    each_similaririty_lemma_countvect.append(similarity_scores)

for each in stemming_data_countvec:
    similarity_scores = cosine_similarity(each)
    each_similaririty_stem_countvect.append(similarity_scores)

Tf-Idf Vectorizer (Stemming/Lemmatization)

In [16]:
stemming_data_tidf = []
lemmatization_data_tidf = []

for each_model in parameters_tfidfvectorizer:
    # Count Vectorizer with stemming/lemmatization
    cv = TfidfVectorizer(ngram_range=each_model['n_gram'], stop_words='english', max_features=each_model['max_features'], 
                         norm=each_model['norm'])
    lemmatization_vector_tidf = cv.fit_transform(df['headline_lemmatization']).toarray()
    stemming_vector_tidf = cv.fit_transform(df['headline_stemming']).toarray()

    lemmatization_data_tidf.append(lemmatization_vector_tidf)
    stemming_data_tidf.append(stemming_vector_tidf)

In [17]:
each_similaririty_lemma_tfidf = []
each_similaririty_stem_tfidf = []

for each in lemmatization_data_tidf:
    similarity_scores = cosine_similarity(each)
    each_similaririty_lemma_tfidf.append(similarity_scores)

for each in stemming_data_tidf:
    similarity_scores = cosine_similarity(each)
    each_similaririty_stem_tfidf.append(similarity_scores)

In [None]:
# Countvectorizer lemmatization/stemming cosine similarity with hyper-parameter tuning
each_similaririty_lemma_countvect
each_similaririty_stem_countvect

# Tf-Idf lemmatization/stemming cosine similarity with hyper-parameter tuning
each_similaririty_lemma_tfidf
each_similaririty_stem_tfidf

### Word2Vec Pre-trained

In [19]:
model_W2V = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True, limit=50000)

2023-03-12 14:13:52,007 : INFO : loading projection weights from GoogleNews-vectors-negative300.bin
2023-03-12 14:13:52,351 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (50000, 300) matrix of type float32 from GoogleNews-vectors-negative300.bin', 'binary': True, 'encoding': 'utf8', 'datetime': '2023-03-12T14:13:52.351055', 'gensim': '4.3.0', 'python': '3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22621-SP0', 'event': 'load_word2vec_format'}


In [20]:
word2vec_news = df['title'].apply(lambda sentence: sentence_vector(sentence, model_W2V))

In [30]:
dataframe = list(word2vec_news)
cosine_similarity_word2vec = cosine_similarity(dataframe)

In [378]:
news_id = 4511

print('actual news ----', '\n', df.iloc[news_id].title)
the_sort = sorted(list(enumerate(cosine_similarity_word2vec[news_id])), reverse=True, key=lambda x:x[1])[0:10]

print('similar news -----')
# close_ques = []
for each in the_sort[1:4]:
    # close_ques.append(each[0])
    print(df.iloc[each[0]].title)

actual news ---- 
 England v New Zealand: Late wickets put hosts in charge at Headingley
similar news -----
England v New Zealand: Late wickets put England in charge of third Test
England v South Africa: Issy Wong takes late wickets to boost hosts
England v India highlights: Jasprit Bumrah takes six wickets as hosts beaten by 10 wickets in first ODI


### Glove Pre-trained

In [5]:
from gensim.scripts.glove2word2vec import glove2word2vec

# glove_input_file = 'glove.6B.50d.txt'
word2vec_output_file = 'glove.6B.300d.txt.word2vec'
# glove2word2vec(glove_input_file,word2vec_output_file)
modelg = KeyedVectors.load_word2vec_format(word2vec_output_file,binary=False)

2023-03-12 12:39:13,908 : INFO : loading projection weights from glove.6B.50d.txt.word2vec
2023-03-12 12:39:24,851 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (400000, 50) matrix of type float32 from glove.6B.50d.txt.word2vec', 'binary': False, 'encoding': 'utf8', 'datetime': '2023-03-12T12:39:24.851510', 'gensim': '4.3.0', 'python': '3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22621-SP0', 'event': 'load_word2vec_format'}


In [7]:
glove_news = df['title'].apply(lambda sentence: sentence_vector(sentence, modelg))

In [383]:
dataframe_glove = list(glove_news)
cosine_similarity_glove = cosine_similarity(dataframe_glove)

In [390]:
news_id = 8777

print('actual news ----', '\n', df.iloc[news_id].title)
the_sort = sorted(list(enumerate(qw[news_id])), reverse=True, key=lambda x:x[1])[0:10]

print('similar news -----')
# close_ques = []
for each in the_sort[1:4]:
    print(df.iloc[each[0]].title)

actual news ---- 
 Track Cycling World Championships: Jess Roberts wins women's scratch bronze
similar news -----
Commonwealth Games: Laura Kenny wins scratch race gold at track cycling
Track Cycling World Championships: Great Britain win three bronze medals on opening day
World Aquatic Championships: Summer McIntosh, 15, wins gold & Great Britain take relay bronze


### Customized Word2Vec

CBOW (sg=0 i.e., Default)

In [None]:
documents = [gensim.utils.simple_preprocess(document) for document in list(df.description)]

model_cbow = gensim.models.Word2Vec(documents, vector_size=300, window=10, min_count=2, workers=10)

model_cbow.train(documents,total_examples=len(documents),epochs=150)

customized_word2vec_news = df['title'].apply(lambda sentence: sentence_vector(sentence, model_cbow.wv))

dataframe_customized_news = list(customized_word2vec_news)
cosine_similarity_customized = cosine_similarity(dataframe_customized_news)

In [54]:
news_id = 8777

print('actual news ----', '\n', df.iloc[news_id].title)
the_sort = sorted(list(enumerate(cosine_similarity_customized[news_id])), reverse=True, key=lambda x:x[1])[0:10]

print('similar news -----')
# close_ques = []
for each in the_sort[1:4]:
    # close_ques.append(each[0])
    print(df.iloc[each[0]].title)

SKIPGRAM  (sg=1)

In [None]:
documents = [gensim.utils.simple_preprocess(document) for document in list(df.description)]

model_skipgram = gensim.models.Word2Vec(documents, vector_size=300, window=10, min_count=2, workers=10, sg=1)
model_skipgram.train(documents,total_examples=len(documents),epochs=150)

model_skipgram.wv.save_word2vec_format('customized_word2vec.txt', binary=False)

customized_word2vec_news_Skip = df['title'].apply(lambda sentence: sentence_vector(sentence, model_skipgram.wv))

In [67]:
news_id = 9882

print('actual news ----', '\n', df.iloc[news_id].title)
the_sort = sorted(list(enumerate(cosine_similarity_customized[news_id])), reverse=True, key=lambda x:x[1])[0:10]

print('similar news -----')
# close_ques = []
for each in the_sort[1:4]:
    # close_ques.append(each[0])
    print(df.iloc[each[0]].title)

actual news ---- 
 Rugby World Cup: Holly Aitchison joins Emily Scarratt in England centres for final
similar news -----
2023 sporting calendar: The year's main events from Women's World Cup football to Ashes series and men's rugby union World Cup
Why were ITV hosts Holly Willoughby and Phillip Schofield at Queen's lying-in-state?
ITV boss defends Holly Willoughby and Phillip Schofield over queue furore


### LSA/SVD on Word2Vec

In [21]:
# SVD on pre trained word2vec
svd_model = TruncatedSVD(n_components=50, algorithm='randomized', n_iter=100, random_state=122)
lsa = svd_model.fit_transform(list(word2vec_news))

In [22]:
cosine_similarity_pretrain_word2vec_lsa = cosine_similarity(lsa)

In [25]:
news_id = 14012

print('actual news ----', '\n', df.iloc[news_id].title)
the_sort = sorted(list(enumerate(cosine_similarity_pretrain_word2vec_lsa[news_id])), reverse=True, key=lambda x:x[1])[0:10]

print('similar news -----')
# close_ques = []
for each in the_sort[1:4]:
    # close_ques.append(each[0])
    print(df.iloc[each[0]].title)

actual news ---- 
 In pictures: Snow blankets parts of the UK as ld snap starts
similar news -----
In pictures: Snow blankets parts of the UK
UK weather: Spring snow as parts of untry hit by ld snap


# Clustering Effect

Creating Function to assign each document to a cluster based on bagofwords, word2vec etc

In [9]:
def assign_doc_to_clus(cluster_labels, df_title):
    outer_clas = []
    for i in range(5):
        cluster_documents = []
        for j in range(len(df_title)):
            if cluster_labels[j] == i:
                cluster_documents.append(df_title[j])
        outer_clas.append(cluster_documents)
    return outer_clas

KMeans on default (i.e., Best) CountVectorizer and TfIdf Vectorizer

In [None]:
# CountVectorizer With KMeans
km_cv = KMeans(n_clusters = 5)
km_cv.fit(lemma_vector_countvect)
labels_cv = km_cv.labels_

cv_assign = assign_doc_to_clus(labels_cv, list(df.title))

In [None]:
# TfIdfVectorizer With KMeans
km_tfidf = KMeans(n_clusters = 5)
km_tfidf.fit(lemma_vector_tfidf)
labels_tfidf = km_tfidf.labels_

tfidf_assign = assign_doc_to_clus(labels_tfidf, list(df.title))

Kmeans on Word2Vec Pre-trained

In [38]:
# Word2Vec With KMeans
km_word2vec = KMeans(n_clusters = 5)
km_word2vec.fit(list(word2vec_news))
labels_word2vec = km_word2vec.labels_

pretrain_word2vec_assign = assign_doc_to_clus(labels_word2vec, list(df.title))



Kmeans on Glove Pre-trained

In [10]:
# Glove With KMeans
km_glove = KMeans(n_clusters = 5)
km_glove.fit(list(glove_news))
labels_glove = km_glove.labels_

pretrain_glove_assign = assign_doc_to_clus(labels_glove, list(df.title))



Kmeans on Customized Word2Vec - CBOW (Default) and Skipgram

In [None]:
# CBOW With KMeans
km_cbow = KMeans(n_clusters = 5)
km_cbow.fit(list(customized_word2vec_news))
labels_customized_word2vec_cbow = km_cbow.labels_

customized_cbow_assign = assign_doc_to_clus(labels_customized_word2vec_cbow, list(df.title))

In [None]:
# SKipgram With KMeans
km_skipgram = KMeans(n_clusters = 5)
km_skipgram.fit(list(customized_word2vec_news_Skip))
labels_customized_word2vec_skipgram = km_skipgram.labels_

customized_skipgram_assign = assign_doc_to_clus(labels_customized_word2vec_skipgram, list(df.title))

Kmeans on LSA_Word2Vec

In [None]:
# word2vec with SVD With KMeans
km_word2vec_lsa = KMeans(n_clusters = 5)
km_word2vec_lsa.fit(list(lsa))
labels_customized_word2vec_lsa = km_word2vec_lsa.labels_

word2vec_lsa_assign = assign_doc_to_clus(labels_customized_word2vec_lsa, list(df.title))

# Saving the models

Bag of words most accurate

In [None]:
# Transform the data and fit KMeans
vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words='english')

vectorizer.fit(df['headline_lemmatization'])

# Save the vectorizer and KMeans to files

with open('modelstest/bag_of_words_best/vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)
    
with open('modelstest/bag_of_words_best/kmeans.pkl', 'wb') as f:
    pickle.dump(km_tfidf, f)

Word2vec saving

In [141]:
# Save the vectorized data to files
with open('modelstest/word2vec_best/pretrain_word2vec_assign.pkl', 'wb') as f:
    pickle.dump(pretrain_word2vec_assign, f)

# Save the KMeans to files
with open('modelstest/word2vec_best/kmeans_word2vec.pkl', 'wb') as f:
    pickle.dump(km_word2vec, f)



glove saving

In [12]:
# Save the vectorized data to files
with open('modelstest/glove_best/pretrain_glove_assign.pkl', 'wb') as f:
    pickle.dump(pretrain_glove_assign, f)

# Save the KMeans to files
with open('modelstest/glove_best/kmeans_glove.pkl', 'wb') as f:
    pickle.dump(km_glove, f)



Customized word2vec saving

In [None]:
# Save the vectorized data to files
with open('modelstest/customized_word2vec_best/customized_skipgram_assign.pkl', 'wb') as f:
    pickle.dump(customized_skipgram_assign, f)

# Save the KMeans to files
with open('modelstest/customized_word2vec_best/kmeans_customized_skipgram.pkl', 'wb') as f:
    pickle.dump(km_skipgram, f)

LSA/SVD

In [26]:
# Save the KMeans to files
with open('modelstest/svd_best/svd_lsa_d.pkl', 'wb') as f:
    pickle.dump(lsa, f)

### X___________________________________________________________________________________X