### College of Computing and Informatics, Drexel University
### INFO 323: Cloud Computing and Big Data
### Due: Tuesday, June 13, 2023
---

## Final Project Report

## Project Title: Unsupervised Clustering of Wikipedia Articles

## Student(s): Charlie Gormley; Monish B.

#### Date: Tuesday, June 13, 2023
---

# Abstract

The project revolves around performing unsupervised clustering with a large dataset of Wikipedia articles. The model we use to perform this clustering is Latent Dirichlet Allocation. 

# Data

The Data comes from a majority of Wikipedia articles during the year 2017→2018. The total database is ~4.5 million articles and 20 GB of data. We decided to just use a subset of randomly selected 500,000 articles (3 GB) due to to computational resources neseccary to train models with 4.5 million articles. 

# Exploratory Data Analysis
 For our EDA process we created a removed stop words and transformed each article of text into TFIDF. We also created an elbow plot to estimate the optimal number of clusters for our dataset. We computed TFIDF through first putting each article through a tokenizer which converts of the words into an element in a list. Then we iterate through those lists and remove any stopwords. After this we vectorize each article through counting the occurrence of each word. Then we put the vectorized counts through idf. We then will use this TFIDF transformation as our input vector for LDA-Clustering. 

Also with EDA we utilized, as viewed below, silhouetee scores to estimate the prowess of a given k-value. We saw better scores with lower k-values.

# Modeling & Evaluation

For modeling we tune our LDA through differing levels of k and compare each model with a different k-value with the model’s silhouette-value, and through the elbow analysis

We then pick the best model and run it again to analyze the topics with the *analyze_topics* function.

# **NOTEBOOK**

## 1. SetUp
Importing & Device Setup 

In [1]:
from multiprocessing import Pool
import sqlite3 as sql
import pandas as pd
import numpy as np
import logging
import time
import random
import re
import matplotlib.pyplot as plt
import os
import sys
os.chdir("..")
os.getcwd()

'd:\\INFO323\\TokenizedToast'

## Loading In Datasets 

In [2]:
db = 'Data\enwiki-20170820.db'

In [3]:
def get_query(select, db=db):
    '''
    1. Connects to SQLite database (db)
    2. Executes select statement
    3. Return results and column names
    
    Input: 'select * from analytics limit 2'
    Output: ([(1, 2, 3)], ['col_1', 'col_2', 'col_3'])
    '''
    with sql.connect(db) as conn:
        c = conn.cursor()
        c.execute(select)
        col_names = [str(name[0]).lower() for name in c.description]
    return c.fetchall(), col_names

In [4]:
def tokenize(text, lower=True):
    '''
    1. Strips apostrophes
    2. Searches for all alpha tokens (exception for underscore)
    3. Return list of tokens

    Input: 'The 3 dogs jumped over Scott's tent!'
    Output: ['the', 'dogs', 'jumped', 'over', 'scotts', 'tent']
    '''
    text = re.sub("'", "", text)
    if lower:
        tokens = re.findall('''[a-z_]+''', text.lower())
    else:
        tokens = re.findall('''[A-Za-z_]''', text)
    return tokens

In [5]:
def get_article(article_id):
    '''
    1. Construct select statement
    2. Retrieve all section_texts associated with article_id
    3. Join section_texts into a single string (article_text)
    4. Tokenize article_text
    5. Return list of tokens
    
    Input: 100
    Output: ['the','austroasiatic','languages','in',...]
    '''
    select = "SELECT section_text FROM articles WHERE article_id = " + str(article_id)

    # # Execute the query with the article_id as a parameter
    # article = spark.sql(select, article_id).collect()
    docs, _ = get_query(select)
    
    docs = [doc[0] for doc in docs]
    doc = '\n'.join(docs)
    
    tokens = tokenize(doc)
    return ' '.join(tokens)

In [6]:
def get_bulk_articles(article_ids):
    corpus = []
    for article_id in article_ids:
        article = get_article(article_id)
        output = (article_id, article)
        corpus.append(output)
    return corpus

In [7]:
select = '''select distinct article_id from articles'''
article_ids, _ = get_query(select)
article_ids = [article_id[0] for article_id in article_ids]

Orignal Pulling of the dataset.

In [8]:
# num_articles = 500000
# random_article_ids = random.sample(range(1, 4902648), num_articles)
# corpus = get_bulk_articles(random_article_ids)


In [9]:
# import pickle
# with open('corpus-500000.pkl', 'wb') as file:
#     pickle.dump(corpus, file)


In [10]:
import pickle
with open('corpus-500000.pkl', 'rb') as file:
    corpus = pickle.load(file)

#### PreProcessing Text

## Spark Setup 

In [11]:
import findspark
findspark.init()

# Spark Packages
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer, CountVectorizer, IDF
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import CountVectorizer, IDF, Word2Vec
from pyspark.ml.clustering import LDA
from pyspark.ml.linalg import VectorUDT
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import lower, regexp_replace, trim
from pyspark.sql.functions import split
from pyspark.ml.feature import StopWordsRemover

from pyspark.sql.functions import rand, udf
from pyspark.sql.types import DoubleType, IntegerType
from pyspark.ml.evaluation import ClusteringEvaluator

from sparknlp.base import DocumentAssembler
from sparknlp import annotator
from pyspark.ml import Pipeline


# Bert Tokenizer
# from transformers import BertTokenizer, BertModel # Hugging Face Package
# import torch # PyTorch
from pyspark.sql import SparkSession

In [12]:
spark = SparkSession.builder \
    .appName("TokenizedToast")\
    .config("spark.driver.memory","28G")\
    .getOrCreate()

In [13]:

stopwords = StopWordsRemover.loadDefaultStopWords("english")

In [14]:
spark_path = os.environ['SPARK_HOME']

In [15]:
sys.path.insert(0, spark_path + "/bin")
sys.path.insert(0, spark_path + "/python/pyspark/")
sys.path.insert(0, spark_path + "/python/lib/pyspark.zip")
sys.path.insert(0, spark_path + "/python/lib/py4j-0.10.7-src.zip")
os.environ['PYSPARK_DRIVER_PYTHON_OPTS']= "notebook"
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ['PYSPARK_PYTHON'] = sys.executable

## Cluster Analysis Techniques

## Tokenization & Vectorization Methods
1. TFIDF

#### Initializing Classes

In [16]:
main_df = spark.createDataFrame(corpus, ['id', 'text'])
main_df = main_df.dropna()

In [17]:
tokenizer = Tokenizer(inputCol='text', outputCol='temp')
stopwords_remover = StopWordsRemover(inputCol='text', outputCol='temp', stopWords=stopwords)
count_vectorizer = CountVectorizer(inputCol='text', outputCol='temp')
idf = IDF(inputCol='text', outputCol='temp')

## Topic Modelling Evaluation

In [18]:
def calculate_coherence(lda_model, df):
    return lda_model.logLikelihood(df), lda_model.logPerplexity(df)

def tuning(df, tokenizer, stopwords_remover, count_vectorizer, idf, max_k):
    coherence_scores = []
    perplexity_scores = []
    results = list()

    df = tokenizer.transform(df)
    df = df.drop('text').withColumnRenamed('temp', 'text')

    df = stopwords_remover.transform(df)
    df = df.drop('text').withColumnRenamed('temp', 'text')

    model_cv = count_vectorizer.fit(df)
    df = model_cv.transform(df)
    df = df.drop('text').withColumnRenamed('temp', 'text')

    model_idf = idf.fit(df)
    df = model_idf.transform(df)
    df = df.drop('text').withColumnRenamed('temp', 'text')

    for k in range(2, max_k + 1, 5):
        cur_result = dict()
        lda = LDA(k=k, maxIter=5, featuresCol='text')
        model = lda.fit(df)
        coherence, perplexity = calculate_coherence(model, df)
        coherence_scores.append(coherence)
        perplexity_scores.append(perplexity)

        sample_size = 100
        sampled_articles_df = df.orderBy(rand()).limit(sample_size)
        sampled_articles_df = model.transform(sampled_articles_df).select("id", "text", "topicDistribution")

        get_max_topic_udf = udf(lambda vector: int(vector.argmax()), IntegerType())
        sampled_articles_df = sampled_articles_df.withColumn("maxTopicIndex", get_max_topic_udf("topicDistribution"))

        # Evaluate the clustering results
        evaluator = ClusteringEvaluator(predictionCol="maxTopicIndex", featuresCol="text")

        # Calculate silhouette coefficient
        silhouette = evaluator.evaluate(sampled_articles_df)
        print("Silhouette coefficient: ", silhouette)

        cur_result['silhouette'] = silhouette
        cur_result['coherence'] = coherence
        cur_result['perplexity'] = perplexity
        cur_result['k'] = k
        results.append(cur_result)
    print(results)

In [19]:
tuning(main_df, tokenizer, stopwords_remover, count_vectorizer, idf, 50)

Silhouette coefficient:  0.06498282576481665
Silhouette coefficient:  -0.6644439268936647
Silhouette coefficient:  -0.6554911981420182
Silhouette coefficient:  -0.5839173521105573
Silhouette coefficient:  -0.6947313540517591
Silhouette coefficient:  -0.5926663297804388
Silhouette coefficient:  -0.6664091380762678
Silhouette coefficient:  -0.5638120480353167
Silhouette coefficient:  -0.5667703468223168
Silhouette coefficient:  -0.5310403825602856
[{'silhouette': 0.06498282576481665, 'coherence': -6397237.184257759, 'perplexity': 10.290046065957823, 'k': 2}, {'silhouette': -0.6644439268936647, 'coherence': -6539463.403415318, 'perplexity': 10.518818941617287, 'k': 7}, {'silhouette': -0.6554911981420182, 'coherence': -6792370.124809612, 'perplexity': 10.925622963194234, 'k': 12}, {'silhouette': -0.5839173521105573, 'coherence': -7114858.841212765, 'perplexity': 11.444350603261416, 'k': 17}, {'silhouette': -0.6947313540517591, 'coherence': -7472882.418228105, 'perplexity': 12.0202365668540

### Obtaining the model for the best k-value
The best k-value is 2, but since the results of 2 may be a little lack luster we will also choose the second best(highest) silhouette value which is suprisingly our largest; 47. 

Note we didn't want to save the models in the loop because we wanted to conserve our ram and storage costs. Running the model twice is much less expensive.

In [22]:


df = tokenizer.transform(main_df)
df = df.drop('text').withColumnRenamed('temp', 'text')

df = stopwords_remover.transform(df)
df = df.drop('text').withColumnRenamed('temp', 'text')

model_cv = count_vectorizer.fit(df)
df = model_cv.transform(df)
df = df.drop('text').withColumnRenamed('temp', 'text')

model_idf = idf.fit(df)
df = model_idf.transform(df)
df = df.drop('text').withColumnRenamed('temp', 'text')

lda = LDA(k=2, maxIter=5, featuresCol='text')
model_k2 = lda.fit(df)

lda = LDA(k=47, maxIter=5, featuresCol='text')
model_k47 = lda.fit(df)

## Visualizations

In [26]:
from pyspark.sql.functions import col
def analyze_topics(lda_model, df):
    transformed = lda_model.transform(df)
    topics = transformed.select('id', 'topicDistribution').rdd.map(lambda row: (row[0], row[1].tolist())).toDF(['id', 'topics'])
    df_with_topics = df.join(topics, on='id')

    # Perform your post-classification analysis here
    # Example: Calculate the dominant topic for each document
    dominant_topic_udf = udf(lambda topics: int(np.argmax(topics)), IntegerType())
    df_with_topics = df_with_topics.withColumn('dominant_topic', dominant_topic_udf(col('topics')))
    df_with_topics.show()

In [27]:
analyze_topics(model_k2, df)

+-------+--------------------+--------------------+--------------+
|     id|                text|              topics|dominant_topic|
+-------+--------------------+--------------------+--------------+
|2130509|(30521,[6,10,12,2...|[0.99947665275049...|             0|
|2346170|(30521,[4,17,21,2...|[0.75411475097072...|             0|
|3456671|(30521,[2,12,19,3...|[0.99241565439553...|             0|
|4512412|(30521,[45,71,82,...|[0.35551641805267...|             1|
|3067872|(30521,[42,66,81,...|[0.65504208781328...|             0|
|1949618|(30521,[2,4,7,8,1...|[0.56952636551823...|             0|
|2865107|(30521,[0,2,5,22,...|[0.54015024437450...|             0|
|1170827|(30521,[1,2,3,4,5...|[0.99991771923471...|             0|
| 403062|(30521,[30,32,37,...|[0.38795922793309...|             1|
|1491162|(30521,[4,6,9,32,...|[0.00316051490843...|             1|
|4193714|(30521,[5,11,22,2...|[0.33843809587893...|             1|
|4885976|(30521,[5,11,12,1...|[0.99784490886379...|           

In [28]:
analyze_topics(model_k47, df)

+-------+--------------------+--------------------+--------------+
|     id|                text|              topics|dominant_topic|
+-------+--------------------+--------------------+--------------+
|2130509|(30521,[6,10,12,2...|[0.99904220574373...|             0|
|2346170|(30521,[4,17,21,2...|[2.81598070782669...|            40|
|3456671|(30521,[2,12,19,3...|[0.56144770489104...|             0|
|4512412|(30521,[45,71,82,...|[1.10368789670043...|            21|
|3067872|(30521,[42,66,81,...|[9.51289437854830...|             6|
|1949618|(30521,[2,4,7,8,1...|[5.09487263239038...|             6|
|2865107|(30521,[0,2,5,22,...|[1.49887802795905...|             1|
|1170827|(30521,[1,2,3,4,5...|[1.42543850019460...|            40|
| 403062|(30521,[30,32,37,...|[1.16062840638495...|            40|
|1491162|(30521,[4,6,9,32,...|[2.29475779028124...|            33|
|4193714|(30521,[5,11,22,2...|[7.08835760238747...|             1|
|4885976|(30521,[5,11,12,1...|[8.82723722686408...|           

# Conclusion
In this project we estimated clusters for a subset of wikipedia articles with apache spark and analyzed the best performing cluster. 

# References 
*None*

# Project Requirements

This final project examines the level of knowledge the students have learned from the course. Course outcomes include querying and exploring data using higher-level tools built on top of a cloud computing platform, applying practical tools for processing massive data sets, and building scalable big data analytical and predictive models. 

** Marking will be foucsed on both presentation and content.** 

## Written Presentation Requirements
The report will be judged on the basis of visual appearance, grammatical correctness, and quality of writing, as well as its contents. Please make sure that the text of your report is well-structured, using paragraphs, full sentences, and other features of well-written presentation.

## Technical Content of the Entire Project:
* Is the problem well defined and described thoroughly?
* Is the size and complexity of the data set used in this project commensurate with the course?
* Does the project uses cloud computing techniques for exploratory data analysis?
* Does the project uses cloud computing techniques for building analytical and predictive models?
* Does the project cover the key data science activites including data cleaning, data wrangling, visualization, model selection, feature engineering, and model evaluation?
* Does the report present the findings well and make clear conclusions?
* Overall, what is the rating of this project?