## Topic Modeling for Research Articles

Dataset Source: https://www.kaggle.com/datasets/shivanandmn/multilabel-classification-dataset

##### Import Necessary Libraries

In [0]:
import re

import pyspark

import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, ArrayType
from pyspark.ml import Pipeline

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline

#### Define Functions Used Throughout This Project

##### Create Function to Ingest Data

In [0]:
def ingest_data(file_location: str, 
                schema: StructType, 
                delimiter: str = ',' 
               ) -> pyspark.sql.dataframe.DataFrame:
    '''
    This function reads in the dataset that is passed to it
    and fits the schema that is passed in to the dataset.
    '''
    file_type = "csv"
    infer_schema = "false"
    first_row_is_header = "true"
    
    dataset = spark.read.format(file_type) \
        .option("inferSchema", infer_schema) \
        .option("header", first_row_is_header) \
        .option("sep", delimiter) \
        .schema(schema)\
        .csv(file_location)
    
    return dataset

##### Define Function to Evaluate Model & Display Metrics

In [0]:
def evaluate_multilabel_model(dataset: pyspark.sql.dataframe.DataFrame, 
                              metrics: [str], 
                              model_name: str 
                             ) -> None:
    '''
    This function calculates & displays metrics for a multilabel 
    classification analysis.
    '''
    from pyspark.ml.evaluation import MultilabelClassificationEvaluator
    
    print("+---------------------------------------------+")
    print("|  " + model_name.center(41) + "  |")
    print("+---------------------------------------------+")
    print("|   %s  |  %s   |" % ("Metric".rjust(20), "Value".ljust(14)))
    print("+---------------------------------------------+")
    
    for x in metrics:
        evaluator = MultilabelClassificationEvaluator(labelCol="label", \
                                                      predictionCol="prediction", \
                                                      metricName=x) 
        score = evaluator.evaluate(dataset)
        print("|   %s  |  %s   |" % (x.rjust(20), str(round(score, 6)).ljust(14)))
        print("+---------------------------------------------+")

#### Ingest & Preprocess Data

##### Ingest Dataset

In [0]:
data_file = "/FileStore/tables/multilabel_train_ready.txt"

orig_schema = StructType([
         StructField('id', StringType(), True),
         StructField('title', StringType(), True),
         StructField('abstract', StringType(), True),
         StructField('Computer Science', IntegerType(), True),
         StructField('Physics', IntegerType(), True),
         StructField('Mathematics', IntegerType(), True),
         StructField('Statistics', IntegerType(), True),
         StructField('Quantitative Biology', IntegerType(), True),
         StructField('Quantitative Finance', IntegerType(), True)
         ])

df = ingest_data(data_file, orig_schema, delimiter=':::')

df = df.na.drop(how='all')

df = df.drop("id", "Quantitative Finance") 
# For some reason, the whole "Quantitative Finance" column loaded as 'null' values

df = df.na.drop(how='all', 
                        subset=["Computer Science", 
                        "Physics", 
                        "Mathematics", 
                        "Statistics", 
                        "Quantitative Biology"
                        ])

df = df.dropDuplicates()

display(df)

title,abstract,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology
An information model for modular robots: the Hardware Robot Information Model (HRIM),"Today's landscape of robotics is dominated by vertical integration where single vendors develop the final product leading to slow progress, expensive products and customer lock-in. Opposite to this, an horizontal integration would result in a rapid development of cost-effective mass-market products with an additional consumer empowerment. The transition of an industry from vertical integration to horizontal integration is typically catalysed by de facto industry standards that enable a simplified and seamless integration of products. However, in robotics there is currently no leading candidate for a global plug-and-play standard. This paper tackles the problem of incompatibility between robot components that hinder the reconfigurability and flexibility demanded by the robotics industry. Particularly, it presents a model to create plug-and-play robot hardware components. Rather than iteratively evolving previous ontologies, our proposed model answers the needs identified by the industry while facilitating interoperability, measurability and comparability of robotics technology. Our approach differs significantly with the ones presented before as it is hardware-oriented and establishes a clear set of actions towards the integration of this model in real environments and with real manufacturers.",1.0,0.0,0.0,0.0,0.0
Learning Graph Representations by Dendrograms,"Hierarchical graph clustering is a common technique to reveal the multi-scale structure of complex networks. We propose a novel metric for assessing the quality of a hierarchical clustering. This metric reflects the ability to reconstruct the graph from the dendrogram, which encodes the hierarchy. The optimal representation of the graph defines a class of reducible linkages leading to regular dendrograms by greedy agglomerative clustering.",1.0,0.0,0.0,1.0,0.0
A New Achievable Rate Region for Multiple-Access Channel with States,"The problem of reliable communication over the multiple-access channel (MAC) with states is investigated. We propose a new coding scheme for this problem which uses quasi-group codes (QGC). We derive a new computable single-letter characterization of the achievable rate region. As an example, we investigate the problem of doubly-dirty MAC with modulo-$4$ addition. It is shown that the sum-rate $R_1+R_2=1$ bits per channel use is achievable using the new scheme. Whereas, the natural extension of the Gel'fand-Pinsker scheme, sum-rates greater than $0.32$ are not achievable.",1.0,0.0,0.0,0.0,0.0
Moduli Spaces of Unordered $n\ge5$ Points on the Riemann Sphere and Their Singularities,"For $n\ge5$, it is well known that the moduli space $\mathfrak{M_{0,\:n}}$ of unordered $n$ points on the Riemann sphere is a quotient space of the Zariski open set $K_n$ of $\mathbb C^{n-3}$ by an $S_n$ action. The stabilizers of this $S_n$ action at certain points of this Zariski open set $K_n$ correspond to the groups fixing the sets of $n$ points on the Riemann sphere. Let $\alpha$ be a subset of $n$ distinct points on the Riemann sphere. We call the group of all linear fractional transformations leaving $\alpha$ invariant the stabilizer of $\alpha$, which is finite by observation. For each non-trivial finite subgroup $G$ of the group ${\rm PSL}(2,{\Bbb C})$ of linear fractional transformations, we give the necessary and sufficient condition for finite subsets of the Riemann sphere under which the stabilizers of them are conjugate to $G$. We also prove that there does exist some finite subset of the Riemann sphere whose stabilizer coincides with $G$. Next we obtain the irreducible decompositions of the representations of the stabilizers on the tangent spaces at the singularities of $\mathfrak{M_{0,\:n}}$. At last, on $\mathfrak{M_{0,\:5}}$ and $\mathfrak{M_{0,\:6}}$, we work out explicitly the singularities and the representations of their stabilizers on the tangent spaces at them.",0.0,0.0,1.0,0.0,0.0
Riemannian stochastic variance reduced gradient,"Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large but finite number of loss functions. In this paper, we propose a novel Riemannian extension of the Euclidean stochastic variance reduced gradient algorithm (R-SVRG) to a manifold search space. The key challenges of averaging, adding, and subtracting multiple gradients are addressed with retraction and vector transport. We present a global convergence analysis of the proposed algorithm with a decay step size and a local convergence rate analysis under a fixed step size under some natural assumptions. The proposed algorithm is applied to problems on the Grassmann manifold, such as principal component analysis, low-rank matrix completion, and computation of the Karcher mean of subspaces, and outperforms the standard Riemannian stochastic gradient descent algorithm in each case.",1.0,0.0,1.0,1.0,0.0
Scholars on Twitter: who and how many are they?,"In this paper we present a novel methodology for identifying scholars with a Twitter account. By combining bibliometric data from Web of Science and Twitter users identified by Altmetric.com we have obtained the largest set of individual scholars matched with Twitter users made so far. Our methodology consists of a combination of matching algorithms, considering different linguistic elements of both author names and Twitter names; followed by a rule-based scoring system that weights the common occurrence of several elements related with the names, individual elements and activities of both Twitter users and scholars matched. Our results indicate that about 2% of the overall population of scholars in the Web of Science is active on Twitter. By domain we find a strong presence of researchers from the Social Sciences and the Humanities. Natural Sciences is the domain with the lowest level of scholars on Twitter. Researchers on Twitter also tend to be younger than those that are not on Twitter. As this is a bibliometric-based approach, it is important to highlight the reliance of the method on the number of publications produced and tweeted by the scholars, thus the share of scholars on Twitter ranges between 1% and 5% depending on their level of productivity. Further research is suggested in order to improve and expand the methodology.",1.0,0.0,0.0,0.0,0.0
Deformable Generator Network: Unsupervised Disentanglement of Appearance and Geometry,"We propose a deformable generator model to disentangle the appearance and geometric information from images into two independent latent vectors. The appearance generator produces the appearance information, including color, illumination, identity or category, of an image. The geometric generator produces displacement of the coordinates of each pixel and performs geometric warping, such as stretching and rotation, on the appearance generator to obtain the final synthesized image. The proposed model can learn both representations from image data in an unsupervised manner. The learned geometric generator can be conveniently transferred to the other image datasets to facilitate downstream AI tasks.",0.0,0.0,0.0,1.0,0.0
Radio variability and non-thermal components in stars evolving toward planetary nebulae,"""We present new JVLA multi-frequency measurements of a set of stars in transition from the post-AGB to the Planetary Nebula phase monitored in the radio range over several years. Clear variability is found for five sources. Their light curves show increasing and decreasing patterns. New radio observations at high angular resolution are also presented for two sources. Among these is IRAS 18062+2410, whose radio structure is compared to near-infrared images available in the literature. With these new maps, we can estimate inner and outer radii of 0.03$""""$ and 0.08$""""$ for the ionised shell, an ionised mass of $3.2\times10^{-4}$ M$_\odot$, and a density at the inner radius of $7.7\times 10^{-5}$ cm$^{-3}$, obtained by modelling the radio shell with the new morphological constraints. The combination of multi-frequency data and, where available, spectral-index maps leads to the detection of spectral indices not due to thermal emission, contrary to what one would expect in planetary nebulae. Our results allow us to hypothesise the existence of a link between radio variability and non-thermal emission mechanisms in the nebulae. This link seems to hold for IRAS 22568+6141 and may generally hold for those nebulae where the radio flux decreases over time.""",0.0,1.0,0.0,0.0,0.0
Hidden long evolutionary memory in a model biochemical network,"We introduce a minimal model for the evolution of functional protein-interaction networks using a sequence-based mutational algorithm, and apply the model to study neutral drift in networks that yield oscillatory dynamics. Starting with a functional core module, random evolutionary drift increases network complexity even in the absence of specific selective pressures. Surprisingly, we uncover a hidden order in sequence space that gives rise to long-term evolutionary memory, implying strong constraints on network evolution due to the topology of accessible sequence space.",0.0,1.0,0.0,0.0,0.0
From Natural to Artificial Camouflage: Components and Systems,"We identify the components of bio-inspired artificial camouflage systems including actuation, sensing, and distributed computation. After summarizing recent results in understanding the physiology and system-level performance of a variety of biological systems, we describe computational algorithms that can generate similar patterns and have the potential for distributed implementation. We find that the existing body of work predominately treats component technology in an isolated manner that precludes a material-like implementation that is scale-free and robust. We conclude with open research challenges towards the realization of integrated camouflage solutions.",1.0,0.0,0.0,0.0,1.0


##### Convert Label Columns into Single ArrayType Column For Classifier

In [0]:
label_cols = ['Computer Science', 'Physics', 'Mathematics', 'Statistics', 'Quantitative Biology']

for x in label_cols:
    df = df.withColumn(x, F.when(F.col(x)==1, F.lit(x)).otherwise(F.concat(F.lit(x), F.lit("Not"))))

df = df.withColumn("labels", F.array(
                    df['Computer Science'], 
                    df['Physics'], 
                    df['Mathematics'], 
                    df['Statistics'], 
                    df['Quantitative Biology'], 
                    ).cast(ArrayType(StringType())))

df = df.drop('Computer Science', 
                    'Physics', 
                    'Mathematics', 
                    'Statistics', 
                    'Quantitative Biology')

df = df.withColumn("text", F.concat(F.col("title"), F.lit(". "), F.col("abstract"))).drop("title", "abstract")

display(df)

labels,text
"List(Computer Science, PhysicsNot, MathematicsNot, StatisticsNot, Quantitative BiologyNot)","An information model for modular robots: the Hardware Robot Information Model (HRIM). Today's landscape of robotics is dominated by vertical integration where single vendors develop the final product leading to slow progress, expensive products and customer lock-in. Opposite to this, an horizontal integration would result in a rapid development of cost-effective mass-market products with an additional consumer empowerment. The transition of an industry from vertical integration to horizontal integration is typically catalysed by de facto industry standards that enable a simplified and seamless integration of products. However, in robotics there is currently no leading candidate for a global plug-and-play standard. This paper tackles the problem of incompatibility between robot components that hinder the reconfigurability and flexibility demanded by the robotics industry. Particularly, it presents a model to create plug-and-play robot hardware components. Rather than iteratively evolving previous ontologies, our proposed model answers the needs identified by the industry while facilitating interoperability, measurability and comparability of robotics technology. Our approach differs significantly with the ones presented before as it is hardware-oriented and establishes a clear set of actions towards the integration of this model in real environments and with real manufacturers."
"List(Computer Science, PhysicsNot, MathematicsNot, Statistics, Quantitative BiologyNot)","Learning Graph Representations by Dendrograms. Hierarchical graph clustering is a common technique to reveal the multi-scale structure of complex networks. We propose a novel metric for assessing the quality of a hierarchical clustering. This metric reflects the ability to reconstruct the graph from the dendrogram, which encodes the hierarchy. The optimal representation of the graph defines a class of reducible linkages leading to regular dendrograms by greedy agglomerative clustering."
"List(Computer Science, PhysicsNot, MathematicsNot, StatisticsNot, Quantitative BiologyNot)","A New Achievable Rate Region for Multiple-Access Channel with States. The problem of reliable communication over the multiple-access channel (MAC) with states is investigated. We propose a new coding scheme for this problem which uses quasi-group codes (QGC). We derive a new computable single-letter characterization of the achievable rate region. As an example, we investigate the problem of doubly-dirty MAC with modulo-$4$ addition. It is shown that the sum-rate $R_1+R_2=1$ bits per channel use is achievable using the new scheme. Whereas, the natural extension of the Gel'fand-Pinsker scheme, sum-rates greater than $0.32$ are not achievable."
"List(Computer ScienceNot, PhysicsNot, Mathematics, StatisticsNot, Quantitative BiologyNot)","Moduli Spaces of Unordered $n\ge5$ Points on the Riemann Sphere and Their Singularities. For $n\ge5$, it is well known that the moduli space $\mathfrak{M_{0,\:n}}$ of unordered $n$ points on the Riemann sphere is a quotient space of the Zariski open set $K_n$ of $\mathbb C^{n-3}$ by an $S_n$ action. The stabilizers of this $S_n$ action at certain points of this Zariski open set $K_n$ correspond to the groups fixing the sets of $n$ points on the Riemann sphere. Let $\alpha$ be a subset of $n$ distinct points on the Riemann sphere. We call the group of all linear fractional transformations leaving $\alpha$ invariant the stabilizer of $\alpha$, which is finite by observation. For each non-trivial finite subgroup $G$ of the group ${\rm PSL}(2,{\Bbb C})$ of linear fractional transformations, we give the necessary and sufficient condition for finite subsets of the Riemann sphere under which the stabilizers of them are conjugate to $G$. We also prove that there does exist some finite subset of the Riemann sphere whose stabilizer coincides with $G$. Next we obtain the irreducible decompositions of the representations of the stabilizers on the tangent spaces at the singularities of $\mathfrak{M_{0,\:n}}$. At last, on $\mathfrak{M_{0,\:5}}$ and $\mathfrak{M_{0,\:6}}$, we work out explicitly the singularities and the representations of their stabilizers on the tangent spaces at them."
"List(Computer Science, PhysicsNot, Mathematics, Statistics, Quantitative BiologyNot)","Riemannian stochastic variance reduced gradient. Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large but finite number of loss functions. In this paper, we propose a novel Riemannian extension of the Euclidean stochastic variance reduced gradient algorithm (R-SVRG) to a manifold search space. The key challenges of averaging, adding, and subtracting multiple gradients are addressed with retraction and vector transport. We present a global convergence analysis of the proposed algorithm with a decay step size and a local convergence rate analysis under a fixed step size under some natural assumptions. The proposed algorithm is applied to problems on the Grassmann manifold, such as principal component analysis, low-rank matrix completion, and computation of the Karcher mean of subspaces, and outperforms the standard Riemannian stochastic gradient descent algorithm in each case."
"List(Computer Science, PhysicsNot, MathematicsNot, StatisticsNot, Quantitative BiologyNot)","Scholars on Twitter: who and how many are they?. In this paper we present a novel methodology for identifying scholars with a Twitter account. By combining bibliometric data from Web of Science and Twitter users identified by Altmetric.com we have obtained the largest set of individual scholars matched with Twitter users made so far. Our methodology consists of a combination of matching algorithms, considering different linguistic elements of both author names and Twitter names; followed by a rule-based scoring system that weights the common occurrence of several elements related with the names, individual elements and activities of both Twitter users and scholars matched. Our results indicate that about 2% of the overall population of scholars in the Web of Science is active on Twitter. By domain we find a strong presence of researchers from the Social Sciences and the Humanities. Natural Sciences is the domain with the lowest level of scholars on Twitter. Researchers on Twitter also tend to be younger than those that are not on Twitter. As this is a bibliometric-based approach, it is important to highlight the reliance of the method on the number of publications produced and tweeted by the scholars, thus the share of scholars on Twitter ranges between 1% and 5% depending on their level of productivity. Further research is suggested in order to improve and expand the methodology."
"List(Computer ScienceNot, PhysicsNot, MathematicsNot, Statistics, Quantitative BiologyNot)","Deformable Generator Network: Unsupervised Disentanglement of Appearance and Geometry. We propose a deformable generator model to disentangle the appearance and geometric information from images into two independent latent vectors. The appearance generator produces the appearance information, including color, illumination, identity or category, of an image. The geometric generator produces displacement of the coordinates of each pixel and performs geometric warping, such as stretching and rotation, on the appearance generator to obtain the final synthesized image. The proposed model can learn both representations from image data in an unsupervised manner. The learned geometric generator can be conveniently transferred to the other image datasets to facilitate downstream AI tasks."
"List(Computer ScienceNot, Physics, MathematicsNot, StatisticsNot, Quantitative BiologyNot)","Radio variability and non-thermal components in stars evolving toward planetary nebulae. ""We present new JVLA multi-frequency measurements of a set of stars in transition from the post-AGB to the Planetary Nebula phase monitored in the radio range over several years. Clear variability is found for five sources. Their light curves show increasing and decreasing patterns. New radio observations at high angular resolution are also presented for two sources. Among these is IRAS 18062+2410, whose radio structure is compared to near-infrared images available in the literature. With these new maps, we can estimate inner and outer radii of 0.03$""""$ and 0.08$""""$ for the ionised shell, an ionised mass of $3.2\times10^{-4}$ M$_\odot$, and a density at the inner radius of $7.7\times 10^{-5}$ cm$^{-3}$, obtained by modelling the radio shell with the new morphological constraints. The combination of multi-frequency data and, where available, spectral-index maps leads to the detection of spectral indices not due to thermal emission, contrary to what one would expect in planetary nebulae. Our results allow us to hypothesise the existence of a link between radio variability and non-thermal emission mechanisms in the nebulae. This link seems to hold for IRAS 22568+6141 and may generally hold for those nebulae where the radio flux decreases over time."""
"List(Computer ScienceNot, Physics, MathematicsNot, StatisticsNot, Quantitative BiologyNot)","Hidden long evolutionary memory in a model biochemical network. We introduce a minimal model for the evolution of functional protein-interaction networks using a sequence-based mutational algorithm, and apply the model to study neutral drift in networks that yield oscillatory dynamics. Starting with a functional core module, random evolutionary drift increases network complexity even in the absence of specific selective pressures. Surprisingly, we uncover a hidden order in sequence space that gives rise to long-term evolutionary memory, implying strong constraints on network evolution due to the topology of accessible sequence space."
"List(Computer Science, PhysicsNot, MathematicsNot, StatisticsNot, Quantitative Biology)","From Natural to Artificial Camouflage: Components and Systems. We identify the components of bio-inspired artificial camouflage systems including actuation, sensing, and distributed computation. After summarizing recent results in understanding the physiology and system-level performance of a variety of biological systems, we describe computational algorithms that can generate similar patterns and have the potential for distributed implementation. We find that the existing body of work predominately treats component technology in an isolated manner that precludes a material-like implementation that is scale-free and robust. We conclude with open research challenges towards the realization of integrated camouflage solutions."


##### Split Dataset into Training & Testing Datasets

In [0]:
train_ds, test_ds = df.randomSplit([0.80, 0.20], seed=42)

train_ds = train_ds.persist()
test_ds = test_ds.persist()

print(f"There are {train_ds.count()} samples in the training dataset.")
print(f"There are {test_ds.count()} samples in the testing dataset.")

There are 16694 samples in the training dataset.
There are 4264 samples in the testing dataset.


##### Build Pipeline Stages

In [0]:
# document assembler
doc = DocumentAssembler() \
        .setInputCol("text") \
        .setOutputCol("document") \
        .setCleanupMode("shrink")

# Universal Sentence Encoder
use = UniversalSentenceEncoder.pretrained() \
        .setInputCols(["document"]) \
        .setOutputCol("sentences")

# Sentiment Analysis Deep Learning Classifier
ml_clf = MultiClassifierDLApproach() \
        .setInputCols("sentences") \
        .setOutputCol("class") \
        .setLabelColumn("labels") \
        .setBatchSize(128) \
        .setMaxEpochs(10) \
        .setLr(1e-3) \
        .setThreshold(0.5) \
        .setShufflePerEpoch(False) \
        .setEnableOutputLogs(True) \
        .setValidationSplit(0.2) \
        .setEnableOutputLogs(True)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][OK!]


##### Build Pipeline

In [0]:
ml_pipe = Pipeline().setStages([
    doc,
    use,
    ml_clf
])

##### Fit/Train Model

In [0]:
ml_model = ml_pipe.fit(train_ds)

##### Generate Predictions Using Testing Dataset

In [0]:
predictions = ml_model.transform(test_ds)

##### Prepare Predictions for Metrics Evaluation (Part 1)

In [0]:
preds = predictions.select(F.col('labels').alias("label"),
                                F.col('class.result').cast(ArrayType(StringType())).alias("prediction"))

train_ds = train_ds.unpersist()
test_ds = test_ds.unpersist()

preds = preds.persist()

preds = preds.withColumn("label", F.array_sort(F.col("label")).cast(ArrayType(StringType()))) \
            .withColumn("prediction", F.array_sort(F.col("prediction").cast(ArrayType(StringType()))))

display(preds)

label,prediction
"List(Computer Science, Mathematics, PhysicsNot, Quantitative BiologyNot, Statistics)","List(Computer Science, Mathematics, PhysicsNot, Quantitative BiologyNot, Statistics)"
"List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, Statistics)","List(Computer ScienceNot, MathematicsNot, PhysicsNot, Quantitative BiologyNot, Statistics)"
"List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, Statistics)","List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, StatisticsNot)"
"List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, Statistics)","List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, StatisticsNot)"
"List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, Statistics)","List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, StatisticsNot)"
"List(Computer Science, MathematicsNot, PhysicsNot, Quantitative Biology, StatisticsNot)","List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, Statistics)"
"List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, StatisticsNot)","List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, StatisticsNot)"
"List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, StatisticsNot)","List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, StatisticsNot)"
"List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, StatisticsNot)","List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, StatisticsNot)"
"List(Computer Science, MathematicsNot, PhysicsNot, Quantitative BiologyNot, StatisticsNot)","List(Computer ScienceNot, MathematicsNot, Physics, Quantitative BiologyNot, StatisticsNot)"


##### Prepare Predictions for Metrics Evaluation (Part 2)

In [0]:
convert_to_zero = {"Computer ScienceNot" : "0",
                    "MathematicsNot" : "0",
                    "PhysicsNot" : "0",
                    "Quantitative BiologyNot" : "0",
                    "StatisticsNot" : "0"}

convert_to_one = {"Computer Science" : "1",
                "Mathematics" : "1",
                "Physics" : "1",
                "Quantitative Biology" : "1",
                "Statistics" : "1"}

### For the 'label' Column

# Convert values in 'label' column that end with "Not" to "0"
def replace_with_zeros(x):
    return [convert_to_zero.get(i,i) for i in x]
zero_converter = F.udf(replace_with_zeros)
preds = preds.withColumn("label", zero_converter(F.col("label")))


# Convert values in 'label' column that do not end with "Not" to "1"
def replace_with_ones(x):
    return [convert_to_one.get(i,i) for i in x]
one_converter = F.udf(replace_with_ones)
preds = preds.withColumn("label", one_converter(F.col("label")))

### For the 'prediction' Column

# Convert values in 'prediction' column that end with "Not" to "0"
def replace_with_zeros(x):
    return [convert_to_zero.get(i,i) for i in x]
zero_converter = F.udf(replace_with_zeros)
preds = preds.withColumn("prediction", zero_converter(F.col("prediction")))

# Convert values in 'prediction' column that do not end with "Not" to "1"
def replace_with_ones(x):
    return [convert_to_one.get(i,i) for i in x]
one_converter = F.udf(replace_with_ones)
preds = preds.withColumn("prediction", one_converter(F.col("prediction")))

display(preds)

label,prediction
"[1, 1, 0, 0, 1]","[1, 1, 0, 0, 1]"
"[1, 0, 0, 0, 1]","[0, 0, 0, 0, 1]"
"[1, 0, 0, 0, 1]","[1, 0, 0, 0, 0]"
"[1, 0, 0, 0, 1]","[1, 0, 0, 0, 0]"
"[1, 0, 0, 0, 1]","[1, 0, 0, 0, 0]"
"[1, 0, 0, 1, 0]","[1, 0, 0, 0, 1]"
"[1, 0, 0, 0, 0]","[1, 0, 0, 0, 0]"
"[1, 0, 0, 0, 0]","[1, 0, 0, 0, 0]"
"[1, 0, 0, 0, 0]","[1, 0, 0, 0, 0]"
"[1, 0, 0, 0, 0]","[0, 0, 1, 0, 0]"


##### Prepare Predictions for Metrics Evaluation (Part 3)

In [0]:
preds = preds.withColumn("label", F.split(F.col("label"),","))\
            .withColumn("prediction", F.split(F.col("prediction"),","))

preds = preds.withColumn("label", F.col("label").cast(ArrayType(DoubleType())))\
            .withColumn("prediction", F.col("prediction").cast(ArrayType(DoubleType())))

##### Calculate & Display Metrics

In [0]:
metrics_to_eval = ["accuracy", "f1Measure", 
                    "precision", "recall", 
                    "microPrecision", "microRecall", 
                    "microF1Measure", "subsetAccuracy", 
                    "hammingLoss"]

evaluate_multilabel_model(preds, 
                          metrics_to_eval, 
                          "Multi-Label of Research Articles")

+---------------------------------------------+
|       Multi-Label of Research Articles      |
+---------------------------------------------+
|                 Metric  |  Value            |
+---------------------------------------------+
|               accuracy  |  0.940548         |
+---------------------------------------------+
|              f1Measure  |  0.964272         |
+---------------------------------------------+
|              precision  |  0.964478         |
+---------------------------------------------+
|                 recall  |  0.964212         |
+---------------------------------------------+
|         microPrecision  |  0.964348         |
+---------------------------------------------+
|            microRecall  |  0.964212         |
+---------------------------------------------+
|         microF1Measure  |  0.96428          |
+---------------------------------------------+
|         subsetAccuracy  |  0.791745         |
+---------------------------------------

##### End Spark Session

In [0]:
preds = preds.unpersist()

spark.stop()

### Notes & Other Takeaways From This Project
****
- Training this model any longer (any more epochs) would result in overfitting. The results are fantastic as it is.
****