# News Topic modeling


## I- Modules import

In [1]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.ml.feature import  CountVectorizer,IDF
from pyspark.sql.functions import col,explode,split
from pyspark.ml import Pipeline,PipelineModel
from pyspark.ml.clustering import LDA

## II- Spark context and session creation

In [2]:
spark = (SparkSession.builder
    .master("spark://node02:7077")
    .appName("NewsTopicModeling")
    .getOrCreate()
        )
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/12 09:33:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## III- Dataframe preparing

### 1. Load the data

In [3]:
# Load data
df = spark.read.parquet("input/news.parquet", header=True, inferSchema=True)

                                                                                

### 2. Partition and cache the dataframe

In [4]:
df.rdd.getNumPartitions()

9

In [5]:
num_partitions=4*40
df= df.repartition(num_partitions).cache()

In [6]:
df.rdd.getNumPartitions()



160

### 3. Preview the data

In [7]:
df.count()

                                                                                

1716608

In [8]:
df.show()

+--------------+--------------------+
|category_label|description_filtered|
+--------------+--------------------+
|          22.0|billie jean king ...|
|          23.0|huffpost accolade...|
|          22.0|genus arizona wom...|
|          21.0|trumpet transform...|
|          21.0|observe instigato...|
|          21.0|friend diaspora c...|
|          21.0|effort father nyc...|
|          21.0|seeding future he...|
|          21.0|help improve surv...|
|          21.0|watched traumatiz...|
|          21.0|marshall speech w...|
|          21.0|help phratry slai...|
|          21.0|irma tim duncan p...|
|          23.0|natalie portman s...|
|          21.0|let choose create...|
|          21.0|german supermarke...|
|          21.0|one world rarest ...|
|          23.0|spectacular new t...|
|          21.0|sex act set polit...|
|          22.0|floor detail trea...|
+--------------+--------------------+
only showing top 20 rows



In [9]:
df.printSchema()

root
 |-- category_label: double (nullable = true)
 |-- description_filtered: string (nullable = true)



### 4. Convert filtered descriptions to arrays

In [10]:
# Create a new DataFrame with description_filtered as arrays
df= df.withColumn('description_filtered', split(col('description_filtered'), ' '))
# Show the new DataFrame
df.show(truncate=False)

+--------------+---------------------------------------------------------------------------------------------------------+
|category_label|description_filtered                                                                                     |
+--------------+---------------------------------------------------------------------------------------------------------+
|22.0          |[billie, jean, king, quotation, card, fraud]                                                             |
|23.0          |[huffpost, accolade, roll, barbershop, outreach, get, shirley, temple, boy, meter, reading]              |
|22.0          |[genus, arizona, woman, allegedly, injects, married, man, iv, feces]                                     |
|21.0          |[trumpet, transforming, capitalism, making, front, see]                                                  |
|21.0          |[observe, instigator, conduct, social, change]                                                           |
|21.0          |

## IV- Feature Engineering


### 1. Explode the filtered descriptions to get the words

In [11]:
exploded_df=df.select(explode(df.description_filtered)).alias('words')
exploded_df.show()

+----------+
|       col|
+----------+
|    billie|
|      jean|
|      king|
| quotation|
|      card|
|     fraud|
|  huffpost|
|  accolade|
|      roll|
|barbershop|
|  outreach|
|       get|
|   shirley|
|    temple|
|       boy|
|     meter|
|   reading|
|     genus|
|   arizona|
|     woman|
+----------+
only showing top 20 rows



### 2. Get unique words in the filtered_description

In [12]:
unique_words=exploded_df.distinct()

### 3. Cache and show the unique words dataframe

In [13]:
unique_words=unique_words.cache()
unique_words.show()



+-------------+
|          col|
+-------------+
|      barrier|
|     vladimir|
|vladimirovich|
|       travel|
|         hope|
|   lieutenant|
|   stateowned|
|        still|
|        inner|
|  requirement|
|       online|
|          fog|
|   electrical|
|        trail|
|     priority|
|      alquran|
|     ignominy|
|      implore|
|     tortured|
|        1970s|
+-------------+
only showing top 20 rows



                                                                                

### 4. Get the vocabulary size

In [14]:
vocabulary_size=unique_words.count()
vocabulary_size

128622

In [15]:
minDF=3.0 # The minimum document frequency

### 5. Define the CountVectorizer and IDF stages

In [16]:
# Define the CountVectorizer and IDF stages
vectorizer = CountVectorizer(inputCol="description_filtered", outputCol="raw_features",vocabSize=vocabulary_size, minDF=minDF)
idf = IDF(inputCol="raw_features", outputCol="features")

## V- Models set up, training and evaluation

### 1. Set up LDA model

We instanciate LDA model with 30 topics

We choose 30 since our data set contains 32 categories.

In [17]:
num_topics = 30
# LDA model with 30 topics
lda = LDA(featuresCol="features",seed=0,k=num_topics)
lda

LDA_2272ffdb448c

### 2. Set up pipelines

We will  set up the pipelines of the following transformations for Naive Bayes and Linear reggression

- CountVectorizer
- IDF

In [18]:
# Create pipeline for LDA
pipeline = Pipeline(stages=[vectorizer, idf, lda])


pipeline

Pipeline_4409f428b81b

### 3. Split the data

First of all let us split the data into train and test set: 80% for train and 20% for test

In [19]:
# Split data
(train_set, test_set) = df.randomSplit([0.8, 0.2], seed=0)

### 4. Create a function for model training

Let us create a function which takes as argument a model that it trains and then returns the trained model.

In [20]:
def train_model(model):
    return model.fit(train_set)

In [21]:
fitted_model=train_model(pipeline)
fitted_model

24/06/12 09:35:37 WARN DAGScheduler: Broadcasting large task binary with size 1987.5 KiB
24/06/12 09:35:40 WARN DAGScheduler: Broadcasting large task binary with size 1987.6 KiB
24/06/12 09:35:42 WARN DAGScheduler: Broadcasting large task binary with size 2004.4 KiB
24/06/12 09:35:43 WARN DAGScheduler: Broadcasting large task binary with size 2008.0 KiB
24/06/12 09:35:49 WARN DAGScheduler: Broadcasting large task binary with size 2009.3 KiB
24/06/12 09:35:52 WARN DAGScheduler: Broadcasting large task binary with size 2004.4 KiB
24/06/12 09:35:53 WARN DAGScheduler: Broadcasting large task binary with size 2008.0 KiB
24/06/12 09:35:58 WARN DAGScheduler: Broadcasting large task binary with size 2009.3 KiB
24/06/12 09:36:00 WARN DAGScheduler: Broadcasting large task binary with size 2004.4 KiB
24/06/12 09:36:01 WARN DAGScheduler: Broadcasting large task binary with size 2008.0 KiB
24/06/12 09:36:05 WARN DAGScheduler: Broadcasting large task binary with size 2009.3 KiB
24/06/12 09:36:07 WAR

PipelineModel_3ed20f173a7c

### 5. Visualize the topics

In [23]:
fitted_vectorizer=fitted_model.stages[0]
vocabulary= fitted_vectorizer.vocabulary
len(vocabulary)

73489

In [24]:
vocabulary[:10]

['new', 'photo', 'state', 'trump', 'day', 'nt', 'say', 'woman', 'make', 'get']

In [25]:
topics = fitted_model.stages[-1].describeTopics()
topics.show()

topics_rdd = topics.rdd
topics_words = topics_rdd\
       .map(lambda row: row['termIndices'])\
       .map(lambda idx_list: [vocabulary[idx] for idx in idx_list])\
       .collect()
topics_words[:2]

+-----+--------------------+--------------------+
|topic|         termIndices|         termWeights|
+-----+--------------------+--------------------+
|    0|[96, 2, 198, 588,...|[0.01022540609806...|
|    1|[123, 469, 337, 5...|[0.01101762943972...|
|    2|[261, 671, 485, 9...|[0.00658080796295...|
|    3|[430, 470, 643, 3...|[0.00929968391042...|
|    4|[187, 3, 278, 329...|[0.00766232554540...|
|    5|[133, 99, 309, 36...|[0.01120993625942...|
|    6|[21, 92, 16, 31, ...|[0.01790640739366...|
|    7|[281, 334, 341, 4...|[0.00636672732379...|
|    8|[342, 455, 325, 5...|[0.00901083155412...|
|    9|[316, 26, 609, 75...|[0.00872033001026...|
|   10|[46, 22, 151, 91,...|[0.01155021560144...|
|   11|[567, 637, 521, 7...|[0.00668455764965...|
|   12|[189, 226, 68, 38...|[0.00715842597608...|
|   13|[203, 67, 581, 52...|[0.00836252371731...|
|   14|[401, 285, 511, 6...|[0.00747224889808...|
|   15|[343, 406, 57, 70...|[0.00829819700176...|
|   16|[387, 354, 564, 4...|[0.00805366547810...|


                                                                                

[['sexual',
  'state',
  'tree',
  'assault',
  'woman',
  '100',
  'charged',
  'pine',
  'arrested',
  'push'],
 ['dog',
  'ampere',
  'race',
  'human',
  'legal',
  'miss',
  'walk',
  'commonwealth',
  'peace',
  'turkey']]

In [31]:
topics_words_dict={}
for idx, topic in enumerate(topics_words):
    print("topic: {}".format(idx))
    print("*"*25)
    topic_words_array=[]
    for word in topic:
        print(word)
        topic_words_array.append(word)
    topics_words_dict[idx]=topic_words_array
    print("*"*25)

topic: 0
*************************
sexual
state
tree
assault
woman
100
charged
pine
arrested
push
*************************
topic: 1
*************************
dog
ampere
race
human
legal
miss
walk
commonwealth
peace
turkey
*************************
topic: 2
*************************
crisis
martin
cut
gas
course
meeting
poor
read
climate
voter
*************************
topic: 3
*************************
monophosphate
deoxyadenosine
activity
role
angstrom
problem
unit
disney
aid
gender
*************************
topic: 4
*************************
bill
trump
republican
fox
face
gop
news
horn
storm
ring
*************************
topic: 5
*************************
space
national
administration
red
police
guy
mar
aeronautics
officer
nasa
*************************
topic: 6
*************************
bank
india
covid19
coronavirus
ha
case
said
today
country
test
*************************
topic: 7
*************************
long
small
insurance
lead
8
fry
study
brain
crataegus
god
****************

In [27]:
topics_words_dict

{0: ['sexual',
  'state',
  'tree',
  'assault',
  'woman',
  '100',
  'charged',
  'pine',
  'arrested',
  'push'],
 1: ['dog',
  'ampere',
  'race',
  'human',
  'legal',
  'miss',
  'walk',
  'commonwealth',
  'peace',
  'turkey'],
 2: ['crisis',
  'martin',
  'cut',
  'gas',
  'course',
  'meeting',
  'poor',
  'read',
  'climate',
  'voter'],
 3: ['monophosphate',
  'deoxyadenosine',
  'activity',
  'role',
  'angstrom',
  'problem',
  'unit',
  'disney',
  'aid',
  'gender'],
 4: ['bill',
  'trump',
  'republican',
  'fox',
  'face',
  'gop',
  'news',
  'horn',
  'storm',
  'ring'],
 5: ['space',
  'national',
  'administration',
  'red',
  'police',
  'guy',
  'mar',
  'aeronautics',
  'officer',
  'nasa'],
 6: ['bank',
  'india',
  'covid19',
  'coronavirus',
  'ha',
  'case',
  'said',
  'today',
  'country',
  'test'],
 7: ['long',
  'small',
  'insurance',
  'lead',
  '8',
  'fry',
  'study',
  'brain',
  'crataegus',
  'god'],
 8: ['divorcement',
  'dad',
  'move',
  'dati

### 6. Get topics distributions

In [None]:
filter

In [28]:
# Transform the training and test data
train_set_transformed = fitted_model.transform(train_set)
test_set_transformed = fitted_model.transform(test_set)

# Get the LDA model from the pipeline model
#lda_model = fitted_model.stages[-1]

# Extract the topic distributions
train_topic_distributions = train_set_transformed.select("description_filtered", "topicDistribution")
test_topic_distributions = test_set_transformed.select("description_filtered", "topicDistribution")

In [29]:
# Show the topic distributions for the training set
train_topic_distributions.show(truncate=False)

# Show the topic distributions for the test set
test_topic_distributions.show(truncate=False)

24/06/12 09:44:34 WARN DAGScheduler: Broadcasting large task binary with size 18.7 MiB


+--------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|description_filtered                                                                                                      |topicDistribution                                                                               

24/06/12 09:44:35 WARN DAGScheduler: Broadcasting large task binary with size 18.7 MiB


+----------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|description_filtered                                                                                                  |topicDistribution                                                                                               

### 8. Save the model and other important data

In [36]:
fitted_model.save('output/news_topic_model')

24/06/12 09:47:36 WARN TaskSetManager: Stage 150 contains a task of very large size (1373 KiB). The maximum recommended task size is 1000 KiB.
24/06/12 09:47:37 WARN TaskSetManager: Stage 154 contains a task of very large size (1179 KiB). The maximum recommended task size is 1000 KiB.
24/06/12 09:47:39 WARN TaskSetManager: Stage 158 contains a task of very large size (17586 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

Let us perform a test to assure everythin is okay

In [40]:
# Load the fitted model
fillted_model=PipelineModel.load('output/news_topic_model')
test_topic=fitted_model.transform(test_set)

                                                                                

In [43]:
test_topic.select('topicDistribution').show()

24/06/12 09:52:09 WARN DAGScheduler: Broadcasting large task binary with size 18.7 MiB


+--------------------+
|   topicDistribution|
+--------------------+
|[0.08535319788072...|
|[3.56484866714508...|
|[5.11276830799110...|
|[3.70785723711311...|
|[0.10490493899793...|
|[0.13381013825972...|
|[7.02350718738341...|
|[0.26348733053179...|
|[4.24504891571636...|
|[0.10875776556678...|
|[3.69178733449807...|
|[3.18097157458662...|
|[4.16774925592334...|
|[4.55233034544584...|
|[7.93050835752041...|
|[6.86429874455979...|
|[0.00103633915831...|
|[6.12979960451373...|
|[3.04405968585980...|
|[3.28065531292427...|
+--------------------+
only showing top 20 rows



Let us also save the dictionary topics--->words

In [44]:
# Specify the file name
file_name = 'output/news_topic_model/topics.json'

# Write the data to a JSON file
with open(file_name, 'w') as json_file:
    json.dump(topics_words_dict, json_file, indent=4)

print(f"Data successfully written to {file_name}")

Data successfully written to output/news_topic_model/topics.json


## VII- Summary

In this notebook we have studied two models for our news categorization task. There are Naive Bayes and Logistic regression.

 Our study reveals that the Logistic regression was the one with best performance.

 Then we tunned the Logistic regression hyperparameters using grid search and then we find the best model that we save.

 The next step of our work will be to ...

In [45]:
# Remove the cache and stop the spark session
df.unpersist()
spark.stop()