# Topic Modelling
Détection de thématiques (topic modeling) : extraire les thématiques pertinentes traitées par les abstracts.

In [None]:
!pip install bertopic[visualization] --quiet

In [None]:
!pip install numpy==1.19.5

# **Imports**

In [75]:
import numpy as np
import pandas as pd
from copy import deepcopy
from bertopic import BERTopic

In [76]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [126]:
import pandas as pd
df = pd.read_json('/content/gdrive/MyDrive/Exam/AgrSmall_cleaned.json')
df.head()

Unnamed: 0,doi,titles,abstracts,authors,keywords,sources
0,10.339,Community Faecal Management Strategies and Per...,Most people in rural areas in South Africa SA ...,"[Matthew Mamera, Johan J. van Tol, Makhosazana...",agriculture,mdpi
1,10.339,Adoption of Sustainable Agriculture Practices ...,The aim of this study was to highlight the imp...,"[Rafay Waseem, Gershom Endelani Mwalupaso, Far...",agriculture,mdpi
2,10.339,Atlanta Residents’ Knowledge Regarding Heavy M...,Urban agriculture and gardening provide many h...,"[Lauren Balotin, Samantha Distler, Antoinette ...",agriculture,mdpi
3,10.339,Perceptions of the Challenges and Opportunitie...,Waste management has become pertinent in urban...,"[Nqubeko Neville Menyuka, Melusi Sibanda, Urmi...",agriculture,mdpi
4,10.339,An Assessment of Seaweed Extracts: Innovation ...,Plant growth regulators PGRs are described in ...,"[El Chami Daniel, Galli Fabio]",agriculture,mdpi


Fortunately this kind of methods don't need much pre-processing. That way we will just have to pass our data to model topics.

# **Load data**

We need to have some idea of how many topics we need to extract from our abstracts. Fortunately, there is a column called Keywords which can be interpreted as a column of topics. We see that there are 33 topics in our abstracts.

In [127]:
len(df['keywords'].value_counts())

33

In [128]:
docs = list(df.loc[:, "abstracts"].values)

# **Creating Topics**

In [130]:
model = BERTopic(language="english",nr_topics="auto")

In [131]:
topics, probs = model.fit_transform(docs)

# We can then extract most frequent topics:

In [132]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
0,-1,1478
1,0,223
2,1,221
3,2,193
4,3,165
...,...,...
56,55,12
57,56,12
58,57,11
59,58,11


Subject -1 means outliers. It is not relevant for our application as we are trying to group those that can be grouped and not the outliers.

# Get Individual Topics

In [133]:
model.get_topic(0)

[('resistance', 0.027306180360468505),
 ('wheat', 0.022536936387512512),
 ('genes', 0.015326432149689312),
 ('gene', 0.012973224803286497),
 ('breeding', 0.010564508038693646),
 ('genetic', 0.008334712069907636),
 ('plant', 0.008172686948642576),
 ('plants', 0.0077816582570526096),
 ('cultivars', 0.007644248708109173),
 ('genotypes', 0.007554980210517708)]

This is the outliers topic



In [134]:
model.get_topic(2)

[('cows', 0.02477964625381168),
 ('dairy', 0.02239357360037012),
 ('milk', 0.019933434338471116),
 ('cattle', 0.012417915963207972),
 ('pig', 0.010149702434404785),
 ('animal', 0.009687357316715972),
 ('cow', 0.008901690751340622),
 ('pigs', 0.008148020096708798),
 ('farm', 0.0076607437896666665),
 ('feeding', 0.007506514561680994)]

This one seems to be related to machine learning

In [135]:
model.get_topic(14)

[('leaf', 0.01709953169884755),
 ('plant', 0.015461347938547013),
 ('classification', 0.013272324515412761),
 ('tree', 0.012575488701553818),
 ('canopy', 0.012183447987412239),
 ('leaves', 0.01105078285855785),
 ('trees', 0.008473905512920798),
 ('dataset', 0.007336205870198295),
 ('detection', 0.007298431730194899),
 ('vegetation', 0.006829647371095687)]

agriculture

# **Visualize Topics**

In [136]:
model.visualize_topics()

In [137]:
new_topics, new_probs = model.reduce_topics(docs, topics, probs, nr_topics=33)

In [138]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
0,-1,1535
1,0,254
2,1,243
3,2,193
4,3,165
5,4,143
6,5,140
7,6,138
8,7,134
9,8,112


In [139]:
model.visualize_topics()

Here we reduced the number to 33 to see if our first assumption is relevant.

We can see that the cluster are pretty spaced

You can see that we can still reduce the number of topics if we want a very llow topic granularity. 



In [143]:
model.find_topics("agriculture")

([7, 5, 27, 9, 28],
 [0.8518545761899723,
  0.8431021045540084,
  0.675743308534456,
  0.6502194162663008,
  0.6107970152718672])

In [145]:
model.get_topic(7)

[('agricultural', 0.03565437019382535),
 ('technology', 0.026140116507505806),
 ('technologies', 0.025191018766316336),
 ('agriculture', 0.02399745875743113),
 ('farmers', 0.020352165381229397),
 ('farming', 0.01690241404833638),
 ('smart', 0.012369160185100859),
 ('development', 0.010033620563684323),
 ('innovation', 0.00941452781672134),
 ('rural', 0.00829096345981467)]

In [144]:
model.find_topics("machine learning")

([19, 28, 1, -1, 26],
 [0.6228054769533977,
  0.5893433771438448,
  0.583040589428113,
  0.5528516026756964,
  0.48521100341254775])

In [146]:
model.get_topic(19)

[('learning', 0.038192074333737634),
 ('data', 0.01873829285043282),
 ('model', 0.013674783425643256),
 ('audio', 0.013612465297208966),
 ('models', 0.013596429964412085),
 ('fairness', 0.013477921456525252),
 ('as', 0.012610033948148306),
 ('ml', 0.01085955670939449),
 ('healthcare', 0.00980604712895294),
 ('supervised', 0.009429501947349846)]

The result is very interesting because the method of evaluating the proximity between a topic and a word gives probabilities that this topic can be represented by a single word.

The words that defines topics are way more logic when we create 33 topics. 
Lets see if we ask for 2 topics, we will obtain 2 topics that define agriculture and machine learning.

# 2 TOPICS

In [117]:
docs = list(df.loc[:, "abstracts"].values)

In [118]:
model = BERTopic(language="english",nr_topics=3)

In [119]:
topics, probs = model.fit_transform(docs)

In [120]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
0,-1,1540
1,0,1212
2,1,1100
3,2,606


In [124]:
model.find_topics("agriculture")

([1, 0, 2, -1],
 [0.6286732631640763,
  0.41762277978959006,
  0.3311495793753889,
  0.32436338257908837])

In [122]:
model.get_topic(1)

[('to', 0.06128262780777263),
 ('for', 0.03765940981166114),
 ('data', 0.02215323611515929),
 ('agriculture', 0.02195908129465209),
 ('agricultural', 0.018556788586467855),
 ('using', 0.013720183003748965),
 ('use', 0.012720749746971046),
 ('learning', 0.012604427999175504),
 ('water', 0.012456796406484245),
 ('farming', 0.012391099137114642)]

In [125]:
model.find_topics("machine learning")

([2, 1, -1, 0],
 [0.4688787801961388,
  0.3184875039725184,
  0.29089146589564324,
  0.2871437819998832])

In [123]:
model.get_topic(2)

[('in', 0.04879732338921885),
 ('we', 0.034624216579477785),
 ('that', 0.029920501263070783),
 ('learning', 0.025694609610814705),
 ('neural', 0.023517840793407484),
 ('cnn', 0.021882377052382294),
 ('segmentation', 0.021457022246107817),
 ('data', 0.021047214596373723),
 ('as', 0.020567844773851995),
 ('model', 0.018847795307258816)]

In [121]:
model.visualize_topics()

After a first attempt of creating 2 topics we saw that one topic has been defined by some stopwords. I think that is normal as topwords appears everywhere and we ask the algorithm to overreduce the amount of topic so it end-up to underfit and to generalize. 

In [94]:
import nltk

In [95]:
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [96]:
len(docs)

4458

In [97]:
stop_words = set(stopwords.words('english')) 

def remove_stopwords(x):
  words = x.split(" ")
  temp = []
  for word in words:
    if word not in stop_words:
      temp.append(word)
  return " ".join(temp)

In [98]:
df['text_clean'] = df['abstracts'].apply(lambda x: remove_stopwords(x))
df

Unnamed: 0,doi,titles,abstracts,authors,keywords,sources,text_clean
0,10.3390,Community Faecal Management Strategies and Per...,Most people in rural areas in South Africa SA ...,"[Matthew Mamera, Johan J. van Tol, Makhosazana...",agriculture,mdpi,Most people rural areas South Africa SA rely u...
1,10.3390,Adoption of Sustainable Agriculture Practices ...,The aim of this study was to highlight the imp...,"[Rafay Waseem, Gershom Endelani Mwalupaso, Far...",agriculture,mdpi,The aim study highlight importance socioeconom...
2,10.3390,Atlanta Residents’ Knowledge Regarding Heavy M...,Urban agriculture and gardening provide many h...,"[Lauren Balotin, Samantha Distler, Antoinette ...",agriculture,mdpi,Urban agriculture gardening provide many healt...
3,10.3390,Perceptions of the Challenges and Opportunitie...,Waste management has become pertinent in urban...,"[Nqubeko Neville Menyuka, Melusi Sibanda, Urmi...",agriculture,mdpi,Waste management become pertinent urban region...
4,10.3390,An Assessment of Seaweed Extracts: Innovation ...,Plant growth regulators PGRs are described in ...,"[El Chami Daniel, Galli Fabio]",agriculture,mdpi,Plant growth regulators PGRs described literat...
...,...,...,...,...,...,...,...
4526,10.2478/,1. Modelling groundwater flow and nitrate tran...,The present paper discusses studies related to...,"[Sieczka Anna, Bujakowski Filip, Koda Eugeniusz]",precision agriculture,"Sciendo, 2018.",The present paper discusses studies related pr...
4527,unknown,2. Cosechando los beneficios de la agricultura...,El objetivo del trabajo fue desarrollar una me...,"[Bonilla, Camila, Terra, José A, Gutiérrez, Lu...",precision agriculture,Facultad de Agronomía - Instituto Nacional de ...,El objetivo del trabajo fue desarrollar una me...
4528,unknown,6. A Risk Analysis of Precision Agriculture Te...,Precision agriculture technology can transform...,"[Yangxuan Liu, Michael R. Langemeier, Ian M. S...",precision agriculture,"MDPI, Open Access Journal, 2018.",Precision agriculture technology transform far...
4529,10.1016/,7. Integrated open geospatial web service enab...,We proposed an integrated geospatial service e...,"[Chen, Nengcheng, Zhang, Xiang, Wang, Chao]",precision agriculture,Elsevier B.V.,We proposed integrated geospatial service enab...


In [99]:
docs = list(df.loc[:, "text_clean"].values)

Then we choose to get rid of them and we will see if the result is better

In [112]:
model = BERTopic(language="english",nr_topics=3)

In [113]:
topics, probs = model.fit_transform(docs)

In [114]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
0,-1,3037
1,0,607
2,1,474
3,2,340


In [115]:
model.visualize_topics()

As we can see removing stopwords is not the best idea as the algorithm yield bad results.

We still need to create an evaluation pipeline to score the topics created but I can't think of a method

# Conclusion

As we can see in this notebook our work permits us to extract topic from abstracts. The resuults are pretty good. As the probabilities are high for the two words : ' Agriculture' and 'Machine Learning'.

In [1]:
!wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
from colab_pdf import colab_pdf
colab_pdf('BERT-Topic-Modelling.ipynb',"/content/drive/MyDrive/Exam/")


--2021-07-09 11:00:45--  https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1864 (1.8K) [text/plain]
Saving to: ‘colab_pdf.py’


2021-07-09 11:00:45 (32.3 MB/s) - ‘colab_pdf.py’ saved [1864/1864]

Mounted at /content/drive/




Extracting templates from packages: 100%
[NbConvertApp] Converting notebook /content/drive/MyDrive/Exam/BERT-Topic-Modelling.ipynb to pdf
  mimetypes=output.keys())
[NbConvertApp] Writing 49924 bytes to ./notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: [u'xelatex', u'./notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: [u'bibtex', u'./notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 44235 bytes to /co

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

'File ready to be Downloaded and Saved to Drive'