# NLP Topic Modeling w/Enviro Policy Using Spacy Language Models


Using the paper *Beyond modeling: NLP Pipeline for efficient environmental policy analysis* [Planas et al.(2022)](https://arxiv.org/abs/2201.07105) by  and their recommended Knowledge Management Framework, I designed a NLP workflow with PythonGPT and Mistral to carry out Topic Modeling on a single page of the 2014 French National Agroecology policy **La loi d'avenir pour l'agriculture, l'alimentation et la forêt**

In [None]:
#next steps: open a new notebook
#use python GPT to complete 'assignment'
#Graduate Level Assignment: Topic Modeling on Translated Sentences from PDF Using spaCy French Model

#Next feed Loi pdf to ChatGPT and see how it does with translation

In [2]:
#previously installed
#!pip install pymupdf 
#!pip install googletrans==4.0.0-rc1 
#!pip install nltk
#!pip install spacy

#!pip install gensim 
#!pip install pyLDAvis
#!pip install scikit-learn



### Pre-NLP: import libraries

In [9]:
#Step 1: Extract Text from PDF
import fitz  # PyMuPDF


#Step 3: Preprocess the French Sentences
import spacy
from nltk.tokenize import sent_tokenize

#Step 4: Translate the Sentences
from googletrans import Translator

#Step 5: Perform Topic Modeling
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#unsupervised topic modeling
#represents documents as semantic vectors
import gensim
from gensim import corpora

#Step 6
import pyLDAvis
import pyLDAvis.gensim_models

### Pre-NLP: Verify spacy install + Language Models

In [3]:
#above install of spacy threw conflict errors
#verifying proper install of spacy + language models

# Load the spaCy models
nlp_fr = spacy.load('fr_core_news_sm')
nlp_en = spacy.load('en_core_web_sm')

# Example French and English texts
french_text = "Bonjour tout le monde."
english_text = "Hello everyone."

# Process the texts
doc_fr = nlp_fr(french_text)
doc_en = nlp_en(english_text)

# Print tokens
print("French tokens:", [token.text for token in doc_fr])
print("English tokens:", [token.text for token in doc_en])

French tokens: ['Bonjour', 'tout', 'le', 'monde', '.']
English tokens: ['Hello', 'everyone', '.']


### Step 1: Extract Text from PDF

In [5]:
#Function to extract txt from pdf
#debugging steps: commented out

def pdf_to_text_third_page(pdf_path):
    # Open the PDF file
    doc = fitz.open(pdf_path)
    
    # Check the number of pages
    #if len(doc) < 3:
        #raise ValueError("The PDF does not contain a third page.")
    
    # Load the third page (page indexing in PyMuPDF starts @ 0)
    third_page = doc.load_page(2)
    
    # Extract text from the third page
    text = third_page.get_text()
    #if text is None:
        #raise ValueError("No text found on the third page.")
    
    return text

#Le Loi
pdf_path = '/Users/jenniferbadger/Dropbox/AI_course/NLP project/joe_20141014_0238_0001.pdf'

# Step 1: Extract Text from PDF
try:
    third_page_text = pdf_to_text_third_page(pdf_path)
    print("Extracted Text from Third Page:\n", third_page_text)
except ValueError as e:
    print(e)
    third_page_text = ""

Extracted Text from Third Page:
 « V. – La politique en faveur de l’agriculture et de l’alimentation tient compte des spécificités des outre-mer 
ainsi que de l’ensemble des enjeux économiques, sociaux et environnementaux de ces territoires. Elle a pour 
objectif de favoriser le développement des productions agricoles d’outre-mer, en soutenant leur accès aux marchés, 
la recherche et l’innovation, l’organisation et la modernisation de l’agriculture par la structuration en filières 
organisées compétitives et durables, l’emploi, la satisfaction de la demande alimentaire locale par des productions 
locales, le développement des énergies renouvelables, des démarches de qualité particulières et de l’agriculture 
familiale, ainsi que de répondre aux spécificités de ces territoires en matière de santé des animaux et des végétaux. 
« VI. – La politique en faveur de l’agriculture et de l’alimentation tient compte des spécificités des territoires de 
montagne, en application de l’article 8 de l

### Step 2: Save the Extracted Text to a TXT File

In [10]:
# Function to save text to a file

def save_text_to_file(text, filename):
    #opens the file specified by filename in write mode 
    #with UTF-8 encoding
    with open(filename, 'w', encoding='utf-8') as file:
        #writes the text to the file
        #file is automatically closed when the block inside with is exited
        file.write(text)

# Save the extracted text to a file
#with if statement for debugging

if third_page_text:
    txt_filename = 'third_page_LaLoi.txt'
    save_text_to_file(third_page_text, txt_filename)

### Step 3: Preprocess the French Sentences
- Loads Spacy's French model that recognizes sentence boundaries
- Defines function to tokenize text into sentences 
- Toeknizes Le Loi into sentences

In [11]:
# Load spacy's French model
nlp_fr = spacy.load('fr_core_news_sm')

#Function to preprocess French sentences
#takes a single argument- 'sentences'

def preprocess_french_sentences(sentences):
    
    #creates an empty list to hold the processed sentences
    processed_sentences = []
   
    #Loops through each sentence in the input list
    for sentence in sentences:
        
        #Converts each sentence to lowercase and 
        #processes it using the spacy French model
        doc = nlp_fr(sentence.lower())
        
        #Removes stop words and non-alphabetic tokens
        #Creates a list of words in the sentence that fit this parameter
        words = [token.text for token in doc if token.is_alpha and not token.is_stop]
        
        #Appends the list of words to the processed_sentences list
        processed_sentences.append(words)
        
        #Returns a list of lists, 
        #where each inner list contains the cleaned and tokenized words of a single sentence.
    return processed_sentences

In [14]:
test_sentences = ["Bonjour tout le monde.",
    "C'est un exemple de texte en français.",
    "Il s'agit de la troisième phrase."]

processed_test_sentences = preprocess_french_sentences(test_sentences)
print("Processed Test Sentences:", processed_test_sentences)

Processed Test Sentences: [['bonjour', 'monde'], ['exemple', 'texte', 'français'], ['agit', 'phrase']]


In [13]:
# Split the extracted text into sentences
french_sentences = sent_tokenize(third_page_text)

# Preprocess the French sentences
processed_french_sentences = preprocess_french_sentences(french_sentences)
print("Processed French Sentences:", processed_french_sentences)

Processed French Sentences: [['politique', 'faveur', 'agriculture', 'alimentation', 'tient', 'compte', 'spécificités', 'ensemble', 'enjeux', 'économiques', 'sociaux', 'environnementaux', 'territoires'], ['objectif', 'favoriser', 'développement', 'productions', 'agricoles', 'soutenant', 'accès', 'marchés', 'recherche', 'innovation', 'organisation', 'modernisation', 'agriculture', 'structuration', 'filières', 'organisées', 'compétitives', 'durables', 'emploi', 'satisfaction', 'demande', 'alimentaire', 'locale', 'productions', 'locales', 'développement', 'énergies', 'renouvelables', 'démarches', 'qualité', 'particulières', 'agriculture', 'familiale', 'répondre', 'spécificités', 'territoires', 'matière', 'santé', 'animaux', 'végétaux'], ['vi'], ['politique', 'faveur', 'agriculture', 'alimentation', 'tient', 'compte', 'spécificités', 'territoires', 'montagne', 'application', 'article', 'loi', 'no', 'janvier', 'développement', 'protection', 'montagne'], ['reconnaît', 'contribution', 'positiv