Goal for 04_preprocessing_nlp.ipynb

Take the translated output from 03_remove_french_translate.ipynb:

Then do text cleaning on translated description:

-	Lowercasing
-	Remove special characters
-	Remove stopwords
-	Tokenization (simple whitespace split)
-	Save final cleaned text ready for STM

In [None]:
import pandas as pd
import re
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from tqdm import tqdm

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
tqdm.pandas()

In [None]:
# Download NLTK stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Load English stopwords
stop_words = set(stopwords.words('english'))

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
# Step 1: Load the translated dataset
df = pd.read_csv('/content/energy_text_trn.csv')
print(f"Loaded {len(df)} records for NLP preprocessing.")

Loaded 64212 records for NLP preprocessing.


In [None]:
# Step 2: Basic cleaning - lowercase, remove non-alphabetic characters, remove extra spaces
def clean_text(text):
    if pd.isnull(text):
        return ""
    text = text.lower()  # lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # remove non-alphabetic characters
    text = re.sub(r'\s+', ' ', text).strip()  # remove extra spaces
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]
    return ' '.join(tokens)

In [None]:
df.columns

Index(['Year', 'CrsID', 'ProjectTitle', 'Bi_Multi', 'SectorCode',
       'LongDescription', 'ClimateMitigation', 'ClimateAdaptation',
       'DonorName', 'RecipientName', 'RegionName', 'IncomegroupName',
       'USD_Commitment', 'USD_Disbursement', 'USD_Received', 'DonorContinent',
       'RecipientContinent', 'LD_Eng', 'FrenchPartRemoved', 'DetectedLanguage',
       'TranslatedDescription', 'DetectedLanguage_tr', 'LD_clean',
       'word_count'],
      dtype='object')

In [None]:
df['LD_clean'] = df['TranslatedDescription'].apply(clean_text)

In [None]:
df.columns

Index(['Year', 'CrsID', 'ProjectTitle', 'Bi_Multi', 'SectorCode',
       'LongDescription', 'ClimateMitigation', 'ClimateAdaptation',
       'DonorName', 'RecipientName', 'RegionName', 'IncomegroupName',
       'USD_Commitment', 'USD_Disbursement', 'USD_Received', 'DonorContinent',
       'RecipientContinent', 'LD_Eng', 'FrenchPartRemoved', 'DetectedLanguage',
       'TranslatedDescription', 'DetectedLanguage_tr', 'LD_clean',
       'word_count'],
      dtype='object')

In [None]:
df['word_count'] = df['LongDescription'].astype(str).str.split().str.len()
df[df['word_count'] > 100][["LongDescription", "TranslatedDescription", "LD_clean"]].head()

Unnamed: 0,LongDescription,TranslatedDescription,LD_clean
37,"Background: Comme pour toute l'Afrique, la pr...","Background: As with all of Africa, production ...",background africa production access energy maj...
38,"Background: Comme pour toute l'Afrique, la pr...","Background: As with all of Africa, production ...",background africa production access energy maj...
39,There is growing interest in the potential for...,There is growing interest in the potential for...,growing interest potential ecosystem serviceba...
259,Lake Chad is situated in the poverty stricken ...,Lake Chad is situated in the poverty stricken ...,lake chad situated poverty stricken sahel regi...
276,Background: The Junior Professional Officers ...,Background: The Junior Professional Officers ...,background junior professional officer jpo pro...


In [None]:
df.to_csv('/content/energy_clean.csv', index=False)
print(f"Final {len(df)} records for STM Modelling.")

Final 64212 records for STM Modelling.
