# Preprocessing

In this Jupyter Notebook, we will  preprocess the metadata parquet file for similarity analysis. We willuse [NLTK](http://www.nltk.org) package to perform a preprocessing pipeline for the dataset 

## Setup

We will perform similairy analysis on metadata records using the [Natural Language Toolkit (NLTK)](http://www.nltk.org/howto/twitter.html) package, an open-source Python library for natural language processing. It has modules for collecting, handling, and processing the metadata records. 

The data used contains metadata information for data records on geo.ca.  This data is saved in a S3 bucekt on dev enironment - webpresence-geocore-geojson-to-parquet-dev. 

In [26]:
import boto3
import logging 
from botocore.exceptions import ClientError

import matplotlib.pyplot as plt 

import pandas as pd 
import numpy as np
import io
import os 


## About the metadata parquet dataset

The metadata parquet dataset contains the properties identitied in the [GeoCore format](https://canadian-geospatial-platform.github.io/geocore/). It contains 7000+ records, each has an uuid. We need to download abd open the data saved from S3 in the workspace (or in your local computer). 


In [2]:
# Function to read the parquet file as pandas dataframe 
def open_S3_file_as_df(bucket_name, file_name):
    """Open a S3 parquet file from bucket and filename and return the parquet as pandas dataframe
    :param bucket_name: Bucket name
    :param file_name: Specific file name to open
    :return: body of the file as a string
    """
    try: 
        s3 = boto3.resource('s3')
        object = s3.Object(bucket_name, file_name)
        body = object.get()['Body'].read()
        df = pd.read_parquet(io.BytesIO(body))
        print(f'Loading {file_name} from {bucket_name} to pandas dataframe')
        return df
    except ClientError as e:
        logging.error(e)
        return e
file_name = "records.parquet"
bucket_name = "webpresence-geocore-geojson-to-parquet-dev"
df = open_S3_file_as_df(bucket_name, file_name)

Loading records.parquet from webpresence-geocore-geojson-to-parquet-dev to pandas dataframe


We can print a few examples from the metadata record dataset to see what information it contains. 

Explore the dataset structure, including its shape, column names, and datatype.  


In [28]:
df.info() 
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7343 entries, 0 to 7342
Data columns (total 68 columns):
 #   Column                                                         Non-Null Count  Dtype 
---  ------                                                         --------------  ----- 
 0   features_type                                                  7343 non-null   string
 1   features_geometry_type                                         7343 non-null   string
 2   features_geometry_coordinates                                  7343 non-null   string
 3   features_properties_id                                         7343 non-null   object
 4   features_properties_title_en                                   7342 non-null   string
 5   features_properties_title_fr                                   7342 non-null   string
 6   features_properties_description_en                             7343 non-null   string
 7   features_properties_description_fr                             7343 n

Unnamed: 0,features_type,features_geometry_type,features_geometry_coordinates,features_properties_id,features_properties_title_en,features_properties_title_fr,features_properties_description_en,features_properties_description_fr,features_properties_keywords_en,features_properties_keywords_fr,...,features_properties_contact,features_properties_credits,features_properties_cited,features_properties_distributor,features_properties_options,features_properties_temporalExtent_end_@indeterminatePosition,features_properties_temporalExtent_end_#text,features_properties_plugins,features_properties_sourceSystemName,features_popularity
0,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",000183ed-8864-42f0-ae43-c4313a860720,"Principal Mineral Areas, Producing Mines, and ...","Principales régions minières, principales mine...",This dataset is produced and published annuall...,Ce jeu de données est produit et publié annuel...,"mineralization, mineral occurrences, mines, hy...","minéralisation, indices minéralisés, mines, hy...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...",[],"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://maps-cartes.services.geo.ca/...",,,[],,1250806
1,Feature,Polygon,"[[[-142, 41], [-52, 41], [-52, 84], [-142, 84]...",7f245e4d-76c2-4caa-951a-45d1d2051333,"Canadian Digital Elevation Model, 1945-2011","Modèle numérique d'élévation du Canada, 1945-2011",This collection is a legacy product that is no...,Ce produit fait maintenant partie du patrimoin...,"Canada, Earth Sciences, elevation, relief, geo...","Canada, Sciences de la Terre, élévation, relie...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...",[],"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://maps.geogratis.gc.ca/wms/ele...",,,[],,210798
2,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",085024ac-5a48-427a-a2ea-d62af73f2142,Canada's National Earthquake Scenario Catalogue,Catalogue national de scénarios de tremblement...,"The National Earthquake Scenario Catalogue, pr...",Le dépôt est utilisé pour l’élaboration du cat...,"Emergency preparedness, Earth sciences, Earthq...","Protection civile, Sciences de la terre, Tremb...",...,"[{""individual"": ""Dr. Tiegan Hobbs"", ""position""...",[],"[{""individual"": ""Nicky Hastings"", ""position"": ...","[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://github.com/OpenDRR/earthquak...",,,[],,140088
3,Feature,Polygon,"[[[-104.75571511, 50.42392886], [-104.56356008...",03ccfb5c-a06e-43e3-80fd-09d4f8f69703,Temporal Series of the National Air Photo Libr...,Série temporelle de la photothèque nationale d...,"Note: To visualize the data in the viewer, zoo...",Note: Pour visualiser les données dans l’outil...,"Mosaic, Aerial photography, Access to informat...","Mosaïque, Photographie aérienne, Accès à l'inf...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...",[],"[{""individual"": ""null"", ""position"": {""en"": ""Na...","[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://datacube-prod-data-public.s3...",,,[],,120162
4,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",488faf70-b50b-4749-ac1c-a1fd44e06f11,Indigenous Mining Agreements,Ententes minières autochtones,The Indigenous Mining Agreements dataset provi...,Les données des ententes minières autochtones ...,"Indigenous, First Nations, Métis, Indigenous a...","Autochtones, Premières nations, Métis, Affaire...",...,"[{""individual"": ""Melanie Campbell"", ""position""...",[],"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://atlas.gc.ca/imaema/en/"", ""pr...",,,[],,111036


In [29]:
row_count, column_count = df.shape 
print(f'The shape of the raw metadata parquet dataset is {df.shape}')

colnames = list(df.columns)
print(f'The column names are {colnames}')

type_int64 = df.dtypes[df.dtypes=='int64']
print(f'\n The columns that have int64 as data tyep is {type_int64}') 
type_string = df.dtypes[df.dtypes=='string']
print(f'\n The columns that have int64 as data tyep is {type_string}') 

The shape of the raw metadata parquet dataset is (7343, 68)
The column names are ['features_type', 'features_geometry_type', 'features_geometry_coordinates', 'features_properties_id', 'features_properties_title_en', 'features_properties_title_fr', 'features_properties_description_en', 'features_properties_description_fr', 'features_properties_keywords_en', 'features_properties_keywords_fr', 'features_properties_topicCategory', 'features_properties_parentIdentifier', 'features_properties_date_published_text', 'features_properties_date_published_date', 'features_properties_date_created_text', 'features_properties_date_created_date', 'features_properties_date_revision_text', 'features_properties_date_revision_date', 'features_properties_date_notavailable_text', 'features_properties_date_notavailable_date', 'features_properties_date_inforce_text', 'features_properties_date_inforce_date', 'features_properties_date_adopted_text', 'features_properties_date_adopted_date', 'features_properties_

We can see that the data has 68 columns, and most of them are stored as a string. The data contains both English and its French translations:
* "features_properties_title_en"
* "features_properties_description_en"
* "features_properties_keywords_en"

## Data selection and data cleaning 
Since there are 68 columns, and we need to select the columns with the most descriptive information to train the model for similarity. For the current models, we are limited the data on English with information on title, keywords, description. 


In [30]:
# df.loc[] and df.iloc[] select rows 
selected_cols = ['features_properties_id', 'features_properties_title_en','features_properties_description_en','features_properties_keywords_en']
df_en = df[selected_cols]
df_en.head()

Unnamed: 0,features_properties_id,features_properties_title_en,features_properties_description_en,features_properties_keywords_en
0,000183ed-8864-42f0-ae43-c4313a860720,"Principal Mineral Areas, Producing Mines, and ...",This dataset is produced and published annuall...,"mineralization, mineral occurrences, mines, hy..."
1,7f245e4d-76c2-4caa-951a-45d1d2051333,"Canadian Digital Elevation Model, 1945-2011",This collection is a legacy product that is no...,"Canada, Earth Sciences, elevation, relief, geo..."
2,085024ac-5a48-427a-a2ea-d62af73f2142,Canada's National Earthquake Scenario Catalogue,"The National Earthquake Scenario Catalogue, pr...","Emergency preparedness, Earth sciences, Earthq..."
3,03ccfb5c-a06e-43e3-80fd-09d4f8f69703,Temporal Series of the National Air Photo Libr...,"Note: To visualize the data in the viewer, zoo...","Mosaic, Aerial photography, Access to informat..."
4,488faf70-b50b-4749-ac1c-a1fd44e06f11,Indigenous Mining Agreements,The Indigenous Mining Agreements dataset provi...,"Indigenous, First Nations, Métis, Indigenous a..."


In [6]:
df_en.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7343 entries, 0 to 7342
Data columns (total 4 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   features_properties_id              7343 non-null   object
 1   features_properties_title_en        7342 non-null   string
 2   features_properties_description_en  7343 non-null   string
 3   features_properties_keywords_en     6194 non-null   string
dtypes: object(1), string(3)
memory usage: 229.6+ KB


Missing valus. 

In the selected variables, missing values are shown as NaN. We should replace the NaN as a empty string for concatenating the string togehter later. 

In [32]:
df_en = df_en.fillna('')
print("The NaN values in the English columns after cleaning are \n") 
df_en.info()

The NaN values in the English columns after cleaning are 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7343 entries, 0 to 7342
Data columns (total 4 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   features_properties_id              7343 non-null   object
 1   features_properties_title_en        7343 non-null   string
 2   features_properties_description_en  7343 non-null   string
 3   features_properties_keywords_en     7343 non-null   string
dtypes: object(1), string(3)
memory usage: 229.6+ KB


There are some fields are labled as "Not Available; Indisponible" for empty value, we should replace rows with these words as a empty string too.

In [33]:
df_en = df_en.replace('Not Available; Indisponible', '')
# Find columns containing 'Not Available; Indisponible'
columns_with_string = df_en.columns[df_en.apply(lambda col: col.isin(['Not Available; Indisponible'])).any()].tolist()
print("Columns with 'Not Available; Indisponible':", columns_with_string)

# Find rows containing 'Not Available; Indisponible'
rows_with_string = df_en[df_en.apply(lambda row: row.isin(['Not Available; Indisponible'])).any(axis=1)]
print("\nRows with 'Not Available; Indisponible':", rows_with_string)


Columns with 'Not Available; Indisponible': []

Rows with 'Not Available; Indisponible': Empty DataFrame
Columns: [features_properties_id, features_properties_title_en, features_properties_description_en, features_properties_keywords_en]
Index: []


Concatenate the string from the three selected variables to a new variables called "metadata_en", which will be used as the text for similarity analysis. 


In [34]:
df_en['metadata_en'] = df_en['features_properties_title_en'] + ' ' + df_en['features_properties_description_en'] + ' ' + df_en['features_properties_keywords_en'] 
df_en.info()
if df_en['metadata_en'].isnull().any():
    df_en['metadata_en'] = df_en['metadata_en'].fillna('')
    

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7343 entries, 0 to 7342
Data columns (total 5 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   features_properties_id              7343 non-null   object
 1   features_properties_title_en        7343 non-null   string
 2   features_properties_description_en  7343 non-null   string
 3   features_properties_keywords_en     7343 non-null   string
 4   metadata_en                         7343 non-null   string
dtypes: object(1), string(4)
memory usage: 287.0+ KB


Checking for duplications. 

Before preprocessing the data, we also need to check for duplications in the records. This is an important step for the similairty analysis.
1. First, we check duplications in uuid, and delete the duplications in uuid.
2. Then, check the duplications in 'metadata_en", and delete the duplications in 'metadata_id". Becuase if 'metadata_en' is duplicated, we will train the models on the exact same text for the uplications, which might create bais.  
3. Record the deleted uuids, and upload it to S3 as a CSV file 


In [35]:
# Find rows with duplications in 'features_properties_id'
duplicateRowsDF1 = df_en[df_en.duplicated(['features_properties_id'], keep=False)]
#print("Duplicate Rows except first occurrence based on features_properties_id are :")
#print(duplicateRowsDF1)
df_en = df_en.drop_duplicates(subset=['features_properties_id'], keep='first')
print(df_en.shape)

# Find rows with duplications in 'metadata_en'
duplicateRowsDF2 = df_en[df_en.duplicated(['metadata_en'], keep=False)]
#print("Duplicate Rows except first occurrence based on Metadata_en are :")
#print(duplicateRowsDF2.shape)
df_en = df_en.drop_duplicates(subset=['metadata_en'], keep='first')
print(df_en.shape)

# Upload the duplicate date to S3 as a csv file
duplicateRowsDF = pd.concat([duplicateRowsDF1, duplicateRowsDF2])

print('The length of the uuids of the duplicated rows are: ')
print(len(duplicateRowsDF['features_properties_id'].unique()))
print(duplicateRowsDF['features_properties_id'].unique())


(7342, 5)
(7153, 5)
The length of the uuids of the duplicated rows are: 
287
['0c25772d-da22-4ac1-b130-c1d97b935f6f'
 'cfe12472-0088-474d-92fc-908524196834'
 'e268a64c-89f8-49bf-befe-1e0fa5c2462e'
 '45dbaf52-c4c8-4e5d-89fe-d14cec62fc41'
 '81467af6-ac07-4127-ba5e-a96c8e425e35'
 '8d9de1e5-584d-443e-82aa-03583ba6f27a'
 '5de016da-7974-4606-b714-ac25831b0ced'
 'd1a2a741-5514-4ff8-a539-ce1033ef5b58'
 'e99a44a4-edc8-47e3-8fc3-38cc98c2f637' 'CGDIWH-141021' 'CGDIWH-141020'
 'CGDIWH-85410' 'CGDIWH-84255' 'CGDIWH-142546' 'CGDIWH-144657'
 'CGDIWH-143710' 'CGDIWH-144646' 'CGDIWH-141928' 'CGDIWH-141932'
 'CGDIWH-139658' 'CGDIWH-116333' 'CGDIWH-117986' 'CGDIWH-118535'
 'CGDIWH-139569' 'CGDIWH-114342' 'CGDIWH-114774' 'CGDIWH-86619'
 'CGDIWH-116120' 'CGDIWH-86608' 'CGDIWH-129163' 'CGDIWH-128172'
 'CGDIWH-126311' 'CGDIWH-126319' 'CGDIWH-145682' 'CGDIWH-150745'
 'CGDIWH-150614' 'CGDIWH-69036' 'CGDIWH-57666' 'CGDIWH-69041'
 'CGDIWH-150538' 'CGDIWH-150542' '2b367fbb-e6ab-48b5-b762-f49da1ec4114'
 'CGDIWH-15

In [11]:
# Upload the duplicate date to S3 as a parquet file 
def upload_dataframe_to_s3_as_parquet(df, bucket_name, file_key):
    # Save DataFrame as a Parquet file locally
    parquet_file_path = 'temp.parquet'
    df.to_parquet(parquet_file_path)

    # Create an S3 client
    s3_client = boto3.client('s3')

    # Upload the Parquet file to S3 bucket
    try:
        response = s3_client.upload_file(parquet_file_path, bucket_name, file_key)
        os.remove(parquet_file_path)
        print(f'Uploading {file_key} to {bucket_name} as parquet file')
        # Delete the local Parquet file
        return True
    except ClientError as e:
        logging.error(e)
        return False
    
    
upload_dataframe_to_s3_as_parquet(df=duplicateRowsDF,  bucket_name='nlp-data-preprocessing', file_key='Duplicated_records.parquet')  

Uploading Duplicated_records.parquet to nlp-data-preprocessing as parquet file


True

Looking at the raw but cleaned texts 

In [37]:
print(df_en[['metadata_en']].sample(n=1, random_state=42).iloc[0])
print(df_en[['metadata_en']].sample(n=1, random_state=2000).iloc[0])

metadata_en    Mining activities Mining activities include in...
Name: 2416, dtype: string
metadata_en    Leduc Cityworks Cityworks utilities This third...
Name: 5784, dtype: string


## Preprocess raw text for similarity analysis 
Data preprocessing is one of the critical steps in any machine learning project. It includes cleaning and formatting the data before feeding into a machine learning algorithm. For NLP, the preprocessing steps are comprised of the following tasks:

* Tokenizing the string (skipped)
* Lowercasing
* Removing [stop words](https://gist.github.com/ethen8181/d57e762f81aa643744c2ffba5688d33a)  and [punctuation](https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089) 
* Stemming
* Remove apostrophe


In [38]:
# Our selected sample. Complex enough to exemplify each step
metadata_en = df_en.loc[20:30]['metadata_en']
print(f'The selected sample type is {type(metadata_en)} and the text is: \n')
print(metadata_en)

print(f'The first row of the select sample is type {type(metadata_en.iloc[0])} and the text is: \n')
print(metadata_en.iloc[0])

The selected sample type is <class 'pandas.core.series.Series'> and the text is: 

20    Areas of High Quality Natural Cover in the Lak...
21    High Resolution Digital Elevation Model Mosaic...
22    Coastal biodiversity of the benthic epifauna o...
23    Heat wave days for warm season crops (>35°C) H...
24    Coastal BC Campsites The locations of coastal ...
25    OD0139 PEI Confederation Trail Available as an...
26    Digital Elevation Model for British Columbia -...
27    Yukon Reference Information Created for distri...
28    Spring 2020 flood mapping # #Données related t...
29    Seasonal Movements and Diving of Ringed Seals,...
30    Major Projects Inventory The major projects in...
Name: metadata_en, dtype: string
The first row of the select sample is type <class 'str'> and the text is: 

Areas of High Quality Natural Cover in the Lake Simcoe Watershed Natural cover includes areas that have been mapped as woodlands (including plantations and hedgerows), wetlands and other rare 

In [39]:
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import word_tokenize   # module for tokenizing strings
import string                              # for string operations
import re

### Tokenize the string
To tokenize means to split the strings into individual words without blanks or tabs. In this same step, we will also convert each word in the string to lower case. The [tokenize](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual) module from NLTK allows us to do these easily:

In [40]:
# Tokenize the string into words
def tokenize_text(text):
    return(word_tokenize(text.lower()))
tokens = tokenize_text(metadata_en.iloc[0])
print(tokens)

['areas', 'of', 'high', 'quality', 'natural', 'cover', 'in', 'the', 'lake', 'simcoe', 'watershed', 'natural', 'cover', 'includes', 'areas', 'that', 'have', 'been', 'mapped', 'as', 'woodlands', '(', 'including', 'plantations', 'and', 'hedgerows', ')', ',', 'wetlands', 'and', 'other', 'rare', 'vegetative', 'cover', 'communities', '.', 'data', 'here', 'represent', 'areas', 'outlined', 'in', 'the', 'lake', 'simcoe', 'protection', 'plan', 'policy', '6.48', '(', 'june', '2011', ')', '.', 'instructions', 'for', 'downloading', 'this', 'dataset', ':', '*', 'select', 'the', 'link', 'below', 'and', 'scroll', 'down', 'the', 'metadata', 'record', 'page', 'until', 'you', 'find', '*', '*', 'transfer', 'options', '*', '*', 'in', 'the', '*', '*', 'distribution', 'information', '*', '*', 'section', '*', 'select', 'the', 'link', 'beside', 'the', '*', '*', 'data', 'for', 'download', '*', '*', 'label', '*', 'you', 'must', 'provide', 'your', 'name', ',', 'organization', 'and', 'email', 'address', 'in', 'ord

### Remove stop words and punctuations

The next step is to remove stop words and punctuation. Stop words are words that don't add significant meaning to the text. You'll see the list provided by NLTK when you run the cells below.

In [41]:
#Import the english stop words list from NLTK
nltk.download('stopwords')
stopwords_english = stopwords.words('english') 
print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)


Stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so

In [17]:
# remove stop words and punctuation for panda series
def remove_stopwords_pdf(text):
    stop_words = stopwords.words('english') # English stop words see:https://gist.github.com/ethen8181/d57e762f81aa643744c2ffba5688d33a
    text = text.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
    return text
# Remove punctuation 
def remove_punctuation_pdf(text):
    text = text.str.replace(r'[^\w\s]+', '')
    return text
test_str = remove_stopwords_pdf(metadata_en)
test_str = remove_punctuation_pdf(test_str)
print('Text after removing stop words and punctuation\n')
print(test_str.iloc[0])

Text after removing stop words and punctuation

Areas High Quality Natural Cover Lake Simcoe Watershed Natural cover includes areas mapped woodlands including plantations hedgerows wetlands rare vegetative cover communities Data represent areas outlined Lake Simcoe Protection Plan Policy 648 June 2011 Instructions downloading dataset  select link scroll metadata record page find Transfer Options Distribution Information section  select link beside Data download label  must provide name organization email address order access dataset This product requires use GIS software GIS geographic information system Economy Business Environment Natural Resources Environment energy Government information


  text = text.str.replace(r'[^\w\s]+', '')


In [18]:
# remove stop words and punctuation for string
def remove_stopwords_punctuation_tokens(tokens):
    stop_words = stopwords.words('english') 
    tokens = [token for token in tokens if token.isalnum() and token not in stop_words]
    return tokens
tokens = remove_stopwords_punctuation_tokens(tokens)
print('Text after removing stop words and punctuation\n')
print(tokens)

Text after removing stop words and punctuation

['areas', 'high', 'quality', 'natural', 'cover', 'lake', 'simcoe', 'watershed', 'natural', 'cover', 'includes', 'areas', 'mapped', 'woodlands', 'including', 'plantations', 'hedgerows', 'wetlands', 'rare', 'vegetative', 'cover', 'communities', 'data', 'represent', 'areas', 'outlined', 'lake', 'simcoe', 'protection', 'plan', 'policy', 'june', '2011', 'instructions', 'downloading', 'dataset', 'select', 'link', 'scroll', 'metadata', 'record', 'page', 'find', 'transfer', 'options', 'distribution', 'information', 'section', 'select', 'link', 'beside', 'data', 'download', 'label', 'must', 'provide', 'name', 'organization', 'email', 'address', 'order', 'access', 'dataset', 'product', 'requires', 'use', 'gis', 'software', 'gis', 'geographic', 'information', 'system', 'economy', 'business', 'environment', 'natural', 'resources', 'environment', 'energy', 'government', 'information']


### Remove apostrophe

In [19]:
# Remove apostrophe 
def remove_apostrophe(data):
    data = data.str.replace(r"[\"\',]", '')
    return data
test_str = remove_apostrophe(test_str)
print(test_str.iloc[0])

Areas High Quality Natural Cover Lake Simcoe Watershed Natural cover includes areas mapped woodlands including plantations hedgerows wetlands rare vegetative cover communities Data represent areas outlined Lake Simcoe Protection Plan Policy 648 June 2011 Instructions downloading dataset  select link scroll metadata record page find Transfer Options Distribution Information section  select link beside Data download label  must provide name organization email address order access dataset This product requires use GIS software GIS geographic information system Economy Business Environment Natural Resources Environment energy Government information


  data = data.str.replace(r"[\"\',]", '')


In [20]:
def remove_apostrophe_tokens(tokens):
    tokens = [re.sub(r"\'s$", "", token) for token in tokens]
    return tokens
remove_apostrophe_tokens(['areas', 'high', 'quality', 'you\'re', 'looking', 'for'])


['areas', 'high', 'quality', "you're", 'looking', 'for']

### Stemming

Stemming is the process of converting a word to its most general form, or stem. This helps in reducing the size of our vocabulary.

Consider the words: 
 * **learn**
 * **learn**ing
 * **learn**ed
 * **learn**t
 
All these words are stemmed from its common root **learn**. However, in some cases, the stemming process produces words that are not correct spellings of the root word. For example, **happi** and **sunni**. That's because it chooses the most common stem for related words. For example, we can look at the set of words that comprises the different forms of happy:

 * **happ**y
 * **happi**ness
 * **happi**er
 
We can see that the prefix **happi** is more commonly used. We cannot choose **happ** because it is the stem of unrelated words like **happen**.
 
NLTK has different modules for stemming and we will be using the [PorterStemmer](https://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter) module which uses the [Porter Stemming Algorithm](https://tartarus.org/martin/PorterStemmer/).

In [21]:
# Stemming
def stemming(text):
    stemmer = PorterStemmer()
    text = text.apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
    return text
test_str = stemming(test_str)
print(test_str.iloc[0])


area high qualiti natur cover lake simco watersh natur cover includ area map woodland includ plantat hedgerow wetland rare veget cover commun data repres area outlin lake simco protect plan polici 648 june 2011 instruct download dataset select link scroll metadata record page find transfer option distribut inform section select link besid data download label must provid name organ email address order access dataset thi product requir use gi softwar gi geograph inform system economi busi environ natur resourc environ energi govern inform


In [22]:
def stemming_tokens(tokens):
    # Instantiate stemming class
    stemmer = PorterStemmer() 
    tokens = [stemmer.stem(token) for token in tokens]
    return tokens
tokens = stemming_tokens(tokens)
print(tokens)
print(metadata_en.iloc[0])

['area', 'high', 'qualiti', 'natur', 'cover', 'lake', 'simco', 'watersh', 'natur', 'cover', 'includ', 'area', 'map', 'woodland', 'includ', 'plantat', 'hedgerow', 'wetland', 'rare', 'veget', 'cover', 'commun', 'data', 'repres', 'area', 'outlin', 'lake', 'simco', 'protect', 'plan', 'polici', 'june', '2011', 'instruct', 'download', 'dataset', 'select', 'link', 'scroll', 'metadata', 'record', 'page', 'find', 'transfer', 'option', 'distribut', 'inform', 'section', 'select', 'link', 'besid', 'data', 'download', 'label', 'must', 'provid', 'name', 'organ', 'email', 'address', 'order', 'access', 'dataset', 'product', 'requir', 'use', 'gi', 'softwar', 'gi', 'geograph', 'inform', 'system', 'economi', 'busi', 'environ', 'natur', 'resourc', 'environ', 'energi', 'govern', 'inform']
Areas of High Quality Natural Cover in the Lake Simcoe Watershed Natural cover includes areas that have been mapped as woodlands (including plantations and hedgerows), wetlands and other rare vegetative cover communities.

### process()


In [24]:
# 
def process_without_tokens(text):
    text = text.str.lower() # convert to lowercase
    text = remove_stopwords_pdf(text)
    text = remove_punctuation_pdf(text)
    text = remove_apostrophe(text)
    #text = stemming(text)
    return text
df_en['metadata_en_processed'] = process_without_tokens(df_en['metadata_en'])



  text = text.str.replace(r'[^\w\s]+', '')
  data = data.str.replace(r"[\"\',]", '')


In [50]:
# An example of cleaned text 
print('For dataset {} \n'.format(df_en.loc[442]['features_properties_title_en']))
print('Original:  {}'.format(df_en.loc[442]['metadata_en']))
print("\n")
print('Cleaned:  {}'.format(df_en.loc[442]['metadata_en_processed']))

For dataset Important Areas for Cetaceans in Strait of Georgia Ecoregion 

Original:  Important Areas for Cetaceans in Strait of Georgia Ecoregion This layer details Important Areas (IAs) relevant to key cetacean species in the Strait of Georgia (SOG) ecoregion. This data was mapped to inform the selection of marine Ecologically and Biologically Significant Areas (EBSA). Experts have indicated that these areas are relevant based upon their high ranking in one or more of three criteria (Uniqueness, Aggregation, and Fitness Consequences). The distribution of IA's within ecoregions is used in the designation of EBSA's.\n\nCanada’s Oceans Act provides the legislative framework for an integrated ecosystem approach to management in Canadian oceans, particularly in areas considered ecologically or biologically significant. DFO has developed general guidance for the identification of ecologically or biologically significant areas. The criteria for defining such areas include uniqueness, aggreg

In [51]:
def process_tokens(tokens):
    # Tokenize text and lower case
    tokens = tokenize_text(tokens)
    # Remove stop words and punctuation
    tokens = remove_stopwords_punctuation_tokens(tokens)
    # Remove apostrophe
    tokens = remove_apostrophe_tokens(tokens)
    # Stemming
    #tokens = stemming_tokens(tokens)
    # Join tokens back into a string
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text
# Apply text preprocessing to the 'Text' column
df_en['metadata_en_preprocessed_token'] = df_en['metadata_en'].apply(process_tokens)



In [52]:
# An example of cleaned text 
print('For dataset {} \n'.format(df_en.loc[442]['features_properties_title_en']))
print('Original:  {}'.format(df_en.loc[442]['metadata_en']))
print("\n")
print('Cleaned:  {}'.format(df_en.loc[442]['metadata_en_preprocessed_token']))

For dataset Important Areas for Cetaceans in Strait of Georgia Ecoregion 

Original:  Important Areas for Cetaceans in Strait of Georgia Ecoregion This layer details Important Areas (IAs) relevant to key cetacean species in the Strait of Georgia (SOG) ecoregion. This data was mapped to inform the selection of marine Ecologically and Biologically Significant Areas (EBSA). Experts have indicated that these areas are relevant based upon their high ranking in one or more of three criteria (Uniqueness, Aggregation, and Fitness Consequences). The distribution of IA's within ecoregions is used in the designation of EBSA's.\n\nCanada’s Oceans Act provides the legislative framework for an integrated ecosystem approach to management in Canadian oceans, particularly in areas considered ecologically or biologically significant. DFO has developed general guidance for the identification of ecologically or biologically significant areas. The criteria for defining such areas include uniqueness, aggreg

## Check the duplication before export the processed data

In [53]:
# Get the shape of the dataframe
shape_training = df_en.shape
# Count the number of missing values in each column
missing_values_training = df_en.isnull().sum()
# Count the number of unique values in each column
unique_values_training = df_en.nunique()

shape_training, missing_values_training, unique_values_training

((7153, 7),
 features_properties_id                0
 features_properties_title_en          0
 features_properties_description_en    0
 features_properties_keywords_en       0
 metadata_en                           0
 metadata_en_processed                 0
 metadata_en_preprocessed_token        0
 dtype: int64,
 features_properties_id                7153
 features_properties_title_en          6930
 features_properties_description_en    5604
 features_properties_keywords_en       3779
 metadata_en                           7153
 metadata_en_processed                 7149
 metadata_en_preprocessed_token        6849
 dtype: int64)

In [54]:

# Remove duplicates based on 'metadata_en_preprocessed_token'
df_en_deduplicated = df_en.drop_duplicates(subset='metadata_en_preprocessed_token')
# Display the first few rows of the deduplicated dataframe
df_en_deduplicated.head()
# Check the shape of the deduplicated dataframe
shape_deduplicated = df_en_deduplicated.shape
shape_deduplicated

(6849, 7)

In [55]:

# Export the preprocessed data to csv
df_sample = df_en_deduplicated[['features_properties_id', 'features_properties_title_en', 'metadata_en_processed','metadata_en_preprocessed_token']]
df_sample = df_sample.sample(n=500, random_state=1)
df_sample.head()
#df_sample.to_csv('df_training.csv', index=False)


Unnamed: 0,features_properties_id,features_properties_title_en,metadata_en_processed,metadata_en_preprocessed_token
6530,CGDIWH-161640,2021 Cartographic Boundary Files (FLC),2021 cartographic boundary files (flc) 2021 ca...,2021 cartographic boundary files flc 2021 cart...
4675,d95901e9-e710-cf6a-bb80-31c2c9111439,Schools Offering the Pre-primary Program,schools offering pre-primary program list scho...,schools offering program list schools offering...
5305,7467e377-7dad-863d-d253-d0c7cb1819b4,First vertical derivative of the magnetic fiel...,"first vertical derivative magnetic field, eagl...",first vertical derivative magnetic field eagle...
233,8ba7bced-b63f-462a-a8a1-7c7c8a7bcfa4,Sponge Reef Areas of the Pacific Region,sponge reef areas pacific region sponge reefs ...,sponge reef areas pacific region sponge reefs ...
6075,CGDIWH-87590,Science Strategies Summary by CCS (2016),"science strategies summary ccs (2016) ""science...",science strategies summary ccs 2016 science st...


### Upload to S3 bucket 

In [31]:
bucket_name_nlp='nlp-data-preprocessing'
#upload_dataframe_to_s3_as_parquet(df=duplicateRowsDF,  bucket_name=bucket_name_nlp, file_key='Duplicated_records.parquet')  
upload_dataframe_to_s3_as_parquet(df=df_en, bucket_name=bucket_name_nlp, file_key='Processed_records.parquet')

Uploading Processed_records.parquet to nlp-data-preprocessing as parquet file


True