Truncation and Lemmatization

Description

This project focuses on text processing using Python and the Natural Language Toolkit (nltk) library. The primary goal is to prepare textual data for information retrieval tasks by performing various preprocessing steps, including tokenization, punctuation removal, conversion to lowercase, removal of stopwords, stemming, and lemmatization. The project includes practical implementations and a comparative analysis of different stemming and lemmatization techniques.

Files Included

lab2.py: A Python script that performs various text processing tasks.
prueba.txt: A text file containing a sample story titled "The Robber Bridegroom".
Laboratorio 2 Truncado y Lematización.pdf: Official documentation detailing the objectives, methodology, and results of the project.

Notable Code Snippets

1. Reading and Tokenizing Text

This snippet reads the content of prueba.txt and tokenizes the text using the nltk library.

import os
import nltk
nltk.download('punkt')

# Read the content of the external file
archivo_path = os.path.join(os.path.dirname(__file__), 'prueba.txt')
with open(archivo_path, 'r', encoding='utf-8') as archivo:
    texto = archivo.read()

# Tokenization using NLTK
palabras = nltk.word_tokenize(texto)

2. Removing Punctuation

This snippet defines a regular expression to remove punctuation from the tokenized text.

import re
import string

# Define regular expression to remove punctuation
simbolos_extra = '’'
re_punc = re.compile('[%s%s]' % (re.escape(string.punctuation), re.escape(simbolos_extra)))

# Remove punctuation from each word
stripped = [re_punc.sub('', w) for w in palabras]

3. Stemming and Lemmatization

This snippet demonstrates the use of PorterStemmer, SnowballStemmer, and WordNetLemmatizer to stem and lemmatize the cleaned text.

from nltk.stem.porter import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer

# Stemming with PorterStemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]

# Stemming with SnowballStemmer
stemmer2 = SnowballStemmer('english')
stemmed_words2 = [stemmer2.stem(word) for word in filtered_words]

# Lemmatization with WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

Official Documentation Summary

The official documentation provided in "Laboratorio 2 Truncado y Lematización.pdf" outlines the following key points:

Objectives

To prepare texts for information retrieval by performing tokenization, punctuation removal, converting text to lowercase, removing stopwords, and truncating words using different stemming methods.
To compare the results of different stemming methods (Porter, Snowball) and lemmatization.

Methodology

Tokenization: Splitting text into individual tokens (words).
Punctuation Removal: Using regular expressions to strip punctuation from tokens.
Lowercase Conversion: Converting all tokens to lowercase.
Stopwords Removal: Eliminating common stopwords using the nltk library.
Stemming and Lemmatization: Applying PorterStemmer, SnowballStemmer, and WordNetLemmatizer to reduce words to their base or root form.

Results and Discussion

The results show the effectiveness of each method in reducing words to their base form, with PorterStemmer providing the most aggressive stemming, followed by SnowballStemmer, and WordNetLemmatizer offering a more moderate approach.
The comparative analysis highlights the strengths and weaknesses of each method in terms of accuracy and preservation of semantic meaning.

Installation and Usage

Clone the repository to your local machine.
Ensure you have Python and nltk installed.
Run the lab2.py script to process the text in prueba.txt.

git clone https://github.com/KPlanisphere/truncation-and-lemmatization.git
cd truncation-and-lemmatization
python lab2.py

Dependencies

Python
NLTK library

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Laboratorio 2 Trucado y Lematización.pdf		Laboratorio 2 Trucado y Lematización.pdf
README.md		README.md
lab2.py		lab2.py
prueba.txt		prueba.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Truncation and Lemmatization

Description

Files Included

Notable Code Snippets

1. Reading and Tokenizing Text

2. Removing Punctuation

3. Stemming and Lemmatization

Official Documentation Summary

Objectives

Methodology

Results and Discussion

Installation and Usage

Dependencies

About

Releases

Packages

Languages

KPlanisphere/truncation-and-lemmatization

Folders and files

Latest commit

History

Repository files navigation

Truncation and Lemmatization

Description

Files Included

Notable Code Snippets

1. Reading and Tokenizing Text

2. Removing Punctuation

3. Stemming and Lemmatization

Official Documentation Summary

Objectives

Methodology

Results and Discussion

Installation and Usage

Dependencies

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages