#  <span style="font-family: Latin Modern Roman; font-size: 35px; font-weight: bold;"> Práctica 2. Limpieza de Datos </span>

---

In [13]:
import re
import nltk
import requests
from bs4 import BeautifulSoup
from googletrans import Translator
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

## <span style="font-family: Latin Modern Roman; font-size: 25px;"> 1. Obtener una traducción del párrafo de ejemplo al español </span>

In [3]:
# Paragraph in English
sample_paragraph_eng = """The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents. We live on a placid island of ignorance in the midst of black seas of infinity, and it was not meant that we should voyage far. The sciences, each straining in its own direction, have hitherto harmed us little; but some day the piecing together of dissociated knowledge will open up such terrifying vistas of reality, and of our frightful position therein, that we shall either go mad from the revelation or flee from the deadly light into the peace and safety of a new dark age. Theosophists have guessed at the awesome grandeur of the cosmic cycle wherein our world and human race form transient incidents. They have hinted at strange survivals in terms which would freeze the blood if not masked by a bland optimism. But it is not from them that there came the single glimpse of forbidden aeons which chills me when I think of it and maddens me when I dream of it. That glimpse, like all dread glimpses of truth, flashed out from an accidental piecing together of separated things—in this case an old newspaper item and the notes of a dead professor. I hope that no one else will accomplish this piecing out; certainly, if I live, I shall never knowingly supply a link in so hideous a chain. I think that the professor, too, intended to keep silent regarding the part he knew, and that he would have destroyed his notes had not sudden death seized him."""

# Async Translation Function
async def translate_text(text, src_lang, dest_lang):
    translator = Translator()
    translation = await translator.translate(text, src=src_lang, dest=dest_lang)
    return translation.text

In [4]:
sample_paragraph_esp = await translate_text(sample_paragraph_eng, 'en', 'es')
print("Translated Paragraph in Spanish:\n", sample_paragraph_esp)

Translated Paragraph in Spanish:
 Creo que lo más misericordioso del mundo es la incapacidad de la mente humana para correlacionar todos sus contenidos. Vivimos en una placida isla de ignorancia en medio de los mares negros del infinito, y no significaba que debamos viajar lejos. Las ciencias, cada una que se esfuerza en su propia dirección, hasta ahora nos han dañado poco; Pero algún día la junta de conocimiento disociado abrirá vistas tan aterradoras de la realidad, y de nuestra espantosa posición en la misma, que nos enojaremos de la revelación o huiremos de la luz mortal a la paz y la seguridad de una nueva edad oscura. . Los teósofos han adivinado en la impresionante grandeza del ciclo cósmico en el que nuestro mundo y la raza humana forman incidentes transitorios. Han insinuado sobrevivientes extrañas en términos que congelarían la sangre si no enmascaran por un optimismo suave. Pero no es de ellos que llegó el único vistazo de los eones prohibidos que me enfría cuando pienso en 

---

## <span style="font-family: Latin Modern Roman; font-size: 25px;"> 2. Obtener una lista de palabras en español y otra en inglés con el resultado de limpiar con los pasos anteriores el párrafo de ejemplo </span>

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Lemmatizer and Stopwords
lemmatizer = WordNetLemmatizer()
stop_words_eng = set(stopwords.words("english"))
stop_words_esp = set(stopwords.words("spanish"))

# Text Cleaning Function with Language
def clean_text(text, language = "english"):
    if language == "spanish":
        stop_words = stop_words_esp
    else:
        stop_words = stop_words_eng

    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text, flags=re.I)

    # Remove special characters
    text = re.sub(r'[^a-zA-ZáéíóúÁÉÍÓÚüÜñÑ\s]', ' ', text)

    # Remove single characters
    text = re.sub(r'\s+[a-zA-ZáéíóúüñÁÉÍÓÚÜÑ]\s+', ' ', text)

    # Convert to lowercase
    text = text.lower()

    # Tokenization
    tokens = word_tokenize(text)

    # Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    # Removal of Stop Words & Words with length ≤ 3
    tokens = [word for word in tokens if len(word) > 3]

    return tokens

# Cleaned word lists in English and Spanish
words_eng = clean_text(sample_paragraph_eng, language = "english")
words_es = clean_text(sample_paragraph_esp, language = "spanish")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/sergiocuencanunez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sergiocuencanunez/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sergiocuencanunez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [6]:
print("List of words in English (cleaned):\n", words_eng)
print("List of words in Spanish (cleaned):\n", words_es)

List of words in English (cleaned):
 ['merciful', 'thing', 'world', 'think', 'inability', 'human', 'mind', 'correlate', 'content', 'live', 'placid', 'island', 'ignorance', 'midst', 'black', 'infinity', 'meant', 'voyage', 'science', 'straining', 'direction', 'hitherto', 'harmed', 'little', 'piecing', 'together', 'dissociated', 'knowledge', 'open', 'terrifying', 'vista', 'reality', 'frightful', 'position', 'therein', 'shall', 'either', 'revelation', 'flee', 'deadly', 'light', 'peace', 'safety', 'dark', 'theosophist', 'guessed', 'awesome', 'grandeur', 'cosmic', 'cycle', 'wherein', 'world', 'human', 'race', 'form', 'transient', 'incident', 'hinted', 'strange', 'survival', 'term', 'would', 'freeze', 'blood', 'masked', 'bland', 'optimism', 'came', 'single', 'glimpse', 'forbidden', 'aeon', 'chill', 'think', 'maddens', 'dream', 'glimpse', 'like', 'dread', 'glimpse', 'truth', 'flashed', 'accidental', 'piecing', 'together', 'separated', 'thing', 'case', 'newspaper', 'item', 'note', 'dead', 'prof

---

## <span style="font-family: Latin Modern Roman; font-size: 25px;"> 3. Obtener una lista de palabras con el resultado de limpiar con los pasos anteriores el texto de una web </span>

In [7]:
def extract_text_from_website(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        paragraphs = soup.find_all('p')
        full_text = ' '.join([p.text for p in paragraphs])
        return full_text
    else:
        return ""

In [None]:
web_text = extract_text_from_website("https://www.elmundo.es")
cleaned_web_words = clean_text(web_text, language = "spanish")
print("List of words from El Mundo (cleaned):\n", cleaned_web_words)

List of words from El Mundo (cleaned):
 ['portada', 'suscríbete', 'suscríbete', 'profesiones', 'exigen', 'dedicación', 'tiempo', 'pasa', 'profesionales', 'sanitarios', 'fundamental', 'recargar', 'pilas', 'llegar', 'casa', 'costaba', 'mundo', 'junto', 'ikea', 'conseguido', 'convertir', 'espacio', 'alma', 'refugio', 'personal', 'proyecto', 'ikea', 'ofrecido', 'vichy']


---

## <span style="font-family: Latin Modern Roman; font-size: 25px;"> 4. Obtener una lista de palabras con el resultado de limpiar con los pasos anteriores las 10 primeras reseñas de la película *Titanic* en *Rotten Tomatoes* </span>

In [27]:
def get_rotten_tomatoes_reviews(movie_url, language="english"):
    response = requests.get(movie_url)

    if response.status_code != 200:
        print("Error: Unable to access Rotten Tomatoes. Status Code:", response.status_code)
        return []

    soup = BeautifulSoup(response.text, 'html.parser')

    reviews_divs = soup.find_all('p', class_='review-text')
    if not reviews_divs:
        print("No reviews found. Check if the page structure has changed.")
        return []

    reviews_text = ' '.join([review.get_text(strip=True) for review in reviews_divs[:10]])

    return clean_text(reviews_text, language)

In [28]:
rotten_tomatoes_url = "https://www.rottentomatoes.com/m/titanic/reviews"
titanic_reviews_rt = get_rotten_tomatoes_reviews(rotten_tomatoes_url, language="english")
print("List of words from Titanic reviews on Rotten Tomatoes (cleaned):\n", titanic_reviews_rt)

List of words from Titanic reviews on Rotten Tomatoes (cleaned):
 ['thanks', 'inspiring', 'piece', 'captivating', 'romance', 'james', 'cameron', 'titanic', 'remains', 'powerful', 'today', 'year', 'every', 'time', 'watch', 'film', 'piece', 'element', 'make', 'fall', 'love', 'even', 'truly', 'greatest', 'romance', 'seen', 'film', 'largely', 'thanks', 'performer', 'titanic', 'reaching', 'heartstrings', 'well', 'made', 'acted', 'impossible', 'look', 'away', 'cameron', 'skilled', 'creating', 'successful', 'film', 'often', 'perform', 'exceptionally', 'well', 'office', 'titanic', 'exception', 'inspiring', 'life', 'changing', 'love', 'kind', 'perhaps', 'worth', 'boarding', 'titanic', 'still', 'second', 'favourite', 'film', 'time', 'masterpiece', 'fact', 'still', 'talking', 'titanic', 'year', 'release', 'speaks', 'film', 'ship', 'staying', 'power', 'issue', 'fictional', 'love', 'story', 'withstanding', 'titanic', 'remains', 'epic', 'achievement', 'filmmaking', 'colossal', 'ship', 'technically',

---

## <span style="font-family: Latin Modern Roman; font-size: 25px;"> Sergio Cuenca Núñez </span>