# NLP and Web Scraping
On TechCrunch: https://techcrunch.com/

#### Básicos Requests
Parámetros de requests:
- url
- params: por ejemplo: ?key=val
- headers: specify user-agent

Response Content:
- encoding = '<tipo_de_encoding>'
- binary, json, raw

Response Status Code:
- r.status_code (== requests.codes.ok)

Response Headers:
- r.headers

Cookies:
- r.cookies

Redirection and History:
- r.history

#### Tech Crunch Scraper
An object with different methods to obtain information and generate lists of dictionaries to be stored in Mongo Databases afterwards.

In [1]:
from scraping_techcrunch import TechCrunchScraper
import time
import random

Para escribir documentación de desarrollo:\
https://textcortex.com/es/post/how-to-write-a-software-documentation

In [2]:
scraper = TechCrunchScraper()

In [5]:
scraper.categories.keys()

dict_keys(['latest_news'])

Modificación:
- 11 de noviembre de 2025. Última noticia de la que se tenía constancia se publicó el 5 de agosto de 2025.
- Seleccionar qué tipo de categorías debe manejar el scraper: [Startups, Ventures, Security, AI, Apps] o [latest]

In [6]:
# Hacer crawl a través de las listas de noticias (pages) por categoría de noticia:
entradas = []
n_paginas = 86

# Recorrer las noticias por categoría: [Startups, Ventures, Security, AI, Apps] o ['latest']

for key in scraper.categories.keys(): # scraper.categories.keys()
    print(f"Extrayendo noticias de la categoría {key}...")

    for n_page in list(range(2,n_paginas+2)):
        print(f"Noticias de la página {n_page}...")
        soup = scraper.http_on_website(scraper.categories[key], page=n_page)
        resultados = scraper.recursive_data_process(soup)

        # Entradas es una lista de listas (una por categoría) que contienen los diccionarios correspondientes a cada noticia:
        entradas.append(resultados)

        # Sleep para cada iteración para evitar colapsar a base de peticiones
        delay = max(0, random.gauss(1.5,0.25))
        time.sleep(delay)

Extrayendo noticias de la categoría latest_news...
Noticias de la página 2...
https://techcrunch.com/latest/page/2/
Conexión exitosa: 200
Error en la iteracion14 : list index out of range
Conexión fallida: IndexError: list index out of range
Error en la iteracion15 : list index out of range
Conexión fallida: IndexError: list index out of range
Error en la iteracion16 : list index out of range
Conexión fallida: IndexError: list index out of range
Error en la iteracion17 : list index out of range
Conexión fallida: IndexError: list index out of range
Error en la iteracion18 : list index out of range
Conexión fallida: IndexError: list index out of range
Error en la iteracion19 : list index out of range
Conexión fallida: IndexError: list index out of range
Noticias de la página 3...
https://techcrunch.com/latest/page/3/
Conexión exitosa: 200
Error en la iteracion13 : list index out of range
Conexión fallida: IndexError: list index out of range
Error en la iteracion14 : list index out of ran

In [None]:
# Extraer por categorías con offset la fecha más antigua de noticias por categoría:

In [None]:
# Para lanzar todas las categorías y extraer todas las entradas de la primera página:
# entradas = []

for key in scraper.categories.keys():
    print(f"Extrayendo noticias de la categoría {key}...")
    soup = scraper.http_on_website(scraper.categories[key])
    resultados = scraper.recursive_data_process(soup)

    # Entradas es una lista de listas (una por categoría) que contienen los diccionarios correspondientes a cada noticia:
    entradas.append(resultados)

    # Sleep para cada iteración para evitar colapsar a base de peticiones
    delay = max(0, random.gauss(2,0.25))
    time.sleep(delay)

In [None]:
# Para lanzar una única categoría:
soup = scraper.http_on_website(scraper.url_base, scraper.categories['AI'])
resultados = scraper.recursive_data_process(soup)

### Conexión MongoDB

In [2]:
# %pip install "pymongo[srv]"
import mongodb_feed as mongo

In [3]:
database = "tech_crunch"
collection = "tech_crunch_news"

In [4]:
  # This is added so that many files can reuse the function get_database()
if __name__ == "__main__":   
  
   # Get the database
   dbname = mongo.get_database(mongo.user, mongo.password, mongo.cluster, database)
   collection_name = mongo.ensure_collection_exists(dbname, collection)
  #  collection_name.create_index([("headtitle", 1), ("dt_utc", 1)], unique=True)

La colección 'tech_crunch_news' ya existe.


#### Inserciones en MongoDB

In [None]:
# Inserción particular:
from pymongo.errors import DuplicateKeyError

# Inserting documents in Python:
try:
    collection_name.insert_one(merge_dict)
except DuplicateKeyError:
    print("Duplicado detectado: ya existe una entrada con ese título y fecha.")

In [12]:
# Inserción en grupos:
from pymongo.errors import BulkWriteError
for i in list(range(len(entradas))):
    try:
        collection_name.insert_many(entradas[i], ordered=False) # Recorrer las listas de entradas
    except BulkWriteError as e:
        print("Se detectaron duplicados, algunos documentos no se insertaron")
    except Exception as e:
        print(f"Error: {e}")

Se detectaron duplicados, algunos documentos no se insertaron
Se detectaron duplicados, algunos documentos no se insertaron


### Operaciones con MongoDB:

In [None]:
# # Find All:
# coleccion_noticias_completo = [doc for doc in collection_name.find({},{"_id":0})]

Encontrar un documento en una colección:

In [5]:
# Find One:
x = collection_name.find_one()
print(x)

{'_id': ObjectId('686be121fdf38784bcf8afb5'), 'main_category': 'Venture', 'headtitle': 'Ready-made stem cell therapies for pets could be coming', 'news_author': 'Connie Loizos', 'link': 'https://techcrunch.com/2025/07/04/ready-made-stem-cell-therapies-for-pets-could-be-coming/', 'dt_localizado': '2025-07-04 16:36:00 PDT-0700', 'dt_utc': '2025-07-04 23:36:00 UTC+0000', 'cuerpo_noticia_str': 'Earlier this week, San Diego startup Gallant announced $18 million in funding to bring the first FDA-approved ready-to-use stem cell therapy to veterinary medicine. If it passes regulatory muster, it could create a whole new way to treat our fur babies.\n\nIt’s still an experimental field, even though people have been researching stem cells for humans for decades. Seven-year-old Gallant’s first target is a painful mouth condition in cats called Feline Chronic Gingivostomatitis (FCGS), which Gallant says could receive FDA approval by early 2026.\n\nThe field has shown some encouraging early results. 

Seleccionar las columnas "cuerpo_noticia_str" y "main_category" en todos los documentos de la colección:

In [6]:
# Return only some fields:
# Select Columns FROM collection
coleccion_noticias = [doc for doc in collection_name.find({},{"_id":0, "cuerpo_noticia_str":1, "main_category":1})]

Encontrar todos los documentos de la colección que pertenezcan a la categoría 'AI':

In [7]:
# MongoDB Query:
myquery = {"main_category":"AI"}

mydoc = collection_name.find(myquery)

for doc in mydoc:
    print(doc)

{'_id': ObjectId('686bf202a9c6cc362593138c'), 'main_category': 'AI', 'headtitle': 'Ingram Micro says ongoing outage caused by ransomware attack', 'news_author': 'Zack Whittaker', 'link': 'https://techcrunch.com/2025/07/07/ingram-micro-says-ongoing-outage-caused-by-ransomware-attack/', 'dt_localizado': '2025-07-07 05:57:00 PDT-0700', 'dt_utc': '2025-07-07 12:57:00 UTC+0000', 'cuerpo_noticia_str': 'Ingram Micro, a U.S. technology distributing giant and managed services provider, said on Monday a ransomware attack is the cause of an ongoing outage at the company.\n\nThe hack began on Thursday, after which the company’s website and much of its network went down. Late on Saturday, the company said in a brief statement that it was working to restore systems so it can begin processing orders again.\xa0\n\nIngram Micro on Monday alerted shareholders to the breach before markets opened in the United States.\n\nCalifornia-based Ingram Micro is one of the world’s largest technology distributors, 

Encontrar aquellos documentos cuya fecha de publicación sea posterior al 6 de julio de 2025:

In [9]:
#Advanced Query:
myquery = {"dt_utc": {"$gt":"2025-07-06"}}

mydoc = collection_name.find(myquery)

Seleccionar aquellos documentos de la colección que incluyan la expresión regular "Apple" en el titular de la noticia:

In [10]:
#Filter With Regular Expressions:
myquery = {"headtitle": {"$regex": "Apple"}}
mydoc = collection_name.find(myquery)

# for doc in mydoc:
#     print(doc)

Ordenar la salida:

In [None]:
#Sort the Result (apply on Social news to lighten the search)
myquery = {"main_category":"Social"}
mydoc = collection_name.find(myquery).sort("headtitle")

# for doc in mydoc:
#     print(doc)

{'_id': ObjectId('686fb900c615752af259f9c2'), 'main_category': 'Social', 'headtitle': '4 days to go: TechCrunch Sessions: AI is almost in session', 'news_author': 'TechCrunch Events', 'link': 'https://techcrunch.com/2025/06/01/4-days-to-go-techcrunch-sessions-ai-is-almost-in-session/', 'dt_localizado': '2025-06-01 07:00:00 PDT-0700', 'dt_utc': '2025-06-01 14:00:00 UTC+0000', 'cuerpo_noticia_str': 'Artificial intelligence has no shortage of visionaries — but the ones who matter are executing. In 4 days, TechCrunch Sessions: AI brings those builders, researchers, funders, and enthusiasts under one roof at UC Berkeley’s Zellerbach Hall.\n\nThis isn’t a parade of AI hype or a string of over-edited keynotes. It’s a single day designed for clarity, candor, and real connection.\n\nIt’s also your last chance to save. Ticket prices rise soon — but right now, you can save over $300 on your pass and get 50% off a second, so your partner, co-founder, or friend can dive in with you.\n\nMaybe it’s a

In [12]:
#Sort descending:
myquery = {"main_category":"Social"}
mydoc = collection_name.find(myquery).sort("headtitle",-1)

for doc in mydoc:
    print(doc)

{'_id': ObjectId('686bf202a9c6cc362593138e'), 'main_category': 'Social', 'headtitle': '‘Improved’ Grok criticizes Democrats and Hollywood’s ‘Jewish executives’', 'news_author': 'Anthony Ha', 'link': 'https://techcrunch.com/2025/07/06/improved-grok-criticizes-democrats-and-hollywoods-jewish-executives/', 'dt_localizado': '2025-07-06 13:58:00 PDT-0700', 'dt_utc': '2025-07-06 20:58:00 UTC+0000', 'cuerpo_noticia_str': 'On Friday morning, Elon Musk declared, “We have improved @Grok significantly. You should notice a difference when you ask Grok questions.”\n\nWhile Musk didn’t say exactly what improvements to look for, he’d previously declared that xAI (which built Grok) would retrain the chatbot after it had been trained on “far too much garbage,” and he called on users at X (where Grok is heavily featured) to share\xa0“divisive facts” that are “politically incorrect, but nonetheless factually true.” (Musk recently merged the two companies.)\n\nOne user subsequently asked Grok whether elec

In [13]:
#Limit the result:
mydoc = collection_name.find().limit(6)

for doc in mydoc:
    print(doc)

{'_id': ObjectId('686be121fdf38784bcf8afb5'), 'main_category': 'Venture', 'headtitle': 'Ready-made stem cell therapies for pets could be coming', 'news_author': 'Connie Loizos', 'link': 'https://techcrunch.com/2025/07/04/ready-made-stem-cell-therapies-for-pets-could-be-coming/', 'dt_localizado': '2025-07-04 16:36:00 PDT-0700', 'dt_utc': '2025-07-04 23:36:00 UTC+0000', 'cuerpo_noticia_str': 'Earlier this week, San Diego startup Gallant announced $18 million in funding to bring the first FDA-approved ready-to-use stem cell therapy to veterinary medicine. If it passes regulatory muster, it could create a whole new way to treat our fur babies.\n\nIt’s still an experimental field, even though people have been researching stem cells for humans for decades. Seven-year-old Gallant’s first target is a painful mouth condition in cats called Feline Chronic Gingivostomatitis (FCGS), which Gallant says could receive FDA approval by early 2026.\n\nThe field has shown some encouraging early results. 

Más info en : https://www.w3schools.com/python/python_mongodb_getstarted.asp\
En MongoDB: https://www.mongodb.com/resources/languages/python

### Guardar a JSON y modificar tipo de dato para el campo 'dt_utc':