# NLP

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. It combines linguistics, computer science, and machine learning to process text and speech in ways that are meaningful and useful. Common applications of NLP include machine translation, sentiment analysis, chatbots, and speech recognition. By bridging the gap between human communication and computer understanding, NLP plays a key role in modern technology, from virtual assistants to search engines.

# Import Libraries

In [None]:
from pprint import pprint            # For pretty-printing data structures
from bs4 import BeautifulSoup        # Import BeautifulSoup for HTML parsing
from urllib.request import urlopen   # For opening URLs

import requests                     # For adding header
import pandas as pd
import re                           # For numeric columns
import numpy as np                  # For numeric columns

import matplotlib.pyplot as plt     # For visualization


import nltk                             # For Stopwords
import string                           # For Stopwords
from nltk.corpus import stopwords       # For Stopwords
from nltk.tokenize import word_tokenize # For Stopwords
from string import punctuation          # For Stopwords

# 1. Choose a Website and Identify Structured Data

I chose the Wikipedia website for this task. As part of my ongoing research on wildfires in the USA, I was looking for a comprehensive list of wildfire events. This website provides a well-structured table that summarizes wildfire history across the country. The table includes clearly organized information such as the year, size, the name of the wildfire, and area.  
I select this table not only because it is relevant to my research topic, but also because the HTML structure of the table is relatively simple and consistent, making it easier to scrape.  

(Ref: https://en.wikipedia.org/wiki/List_of_wildfires)

# 2. Scrape Data Using BeautifulSoup

### 2-1. Load website

In [None]:
# Target website
myurl = 'https://en.wikipedia.org/wiki/List_of_wildfires'

# Add header for avoiding error: 'HTTP Error 403: Forbidden'
# ref: https://stackoverflow.com/questions/77129954/error-403-webscraping-project-using-beautifulsoup
headers = {'User-Agent' : 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/117.0'}

# Request with header 
request_id = requests.get(myurl, 
                          headers = headers)

# Parse the HTML content using BeautifulSoup with the chosen parser
soupified = BeautifulSoup(request_id.content, 'html.parser')

### 2-2. Find all tables in the website

Extract only table including `Year`, `Size`, `Name`, and `Area`.

In [None]:
def extract_table_data(soup_object):
    tables_data = []
    
    for table_index, table in enumerate(soup_object.find_all('table')):
        table_data = []
        rows = table.find_all('tr')
        
        for row in rows:
            # Handle both header (th) and data (td) cells
            cells = row.find_all(['th', 'td'])
            row_data = [cell.get_text(strip=True) for cell in cells]
            if row_data:  # Only add non-empty rows
                table_data.append(row_data)
        
        if table_data:  # Only add non-empty tables
            tables_data.append({
                'table_index': table_index,
                'data': table_data,
                'rows': len(table_data),
                'columns': len(table_data[0]) if table_data else 0
            })
    
    return tables_data

table_info = extract_table_data(soupified)
for table in table_info:
    print(f"Table {table['table_index']}: {table['rows']} rows × {table['columns']} columns")

Table 0: 1 rows × 2 columns
Table 1: 132 rows × 5 columns
Table 2: 16 rows × 1 columns
Table 3: 5 rows × 2 columns
