### Step 1: Import Necessary Libraries

In [1]:
import requests
from bs4 import BeautifulSoup

### Step 2: Fetch the Webpage

In [2]:
# Specifying the URL of the Wikipedia page you want to scrape
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"

# Sending an HTTP GET request to the URL and get the HTML content
response = requests.get(url)

# Extracted the HTML content from the response
html_content = response.text

- **Define the URL**: Specify the URL of the Wikipedia page you want to scrape, which is "https://en.wikipedia.org/wiki/Artificial_intelligence."

- **Send an HTTP GET Request**: Use the `requests.get()` method to send an HTTP GET request to the specified URL. This action is similar to requesting the web page from the server.

- **Retrieve HTML Content**: Extract the HTML content from the response. This action involves obtaining the raw HTML data from the web page, which includes all the text and structure.

### Step 3: Parse the HTML Content

In [3]:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

- **Parsing the HTML Content**: We employ BeautifulSoup to parse the HTML content, making it structured and navigable.
- **Utilizing BeautifulSoup**: We create a BeautifulSoup object called "soup" with the provided HTML content and "html.parser" as the parser. This object facilitates interaction with the page's structure for subsequent data extraction and analysis.

### Step 4: Locate the Data

In [4]:
# Locate the div element with the ID "content"
content_div = soup.find("div", {"id": "content"})

In [6]:
content_div = soup.find("div", {"id": "mw-content-text"})

- **Parsing the HTML Structure**: We parsed the HTML content retrieved from the webpage to make it structured and accessible for data extraction.
- **Searching for a Specific Division**: We searched within the parsed HTML for a specific division using its unique ID, which is "mw-content-text" in this case. This division contains the primary content we want to extract.
- **Storing the Targeted HTML Element**: The "content_div" variable is used to store the HTML element that represents the "mw-content-text" division. This allows us to focus on the relevant content for extraction.

### Step 5: Extract the Data

In [7]:
text_data = [p.get_text() for p in content_div.find_all("p")]

In [8]:
cleaned_text_data = [text.strip() for text in text_data if text.strip()]

- **Extracting Text Data**: We obtained the text data by iterating through all the paragraphs ("p" tags) found within the "content_div."
- **Cleaning Text Data**: We cleaned the text data by removing extra whitespace and filtering out any empty paragraphs to ensure only meaningful content remains.
- **Storing Cleaned Text**: The "cleaned_text_data" variable holds the cleaned text data, which is now ready for further processing or analysis.

In [11]:
with open('output.txt', 'w', encoding='utf-8') as file:
    for paragraph in text_data:
        file.write(paragraph + '\n')

- **Opening a File**: We used the 'open' function to create a file named 'output.txt' for writing, with UTF-8 encoding support.
- **Iterating and Writing**: We iterated through the 'text_data' paragraphs and wrote each paragraph to the file, followed by a newline character to separate paragraphs.
- **Saving Content**: This process saved the text data to the 'output.txt' file in a readable and structured manner for further use.

In [12]:
# Replace or remove problematic characters from the text
cleaned_text_data = [paragraph.replace('\u010c', '') for paragraph in text_data]

# Print or write the cleaned text
for paragraph in cleaned_text_data:
    print(paragraph)



Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of humans or animals. It is also the field of study in computer science that develops and studies intelligent machines. "AI" may also refer to the machines themselves.

AI technology is widely used throughout industry, government and science. Some high-profile applications are: advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), understanding human speech (such as Siri and Alexa), self-driving cars (e.g., Waymo), generative or creative tools (ChatGPT and AI art), and competing at the highest level in strategic games (such as chess and Go).[1]

Artificial intelligence was founded as an academic discipline in 1956.[2] The field went through multiple cycles of optimism[3][4] followed by disappointment and loss of funding,[5][6] but after 2012, when deep learning surpassed all previous AI techniques,[7] there was a vast inc

- **Cleaning Text**: We removed problematic characters, represented by '\u010c', from the text data.
- **Replacement**: We used the 'replace' method to replace the problematic characters with empty strings, effectively removing them.
- **Displaying Cleaned Text**: We printed the cleaned text data to the console, ensuring that it's free from unwanted characters and ready for further use.

The expression ('\u010c', '') is specifically designed to replace the Unicode character represented by the hexadecimal code point 010c.

### 4. Writed Cleaned Text to Word Document and CSV File:

In [17]:
pip install python-docx

Collecting python-docx
  Obtaining dependency information for python-docx from https://files.pythonhosted.org/packages/ea/82/ddb60b44c6e39a74bd406fab7d7c102ce7dfca2dff9515dfd6edc7d25f1e/python_docx-1.0.1-py3-none-any.whl.metadata
  Downloading python_docx-1.0.1-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.0.1-py3-none-any.whl (237 kB)
   ---------------------------------------- 0.0/237.4 kB ? eta -:--:--
   ------------------------ --------------- 143.4/237.4 kB 2.8 MB/s eta 0:00:01
   ---------------------------------------- 237.4/237.4 kB 2.9 MB/s eta 0:00:00
Installing collected packages: python-docx
Successfully installed python-docx-1.0.1
Note: you may need to restart the kernel to use updated packages.


In [18]:
import docx
import pandas as pd

In [24]:
# Removed(Replaced) problematic characters from the text
cleaned_text_data = [paragraph.replace('\u010c', '') for paragraph in text_data]

# Created a Word document for cleaned text
cleaned_doc = docx.Document()

# Added cleaned text to the Word document
for paragraph in cleaned_text_data:
    cleaned_doc.add_paragraph(paragraph)

# Saved the cleaned text in a Word document
cleaned_doc.save('Cleaned_AI_Wikipedia.docx')

# Created a DataFrame from the cleaned text data
cleaned_df = pd.DataFrame({'Cleaned Text': cleaned_text_data})

# Saved the cleaned text data to a CSV file
cleaned_df.to_csv('Cleaned_AI_Wikipedia.csv', index=False)

- **Removed Problematic Characters**: Problematic characters, represented by Unicode escape sequences like '\u010c', were replaced or removed from the extracted text.
- **Created a Word Document**: A new Word document, named 'Cleaned_AI_Wikipedia.docx,' was generated to store the cleaned text data.
- **Added Cleaned Text**: Each cleaned paragraph was added to the Word document.
- **Saved in Word Format**: The Word document containing the cleaned text was saved to a file named 'Cleaned_AI_Wikipedia.docx.'
- **Created a DataFrame**: A DataFrame was created to structure the cleaned text data for further analysis.
- **Saved in CSV Format**: The cleaned text data was saved as a CSV file named 'Cleaned_AI_Wikipedia.csv' for convenient data storage and sharing.