# Building text corpus using BeautifulSoup

The goal of this code is to perform web scraping for text extraction from a list of company websites. The objective is to collect relevant textual information from these websites for the purpose of classifying companies by industry using Natural Language Processing (NLP).

The process involves sending HTTP requests to the specified websites, retrieving the HTML content, parsing it to extract text, and filtering out non-English text. The extracted text is then stored in a pandas DataFrame along with the corresponding website URLs. 

Afterwards, we will test if this DataFrame can serve as a foundational dataset for subsequent analysis, text classification, and machine learning tasks related to the categorization of companies based on the content of their websites.

In [None]:
import requests # HTTP library for making web requests
from bs4 import BeautifulSoup #HTML parsing library for web scraping
import langid #Language identification library for text
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pandas as pd

In [4]:
df = pd.read_csv("1.2_web_scrap_source.csv")
df.head(2)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,name,domain,year founded,industry,size range,locality,country,linkedin url,current employee estimate,total employee estimate
0,0,5872184,ibm,ibm.com,1911.0,information technology and services,10001+,"new york, new york, united states",united states,linkedin.com/company/ibm,274047,716906
1,3,2309813,us army,goarmy.com,1800.0,military,10001+,"alexandria, virginia, united states",united states,linkedin.com/company/us-army,162163,445958


In [11]:
websites = ['https://ibm.com', 'https://goarmy.com']

In [15]:

# Function to extract English text from a website
def extract_text(url):
    try:
        # Send a GET request to the website
        response = requests.get(url)
        response.raise_for_status()  # Raise an HTTPError for bad requests

        # Parse the HTML content
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract text from the HTML (modify this based on your website structure)
        text = ' '.join([p.get_text() for p in soup.find_all('p')])

        # Check if the text is in English
        if langid.classify(text)[0] == 'en':
            return text
        else:
            print(f"Text from {url} is not in English.")
            return None
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None


In [19]:
# Create an empty DataFrame
df2 = pd.DataFrame({'Website': df['domain'].apply(lambda x: f'https://{x}')})
df2.head(2)

Unnamed: 0,Website
0,https://ibm.com
1,https://goarmy.com


In [20]:
# Apply the extract_text function to each website and create a new column 'Text'
df2['Text'] = df2['Website'].apply(extract_text)

# Save the DataFrame to a CSV file or perform further analysis
df2.to_csv('testextract.csv', index=False)

Text from https://ibm.com is not in English.
