# Web Scraping for Cancer Research Forum Comments Documentation

## Introduction
This Python script performs web scraping on the Cancer Research UK forum website to extract comments from a specific thread related to chemotherapy side effects. It uses the BeautifulSoup library to parse the HTML content of the webpage and extract relevant information. The extracted comments are then processed, filtered, and saved into a CSV file for further analysis.

## Required Libraries
- BeautifulSoup: A library for pulling data out of HTML and XML files.
- requests: A library for making HTTP requests.
- pandas: A library for data manipulation and analysis.

# Script Explanation
1. Import Libraries: The script starts by importing the necessary libraries: BeautifulSoup, requests, and pandas.

2. URL and Request: The URL of the target forum thread is provided (url_link). The requests.get() function sends an HTTP GET request to the URL, and the HTML content of the response is extracted using the .text attribute.

3. HTML Parsing: The HTML content retrieved from the webpage is parsed using BeautifulSoup. The parsed HTML content is stored in the doc variable.

4. Extract Comments: The comments are located within the HTML element with the class field-item even. The script uses the find() method to locate this element. Within this element, all <p> (paragraph) tags are found using the find_all('p') method, which returns a list of comment elements.

5. Comments Processing: The text content of the comment elements is extracted, converted to strings, and added to the comments list.

6. Data Cleaning: The script replaces the '\xa0' characters (non-breaking spaces) within comments with regular spaces using a list comprehension. Additionally, comments with a length of less than 20 characters are filtered out.

7. Create DataFrame: The cleaned comments are organized into a Pandas DataFrame. The DataFrame has two columns: 'comments' and 'type'. The 'type' column is set to 'cancer' for all rows.

8. Export to CSV: The DataFrame is exported to a CSV file named 'cancer_research.csv'. The index=False parameter ensures that the DataFrame index is not included in the CSV file.

# How to Use
1. Ensure you have the required libraries (BeautifulSoup, requests, pandas) installed.

2. Run the script. It fetches the HTML content, parses it, and extracts comments from the specified webpage.

3. The script generates a CSV file named 'cancer_research.csv' in the same directory where the script is located. This file contains the cleaned comments along with their corresponding 'type' ('cancer').

In [5]:
from bs4 import BeautifulSoup
import requests
import pandas as pd


In [6]:
#URL of the forum site
url_link =  'https://www.cancerresearchuk.org/about-cancer/cancer-chat/thread/chemo-side-effects-2'

In [60]:
#Fetching the html of the site
result = requests.get(url_link).text
doc = BeautifulSoup(result, "html.parser")

In [63]:
#Fetching only the comments section within the forum
heading = doc.find(class_ = "field-item even")

In [64]:
#Taking only the contents of the comments section 
p = heading.find_all('p',)
comments = []
for x in p:
    comments.append(str(x.text))

In [65]:
#Filtering to remove any special characters and if the comments were less than 20 characters as it would not hold any significant value
comments = ([s.replace('\xa0',' ') for s in comments])
comments = [x for x in comments if  len(x) > 20]

In [67]:
#Creating a dataframe of the comments
df = pd.DataFrame({'comments':comments , 'type':'cancer'})
#Creating an Output of the datafram onto current working directory
df.to_csv('cancer_research.csv', index=False) 