# **Classification of Documents Using Graph-Based Features and KNN**

## **1. Data Collection and Preparation:**
Collect or create 15 pages of text for each of the three assigned topics, ensuring each
page contains approximately 500 words.

**Import necessary libraries**

In [2]:
import requests
from bs4 import BeautifulSoup
import os
import json
import csv

**Function to scrape articles links from a given URL**

In [3]:
def scrape_articles_links(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        articles = soup.find('section', class_='o-ListArticle')
        links = ["https:" + a['href'] for a in articles.find_all('a', href=True)]
        return links
    except Exception as e:
        print(f"Error scraping links from {url}: {e}")
        return None

**Function to process links**

In [4]:
def process_links(articles_links):
    links = []
    for i in range(0,len(articles_links),2):
        if articles_links[i] == "https:#" or articles_links[i].startswith("https://www.foodnetwork.com/healthy/articles/p"):
            break
        links.append(articles_links[i])
    return links

**Function to scrape data from a given URL**

In [5]:
def scrape_data(url):
    article = {}
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        article_content = soup.find('article', class_='article-content')
        if article_content:
            article['title'] = article_content.find('div', class_='assetTitle').get_text().replace('\n', '')
            body = article_content.find('div', class_='article-body')
            paras = body.find_all('div', class_='customRTE smartbody-core section')
            article['body'] = ' '.join([p.get_text() for p in paras]).replace('\n', '')
            article['words_count'] = len(article['body'].split())
        return article
    except Exception as e:
        print(f"Error scraping data from {url}: {e}")
        return None

**Function to save scraped data (dictionary) to a file as JSON format**

In [6]:
def save_to_json(data, filename):
    try:
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=4)
        print(f"Data saved to {filename}")
    except Exception as e:
        print(f"Error saving data to {filename}: {e}")

**Function to save scraped data (dictionary) to a file as CSV format**

In [7]:
def save_to_csv(data, filename):
  try:
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
      fieldnames = ['title', 'body', 'words_count']
      writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
      writer.writeheader()
      for article in data:
          writer.writerow({'title': article['title'], 'body': article['body'], 'words_count': article['words_count']})
    print(f"Data saved to {filename}")
  except Exception as e:
        print(f"Error saving data to {filename}: {e}")

**Main function to scrape articles**

In [8]:
def scrape_articles(url_base, pages, min_articles):
    articles_data = []
    articles_count = 0
    for i in range(1, pages + 1):
        url = f"{url_base}/p/{i}"
        print(url)
        articles_links = scrape_articles_links(url)
        links = process_links(articles_links)
        for link in links:
            data = scrape_data(link)
            if data and data.get('words_count', 0) > 500:
                print(f"Article {articles_count + 1} - {data['title']} - {data['words_count']} words")
                articles_data.append({'index': articles_count + 1, 'label': 'Food', **data})
                articles_count += 1
                if articles_count >= min_articles:
                    return articles_data
    return articles_data

Main function

In [9]:
food_url = 'https://www.foodnetwork.com/healthy/articles'
pages = 5  # Number of pages to scrape
min_articles = 15  # Minimum number of articles to scrape
articles_data = scrape_articles(food_url, pages, min_articles)
os.makedirs("articles", exist_ok=True)
# Save to JSON file
output_file = 'articles/food_articles.json'
save_to_json(articles_data, output_file)
# Save to CSV file
output_file = 'articles/food_articles.csv'
save_to_csv(articles_data, output_file)

https://www.foodnetwork.com/healthy/articles/p/1
Article 1 - Now Trending: Smoothie-Delivery Programs - 503 words
Article 2 - Scalloped Potatoes with Blue Cheese and Mushrooms - 611 words
Article 3 - Reading List: Red Meat Gets The Green Light, Chemicals in Canned Foods and Fast Food Facts - 513 words
Article 4 - 6 Surprising Tips for Cooking with Kids - 684 words
Article 5 - This Is the Right Internal Temperature for Cooked Chicken - 854 words
Article 6 - What Is Cutting and Bulking? And Should You Do It? - 670 words
Article 7 - Summer Fest: Cool Cucumber Soup - 562 words
https://www.foodnetwork.com/healthy/articles/p/2
Article 8 - No, California Is Not Banning Skittles - 810 words
Article 9 - When Cravings Call, What Are They Saying? - 548 words
Article 10 - This Week's Nutrition News Feed - 503 words
https://www.foodnetwork.com/healthy/articles/p/3
Article 11 - Feed Your Brain: 4 Healthy Foods with Brain-Boosting Nutrients - 686 words
Article 12 - Please! For the Love of Food Safety

Divide the dataset into a training set (12 pages per topic) and a test set (3 pages per 
topic).

In [10]:
# from sklearn.model_selection import train_test_split

# Split the dataset into training set and test set
# train_set, test_set = train_test_split(articles_data, test_size=0.2, random_state=42)
train_set = articles_data[:12]
test_set = articles_data[12:]

# Print the number of articles in each set
print("Training set size:", len(train_set))
print("Test set size:", len(test_set))

Training set size: 12
Test set size: 3


## **2. Preprocessing**
Preprocessing such as tokenization, stop-word removal, and stemming

In [11]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [12]:
# Tokenization
def tokenize(text):
    return nltk.word_tokenize(text)

# Stop-word removal
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
    return filtered_tokens

# Stemming
def stem_tokens(tokens):
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

In [13]:
# Preprocess train dataset
preprocessed_train_set = []
for article in train_set:
    title_tokens = tokenize(article['title'])
    body_tokens = tokenize(article['body'])
    
    title_tokens = remove_stopwords(title_tokens)
    body_tokens = remove_stopwords(body_tokens)
    
    title_tokens = stem_tokens(title_tokens)
    body_tokens = stem_tokens(body_tokens)

    words_count = len(body_tokens)
    
    preprocessed_train_set.append({'index': article['index'], 'label': article['label'], 'title_tokens': title_tokens, 'body_tokens': body_tokens, 'words_count': words_count})

# Print preprocessed train dataset
for article in preprocessed_train_set:
    print(f"Article: {article['index']}")
    print(f"Label: {article['label']}")
    print(f"Title Tokens: {article['title_tokens']}")
    print(f"Body Tokens: {article['body_tokens']}")
    print(f"Words Count: {article['words_count']}")
    print()

Article: 1
Label: Food
Title Tokens: ['trend', ':', 'smoothie-deliveri', 'program']
Body Tokens: ['popular', 'home-deliveri', 'cook', 'servic', 'continu', 'grow', '.', 'think', 'beyond', 'meal', 'program', ':', 'smoothi', 'juic', 'lover', 'get', 'action', '.', 'took', 'most-popular', 'option', 'whirl', 'blender', '.', 'smoothi', 'nutrit', 'smoothi', 'healthi', 'eat', ',', 'mind', 'ingredi', 'portion', 'size', ';', 'even', 'much', 'healthi', 'food', 'lead', 'excess', 'calori', '.', 'smoothi', 'tri', 'contain', 'anywher', '40', '200', 'calori', 'per', 'serv', 'vari', 'significantli', 'total', 'healthi', 'fat', ',', 'fiber', 'carbohydr', ',', 'depend', 'ingredi', '.', 'sinc', 'smoothi', 'made', 'noth', 'whole', 'food', ',', 'ad', 'sugar', '’', 'realli', 'concern', '.', 'order', ',', 'price', 'deliveri', 'place', 'order', 'super-easi', '.', 'brand', 'like', 'green', 'blender', 'daili', 'harvest', 'user-friendli', 'websit', 'get', 'smoothi', 'door', 'click', '.', 'price', 'rang', '$', '7', 

## **3. Graph Construction:**
Represent each page as a directed graph where nodes represent unique terms = (words), (around 300 words, after preprocessing such as 
tokenization, stop-word removal, and stemming) and edges denote term relationships based on their sequence in the text

In [14]:
import networkx as nx
from nltk.tokenize import word_tokenize
import networkx as nx
import matplotlib.pyplot as plt

In [15]:
# Function to build directed graph
def build_graph(tokens):
    graph = nx.DiGraph()
    for i in range(len(tokens) - 1):
        if not graph.has_edge(tokens[i], tokens[i+1]):
            graph.add_edge(tokens[i], tokens[i+1], weight=1)
        else:
            graph.edges[tokens[i], tokens[i+1]]['weight'] += 1
    return graph

In [16]:
# Function to plot the graph
def plot_graph(graph):
    pos = nx.spring_layout(graph)
    nx.draw(graph, pos, with_labels=True, node_color='skyblue', node_size=1500, edge_color='black', linewidths=1, font_size=10)
    labels = nx.get_edge_attributes(graph, 'weight')
    nx.draw_networkx_edge_labels(graph, pos, edge_labels=labels)
    plt.show()

In [17]:
graphs_train_set = []
for article in preprocessed_train_set:
    # Build the directed graph
    print("Article : ", article['index'], "Graph built")
    graph = build_graph(article['body_tokens'])
    graphs_train_set.append(graph)

    # Plot the graph
    # plot_graph(graph)

Article :  1 Graph built
Article :  2 Graph built
Article :  3 Graph built
Article :  4 Graph built
Article :  5 Graph built
Article :  6 Graph built
Article :  7 Graph built
Article :  8 Graph built
Article :  9 Graph built
Article :  10 Graph built
Article :  11 Graph built
Article :  12 Graph built


## **4. Feature Extraction via Common Subgraphs:**
Utilize frequent subgraph mining techniques to identify common subgraphs within the training set graphs. These common subgraphs will serve as features for classification, capturing the shared content across documents related to the same topic.