In [14]:
%pip install faker



This code cell installs the `faker` library using the `%pip install` magic command. The `faker` library is used to generate synthetic data, which is helpful for creating realistic-looking data for testing and development purposes when you don't have access to real data.

In [15]:
from faker import Faker
import pandas as pd

fake = Faker()

def generate_client_record():
    """Generates a synthetic client record using Faker."""
    return {
        'company_name': fake.company(),
        'industry': fake.bs(), # Using bs() for a business-sounding industry description
        'contact_person': fake.name(),
        'contact_email': fake.email(),
        'company_description': fake.catch_phrase() + ". " + fake.text(max_nb_chars=100)
    }

def generate_industry_overview(industry):
    """Generates a synthetic industry overview based on a given industry name."""
    #simplified approach using Faker and basic string formatting, more complex generation would require language models.(but limit generating due to free tier)
    overview = f"The {industry} industry is currently experiencing {fake.word()} trends. Key players are focusing on {fake.catch_phrase().lower()}. Recent developments indicate a shift towards {fake.bs().lower()} solutions."
    return overview

This code cell imports the necessary libraries (`Faker` for generating synthetic data and `pandas` for creating a DataFrame) and defines two Python functions:

1.  **`generate_client_record()`**: This function uses the `Faker` library to create a synthetic dictionary representing a client with details like company name, industry, contact person, email, and description.
2.  **`generate_industry_overview(industry)`**: This function generates a simple synthetic industry overview based on a provided industry name, also using `Faker` to add some descriptive words.

These functions are designed to generate the raw synthetic data that will be used for the RAG system.

In [16]:
num_records = 75 # Choose a number between 50 and 100
synthetic_data = []

for _ in range(num_records):
    client_data = generate_client_record()
    industry_overview = generate_industry_overview(client_data['industry'])
    synthetic_data.append({
        'client_data': client_data,
        'industry_overview': industry_overview
    })

df_synthetic = pd.DataFrame(synthetic_data)
display(df_synthetic.head())

Unnamed: 0,client_data,industry_overview
0,"{'company_name': 'Williams and Sons', 'industr...",The innovate innovative networks industry is c...
1,"{'company_name': 'Hayes, Ruiz and Jones', 'ind...",The transform cross-media applications industr...
2,"{'company_name': 'Hopkins Group', 'industry': ...",The cultivate distributed convergence industry...
3,"{'company_name': 'Wolfe-Villegas', 'industry':...",The re-intermediate proactive e-commerce indus...
4,"{'company_name': 'Johnson Group', 'industry': ...",The transition collaborative methodologies ind...


This code cell utilizes the previously defined `generate_client_record()` and `generate_industry_overview()` functions to create a specified number of synthetic data records (`num_records`).

It iterates to generate client and industry data for each record, combines them into a dictionary, and appends them to a list. Finally, it converts this list of dictionaries into a pandas DataFrame called `df_synthetic`.

The `display(df_synthetic.head())` command is used to show the first few rows of the generated DataFrame, allowing you to inspect the structure and content of the synthetic data.

In [17]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer # Import RegexpTokenizer

# Ensure necessary NLTK data is downloaded
try:
    nltk.data.find('tokenizers/punkt') # Keep standard punkt download
except LookupError:
    nltk.download('punkt')
# Removed punkt_tab download attempt as it's not needed for RegexpTokenizer
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

# Using RegexpTokenizer as an alternative to word_tokenize
def tokenize_and_remove_stopwords_robust(text):
    """Tokenizes text using RegexpTokenizer and removes stop words."""
    if not isinstance(text, str):
        return [] # Return empty list for non-string inputs

    tokenizer = RegexpTokenizer(r'\w+') # Tokenize into words, ignoring punctuation
    tokens = tokenizer.tokenize(text)

    filtered_tokens = [word for word in tokens if word not in stop_words]
    return filtered_tokens

# Re-apply the cleaning step first to ensure the input to the next step is a string
import re
import string

def combine_and_clean_text(row):
    """Combines client data and industry overview, then cleans the text."""
    client_info = row['client_data']
    industry_info = row['industry_overview']

    # Extract relevant text from the client_data dictionary
    client_text = f"Company: {client_info.get('company_name', '')}. Industry: {client_info.get('industry', '')}. Contact: {client_info.get('contact_person', '')}. Email: {client_info.get('contact_email', '')}. Description: {client_info.get('company_description', '')}"

    # Combine client info and industry overview
    combined_text = f"{client_text} Industry Overview: {industry_info}"

    # Clean the text
    # Remove special characters and punctuation
    combined_text = re.sub(f'[{re.escape(string.punctuation)}]', '', combined_text)
    # Convert to lowercase
    combined_text = combined_text.lower()
    # Remove extra whitespace
    combined_text = re.sub(r'\s+', ' ', combined_text).strip()

    return combined_text

# Re-apply the cleaning step to get string text
df_synthetic['document_text'] = df_synthetic.apply(combine_and_clean_text, axis=1)

# Now apply tokenization and stop word removal using the robust method
df_synthetic['document_text'] = df_synthetic['document_text'].apply(tokenize_and_remove_stopwords_robust)

display(df_synthetic[['client_data', 'industry_overview', 'document_text']].head())

Unnamed: 0,client_data,industry_overview,document_text
0,"{'company_name': 'Williams and Sons', 'industr...",The innovate innovative networks industry is c...,"[company, williams, sons, industry, innovate, ..."
1,"{'company_name': 'Hayes, Ruiz and Jones', 'ind...",The transform cross-media applications industr...,"[company, hayes, ruiz, jones, industry, transf..."
2,"{'company_name': 'Hopkins Group', 'industry': ...",The cultivate distributed convergence industry...,"[company, hopkins, group, industry, cultivate,..."
3,"{'company_name': 'Wolfe-Villegas', 'industry':...",The re-intermediate proactive e-commerce indus...,"[company, wolfevillegas, industry, reintermedi..."
4,"{'company_name': 'Johnson Group', 'industry': ...",The transition collaborative methodologies ind...,"[company, johnson, group, industry, transition..."


This code cell focuses on **preprocessing the text data** using the **Natural Language Toolkit (NLTK)** library.

**What is NLTK?**
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. It's a powerful tool for various Natural Language Processing (NLP) tasks.

**Why do we use NLTK here?**
In this context, NLTK is used for two key preprocessing steps:

1.  **Tokenization**: Breaking down the text into individual words or tokens. This is essential because most NLP models and techniques work on tokens rather than raw text strings. Here, we use `RegexpTokenizer` which tokenizes based on a regular expression, effectively separating words while ignoring punctuation.
2.  **Removing Stop Words**: Eliminating common words (like 'the', 'a', 'is', 'in') that don't usually carry significant meaning for information retrieval or analysis. Removing stop words helps reduce noise in the data and can improve the efficiency and effectiveness of subsequent steps like embedding.

By performing these steps, we prepare the text data in the `document_text` column for further processing, such as creating text embeddings for the RAG system.

# Summary of Progress: Data Generation and Preprocessing

This notebook demonstrates the initial steps in building a Retrieval Augmented Generation (RAG) system for a salesperson preparation tool. So far, we have successfully:

1.  **Generated Synthetic Data:** Using the Faker library, we created synthetic data representing client information and industry overviews. This provides a realistic dataset for testing and developing the RAG system without relying on real-world data.
2.  **Preprocessed Text Data:** We performed essential text preprocessing steps on the generated data using the NLTK library. This included:
    *   Combining client information and industry overviews into a single text document per record.
    *   Cleaning the text by removing punctuation and converting it to lowercase.
    *   Tokenizing the text into individual words using `RegexpTokenizer`.
    *   Removing common English stop words that do not contribute significantly to the meaning of the text.

These steps have prepared the data for the next stages of the RAG pipeline, which involve creating text embeddings and building a vector index for efficient information retrieval.