## **Market Research Assistant**


# Abstract
This project develops an AI-powered market research assistant that generates concise industry reports based on Wikipedia data. Designed for business analysts in large corporations, the assistant automates information retrieval and report generation, reducing manual effort in market research.

# Objective
To create an automated system that fetches and summarizes industry-specific information from Wikipedia, structuring the output in a standardized JSON format.

# Problem Statement
Market research is time-intensive, requiring analysts to sift through vast amounts of data to extract relevant insights. This project addresses this challenge by leveraging natural language processing (NLP) and retrieval-based AI to generate structured reports, including industry overviews, key players, market trends, and relevant sources.

# Key Features

Wikipedia-based industry research retrieval

JSON-formatted structured output

Automated summarization of key industry insights

Performance optimization techniques for improved accuracy

This implementation leverages LangChain and WikipediaRetriever to streamline report generation. Future improvements may include refining prompt engineering, integrating multiple data sources, and enhancing retrieval accuracy.

In [None]:
!pip install google-generativeai
!pip install wikipedia




In [None]:
# !pip install google-generativeai wikipedia langchain fpdf

import google.generativeai as genai
import wikipedia
import json
import os
import re
from langchain.prompts import PromptTemplate
from functools import lru_cache
from fpdf import FPDF

genai.configure(api_key=os.getenv("GEMINI_API_KEY", "AI**"))


system_prompt_template = PromptTemplate.from_template("""
You are an AI research assistant for a business analyst conducting market research on various industries.
Your task is to generate a **detailed and structured** industry report **in JSON format** based **ONLY** on the Wikipedia data provided.
Pay special attention to capturing detailed information on:
- Challenges, risks, and barriers faced by the industry.
- Market size, growth trends, and any numerical data available.
- Detailed information on major companies, including historical context and notable events (e.g., acquisitions).

Please ensure that the descriptions in the "Market Size and Growth Potential" section are as long and descriptive as possible, and include all available details from the source.

### **Report Structure**
{
  "Industry Overview": {
    "Significance": "<Explain the industry's significance and economic role>",
    "Historical Development": "<Discuss historical development and transformation over time>",
    "Global Differences": "<Provide insights into global vs. regional differences>"
  },
  "Key Trends": [
    {
      "Trend": "<Trend name>",
      "Description": "<Trend description>"
    },
    ... (at least 2-4 items)
  ],
  "Challenges": {
    "Risks And Barriers": "<Identify key risks and barriers (economic, regulatory, operational)>",
    "Addressing Issues": "<Describe how companies are addressing these issues>"
  },
  "Market Size and Growth Potential": {
    "Market Size": "<Provide a detailed description including any numerical data available>",
    "Growth Trends": "<Provide a detailed analysis of historical growth trends and future projections>"
  },
  "Major Companies": {
    "companies": [
      "<List 2-5 leading companies along with a brief note on their market influence, innovations, and historical context>"
    ]
  },
  "Conclusion": {
    "Industry Outlook": "<Summarize the industry outlook>",
    "Future Opportunities": "<Highlight future opportunities and key developments>"
  }
}

### **Important Rules**
- Remove any section with no meaningful data.
- The final report must be **less than 600 words**.
- Ensure structured JSON output enclosed in triple backticks with the label 'json'.
- Do NOT generate new facts beyond the Wikipedia content.
""")



@lru_cache(maxsize=32)
def get_wikipedia_content(industry):
    """
    Retrieves the full Wikipedia page content and URL for the given industry.
    If the direct query fails or is ambiguous, it tries alternative queries
    (appending 'industry' if necessary, and finally using Wikipedia's search function).
    Uses caching to avoid duplicate requests.
    """
    try:

        page = wikipedia.page(industry)
    except (wikipedia.exceptions.PageError, wikipedia.exceptions.DisambiguationError):
        alt_query = industry
        if "industry" not in industry.lower():
            alt_query = industry + " industry"
        try:
            page = wikipedia.page(alt_query)
        except (wikipedia.exceptions.PageError, wikipedia.exceptions.DisambiguationError):
            search_results = wikipedia.search(industry)
            if search_results:
                try:
                    page = wikipedia.page(search_results[0])
                except Exception as e:
                    raise ValueError(f"Unable to retrieve a page for '{industry}' using search results: {e}")
            else:
                raise ValueError(f"No Wikipedia page found for '{industry}'.")
    return page.content, page.url

def extract_json_block(text):
    """
    Uses regex to extract a JSON block enclosed in triple backticks with the 'json' label.
    """
    json_pattern = r"```json\s*(\{.*?\})\s*```"
    match = re.search(json_pattern, text, re.DOTALL)
    if match:
        json_str = match.group(1)
        try:
            return json.loads(json_str)
        except json.JSONDecodeError as e:
            raise ValueError(f"JSON parsing failed: {str(e)}")
    else:
        raise ValueError("No JSON block found in the model output.")

def extract_additional_info(wiki_text, keywords):
    """
    Extracts additional information from the Wikipedia text based on provided keywords.
    Returns concatenated sentences containing those keywords.
    """
    sentences = wiki_text.split('.')
    relevant = [s.strip() for s in sentences if any(kw.lower() in s.lower() for kw in keywords)]
    return ". ".join(relevant) + "."

def merge_additional_info(report_json, wiki_text):
    """
    Merges additional extracted information about challenges and market data into the report JSON.
    """
    # Define keywords for challenges and market size.
    challenge_keywords = ["challenge", "risk", "barrier", "obstacle", "problem"]
    market_keywords = ["market size", "growth", "share", "projection", "trend", "dollar", "billion"]

    # Extract additional info.
    extra_challenges = extract_additional_info(wiki_text, challenge_keywords)
    extra_market = extract_additional_info(wiki_text, market_keywords)

    # Merge if the fields are empty or lack detail.
    if not report_json.get("Challenges", {}).get("Risks And Barriers"):
        report_json.setdefault("Challenges", {})["Risks And Barriers"] = extra_challenges
    if not report_json.get("Market Size and Growth Potential", {}).get("Market Size"):
        report_json.setdefault("Market Size and Growth Potential", {})["Market Size"] = extra_market

    return report_json

def clean_text(text):
    """
    Cleans the text by removing specific patterns like numeric ranges with Unicode dashes.
    """
    # Remove patterns like "1974–1984" (digits, en-dash, digits)
    cleaned = re.sub(r'\d+–\d+', '', text)
    return cleaned.strip()

def clean_report(report):
    """
    Recursively cleans all string fields in the report dictionary.
    """
    if isinstance(report, dict):
        return {k: clean_report(v) for k, v in report.items()}
    elif isinstance(report, list):
        return [clean_report(item) for item in report]
    elif isinstance(report, str):
        return clean_text(report)
    else:
        return report

def generate_market_report_gemini(industry):
    """
    Generates a comprehensive market research report for the given industry.
    Combines full Wikipedia data with a detailed prompt and uses Gemini 1.5 Flash.
    Returns a structured JSON report.
    """
    try:
        wiki_content, wiki_url = get_wikipedia_content(industry)
    except ValueError as err:
        return {"error": str(err)}

    # Assemble the final prompt by combining the system prompt with the full Wikipedia data.
    final_prompt = system_prompt_template.template + "\n\nWikipedia Data:\n" + wiki_content

    # Use Gemini 1.5 Flash for content generation.
    model = genai.GenerativeModel("gemini-1.5-flash")
    response = model.generate_content([final_prompt])
    full_text = response.text.strip()

    try:
        report_json = extract_json_block(full_text)
    except ValueError as e:
        return {"error": str(e), "raw_output": full_text}

    # Merge in additional extracted info (e.g., challenges, market data).
    report_json = merge_additional_info(report_json, wiki_content)

    # Add metadata if missing.
    report_json["source_url"] = wiki_url
    report_json["industry"] = industry

    # Clean all text in the report to remove undesired patterns.
    report_json = clean_report(report_json)

    return report_json

# Step 5: Generate reports for multiple industries.
industries = [
    "Aluminium",
    "Advertising",
    "Internet Services",
    "Automobile Manufacturers",
    "Automobile Manufacturers in South Korea"
]
all_reports = {ind: generate_market_report_gemini(ind) for ind in industries}

# Convert the reports dictionary into a formatted JSON string.
json_string = json.dumps(all_reports, indent=4, ensure_ascii=False)


In [None]:
# --- PDF Generation Section ---
from fpdf import FPDF

class PDF(FPDF):
    # Remove the footer if you don't want any page numbers.
    def footer(self):
        pass

pdf = PDF()
pdf.add_page()
pdf.set_auto_page_break(auto=True, margin=15)
# Use a monospaced font (Courier) for better JSON readability.
pdf.set_font("Courier", size=10)

# Convert your JSON reports dictionary into a formatted string.
json_string = json.dumps(all_reports, indent=4, ensure_ascii=False)
# Replace problematic Unicode characters if needed.
safe_string = json_string.replace("\u2013", "-").encode("latin-1", "replace").decode("latin-1")



# Optionally, you can remove any unwanted page markers from the safe_string.
# For example, if your JSON accidentally contains "Page X", you can remove it:
safe_string = re.sub(r'Page\s+\d+', '', safe_string)

# Add the JSON string to the PDF.
pdf.multi_cell(0, 5, safe_string)

pdf_file = "Q9_MSIN0231.pdf"
pdf.output(pdf_file)

print(f"All detailed reports have been saved to '{pdf_file}'.")


All detailed reports have been saved to 'Q9_MSIN0231.pdf'.


In [None]:
# Additional Industry reports for testing purposes

# !pip install google-generativeai wikipedia langchain fpdf

import google.generativeai as genai
import wikipedia
import json
import os
import re
from langchain.prompts import PromptTemplate
from functools import lru_cache

# Configure your Gemini API key securely.
genai.configure(api_key=os.getenv("GEMINI_API_KEY", "AI****"))

# Define the system prompt template
system_prompt_template = PromptTemplate.from_template("""
You are an AI research assistant for a business analyst conducting market research on various industries.
Your task is to generate a **detailed and structured** industry report **in JSON format** based **ONLY** on the Wikipedia data provided.
Pay special attention to capturing detailed information on:
- Challenges, risks, and barriers faced by the industry.
- Market size, growth trends, and any numerical data available.
- Detailed information on major companies, including historical context and notable events (e.g., acquisitions).

Please ensure that the descriptions in the "Market Size and Growth Potential" section are as long and descriptive as possible, and include all available details from the source.

### **Report Structure**
{
  "Industry Overview": {
    "Significance": "<Explain the industry's significance and economic role>",
    "Historical Development": "<Discuss historical development and transformation over time>",
    "Global Differences": "<Provide insights into global vs. regional differences>"
  },
  "Key Trends": [
    {
      "Trend": "<Trend name>",
      "Description": "<Trend description>"
    },
    ... (at least 2-4 items)
  ],
  "Challenges": {
    "Risks And Barriers": "<Identify key risks and barriers (economic, regulatory, operational)>",
    "Addressing Issues": "<Describe how companies are addressing these issues>"
  },
  "Market Size and Growth Potential": {
    "Market Size": "<Provide a detailed description including any numerical data available>",
    "Growth Trends": "<Provide a detailed analysis of historical growth trends and future projections>"
  },
  "Major Companies": {
    "companies": [
      "<List 2-5 leading companies along with a brief note on their market influence, innovations, and historical context>"
    ]
  },
  "Conclusion": {
    "Industry Outlook": "<Summarize the industry outlook>",
    "Future Opportunities": "<Highlight future opportunities and key developments>"
  }
}

### **Important Rules**
- Remove any section with no meaningful data.
- The final report must be **less than 600 words**.
- Ensure structured JSON output enclosed in triple backticks with the label 'json'.
- Do NOT generate new facts beyond the Wikipedia content.
""")

@lru_cache(maxsize=32)
def get_wikipedia_content(industry):
    """
    Retrieves the full Wikipedia page content and URL for the given industry.
    If the direct query fails or is ambiguous, it tries alternative queries
    (appending 'industry' if necessary, and finally using Wikipedia's search function).
    Uses caching to avoid duplicate requests.
    """
    try:
        page = wikipedia.page(industry)
    except (wikipedia.exceptions.PageError, wikipedia.exceptions.DisambiguationError):
        alt_query = industry
        if "industry" not in industry.lower():
            alt_query = industry + " industry"
        try:
            page = wikipedia.page(alt_query)
        except (wikipedia.exceptions.PageError, wikipedia.exceptions.DisambiguationError):
            search_results = wikipedia.search(industry)
            if search_results:
                try:
                    page = wikipedia.page(search_results[0])
                except Exception as e:
                    raise ValueError(f"Unable to retrieve a page for '{industry}' using search results: {e}")
            else:
                raise ValueError(f"No Wikipedia page found for '{industry}'.")
    return page.content, page.url

def extract_json_block(text):
    """
    Uses regex to extract a JSON block enclosed in triple backticks with the 'json' label.
    """
    json_pattern = r"```json\s*(\{.*?\})\s*```"
    match = re.search(json_pattern, text, re.DOTALL)
    if match:
        json_str = match.group(1)
        try:
            return json.loads(json_str)
        except json.JSONDecodeError as e:
            raise ValueError(f"JSON parsing failed: {str(e)}")
    else:
        raise ValueError("No JSON block found in the model output.")

def extract_additional_info(wiki_text, keywords):
    """
    Extracts additional information from the Wikipedia text based on provided keywords.
    Returns concatenated sentences containing those keywords.
    """
    sentences = wiki_text.split('.')
    relevant = [s.strip() for s in sentences if any(kw.lower() in s.lower() for kw in keywords)]
    return ". ".join(relevant) + "."

def merge_additional_info(report_json, wiki_text):
    """
    Merges additional extracted information about challenges and market data into the report JSON.
    """
    # Define keywords for challenges and market size.
    challenge_keywords = ["challenge", "risk", "barrier", "obstacle", "problem"]
    market_keywords = ["market size", "growth", "share", "projection", "trend", "dollar", "billion"]

    # Extract additional info.
    extra_challenges = extract_additional_info(wiki_text, challenge_keywords)
    extra_market = extract_additional_info(wiki_text, market_keywords)

    # Merge if the fields are empty or lack detail.
    if not report_json.get("Challenges", {}).get("Risks And Barriers"):
        report_json.setdefault("Challenges", {})["Risks And Barriers"] = extra_challenges
    if not report_json.get("Market Size and Growth Potential", {}).get("Market Size"):
        report_json.setdefault("Market Size and Growth Potential", {})["Market Size"] = extra_market

    return report_json

def clean_text(text):
    """
    Cleans the text by removing specific patterns like numeric ranges with Unicode dashes.
    """
    # Remove patterns like "1974–1984" (digits, en-dash, digits)
    cleaned = re.sub(r'\d+–\d+', '', text)
    return cleaned.strip()

def clean_report(report):
    """
    Recursively cleans all string fields in the report dictionary.
    """
    if isinstance(report, dict):
        return {k: clean_report(v) for k, v in report.items()}
    elif isinstance(report, list):
        return [clean_report(item) for item in report]
    elif isinstance(report, str):
        return clean_text(report)
    else:
        return report

def generate_market_report_gemini(industry):
    """
    Generates a comprehensive market research report for the given industry.
    Combines full Wikipedia data with a detailed prompt and uses Gemini 1.5 Flash.
    Returns a structured JSON report.
    """
    try:
        wiki_content, wiki_url = get_wikipedia_content(industry)
    except ValueError as err:
        return {"error": str(err)}

    # Assemble the final prompt by combining the system prompt with the full Wikipedia data.
    final_prompt = system_prompt_template.template + "\n\nWikipedia Data:\n" + wiki_content

    # Use Gemini 1.5 Flash for content generation.
    model = genai.GenerativeModel("gemini-1.5-flash")
    response = model.generate_content([final_prompt])
    full_text = response.text.strip()

    try:
        report_json = extract_json_block(full_text)
    except ValueError as e:
        return {"error": str(e), "raw_output": full_text}

    # Merge in additional extracted info (e.g., challenges, market data).
    report_json = merge_additional_info(report_json, wiki_content)

    # Add metadata if missing.
    report_json["source_url"] = wiki_url
    report_json["industry"] = industry

    # Clean all text in the report to remove undesired patterns.
    report_json = clean_report(report_json)

    return report_json

# Generate reports for your five chosen industries.
industries = [
    "Automotive",
    "Luxury Fashion",
    "Video Games",
    "Laptop",
    "FMCG in India"
]

all_reports = {}
for ind in industries:
    all_reports[ind] = generate_market_report_gemini(ind)

# Convert the reports dictionary into a formatted JSON string.
json_string = json.dumps(all_reports, indent=4, ensure_ascii=False)

# Print the final JSON outputs for the five chosen industries.
print(json_string)


{
    "Automotive": {
        "Industry Overview": {
            "Significance": "The automotive industry designs, develops, manufactures, markets, and sells motor vehicles, with cars comprising over three-quarters of production.  It plays a significant role in global economies, employing millions and impacting energy consumption, infrastructure development, and societal mobility.",
            "Historical Development": "Early car development involved steam-powered vehicles (Cugnot, 1769) and internal combustion engines (de Rivaz, 1808).  Carl Benz's 1886 Benz Patent-Motorwagen is considered the first practical, marketable automobile. Mass production, starting with Oldsmobile in 1901 and revolutionized by Ford's moving assembly line in 1913, made cars widely accessible.  The industry experienced rapid technological advancements throughout the 20th century.",
            "Global Differences": "China is the largest producer and market for cars, followed by the US, Japan, Germany, South K