# Task 2 Summary

In **Task 2**, we developed a Python script to extract specific information from company websites (Nestlé, Pfizer, and Coca-Cola) to evaluate their alignment with various health-related categories. The output was structured into a DataFrame with columns representing different operational and health aspects of each company, facilitating a clear comparative analysis.

---

## Libraries Used

1. **requests**: Sends HTTP requests to fetch website content.
2. **BeautifulSoup**: Parses and navigates through the HTML content.
3. **pandas**: Organizes extracted data into a structured DataFrame and exports it to an Excel file.
4. **openpyxl**: Saves the DataFrame as an Excel file, ensuring compatibility with `.xlsx` format.

---

## Columns with Objectives

- **Products Offered**: Identifies if the company provides health-related products like supplements, vitamins, or nutrition products.
- **Company Description**: Captures a brief overview of the company’s history, mission, and values.
- **Category**: Classifies the company as a manufacturer, distributor, or brand.
- **Health Relevance**: Assesses if health or wellness is a core focus for the company.
- **Manufacturer**: Checks if the company manufactures its own products.
- **Brand Presence**: Evaluates the visibility and presence of the company’s brand lineup.
- **Distribution Role**: Determines if the company actively manages its distribution or supply chain.

This column-based approach provides a consistent structure to analyze each company’s alignment within the health sector and specific operational roles.

---

## Process Summary

We created a Python script to analyze the health relevance of three companies—**Nestlé, Pfizer, and Coca-Cola**—based on defined health-related criteria. Below is an overview of the process:

1. **Objective Definition**:
   - Defined specific objectives for each DataFrame column, such as *Products Offered* and *Company Description*.

2. **Data Extraction Using Python Libraries**:
   - Leveraged libraries:
     - `requests` and `BeautifulSoup` for web scraping,
     - `pandas` to structure and analyze the data,
     - `openpyxl` to export the results into an Excel file.
   
3. **Keyword and Selector Setup**:
   - Defined keywords and CSS selectors to identify and extract relevant content per company, ensuring alignment with health-related criteria like health product offerings, brand visibility, and manufacturing involvement.

4. **DataFrame Population and Export**:
   - Populated the DataFrame with the extracted data, enabling easy comparison.
   - Exported the structured data to an Excel file for further review.

---

In [284]:
pip install requests beautifulsoup4 pandas

Note: you may need to restart the kernel to use updated packages.


In [285]:
pip install openpyxl

Note: you may need to restart the kernel to use updated packages.


In [286]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## List of companies and their websites

In [288]:
companies = {
    "Nestle": "https://www.nestle.com",
    "Pfizer": "https://www.pfizer.com",
    "Coca-Cola": "https://www.coca-cola.com"}

## Creating an empty list to store results

In [290]:
results = []

## Defining function to check the keywords present or not

In [292]:
def check_keywords(content, keywords):
    return any(keyword.lower() in content.lower() for keyword in keywords)

## Imputing the scrapped data for each of the companies

In [294]:
selectors = {
    "Nestle": {
        "products_offered": ".description, .title-product, .page-header__title",
        "company_description": ".card-item__body p, .card-item__title h2, b",
        "category": ".news, .page-header__body",
        "health_relevance": "li, .ct-footer h2, p, .page-header__title",
        "manufacturer": "h3, p, h2, .page-header__title",
        "brand_presence": ".description, .title-product, .page-header__title",
        "distribution_role": ".text p, h2, .page-header__title"
    },
    "Pfizer": {
        "products_offered": "h4, .bw-pull-quote, p, .h3-styling-press-release",
        "company_description": ".link-text a, .link-l, .first-sub",
        "category": "h4, .banner-tile__headline-small, p",
        "health_relevance": ".dark-text, p, .main-menu__overlay, .landing-page-title",
        "manufacturer": "h4, p, .banner-tile__headline-small",
        "brand_presence": ".main :nth-child(1)",
        "distribution_role": "h4, .breadcrumbs, p, .banner-tile__headline-small"
    },
    "Coca-Cola": {
        "products_offered": ".cmp-container :nth-child(1)",
        "company_description": ".cmp-teaser__content :nth-child(1)",
        "category": "#container-7ee9da1a92 :nth-child(1)",
        "health_relevance": "#container-f3ebc6e3ad :nth-child(1)",
        "manufacturer": ".cmp-teaser__title, p, .cmp-title__text, .cmp-tabs__tablist",
        "brand_presence": ".cmp-tabs, .sub-nav-brands, p, .cmp-title__text",
        "distribution_role": ".cmp-container :nth-child(1)"
    }
}

## Adding keywords 

In [296]:
keywords_dict = {
    "products_offered": ['health', 'nutrition', 'probiotics', 'vitamins', 'supplements', 'health drinks', 'snacks'],
    "company_description": ['about us', 'company overview', 'mission', 'history', 'vision', 'values'],
    "category": ['manufacturer', 'distributor', 'brand', 'consumer goods', 'healthcare'],
    "health_relevance": ['health', 'wellness', 'nutrition', 'lifestyle', 'care', 'sustainability', 'research'],
    "manufacturer": ['manufacturing', 'production', 'factory', 'quality assurance', 'processes'],
    "brand_presence": ['brands', 'brand portfolio', 'our brands', 'product lines'],
    "distribution_role": ['distribution', 'supply chain', 'logistics', 'partners', 'distribution network']
}

## Going through each company to scrape data

### Making a request to the website

In [300]:
# Iterate through each company to scrape data
for company, url in companies.items():
    # Make a request to the website
    response = requests.get(url)
    
    # If the request was successfu this code will runl
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Creating data dictionary for each company
        company_data = {"Company": company, "Website": url}

        # Scraping each field based on selectors and keywords
        for field, css_selector in selectors[company].items():
            # Selecting the element and check for keywords
            element = soup.select_one(css_selector)
            field_value = 'Yes' if element and check_keywords(element.get_text(), keywords_dict[field]) else 'No'
            company_data[field.replace("_", " ").title()] = field_value  # Replace underscores with spaces for readability
        
        # Appending the results to the list
        results.append(company_data)

### Creating a DataFrame from the results

In [310]:
data = pd.DataFrame(results)

### Printing the dataframe

In [313]:
data

Unnamed: 0,Company,Website,Products Offered,Company Description,Category,Health Relevance,Manufacturer,Brand Presence,Distribution Role
0,Nestle,https://www.nestle.com,No,No,No,No,No,No,No
1,Pfizer,https://www.pfizer.com,Yes,No,No,No,No,No,No
2,Coca-Cola,https://www.coca-cola.com,No,No,No,No,No,No,No


### Saving an Excel file at the specified path

In [316]:
data.to_excel(r"C:\Users\verma\Downloads\company_health_relevance.xlsx", index=False)

## Conclusion

This systematic process provides a comprehensive way to evaluate each company's role in health and wellness, alongside its operational focus areas. The final structured DataFrame offers an effective comparison of these companies within the health sector.

---