# Coffee Shop Data Analysis System Demo

This notebook demonstrates how to use the coffee shop data analysis system to collect, process, analyze, and visualize data about coffee shops in Pakistan using free resources.


## 1. Setup

First, let's make sure we have all the required dependencies installed:


In [None]:
!pip install beautifulsoup4 requests pandas matplotlib seaborn plotly statsmodels scikit-learn scipy python-dotenv schedule

Next, let's import the necessary modules and set up the paths to access our code:


In [None]:
import sys
import os
import json
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add the project root to Python path so we can import our modules
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)

# Import project modules
from src.config import TARGET_LOCATIONS, PATHS
from src.data_collection.google_maps import collect_google_maps_data
from src.data_collection.social_media import collect_social_media_data
from src.data_collection.food_delivery import collect_food_delivery_data
from src.data_collection.market_trends import collect_market_trends_data
from src.visualization.dashboard import CoffeeShopVisualizer, create_visualizations

# Set up timestamp for this run
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
print(f"Notebook execution timestamp: {timestamp}")

## 2. Data Collection

Now let's collect data from various sources. All of these modules have been modified to use free resources instead of paid APIs.


### 2.1 Collect Google Maps Data

This module uses web scraping to extract coffee shop information from Google Maps instead of using the paid Google Maps API.


In [None]:
# Collect data from Google Maps using web scraping
google_data = collect_google_maps_data()

# Preview the first city's data
first_city = list(google_data.keys())[0]
print(f"Collected data for {len(google_data)} cities.")
print(f"Found {len(google_data[first_city])} coffee shops in {first_city}.")

# Show one sample coffee shop
if google_data[first_city]:
    sample_shop = google_data[first_city][0]
    print("\nSample coffee shop data:")
    print(f"Name: {sample_shop.get('name')}")
    print(f"Rating: {sample_shop.get('rating')}")
    print(f"Address: {sample_shop.get('address')}")

### 2.2 Collect Social Media Data

This module generates simulated Twitter/X data about coffee shops since we don't have access to the paid Twitter API.


In [None]:
# Collect simulated social media data
social_data = collect_social_media_data()

# Preview the data
print(f"Collected {len(social_data.get('twitter', {}).get('tweets', []))} tweets related to coffee shops.")

# Display a few sample tweets
if 'twitter' in social_data and 'tweets' in social_data['twitter']:
    print("\nSample tweets:")
    for i, tweet in enumerate(social_data['twitter']['tweets'][:3]):
        print(f"{i+1}. {tweet.get('username')}: {tweet.get('text')}")

# Show top hashtags if available
if 'twitter' in social_data and 'hashtags' in social_data['twitter']:
    print("\nTop hashtags:")
    for i, (tag, count) in enumerate(social_data['twitter']['hashtags'].items()):
        if i < 5:  # Show only top 5
            print(f"{tag}: {count} mentions")

### 2.3 Collect Food Delivery Data

This module simulates data from food delivery platforms based on realistic patterns for coffee shops in Pakistan.


In [None]:
# Collect simulated food delivery data
delivery_data = collect_food_delivery_data()

# Preview the data
print(f"Collected food delivery data for {len(delivery_data)} cities.")

# Show sample coffee menu items and pricing
if delivery_data and first_city in delivery_data and delivery_data[first_city]:
    sample_shop = delivery_data[first_city][0]
    print(f"\nSample coffee shop: {sample_shop.get('name')}")
    print("Menu items:")
    for item in sample_shop.get('menu_items', []):
        print(f"  - {item.get('name')}: Rs. {item.get('price')} ({item.get('description')})")

### 2.4 Collect Market Trends Data

This module collects market trend data about the coffee industry from free public datasets.


In [None]:
# Collect market trends data from free sources
market_data = collect_market_trends_data()

# Preview the data
print("Market trend data categories:")
for category in market_data.keys():
    print(f"- {category}")

# Show coffee price trends if available
if 'coffee_prices' in market_data:
    prices = market_data['coffee_prices']
    print("\nCoffee price trends:")
    for entry in prices[:5]:  # Show first 5 entries
        print(f"{entry.get('date')}: ${entry.get('price')}")

## 3. Data Visualization

Now let's create some visualizations of our data. We'll use the free visualization libraries we've integrated.


### 3.1 Coffee Shop Distribution by City

Let's create a bar chart showing coffee shop distribution across Pakistani cities.


In [None]:
# Create a DataFrame with shop counts per city
city_counts = []
for city, shops in google_data.items():
    city_counts.append({"city": city, "count": len(shops)})

city_df = pd.DataFrame(city_counts)

# Create the visualization
plt.figure(figsize=(12, 6))
ax = sns.barplot(data=city_df, x="city", y="count")

# Add value labels on top of bars
for i, v in enumerate(city_df["count"]):
    ax.text(i, v + 0.1, str(v), ha="center")

plt.title("Coffee Shop Distribution by City in Pakistan", fontsize=16)
plt.xlabel("City", fontsize=12)
plt.ylabel("Number of Coffee Shops", fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### 3.2 Coffee Price Analysis

Let's analyze coffee prices from our food delivery data.


In [None]:
# Extract coffee prices from different cities
coffee_prices = []
for city, shops in delivery_data.items():
    for shop in shops:
        for item in shop.get('menu_items', []):
            if item.get('name') == 'Espresso':  # Focus on one coffee type
                coffee_prices.append({
                    "city": city,
                    "shop": shop.get('name'),
                    "price": item.get('price')
                })

price_df = pd.DataFrame(coffee_prices)

# Create a boxplot to compare prices across cities
plt.figure(figsize=(14, 8))
ax = sns.boxplot(data=price_df, x="city", y="price")
plt.title("Espresso Prices Across Pakistani Cities", fontsize=16)
plt.xlabel("City", fontsize=12)
plt.ylabel("Price (PKR)", fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Calculate and display average prices
avg_prices = price_df.groupby('city')['price'].mean().sort_values(ascending=False)
print("Average Espresso Prices by City (PKR):")
for city, price in avg_prices.items():
    print(f"{city}: {price:.2f}")

### 3.3 Social Media Analysis

Let's visualize some insights from our social media data.


In [None]:
# Extract hashtags and their counts
if 'twitter' in social_data and 'hashtags' in social_data['twitter']:
    hashtags = []
    for tag, count in social_data['twitter']['hashtags'].items():
        hashtags.append({"tag": tag, "count": count})
    
    # Convert to DataFrame and sort by count
    hashtag_df = pd.DataFrame(hashtags)
    hashtag_df = hashtag_df.sort_values('count', ascending=False).head(10)  # Top 10
    
    # Create the visualization
    plt.figure(figsize=(12, 8))
    ax = sns.barplot(data=hashtag_df, y="tag", x="count", palette='viridis')
    
    # Add value labels
    for i, v in enumerate(hashtag_df['count']):
        ax.text(v + 0.1, i, str(v), va='center')
    
    plt.title('Top Coffee-Related Hashtags on Social Media', fontsize=16)
    plt.xlabel('Number of Mentions', fontsize=12)
    plt.ylabel('Hashtag', fontsize=12)
    plt.tight_layout()
    plt.show()

### 3.4 Generate Complete Dashboard

Let's create a complete dashboard using our visualization module.


In [None]:
# First we need to save our collected data to the expected locations
# This simulates what the main pipeline would do

base_dir = os.path.dirname(os.path.dirname(os.path.abspath('__file__')))
raw_dir = os.path.join(base_dir, PATHS['raw_data'])
processed_dir = os.path.join(base_dir, PATHS['processed_data'])

# Create directories if they don't exist
os.makedirs(raw_dir, exist_ok=True)
os.makedirs(processed_dir, exist_ok=True)

# Save the raw data
with open(os.path.join(raw_dir, f"google_maps_{timestamp}.json"), 'w') as f:
    json.dump(google_data, f, indent=2)

with open(os.path.join(raw_dir, f"social_media_{timestamp}.json"), 'w') as f:
    json.dump(social_data, f, indent=2)
    
with open(os.path.join(raw_dir, f"food_delivery_{timestamp}.json"), 'w') as f:
    json.dump(delivery_data, f, indent=2)
    
with open(os.path.join(raw_dir, f"market_trends_{timestamp}.json"), 'w') as f:
    json.dump(market_data, f, indent=2)

print("Raw data saved to files.")

In [None]:
# Now create an analysis directory with some sample analysis results
# In a real pipeline these would come from the analysis modules
analysis_dir = os.path.join(processed_dir, f"analysis_{timestamp}")
os.makedirs(analysis_dir, exist_ok=True)

# Create sample statistical analysis results
statistical_results = {
    "city_analysis": {
        "google_maps": {
            "shop_count_by_city": [
                {"city": city, "coffee_shop_count": len(shops)} 
                for city, shops in google_data.items()
            ]
        }
    },
    "review_analysis": {
        "google_maps": {
            "rating_distribution": [
                {"rating": r, "count": sum(1 for city in google_data.values() for shop in city if shop.get('rating', 0) == r)}
                for r in [3.0, 3.5, 4.0, 4.5, 5.0]
            ]
        }
    },
    "pricing_analysis": {
        "coffee_menu_items": {
            "city_comparison": [
                {
                    "name": "Espresso",
                    "Karachi": 250,
                    "Lahore": 230,
                    "Islamabad": 280
                },
                {
                    "name": "Cappuccino",
                    "Karachi": 400,
                    "Lahore": 380,
                    "Islamabad": 420
                },
                {
                    "name": "Latte",
                    "Karachi": 450,
                    "Lahore": 430,
                    "Islamabad": 470
                }
            ]
        }
    }
}

# Save the statistical analysis
with open(os.path.join(analysis_dir, 'statistical_analysis.json'), 'w') as f:
    json.dump(statistical_results, f, indent=2)

# Create sample trend analysis results
trend_results = {
    "price_trends": {
        "coffee_price": {
            "forecast": {
                "forecast": [
                    {"date": "2025-01", "value": 100},
                    {"date": "2025-02", "value": 102},
                    {"date": "2025-03", "value": 105},
                    {"date": "2025-04", "value": 110},
                    {"date": "2025-05", "value": 112},
                    {"date": "2025-06", "value": 115}
                ]
            }
        }
    },
    "consumption_trends": {
        "forecast": {
            "next_3_years": [
                {"year": "2025", "value": 1200},
                {"year": "2026", "value": 1350},
                {"year": "2027", "value": 1500}
            ]
        }
    },
    "social_media_trends": {
        "hashtags": {
            "top_10": [
                {"tag": "#CoffeeLover", "mentions": 145},
                {"tag": "#PakistaniCafe", "mentions": 98},
                {"tag": "#MorningCoffee", "mentions": 87},
                {"tag": "#CafeHopping", "mentions": 76},
                {"tag": "#LahoreEats", "mentions": 65},
                {"tag": "#IslamabadCafe", "mentions": 54},
                {"tag": "#KarachiCoffee", "mentions": 43},
                {"tag": "#CoffeeTime", "mentions": 32},
                {"tag": "#CafeVibes", "mentions": 28},
                {"tag": "#BrewedCoffee", "mentions": 22}
            ]
        }
    },
    "competitor_trends": {
        "market_share": [
            {"category": "International Chains", "market_share": 35},
            {"category": "Local Premium Brands", "market_share": 25},
            {"category": "Mid-market Cafes", "market_share": 20},
            {"category": "Independent Shops", "market_share": 15},
            {"category": "Tea & Coffee Houses", "market_share": 5}
        ]
    }
}

# Save the trend analysis
with open(os.path.join(analysis_dir, 'trend_analysis.json'), 'w') as f:
    json.dump(trend_results, f, indent=2)

print("Sample analysis results created.")

In [None]:
# Now we can generate a dashboard using our visualization module
visualization_results = create_visualizations(timestamp=timestamp)

if visualization_results and 'dashboard' in visualization_results:
    dashboard_path = visualization_results['dashboard']
    print(f"Dashboard successfully created at: {dashboard_path}")
    
    # In a Jupyter notebook, we can display the HTML directly
    from IPython.display import IFrame
    IFrame(src=dashboard_path, width=1000, height=600)

## 4. Advanced Analysis Example: Price Correlation with Ratings

Let's perform a simple analysis to see if there's any correlation between coffee prices and shop ratings.


In [None]:
# Combine price data with ratings
price_rating_data = []

for city, shops in delivery_data.items():
    for shop in shops:
        # Get the average price of all coffee items
        if 'menu_items' in shop:
            avg_price = sum(item.get('price', 0) for item in shop['menu_items']) / len(shop['menu_items'])
            
            price_rating_data.append({
                "shop_name": shop.get('name'),
                "city": city,
                "rating": shop.get('rating', 0),
                "avg_price": avg_price
            })

# Convert to DataFrame
pr_df = pd.DataFrame(price_rating_data)

# Create a scatter plot
plt.figure(figsize=(12, 8))
ax = sns.scatterplot(data=pr_df, x="avg_price", y="rating", hue="city", s=100, alpha=0.7)

# Add a trend line
sns.regplot(data=pr_df, x="avg_price", y="rating", scatter=False, ax=ax, color="black")

plt.title("Correlation Between Coffee Prices and Shop Ratings", fontsize=16)
plt.xlabel("Average Menu Price (PKR)", fontsize=12)
plt.ylabel("Rating", fontsize=12)
plt.tight_layout()
plt.show()

# Calculate correlation
correlation = pr_df[['avg_price', 'rating']].corr().iloc[0, 1]
print(f"Correlation coefficient between price and rating: {correlation:.3f}")

# Interpretation
if abs(correlation) < 0.3:
    print("There is a weak correlation between coffee prices and ratings.")
elif abs(correlation) < 0.7:
    print("There is a moderate correlation between coffee prices and ratings.")
else:
    print("There is a strong correlation between coffee prices and ratings.")

## 5. Conclusion and Next Steps

In this notebook, we've demonstrated:

1. How to collect coffee shop data using free resources (web scraping and simulated data)
2. How to visualize and analyze this data using free Python libraries
3. How to generate a comprehensive dashboard for coffee shop market analysis
4. How to perform advanced analyses like price-rating correlation

Next steps:

1. Run the full data pipeline regularly using scheduled tasks
2. Refine the data collection methods as needed
3. Add more advanced analyses based on business questions
4. Create custom reports for specific stakeholders
