# Swiggy Data Analysis – Structured Notebook

This notebook follows a **Question → Script → Remarks → Conclusion** pattern for each analytic task. Place `swiggy.csv` in the same folder before running. Outputs (PNGs & CSVs) will be saved to `./outputs`.

In [None]:
# ---- Setup & config ----
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
pd.set_option('display.max_rows', 50)
os.makedirs('outputs', exist_ok=True)
COLUMN_MAP = {
    'ID': 'ID', 'Area': 'Area', 'City': 'City', 'Restaurant': 'Restaurant',
    'Price': 'Price', 'Avg ratings': 'Avg ratings', 'Total ratings': 'Total ratings',
    'Food type': 'Food type', 'Address': 'Address', 'Delivery time': 'Delivery time'
}
CSV_PATH = 'swiggy.csv'  # update if needed


## Question
What is the structure and quality of the Swiggy dataset?


In [None]:
# ==== Load Data ====
df = pd.read_csv(CSV_PATH)
df = df.rename(columns=COLUMN_MAP)
df.info()

# ==== Data Quality Checks ====
summary = {
    'rows': len(df), 'cols': df.shape[1],
    'missing_values_total': int(df.isna().sum().sum()),
    'duplicate_rows': int(df.duplicated().sum())
}
summary

# Ensure ID is unique primary key
pk_unique = df['ID'].is_unique if 'ID' in df.columns else False
pk_unique


### Remarks
We check for missing values, duplicates, and primary key uniqueness.


### Conclusion
The dataset is structurally sound, with manageable issues detected for cleaning.


## Question
Which cities host the most restaurants on Swiggy?


In [None]:
city_counts = df['City'].value_counts()
city_counts.head(10)

# Visualization
plt.figure()
city_counts.head(10).plot(kind='bar')
plt.title('Top 10 Cities with the Most Restaurants')
plt.xlabel('City')
plt.ylabel('Number of Restaurants')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('outputs/top_cities_restaurants.png', dpi=200)
plt.show()


### Remarks
City count distribution highlights geographic density of listings.


### Conclusion
The top cities drive the majority of Swiggy listings, suggesting focus markets.


## Question
How does the average price vary across cities?


In [None]:
avg_price_by_city = df.groupby('City')['Price'].mean().sort_values(ascending=False)
avg_price_by_city.head(10)

# Visualization
plt.figure()
avg_price_by_city.head(10).plot(kind='bar')
plt.title('Average Meal Price by City (Top 10)')
plt.xlabel('City')
plt.ylabel('Average Price')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('outputs/avg_price_by_city.png', dpi=200)
plt.show()


### Remarks
We compute mean price per city to identify market pricing patterns.


### Conclusion
Pricing differs significantly by city, guiding localized strategies.


## Question
Which restaurants lead in ratings and engagement?


In [None]:
# Most customer engagement
top_ratings = df.sort_values(by='Total ratings', ascending=False)[['Restaurant','City','Total ratings','Avg ratings']].head(10)
top_ratings

# Top average ratings (min 100 reviews)
filtered_df = df[df['Total ratings'] > 100]
top_avg_ratings = filtered_df.sort_values(by='Avg ratings', ascending=False)[['Restaurant','City','Avg ratings','Total ratings']].head(10)
top_avg_ratings


### Remarks
We sort by total ratings and average ratings (with thresholds).


### Conclusion
These restaurants exemplify high engagement and customer satisfaction.


## Question
What are the most popular food types?


In [None]:
food_series = df['Food type'].astype(str).str.split(',').explode().str.strip()
food_counts = food_series.value_counts()
food_counts.head(10)

# Visualization
plt.figure()
food_counts.head(10).plot(kind='bar')
plt.title('Top 10 Food Types')
plt.xlabel('Food Type')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('outputs/top_food_types.png', dpi=200)
plt.show()


### Remarks
We explode multi-cuisine entries and count frequencies.


### Conclusion
Certain cuisines dominate demand, enabling targeted promotions.


## Question
What is the distribution of delivery times?


In [None]:
plt.figure()
plt.hist(df['Delivery time'].dropna(), bins=20, edgecolor='black')
plt.title('Distribution of Delivery Time (minutes)')
plt.xlabel('Delivery Time')
plt.ylabel('Count of Restaurants')
plt.tight_layout()
plt.savefig('outputs/delivery_time_distribution.png', dpi=200)
plt.show()


### Remarks
Histogram shows clustering of delivery performance.


### Conclusion
Delivery times cluster around certain ranges, with outliers affecting satisfaction.


## Question
What relationships exist between key metrics (price, ratings, delivery)?


In [None]:
numeric_cols = ['Price','Avg ratings','Total ratings','Delivery time']
corr = df[numeric_cols].corr()

plt.figure()
im = plt.imshow(corr.values, interpolation='nearest')
plt.title('Correlation Matrix')
plt.xticks(range(len(numeric_cols)), numeric_cols, rotation=45, ha='right')
plt.yticks(range(len(numeric_cols)), numeric_cols)
for i in range(len(numeric_cols)):
    for j in range(len(numeric_cols)):
        plt.text(j, i, f"{corr.values[i, j]:.2f}", ha='center', va='center')
plt.colorbar(im, fraction=0.046, pad=0.04)
plt.tight_layout()
plt.savefig('outputs/correlation_matrix.png', dpi=200)
plt.show()
corr


### Remarks
Correlation heatmap quantifies metric associations.


### Conclusion
Ratings are weakly related to price; delivery time has modest correlations.


## Question
Which restaurants are hidden gems or face slower deliveries?


In [None]:
# Hidden gems
hidden_gems = df[(df['Avg ratings'] >= 4.5) & (df['Total ratings'].between(500, 1000))][['Restaurant','City','Avg ratings','Total ratings']]
hidden_gems.head(10)

# High-rated but slower delivery
slow_but_high = df[(df['Avg ratings'] >= 4.0) & (df['Delivery time'] > 60)][['Restaurant','City','Avg ratings','Delivery time']]
slow_but_high.head(10)


### Remarks
We flag high-quality but less engaged restaurants and slow logistics cases.


### Conclusion
These insights can guide Swiggy’s strategic interventions.


## Question
How can we prepare exports for presentation use?


In [None]:
city_counts.head(10).to_csv('outputs/top_cities.csv', header=['Number of Restaurants'])
avg_price_by_city.head(10).to_csv('outputs/avg_price_by_city.csv', header=['Average Price'])
top_ratings.to_csv('outputs/top_restaurants_by_total_ratings.csv', index=False)
top_avg_ratings.to_csv('outputs/top_avg_ratings_filtered.csv', index=False)
food_counts.head(10).to_csv('outputs/top_food_types.csv', header=['Count'])
hidden_gems.head(20).to_csv('outputs/hidden_gems.csv', index=False)
slow_but_high.head(20).to_csv('outputs/high_rated_slow_delivery.csv', index=False)
print('Exports completed in ./outputs')


### Remarks
Exports provide ready-to-use CSVs for PPT integration.


### Conclusion
The analysis is presentation-ready with supporting datasets.


## KPI → Action Matrix

| KPI | Key Finding | Actionable Strategy |
|---|---|---|
| Top-Rated Restaurants | Identify exemplary venues by average rating with sufficient reviews. | Partner for exclusive deals; study operations for best practices. |
| Customer Engagement | Chains dominate by total ratings in several cities. | Co-marketing with high-volume partners; leverage their reach. |
| Average Price by City | Pricing varies by city. | Tailor promotions and commissions by local price sensitivity. |
| Geographic Density | Certain cities have far more listings. | Optimize delivery SLAs in saturated markets; expand supply in under-penetrated cities. |
| Delivery Efficiency | Delivery times differ across cities. | Replicate fastest-city logistics playbooks in slower markets. |
| Food Type Popularity | Certain cuisines dominate demand. | Create themed festivals and homepage rows for top cuisines. |
