## ☕️ Exploratory Data Analysis (EDA) - BrewBuddy Price Predictor

**This project is The simple and beginner friendly Exploratory Data Analysis (EDA) performed on a custom prepared Dummy Dataset. The Goal is to** discover insights in coffee sales data, clean and prepare it, for the pricing model.

---

### 📂 Project Structure

- **.ipynb**: Main Jupyter notebook for loading, exploring, and cleaning data.
- **data/**
  - `NewCoffeeData.csv` – raw sample data

---

### 📊 Dataset

- **Source:** Your own handcrafted BrewBuddy sample data
- **Columns:**
  - `Bean_Origin` (e.g., Ethiopia, Colombia, Brazil)
  - `Roast_Level` (Light, Medium, Dark)
  - `Flavor_Profile` (Fruity, Nutty, Chocolatey, Floral…)
  - `Customer_Rating` (1.0–5.0 star feedback)
  - `Actual_Price` (USD or INR)

---

### 🔍 Key Steps in the Notebook

1. **Data Cleaning & Discovery** 🕵️‍♀️
   - **Initial summary:** `df.describe()`, `df.info()` to get counts, means, and dtypes.
   - **Missing values:** identify with `df.isnull().sum()` and decide to drop or impute.
   - **Value checks:** look for typos or inconsistent bean names (`df['Bean_Origin'].unique()`).

2. **Exploratory Charts** 📈
   - **Average Price by Roast Level:** bar chart of mean price per roast.
   - **Price vs. Rating:** scatter plot to see correlation between customer rating and price.
   - **Origin Popularity:** countplot of bean origins to spot data biases.

3. **Feature Engineering & Transformation** 🔧
   - **Price per Rating:** new column `price_per_star = Actual_Price / Customer_Rating`.
   - **One-Hot Encoding:** convert `Bean_Origin`, `Roast_Level`, `Flavor_Profile` into numeric columns.
   - **Scaling:** apply `StandardScaler` to `Customer_Rating` and engineered numeric features.

4. **Sample Output** 💾
   ```python
   # Example: price_per_star
   df[['Actual_Price','Customer_Rating','price_per_star']].head()

   # Actual_Price	Customer_Rating	price_per_star
   # 15.99	        4.5	            3.55
   # 13.50	        4.0	            3.38
   # 12.00	        3.5	            3.43



## 📥 Data Loading

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('data/NewCoffeeData.csv')
print("First Five Rows: \n", df.head())
print("\ncolumn types & missing values : \n",df.info())
print("summary stats for numeric columns : \n",df.describe())

### Your Mission: Look for hidden secrets and patterns in your coffee data.
**What's the average price? (Simple numbers!)**

In [None]:
avg_price = df['Actual_Price_INR'].mean()
print(f"Average price: ${avg_price:.2f}")

**Average Price :**

In [None]:
df['Actual_Price_INR'].hist(bins=10)
plt.title("Distribution of Prices")
plt.xlabel("Price (USD)")
plt.ylabel("Count")
plt.show()

**☕️ Do darker roasts sell for more or less?**

In [None]:
mean_by_roast = df.groupby('Roast_Level')['Actual_Price_INR'].mean()
print(mean_by_roast)

In [None]:
mean_by_roast.plot(kind='bar')
plt.title("Average Price by Roast Level")
plt.ylabel("Avg Price (USD)")
plt.show()

**🍑 Are fruity coffees always highly rated?**

In [None]:
fruity = df[df['Flavor_Profile'] == 'Fruity']['Customer_Rating']
overall = df['Customer_Rating']
print("Fruity avg rating:", fruity.mean())
print("Overall avg rating:", overall.mean())

In [None]:
df.boxplot(column='Customer_Rating', by='Flavor_Profile', rot=45)
plt.title("Ratings by Flavor Profile")
plt.suptitle("")  # remove auto-title
plt.ylabel("Customer Rating")
plt.show()

**🔍Are there any typos in my bean names?**

In [None]:
print(df['Bean_Origin'].unique())

**NO TYPOS**

In [None]:
lengths = df['Bean_Origin'].apply(len)
odd = df[lengths < 5]  # e.g. names shorter than 5 chars
print(odd['Bean_Origin'].value_counts())