<a href="https://colab.research.google.com/github/Aidakazemi/BUS650/blob/Assignment/BUSI650_Predictive_Decision_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📌 BUSI650 — Predictive & Decision-Making Analysis Assignment (Week 6)
**Weight:** 20% of final grade  
**Environment:** Google Colab (recommended)  

## 🎯 Assignment Purpose
In this assignment you will:
1. **Find a business dataset** (using the class Custom GPT helper) in a field you care about (Marketing, Finance, HR, Law/Policy, Operations, etc.).
2. Perform **descriptive analysis** and **hypothesis testing** to answer a concrete research question.
3. Build a simple **predictive model (linear regression)** to support a **decision recommendation**.
4. Communicate your findings in **plain business language**.

👉 **Rubric:** See the course shell (Week Overview → Syllabus → Rubric section).

---
## 🧭 What You Submit
- Your final **`.ipynb` notebook** (this file with your completed analysis and commentary).
- **Your dataset file or a permanent link with access** (e.g., CSV attached/uploaded, Drive shared link with "Anyone with the link" access, or Kaggle URL + exact version).  
  - Include a **Data Provenance** note (source, link, date accessed).

---
## ✅ Academic Integrity
- You may use the **Custom GPT helper** to **find** a dataset and for **guided steps**.
- You **must run your own code** and **interpret your own outputs**. Do not create new code but use only the given codes.
- Questions in this assignment **cannot be answered** without running your code on your chosen data.


---
## 🧑‍💻 Step 0 — Identify Yourself
Rename this file in Colab: **File → Save a copy in Drive** → `Lastname_Firstname_PredictiveDecision.ipynb`

Fill in your info below and run the cell.

In [None]:
STUDENT_NAME = "Your Name"
STUDENT_ID = "Your Student ID"
print(f"Student: {STUDENT_NAME} | ID: {STUDENT_ID}")

---
## 🔎 Step 1 — Find a Dataset (with the Custom GPT helper)
Choose a domain you like (Marketing, Finance, HR, Law/Policy, Ops). Use the **Custom GPT** ([BUSI650 Helper GPT — Dataset Finder & Descriptive Analysis](https://chatgpt.com/g/g-68e750d6c8a88191bcc36422f4d40ce2-busi650-dataset-finder-descriptive-analysis)) to locate a **clean, tabular dataset** with:
- **Rows ≥ 200** (preferred), columns ≥ 5  
- Includes **at least one numeric outcome** you could predict (e.g., Sales, Salary, Satisfaction, Price) and **2–5 numeric predictors** (e.g., AdSpend, Tenure, Rating, Age)  
- Data is **non-sensitive** and sharable for academic use

### 💬 Suggested Prompts to use in the Custom GPT
- *“Find a clean public **marketing dataset** (CSV) with monthly sales and advertising spend that I can load in Colab. Include a direct CSV link if possible.”*
- *“Suggest an **HR dataset** with employee age, tenure, satisfaction, and attrition status. Include data dictionary and a CSV link or Kaggle page.”*
- *“Point me to a **finance dataset** for stock prices with features I could use to predict next-day return (or volume). Provide source and direct CSV link if available.”*
- *“Suggest a **law/policy** dataset where we could examine whether a policy change affects an outcome, with at least one continuous variable to predict.”*

### 📌 Data Provenance (fill in after you choose)
- **Source / URL:**  
- **Date accessed:**  
- **Short description of variables:**  


---
## 📥 Step 2 — Load Your Data
Choose **one** of the methods below. Run the cell that matches your situation.

### Option A — Upload a local CSV (easy)
Use the upload button to select your CSV file.


In [None]:
import pandas as pd
from google.colab import files

print("🔼 Choose your CSV file…")
uploaded = files.upload()  # pick your CSV
csv_name = list(uploaded.keys())[0]
df = pd.read_csv(csv_name)
print("Loaded:", csv_name)
df.head()

### Option B — Load directly from a **URL** (if you have a direct CSV link)
Paste the URL and run.

In [None]:
import pandas as pd
CSV_URL = "https://example.com/path/to/your.csv"  # <-- paste
try:
    df = pd.read_csv(CSV_URL)
    print("Loaded from URL:", CSV_URL)
    display(df.head())
except Exception as e:
    print("URL load failed — check the link or use Option A.\nError:", e)

### Option C — Kaggle (advanced, optional)
1) Go to **Kaggle → Account → Create New API Token** → it downloads `kaggle.json`.  
2) Upload `kaggle.json` here.  
3) Replace the dataset path (e.g., `zynicide/wine-reviews`) and filename.


In [None]:
!pip -q install kaggle
from google.colab import files
import os, pandas as pd

print("Upload kaggle.json from your Kaggle account…")
files.upload()
os.makedirs('/root/.kaggle', exist_ok=True)
os.replace('kaggle.json','/root/.kaggle/kaggle.json')
os.chmod('/root/.kaggle/kaggle.json', 0o600)

KAGGLE_DATASET = "zynicide/wine-reviews"  # <-- change to your dataset
!kaggle datasets download -d $KAGGLE_DATASET -p /content --unzip

# Change filename below to the CSV in the dataset
CSV_FILENAME = "/content/winemag-data-130k-v2.csv"  # <-- change
df = pd.read_csv(CSV_FILENAME)
print("Loaded Kaggle dataset:", CSV_FILENAME)
df.head()

---
## 🧹 Step 3 — Understand & Prepare Data
Run the checks below and make minimal cleaning choices.

In [None]:
print("Shape (rows, cols):", df.shape)
print("\nColumns:\n", list(df.columns))
print("\nDtypes:\n", df.dtypes)
print("\nMissing values (per column):\n", df.isna().sum())

### Choose Your Variables
Pick **one numeric outcome (Y)** to predict and **2–5 numeric predictors (X)** that could explain it.

- Example Marketing: `Y = Sales`, `X = [AdSpend, Price, StoreVisits]`
- Example HR: `Y = Salary`, `X = [YearsExperience, PerformanceScore]`

Fill in your choices below and run.

In [None]:
Y_COL = "<your_numeric_outcome>"      # e.g., 'Sales'
X_COLS = ["<x1>", "<x2>"]            # e.g., ['AdSpend','Price']

assert Y_COL in df.columns, "Y_COL not found in dataframe!"
for c in X_COLS:
    assert c in df.columns, f"Predictor {c} not in dataframe!"

data = df[X_COLS + [Y_COL]].dropna()
print("Data used (first 5 rows):")
data.head()

---
## 📊 Step 4 — Descriptive Analysis
Compute basic summaries to understand your variables.

In [None]:
desc = data.describe().T
desc

### Quick Visuals (optional but helpful)
- Histogram of the outcome
- Scatterplots of Y vs each X

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.figure()
plt.hist(data[Y_COL], bins=20)
plt.title(f"Distribution of {Y_COL}")
plt.xlabel(Y_COL)
plt.ylabel("Frequency")
plt.show()

for col in X_COLS:
    plt.figure()
    plt.scatter(data[col], data[Y_COL])
    plt.xlabel(col)
    plt.ylabel(Y_COL)
    plt.title(f"{Y_COL} vs {col}")
    plt.show()

---
## 🧪 Step 5 — Hypothesis Testing (choose one)
Pick **one** hypothesis test suitable for your dataset and state your hypotheses:

### Option A — One-sample t-test (mean of Y vs a benchmark)
- Example: *Is average monthly sales **> 10,000**?*

### Option B — Two-sample t-test (mean of Y across two groups)
- Requires a **binary group column** (e.g., Region A vs B). Create one if needed.
- Example: *Is **average salary** higher for **department A** than **department B**?*

👉 Write your $H_0$ and $H_1$ clearly in a **Text** cell below. Then run the matching code.

### ✅ Option A — One-sample t-test

In [None]:
from scipy import stats
import numpy as np

BENCHMARK = 0.0  # <-- set your benchmark, e.g., 10000
alternative = 'two-sided'  # 'greater', 'less', or 'two-sided'

y = data[Y_COL].astype(float).values
t_stat, p_val = stats.ttest_1samp(y, popmean=BENCHMARK, alternative=alternative)
print(f"t = {t_stat:.3f}, p = {p_val:.4f}, n = {len(y)}")

### ✅ Option B — Two-sample t-test (independent)
Provide a **binary group column** (e.g., `GroupCol` with values A/B or 0/1).

In [None]:
from scipy import stats

GROUP_COL = "<your_binary_group_col>"  # e.g., 'RegionAB' with values 'A'/'B'
if GROUP_COL in data.columns:
    gvals = data[GROUP_COL].dropna().unique()
    assert len(gvals) == 2, "Group column must have exactly 2 unique values!"
    g1, g2 = gvals[0], gvals[1]
    y1 = data.loc[data[GROUP_COL] == g1, Y_COL].astype(float)
    y2 = data.loc[data[GROUP_COL] == g2, Y_COL].astype(float)
    t_stat, p_val = stats.ttest_ind(y1, y2, equal_var=True, alternative='two-sided')
    print(f"Groups: {g1} (n={len(y1)}), {g2} (n={len(y2)})")
    print(f"t = {t_stat:.3f}, p = {p_val:.4f}")
else:
    print("Define GROUP_COL and ensure it exists in data with exactly 2 unique values.")

---
## 🤖 Step 6 — Predictive Modeling (Linear Regression)
We will build a simple regression to predict **Y** from your selected **X** variables.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

X = data[X_COLS].astype(float).values
y = data[Y_COL].astype(float).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print("Model Coefficients (b for each X):")
for name, coef in zip(X_COLS, model.coef_):
    print(f"  {name}: {coef:.4f}")
print(f"Intercept (a): {model.intercept_:.4f}")
print("\nTest Performance:")
print(f"  R^2  : {r2:.3f}")
print(f"  MAE  : {mae:.3f}")
print(f"  RMSE : {rmse:.3f}")

### Visualize (one predictor at a time)
If you only selected **one X**, this draws the regression line.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression

if len(X_COLS) == 1:
    xname = X_COLS[0]
    xvals = data[xname].astype(float).values.reshape(-1,1)
    yvals = data[Y_COL].astype(float).values
    m = LinearRegression().fit(xvals, yvals)
    xgrid = np.linspace(xvals.min(), xvals.max(), 100).reshape(-1,1)
    yhat  = m.predict(xgrid)
    plt.figure()
    plt.scatter(xvals, yvals)
    plt.plot(xgrid, yhat)
    plt.xlabel(xname)
    plt.ylabel(Y_COL)
    plt.title(f"Regression of {Y_COL} on {xname}")
    plt.show()
else:
    print("(Visualization: set only one predictor in X_COLS to see the fitted line plot.)")

---
## 🧠 Step 7 — Decision Questions (Answer in Markdown)
Answer **all** questions below using your outputs (they cannot be answered without running your code):

1) **Hypothesis Test Result:** What was your $H_0$ and $H_1$? Report your **t-statistic**, **p-value**, and the **plain-language conclusion** at $\alpha=0.05$.
2) **Model Fit:** Interpret your **R², MAE, RMSE**. Is the model useful for decision-making? Why or why not?
3) **Feature Effects:** Pick the most important predictor from your model (largest |coefficient|). Explain what a **one-unit increase** in that variable means for the outcome **in business terms**.
4) **What-if Scenario:** If you change one predictor by +10% (e.g., increase AdSpend), what is the **predicted change** in Y (using the fitted model)? Show the calculation.
5) **Risk & Limitations:** Identify one potential bias or limitation in your data or model (e.g., omitted variables, nonlinearity, small sample). How could it affect decisions?
6) **Decision Recommendation:** Based on your analysis, what should the manager do **next month/quarter**? Be specific (e.g., adjust price by X, raise budget by Y, prioritize region Z) and justify with numbers.

> Write your answers below this cell in **clear, concise business language**. Use bullet points and short paragraphs.

### *Write your responses here

---
## 🎨 Step 8 — Creative Element (pick one)
Add at least **one** of the following:
- **Segmentation:** Split data into two meaningful groups (e.g., Region A/B, Product Type) and **run regression separately**; compare coefficients.
- **Feature Engineering:** Create a new variable that could improve predictions (e.g., `RevenuePerVisit = Revenue / Visits`).
- **Validation:** Perform **k-fold cross-validation** (e.g., 5-fold) and report average R²/MAE.
- **Visualization:** Add one clean visualization that helps a manager understand the recommendation at a glance.


---
## 📄 Step 9 — Decision Memo (≤ 200 words)
Write a short memo to a non-technical manager:
- **Goal**: what you set out to predict or test
- **Key result**: hypothesis test conclusion and model takeaway
- **Recommendation**: the decision and quantitative justification
- **Risk**: one limitation and how you’d address it next

Type your memo below this cell.

---
## 📦 Step 10 — Save & Submit
- Go to **File → Download → Download .ipynb** and upload to the assignment page.
- If you used a local file, also **attach your CSV**. If you used an online dataset, ensure the link is accessible.
- Include the **Data Provenance** note with link and access info.

*(Optional) Save a copy to Google Drive using the code below.*

In [None]:
from google.colab import drive
from datetime import datetime
drive.mount('/content/drive')
ts = datetime.now().strftime('%Y%m%d_%H%M')
out = f"/content/drive/MyDrive/{STUDENT_NAME.replace(' ','_')}_PredictiveDecision_{ts}.csv"
try:
    data.to_csv(out, index=False)
    print("Saved a copy of your working dataset to:", out)
except Exception as e:
    print("(Optional save) Could not save to Drive:", e)