# üßÆ Descriptive Analysis ‚Äî Individual Assignment (Week 3)

**Course:** BUS 650 ‚Äî Business Analytics  
**Submission:** One Colab notebook (`.ipynb`) only  
**Rubric:** See Course Overview & Syllabus ‚Üí Rubrics  
**Support tool:** [BUSI650 Helper GPT ‚Äî Dataset Finder & Descriptive Analysis](https://chatgpt.com/g/g-68e750d6c8a88191bcc36422f4d40ce2-busi650-dataset-finder-descriptive-analysis)

---

Use the **BUSI650 Helper GPT** that I created (link above) to help you find a small, beginner-friendly dataset in your area of interest ‚Äî HR, Marketing, Accounting/Finance, or Policy/Law.

To begin, type this in the Helper GPT:

> I need your help for dataset finding

It will ask you a few short questions and provide:
- Dataset suggestions with links  
- A short description of each dataset  
- A direct CSV file you can use in Google Colab  

You‚Äôll use this dataset to perform a **simple descriptive analysis** and write plain-English answers about what you find.

---

## üéØ Your Task

You will:
1. Choose one small dataset (CSV format).  
2. Open Google Colab and create one notebook.  
3. Run your analysis step-by-step (we‚Äôll guide you below).  
4. Use plain English to describe what the data shows.  
5. (Optional) Use the Helper GPT for guidance and record your usage in an **Appendix**.

---


## üß≠ Part A ‚Äî Set Up Your Colab Notebook

### Step 1: Open Colab
1. Go to [https://colab.research.google.com](https://colab.research.google.com)
2. Sign in using your Google account.
3. Click **File ‚Üí New notebook**

### Step 2: Rename Your File
Click the title (top-left) and rename it:
Lastname_Firstname_DescriptiveAnalysis.ipynb


### Step 3: Add Your Header
At the very top, create a **Text cell** and paste this:
```markdown
# BUS 650 ‚Äî Week 3 Descriptive Analysis
**Name:** Your Name  
**Student ID:** 12345678  
**Dataset chosen:** (Name + Link)  
**Date:** YYYY-MM-DD  
**AI usage:** (Yes/No)  
If Yes ‚Üí see Appendix (include your prompts + what you changed)


## PART B ‚Äî Minimal Setup Code

Add a Code cell below your header and copy this:

In [None]:
# Step 1: Import three basic libraries for analysis.
# - pandas: helps you work with tables of data
# - numpy: helps with calculations
# - matplotlib: helps you make charts

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Step 2: Check that the libraries are working and see their versions
print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)


#Explanation: This code simply loads the tools you‚Äôll need ‚Äî no changes required. You‚Äôll see version numbers appear below the cell when you run it.

üíæ Save often! Colab autosaves, but also press Ctrl/Cmd + S


## Part C ‚Äî Get a Dataset

Pick one dataset area (HR, Marketing, Finance, Policy/Law).
Ask the Helper GPT for a beginner-friendly CSV with a direct link.

Examples:

HR: ‚ÄúFind a simple HR absenteeism dataset (CSV).‚Äù

Marketing: ‚ÄúFind a small marketing campaign dataset.‚Äù

Finance: ‚ÄúFind a tidy transactions dataset with numeric and categorical columns.‚Äù

Law: ‚ÄúFind a small public policy or crime dataset (CSV).‚Äù

‚úÖ You must have a direct CSV link or upload a local CSV file.


## Part C ‚Äî Load Your Data

Option 1 ‚Äî Load from a Web Link (recommended)

If your dataset is hosted online (e.g., from Kaggle, GitHub, or data.gov):

In [None]:
# Replace the text inside the quotes with your dataset's direct CSV link
url = "PASTE_YOUR_CSV_LINK_HERE"

# Load the data into a pandas "DataFrame" (like a spreadsheet)
df = pd.read_csv(url)

# Show the first 5 rows of the data
df.head()


Option 2 ‚Äî Upload a Local CSV File

If you downloaded a dataset to your computer:

In [None]:
# Upload the CSV file from your computer
from google.colab import files
uploaded = files.upload()  # Choose your CSV file

# Get the filename and load it into pandas
filename = list(uploaded.keys())[0]
df = pd.read_csv(filename)

# Show the first 5 rows
df.head()


If the dataset doesn‚Äôt load correctly, ask the Helper GPT:

‚ÄúThis dataset isn‚Äôt loading ‚Äî how do I fix the separator or text encoding?‚Äù

## Part D ‚Äî Understand Your Data

Now let‚Äôs learn what your dataset looks like.

In [None]:
# Check the number of rows and columns
print("Shape (rows, columns):", df.shape)

# Show the first 10 rows
display(df.head(10))

# Show the column names and their data types
print("\nColumn names:", list(df.columns))
print("\nData types:\n", df.dtypes)

# Quick info summary (non-empty values, data types)
df.info()


Write in a Text Cell:

What does one row represent (e.g., one employee, one sale, one crime record)?

List:

3‚Äì5 categorical variables (text or labels)

3‚Äì5 numeric variables (numbers)

Are there any ‚ÄúID‚Äù or name columns you should ignore or drop?

If you see a strange column like Unnamed: 0, remove it:

In [None]:
df = df.drop(columns=['Unnamed: 0'], errors='ignore')


## Part E ‚Äî Light Cleaning

This step just makes your data easier to work with

In [None]:
# Remove duplicate rows (if any)
df = df.drop_duplicates()

# Simplify column names: lowercase and replace spaces with underscores
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# Change small text columns into "category" type for easier summaries
for col in df.select_dtypes(include="object"):
    if df[col].nunique() <= 30:
        df[col] = df[col].astype("category")

# Show the first 20 column types
df.dtypes.head(20)


## PART F ‚Äî Descriptive Statistics (Numbers)

In [None]:
# Find which columns have numbers
num_cols = df.select_dtypes(include="number").columns
print("Numeric columns:", list(num_cols))

# Show a basic summary: count, mean, standard deviation, min, quartiles, max
df[num_cols].describe().T


In a Text Cell, answer:

Pick 2 numeric columns. Write their mean, median, and standard deviation (std).

If mean ‚â† median, what might that tell you (skew or outliers)?

Which variable has the largest spread (highest std)? Why could that matter for decisions?

## PART G ‚Äî Descriptive Statistics (Text / Categories)

In [None]:
# Find which columns contain categories or text
cat_cols = df.select_dtypes(include=["category", "object", "bool"]).columns
print("Categorical columns:", list(cat_cols))

# Show counts and percentages for up to 5 category columns
for col in cat_cols[:5]:
    print(f"\n--- {col} ---")
    print("Counts:\n", df[col].value_counts(dropna=False).head(10))
    print("\nPercentages (%):\n", (df[col].value_counts(normalize=True, dropna=False).head(10)*100).round(1))


In a Text Cell:

Which 3 categories are most common overall?

Do you see any large imbalances (e.g., one category much bigger than others)?

## PART H ‚Äî Visualizations

Let‚Äôs make one histogram (for numbers) and one bar chart (for categories).

In [None]:
# Histogram for one numeric variable
numeric_example = df.select_dtypes(include="number").columns[0]  # pick the first numeric column
plt.hist(df[numeric_example].dropna(), bins=20)
plt.title(f"Histogram of {numeric_example}")
plt.xlabel(numeric_example)
plt.ylabel("Count")
plt.show()

# Bar chart for one categorical variable
categorical_example = df.select_dtypes(include=["category", "object", "bool"]).columns[0]
top_counts = df[categorical_example].value_counts().head(10)
plt.bar(top_counts.index.astype(str), top_counts.values)
plt.title(f"Top 10 {categorical_example} Categories")
plt.xlabel(categorical_example)
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.show()


Explanation:

A histogram shows how numeric values are distributed.

A bar chart shows which categories appear most often.

## PART I ‚Äî Think Like a Manager

In a Text cell, answer 5‚Äì6 of these in plain language:

For two numeric variables, what are the mean, median, and spread?

Which variable is most variable (has the highest SD)?

For one categorical variable, what are the top 3 categories? Any surprises?

(Optional) Compare one numeric variable by category:

In [None]:
df.groupby(categorical_example)[numeric_example].mean().sort_values(ascending=False).head(10)


What differences might matter in practice?

What are your top 2‚Äì3 business insights you‚Äôd report to a manager?

##PART J ‚Äî (Optional) Use the BUSI650 Helper GPT

You may ask the GPT helper for:

Dataset ideas

Help loading or cleaning data

Guidance on interpreting charts

If you use GPT, add an Appendix at the end that includes:

Your exact prompt(s)

What part of the response helped you

What you changed or decided on your own

‚úÖ Do not copy full AI text ‚Äî just summarize what you used.

## PART K ‚Äî Final Checks & Submission

In Colab, click Runtime ‚Üí Restart and run all to make sure all code works.

Fix any errors if they appear.

Go to File ‚Üí Download ‚Üí Download .ipynb

Upload your .ipynb file to the course shell.

üìÅ Filename format:

In [None]:
Lastname_Firstname_DescriptiveAnalysis.ipynb


### ‚ö†Ô∏è Important Note About Datasets

If you use a dataset stored in your own Google Drive (e.g., you uploaded a CSV and read it using a path like /content/drive/MyDrive/...),
you must also attach that dataset file when you submit your assignment.

Please follow one of these submission options:

Preferred (public dataset):
Use a public URL (like a CSV link from GitHub, Kaggle, or a government data portal) so your notebook loads automatically ‚Äî no need to attach files.

Example:

url = "https://raw.githubusercontent.com/datasets/inflation/master/data/cpi.csv"
df = pd.read_csv(url)


If using a private or uploaded dataset:

Download the CSV from your Drive (right-click ‚Üí Download).

Attach that same .csv file alongside your .ipynb file when you submit.

Make sure the filename in your code matches the attached dataset file.
Example:

df = pd.read_csv("mydata.csv")


Do not use Drive-only paths in your final submission:

# ‚ùå This will not work when the instructor opens your notebook
df = pd.read_csv("/content/drive/MyDrive/mydata.csv")

üì¶ What to submit

You should upload:

‚úÖ Your Colab notebook: Lastname_Firstname_DescriptiveAnalysis.ipynb

‚úÖ Your dataset file (if it‚Äôs not publicly accessible online)



### You Should Have:

A clear header with your name, ID, and dataset info

All code and answers in one notebook

Numeric and categorical summaries

One histogram + one bar chart

Plain-English insights for decision-making

Appendix (if GPT was used)