<a href="https://colab.research.google.com/github/Ty700/CSCE_676/blob/main/335009542.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## CSCE 676 :: Data Mining and Analysis :: Texas A&M University :: Spring 2026


# Weekly Homework 1: Data Basics


***Goals of this homework:***
Onboard into the course (introductions and short video), perform end-to-end data analysis on a noisy real-world dataset (ingestion, cleaning, feature engineering, and exploratory data analysis), practice clear technical communication (written and spoken via short video), and answer interview-style questions that assess data reasoning, assumptions, tradeoffs, and limitations.


***Submission instructions:***

You should post your notebook to Canvas (look for the homework 1 assignment there). Please name your submission **your-uin_hw1.ipynb**, so for example, my submission would be something like **555001234_hw1.ipynb**. Your notebook should be fully executed when you submit ... so run all the cells for us so we can see the output, then submit that.

***Grading philosophy:***

We are grading reasoning, judgment, and clarity, not just correctness. Show us that you understand the data, the constraints, and the limits of your conclusions.

***For each question, you need to respond with 3 cells:***
1. **[A Code Cell] Your Code:** If code is not applicable, put `# no code` in the cell. For tests: tests can be simple assertions or checks (e.g., using `assert` or `print` or small functions or visual inspection); formal testing frameworks are not required.
2. **[A Markdown Cell] Your Answer:** Write up your answers and explain them in complete sentences. Include any videos in this section as well; for videos, upload them to your TAMU Google Drive, and ensure they are set to be visible by the instruction team (`caverlee@tamu.edu` and `mariateleki@tamu.edu`), then share the link to the video in the cell.
3. **[A Markdown Cell] Your Resources:** You need to cite 3 types of resources and note how they helped you: (1) Collaborators, (2) Web Sources (e.g. StackOverflow), and (3) AI Tools (you must also describe how you prompted, but we do not require any links to any specific chats). Specifically, use the following format as a template:
```
On my honor, I declare the following resources:
1. Collaborators:
- Reveille A.: Helped me understand that a df in pandas is a data structure kinda like a CSV.
- Sully A.: Helped me fix a bug with the vector addition of 2 columns.
- ...

2. Web Sources:
- https://stackoverflow.com/questions/46562479/python-pandas-data-frame-creation: how to create a pd df
- ...

3. AI Tools:
- ChatGPT: I gave it the homework .ipynb file and the ufo.csv, and told it to generate the code for the first question, but it did it with csv.reader(), so I re-prompted it to use pandas and that one was correct
- ...
```
***Why do we require this cell?*** This cell is important...

1. For academic integrity, you must give credit where credit is due.

2. We want you to pay attention to how you can successfully get help to move through problems! Is there someone you work with or an AI tool that helps you learn the material better? That's great! The point of engineering is to use your tools to solve hard problems, and part of graduate school is learning about how *you* learn and solve problems best.

***A reminder: you get out of it what you put into it.***
Do your best on these homeworks, show us your creativity, and ask for help when you need it -- good luck!

# A [4pts]. Hi! Introduce Yourself

**Note:** We only need 1 markdown cell for these questions.

**Rubric**

[2pt] Complete, thoughtful response.

[1pt] Partial response.

[0pt] Minimal response.


## 1.
Welcome to CSCE 676! Head to this thread -- **"Week 1: Introductions (on Canvas Discussions)"** -- in this week's module and introduce yourself. When you're done, type "done" here.
**Done.**

## 2.

I want to get to know you all! Please share a very brief (1min max) video saying hello.

What to include:

* A greeting (hello, hola, yo!, whatever)
* Please tell me how you pronounce your name
* One memorable thing -- could be your favorite meme, an interesting fact, favorite movie, etc. Just something that will help me remember -- like "Aha, Alice is that student who really loves skateboarding".

See the introduction for instructions on how to share the video.



# B [64pts]. UFO Sightings — Data Ingestion, Cleaning, and Feature Engineering

**Dataset:** `ufos.csv`

Detected columns: `datetime`, `city`, `state`, `country`, `shape`, `duration (seconds)`, `duration (hours/min)`, `comments`, `date posted`, `latitude`, `longitude`, and possibly extra unnamed columns.

**Goal:** Perform a set of tasks to load the data, diagnose issues, clean/standardize it, and derive basic features to support downstream mining using the Python package `pandas`.

**Rubric**

[8 pts] Strong/Professional: Correct and complete implementation of the task; Reasonable assumptions, stated or implied, and justified; Thoughtful handling of real-world data issues (missingness, noise, scale, duplicates, edge cases); Clear, concise explanations of what was done and why; Code is clean, readable, and well-structured, uses appropriate pandas, and would plausibly pass a professional code review; Tests meaningfully validate non-trivial behavior (not just "the code runs so it must be right").

[4 pts] Partial/Developing: Core task mostly completed but with gaps, weak assumptions, or minor mistakes; Reasoning is shallow or mostly descriptive; Code works but is messy, repetitive, or fragile; Tests are superficial, incomplete, or poorly motivated.

[0 pts] Minimal/Incorrect: Task is largely incorrect, missing, or misunderstands the goal; Little to no reasoning or justification; Code does not run or ignores constraints; No meaningful tests.


## 1.

* Load `ufos.csv` into a pd.DataFrame named `ufo_raw`.
* Display 5 random rows and `ufo_raw.info()`.
* Display the number of rows/columns.
* Display any empty columns.
* Write at least 1 test for your code, then answer: What did you test for? How do you know your code is correct?


In [None]:
import pandas as pd

bad_lines = []
ufo_raw = pd.read_csv('ufos.csv', engine='python', on_bad_lines=lambda line: bad_lines.append(line))

# Tests to make sure ufos.csv was loaded correctly
assert(ufo_raw.shape[0] > 0)
assert(ufo_raw.shape[1] > 0)

# Test to make sure ufo_raw is of DataFrame type
assert(type(ufo_raw) == pd.core.frame.DataFrame)

# Displays a random 5 rows
print("======================== RANDOM 5 ROWS ========================")
print(ufo_raw.sample(5), end='\n\n')

# Displays information about ufo_raw CSV
print("======================== UFO_RAW.INFO() ========================")
print(ufo_raw.info(), end='\n\n')

print("======================== UFO_RAW ROWS & COLUMNS ========================")
print(f"UFO ROWS: {ufo_raw.shape[0]}")
print(f"UFO COLUMNS: {ufo_raw.shape[1]}", end='\n\n')

print("======================== UFO_RAW EMPTY COLUMNS ========================")
# Finds all empty columns
empty_columns = [col for col in ufo_raw.columns if ufo_raw[col].isna().all()]
if empty_columns:
    print('\n'.join(empty_columns))
    print('\n\n')
else:
    print("No empty columns\n")

print("======================== UFO_RAW BAD LINES ========================")
# Prints all bad lines
if bad_lines:
    print(f"Bad lines: {len(bad_lines)}")
else:
    print("No bad lines")


              datetime                             city state country  \
46630  5/13/2013 21:00                       northfield    nj      us   
40359  4/13/2002 21:55           leicester (uk/england)   NaN      gb   
84067  9/20/2004 22:00  vlachata&#44 kefalonia (greece)   NaN     NaN   
56249  6/21/2012 00:00                         harrison    ar      us   
81651  9/12/2006 22:00                        waterbury    ct      us   

          shape duration (seconds) duration (hours/min)  \
46630     light                150          2.5 minutes   
40359    sphere               2100           35 minutes   
84067  triangle                 20           20 seconds   
56249  fireball                600           10 minutes   
81651     light               3600               1 hour   

                                                comments date posted  \
46630  I saw what at first looked to me like a small ...   5/15/2013   
40359  On the 13 April 2002 i witnessed 7 spheres cir...   4/2

## Q/A

### 1. What did you test for? How do you know your code is correct?

I tested to make sure the ufos.csv data was read correctly by asserting the rows and columns were greather than 0.

I also tested to make sure ufo_raw's data type was pd.DataFrame.

```
On my honor, I declare the following resources:
1. Collaborators:
N/A
- ...

2. Web Sources:
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame: Used for understanding DataFrame APIs (read_csv, on_bad_lines, isna, etc)
- ...

3. AI Tools:
- Sonnet 4.5: I gave it the ufo.csv and prompted it to characterize the set. (How many columns are empty, bad lines, total rows/columns)
- ...
```


## 2.

Create a cleaned DataFrame `ufo`:

* Drop fully-empty or irrelevant columns (e.g., unnamed columns).
* Parse `datetime` to `datetime64[ns]` (`errors='coerce'`).
* Coerce `duration (seconds)`, `latitude`, `longitude` to numeric.
* Lowercase/trim `city`, `state`, `country`, `shape`.
* Remove rows with impossible coordinates (lat ∉ [-90,90], lon ∉ [-180,180]).
* Drop exact duplicates based on a reasonable subset (document your choice).
* Write at least 2 tests for your code (focus on the most complicated parts), then answer: What did you test for? How do you know your code is correct?


## 3.

Now take a look at the `duration (hours/min)` column. For this question, we'd like you to just extract how ever many unique versions of durations reported in *minutes* you can from the `duration (hours/min)` column. In other words, find as many different variations of anything that could be reasonably interpreted as a minute-like duration. Examples include:
* several minutes
* x to y minutes
* x minutes
* x min.
* x mins.
* and so on ...

Write at least 2 tests for your code (focus on the most complicated parts), then answer: What did you test for? How do you know your code is correct?


## 4.

Create additional columns in a new feature engineering version of the dataset called `ufo_fe`:

- `year`, `month`, `hour` from `datetime`
- `duration_log10` = log10(duration_seconds + 1)
- `duration_bucket` = categorical bins for `duration (seconds)` (you may adjust edges)
- `us_only` = 1 if `country == 'us'` else 0

Then produce:
- Value counts of `shape` (top 10).
- A table of sightings by `year` (counts).
- A `state × shape` table (top 10 states by sample size).
- A matplotlib or seaborn visualization of the top 10 states by number of UFO sightings. Write about which visualization you selected, and why.
- Write at least 2 tests for your code (focus on the most complicated parts), then answer: What did you test for? How do you know your code is correct?





## 5.

In 3–6 sentences, based on your previous answers, summarize data quality and distributional patterns; use at least 3 facts about the dataset produced by your analysis in the previous questions.


# 6.
* In a new chat, prompt an AI tool to perform an EDA on the dataset for you, e.g., *Here is my dataset (attached). Give python code for a basic EDA.* Paste and run the code here (re-prompt until it gives reasonable output if there are bugs/failures).
* Write at least 2 tests for your code (focus on the most complicated parts), then answer: What did you test for? How do you know the code is correct?
* So, what did ChatGPT do for the EDA?
* Come up with 3 things it could've done instead.

# 7.
* Now, give screenshots of your code output back to the AI tool, and ask it to interpret the results for you, e.g., *Here are the results, interpret.* Paste the interpretation it provides here.
* What do you agree with?
* What do you disagree with?


## 8.

* Propose and implement 1-2 concrete next steps for your EDA (e.g., better deduplication with fuzzy text, geospatial clustering, normalization of duration text, timezone handling, creation of dummy variables, visualization of a different column, etc.) to improve your understanding of the data, pretending that you are going to use this dataset for important downstream tasks later.
* Interpret your results: what did you learn about the dataset?

# C [30pts]. Interview Questions

We now pretend this is a real job interview. Here's some guidance on how to answer these questions:

1. Briefly restate the question and state any assumptions you are making.

2. Explain your reasoning out loud, focusing on tradeoffs, limitations, and constraints.

3. As a principle, keep your answers as short and clear as they can be (while still answering the question).

4. Write/speak in a conversational but professional tone (avoid being overly formal). For speaking: speak at a reasonable pace and volume, speak clearly, pause when you need to, and practice making "eye contact" with the camera. Keep a confident, positive, and professional tone. *For additional coaching and practice, the University Writing Center provides individual appointments: https://writingcenter.tamu.edu/make-an-appointment.*

There may not be a single correct answer. We are grading whether your reasoning is reasonable and aware of limitations.

These questions are written unless a video response is specified.


**Rubric**

[6pt] Clear understanding of the question; reasonable assumptions; thoughtful reasoning that acknowledges tradeoffs and limitations; clear, concise communication in a conversational but professional tone (for speaking: clear pace, volume, and articulation).

[3pt] Basic understanding but shallow reasoning or unclear assumptions; communication is somewhat unclear, overly verbose, or overly informal/formal.

[0pt] Minimal, unclear, or incorrect response; poor communication or unprofessional tone.

## 1.
When should you use `pandas` versus just read in a csv?

## 2.
If this dataset suddenly grew by 10000×, which parts of your analysis pipeline would fail first? (Hint: Consider your hardware constraints.)

## 3.
Assume some fraction of reports are adversarially fabricated (meaning: someone submitted fake UFO reports on purpose). How does that change your analysis?

## 4.
How would incorrect timezone handling distort downstream statistical conclusions?

## 5.
Now, link to a video (1 min. max) of yourself answering the following question: What kind of selection bias do you think is present in this web-based UFO dataset?

Make sure to end on a follow-up question for the interviewer -- e.g., *So, to get some more context, are you thinking about this for a Speech AI application like Siri?*

# D [2pts]. What new questions do you have?
We want you to think bigger! Tell us what questions and curiosity this homework brings up for you.

**Rubric**

[2pt] Complete, thoughtful response.

[1pt] Partial response.

[0pt] Minimal response.

# 1.
What new questions do you have about data cleaning and exploratory data analysis (in general) after this homework? Or, what topics are you curious about now? List at least 3.