<a href="https://colab.research.google.com/github/Ty700/CSCE_676/blob/main/335009542_HW3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## CSCE 676 :: Data Mining and Analysis :: Texas A&M University :: Spring 2026


# Weekly Homework 3: Graphs!


***Goals of this homework:***
Perform an analysis of a graph of your choice.


***Submission instructions:***

You should post your notebook to Canvas (look for the assignment there). Please name your submission **your-uin_hw3.ipynb**, so for example, my submission would be something like **555001234_hw3.ipynb**. Your notebook should be fully executed when you submit ... so run all the cells for us so we can see the output, then submit that.

***Grading philosophy:***

We are grading reasoning, judgment, and clarity, not just correctness. Show us that you understand the data, the constraints, and the limits of your conclusions.

***For each question, you need to respond with 2 cells:***
1. **[A Code Cell] Your Code:**
  - If code is not applicable for the question, you can skip this cell.
  - For tests: tests can be simple assertions or checks (e.g., using `assert` or `print` or small functions or visual inspection); formal testing frameworks are not required.
2. **[A Markdown Cell] Your Answer:** Write up your answers and explain them in complete sentences. Include any videos in this section as well; for videos, upload them to your TAMU Google Drive, and ensure they are set to be visible by the instruction team (set to: **anyone with a TAMU email can view**), then share the link to the video in the cell.

***At the end of each Section (A/B/C/...) include a cell for your resources:***

**[A Markdown Cell] Your Resources:** You need to cite 3 types of resources and note how they helped you: (1) Collaborators, (2) Web Sources (e.g. StackOverflow), and (3) AI Tools (you must also describe how you prompted, but we do not require any links to any specific chats). Specifically, use the following format as a template:
```
On my honor, I declare the following resources:
1. Collaborators:
- Reveille A.: Helped me understand that a df in pandas is a data structure kinda like a CSV.
- Sully A.: Helped me fix a bug with the vector addition of 2 columns.
- ...

2. Web Sources:
- https://stackoverflow.com/questions/46562479/python-pandas-data-frame-creation: how to create a pd df
- ...

3. AI Tools:
- ChatGPT: I gave it the homework .ipynb file and the ufo.csv, and told it to generate the code for the first question, but it did it with csv.reader(), so I re-prompted it to use pandas and that one was correct
- ...
```
***Why do we require this cell?*** This cell is important...

1. For academic integrity, you must give credit where credit is due.

2. We want you to pay attention to how you can successfully get help to move through problems! Is there someone you work with or an AI tool that helps you learn the material better? That's great! The point of engineering is to use your tools to solve hard problems, and part of graduate school is learning about how *you* learn and solve problems best.

***A reminder: you get out of it what you put into it.***
Do your best on these homeworks, show us your creativity, and ask for help when you need it -- good luck!

# A [72pts]. Step-by-Step Data Mining & Experimental Analysis on A Graph of Your Choice

**Rubric**

[18 pts] Strong/Professional: Correct and complete implementation of the task; Reasonable assumptions, stated or implied, and justified; Thoughtful handling of real-world data issues (missingness, noise, scale, duplicates, edge cases); Clear, concise explanations of what was done and why; Code is clean, readable, and well-structured, uses appropriate pandas, and would plausibly pass a professional code review; Tests meaningfully validate non-trivial behavior (not just "the code runs so it must be right").

[9 pts] Partial/Developing: Core task mostly completed but with gaps, weak assumptions, or minor mistakes; Reasoning is shallow or mostly descriptive; Code works but is messy, repetitive, or fragile; Tests are superficial, incomplete, or poorly motivated.

[0 pts] Minimal/Incorrect: Task is largely incorrect, missing, or misunderstands the goal; Little to no reasoning or justification; Code does not run or ignores constraints; No meaningful tests.


## Overview
In this homework, you will **choose one dataset you like** from the [SNAP datasets](https://snap.stanford.edu/data/index.html) collection from the **Social networks** section. You must choose a  **directed graph only**. In this section, you will perform a step‑by‑step data mining & experimental analysis. *Much of this section is self-directed, meaning you will need to make critical decisions about what tools you use and what you explore.*

Ideally, you should eventually turn in a coherent story: **What you tried → Why → What you found → So what? → Wait...Anything more?**. It's completely OKAY if you only get minor discoveries. But you should always document the whole learning and reasoning process. Grading will be based on the logic and coherence of your submitted notebook.

As a guide, for each step of the homework, you should briefly document:
- **Method choice & rationale.** Why this method? What do you expect?
- **Parameters.** E.g., `alpha=0.85` for PageRank; seed selection for PPR; thresholds for community extraction.
- **Results.** Tables/plots + **1–3 sentences** of interpretation.
- **Reflection.** Did results match your expectations? If not, why might that be?

Ideally, strong submissions should read like a *short research memo* rather than a raw dump of numbers.

## Environment Setup
You may use **Python 3.9+**. Other possible tools are:
- **Graph tools**: You are welcome to use existing tools that are optimised for graph learning. Among them, `networkx`is recommended. `graph-tool` and `igraph` are also good when dealing with larger graphs.
- **Data processing packages**: eg. `numpy`, `pandas`
- **Plotting**: `matplotlib`, `plotly`, `seaborn` etc.
- **other related tools** `scipy`, `scikit-learn`,  `pytorch` etc.
- You may choose other tools; just be sure to let us know.

> If your chosen dataset is very large, consider using `graph-tool` and `igraph`, or sampling/induced subgraphs to stay within reasonable time/memory limits.


In [None]:
# Install libraries as needed (uncomment if running on a clean environment)
# %pip install networkx pandas numpy matplotlib scipy scikit-learn
# %pip install python-louvain igraph

import os, io, gzip, zipfile, tarfile, sys, math, random
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

# For reproducibility
random.seed(42)
np.random.seed(42)

print(nx.__version__)

# 1. Choose a Dataset
Pick **one** dataset from SNAP's **Directed networks** (e.g.`ego-Twitter`, `wiki-Vote` samples, etc.). Paste the **download URL** and a brief description of why you chose it.

- **Dataset name:** _e.g., soc-Slashdot0811_
- **URL:** _direct link to .txt/.gz_
- **Why this dataset?** _1–3 sentences on interest & expected properties_

> ⚠️ Make sure it's **directed**.

Add this cite to your citation cell:
> Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.

In [None]:
# === You may using the follow method to download datasets if you like ===
DATA_URL = "https://snap.stanford.edu/data/soc-Slashdot0811.txt.gz"  # Example; replace with your chosen dataset
LOCAL_PATH = "data/raw_graph.txt.gz"

os.makedirs("data", exist_ok=True)

def download_dataset(url: str, to_path: str):
    import urllib.request
    print(f"Downloading from {url} ...")
    urllib.request.urlretrieve(url, to_path)
    size = os.path.getsize(to_path) / (1024*1024)
    print(f"Saved to {to_path} ({size:.2f} MB)")

# Uncomment to download when ready
# download_dataset(DATA_URL, LOCAL_PATH)

# 2. Load & Parse the Directed Graph
First choose a graph tool and write a few lines of why you choose it and how you are going to use it. If you are not using any tools, please also document that.

> SNAP directed edge lists are usually in the form `src\t dst` per line, with comments starting with `#`.

Implement a robust loader that:
- Skips comment lines (if there is any)
- Builds a **`networkx.DiGraph`** (if you choose networkx)
- (Optional) Restricts to the **largest weakly connected component (WCC)** for clarity

Write your loading graph code here:

In [None]:
'''sample codes'''
def load_directed_graph(**kwargs) -> nx.DiGraph:
    """Load a directed edge list into a DiGraph.
    Assumes lines like: u<sep>v, with comment lines starting by `comment`.
    If sep is None, split on whitespace.
    """

    ''' your code here'''
    return graph

# 3. First Look: Basic Structural Statistics
Compute and report at least:
- `|V|` (nodes), `|E|` (edges)
- **Average in/out degree**, **degree distributions** (plot)
- **#SCCs**, size of **largest SCC** and **largest WCC**
- *(Optional): **Density**, **reciprocity***

Add a few sentences interpreting what these numbers suggest about your network.


In [None]:
'''sample codes'''
def basic_stats(G: nx.DiGraph):
    ''' your code here'''
    return pd.DataFrame([your stats])

# Example after loading:
# basic_stats(G)

# 4. Quick Visualization (Exploratory)
Produce a small subgraph visualization to build intuition (e.g., induced subgraph of top-`k` PageRank nodes or a random 500-node sample). For large graphs, you **don't have to** try to plot everything.

- Annotate what you observe (hubs? communities? sources/sinks?)


In [None]:
def visualize_subgraph(G):



# 5.

Prepare a concise research brief summarizing your data mining results and their significance.

Your write-up must include:

1. Results
- Clearly and precisely report your empirical findings.
- Include relevant quantitative results (e.g., metrics, comparisons, trends, error analysis).
- Figures or tables may be used if they improve clarity, but they must be referenced and interpreted in the text.
- You can use markdown formatting (bold, italics, headings, etc.) to help you communicate your findings.
- Do not focus on implementation details unless they are necessary to understand the results.

2. Significance and Interpretation
- Explain why the results matter in a data mining context.
- Discuss what the findings imply about the data, the model(s), or the assumptions made.
- Address limitations, tradeoffs, or unexpected outcomes where relevant.

Clearly state what new insight is gained from your results.
Aim for 5-7 paragraphs in length.

# B [24pts]. Interview Questions

We now pretend this is a real job interview. Here's some guidance on how to answer these questions:

1. Briefly restate the question and state any assumptions you are making.

2. Explain your reasoning out loud, focusing on tradeoffs, limitations, and constraints.

3. As a principle, keep your answers as short and clear as they can be (while still answering the question).

4. Write/speak in a conversational but professional tone (avoid being overly formal). For speaking: speak at a reasonable pace and volume, speak clearly, pause when you need to, and practice making "eye contact" with the camera. Keep a confident, positive, and professional tone. *For additional coaching and practice, the University Writing Center provides individual appointments: https://writingcenter.tamu.edu/make-an-appointment.*

There may not be a single correct answer. We are grading whether your reasoning is reasonable and aware of limitations.


**Rubric**

[8pt] Clear understanding of the question; reasonable assumptions; thoughtful reasoning that acknowledges tradeoffs and limitations; clear, concise communication in a conversational but professional tone (for speaking: clear pace, volume, and articulation).

[4pt] Basic understanding but shallow reasoning or unclear assumptions; communication is somewhat unclear, overly verbose, or overly informal/formal.

[0pt] Minimal, unclear, or incorrect response; poor communication or unprofessional tone.

# 1.
Many real systems can be represented as graphs in multiple ways. How would you decide what the nodes and edges should represent in a given domain, and what kinds of errors can arise from a poor abstraction?

# 2.
Discuss how missing edges, spurious edges, or sampling bias affect centrality-based conclusions. Which measures are most fragile?

# 3.
As a video (reminder to keep it brief, 2 minutes max): So, I see you did a graph analysis (referring to this homework). That's cool -- can you walk me through what you did?

# C [4pts]. What new questions do you have?
We want you to think bigger! Tell us what questions and curiosity this homework brings up for you.

**Rubric**

[4pt] Complete, thoughtful response.

[2pt] Partial response.

[0pt] Minimal response.

# 1.
What new questions do you have about association rule mining (in general) after this homework? Or, what topics are you curious about now? List at least 3.