# 💻 Advanced AI Mini-Project: Data Mining & Data Cleansing
Welcome! This notebook includes two distinct tasks for your team. Each task is designed to challenge your skills in sourcing and preparing data for AI-driven projects.

---
## 📌 Instructions
- You can divide your group into two sub-teams or work sequentially.
- Each section is marked with goals and deliverables.
- Document your decisions and issues in the markdown cells provided.
- Make sure to comment your code.


## 🛰 Task 1: Data Mining
### Goal:
Find and extract a **small, real-world dataset** related to human behavior, interaction, or accessibility. The dataset should be sourced from a public platform or scraped from an online service (ethically).

### Guidelines:
- Choose a source (e.g. Reddit, Wikipedia tables, GitHub, open APIs)
- Extract and store the data using Python (e.g. `requests`, `BeautifulSoup`, or an API client)
- Save a copy of the data locally (as CSV)

### Deliverables:
- A working script that gathers the data
- A short markdown summary explaining what the data is, how it was retrieved, and ethical considerations


In [None]:
# ✏️ Code your data mining pipeline here
# Example: scraping a table from Wikipedia or using an open API
# Make sure to document your process clearly


### 🧠 Reflection (fill this in):
- What challenges did you face during data retrieval?
- Did you consider licensing and ethical implications?
- What could go wrong when using this data in a real AI system?


## 🧼 Task 2: Data Cleansing and Profiling
### Goal:
Perform an in-depth cleaning and exploration of the dataset your team collected in Task 1.

### Required steps:
- Handle missing values (smart strategies, not just drop!)
- Normalize column formats and types
- Identify outliers and inconsistencies
- Visualize distributions of at least two variables

### Bonus:
- Try basic feature engineering (e.g. transforming or combining columns)
- Apply bias detection if the dataset contains personal traits (e.g. gender, age)


In [None]:
# ✏️ Load your dataset and perform profiling
# Clean and prepare the data for ML use
# Use visualizations to understand patterns and quality issues


### 🧠 Reflection (fill this in):
- Which issues were most time-consuming?
- What trade-offs did you make when cleaning the data?
- How could bad preprocessing affect downstream AI decisions?
