# ARTI 308 - Machine Learning
## Lab 2: Identifying ML Problems, Selecting Open Datasets, and Drawing a Methodology Diagram

**Dataset:** Student Performance Dataset (UCI Machine Learning Repository)

**Dataset Source:** https://archive.ics.uci.edu/dataset/320/student+performance

---
## Part 1: Dataset Description

The **Student Performance Dataset** contains data about student achievement in secondary education from two Portuguese schools. The dataset includes student grades, demographic information, social factors, and school-related features collected through school reports and questionnaires.

### Key Information:
- **Source:** UCI Machine Learning Repository
- **Number of Instances:** 395 students
- **Number of Features:** 30 attributes + 3 grade columns
- **Target Variable:** G3 (final grade, ranging from 0 to 20)

---
## Part 2: Machine Learning Problem Definition

### Problem Type: **Regression** (can also be framed as Classification)

### Target Variable: **G3** (Final Grade)
- G3 is the final year grade issued at the 3rd period
- It is a numerical value ranging from 0 to 20

### Problem Statement:
**"Given a student's demographic information, family background, social factors, and school-related attributes, predict their final grade (G3) in the course."**

### Why is this a Regression Problem?
- The target variable (G3) is **continuous/numerical** (values 0-20)
- We want to **predict a specific grade value**, not just a category
- The model will learn patterns from input features to estimate the final grade

### Alternative: Classification Approach
This could also be framed as a **classification** problem by converting grades to categories:
- Pass (G3 >= 10) vs Fail (G3 < 10) → Binary Classification
- Grade levels (A, B, C, D, F) → Multi-class Classification

---
## Part 3: Loading and Inspecting the Dataset in Python

In [1]:
# Import required libraries
import pandas as pd

In [None]:
# Load the dataset
df = pd.read_csv('student+performance/student/student-mat.csv')

print("Dataset loaded successfully!")

FileNotFoundError: [Errno 2] No such file or directory: 'student+performance/student-mat.csv'

### 3.1 Display Dataset Shape

In [None]:
# Display the shape of the dataset (rows, columns)
print(f"Dataset Shape: {df.shape}")
print(f"Number of Rows (Students): {df.shape[0]}")
print(f"Number of Columns (Features): {df.shape[1]}")

### 3.2 Preview the First Few Rows

In [None]:
# Display the first 5 rows of the dataset
df.head()

### 3.3 Check Column Names

In [None]:
# Display all column names
print("Column Names:")
print(list(df.columns))

### 3.4 Check Data Types

In [None]:
# Display data types and non-null counts for each column
df.info()

### 3.5 Basic Statistics (Optional)

In [None]:
# Display basic statistics for numerical columns
df.describe()

### 3.6 Target Variable Distribution

In [None]:
# Check the distribution of the target variable (G3 - Final Grade)
print("Target Variable (G3) Statistics:")
print(f"Mean: {df['G3'].mean():.2f}")
print(f"Median: {df['G3'].median():.2f}")
print(f"Min: {df['G3'].min()}")
print(f"Max: {df['G3'].max()}")
print(f"Standard Deviation: {df['G3'].std():.2f}")

---
## Feature Descriptions

| Feature | Description |
|---------|-------------|
| school | Student's school (GP or MS) |
| sex | Student's sex (F or M) |
| age | Student's age (15-22) |
| address | Home address type (U=Urban, R=Rural) |
| famsize | Family size (LE3=≤3, GT3=>3) |
| Pstatus | Parent's cohabitation status (T=Together, A=Apart) |
| Medu | Mother's education (0-4) |
| Fedu | Father's education (0-4) |
| Mjob | Mother's job |
| Fjob | Father's job |
| reason | Reason to choose this school |
| guardian | Student's guardian |
| traveltime | Home to school travel time (1-4) |
| studytime | Weekly study time (1-4) |
| failures | Number of past class failures |
| schoolsup | Extra educational support |
| famsup | Family educational support |
| paid | Extra paid classes |
| activities | Extra-curricular activities |
| nursery | Attended nursery school |
| higher | Wants to take higher education |
| internet | Internet access at home |
| romantic | In a romantic relationship |
| famrel | Quality of family relationships (1-5) |
| freetime | Free time after school (1-5) |
| goout | Going out with friends (1-5) |
| Dalc | Workday alcohol consumption (1-5) |
| Walc | Weekend alcohol consumption (1-5) |
| health | Current health status (1-5) |
| absences | Number of school absences |
| G1 | First period grade (0-20) |
| G2 | Second period grade (0-20) |
| **G3** | **Final grade (0-20) - TARGET** |

---
## Summary

In this lab, we:
1. Selected the **Student Performance** dataset from UCI Machine Learning Repository
2. Identified this as a **Regression** problem (predicting continuous final grade)
3. Defined the **target variable** as G3 (final grade)
4. Loaded and inspected the dataset using Python/Pandas
5. Created a methodology diagram showing the ML workflow

The methodology diagram is saved separately as `methodology_diagram.png`