# Web & Data Science – Data Mining Assignment

**Instructions for students:**
- This notebook provides the *structure* of the assignment.
- Cells marked with `# TODO` are for **you** to complete.
- You may add extra cells as needed (for checks, plots, notes, etc.).
- Make sure your final notebook is clean and readable (remove debugging prints).

**Required files (place in the same folder as this notebook):**
- `world_happiness.csv` – World Happiness Report subset
- `economic_freedom.csv` – Economic Freedom of the World dataset (Fraser Institute subset)
  

## 0. Setup & Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

pd.set_option('display.max_columns', None)
sns.set()
print('Libraries imported.')

Libraries imported.


## 1. Load the Datasets
In this section, you will:
- Load the **World Happiness** dataset
- Load the **Economic Freedom** dataset
- Inspect their basic structure

In [None]:
# TODO: Load the datasets from CSV files

happiness_path = 'world_happiness.csv'   # adjust if needed
freedom_path = 'economic_freedom.csv'    # adjust if needed

# YOUR CODE HERE


### 1.1 Basic Inspection
Answer questions like:
- What are the column names?
- How many rows/columns does each dataset have?
- Are there obvious issues (weird values, extra columns)?

In [None]:
# TODO: Inspect the basic structure of both dataframes
# Use .info(), .describe(), .head(), .dtypes

# YOUR CODE HERE


---
# TASK 1 – Exploration & Cleaning (CRISP-DM: Data Understanding & Preparation)

In this task, you will:
1. Identify attribute types (numeric, categorical, etc.)
2. Explore data quality (missing values, outliers, inconsistencies)
3. Handle missing values
4. Handle outliers


## 1.1 Attribute Types
- Identify **numeric** and **categorical** attributes for each dataset.
  

In [None]:
# TODO: Inspect dtypes and decide attribute types (numeric or categorical) for both datasets

# YOUR CODE HERE
# Example logic to separate numeric vs non-numeric


## 1.2 & 1.3 Missing Values and Handling them
Use the slides on **Missing data – what can be done?**

Tasks:
- Count missing values per column
- Decide how to handle them (drop rows, drop columns, or impute)
- Apply your chosen strategy and justify it

In [None]:
# TODO: Explore missing values in both datasets

# YOUR CODE HERE


In [None]:
# TODO: Handle missing values
# Choose and justify your strategy (e.g., mean/median imputation, dropping rows/columns)
# You may want to create cleaned copies: happiness_clean, freedom_clean

# YOUR CODE HERE
happiness_clean = happiness_df.copy()
freedom_clean = freedom_df.copy()

# Strategy used here:
# For numeric columns: impute with median
# For categorical columns: impute with mode

# Numeric median imputation


# Categorical mode imputation


# drop the rows completely for missing values



In [None]:
# TODO: Explore missing values AFTER in both datasets

# YOUR CODE HERE


## 1.4 Outlier Detection

Tasks:
- Choose a few key numeric variables
- Use IQR or ±2 standard deviations to detect outliers
- Remove or clamp them (justify your choice)

In [None]:
# TODO: Detect outliers for selected numeric columns
# Hint: Use boxplots and/or IQR calculations

# YOUR CODE HERE


In [None]:
# TODO: Apply your chosen outlier handling strategy
# You can create new dataframes without outliers if you want

# YOUR CODE HERE
happiness_no_outliers = happiness_df.copy()
freedom_no_outliers = freedom_df.copy()

# Example outline (fill in logic):


# TODO: Explore missing values AFTER in both datasets


---
# TASK 2 – Apply Data Mining Methods (Modelling & Evaluation)

Now that the data is cleaned and prepared, you will:
- Merge the two datasets
- Perform correlation analysis
- Apply unsupervised learning (clustering)
- Optionally apply a simple supervised method (classification)


## 2.1 Merge Datasets
Merge the cleaned Happiness and Economic Freedom dataframes on `Country`.


In [None]:
# TODO: Merge the two datasets

# YOUR CODE HERE



## 2.2 Correlation Analysis
Use the merged dataset to:
- Compute a correlation matrix for selected numeric features
- Visualise it using a heatmap
- Interpret some of the strongest positive/negative correlations

In [None]:
# TODO: Compute and visualise correlation matrix

# YOUR CODE HERE


## 2.3 Clustering (Unsupervised Learning)

Tasks:
- Select a subset of numeric features (e.g., Score, GDP_per_capita, EconomicFreedomIndex, LifeExpectancy)
- Normalise them (if not already done)
- Apply K-Means with k=3 and k=5
- Visualise the clusters (e.g., Score vs GDP, coloured by cluster)
- Interpret the clusters

In [None]:
# TODO: Apply K-Means clustering

from sklearn.cluster import KMeans

# YOUR CODE HERE


In [None]:
# TODO: Visualise clusters (e.g., Score vs GDP, coloured by cluster)

# YOUR CODE HERE
x_col = ''  # e.g., use column names from the dataset
y_col = ''  # e.g., use column names from the dataset



✏️ **Short Answer:**
- How would you describe each cluster (high GDP & high happiness, etc.)?
- Do clusters correspond to specific regions/continents?


*(Write your answer here.)*