# Web & Data Science – Data Mining Assignment

**Instructions for students:**
- This notebook provides the *structure* of the assignment.
- Cells marked with `# TODO` are for **you** to complete.
- You may add extra cells as needed (for checks, plots, notes, etc.).
- Make sure your final notebook is clean and readable (remove debugging prints).

**Required files (place in the same folder as this notebook):**
- `world_happiness.csv` – World Happiness Report subset
- `economic_freedom.csv` – Economic Freedom of the World dataset (Fraser Institute subset)
  

## 0. Setup & Imports

In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

pd.set_option('display.max_columns', None)
sns.set()
print('Libraries imported.')

Libraries imported.


## 1. Load the Datasets
In this section, you will:
- Load the **World Happiness** dataset
- Load the **Economic Freedom** dataset
- Inspect their basic structure

In [45]:
# TODO: Load the datasets from CSV files

happiness_path = 'world_happiness.csv'   # adjust if needed
freedom_path = 'economic_freedom.csv'    # adjust if needed

# YOUR CODE HERE
df_happiness = pd.read_csv(f"./assignment_support/{happiness_path}")
df_happiness.head()

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,154,Afghanistan,3.203,0.35,,0.361,,0.158,0.025
1,107,Albania,4.719,0.947,0.848,0.874,0.383,0.178,0.027
2,88,Algeria,5.211,1.002,1.16,0.785,0.086,0.073,0.114
3,47,Argentina,6.086,1.092,1.432,0.881,0.471,0.066,0.05
4,116,Armenia,4.559,0.85,1.055,0.815,0.283,0.095,0.064


In [46]:
df_freedom = pd.read_csv(f"./assignment_support/{freedom_path}")
df_freedom.head()

Unnamed: 0,Year,ISO_Code,Countries,Economic Freedom Summary Index,Rank,Size of Government,Legal System & Property Rights,Sound Money,Freedom to trade internationally,Regulation
0,2019,AGO,Angola,4.99,157,,3.46,3.71,5.52,4.5
1,2019,ALB,Albania,7.72,33,7.64,5.52,9.76,8.53,7.17
2,2019,ARE,United Arab Emirates,7.05,72,5.4,5.81,8.93,8.49,6.63
3,2019,ARG,Argentina,5.3,153,6.34,5.46,3.37,5.85,5.48
4,2019,ARM,Armenia,7.51,47,7.71,5.86,9.34,8.11,6.54


### 1.1 Basic Inspection
Answer questions like:
- What are the column names?
- How many rows/columns does each dataset have?
- Are there obvious issues (weird values, extra columns)?

In [9]:
# TODO: Inspect the basic structure of both dataframes
# Use .info(), .describe(), .head(), .dtypes

# YOUR CODE HERE

#What are the column names?
df_happiness.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  155 non-null    int64  
 1   Country or region             155 non-null    object 
 2   Score                         155 non-null    float64
 3   GDP per capita                154 non-null    float64
 4   Social support                153 non-null    float64
 5   Healthy life expectancy       154 non-null    float64
 6   Freedom to make life choices  154 non-null    float64
 7   Generosity                    155 non-null    float64
 8   Perceptions of corruption     154 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.0+ KB


In [14]:
df_freedom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Year                              165 non-null    int64  
 1   ISO_Code                          165 non-null    object 
 2   Countries                         165 non-null    object 
 3   Economic Freedom Summary Index    164 non-null    float64
 4   Rank                              165 non-null    int64  
 5   Size of Government                164 non-null    float64
 6   Legal System & Property Rights    164 non-null    float64
 7   Sound Money                       164 non-null    float64
 8   Freedom to trade internationally  164 non-null    float64
 9   Regulation                        165 non-null    float64
dtypes: float64(6), int64(2), object(2)
memory usage: 13.0+ KB


In [11]:
# How many rows/columns does each dataset have?
df_happiness.shape

(155, 9)

In [12]:
# Are there obvious issues (weird values, extra columns)?
df_happiness.describe()

Unnamed: 0,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
count,155.0,155.0,154.0,153.0,154.0,154.0,155.0,154.0
mean,78.935484,5.395348,0.899753,1.211667,0.724636,0.394052,0.183897,0.109721
std,44.994181,1.106983,0.398083,0.295503,0.241765,0.14005,0.09482,0.093972
min,1.0,2.853,0.0,0.0,0.0,0.01,0.0,0.0
25%,40.5,4.541,0.58625,1.056,0.552,0.3095,0.1085,0.047
50%,79.0,5.373,0.96,1.274,0.789,0.417,0.177,0.0855
75%,117.5,6.178,1.221,1.452,0.881,0.50475,0.2465,0.14075
max,156.0,7.769,1.684,1.624,1.141,0.631,0.566,0.453


In [13]:
df_freedom.describe()

Unnamed: 0,Year,Economic Freedom Summary Index,Rank,Size of Government,Legal System & Property Rights,Sound Money,Freedom to trade internationally,Regulation
count,165.0,164.0,165.0,164.0,164.0,164.0,164.0,165.0
mean,2019.0,6.753476,82.769697,6.749329,5.294878,8.1375,7.168049,6.376727
std,0.0,1.050573,47.783126,1.119278,1.716158,1.627281,1.446906,1.114888
min,2019.0,2.56,1.0,3.7,1.69,0.74,1.75,2.27
25%,2019.0,6.1675,42.0,6.0575,4.09,7.31,6.3325,5.78
50%,2019.0,6.85,83.0,6.685,5.11,8.625,7.24,6.59
75%,2019.0,7.5925,123.0,7.5675,6.355,9.33,8.32,7.1
max,2019.0,8.94,165.0,9.18,8.93,9.86,9.66,9.11


---
# TASK 1 – Exploration & Cleaning (CRISP-DM: Data Understanding & Preparation)

In this task, you will:
1. Identify attribute types (numeric, categorical, etc.)
2. Explore data quality (missing values, outliers, inconsistencies)
3. Handle missing values
4. Handle outliers


## 1.1 Attribute Types
- Identify **numeric** and **categorical** attributes for each dataset.
  

In [16]:
# TODO: Inspect dtypes and decide attribute types (numeric or categorical) for both datasets

# YOUR CODE HERE
# Example logic to separate numeric vs non-numeric
df_happiness.dtypes

Overall rank                      int64
Country or region                object
Score                           float64
GDP per capita                  float64
Social support                  float64
Healthy life expectancy         float64
Freedom to make life choices    float64
Generosity                      float64
Perceptions of corruption       float64
dtype: object

In [18]:
df_freedom.dtypes

Year                                  int64
ISO_Code                             object
Countries                            object
Economic Freedom Summary Index      float64
Rank                                  int64
Size of Government                  float64
Legal System & Property Rights      float64
Sound Money                         float64
Freedom to trade internationally    float64
Regulation                          float64
dtype: object

## 1.2 & 1.3 Missing Values and Handling them
Use the slides on **Missing data – what can be done?**

Tasks:
- Count missing values per column
- Decide how to handle them (drop rows, drop columns, or impute)
- Apply your chosen strategy and justify it

In [47]:
# TODO: Explore missing values in both datasets

# YOUR CODE HERE
df_happiness.isnull().sum()

Overall rank                    0
Country or region               0
Score                           0
GDP per capita                  1
Social support                  2
Healthy life expectancy         1
Freedom to make life choices    1
Generosity                      0
Perceptions of corruption       1
dtype: int64

In [31]:
df_happiness[df_happiness.isna().any(axis=1)]

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,154,Afghanistan,3.203,0.35,,0.361,,0.158,0.025
29,103,Congo (Brazzaville),4.812,0.673,0.799,,0.372,0.105,0.093
30,127,Congo (Kinshasa),4.418,0.094,,0.357,0.269,0.212,0.053
34,20,Czech Republic,6.852,,1.487,0.92,0.457,0.046,0.036
38,137,Egypt,4.166,0.913,1.039,0.644,0.241,0.076,


In [26]:
df_freedom.isnull().sum()

Year                                0
ISO_Code                            0
Countries                           0
Economic Freedom Summary Index      1
Rank                                0
Size of Government                  1
Legal System & Property Rights      1
Sound Money                         1
Freedom to trade internationally    1
Regulation                          0
dtype: int64

In [34]:
df_freedom[df_freedom.isna().any(axis=1)]

Unnamed: 0,Year,ISO_Code,Countries,Economic Freedom Summary Index,Rank,Size of Government,Legal System & Property Rights,Sound Money,Freedom to trade internationally,Regulation
0,2019,AGO,Angola,4.99,157,,3.46,3.71,5.52,4.5
5,2019,AUS,Australia,8.23,6,6.48,,9.42,8.19,8.41
15,2019,BHS,"Bahamas, The",7.15,63,8.83,6.27,,5.66,7.04
18,2019,BLZ,Belize,,115,6.96,3.88,7.28,6.57,6.66
21,2019,BRB,Barbados,6.84,83,6.97,6.08,7.9,,6.21


In [51]:
# TODO: Handle missing values
# Choose and justify your strategy (e.g., mean/median imputation, dropping rows/columns)
# You may want to create cleaned copies: happiness_clean, freedom_clean

# YOUR CODE HERE
happiness_clean = df_happiness.copy()
freedom_clean = df_freedom.copy()

happiness_numeric = df_happiness.select_dtypes(include=['int64', 'float64']).columns
happiness_numeric

for col in happiness_numeric:
    happiness_clean[col] = happiness_clean[col].fillna(happiness_clean[col].median())
happiness_clean[happiness_clean.isna().any(axis=1)]
# Strategy used here:
# For numeric columns: impute with median
# For categorical columns: impute with mode


# Numeric median imputation

# Categorical mode imputation


# drop the rows completely for missing values



Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption


In [58]:
freedom_numeric = df_freedom.select_dtypes(include=['int64', 'float64']).columns
freedom_numeric

for col in freedom_numeric:
    freedom_clean[col] = freedom_clean[col].fillna(freedom_clean[col].median())
freedom_clean[freedom_clean.isna().any(axis=1)]


Unnamed: 0,Year,ISO_Code,Countries,Economic Freedom Summary Index,Rank,Size of Government,Legal System & Property Rights,Sound Money,Freedom to trade internationally,Regulation


In [60]:
# TODO: Explore missing values AFTER in both datasets

# YOUR CODE HERE
happiness_clean[happiness_clean.isna().any(axis=1)]
freedom_clean[freedom_clean.isna().any(axis=1)]

Unnamed: 0,Year,ISO_Code,Countries,Economic Freedom Summary Index,Rank,Size of Government,Legal System & Property Rights,Sound Money,Freedom to trade internationally,Regulation


## 1.4 Outlier Detection

Tasks:
- Choose a few key numeric variables
- Use IQR or ±2 standard deviations to detect outliers
- Remove or clamp them (justify your choice)

In [61]:
# TODO: Detect outliers for selected numeric columns
# Hint: Use boxplots and/or IQR calculations

# YOUR CODE HERE
happiness_clean.describe()

Unnamed: 0,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
count,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0
mean,78.935484,5.395348,0.900142,1.212471,0.725052,0.3942,0.183897,0.109565
std,44.994181,1.106983,0.396818,0.293663,0.241034,0.139607,0.09482,0.093686
min,1.0,2.853,0.0,0.0,0.0,0.01,0.0,0.0
25%,40.5,4.541,0.5945,1.057,0.553,0.31,0.1085,0.047
50%,79.0,5.373,0.96,1.274,0.789,0.417,0.177,0.0855
75%,117.5,6.178,1.221,1.447,0.881,0.5025,0.2465,0.1405
max,156.0,7.769,1.684,1.624,1.141,0.631,0.566,0.453


In [62]:
std = 0.093686
maxi = 2*std
maxi

0.187372

In [None]:
# TODO: Apply your chosen outlier handling strategy
# You can create new dataframes without outliers if you want

# YOUR CODE HERE
happiness_no_outliers = happiness_df.copy()
freedom_no_outliers = freedom_df.copy()

# Example outline (fill in logic):


# TODO: Explore missing values AFTER in both datasets


---
# TASK 2 – Apply Data Mining Methods (Modelling & Evaluation)

Now that the data is cleaned and prepared, you will:
- Merge the two datasets
- Perform correlation analysis
- Apply unsupervised learning (clustering)
- Optionally apply a simple supervised method (classification)


## 2.1 Merge Datasets
Merge the cleaned Happiness and Economic Freedom dataframes on `Country`.


In [None]:
# TODO: Merge the two datasets

# YOUR CODE HERE



## 2.2 Correlation Analysis
Use the merged dataset to:
- Compute a correlation matrix for selected numeric features
- Visualise it using a heatmap
- Interpret some of the strongest positive/negative correlations

In [None]:
# TODO: Compute and visualise correlation matrix

# YOUR CODE HERE


## 2.3 Clustering (Unsupervised Learning)

Tasks:
- Select a subset of numeric features (e.g., Score, GDP_per_capita, EconomicFreedomIndex, LifeExpectancy)
- Normalise them (if not already done)
- Apply K-Means with k=3 and k=5
- Visualise the clusters (e.g., Score vs GDP, coloured by cluster)
- Interpret the clusters

In [None]:
# TODO: Apply K-Means clustering

from sklearn.cluster import KMeans

# YOUR CODE HERE


In [None]:
# TODO: Visualise clusters (e.g., Score vs GDP, coloured by cluster)

# YOUR CODE HERE
x_col = ''  # e.g., use column names from the dataset
y_col = ''  # e.g., use column names from the dataset



✏️ **Short Answer:**
- How would you describe each cluster (high GDP & high happiness, etc.)?
- Do clusters correspond to specific regions/continents?


*(Write your answer here.)*