# FIT Competition
<div style="text-align: justify"> FIT Competition, or Faculty of Information Technology Competition, is an annual technology competition organized by the Computer Science Student Association at the Faculty of Information Technology, Satya Wacana Christian University. The FIT Competition 2024 is held under the theme “Creating Sustainable Solution for Smart City”. This competition aims to foster a positive competitive spirit and encourage creativity and ideas in the field of computer science that can raise public awareness of the environment through the integration of technology in urban environments. The competition began on July 3, 2024 and ended on July 5, 2024</div>

## Description
<div style="text-align: justify">Happiness of a city can be a reflection of government success in managing their people and country, not only focusing in physical fields like infrastructure, but also in social problems, economy, and culture. There are many variables that can affect the happiness index of a city, include High GDP per capita, social support in times of need, absence of corruption in government, healthy life expectancy, freedom to make life choices, and generosity or charity towards others. The goal of the competition is to use regression to predict the happiness score of a city in Indonesia.</div>

## Dataset
<div style="text-align: justify">Fit Competition uses the "Indonesia Smart City" dataset which is a comprehensive collection of data features related to smart city implementations across various cities and regencies in Indonesia from the year 2022 to 2023. It contains train, test, and sample submission dataset. The features of the train and test dataset are:</div>

- `id` - City or Regency identifier
- `city_or_regency` - Name of City or Regency
- `year` - The year in which the data is recorded
- `total_area` - Area of City or Regency (KM^2)
- `population` - The Number of Residents in One City or Regency
- `densities` - Density Level (Population/KM^2)
- `traffic_density` - Categories for Traffic Density (Low/Medium/High)
- `green_open_space` - Area of Green Open Space (KM^2)
- `hdi` - Index of Human Development for Each City or Regency
- `gross_regional_domestic_product` - Total Gross Value Added at Current Prices (Billion Rupiah)
- `total_landfills` - Number of Landfills per City or Regency
- `solid_waste_generated` - The amount of waste each City or Regency generated from various sources for a year (Tens of Tons)
- `happiness_score` - Score to Measure The Level of Happiness for each city or Regency (0 - 100), *this is the label*

The official kaggle competition for the preliminary round of FIT Competition 2024 can be found here:
https://www.kaggle.com/competitions/preliminary-round-fit-competition-2024
(p.s. it's invite only 😉)

## Importing Libraries

<div style="text-align: justify">In this first section of our notebook, we are setting up the environment by importing all the necessary libraries and modules that will be required for our data analysis and machine learning tasks. This includes libraries for data manipulation, visualization, preprocessing, and various machine learning algorithms (famous ones for tabular data tasks or maybe what we can call, <b>the state of the art</b>). By importing these at the beginning, we ensure that all tools are available for use throughout the notebook, promoting a structured and organized approach to our analysis, optimally running the notebook without have to import these libraries over and over again.</div>

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.metrics import make_scorer, mean_squared_error
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, VotingRegressor, StackingRegressor

# import eli5
# from eli5.sklearn import PermutationImportance

| Library Import Command | Purpose |
| ---------------------- | ------- |
| `import numpy as np` | Used for numerical operations on arrays, supporting large, multi-dimensional arrays and matrices along with high-level mathematical functions. |
| `import pandas as pd` | Provides data structures and data analysis tools, fundamental for manipulating and using structured data efficiently. |
| `import matplotlib.pyplot as plt`<br>`import seaborn as sns` | Used for data visualization. Matplotlib offers customization options, while Seaborn provides a high-level interface for attractive statistical graphics. |
| `from sklearn.preprocessing import LabelEncoder` | Utility class to help normalize labels to contain only values between 0 and n_classes-1, used for encoding target values into categorical data types. |
| `from sklearn.experimental import enable_iterative_imputer` | Enables experimental features in scikit-learn, specifically the iterative imputer for handling missing data. |
| `from sklearn.impute import IterativeImputer` | Used for imputing missing values using a round-robin linear regression. |
| `from sklearn.model_selection import cross_val_score, KFold, train_test_split` | Includes classes for model selection and evaluation, facilitating cross-validation and data splitting. |
| `from sklearn.metrics import make_scorer, mean_squared_error` | Contains functions for creating custom scoring functions and calculating mean squared error for model evaluation. |
| `from catboost import CatBoostRegressor`<br>`from lightgbm import LGBMRegressor`<br>`from xgboost import XGBRegressor`<br>`from sklearn.linear_model import LinearRegression`<br>`from sklearn.ensemble import RandomForestRegressor, VotingRegressor, StackingRegressor` | Various machine learning models from libraries like CatBoost, LightGBM, XGBoost, and scikit-learn used for regression tasks. |
| `import eli5`<br>`from eli5.sklearn import PermutationImportance` | ELI5 is for debugging machine learning classifiers and explaining their predictions. Permutation Importance computes feature importances in our models. |

## Data Preprocessing

<div style="text-align: justify">In this section, we will focus on preparing our datasets for future analysis and modeling. This section is crucial as it sets the foundation for accurate and effective models that we are going to use. Below is an overview of the steps and activities we will do:</div>

1. **Data Loading**: We begin by importing the provided datasets — train, test, and sample submission files. We then can visualize the raw train and test dataset as this step is essential for understanding the structure of our data and identifying which features we can utilize or need modification.

2. **Data Cleaning**: Involves handling missing values, removing or imputing data as necessary, and ensuring that our datasets are free of errors or outliers that could effect out analysis.

By carefully processing our data, we can later be sure that our analysis `(EDA)` is reliable enough for us to continue to the next step, `Feature Engineering`.

### Data Loading and Raw Data Visualization

In [26]:
# Please replace the path of the data files with the path you have,
# since we have it in the same directory of our work, we simple import it
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
sample = pd.read_csv("sample_submission.csv")

In [27]:
display(train.head(8).style.set_caption('Train Dataset'))
display(train.describe().style.set_caption('Train Dataset Summary'))
display(test.head(8).style.set_caption('Test Dataset'))
test.describe().style.set_caption('Test Dataset Summary')

Unnamed: 0,id,city_or_regency,year,total_area (km2),population,densities,traffic_density,green_open_space,hdi,gross_regional_domestic_product,total_landfills,solid_waste_generated,happiness_score
0,11012022,Simeulue,2022,1838.09,94876,51.62,Low,0.12,67.27,2688,1.0,1628.09,72.37
1,11032022,Aceh Selatan,2022,4173.82,237376,56.87,Low,,67.87,6447,1.0,3521.77,72.54
2,11042022,Aceh Tenggara,2022,4242.04,228308,53.82,Low,,70.32,5829,1.0,3333.3,72.38
3,11062022,Aceh Tengah,2022,4527.53,222673,49.18,Low,1.11,73.95,8873,1.0,,71.38
4,11072022,Aceh Barat,2022,2927.95,202858,69.28,Low,0.01,72.34,12730,1.0,3702.16,72.36
5,11082022,Aceh Besar,2022,2903.5,414490,142.76,Low,,74.0,15457,,,72.91
6,11102022,Bireuen,2022,1798.0,443874,246.87,Low,0.0,73.16,15470,,,73.03
7,11112022,Aceh Utara,2022,3296.86,614640,186.43,Low,0.05,70.22,28310,,,70.79


Unnamed: 0,id,year,densities,hdi,total_landfills,happiness_score
count,822.0,822.0,822.0,822.0,420.0,822.0
mean,45302436.125304,2022.5,1159.968893,71.079745,1.264286,73.919246
std,26498600.384108,0.500304,2689.298031,6.449038,0.854113,2.705407
min,11012022.0,2022.0,0.0,34.1,1.0,65.11
25%,18062022.25,2022.0,55.095,67.9225,1.0,71.99
50%,35292022.5,2022.5,169.99,70.845,1.0,73.695
75%,71729522.0,2023.0,911.8425,74.35,1.0,75.6225
max,94712023.0,2023.0,22150.59,88.28,8.0,81.81


Unnamed: 0,id,city_or_regency,year,total_area (km2),population,densities,traffic_density,green_open_space,hdi,gross_regional_domestic_product,total_landfills,solid_waste_generated
0,11022022,Aceh Singkil,2022,1857.88,130787,70.4,Low,685.53,69.62,3005,1.0,1926.13
1,11052022,Aceh Timur,2022,6040.6,432849,71.66,Low,0.09,68.72,13101,2.0,6319.6
2,11092022,Pidie,2022,3184.46,444505,139.59,Low,0.42,71.2,12412,1.0,6489.77
3,11182022,Pidie Jaya,2022,952.11,162771,170.96,Low,0.05,74.34,3980,1.0,2902.94
4,11752022,City of Subulussalam,2022,1391.0,95199,68.44,Low,0.2,66.2,2365,1.0,1353.0
5,12052022,Tapanuli Utara,2022,3793.71,318424,83.93,Low,0.0,74.14,9646,,
6,12082022,Asahan,2022,3732.97,787681,211.01,Low,,71.13,46575,1.0,17256.5
7,12112022,Karo,2022,2127.25,414429,194.82,Low,0.01,75.36,23976,,


Unnamed: 0,id,year,densities,green_open_space,hdi,total_landfills
count,206.0,206.0,206.0,105.0,206.0,104.0
mean,47508915.703883,2022.5,951.246374,99.382855,70.329515,1.240385
std,27948473.246756,0.501218,2523.463139,237.269981,6.126309,0.689663
min,11012023.0,2022.0,2.198125,0.0,44.59,1.0
25%,17074522.75,2022.0,48.654709,0.04271,67.765,1.0
50%,36027022.5,2022.5,126.755259,0.6401,69.735,1.0
75%,72107022.75,2023.0,735.491663,36.93696,73.2975,1.0
max,94362023.0,2023.0,19760.432,1215.40535,86.69,5.0


### Combining Train and Test Datasets

<div style="text-align: justify">This next step involves combining the train and test datasets into a single DataFrame named <code>df</code>. We do this kind of common practice because we observe the characteristics of both datasets and they turned out to be very similar, indicating that there isn't a concrete difference in their features or distributions, but rather that they were just split merely for the purposes of the competition. The other reasons are:</div>

- **Consistent Data Handling**: Combining the datasets ensures that any transformations or feature engineering we apply are consistent across the entire dataset, reducing the risk of discrepancies between training and testing phases.

- **Increased Data Volume**: By combining the datasets, we increase the volume of data available for certain features (except the target labels, obviously).

In [28]:
df = pd.concat([train, test])

<div style="text-align: justify">*Here is our clarifications for some who might be wondering whether what we're doing here is a leakage or not. Data leakage occurs when information or data from outside the training dataset is used to create the model, which includes future or unseen data (like the test set labels) and because we don't involve any of the target variable or outcome data from the test set, there is no leakage of the sort that could unfairly advantage the model.</div>

### Formatting Numeric Columns

<div style="text-align: justify">In this step, we're correcting the format of numeric values that are represented as strings with commas for it's thousands and millions separators. We remove all commas from the values and convert them to floats for the <code>total_area (km2)</code>, <code>population</code>, <code>gross_regional_domestic_product</code>, and <code>green_open_space</code> columns. For the <code>solid_waste_generated</code> column, we handle its non-numeric values by converting them to NaN. This action is to ensure all data is in a consistent numeric format for further analysis.</div>

In [29]:
df['total_area (km2)'] = df['total_area (km2)'].str.replace(',', '').astype(float)
df['population'] = df['population'].str.replace(',', '').astype(float)
df['gross_regional_domestic_product'] = df['gross_regional_domestic_product'].str.replace(',', '').astype(float)
df['green_open_space'] = df['green_open_space'].str.replace(',', '').astype(float)
df['solid_waste_generated'] = pd.to_numeric(df['solid_waste_generated'].str.replace(',', ''), errors='coerce')

### `traffic_density` Encoding

<div style="text-align: justify">Now, we're doing manual ordinal encoding to transform the <code>traffic_density</code> column. This column contains categorical data of <b>Low</b>, <b>Medium</b>, and <b>High</b>, which we map into a numeric format of <b>-1</b>, <b>0</b>, and <b>1</b>, respectively. In short, by using -1, 0, and 1 with 0 as a midpoint introduces symmetry around that zero, where it might help in interpreting the coefficients more straightforwardly.</div>


In [30]:
df['traffic_density'] = df['traffic_density'].map({'Low': -1, 'Medium': 0, 'High': 1})

### Dropping Columns

<div style="text-align: justify">Here, we are dropping the <code>total_landfills</code>, <code>solid_waste_generated</code>, and <code>green_open_space</code> columns from the dataset. This decision is based on our observation that imputing these columns did not produce satisfactory results, as indicated by feature importance analysis (XGBoost and CatBoost built-in feature importance). Despite various imputation strategies, trials, hard work, and dedication, these features did not contribute significantly to our model's performance and hence, we remove them from the dataset.</div>


In [31]:
df = df.drop(['total_landfills', 'solid_waste_generated', 'green_open_space'], axis=1)

### Summary of Preprocessing

Here is to summarize the preprocessing techniques we used (which ones worked and which ones didn't):
| Technique                                                                     | Outcome      |
|-------------------------------------------------------------------------------|--------------|
| One hot encoding `traffic_density`                                            | Didn`t Work  |
| Ordinal encoding `traffic_density` (-1, 0, 1)                                 | Worked       |
| Imputing `total_landfills`, `solid_waste_generated`, `green_open_space` using IterativeImputer | Didn`t Work  |
| Imputing `total_landfills`, `solid_waste_generated`, `green_open_space` using mean imputation | Didn`t Work  |
| Imputing `total_landfills`, `solid_waste_generated`, `green_open_space` using 0 imputation | Didn`t Work  |
| Dropping `total_landfills`, `solid_waste_generated`, `green_open_space`       | Worked       |


## Exploratory Data Analysis (EDA)

### Data Observation

1. Each city or regency has data for two different years: 2022 and 2023. This allows us to later analyze changes and trends over time within each administrative region.
2. The `happiness_score` tends to increase over the years (2022 to 2023) for the same city or regency. We sampled them by checking the cities that is in the train dataset for both of their 2022' and 2023's `happiness_score`

In [32]:
increase_count = 0
decrease_count = 0

for city in train['city_or_regency'].unique():
    city_data = train[train['city_or_regency'] == city]
    if len(city_data) == 2:
        score_2022 = city_data[city_data['year'] == 2022]['happiness_score'].values[0]
        score_2023 = city_data[city_data['year'] == 2023]['happiness_score'].values[0]
        
        if score_2023 > score_2022:
            increase_count += 1
        elif score_2023 < score_2022:
            decrease_count += 1

print(f'Number of cities/regencies with increased happiness score: {increase_count}')
print(f'Number of cities/regencies with decreased happiness score: {decrease_count}')

Number of cities/regencies with increased happiness score: 257
Number of cities/regencies with decreased happiness score: 67


### Dataset `id` Structure

<div style="text-align: justify">In this first EDA section, we deep dive into the structure of our dataset `id` to uncover meaningful patterns and insights. After careful observation, we believe the IDs follow this pattern, using 11012022 as an example:</div>

1. Province ID: The first two digits (11) represent the province id/number in Indonesia.
2. City/Regency ID: The first four digits (1101) represent the city or regency id/number in Indonesia.
3. Year: The last four digits (2022) indicate the year.

<div style="text-align: justify">We confirmed this observation by cross-checking the city or regency id/number with the one from this <a href="https://www.kaggle.com/datasets/greegtitan/indonesia-province-city-district-and-subdistrict">kaggle dataset</a> and this <a href="https://lekadnews.blogspot.com/2011/04/data-id-kabupaten-kota-dan-propinsi-di.html">blogspot</a>, even though the city id/number is slightly different than the one in <a href="https://id.wikipedia.org/wiki/Daftar_kabupaten_di_Indonesia">Wikipedia</a> or <a href="https://kodewilayah.id/">kodewilayah.id</a> but it seems that it's just a matter of the year (version) of the publication of the data itself.</div>

## Feature Engineering

### `id` Extraction

In [33]:
df['city'] = df['id'].astype(int).astype(str).str[:4].astype(int)
df['prov'] = df['id'].astype(int).astype(str).str[:2].astype(int)
df = df.drop(['id', 'city_or_regency'], axis=1)

### Categorize <code>Pulau</code>

In [34]:
def categorize_pulau(prov):
    if 11 <= prov <= 21:
        return 'Sumatra'
    elif 31 <= prov <= 36:
        return 'Jawa'
    elif 51 <= prov <= 53:
        return 'Timur dan Bali'
    elif 61 <= prov <= 65:
        return 'Kalimantan'
    elif 71 <= prov <= 76:
        return 'Sulawesi'
    elif 81 <= prov <= 82:
        return 'Ambon'
    elif 91 <= prov <= 94:
        return 'Papua'
    else:
        return 'Unknown'  # Handle other cases if needed

# Apply the function to create the new column 'pulau'
df['pulau'] = df['prov'].apply(categorize_pulau)

### Aggregation

<div style="text-align: justify">The dataset is first organized by island (pulau), province (prov), year, city, and province-year (prov_year). Statistical metrics for each grouping are computed for pertinent numeric columns, including mean, standard deviation, maximum, and minimum values. This procedure entails removing any extraneous group-specific columns, aggregating the data using the groupby and agg functions, and then flattening the multi-level column names for readability. Then, the combined outcomes are added back into the original dataset, making sure that the insights from each aggregation are combined with the matching rows according to their grouping keys (year, city, prov, prov_year, pulau). Comprehensive analysis and comparison across several geographic and temporal dimensions are made possible by this organized method, which also allows for deeper investigation and comprehension of the dataset's trends and distributions at various degrees of granularity.</div>

In [35]:
df['prov_year'] = (df['prov'].astype(int).astype(str) + df['year'].astype(int).astype(str)).astype(int)

In [36]:
from scipy.stats import kurtosis, skew

# Group by 'year'
aggregated_by_year = df.drop(columns=['city', 'prov', 'prov_year', 'pulau']).groupby('year').agg(['mean', 'std', 'max', 'min', 'var', 'median', kurtosis, skew]).reset_index()
aggregated_by_year.columns = ['_'.join(col).strip() for col in aggregated_by_year.columns.values]
aggregated_by_year.rename(columns={'year_': 'year'}, inplace=True)

# Group by 'city'
aggregated_by_city = df.drop(columns=['year', 'prov', 'prov_year', 'pulau']).groupby('city').agg(['mean', 'std']).reset_index()
aggregated_by_city.columns = ['_'.join(col).strip() for col in aggregated_by_city.columns.values]
aggregated_by_city.rename(columns={'city_': 'city'}, inplace=True)

# Group by 'prov'
aggregated_by_prov = df.drop(columns=['city', 'year', 'prov_year', 'pulau']).groupby('prov').agg(['mean', 'std', 'max', 'min', 'var', 'median', kurtosis, skew]).reset_index()
aggregated_by_prov.columns = ['_'.join(col).strip() for col in aggregated_by_prov.columns.values]
aggregated_by_prov.rename(columns={'prov_': 'prov'}, inplace=True)

# Group by 'prov_year'
aggregated_by_prov_year = df.drop(columns=['city', 'year', 'prov', 'pulau']).groupby('prov_year').agg(['mean', 'std', 'max', 'min', 'var', 'median', kurtosis, skew]).reset_index()
aggregated_by_prov_year.columns = ['_'.join(col).strip() for col in aggregated_by_prov_year.columns.values]
aggregated_by_prov_year.rename(columns={'prov_year_': 'prov_year'}, inplace=True)

# Group by 'pulau'
aggregated_by_pulau = df.drop(columns=['city', 'year', 'prov', 'prov_year']).groupby('pulau').agg(['mean', 'std', 'max', 'min', 'var', 'median', kurtosis, skew]).reset_index()
aggregated_by_pulau.columns = ['_'.join(col).strip() for col in aggregated_by_pulau.columns.values]
aggregated_by_pulau.rename(columns={'pulau_': 'pulau'}, inplace=True)

# Gabungkan hasil agregasi ke dataframe asli
df = df.merge(aggregated_by_year, on='year', suffixes=('', '_year'))
df = df.merge(aggregated_by_city, on='city', suffixes=('', '_city'))
df = df.merge(aggregated_by_prov, on='prov', suffixes=('', '_prov'))
df = df.merge(aggregated_by_prov_year, on='prov_year', suffixes=('', '_prov_year'))
df = df.merge(aggregated_by_pulau, on='pulau', suffixes=('', '_pulau'))

# Lihat hasilnya
print(df.head())

  f = lambda x: func(x, *args, **kwargs)
  f = lambda x: func(x, *args, **kwargs)


   year  total_area (km2)  population  densities  traffic_density    hdi  \
0  2022           1838.09     94876.0      51.62               -1  67.27   
1  2022           4173.82    237376.0      56.87               -1  67.87   
2  2022           4242.04    228308.0      53.82               -1  70.32   
3  2022           4527.53    222673.0      49.18               -1  73.95   
4  2022           2927.95    202858.0      69.28               -1  72.34   

   gross_regional_domestic_product  happiness_score  city  prov  ...  \
0                           2688.0            72.37  1101    11  ...   
1                           6447.0            72.54  1103    11  ...   
2                           5829.0            72.38  1104    11  ...   
3                           8873.0            71.38  1106    11  ...   
4                          12730.0            72.36  1107    11  ...   

  gross_regional_domestic_product_kurtosis_pulau  \
0                                      14.772938   
1     

  f = lambda x: func(x, *args, **kwargs)


### `GDP_per_Capita`

GDP per capita of a country is calculated by dividing total GDP of the country by its total population. This indicator uses GDP at current prices.

In [37]:
df['GDP_per_Capita'] = df['gross_regional_domestic_product'] / df['population']

### Part the target labels

In [38]:
df_label = df['happiness_score']
df = df.drop('happiness_score', axis=1)

## Modelling

<div style="text-align: justify">We begin our modelling process by defining three popular gradient boosting algorithms: CatBoost, LightGBM, and XGBoost with their hyperparameter tuned specifically by <b>Optuna</b> beforehand. then, we define a Mean Squared Error (MSE) scorer for model evaluation. The <code>greater_is_better=False</code> argument ensures that lower MSE scores indicate better performance. Additionally, we create a Voting Regressor to combine the predictions of our three models.</div>

In [39]:
# Define the models
cb = CatBoostRegressor(iterations=715, depth=9, learning_rate=0.21685466614525384,
                    l2_leaf_reg=46.80107916663932, min_child_samples=5, border_count=40, silent=True)

lgb = LGBMRegressor(max_depth=20, lambda_l1=0.00016628985509900782,
                    lambda_l2=0.002837398285557587, min_split_gain=0.7577793742601943,
                    min_child_weight=9.419284482047434, colsample_bytree=0.7269574870557697,
                    subsample=0.272828878529134, reg_alpha=6.174076387406154,
                    reg_lambda=6.096821117413743, n_estimators=217, verbose_eval=False)

xgb = XGBRegressor(colsample_bytree=0.5920623339546657, learning_rate=0.05655179773541492,
                max_depth=8, n_estimators=477, gamma=0.6651835678620678,
                min_child_weight=1.561555677120107, alpha=0.131264100671468,
                reg_lambda=0.4116178861256159)

# Define the MSE scorer for validation
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)

models = {
    "CatBoost": cb,
    "LightGBM": lgb,
    "XGBoost": xgb,
}

votingreg = VotingRegressor(estimators=[
    ('cb',cb),
    ('lgb', lgb),
    ('xgb',xgb)])

<div style="text-align: justify">In order to filter the features later on, we're going to find each of the features' permutation importance score by using <code>PermutationImportance</code>. First, we split our dataset into training and testing sets using a 70-30 split. Secondly, we train the Voting Regressor on the training set and make predictions on the test set. "Sambil menyelam, minum air" they said, so we squeeze in MSE calculation to evaluate model performance with that training set (the actual evaluation we're doing is at the end of our modelling process). We then use permutation importance to identify the most important features for our model and visualize them</div>

In [40]:
X = train
y = train_label

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model to get the permutation importance
model = votingreg
model.fit(X_train, y_train)

# Prediksi dan hitung MSE
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'MSE: {mse}')

# Count the permutation importance
perm = PermutationImportance(model, scoring='neg_mean_squared_error', random_state=42).fit(X_test, y_test)

# Show the weights
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

NameError: name 'train_label' is not defined

<div style="text-align: justify">The above step helps us understand which features contribute most to the model's predictions so we can choose only the top 50 of them with those scores and continue to our model evaluation.</div>

In [None]:
importance_df = pd.DataFrame({
    'feature': X_test.columns,
    'importance': perm.feature_importances_
})

top_50_features = importance_df.sort_values(by='importance', ascending=False).head(50)['feature'].tolist()

<div style="text-align: justify">We perform cross-validation on each individual model as well as the Voting Regressor to evaluate their performance. We use a 6-fold cross-validation scheme to ensure robust performance metrics.</div>

In [None]:
X = train[top_50_features]
y = train_label

for name, model in models.items():
    cv_scores = cross_val_score(model, X, y, cv=KFold(n_splits=6, shuffle=True, random_state=42), scoring=mse_scorer)
    cv_scores = -cv_scores  # Take absolute value of MSE
    print(f'{name} Cross-Validation MSE Scores: {cv_scores}')
    print(f'{name} Mean MSE: {cv_scores.mean()}')
    print(f'{name} Standard Deviation MSE: {cv_scores.std()}')

# Cross-validation
cv_scores_stacking = cross_val_score(votingreg, X, y, cv=KFold(n_splits=6, shuffle=True, random_state=42), scoring=mse_scorer)
cv_scores_stacking = -cv_scores_stacking  # Take absolute value of MSE
print(f'Stacking Model Cross-Validation MSE Scores: {cv_scores_stacking}')
print(f'Stacking Model Mean MSE: {cv_scores_stacking.mean()}')
print(f'Stacking Model Standard Deviation MSE: {cv_scores_stacking.std()}')

CatBoost Cross-Validation MSE Scores: [1.56611909 1.40801677 1.51404972 1.28215923 1.36236832 1.83801147]
CatBoost Mean MSE: 1.4951207682369414
CatBoost Standard Deviation MSE: 0.17959957105806332
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000528 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1490
[LightGBM] [Info] Number of data points in the train set: 685, number of used features: 30
[LightGBM] [Info] Start training from score 73.948438
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000528 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1462
[LightGBM] [Info] Number of data points in the train set: 685, number of used features: 30
[LightGBM] [Info] Start training from score 73.934730
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000514 seconds.
You can set `forc

## Submitting To The Competition

<div style="text-align: justify">We use the previously identified top 50 features from the training data <code>train</code> and corresponding labels <code>train_label</code> to train the Voting Regressor. After training, we use the Voting Regressor to predict the <b>happiness_score</b> for the test dataset using the same top 50 features. </div>

In [None]:
X, y = train[top_50_features], train_label
votingreg.fit(X, y)
sample['happiness_score'] = votingreg.predict(test[l])

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000604 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1512
[LightGBM] [Info] Number of data points in the train set: 822, number of used features: 30
[LightGBM] [Info] Start training from score 73.919246


Then, we ~~pray~~ reevaluate and submit the prediction to the competition.