In [None]:
# -------- INFO --------
"""
Repository: https://github.com/NLBrien/mod550-2025
Creation date: 2025-10-15
Author: Nathan L.Brien
Course: MOD550 - Machine Learning
Title: Semester project
Description: Task 2 discussion and result analysis from script MOD550-P1-NLB-Task_2

Last modification date: 2025-10-15
"""


'\nRepository: https://github.com/NLBrien/mod550-2025\nCreation date: 2025-10-12\nAuthor: Nathan L.Brien\nCourse: MOD550 - Machine Learning\nTitle: Semester project\nDescription: Task 2\n    - Linear Regression (LinReg) using sklearn\n    - Mean Squared Error (MSE) code in vanilla Python\n    - Neural Network (NN) using keras\n    - K-Means (KM) clustering\n    - Gaussian (GMM) code\n\nKernel: Use of Python 3.10.9 necessary to run tensorflow module\n\nImport: After many unsuccesful attempts to individually call the function scripts, I had to create one note book file\n        with all the function codes. More detailed explanation in MOD550_P1_NLB_DEF_functions.ipynb.\n\nNote:   I initiallly intended to class targeted food crisis state with the 5 different levels, but data manipulation\n        proved to be a colossal challenge due to sparse and incomplete intitial dataset. The food crisis level has \n        been used in the first task (1), but data management proved to be more complex

# World Food Crises Prediction ML
Main script is **MOD550-P1-NLB-Task_2**, which contains callbacks for different machine learning operators in Python. These operators are defined in **MOD550_P1_NLB_DEF_functions.ipynb**, initially coded individually in Python scripts (PYTHON_DEF_...).

## Introduction
Before running different operators on the selected dataset (Global Report on Food Crises, GRFC 2025, with data from 2016 to 2024), multiple data manipulation and cleaning steps were required.  
The project goal is to predict if a country, will go into a state of food crisis depending on several features such as:
- Region (continent)  
- Previous food crisis phase (level 1 {least} to 5 {worst})  
- Total country population  
- Total country GDP  
- GDP per capita  

Other characteristics like population birth rate or total food production (in tons) would have been interesting to include, but the data available from open‑source databases was limited.

## Scripts List
#### Data
- Datasets original downloaded files:
    MOD550-P1-NLB-datasets [folder]
- Datasets import:
    MOD550_P1_NLB_datasets.py
- Datasets preview:
    MOD550-P1-NLB-data_preview.ipynb
- Datasets merge:
    MOD550-P1-NLB-data_merge.py

#### Task 1
- Main dataset manipulation and cleaning:
    MOD550-P1-NLB-crisis_phase_isolation.py
- Main dataset manipulation and export:
    MOD550-P1-NLB-phase_p_year.py
- Reference dataset:
    grfc_phase_3d_points.csv
- Histogram and heat map:
    MOD550-NLB_T1.3-5.ipynb

#### Task 2
- Linear Regression initial class:
    PYTHON_DEF_linearregression.py
- Mean Squared Error initial class:
    PYTHON_DEF_mse.py
- Neural Network initial class:
    PYTHON_DEF_neuralnetwork.py
- K-Means initial class:
    PYTHON_DEF_kmeans.py
- Gaussian Mixture Model initial class:
    PYTHON_DEF_gmm.py
- Function collection:
    MOD550_P1_NLB_DEF_functions.ipynb
- Feature weight pre-run:
    MOD550-P1-NLB-heaviest_feature.ipynb
- Main code:
    MOD550-P1-NLB-Task_2.ipynb
- Neural Network tests:
    MOD550-P1-NLB-NN_variations.ipynb

## Challenges
Most of the time was spent on data manipulation, merging, and cleaning. Most errors came from data mismatches or functions not running due to inconsistent formats. Data manipulation is much easier when visualized with tools like Excel. Python’s text‑only terminal can be harder for beginners to interpret and visualize clearly.  

As this was my first time coding in Python, the learning curve was especially steep. There are many different ways and modules to run similar functions, and online references for learning can be inconsistent. The lack of script examples in class lectures and exercises made knowledge retention more difficult. Even after taking online classes on Python basics and spending many hours watching tutorials and practicing, the required level for a fluid workflow has not been reached yet.

## Main Data Analysis
The initial dataset was not usable in its original form. Many manipulations were required before running any scripts. Operators were very sensitive to sparse and inconsistent data. A lot of time was spent understanding the data and arranging formats to merge external complementary datasets such as *total country population*, *total country GDP*, and *GDP per capita*. I tested multiple approaches to ensure the data structure was compatible before merging. Once the merge was successful, an export file was generated in CSV format for further use.  

Before coding direct operations on the data, I completed individual Python operator scripts for later callbacks in the main program. All operators were coded in a general optic in mind to improve reusability in a near future. Some concepts and explanations were covered in MOD550 course lectures, but most references and application examples came from online sources. All references used are listed in the main data script (second notebook cell).  

Not being able to launch the previously separated operator classes from their original scripts due to kernel and interpreter issues that I could not solve, I transferred all the code into a *Jupyter Notebook* format. I knew I could run the necessary functions this way. Therefore, I copied all the function classes into one notebook file and found a method to import functions from one notebook into another (main script). After testing, this solution worked successfully and I was able to start coding the main script.

Unfortunately, even after spending a lot of time managing and transforming the datasets, this was not enough, and I still encountered multiple issues when executing the operators. After extensive trial and error and troubleshooting, I was able to create a more usable dataset. Once the data was fitted to the operators, some minor changes were needed to improve computation and result visualization.  

Overall, results would be more meaningful with a non‑binary target variable. Looking back, it would have been more interesting to use a 2D target instead of a binary one. Unfortunately, due to lack of time, I had to continue with the solution I had already implemented to fix errors and simplify data management.

### Results

Linear regression (LinReg), see graphs in **MOD550-P1-NLB-Task_2** for better visualization: show the average tendency of the points

    **Region**

        - Region **intercept** (prediction when x=0): 0.78947368
        - Region coefficient (**slope**): [[ 0.02870813  0.08149406 -0.16447368 -0.16947368 -0.19423559 -0.34331984]]

    **Year**

        - Year **intercept** (prediction when x=0): 0.66590909
        - Year coefficient (**slope**): 0.07992909

    **Total country population**

        - Total country population **intercept** (prediction when x=0): 0.66590909
        - Total country population coefficient (**slope**): 0.11239573

    **Total country GDP**

        - Total country GDP (US$) **intercept** (prediction when x=0): 0.66590909
        - Total country GDP (US$) coefficient (**slope**): 0.02930666

    **GDP per capita**

        - GDP per capita (US$) **intercept** (prediction when x=0): 0.66590909
        - GDP per capita (US$) coefficient (**slope**): -0.08913634

Mean squared error (MSE): lower value implies a better fit between the Linear Regression and the data
    - **Region MSE**: 0.19441885
    - **Year MSE**: 0.21608551
    - **Total country population MSE**: 0.20984137
    - **Total country GDP (US$) MSE**: 0.22161529
    - **GDP per capita (US$) MSE**: 0.21452889

Neural Network (NN):
    **NN results of Region**
        - 14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step
        - Region loss: 0.27917298674583435
        - Region MSE: 0.19958528
        - Region mean prediction: 0.6615747213363647
    **NN results of Year**
        - 14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
        - Year loss: 0.2727109491825104
        - Year MSE: 0.22032226
        - Year mean prediction: 0.6611434817314148
    **NN results of Total country population**
        - 14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step 
        - Total country population loss: 0.27564677596092224
        - Total country population MSE: 0.20647663
        - Total country population mean prediction: 0.6511742472648621
    **NN results of Total country GDP (US$)**
        - 14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step
        - Total country GDP (US$) loss: 0.2835673689842224
        - Total country GDP (US$) MSE: 0.22156743
        - Total country GDP (US$) mean prediction: 0.6589770913124084
    **NN results of GDP per capita (US$)**
        - 14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
        - GDP per capita (US$) loss: 0.2695464491844177
        - GDP per capita (US$) MSE: 0.21395775
        - GDP per capita (US$) mean prediction: 0.6609631776809692

K-means elbow (KME), see graphs in **MOD550-P1-NLB-Task_2** for better visualization: lower (compared to other results) ==> more regrouped data from cluster center point (if clusters are compact, good fit)
    - **Region inertia**: 21.06
    - **Year inertia**: 18.59
    - **Total country population inertia**: 9.08
    - **Total country GDP (US$) inertia**: 9.26
    - **GDP per capita (US$) inertia**: 18.81

K-means with optimized cluster number (KMO): lower (compared to other results) ==> more regrouped data from cluster center point (if clusters are compact, good fit)
    - **Region inertia**: 159.10
    - **Year inertia**: 204.44
    - **Total country population inertia**: 94.30
    - **Total country GDP (US$) inertia**: 111.84
    - **GDP per capita (US$) inertia**: 95.21

Gaussian mixture model (GMM), see graphs in **MOD550-P1-NLB-Task_2** for better visualization:
    **GMM results of Year**
        - Cluster means: [ 0.92377692 -0.75754307]
        - Covariance matrices: [[[0.21741716]], [[0.3680864 ]]]
        - Label distribution: [201 239]
    **GMM results of Total country population**
        - Cluster means: [-0.47817034  2.4695347   0.06353903]
        - Covariance matrices: [[[0.02405191]], [[1.58616376]], [[0.07437892]]]
        - Label distribution: [276  44 120]
    **GMM results of Total country GDP (US$)**
        - Cluster means: [-0.41982386  0.86459924  9.43457342]
        - Covariance matrices: [[[1.07878607e-02]], [[1.41444836e+00]], [[1.00000000e-06]]]
        - Label distribution: [309 130   1]
    **GMM results of GDP per capita (US$)**
        - Cluster means: [-0.67038876  1.11906907 -0.13225283  4.1176204 ]
        - Covariance matrices: [[[0.01818492]], [[0.4702217 ]], [[0.06941185]], [[1.02115492]]]
        - Label distribution: [207  91 133   9]


## Discussion
(a) Discuss how different functions can be used in the linear regression, and different NN architecture.
(b) Discuss how you can use the validation data for the different cases.
(c) Discuss the different outcome from the different models when using the full dataset to train and when you use a different ML approach.
(d) Discuss the outcomes you get for K-means and GMM.
(e) Discuss how you can integrate supervised and unsupervised methods for your case.