# Unsupervised Machine Learning - KMeans Clustering Analysis on Environmental Data and Life Ladder Values.

Finalized and cleaned per capita dataset: `dataset/per_capita_MAIN_ds.csv`.

We use the KMeans clustering package from the sci-kit learn module to predict classes from our environmental data. This allows us to look for hidden patterns that we otherwise might not find through any other type of analysis. The resulting table of data produced can be used to colour countries on a world map, along with other visualizations, to show off the analysis.

The original analysis into KMeans clustering was the driving force behind converting the Greenhouse Gases and CO<sub>2</sub> Emissions columns to per capita, as they were heavlily weighting the bigger countries, and skewing the results. This means China, India, and the United States came more in line with the rest of the countries.

In [1]:
%matplotlib inline
# Dependencies and data.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# SQL Database access.
from sqlalchemy import create_engine
from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session

## To run this yourself you will need to work with the correct SQL database password and naming.
from config import pg_pass

# Style.
from matplotlib import style
style.use('fivethirtyeight')

# Turn off warnings.
import warnings
warnings.filterwarnings('ignore')

## EXTRACT

### SQL Database Connection

Using SQLAlchemy we will connect to the PostgreSQL database to access the dataset. We will map the database with `automap_base()` and reflect the tables within. Storing the required table as a variable, we query the table to get the datam and use Pandas to read the SQL data.

In [2]:
# Connect to the PostgreSQL - pgAdmin database.
engine = create_engine(f"postgresql+psycopg2://postgres:{pg_pass}@localhost:5434/final_project")
# Reflect database, reflect tables.
Base = automap_base()
Base.prepare(engine, reflect=True)

In [3]:
# Save the main dataset as a class.
MainPC = Base.classes.main_pcapita
# Create a query session.
session = Session(engine)
# Query the database to get the table of data.
q = session.query(MainPC)
# Create a DataFrame from the queried data.
df = pd.read_sql(q.statement, engine)
df.head()

Unnamed: 0,ID,Country,Year,Life Ladder,Temperature,Clean Water,PM2.5,pc Greenhouse Gas Emissions,pc CO2 Emissions
0,1,Afghanistan,2010,4.758,14.629,48.28708,52.49585,0.001539,0.287738
1,2,Afghanistan,2011,3.832,16.487,50.82785,57.09972,0.001947,0.401953
2,3,Afghanistan,2012,3.783,14.373,53.40352,55.46611,0.002142,0.327922
3,4,Afghanistan,2013,3.572,16.156,56.01404,59.62277,0.002318,0.26157
4,5,Afghanistan,2014,3.131,15.647,58.65937,62.72192,0.002536,0.232968


## TRANSFORM

We remove all rows containing null values, along with unneeded columns (identifications).

In [4]:
# Drop any and all NaN values.
df = df.dropna()
df = df.drop(columns=['ID'])
df.head()

Unnamed: 0,Country,Year,Life Ladder,Temperature,Clean Water,PM2.5,pc Greenhouse Gas Emissions,pc CO2 Emissions
0,Afghanistan,2010,4.758,14.629,48.28708,52.49585,0.001539,0.287738
1,Afghanistan,2011,3.832,16.487,50.82785,57.09972,0.001947,0.401953
2,Afghanistan,2012,3.783,14.373,53.40352,55.46611,0.002142,0.327922
3,Afghanistan,2013,3.572,16.156,56.01404,59.62277,0.002318,0.26157
4,Afghanistan,2014,3.131,15.647,58.65937,62.72192,0.002536,0.232968


## LOAD

We will split the data into years and create 9 tables of results (2010-2018), which we can then run through a KMeans model. The resulting classes will be combined into one dataset that we can save for the map rendering, and other analysis.

In [5]:
# Years list.
years = df['Year'].unique().tolist()
year_dfs = {}

# Fill dictionary with corresponding DataFrames.
for year in years:
    year_dfs[year] = df[df.Year == year]

In [6]:
# Make an empty dictionary to store the model predictions.
preds = {}

In [7]:
# Loop through each year's DataFrame.
for ydf in year_dfs:
    # Set the index as the Country column.
    year_dfs[ydf] = year_dfs[ydf].set_index('Country')
    # Drop the year column.
    year_dfs[ydf] = year_dfs[ydf].drop(columns=['Year'])
    # Scale the DataFrame.
    X_scale = StandardScaler().fit_transform(year_dfs[ydf])
    # Create a scaled DataFrame from the scale.
    scaled_df = pd.DataFrame(X_scale, index=year_dfs[ydf].index, columns=year_dfs[ydf].columns)
    
    # Create a KMeans model, fit the scaled DataFrame to it.
    km = KMeans(n_clusters=4, random_state=99)
    km.fit(scaled_df)

    # Store the predictions of the model to the dictionary - the key is the year.
    preds[ydf] = km.predict(scaled_df).tolist()

In [8]:
# Create a DataFrame from the predictions, setting the countries as the index.
pred_df = pd.DataFrame(preds, index=year_dfs[ydf].index)
pred_df.head()

Unnamed: 0_level_0,2010,2011,2012,2013,2014,2015,2016,2017,2018
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Afghanistan,0,2,0,2,3,2,3,2,2
Albania,1,1,2,1,1,0,1,0,0
Argentina,1,3,3,0,2,1,2,0,1
Armenia,1,1,2,1,1,0,1,0,0
Australia,2,0,1,3,0,3,0,3,3


In [9]:
# Store the results for plotting purposes.
pred_df.to_csv('predictions/ML_KMeans_By_Year.csv')

In [10]:
# Stack the data to store the results in a visualization format - (Tableau).
viz_df = pd.DataFrame(pred_df.stack())
viz_df.index.set_names(['Country', 'Year'], inplace=True)
viz_df.columns = ['Class']
viz_df.to_csv('predictions/ML_KMeans_BY_Year_Stacked.csv')