# K-means: California housing


## Notebook set-up

In [None]:
# Standard library imports
from pathlib import Path

# Core data science libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

# Machine learning libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Custom helper functions for visualization and analysis
import functions as funcs

RANDOM_SEED = 315

## 1. Data loading

In this section, we load the California housing dataset from a remote URL, save a local copy for future use, and perform initial data inspection. We'll also filter the dataset to keep only the features needed for our clustering analysis: median income and geographic coordinates.

### 1.1. Load data from URL

In [None]:
# Load California housing data from remote CSV
url = 'https://raw.githubusercontent.com/4GeeksAcademy/k-means-project-tutorial/main/housing.csv'
data_df = pd.read_csv(url)

### 1.2. Save a local copy

In [None]:
# Your code here...


### 1.3. Inspect

In [None]:
# Your code here...


### 1.4. Remove unnecessary features

In [None]:
# Select only location and median income features as specified in assignment
data_df = data_df[['MedInc', 'Latitude', 'Longitude']]
data_df.info()

## 2. EDA

### 2.1. Feature distributions

In [None]:
# Your code here...


### 2.2. Feature correlations

In [None]:
# Your code here...


## 3. Data preparation

### 3.1. Train-test split

In [None]:
# Your code here...


### 3.2. Feature scaling

In [None]:
# Scale the features so that they all have the same range - sklearn's MinMaxScaler() or StandardScaler() are good options here.



## 4. Clustering


### 4.1. Find clusters

In [None]:
# Fit a KMeans() model to the training data, and extract the cluster labels. You can get the list of cluster assignments
# for the training data with the .labels_ attribute of the fitted KMeans model.


### 4.2. Add cluster label to training data

In [None]:
# Your code here...


### 3.3. Plot results

In [None]:
# Plot geographic distribution of clusters using 2D scatter plot


## 5. Supervised classification model

In this final section, we build a supervised classification model to predict cluster membership based on the features alone. We'll train a Gradient Boosting Classifier using the cluster labels from K-means as our target variable, evaluate its performance through cross-validation, and test its accuracy on held-out data. This demonstrates how unsupervised clustering results can be used to create supervised learning models.

### 5.1. Features & labels

In [None]:
# Set up dataframes/list holding the features: MedInc, Latitude and Longitude and the 'label' - the cluster assigned by the KMeans model


### 5.2. Model training

In [None]:
# Train a classification model to predict the cluster label


### 5.3. Cross-validation

In [None]:
# Cross validate the classification model

### 5.4. Model evaluation

In [None]:
# Evaluate the classification model on the held-out test data
