# Table of content

## EDA
1. [Problem statement and data collection](##1-problem-statement-and-data-collection)  
2. [Exploration and data cleaning](##2-Exploration-and-data-cleaning)
   - [2.1: Understanding the features](###21-Understanding-the-features)  
   - [2.2: Identifying null values](###22-Identifying-null-values)  
   - [2.3: Eliminating duplicate values](###23-Eliminating-duplicate-values)
   - [2.4: Eliminating variables](###24-Eliminating-variables)  

## Machine Learning
3. [K-Means](###3-K-Means)
4. [KNN](###4-KNN)



-----------------------------------------------------------------------------------------------------------------

## EDA

## 1. Problem statement and data collection

In [11]:
# Your code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import json
from sklearn.model_selection import train_test_split
import warnings
def warn(*args, **kwargs):
    pass
warnings.warn = warn
warnings.filterwarnings("ignore", category=FutureWarning)

pd.set_option('display.max_columns', None)
sns.set(
    style="whitegrid",
    palette="pastel",
)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

In [6]:
total_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-means-project-tutorial/main/housing.csv")
total_data.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## 2. Exploration and data cleaning

### 2.1 Understanding the features

We want to be able to classify houses according to their region and median income. To do this, we will use the famous California Housing dataset. It was constructed using data from the 1990 California census. It contains one row per census block group. A block group is the smallest geographic unit for which US Census data is published.

We are only interested in the Latitude, Longitude and MedInc columns

In [7]:
total_data.shape

(20640, 9)

### 2.2 Identifying null values

In [8]:
total_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


### 2.3 Eliminating duplicate values

In [9]:
total_data.drop_duplicates()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


### 2.4 Eliminating variables

In [10]:
total_data.drop(["HouseAge", "AveRooms", "AveBedrms","Population","AveOccup","MedHouseVal"], axis = 1, inplace = True)
total_data.head()

Unnamed: 0,MedInc,Latitude,Longitude
0,8.3252,37.88,-122.23
1,8.3014,37.86,-122.22
2,7.2574,37.85,-122.24
3,5.6431,37.85,-122.25
4,3.8462,37.85,-122.25


## Machine Learning

### 3. K-Means

In [16]:
from sklearn.cluster import KMeans

X = total_data.copy()

# Training the model
model = KMeans(n_clusters = 3, random_state = 42)
model.fit(X)

# Making predictions with new data
new_data = np.array([[2, 3], [0, 4], [3, 1]])
predictions = model.predict(new_data)

ValueError: X has 2 features, but KMeans is expecting 3 features as input.