# California Housing (K-means)

---

---

Imported Libraries

In [28]:
# Data processing
# ==================================================================================
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Preprocessing and modeling
# ==================================================================================
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

---

## Step 1: Loading the dataset

In [10]:
_df_ = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-means-project-tutorial/main/housing.csv")
_df_.head(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521


>NOTE:  In this case, we are only interested in the `Latitude`, `Longitude` and `MedInc` columns.

In [11]:
# We are only interested in the `Latitude`, `Longitude` and `MedInc` columns.

df= _df_[['Latitude', 'Longitude', 'MedInc']]
df.head(3)

Unnamed: 0,Latitude,Longitude,MedInc
0,37.88,-122.23,8.3252
1,37.86,-122.22,8.3014
2,37.85,-122.24,7.2574


**Description and types of Data**

- `Latitude` --> 

- `Longitude` --> 

- `MedInc` -->

---

## Step 2: Study of variables and their content

In [12]:
# Obtain dimensions

rows, columns = df.shape

print(f"The dimensions of this dataset are: {rows} Rows and {columns} Columns")

The dimensions of this dataset are: 20640 Rows and 3 Columns


In [13]:
# Obtain information about data types and non-null values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Latitude   20640 non-null  float64
 1   Longitude  20640 non-null  float64
 2   MedInc     20640 non-null  float64
dtypes: float64(3)
memory usage: 483.9 KB


In [14]:
# Check null values

null_var = df.isnull().sum().loc[lambda x: x > 0] # Number of nulls in each variable.

num_of_null_var = len(null_var) # Number of variables with almost 1 null.

print(f"{null_var}\n\nThe number of null variables are {num_of_null_var}")

Series([], dtype: int64)

The number of null variables are 0


- ### 2.1 Divide the dataset into train and test

Be sure to conveniently split the dataset into `train` and `test` as we have seen in previous lessons. Although these sets are not used to obtain statistics(we don´t need `y` to compare), you can use them to train the unsupervised algorithm and then to make predictions about new points to predict the cluster they are associated with.

In [None]:
# Train - Test - Split
# ===============================================================================
def split(dataset ,
          # target, # In Unsupervised Learning we don`t have target
             test_size=0.2,
               random_state=42):
  
  X = dataset # dataset
#  y = df[target] # Target

  X_train, X_test = train_test_split(X,
                                      # y, # In Unsupervised Learning we don`t have target
                                        test_size = test_size,
                                          random_state = random_state)

  return X_train, X_test

In [27]:
X_train, X_test = split(df)


X_train.head(3)

Unnamed: 0,Latitude,Longitude,MedInc
14196,32.71,-117.03,3.2596
8267,33.77,-118.16,3.8125
17445,34.66,-120.48,4.1563


---

## Step 3: Build a K-means

Classify the data into **6 clusters** using the **K-Means model**. Then store the cluster to which each house belongs as a new column in the dataset. You could call it cluster. To introduce it to your dataset, you may have to categorize it. See what format and values it has, and act accordingly. Plot it in a dot plot and describe what you see.

In [30]:
# Training the model

model = KMeans (n_clusters = 6,
                    random_state=42)

model.fit(df)

In [38]:
df['cluster'] = list(model.labels_)

df.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cluster'] = list(model.labels_)


Unnamed: 0,Latitude,Longitude,MedInc,cluster
0,37.88,-122.23,8.3252,2
1,37.86,-122.22,8.3014,2
2,37.85,-122.24,7.2574,2
