Previously, we've looked at data where we did know what belonged to what. Our data was labeled, and we could learn off of those labels. For those data we could use supervised learning techniques. Often, data won't be neatly and nicely labeled.

There can be many attributes or features, but no label to indicate which rows belong to which label or group. Without the labels, we'd be hard-pressed to build a classification model. We wouldn't even know what values we're classifying things as!

One approach to this situation is to try to find some "natural" groupings in the data. Clustering is one such approach that looks at each of the different attributes in the data and tries to group the data based on rows that have similar values for the attributes.

In [15]:
# Source: https://drive.google.com/file/d/0B7Vwt0JZZE6qSzIyckNiU0x3U3c/view
# Local source: /Users/AshRajBala/Repositories/ThinkfulProjects/undata.csv

import pandas as pd

# Load un data
un = pd.read_csv("/Users/AshRajBala/Repositories/ThinkfulProjects/un.csv")

In [21]:
len(un)


207

In [23]:
# Get num of rows, Get num of ******non-null****** values present in each column
un.count()

# Highest number of missing values (lowest values in output):
# educationMale              76
# educationFemale            76

country                   207
region                    207
tfr                       197
contraception             144
educationMale              76
educationFemale            76
lifeMale                  196
lifeFemale                196
infantMortality           201
GDPperCapita              197
economicActivityMale      165
economicActivityFemale    165
illiteracyMale            160
illiteracyFemale          160
dtype: int64

Based on the number of non-null values, we can cluster on tfr, lifeMale, lifeFemale, infantMortality, GDP/cap

In [19]:
# Determine datatype of each column
un.dtypes

country                    object
region                     object
tfr                       float64
contraception             float64
educationMale             float64
educationFemale           float64
lifeMale                  float64
lifeFemale                float64
infantMortality           float64
GDPperCapita              float64
economicActivityMale      float64
economicActivityFemale    float64
illiteracyMale            float64
illiteracyFemale          float64
dtype: object

In [20]:
# How many countries are present in data? 207
len(un['country'])

207

We're going to see how lifeMale, lifeFemale and infantMortality cluster according to GDPperCapita, keeping in mind that we don't know in advance how many clusters there will be.

In [24]:
# Because indexes start from 0, we need columns 6,7,8, and 9
data = un.ix[:,:10]
data = data.dropna()

In [26]:
# take data from data for only infant mortality and GDP
thelist = []
for i, row in data.iterrows():
    thelist.append([row['infantMortality'], row['GDPperCapita']])
print(thelist)

[[44.0, 1531.0], [6.0, 20046.0], [6.0, 29006.0], [14.0, 12545.0], [18.0, 9073.0], [7.0, 26582.0], [30.0, 2569.0], [56.0, 3640.0], [97.0, 165.0], [114.0, 205.0], [6.0, 18943.0], [13.0, 4736.0], [9.0, 1983.0], [9.0, 4450.0], [89.0, 117.0], [7.0, 33191.0], [34.0, 1508.0], [54.0, 973.0], [39.0, 1660.0], [98.0, 96.0], [12.0, 2433.0], [7.0, 26444.0], [122.0, 321.0], [23.0, 343.0], [6.0, 29632.0], [5.0, 22898.0], [14.0, 4325.0], [48.0, 1019.0], [95.0, 11308.0], [12.0, 1779.0], [4.0, 41718.0], [9.0, 9736.0], [14.0, 15757.0], [86.0, 359.0], [16.0, 1764.0], [72.0, 486.0], [142.0, 142.0], [52.0, 388.0], [6.0, 25635.0], [7.0, 16866.0], [44.0, 464.0], [5.0, 33734.0], [25.0, 6232.0], [39.0, 1860.0], [45.0, 2497.0], [35.0, 1093.0], [8.0, 10428.0], [17.0, 14013.0], [24.0, 1570.0], [58.0, 1106.0], [48.0, 3230.0], [7.0, 14111.0], [65.0, 1389.0], [5.0, 26253.0], [5.0, 42416.0], [33.0, 3573.0], [14.0, 4083.0], [44.0, 2814.0], [6.0, 18913.0], [7.0, 26037.0], [21.0, 3496.0], [103.0, 382.0]]


In [27]:
# 'flatten' the data and make clusters for k=range(1,10)
w = whiten(thelist)

In [28]:
w

array([[ 1.27920867,  0.13177427],
       [ 0.17443755,  1.72537358],
       [ 0.17443755,  2.49656719],
       [ 0.40702094,  1.07975713],
       [ 0.52331264,  0.78091961],
       [ 0.20351047,  2.28793178],
       [ 0.87218773,  0.22111567],
       [ 1.62808377,  0.31329741],
       [ 2.82007367,  0.01420167],
       [ 3.31431338,  0.0176445 ],
       [ 0.17443755,  1.63043758],
       [ 0.37794802,  0.40763091],
       [ 0.26165632,  0.17067823],
       [ 0.26165632,  0.38301469],
       [ 2.58749027,  0.01007027],
       [ 0.20351047,  2.85677314],
       [ 0.98847943,  0.12979464],
       [ 1.56993792,  0.08374681],
       [ 1.13384405,  0.14287739],
       [ 2.84914659,  0.00826279],
       [ 0.34887509,  0.20941005],
       [ 0.20351047,  2.27605402],
       [ 3.54689678,  0.0276287 ],
       [ 0.66867726,  0.02952226],
       [ 0.17443755,  2.55044746],
       [ 0.14536462,  1.97084726],
       [ 0.40702094,  0.37225585],
       [ 1.39550037,  0.08770606],
       [ 2.76192782,