# Unsupervised Learning

Overview

In the past two lessons, you have learned what Unsupervised Machine Learning is, what problems are suitable for a solution based on Unsupervised Machine Learning, how to apply Unsupervised Machine Learning, and you have practiced implementing the basic phases of a solution using Scikit-learn. Now is time to put all that conceptual and procedural knowledge to work by doing a larger project. Choose a problem domain that motivates you, and build a complete solution implementing all the phases you learned about in previous chapters. We provide some ideas of interesting problem domains in a dedicated section in this lesson, but we want you to be creative and adventurous, and explore other options as well. This lesson does not present any new material: everything you will need to complete this project was discussed on previous lessons.
External Interface Requirements

    Input requirement: capacity to read a dataset stored on disk.
    Output requirement: report on optimal number of clusters, centroid coordinates and quality metric.
    Output requirement: identifiers of classes corresponding to new instances classified by the model.

Functional Requirements

    The software must learn a clusterization a the dataset.
    The software must use the learned clusterization to classify new problem instances.
    The software must evaluate the quality of a clusterization.
    The software must be flexible to work with different preconfigured amount of clusters.
    The software must compare results using different numbers of clusters and determine which number of clusters is best.

Technical Requirements

    Use Python as programming language.
    Use Pandas for reading the dataset into a Pandas dataframe.
    Use Scikit-learn for training and testing the Machine Learning model.

Necessary Deliverables

    Python application that performs ETL, training, and testing.
    Report containing quality metrics, and explanation of the dataset, and the experimental procedure (range of the different number of clusters that were tested, how the range was traversed, etc.).
    Optional(Build a supervised model by attaching your labels(clusters) as Target)

Suggestions to Get Started

    Find an interesting dataset! Look in the Useful Resources section for sources of ideas.
    If you do not find a pre-existing dataset on the problem domain that you like, be creative: consider building the dataset yourself and donating the dataset to one of the public Machine Learning repositories.
    Break down the project into smaller tasks, for instance: importing the dataset, training, etc.
    Decide whether you will create a single Python application or several Python applications.

Potential Project Ideas

    Segment smartphone users according to phone usage and apps installed.
    Segment healthy person under 50 years of age according to their risk or propensity of suffering from Alzheimer's disease after 70 years of age.
    Classify differnent customers from an Ecomerce data.


In [1]:
# silhoutte score
# from yellowbrick.cluster import KElbowVisualizer

In [2]:
# starting libraries, additional ones will be added when needed

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly

%matplotlib inline
sns.set()

In [3]:
# display options

pd.options.display.max_columns = None
pd.set_option('display.max_rows', 200)

### data import and cleaning

In [4]:
df = pd.read_csv(r"C:\Users\aciag\ih\Week7_project\data\terror\globalterrorismdb_0718dist.csv", encoding='latin-1')


Columns (4,6,31,33,61,62,63,76,79,90,92,94,96,114,115,121) have mixed types.Specify dtype option on import or set low_memory=False.



In [5]:
df.shape

(181691, 135)

In [6]:
df.isna().sum()

eventid                    0
iyear                      0
imonth                     0
iday                       0
approxdate            172452
extended                   0
resolution            179471
country                    0
country_txt                0
region                     0
region_txt                 0
provstate                421
city                     434
latitude                4556
longitude               4557
specificity                6
vicinity                   0
location              126196
summary                66129
crit1                      0
crit2                      0
crit3                      0
doubtterr                  1
alternative           152680
alternative_txt       152680
multiple                   1
success                    0
suicide                    0
attacktype1                0
attacktype1_txt            0
attacktype2           175377
attacktype2_txt       175377
attacktype3           181263
attacktype3_txt       181263
targtype1     

In [7]:
percent_missing = pd.DataFrame((df.isnull().sum() * 100 / len(df)).sort_values(ascending=False))
percent_missing

Unnamed: 0,0
gsubname3,99.988992
weapsubtype4_txt,99.961473
weapsubtype4,99.961473
weaptype4,99.959822
weaptype4_txt,99.959822
claimmode3,99.926799
claimmode3_txt,99.926799
gsubname2,99.911938
claim3,99.824978
guncertain3,99.823877


In [8]:
max_percent_missing = 15.0
df1 = df.loc[:, (df.isnull().sum(axis=0) <= max_percent_missing)]

In [9]:
df1.shape

(181691, 32)

In [10]:
df1.head()

Unnamed: 0,eventid,iyear,imonth,iday,extended,country,country_txt,region,region_txt,specificity,vicinity,crit1,crit2,crit3,doubtterr,multiple,success,suicide,attacktype1,attacktype1_txt,targtype1,targtype1_txt,gname,individual,weaptype1,weaptype1_txt,property,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY
0,197000000001,1970,7,2,0,58,Dominican Republic,2,Central America & Caribbean,1.0,0,1,1,1,0.0,0.0,1,0,1,Assassination,14,Private Citizens & Property,MANO-D,0,13,Unknown,0,PGIS,0,0,0,0
1,197000000002,1970,0,0,0,130,Mexico,1,North America,1.0,0,1,1,1,0.0,0.0,1,0,6,Hostage Taking (Kidnapping),7,Government (Diplomatic),23rd of September Communist League,0,13,Unknown,0,PGIS,0,1,1,1
2,197001000001,1970,1,0,0,160,Philippines,5,Southeast Asia,4.0,0,1,1,1,0.0,0.0,1,0,1,Assassination,10,Journalists & Media,Unknown,0,13,Unknown,0,PGIS,-9,-9,1,1
3,197001000002,1970,1,0,0,78,Greece,8,Western Europe,1.0,0,1,1,1,0.0,0.0,1,0,3,Bombing/Explosion,7,Government (Diplomatic),Unknown,0,6,Explosives,1,PGIS,-9,-9,1,1
4,197001000003,1970,1,0,0,101,Japan,4,East Asia,1.0,0,1,1,1,-9.0,0.0,1,0,7,Facility/Infrastructure Attack,7,Government (Diplomatic),Unknown,0,8,Incendiary,1,PGIS,-9,-9,1,1


In [11]:
df1 = df1.drop_duplicates()
df1.shape

(181691, 32)

In [12]:
# print(df1.INT_ANY.value_counts())

In [13]:
# drop columns that dont have meaning?
#cols=['INT_ANY','INT_MISC','INT_IDEO','INT_LOG','dbsource','','','']
#df1.drop

In [14]:
df1['event_date']  = (pd.to_datetime(df1['iday'].astype(str) + '-' +
                                  df1['imonth'].astype(str) + '-' +
                                  df1['iyear'].astype(str), errors='coerce'))

# errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
# coerce will return NaT


In [15]:
df1.head(5)

Unnamed: 0,eventid,iyear,imonth,iday,extended,country,country_txt,region,region_txt,specificity,vicinity,crit1,crit2,crit3,doubtterr,multiple,success,suicide,attacktype1,attacktype1_txt,targtype1,targtype1_txt,gname,individual,weaptype1,weaptype1_txt,property,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,event_date
0,197000000001,1970,7,2,0,58,Dominican Republic,2,Central America & Caribbean,1.0,0,1,1,1,0.0,0.0,1,0,1,Assassination,14,Private Citizens & Property,MANO-D,0,13,Unknown,0,PGIS,0,0,0,0,1970-02-07
1,197000000002,1970,0,0,0,130,Mexico,1,North America,1.0,0,1,1,1,0.0,0.0,1,0,6,Hostage Taking (Kidnapping),7,Government (Diplomatic),23rd of September Communist League,0,13,Unknown,0,PGIS,0,1,1,1,NaT
2,197001000001,1970,1,0,0,160,Philippines,5,Southeast Asia,4.0,0,1,1,1,0.0,0.0,1,0,1,Assassination,10,Journalists & Media,Unknown,0,13,Unknown,0,PGIS,-9,-9,1,1,NaT
3,197001000002,1970,1,0,0,78,Greece,8,Western Europe,1.0,0,1,1,1,0.0,0.0,1,0,3,Bombing/Explosion,7,Government (Diplomatic),Unknown,0,6,Explosives,1,PGIS,-9,-9,1,1,NaT
4,197001000003,1970,1,0,0,101,Japan,4,East Asia,1.0,0,1,1,1,-9.0,0.0,1,0,7,Facility/Infrastructure Attack,7,Government (Diplomatic),Unknown,0,8,Incendiary,1,PGIS,-9,-9,1,1,NaT


In [16]:
df1 = df1.dropna(axis=0)
df1 = df1.drop(['iyear','iday','imonth'], axis=1)

In [17]:
df1.shape

(180792, 30)

In [18]:
df1.head()

Unnamed: 0,eventid,extended,country,country_txt,region,region_txt,specificity,vicinity,crit1,crit2,crit3,doubtterr,multiple,success,suicide,attacktype1,attacktype1_txt,targtype1,targtype1_txt,gname,individual,weaptype1,weaptype1_txt,property,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,event_date
0,197000000001,0,58,Dominican Republic,2,Central America & Caribbean,1.0,0,1,1,1,0.0,0.0,1,0,1,Assassination,14,Private Citizens & Property,MANO-D,0,13,Unknown,0,PGIS,0,0,0,0,1970-02-07
5,197001010002,0,217,United States,1,North America,1.0,0,1,1,1,0.0,0.0,1,0,2,Armed Assault,3,Police,Black Nationalists,0,5,Firearms,1,Hewitt Project,-9,-9,0,-9,1970-01-01
6,197001020001,0,218,Uruguay,3,South America,1.0,0,1,1,1,0.0,0.0,0,0,1,Assassination,3,Police,Tupamaros (Uruguay),0,5,Firearms,0,PGIS,0,0,0,0,1970-02-01
7,197001020002,0,217,United States,1,North America,1.0,0,1,1,1,1.0,0.0,1,0,3,Bombing/Explosion,21,Utilities,Unknown,0,6,Explosives,1,Hewitt Project,-9,-9,0,-9,1970-02-01
8,197001020003,0,217,United States,1,North America,1.0,0,1,1,1,0.0,0.0,1,0,7,Facility/Infrastructure Attack,4,Military,New Year's Gang,0,8,Incendiary,1,Hewitt Project,0,0,0,0,1970-02-01


In [19]:
df1.gname.value_counts()
#txt remove, country and region, targtype1_txt, 
# reshape date col into month=+1

Unknown                                             82454
Taliban                                              7468
Islamic State of Iraq and the Levant (ISIL)          5613
Shining Path (SL)                                    4532
Farabundo Marti National Liberation Front (FMLN)     3343
                                                    ...  
Right-Wing National Youth Front                         1
Union of the Peoples of the Arabian Peninsula           1
Burmese refugees                                        1
Death to the Demobilized Militias                       1
Communist Territorial Unit                              1
Name: gname, Length: 3517, dtype: int64

### preprocessing

In [20]:
# we will drop text columns that already have numerical representation in the dataset

df2 = df1.drop(['country_txt','region_txt','targtype1_txt', 'weaptype1_txt', 'attacktype1_txt', 'dbsource'], axis=1)

In [21]:
#df2.dbsource.value_counts()
df2.gname.value_counts()[0:10]

Unknown                                             82454
Taliban                                              7468
Islamic State of Iraq and the Levant (ISIL)          5613
Shining Path (SL)                                    4532
Farabundo Marti National Liberation Front (FMLN)     3343
Al-Shabaab                                           3280
New People's Army (NPA)                              2752
Irish Republican Army (IRA)                          2658
Revolutionary Armed Forces of Colombia (FARC)        2465
Boko Haram                                           2418
Name: gname, dtype: int64

In [22]:
df2.gname.value_counts()

Unknown                                             82454
Taliban                                              7468
Islamic State of Iraq and the Levant (ISIL)          5613
Shining Path (SL)                                    4532
Farabundo Marti National Liberation Front (FMLN)     3343
                                                    ...  
Right-Wing National Youth Front                         1
Union of the Peoples of the Arabian Peninsula           1
Burmese refugees                                        1
Death to the Demobilized Militias                       1
Communist Territorial Unit                              1
Name: gname, Length: 3517, dtype: int64

In [23]:
s = df2.gname.value_counts()

df2['org_claim'] = np.where(df2['gname'].isin(s.index[s >= 2417]), df2['gname'], 'Other')

In [24]:
#df2 = pd.get_dummies(data=df2, columns=['attacktype1_txt'], drop_first=True)
#df2 = pd.get_dummies(data=df2, columns=['dbsource'], drop_first=True)

In [25]:
df2 = pd.get_dummies(data=df2, columns=['org_claim'], drop_first=True)

In [26]:
df2.reset_index(drop=True, inplace=True)

In [27]:
# dropping time column for now
df2.drop(['event_date','eventid'], axis=1, inplace=True)

In [28]:
df2.drop(['gname'], axis=1, inplace=True)

In [29]:
df2.sample(10)

Unnamed: 0,extended,country,region,specificity,vicinity,crit1,crit2,crit3,doubtterr,multiple,success,suicide,attacktype1,targtype1,individual,weaptype1,property,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,org_claim_Boko Haram,org_claim_Farabundo Marti National Liberation Front (FMLN),org_claim_Irish Republican Army (IRA),org_claim_Islamic State of Iraq and the Levant (ISIL),org_claim_New People's Army (NPA),org_claim_Other,org_claim_Revolutionary Armed Forces of Colombia (FARC),org_claim_Shining Path (SL),org_claim_Taliban,org_claim_Unknown
84856,0,182,11,1.0,0,1,1,1,0.0,0.0,1,0,3,12,0,6,-9,-9,-9,1,1,0,0,0,0,0,0,0,0,0,1
12063,0,98,8,1.0,0,1,1,1,0.0,0.0,1,0,3,6,0,6,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0
159303,0,95,10,1.0,0,1,1,1,0.0,0.0,1,0,3,14,0,6,-9,-9,-9,0,-9,0,0,0,0,0,0,0,0,0,1
60073,1,167,9,1.0,0,1,1,1,-9.0,0.0,1,0,4,18,0,5,0,-9,-9,0,-9,0,0,0,0,0,0,0,0,0,1
65690,0,235,9,1.0,0,1,1,1,-9.0,0.0,1,0,1,14,0,5,0,-9,-9,1,1,0,0,0,0,0,0,0,0,0,1
36431,0,186,6,3.0,0,1,1,0,1.0,0.0,1,0,2,4,0,5,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
24726,0,45,3,2.0,0,1,1,1,0.0,0.0,1,0,5,14,0,5,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
46296,0,159,3,1.0,0,1,1,1,0.0,0.0,1,0,3,2,0,5,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4742,0,98,8,1.0,0,1,1,1,0.0,0.0,1,0,2,1,0,5,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
65354,0,28,9,1.0,0,1,1,1,0.0,0.0,1,0,3,7,0,6,1,-9,-9,1,1,0,0,0,0,0,0,0,0,0,1


In [None]:
df2.shape

### model building

In [None]:
# clustering - KMeans (sklearn.cluster import KMeans)

# association - 

#### k-Means

In [None]:
from sklearn.cluster import KMeans

In [None]:
kmeans=KMeans(n_clusters=5)
df2_clusters=kmeans.fit(df2)

In [None]:
centroids=kmeans.cluster_centers_
labels=kmeans.labels_
centroids

In [None]:
labels

In [None]:
#plt.plot(df2,colors[labels],markersize=10)

In [None]:
# any X and Y columns, hue by centroids

In [None]:
df2.sample(3)

In [None]:
# sns.scatterplot(x=df2.attacktype1, y=df2.targtype1, hue=labels);

In [None]:
# sns.scatterplot(x=df2.weaptype1, y=df2.country, hue=labels);

In [None]:
# sns.scatterplot(x=df2.weaptype1, y=df2.success, hue=labels);

In [None]:
plt.scatter(centroids[:,0],centroids[:,1],marker='x',s=150,linewidths=5,zorder=10)

In [None]:
# how to viz actual clusters ffs?

from yellowbrick.cluster import KElbowVisualizer
from sklearn import cluster

In [None]:
n_samples=1500
X,y=df2(n_samples=n_samples)

In [None]:
model=cluster.KMeans()

In [None]:
visualizer=KElbowVisualizer(model, k=(1,5))
visualizer.fit(X)
visualizer.poof()

## DBScan model

In [30]:
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans

In [31]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

In [32]:
stscaler = StandardScaler().fit(df2)
data = stscaler.transform(df2)

In [None]:
"""plt.scatter(df2.targtype1, df2.country)
plt.xlabel("Type of target")
plt.ylabel("Country")
plt.title("test viz")
#plt.savefig("results/wholesale.png", format = "PNG")"""

In [35]:
clustering = DBSCAN(eps=10, min_samples=50).fit(data)
clustering.labels_

MemoryError: 

In [None]:
dbsc = DBSCAN(eps = 3, min_samples = 10).fit(data)

In [None]:
labels = dbsc.labels_
core_samples = np.zeros_like(labels, dtype = bool)
core_samples[dbsc.core_sample_indices_] = True

In [None]:
unique_labels = np.unique(labels)
colors = plt.cm.Spectral(np.linspace(0,1, len(unique_labels)))

In [None]:
for (label, color) in zip(unique_labels, colors):
    class_member_mask = (labels == label)
    xy = data[class_member_mask & core_samples]
    plt.plot(xy[:,0],xy[:,1], 'o', markerfacecolor = color, markersize = 10)
    
    xy2 = data[class_member_mask & ~core_samples]
    plt.plot(xy2[:,0],xy2[:,1], 'o', markerfacecolor = color, markersize = 5)
plt.title("DBSCAN on Wholsesale data")
plt.xlabel("Grocery (scaled)")
plt.ylabel("Milk (scaled)")
#plt.savefig("results/dbscan_wholesale.png", format = "PNG")