# <center>Whisky Distilleries</center>

![title](whisky.jpg)

*For this task, you will need the following Python packages:*

    - pandas
    - NumPy
    - scikit-learn
    - Seaborn 
    - Matplotlib.

In [None]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import folium
from pyproj import Proj, transform
from folium.plugins import MarkerCluster

## Load the dataset.

In [None]:
df = pd.read_csv('whisky.csv',index_col='RowID')

### Preview the kind of data you will be working with by printing some samples from the DataFrame.

In [None]:
df.head()

In [None]:
df.shape

The data set is dataframe of 86 datapoints and 17 features for each data point.

*That means that there are 86 distillers of whisky with 15 features*

### Get some Descreptive statistics

In [None]:
df.describe()

From the above outputs you definitely got to know about the features of the dataset and some basic statistics of it. 
I will list the feature names for you:

In [None]:
print(df.columns.values)

It is very important to note that not all machine learning algorithms support missing values in the data that you are feeding to them. 
K-Means being one of them. So we need to handle the missing values present in the data. 
Let's first see where are the values missing:

In [None]:
print(df.isnull().sum())

*There are no missing values.Even by inspection!*

Let's do some more analytics in order to understand the data better. 
Understanding is really required in order to perform any Machine Learning task. 
Let's start with finding out which features are categorical and which are numerical.

In [None]:
df.info()

### Which features are suitable to Cluster whisky distilleries according to tasting Profiles

In [None]:
df.hist(figsize=(10,10))
plt.show()

*The above histograms show that most of the features are categorical.*

You might be thinking that since it is a labeled dataset, how could it be used for a clustering task?
Drop the 'Distillery' column from the dataset and make it unlabeled.
Leave the columns that will be used to cluster the wines based on the taste features.
It's the task of K-Means to cluster the records of the datasets.


 Often, it is better to train your model with only significant features than to train it with all the features, 
including unnecessary ones.
It not only helps in efficient modelling, but also the training of the model can happen in much lesser time.

For this task, we want to focus on the factors that influence the taste of whisy. Know that the features 
Latitude,Longitude,Postcode,Ditillery and Body
can be dropped and they will not have significant impact on the training of the K-Means model.

In [None]:
df_new = df.drop(['Latitude','Longitude','Postcode','Distillery'], axis=1)

## Use the elbow or silhouette method to find the optimal number of clusters.

### Dimentionality Reduction

The features must be reduced to a smaller size so that they can be visualized and be fit to the model with ease.

In [None]:
from sklearn.decomposition import KernelPCA
np.random.seed(42)
rbf_pca= KernelPCA(n_components=2, kernel='rbf')
X_reduced= rbf_pca.fit_transform(df_new)

In [None]:
x=X_reduced[:,0]
y=X_reduced[:,1]

plt.scatter(x, y)
plt.title('transformed features')
plt.show()

### Use the elbow  method to find the optimal number of clusters.

In [None]:
wcss = []
for i in range(1, 12):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=20, random_state=0)
    kmeans.fit(X_reduced)
    wcss.append(kmeans.inertia_)

fig = plt.figure(figsize=(10,8))
plt.plot(range(1, 12), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

*The graph above shows that the optimalnumber of clusters is 3*

In [None]:

from sklearn.metrics import silhouette_score
sil_scores=[]
for clusters in range(2,10):
    km= KMeans(n_clusters=clusters, random_state=42)
    km.fit(X_reduced)
    labels= km.predict(X_reduced)
    
    #silhouette score
    sil_score= silhouette_score(X_reduced, labels)   
    sil_scores.append(sil_score)
    
sns.barplot(x=list(range(2,10)), y=sil_scores)
plt.title('Score for number cluster')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.show()

**The above bargraph shows that 3 clusters are the optimal clusters for this data.**

### Looks like you are good to go to train your K-Means model now.

In [None]:
kmeans = KMeans(n_clusters=3,random_state=0)
pred_y = kmeans.fit_predict(X_reduced)
plt.scatter(x, y,c = pred_y,cmap ='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.show()

In [None]:
cluster = pd.DataFrame(pred_y,columns = ['Cluster'])
df.shape


In [None]:
df_cluster = pd.concat([df,cluster],axis = 1)


In [None]:
df_cluster.isnull().sum()

In [None]:
df_cluster['Cluster'].value_counts()

In [None]:
plt.figure(figsize=(15,5))
fontdict={'fontsize':20}
sns.countplot(x='Cluster',data=df_cluster)
plt.title('Clusters',fontdict=fontdict)

*The above bar graph displays the count of clusters*

In [None]:
df_clusters = pd.concat([df_new,cluster],axis = 1)
df_clusters = df_clusters.dropna()
df_clusters.astype(int)

In [None]:
g = df_clusters.groupby('Cluster').mean()
g

In [None]:
# plt.figure(figsize=(15,5))
g.plot.bar(figsize=(20,10))

 **see the characteristics of whisky tastes in each class**

cluster_0 - Bodied and Sweet (equaly), Fruity

cluster_1 - More Sweet, Bodied, Malty 

cluster_2 -  More Sweet, Floral and Bodied

**Locating the whisky distilleries.**

In [None]:
df = pd.read_csv('whisky.csv',index_col='RowID')

In [None]:
cluster = pd.DataFrame(pred_y,columns = ['Cluster'])

In [None]:
df['class'] = cluster

df = df.dropna()
df['class'] = df['class'].astype(int)


In [None]:
col = ['blue','red','green']

In [None]:
labels = df['Distillery']

In [None]:

map_distillery = folium.Map(location=[57.499520,  -2.776390], zoom_start = 9)

inProj = Proj(init='epsg:27700')
outProj = Proj(init='epsg:4326')


for label, lon, lat, c in zip(labels, df['Latitude'], df['Longitude'], df['class']):
    
    lat2,lon2 = transform(inProj,outProj,lon,lat)
    folium.Marker([lon2, lat2], popup= label, icon=folium.Icon(color=col[c])).add_to(map_distillery)   

        

map_distillery.save('Whiskymap.html')
map_distillery

We can see the location of distilleries for each class
More Distilleries are located on the North East of United Kingdom around Scotland. 
  

**In conclusion the whisky can clustered  mainly according to Sweetness and Body** 