# Machine Learning Engineer Nanodegree
## Unsupervised Learning
## Project: Creating Customer Segments

## Getting Started

In this project, you will analyze a dataset containing data on various customers' annual spending amounts (reported in *monetary units*) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

The dataset for this project can be found on the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers). For the purposes of this project, the features `'Channel'` and `'Region'` will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

Run the code block below to load the wholesale customers dataset, along with a few of the necessary Python libraries required for this project. You will know the dataset loaded successfully if the size of the dataset is reported.

In [6]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
import visuals as vs
from IPython.display import display # Allows the use of display() for DataFrames

# Import supplementary visualizations code visuals.py
import visuals as vs

# Pretty display for notebooks
%matplotlib inline

# Load the wholesale customers dataset
try:
    data = pd.read_csv("customers.csv")
    data.drop(['Region', 'Channel'], axis = 1, inplace = True)
    print("Wholesale customers dataset has {} samples with {} features each.".format(*data.shape))
except:
    print("Dataset could not be loaded. Is the dataset missing?")

Wholesale customers dataset has 440 samples with 6 features each.


In [7]:
# Display a description of the dataset
display(data.describe())
data.dropna()

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
count,440.0,440.0,440.0,440.0,440.0,440.0
mean,12000.297727,5796.265909,7951.277273,3071.931818,2881.493182,1524.870455
std,12647.328865,7380.377175,9503.162829,4854.673333,4767.854448,2820.105937
min,3.0,55.0,3.0,25.0,3.0,3.0
25%,3127.75,1533.0,2153.0,742.25,256.75,408.25
50%,8504.0,3627.0,4755.5,1526.0,816.5,965.5
75%,16933.75,7190.25,10655.75,3554.25,3922.0,1820.25
max,112151.0,73498.0,92780.0,60869.0,40827.0,47943.0


Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
0,12669,9656,7561,214,2674,1338
1,7057,9810,9568,1762,3293,1776
2,6353,8808,7684,2405,3516,7844
3,13265,1196,4221,6404,507,1788
4,22615,5410,7198,3915,1777,5185
...,...,...,...,...,...,...
435,29703,12051,16027,13135,182,2204
436,39228,1431,764,4510,93,2346
437,14531,15488,30243,437,14841,1867
438,10290,1981,2232,1038,168,2125


In [8]:
# TODO: Select three indices of your choice you wish to sample from the dataset
indices = [60,110,180]

# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Chosen samples of wholesale customers dataset:")
display(samples)

Chosen samples of wholesale customers dataset:


Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
0,8590,3045,7854,96,4095,225
1,11818,1648,1694,2276,169,1647
2,12356,6036,8887,402,1382,2794


In [9]:
# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor
new_data = data.drop(['Delicatessen'], axis = 1)
new_feature = pd.DataFrame(data['Delicatessen'])

# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
X_train, X_test, y_train, y_test = train_test_split(new_data, new_feature, test_size=0.25, random_state=42)

# TODO: Create a decision tree regressor and fit it to the training set
regressor =  DecisionTreeRegressor(random_state=42)
regressor.fit(X_train,y_train)

# TODO: Report the score of the prediction using the testing set
score =  regressor.score(X_test, y_test)
score

ModuleNotFoundError: No module named 'sklearn.cross_validation'

In [10]:
# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor
new_data = data.drop(['Fresh'], axis = 1)
new_feature = pd.DataFrame(data['Fresh'])

# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
X_train, X_test, y_train, y_test = train_test_split(new_data, new_feature, test_size=0.25, random_state=42)

# TODO: Create a decision tree regressor and fit it to the training set
regressor =  DecisionTreeRegressor(random_state=42)
regressor.fit(X_train,y_train)

# TODO: Report the score of the prediction using the testing set
score =  regressor.score(X_test, y_test)
score

ModuleNotFoundError: No module named 'sklearn.cross_validation'

In [None]:
# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor
new_data = data.drop(['Frozen'], axis = 1)
new_feature = pd.DataFrame(data['Frozen'])

# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
X_train, X_test, y_train, y_test = train_test_split(new_data, new_feature, test_size=0.25, random_state=42)

# TODO: Create a decision tree regressor and fit it to the training set
regressor =  DecisionTreeRegressor(random_state=42)
regressor.fit(X_train,y_train)

# TODO: Report the score of the prediction using the testing set
score =  regressor.score(X_test, y_test)
score

In [None]:
# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor
new_data = data.drop(['Milk'], axis = 1)
new_feature = pd.DataFrame(data['Milk'])

# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
X_train, X_test, y_train, y_test = train_test_split(new_data, new_feature, test_size=0.25, random_state=42)

# TODO: Create a decision tree regressor and fit it to the training set
regressor =  DecisionTreeRegressor(random_state=42)
regressor.fit(X_train,y_train)

# TODO: Report the score of the prediction using the testing set
score =  regressor.score(X_test, y_test)
score

In [None]:
# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor
new_data = data.drop(['Grocery'], axis = 1)
new_feature = pd.DataFrame(data['Grocery'])

# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
X_train, X_test, y_train, y_test = train_test_split(new_data, new_feature, test_size=0.25, random_state=42)

# TODO: Create a decision tree regressor and fit it to the training set
regressor =  DecisionTreeRegressor(random_state=42)
regressor.fit(X_train,y_train)

# TODO: Report the score of the prediction using the testing set
score =  regressor.score(X_test, y_test)
score

In [None]:
# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

In [None]:
#correlation function for the featgures of the costumers segments csv file
data.corr()

In [None]:
import seaborn as sns
sns.heatmap(data.corr(), annot=True)



In [None]:
# TODO: Scale the data using the natural logarithm
log_data = np.log(data)

# TODO: Scale the sample data using the natural logarithm
log_samples = np.log(samples)

# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

In [None]:
#correlation for the attributes after applyting the log function
log_data.corr()

In [None]:
import seaborn as sns
sns.heatmap(log_data.corr(), annot=True)

In [None]:
# Display the log-transformed sample data
display(log_samples)

In [None]:
all_outliers = np.array([], dtype='int64')

# For each feature find the data points with extreme high or low values
for feature in log_data.keys():
    
    # TODO: Calculate Q1 (25th percentile of the data) for the given feature
    Q1 = np.percentile(log_data[feature], 25)
    
    # TODO: Calculate Q3 (75th percentile of the data) for the given feature
    Q3 = np.percentile(log_data[feature], 75)
    
    # TODO: Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
    step = (Q3 - Q1) * 1.5
    
    outlier_points = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]
    all_outliers = np.append(all_outliers, outlier_points.index.values.astype('int64'))
    # Display the outliers
    print ("Data points considered outliers for the feature '{}':".format(feature))
    display(outlier_points)

all_outliers, indices = np.unique(all_outliers, return_inverse=True)
counts = np.bincount(indices)
outliers = all_outliers[counts>1]

print (outliers)

# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)

In [None]:
# TODO: Apply PCA by fitting the good data with the same number of dimensions as features
from sklearn.decomposition import PCA
# TODO: Apply PCA to the good data with the same number of dimensions as features
pca = PCA(n_components=6).fit(good_data)

# TODO: Apply a PCA transformation to the sample log-data
pca_samples = pca.transform(log_samples)

# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)

# display cumulative variance:
print (pca_results['Explained Variance'].cumsum())



In [None]:
# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))

In [None]:
# TODO: Apply PCA by fitting the good data with only two dimensions
pca = PCA(n_components=2).fit(good_data)

# TODO: Transform the good data using the PCA fit above
reduced_data = pca.transform(good_data)

# TODO: Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)

# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])

In [None]:
# Display sample log-data after applying PCA transformation in two dimensions
display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))

In [None]:
# Create a biplot
vs.biplot(good_data, reduced_data, pca)

In [None]:
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.cluster import DBSCAN
# Prepare models
df=reduced_data
kmeans = KMeans(n_clusters=3).fit(df)
normalized_vectors = preprocessing.normalize(df)
normalized_kmeans = KMeans(n_clusters=4).fit(normalized_vectors)
df1=reduced_data1
db = DBSCAN(eps=3.5,min_samples=10 ).fit(good_data)
# Print results
print('kmeans: {}'.format(silhouette_score(df, kmeans.labels_, 
                                           metric='euclidean')))
print('Cosine kmeans:{}'.format(silhouette_score(normalized_vectors,
                                          normalized_kmeans.labels_,
                                          metric='cosine')))
print('DBSCAN: {}'.format(silhouette_score(df, db.labels_, 
                                           metric='cosine')))
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
preds = kmeans.predict(reduced_data)
centers = kmeans.cluster_centers_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)

In [None]:
vs.cluster_results(reduced_data, preds, centers, pca_samples)

In [None]:
# K-MEANS
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
reduced_samples = pd.DataFrame(pca_samples, columns = ['Dimension 1', 'Dimension 2'])

# TODO: Apply your clustering algorithm of choice to the reduced data 
clusterer = KMeans(n_clusters=2, random_state=29).fit(reduced_data)
# TODO: Predict the cluster for each data point
preds = clusterer.predict(reduced_data)
centers = clusterer.cluster_centers_
# TODO: Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(reduced_samples)
# TODO: Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(reduced_data, clusterer.labels_, metric='euclidean')
print("KMeans score", score)

In [None]:
# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca_samples)

In [None]:

# TODO: Inverse transform the centers
log_centers = pca.inverse_transform(centers)

# TODO: Exponentiate the centers
true_centers = np.exp(log_centers)

# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)


* For each sample point, which customer segment from best represents it? 
* Are the predictions for each sample point consistent with this?*

Run the code block below to find which cluster each sample point is predicted to be.

In [None]:
# Display the predictions
for i, pred in enumerate(sample_preds):
    print("Sample point", i, "predicted to be in Cluster", pred)
display(true_centers)
display(samples)
display(true_centers-data.median())
display(true_centers-data.mean())

In [None]:
# Display the clustering results based on 'Channel' data
vs.channel_results(reduced_data, outliers, pca_samples)

In [None]:
print(preds)

In [None]:
print(reduced_data.shape)

In [None]:
good_data["cluster_number"]=preds

In [None]:
good_data.head()

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn import svm
# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
new_data = good_data.drop(['cluster_number'], axis = 1)
new_feature = pd.DataFrame(good_data['cluster_number'])
X_train, X_test, y_train, y_test = train_test_split(new_data, new_feature, test_size=0.25, random_state=42)

# TODO: Create a decision tree regressor and fit it to the training set
claaas =  svm.SVC(random_state=42)
claaas.fit(X_train,y_train)

# TODO: Report the score of the prediction using the testing set
score =  claaas.score(X_test, y_test)
score

In [None]:
good_data.shape

In [None]:
preds.shape

In [None]:
new_data.shape

In [None]:
new_feature.shape

In [None]:
p=claaas.predict(X_test)
p=pd.DataFrame(p)

In [None]:
from sklearn.metrics import accuracy_score 

p1=accuracy_score(p,y_test)

In [None]:
p1

In [None]:
#TODO:Visualize the answers using the confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(p,y_test)

In [None]:
from sklearn.metrics import classification_report 
ff=classification_report(p,y_test)
print(ff,end="\n")

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn import svm
# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
new_data = good_data.drop(['cluster_number'], axis = 1)
new_feature = pd.DataFrame(good_data['cluster_number'])
X_train, X_test, y_train, y_test = train_test_split(new_data, new_feature, test_size=0.25, random_state=42)

# TODO: Create a decision tree regressor and fit it to the training set
claaas =  AdaBoostClassifier(random_state=42)
claaas.fit(X_train,y_train)

# TODO: Report the score of the prediction using the testing set
score =  claaas.score(X_test, y_test)
score

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
new_data = good_data.drop(['cluster_number'], axis = 1)
new_feature = pd.DataFrame(good_data['cluster_number'])
X_train, X_test, y_train, y_test = train_test_split(new_data, new_feature, test_size=0.25, random_state=42)

# TODO: Create a decision tree regressor and fit it to the training set
claaas =  KNeighborsClassifier(n_neighbors=2)
claaas.fit(X_train,y_train)

# TODO: Report the score of the prediction using the testing set
score =  claaas.score(X_test, y_test)
score

In [None]:
p=claaas.predict(X_test)
p=pd.DataFrame(p)
from sklearn.metrics import classification_report 
ff=classification_report(p,y_test)
print(ff,end="\n")