# Problem definition

The more the telecom services marketing paradigm evolves, the more important it becomes to retain high value customers. Traditional customer segmentation methods based on experience or ARPU (Average Revenue per User) consider neither customers’ future revenue nor the cost of servicing customers of different types. Therefore, it is very difficult to effectively identify high-value customers.

Cons:
- Current segmentation serves various areas of the company aiming different objectives (best practices have different segmentations for different purposes such as customer care, acquisition and offering P&S)
- Mixes behavioural elements with customer value
- Leaves out elements describing customer behaviour e.g. recharge behaviour, payment methods
- Does not link to the market or market potential; looks only at MCI customer base
- Does not incorporate future needs/ potential; captures only historical/ current customer usage

 MCI’s current segmentation is ARPU-based and does not allow value-based marketing. MCI’s current segmentation approach
- SHVC(Super high value customers)
- HVR(High voice revenue)
- HMR(High mix revenue)
- HDR(High data revenue)
- Mass


Ideally, different segmentation techniques should be used for different purposes such as customer care, acquisition and P&S development
1. ARPU and lifetime value based segmentation for CC(ARPU and life-time customer value based segmentation (across all touchpoints))
2. Micro-segmentation for P&S development(Dynamic/ Micro segmentation (contextual marketing to target microsegments) against traditional statistic segmentation
3. Value based segmentation  for customer acquisitions(Value based segmentation that mixes behaviour patterns and values to address specific target segments with tailor-made offers)

----------------

# Data preparation

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from numpy import percentile
from IPython.display import Audio
from sklearn.ensemble import IsolationForest
sound='0.mp3'
#%%capture #run the cell, capturing stdout, stderr, and IPython’s rich display() calls.

In [None]:
data=pd.read_csv('100000.csv', sep=',', dtype={'CUSTOMER_ID': float, 'GENDER': float })
print(type(data)) # pandas.core.frame.DataFrame
backup=data.copy()

In [None]:
data.columns

In [None]:
data.isnull().sum()
null_values=data.columns[data.isnull().any()]
data[null_values].isnull().sum()

In [None]:
data.isnull().sum().sort_values(ascending=False).head(10)

In [None]:
data.iloc[16443,:]  # we can see ### input for AGE

# Missing Values

In [None]:
data['COUNT_MSISDN'].fillna(0,inplace=True)
data['GENDER'].fillna(0,inplace=True) #222


data['AGE'].fillna(0,inplace=True) #222
data['AGE'].replace('###',0,inplace=True)


# df["column1"].replace({"a": "x", "b": "y"}, inplace=True)


#convert GENERATION to number
data['GENERATION'].fillna('0G',inplace=True) 
data['GENERATION']= data['GENERATION'].str.split('G' , expand=True,)

# data['GENERATION']= data['GENERATION'].str.strip().split('g')[0]
# data['GENERATION']= data['GENERATION'].values.strip().split('g')[0]

#.split('g')[0]

#Voice_main_lac
data['VOICE_MAIN_LACCELL'].fillna(0,inplace=True) 
data['VOICE_MAIN_LAC_CELL_PROVINCE'].fillna(0,inplace=True) 
data['ACTIVE_INTEC_USAGE'].fillna(0,inplace=True) 

#data[data.VOICE_MAIN_LACCELL.isnull()]=-1
#data[data.VOICE_MAIN_LAC_CELL_PROVINCE.isnull()]=-1
#data[data.ACTIVE_INTEC_USAGE.isnull()]=-1


#Packages
data['MONTHLY_PKG_COUNT'].fillna(0,inplace=True)
data['BOUNDLE_PKG_COUNT'].fillna(0,inplace=True)
data['HOURLY_U_PKG_COUNT'].fillna(0,inplace=True)
data['GIFT_PKG_COUNT'].fillna(0,inplace=True)
data['HOURLY_L_PKG_COUNT'].fillna(0,inplace=True)
data['SHARED_PKG_COUNT'].fillna(0,inplace=True)
data['OTHER_PKG_COUNT'].fillna(0,inplace=True)
data['CLEAN_INTERNET_PKG_COUNT'].fillna(0,inplace=True)
data['CUSTOMIZED_OFFER_PKG_COUNT'].fillna(0,inplace=True)
data['DTS_PKG_COUNT'].fillna(0,inplace=True)
data['B2B_PKG_COUNT'].fillna(0,inplace=True)
data['SHORT_TERM_PKG_COUNT'].fillna(0,inplace=True)
data['LONG_TERM_PKG_COUNT'].fillna(0,inplace=True)
data['NEWSUBS_PKG_COUNT'].fillna(0,inplace=True)
data['CONTENT_PLAN_PKG_COUNT'].fillna(0,inplace=True)
data['ROAMING_PKG_COUNT'].fillna(0,inplace=True)
data['UNKNOWN_TYPE_PKG_COUNT'].fillna(0,inplace=True)
data['ZZ_BLANK_GENERAL_TYPE_PKG_COUNT'].fillna(0,inplace=True)
data['ZZ_NEED_TO_MAP_PKG_COUNT'].fillna(0,inplace=True)



# Calculation fields

In [None]:
data['TENURE']=1400 - data.ACTIVATION_YEAR.values
data['VOL_ALL_CBS_BYTE']=data['VOL_2G_CBS_BYTE'].values+ data['VOL_3G_CBS_BYTE'].values + data['VOL_4G_CBS_BYTE'].values+data['VOL_UNK_CBS_BYTE'].values
data['ARPU_DATA']= data['CASH_PKG'].values + data['PAYG_DATA_RIAL'].values + data['PKG_DATA_ACTIVATION_RIAL_CRM'].values 
data['ARPU_NONDATA']=data.ARPU_RIAL.values - data.ARPU_DATA.values

In [None]:
data.GENDER.value_counts()

----------------------
# EDA(Eploratory Data Analysis)
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

In [1]:
data.head()

NameError: name 'data' is not defined

In [None]:
data.info()
# we see many Data types as objects, so we have to change them into the proper Data type

In [None]:
data.describe() #Looking for some statistical information about each feature

In [None]:
plt.figure(figsize=(30,30))
sns.heatmap(data.corr(),cmap='viridis', annot=True)

### Checking the skewness of our dataset.
- A normally distribuited data has a skewness close to zero.
- Skewness greather than zero means that there is more weight in the left side of the data.
- In another hand, skewness smaller than 0 means that there is more weight in the right side of the data

In [None]:
data.skew()

In [None]:
# data['GENDER'] = pd.to_numeric(data['GENDER'])
# data['GENDER'] =[int(g) for g in data['GENDER']]
data['GENDER'] = data['GENDER'].astype('int8')
data['AGE'] = data['AGE'].astype('int8')
# data['GENERATION']= [int(g) for g in data['GENERATION']]
data['GENERATION']=data['GENERATION'].astype('int8')
# data['PROVINCE']= [str(g) for g in data['PROVINCE']]          # data['PROVINCE'].astype(str)

data['VAS_CNT'] = pd.to_numeric(data['VAS_CNT'])
data['TOT_RCHG_CNT'] = pd.to_numeric(data['TOT_RCHG_CNT'])
data['MONTHLY_PKG_COUNT'] = pd.to_numeric(data['MONTHLY_PKG_COUNT'])
data['TOT_RCHG_CNT'] = pd.to_numeric(data['TOT_RCHG_CNT'])
data['TOT_RCHG_CNT'] = pd.to_numeric(data['TOT_RCHG_CNT'])
data['TOT_RCHG_CNT'] = pd.to_numeric(data['TOT_RCHG_CNT'])

#Packages
#MONTHLY_PKG_COUNT

In [None]:
data.info()

### label Encoding

'PROVINCE'
??????

## The distribution of a data set

Plotting the histogram of each numerical variable (in this case, all features), the main idea here is to visualize the data distribution for each feature. This method can bring fast insights as:
- Check the kind of each feature distribution
- Check data symmery
- Verify features frequency
- Identify outliers

In [None]:
%%time
sns.set(style='white', font_scale=1.3, rc={'figure.figsize':(30,30)})
data2=data.drop('CUSTOMER_ID',axis=1)
ax=data2.hist(bins=20, color='red')

In [None]:
data2.plot(kind='box', subplots=True, layout=(15,15), sharex=False, sharey=False, color='black')
plt.show()

---------------

# Remove Outlier / Anomaly Detection
- Standard Deviation Method: If we know that the distribution of values in the sample is Gaussian or Gaussian-like
- Interquartile Range Method: Not all data is normal or normal enough to treat it as being drawn from a Gaussian distribution.
- Automatic Outlier Detection


The Python Outlier Detection (PyOD) module makes your anomaly detection modeling easy. It collects a wide range of techniques ranging from supervised learning to unsupervised learning techniques.
PyOD boasts a set of more than 30 detection algorithms, ranging from from classical algorithms like isolation forest to the latest deep learning methods to emerging algorithms like COPOD (paper). PyOD algorithms are well-established, highly cited in the literature, and useful.


Outlier Detection Algorithms used in PyOD

Let’s see the outlier detection algorithms that power PyOD. It’s well and good implementing PyOD but I feel it’s equally important to understand how it works underneath. This will give you more flexibility when you’re using it on a dataset.

Note: We will be using a term Outlying score in this section. It means every model, in some way, scores a data point than uses threshold value to determine whether the point is an outlier or not.

 
Angle-Based Outlier Detection (ABOD)

    It considers the relationship between each point and its neighbor(s). It does not consider the relationships among these neighbors. The variance of its weighted cosine scores to all neighbors could be viewed as the outlying score
    ABOD performs well on multi-dimensional data
    PyOD provides two different versions of ABOD:
        Fast ABOD: Uses k-nearest neighbors to approximate
        Original ABOD: Considers all training points with high-time complexity

 
k-Nearest Neighbors Detector

    For any data point, the distance to its kth nearest neighbor could be viewed as the outlying score
    PyOD supports three kNN detectors:
        Largest: Uses the distance of the kth neighbor as the outlier score
        Mean: Uses the average of all k neighbors as the outlier score
        Median: Uses the median of the distance to k neighbors as the outlier score

 
Isolation Forest

    It uses the scikit-learn library internally. In this method, data partitioning is done using a set of trees. Isolation Forest provides an anomaly score looking at how isolated the point is in the structure. The anomaly score is then used to identify outliers from normal observations
    Isolation Forest performs well on multi-dimensional data

 
Histogram-based Outlier Detection

    It is an efficient unsupervised method which assumes the feature independence and calculates the outlier score by building histograms
    It is much faster than multivariate approaches, but at the cost of less precision

 
Local Correlation Integral (LOCI)

    LOCI is very effective for detecting outliers and groups of outliers. It provides a LOCI plot for each point which summarizes a lot of the information about the data in the area around the point, determining clusters, micro-clusters, their diameters, and their inter-cluster distances
    None of the existing outlier-detection methods can match this feature because they output only a single number for each point

 
Feature Bagging

    A feature bagging detector fits a number of base detectors on various sub-samples of the dataset. It uses averaging or other combination methods to improve the prediction accuracy
    By default, Local Outlier Factor (LOF) is used as the base estimator. However, any estimator could be used as the base estimator, such as kNN and ABOD
    Feature bagging first constructs n sub-samples by randomly selecting a subset of features. This brings out the diversity of base estimators. Finally, the prediction score is generated by averaging or taking the maximum of all base detectors

 
Clustering Based Local Outlier Factor

    It classifies the data into small clusters and large clusters. The anomaly score is then calculated based on the size of the cluster the point belongs to, as well as the distance to the nearest large cluster

 
Extra Utilities provided by PyOD

    A function generate_data can be used to generate random data with outliers. Inliers data is generated by a multivariate Gaussian distribution and outliers are generated by a uniform distribution.
    We can provide our own values of outliers fraction and the total number of samples that we want in our dataset. We will use this utility function to create data in the implementation part.


In [None]:
data= data.drop(['CUSTOMER_ID','ACTIVE','ACTIVE_DATA_RIAL','ACTIVE_DATA_USAGE',"ACTIVE","ACTIVE_DATA_RIAL","ACTIVE_DATA_USAGE","ACTIVE_VOICE_RIAL","ACTIVE_VOICE_USAGE", "ACTIVE_SMS_RIAL",
                "ACTIVE_SMS_USAGE","ACTIVE_INTEC_USAGE", "ACTIVATION_YEAR","ACTIVATION_MONTH",'PROVINCE','SIM_PLAN_CHANNEL','SIM_PLAN_CATEGORY','SEGMENT_NEW_VER','VOICE_MAIN_LACCELL',
                'VOICE_MAIN_LAC_CELL_PROVINCE', "VOL_2G_CBS_BYTE","VOL_3G_CBS_BYTE","VOL_4G_CBS_BYTE","VOL_UNK_CBS_BYTE",'PROVINCE'], axis=1)
data.info()

In [None]:
sns.regplot(x="ARPU_RIAL", y="ARPU_NONDATA", data=data)
sns.despine();

In [None]:
sns.regplot(x="ARPU_RIAL", y="ARPU_NONDATA", data=data)
sns.despine();

In [None]:
l = data.columns.values
number_of_columns=60
number_of_rows = len(l)-1/number_of_columns

plt.figure(figsize=(number_of_columns,5*number_of_rows))
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1, number_of_columns,i+1)
    sns.set_style('whitegrid')
    sns.boxplot(data[l[i]],color='green',orient='v')
    plt.tight_layout()

In [None]:
data[l[4]]

- Distribution-based techniques – Minimum Covariance Determinant, Elliptic Envelope
- Depth-based technique – Isolation Forest
- Clustering-based technique – Local Outlier Factor
- Density-based technique – DBSCAN
- Unified library for Outlier Detection – PyOD
- Statistical techniques – Interquartile range
- Visualization techniques – Box-plot

In [None]:
X=StandardScaler().fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.4 , random_state=42)

# Make the 2d numpy array a pandas dataframe for each manipulation 
X_train_pd = pd.DataFrame(X_train)


In [None]:
# train kNN detector
from pyod.models.knn import KNN
clf_name = 'KNN'
clf = KNN()
clf.fit(X_train)

In [None]:
y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_  # raw outlier scores

# get the prediction on the test data
y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test)  # outlier scores

In [None]:
len(y_train_pred[y_train_pred==1])

In [None]:
len(y_test_pred[y_test_pred==0])

In [None]:
# Now we have the trained K-NN model, let's apply to the test data to get the predictions
y_test_pred = clf.predict(X_test) # outlier labels (0 or 1)
# Because it is '0' and '1', we can run a count statistic. There are 44 '1's and 456 '0's. The number of anomalies is roughly ten percent, as we have generated before:
unique, counts = np.unique(y_test_pred, return_counts=True)
dict(zip(unique, counts))
#{0: 456, 1: 44}
# And you can generate the anomaly score using clf.decision_function:
y_test_scores = clf.decision_function(X_test)

In [None]:
y_test_scores

In [None]:
y_test_scores.shape

In [None]:
y_test_pred

In [None]:
plt.hist(y_test_scores, bins='auto')  # arguments are passed to np.histogram
plt.title("Histogram with 'auto' bins")
plt.show()

In [None]:
# Let's see how many '0's and '1's. We get 452 '0's and 48 '1's.
df_test = pd.DataFrame(X_test)
df_test['score'] = y_test_scores
df_test['cluster'] = np.where(df_test['score']<1, 0, 1)
df_test['cluster'].value_counts()

# Now let's show the summary statistics:
# df_test.groupby('y_by_average_cluster').mean()
df_test.groupby('cluster').mean()

In [None]:
df_test.cluster.value_counts()

In [None]:
df_test.query('cluster==0')

In [None]:
data.iloc[39835,:]

In [None]:
#Plot
plt.scatter(X_train_pd[0], X_train_pd[1], c=y_train, alpha=0.8)
plt.title('Scatter plot pythonspot.com')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

### Inter Quartile Range(IQR)

In [None]:
ys = data.ARPU_DATA
quartile_1, quartile_3 = np.percentile(ys, [25, 75])
iqr = quartile_3 - quartile_1
lower_bound = quartile_1 - (iqr * 1.5)
upper_bound = quartile_3 + (iqr * 1.5)
outliers_indices = np.where((ys > upper_bound) | (ys < lower_bound))

In [None]:
outliers_indices

In [None]:
 np.percentile(ys, 75)

In [None]:
data.ARPU_DATA.value_counts().sort_values(ascending=False)

In [None]:
data.info()

In [None]:

outliers_fraction = 0.01
xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))
clf = CBLOF(contamination=outliers_fraction,check_estimator=False, random_state=0)
clf.fit(X)
scores_pred = clf.decision_function(X) * -1
y_pred = clf.predict(X)
n_inliers = len(y_pred) - np.count_nonzero(y_pred)
n_outliers = np.count_nonzero(y_pred == 1)

plt.figure(figsize=(8, 8))

df1 = data
df1['outlier'] = y_pred.tolist()
    
# sales - inlier feature 1,  profit - inlier feature 2
inliers_sales = np.array(df1['Sales'][df1['outlier'] == 0]).reshape(-1,1)
inliers_profit = np.array(df1['Profit'][df1['outlier'] == 0]).reshape(-1,1)
    
# sales - outlier feature 1, profit - outlier feature 2
outliers_sales = df1['Sales'][df1['outlier'] == 1].values.reshape(-1,1)
outliers_profit = df1['Profit'][df1['outlier'] == 1].values.reshape(-1,1)
         
print('OUTLIERS:',n_outliers,'INLIERS:',n_inliers)
threshold = percentile(scores_pred, 100 * outliers_fraction)        
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)
a = plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')
plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')
b = plt.scatter(inliers_sales, inliers_profit, c='white',s=20, edgecolor='k')
    
c = plt.scatter(outliers_sales, outliers_profit, c='black',s=20, edgecolor='k')
       
plt.axis('tight')   
plt.legend([a.collections[0], b,c], ['learned decision function', 'inliers','outliers'],
           prop=matplotlib.font_manager.FontProperties(size=20),loc='lower right')      
plt.xlim((0, 1))
plt.ylim((0, 1))
plt.title('Cluster-based Local Outlier Factor (CBLOF)')
plt.show();

### Manualy Outlier and Anomaly detection

In [None]:
plt.scatter(range(data.shape[0]),data['COUNT_MSISDN'].values)
plt.xlabel('Index')
plt.ylabel('COUNT MSISDN')
plt.title('Distribution of COUNT MSISDN')
sns.despine() #Remove the top and right spines from plot(s).

In [None]:
back=data.copy()
back

In [None]:
print(len(data.query('COUNT_MSISDN > 10')))
out=data.query('COUNT_MSISDN > 10')
out

In [None]:
data.drop(out.index, axis=0 , inplace=True)

In [None]:
print(len(data.query('ARPU_RIAL <0')))

In [None]:
data.drop(data[data.ARPU_RIAL<0].index,axis=0 , inplace=True )

In [None]:
data['ARPU_RIAL'].describe()

In [None]:
plt.scatter(range(data.shape[0]), np.sort(data['ARPU_RIAL'].values))

In [None]:
sns.distplot(data['ARPU_RIAL'])

In [None]:
isolation_forest = IsolationForest(n_estimators=100)
isolation_forest.fit(data['ARPU_RIAL'].values.reshape(-1, 1))
xx = np.linspace(data['ARPU_RIAL'].min(), data['ARPU_RIAL'].max(), len(data)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score), where=outlier==-1, color='r',  alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('ARPU_RIAL')
plt.show();

In [None]:
ys = data.ARPU_RIAL
quartile_1, quartile_3 = np.percentile(ys, [25, 75])
iqr = quartile_3 - quartile_1
lower_bound = quartile_1 - (iqr * 1.5)
upper_bound = quartile_3 + (iqr * 1.5)
outliers_indices = np.where((ys > upper_bound) | (ys < lower_bound))

In [None]:
upper_bound

In [None]:
np.where(ys > upper_bound)

In [None]:
data.ARPU_RIAL.sort_values(ascending=False).head(40)

In [None]:
data.iloc[26086,]

In [None]:
len(outliers_indices[0])

In [None]:
data=data.drop(outliers_indices[0],axis=0)

In [None]:
len(data)

In [None]:
plt.scatter(range(data.shape[0]), np.sort(data['ARPU_RIAL'].values))
data['ARPU_RIAL'].

In [None]:
plt.scatter(x=data.AGE, y=data.COUNT_MSISDN)

In [None]:
plt.scatter(range(data.shape[0]),data['POSTPAID_COUNTS'].values)
plt.xlabel('Index')
plt.ylabel('POSTPAID COUNTS')
plt.title('Distribution of  POSTPAID COUNTS')
sns.despine() 

In [None]:
plt.scatter(x=data.AGE, y=data.POSTPAID_COUNTS)

In [None]:
%time
len(data.query('AGE < 18'))

In [None]:
plt.scatter(range(data.shape[0]),data['PREPAID_COUNTS'].values)
plt.xlabel('Index')
plt.ylabel('PREPAID COUNTS')
plt.title('Distribution of  PREPAID COUNTS')
sns.despine() 

In [None]:
len(data.query('PREPAID_COUNTS > 10'))

In [None]:
print("skewness: %f" % data['ARPU_DATA'].skew())# unbiased skew over requested axis.
print("kurtosis: %f" % data['ARPU_DATA'].kurt()) # unbiased kurtosis over requested axis.

In [None]:
plt.scatter(range(data.shape[0]),data['ARPU_NONDATA'].values)
plt.xlabel('Index')
plt.ylabel('ARPU NONDATA')
plt.title('Distribution of ARPU NONDATA')
sns.despine() #Remove the top and right spines from plot(s).

In [None]:
sns.displot(data, x='COUNT_MSISDN', hue='GENDER', kind='kde')

## Dimensionality Reduction 
- PCA
- UMAP
- T-SNE

### PCA(Principal Component Analysis)
PCA is effected by scale so you need to scale the features in your data before applying PCA. Use StandardScaler to help you standardize the dataset’s features onto unit scale (mean = 0 and variance = 1) which is a requirement for the optimal performance of many machine learning algorithms. If you want to see the negative effect not scaling your data can have, scikit-learn has a section on the effects of not standardizing your data.

In [None]:
data2=data.drop(['CUSTOMER_ID','ACTIVE','ACTIVE_DATA_RIAL','ACTIVE_DATA_USAGE',"ACTIVE","ACTIVE_DATA_RIAL","ACTIVE_DATA_USAGE","ACTIVE_VOICE_RIAL","ACTIVE_VOICE_USAGE", "ACTIVE_SMS_RIAL",
           "ACTIVE_SMS_USAGE","ACTIVE_INTEC_USAGE", "ACTIVATION_YEAR","ACTIVATION_MONTH",'PROVINCE','SIM_PLAN_CHANNEL','SIM_PLAN_CATEGORY','SEGMENT_NEW_VER','VOICE_MAIN_LACCELL','VOICE_MAIN_LAC_CELL_PROVINCE',
           "VOL_2G_CBS_BYTE","VOL_3G_CBS_BYTE","VOL_4G_CBS_BYTE","VOL_UNK_CBS_BYTE"], axis=1)
data2.columns

In [None]:
data2.shape

### ploting 65 dimesions into 2D

In [None]:
# Separating out the features
x = data2.loc[:, data2.columns].values
# Separating out the target
#y = df.loc[:,['target']].values
# Standardizing the features
x = StandardScaler().fit_transform(x)

In [None]:
pca=PCA(n_components=2)
pcComponents=pca.fit_transform(x)
pcDF=pd.DataFrame(data=pcComponents, columns=['pc1','pc2'])

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(pcComponents[:,0],pcComponents[:,1], edgecolors='none',alpha=.7, s=40, cmap=plt.get_cmap('nipy_spectral', 10))
plt.colorbar()

-------------

# Clustering

Clustering is the task of grouping a set of objects in such a way that those in the same group (called a cluster) are more similar to each other than to those in other groups. It is an exploratory data mining activity, and a common technique for statistical data analysis used in many fields including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression and computer graphics. Some common real life use cases of clustering are:

- Customer segmentation based on purchase history or interests to design targetted marketing compaigns.
- Cluster documents into multiple categories based on tags, topics, and the content of the document.
- Analysis of outcome in social / life science experiments to find natural groupings and patterns in the data.

### A list of 10 of the more popular algorithms is as follows:

- Affinity Propagation
- Agglomerative Clustering
- BIRCH
- DBSCAN
- K-Means
- Mini-Batch K-Means
- Mean Shift
- OPTICS
- Spectral Clustering
- Mixture of Gaussians

## K-Means
a) By dropping features which are very similar to each other and keep just one out of the two.
b) By combining features which represent more sensible information when considered together. and if you cannot afford to do either of the two or even after applying them, the dimensions are a mess we then can
c) Use any of the several available dimensionality reduction techniques.


We all know that KMeans is great, that but it does not work well with higher dimension data

In [None]:
arpu_ds=data[['ARPU_DATA','ARPU_NONDATA']]
ss=StandardScaler()
data_cluster=arpu_ds.copy()
data_cluster[data_cluster.columns]=ss.fit_transform(data_cluster)
data_cluster.describe()

In [None]:
x=data[['ARPU_DATA','ARPU_NONDATA']].values
wcss=[]
for i in range(1,10):
    kmeans=KMeans(n_clusters=i,init='k-means++', random_state=42)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
wcss

In [None]:
%%time
%matplotlib inline
plt.plot(range(1,10),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
kmeans=KMeans(n_clusters=4,init='k-means++',random_state=42)
y_kmeans=kmeans.fit_predict(x)

In [None]:
plt.scatter(x[:,0] ,x[:,1] , c=y_kmeans, s=50, cmap='viridis' )
centers=kmeans.cluster_centers_
plt.scatter(centers[:,0], centers[:,1], c='black', s=200, alpha=.5);