#### OFM3 — OFM3 TASK 2: DIMENSIONALITY REDUCTION METHODS

<ul>
<li>Ryan L. Buchanan</li>
<li>Student ID:  001826691</li>
<li>Masters Data Analytics (12/01/2020)</li>
<li>Program Mentor:  Dan Estes</li>
<li>385-432-9281 (MST)</li>
<li>rbuch49@wgu.edu</li>
</ul>

#### Scenario 1
One of the most critical factors in customer relationship management that directly affects a company’s long-term profitability is understanding its customers. When a company can better understand its customer characteristics, it is better able to target products and marketing campaigns for customers, resulting in better profits for the company in the long term.

You are an analyst for a telecommunications company that wants to better understand the characteristics of its customers. You have been asked to perform a market basket analysis to analyze customer data to identify key associations of your customer purchases, ultimately allowing better business and strategic decision-making.

#### Part I: Research Question

#### <span style="color:green"><b>A1. Proposal of Question</b>:</span>
Which principal variables of our customers demonstrate that they are at high risk of churn?  And, therefore, which customers' features indicate relationship that might help identify customers that may potentially churn?  This question will be answered using principal component analysis (PCA).
<br>In other words, though we are not using a supervised learning model, such as linear regression, trying to make prediction, we are trying to better understand the relationships between customer features in order to inform stakeholder decisions.


#### <span style="color:green"><b>A2. Defined Goal</b>:</span>
Stakeholders in the company will benefit by knowing, with some measure of confidence, which customers are at highest risk of churn because this will provide weight for decisions in marketing improved services to customers with these characteristics and past user experiences.
The goal of this data analysis is to present numerical values to company stakeholders to help them better understand their customers and the principal components that cause customer churn.

#### Part II: Technique Justification
B.  Explain the reasons for using PCA by doing the following:

#### <span style="color:green"><b>B1. Explanation of of PCA</b>:</span>
In this analysis, Principal Component Analysis (PCA) is used for feature extraction. The breakdown of PCA involves linear algebra operations to manipulate the dataset into a more tractable form with far fewer and more meaningful variables.  The steps for PCA are as follows:
* Standardize the data.  This involves the mathematical formula of subtracting the mean of data points from the data points and dividing by the standard deviation.
* Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix.
* Sort Eigenvalues in descending order and choose the <i>k</i> Eigenvectors that correspond to the <i>k</i> largest Eigenvalues where <i>k</i> is the number of the dimensions (columns) of the new feature subspace.
* Construct the projection matrix <i>W</i> from the selected <i>k</i> Eigenvectors.
* Transform the original dataset <i>x</i> via <i>W</i> to obtain a <i>k</i>-dimensional feature subspace <i>Y</i>.

<span style="color:orange">(SuperDataScience)</span>

#### <span style="color:green"><b>B2. PCA Assumption</b>:</span>
One assumption of this approach is that we will reduce the dimensions (number of our customers' features) of this particular <i>d</i>-dimensional churn dataset by projecting it onto a <i>k</i>-dimensional subspace.  The point is to find <i>k</i> features that are less than <i>d</i> 
<span style="color:orange">(SuperDataScience)</span>.

#### Part III: Data Preparation

#### <span style="color:green"><b>C1. Continuous Dataset Variables</b>:</span>
In cleaning the data, we may discover relevance of the continuous predictor variables:
* Children
* Age
* Income
* Outage_sec_perweek
* Email
* Contacts    
* Yearly_equip_failure
* Tenure (the number of months the customer has stayed with the provider)
* MonthlyCharge
* Bandwidth_GB_Year    

Our target variable for all of these analyses is Churn. Churn is a binary (yes/no) variable.  So will accordingly encode it with dummy variables (1/0). 

In [None]:
# Standard data science imports
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# Visualization libraries
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.axes._axes import _log as matplotlib_axes_logger
%matplotlib inline

# Scikit-learn
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import IncrementalPCA
from sklearn.cluster import KMeans
from sklearn import metrics

# Import Scikit Learn PCA application
from sklearn.decomposition import PCA

# Import Scipy for feature scaling
import scipy
from scipy.cluster.vq import whiten

In [None]:
# Change color of Matplotlib font
import matplotlib as mpl

COLOR = 'white'
mpl.rcParams['text.color'] = COLOR
mpl.rcParams['axes.labelcolor'] = COLOR
mpl.rcParams['xtick.color'] = COLOR
mpl.rcParams['ytick.color'] = COLOR

In [None]:
# Increase Jupyter display cell-width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:75% !important; }</style>"))

In [None]:
# Ignore Warning Code
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load data set into Pandas dataframe
churn_df = pd.read_csv('data/churn_clean.csv')

In [None]:
# Examine the features of the dataset
churn_df.columns

In [None]:
# Get an idea of dataset size
churn_df.shape

In [None]:
# View DataFrame info
churn_df.info

In [None]:
# Provide an initial look at extant dataset
churn_df.head()

In [None]:
# Get an overview of descriptive statistics
churn_df.describe()

In [None]:
# Use describe method to view non-numerical data
churn_df.describe(exclude='number')

In [None]:
# Get data types of features
churn_df.dtypes

In [None]:
# Encode binary categorical variable with dummies
churn_df['DummyChurn'] = [1 if v == 'Yes' else 0 for v in churn_df['Churn']] ### If the customer left (churned) they get a '1'

In [None]:
# Drop original binary categorical feature from dataframe
churn_df = churn_df.drop(columns=['Churn'])

In [None]:
# Remove less meaningful non-numerical categorical variables from dataset to provide fully numerical dataframe
churn_df = churn_df.drop(columns=['CaseOrder', 'Customer_id', 'Interaction', 'UID', 'City', 'State', 
                                  'County', 'Zip', 'Lat', 'Lng', 'Area', 'TimeZone', 
                                  'Job', 'Marital', 'PaymentMethod', 'Gender', 'Techie', 
                                  'Contract', 'Port_modem', 'Tablet', 
                                  'InternetService', 'Phone', 'Multiple', 'OnlineSecurity', 
                                  'OnlineBackup', 'DeviceProtection', 'TechSupport', 
                                  'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 
                                  'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 
                                  'Item8'])

In [None]:
# Move DummyChurn to end of dataset to set as target
churn_df = churn_df[['Children', 'Age', 'Income', 'Outage_sec_perweek', 'Email', 'Contacts',
       'Yearly_equip_failure', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year', 'DummyChurn']]

In [None]:
# Review changes in DataFrame
churn_df.head()

In [None]:
# Examine the features of the dataset
churn_df.columns

In [None]:
# Create histograms of contiuous variables & categorical variables
churn_df[['Children', 'Age', 'Income', 'Outage_sec_perweek', 'Email', 
          'Contacts', 'Yearly_equip_failure', 'Tenure', 'MonthlyCharge', 
          'Bandwidth_GB_Year', 'DummyChurn']].hist()
plt.tight_layout()

In [None]:
# Set plot style to ggplot for aesthetics & R style
plt.style.use('ggplot')

In [None]:
# Create a scatterplot to get an idea of correlations between potentially related variables
sns.scatterplot(x=churn_df['MonthlyCharge'], y=churn_df['Outage_sec_perweek'], color='blue')
plt.show();

In [None]:
# Create a scatterplot to get an idea of correlations between potentially related variables
sns.scatterplot(x=churn_df['Outage_sec_perweek'], y=churn_df['DummyChurn'], color='blue')
plt.show();

In [None]:
# Create a scatterplot to get an idea of correlations between potentially related variables
sns.scatterplot(x=churn_df['Tenure'], y=churn_df['Bandwidth_GB_Year'], color='blue')
plt.show();

In [None]:
# P# Provide a scatter matrix of numeric variables for high level overview of potential relationships & distributions
churn_numeric = churn_df[['Children', 'Age', 'Income', 'Outage_sec_perweek', 
                          'Email', 'Contacts','Yearly_equip_failure', 'Tenure', 
                          'MonthlyCharge', 'Bandwidth_GB_Year', 'DummyChurn']]


scatter_matrix = pd.plotting.scatter_matrix(
    churn_numeric,
    figsize  = [15, 15],
    diagonal = "kde",
    color="b"
)

for ax in scatter_matrix.ravel():
    ax.set_xlabel(ax.get_xlabel(), fontsize = 10, rotation = 90)
    ax.set_ylabel(ax.get_ylabel(), fontsize = 10, rotation = 0)

In [None]:
sns.pairplot(churn_df, hue='DummyChurn', diag_kind='hist')

In [None]:
# Create multiple boxplots for continuous & categorical variables
churn_df.boxplot(column=['MonthlyCharge','Bandwidth_GB_Year'])

In [None]:
# Create Seaborn boxplots for continuous & categorical variables
sns.boxplot('MonthlyCharge', data = churn_df)
plt.show()

In [None]:
# Create Seaborn boxplots for continuous & categorical variables
sns.boxplot('Bandwidth_GB_Year', data = churn_df)
plt.show()

In [None]:
# Create Seaborn boxplots for continuous variables
sns.boxplot('Tenure', data = churn_df)
plt.show()

#### Anomalies
It appears that anomolies have been removed from the supplied dataset, churn_clean.csv. &nbsp; There are no remaining outliers.

In [None]:
# Discover missing data points within dataset
data_nulls = churn_df.isnull().sum()
print(data_nulls)

In [None]:
# Check for missing data & visualize missing values in dataset 

# Install appropriate library
!pip install missingno

# Importing the libraries
import missingno as msno

# Visualize missing values as a matrix
msno.matrix(churn_df);
"""(GeeksForGeeks, p. 1)"""

In [None]:
churn_df.head()

In [None]:
# List features for analysis
features = (list(churn_df.columns[:-1]))
print('Features for analysis include: \n', features)

In [None]:
# Extract Clean dataset
churn_df.to_csv('data/churn_prepared_pca.csv')

#### <span style="color:green"><b>C2. Standardization of Dataset Variables</b>:</span>

In [None]:
# Load clean, prepared dataset
churn_df = pd.read_csv('data/churn_prepared_pca.csv')

In [None]:
# Standardize the data
churn_standardized = (churn_df - churn_df.mean()) / churn_df.std()

In [None]:
# View standardized values
churn_standardized.head()

In [None]:
# Statistically descibe standardized values
churn_standardized.describe()

#### <span style="color:orange"><b>Visualization of Feature Scaling</b></span>

In [None]:
# Scale the data with the Scipy whiten method
churn_df_scaled = whiten(churn_df)
print(churn_df_scaled)

In [None]:
# Initialize original, scaled data
plt.plot(churn_df, 
        label="original")
plt.plot(churn_df_scaled,
        label="scaled")

# Show legend and display plot
plt.legend()
plt.show()

#### Part IV: Analysis
D. Perform PCA by doing the following:

#### <span style="color:green"><b>D1. Principal Components</b></span>
<span style="color:red">Determine the matrix of all the principal components.</span>

In [None]:
# Create a list of PCA names
churn_numeric = churn_standardized[['Children', 'Age', 'Income', 'Outage_sec_perweek', 
                      'Email', 'Contacts','Yearly_equip_failure', 
                      'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year']]
pcs_names = []
for i, col in enumerate(churn_standardized.columns):
    pcs_names.append('PC' + str(i + 1))
print(pcs_names)

In [None]:
# Select number of components to extract
pca = PCA(n_components = churn_standardized.shape[1])

In [None]:
# Set independent and dependent variables
X = churn_df.iloc[:, :-1].values
y = churn_df.iloc[:, -1].values

In [None]:
# Import Scikit-learn PCA and StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Scale features with Scikit-learn's standardization class
sc = StandardScaler()
X = sc.fit_transform(X)

In [None]:
# Split dataset in to training and test sets for logistic regression model analysis 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#### <span style="color:green"><b>D2. Identification of Total Number of Components</b></span>
<span style="color:red">Identify the total number of principal components using the elbow rule or the Kaiser criterion. Include a screenshot of the scree plot.</span>

In [None]:
# Run the scree plot
plt.plot(pca.explained_variance_ratio_)
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.show();

#### <span style="color:green"><b>D3. Total Variance of Components</b></span>
<span style="color:red">Identify the variance of each of the principal components identified in part D2.</span>

#### <span style="color:green"><b>D4. Total Variance Captured by Components</b></span>
<span style="color:red">Identify the total variance captured by the principal components identified in part D2.</span>

In [None]:
# Apply Scikit-learn PCA method
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
print("Total explained variance: ", explained_variance)

In [None]:
# Select the fewest components 
for pca, var in zip(pcs_names, np.cumsum(pca.explained_variance_ratio_)):
    print(pca, var)

In [None]:
# Train the Logistic Regression model on the training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

In [None]:
# Predict values for the test set
y_pred = classifier.predict(X_test)
print(y_pred)

In [None]:
# Create a confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("Accuracy for prediction with test set: ", accuracy_score(y_test, y_pred) * 100, "%")

#### <span style="color:orange"><b>Visualization of Training set results</b></span>

In [None]:
# Visualization of Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()

#### <span style="color:orange"><b>Visualization of Test set results</b></span>

In [None]:
# Visualization of Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()

In [None]:
# Create a visually more intuitive confusion matrix
"""(Dennis, pg. 1)"""
group_names = ['True Neg', 'False Pos', 'False Neg', 'True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

#### <span style="color:green"><b>D5. Summary of Data Analysis</b></span>
<span style="color:red">Summarize the results of your data analysis.</span>

It is critical that decision-makers & marketers understand that our predictor variables create a relatively low accuracy score with the results of an 0.84 after scaling.   We should analyse the features that are in common among those leaving the company & attempt to reduce their likelihood of occuring with any given customer in the future.   This suggests that as a customer subscribes to more services that the company provided, an additional port modem or online backup for example, they are less likely to leave the company.   Clearly, it is the best interest of retaining customers to provide them with more services & improve their experience with the company by helping customers understand all the services that are available to them as a subscriber, not simple mobile phone service.

#### <span style="color:green"><b> E. Sources for Third-Party Code</b></span>
* GeeksForGeeks. &ensp; (2019, July 4). &ensp; <i>Python | Visualize missing values (NaN) values using Missingno Library</i>. &ensp; GeeksForGeeks. &ensp; https://www.geeksforgeeks.org/python-visualize-missing-values-nan-values-using-missingno-library/
<br>
* Dennis, T. &ensp; (2019, July 25). &ensp; <i>Confusion Matrix Visualization</i>. &ensp; Medium. &ensp; https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea
<br>
* SuperDataScience. &ensp; (2021, August 15) &ensp; <i>Machine Learning A-Z: Hands-On Python & R in Data Science</i>. &ensp; https://www.superdatascience.com/

#### <span style="color:green"><b> F. Sources</b></span>
* CBTNuggets. &ensp; (2018, September 20). &ensp; <i>Why Data Scientists Love Python</i>. &ensp; CBTNuggets. &ensp; https://www.cbtnuggets.com/blog/technology/data/why-data-scientists-love-python
<br> 
* Massaron, L. & Boschetti, A. &ensp; (2016). &ensp; <i>Regression Analysis with Python</i>. &ensp; Packt Publishing.