# Customer Satisfaction
#### Business Problem Introduction
From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving.

Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.

In this competition, you'll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience.

The original dataset can be found on Kaggle: <br> https://www.kaggle.com/c/santander-customer-satisfaction

#### Summary
Using AUC as a metric:
* Logistic regression is:  0.57

--------------
## View Train Data
The train data is the most important dataset in which we try to find trends on the TARGET column, Y indepdendent variable.

In [17]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_validate
from statistics import mean
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.datasets import make_classification
from sklearn import ensemble
import sklearn.metrics as metrics

sample=r'/kaggle/input/santander-customer-satisfaction/sample_submission.csv'
train=r'/kaggle/input/santander-customer-satisfaction/train.csv'
test=r'/kaggle/input/santander-customer-satisfaction/test.csv'

data=pd.read_csv(train)
data.drop_duplicates() #no duplicates
print(data.shape) #(76020, 371)
data.head() 

## Find Missing Values
The dataset is still structured as (76020, 371) meaning all the columns must have been numeric. There are no missing values. 

In [18]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df = data.select_dtypes(include=numerics) 

#search for columns with missing values:
def findNA():
    print("Missing data by column as a percent:")
    findNA=df.isnull().sum().sort_values(ascending=False)/len(df)
    print(findNA.head())
findNA() 

## Examine Target Column
Customer satisfaction is in 1 and 0. This means that we can use logistic regression to further analyze any trends in the data. 

In [19]:
target=df['TARGET'].unique()
print(target)

## Identify Highly Correlated Features:
First, we must remove highly correlated features that are above .80 correlation. Normally, creating a heapmap visualization is helpful but with 371 column features, will not be memory efficient unless using a supercomputer.

In [27]:
print(df.shape, " before removing highly correlated variables.")
# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.90)]

df=df.drop(to_drop, axis = 1)
print(df.shape, " after removing highly correlated variables.")

In [21]:
X=df.drop('TARGET', axis=1)
y=df['TARGET']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)

## Logistic Regression
AUC for logistic regression is:  0.57.

In [22]:
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression(solver='liblinear') #solver param gets rid of encoder error

#Train the model and create predictions
logReg.fit(X_train, y_train)
logPredict = logReg.predict_proba(X_test)[::,1]

#calculate AUC of model
auc = round( metrics.roc_auc_score(y_test, logPredict), 4 ) 
print("AUC for logistic regression is: ", auc)

## References
1. https://www.kaggle.com/rahulanand0070/feature-selection
2. https://www.kaggle.com/solegalli/feature-selection-with-feature-engine