# Data Exploration - Santander Customer Transaction Prediction

In this notebook you can find a preliminary study for the [Santander Customer Transaction Prediction Challenge on Kaggle.com](https://www.kaggle.com/c/santander-customer-transaction-prediction)
![alt text](Kaggle.png)![alt text](im-wcsanusa-logo-7-19-18.png)

###### Description
At [Santander](https://www.santanderbank.com) our mission is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals.

Our data science team is continually challenging our machine learning algorithms, working with the global data science community to make sure we can more accurately identify new ways to solve our most common challenge, binary classification problems such as: is a customer satisfied? Will a customer buy this product? Can a customer pay this loan?

In this challenge, we invite Kagglers to help us identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this problem.

In [None]:
# Importing all the libraries needed
import os
from IPython.display import Image
Image("Kaggle.png") # same directory
Image("im-wcsanusa-logo-7-19-18.png") # same directory
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

%matplotlib inline

# getting my path C:\\Users\\username\\Desktop
# /Users/username/Desktop for Mac
path = os.getcwd()
# / if Mac \\ if Windows
path = '/'.join(path.split("/")[:4])

In [None]:
# reading the training set (train.csv), I had this file in C:\\Users\\username\\Desktop\\...
# / if Mac \\ if Windows
df = pd.read_csv(path + '/Santander_Customer_Transaction_Prediction/data/train.csv')

In [None]:
# first records
df.head(5)

In [None]:
# last records
df.tail(5)

In [None]:
# here a quick view of our df
df.describe()

In [None]:
# data types and df shape
print(df.dtypes)
print(df.shape)

In [None]:
# Finding the null values if present:
if True in list(df.isnull().any()):
    print('There are some null values here!\nVar_Name; number_of_nulls:')
    print(df[df.columns[df.isnull().any()]].isnull().sum())
else:
    print('No null values in your df!')

In [None]:
# putting all the df colname in a list
dfcols = list(df.columns)

# exculdig target and index columns
variables = dfcols[2:]

# splitting the list every n elements:
n = 10
chunks = [variables[x:x + n] for x in range(0, len(variables), n)]

In [None]:
# displaying a boxplot every n columns:
for i in chunks:
    plt.show(df.boxplot(column = i))

## Let's look closer
###### Select an index from 0 to 199

In [None]:
# choose a column index
index = 34

In [None]:
# displaying boxplots for the selected column:
fig1, axes1 = plt.subplots(ncols = 2, sharey = True, figsize=(10,5))

# boxplot only for records with target = 1
ax1 = df.loc[df["target"] == 1].boxplot(column = variables[index], ax=axes1[0], sym='k.')
ax1.set_title('Target = 1')

# boxplot only for records with target = 0
ax2 = df.loc[df["target"] == 0].boxplot(column = variables[index], ax=axes1[1], sym='k.')
ax2.set_title('Target = 0')

fig1.suptitle("Boxplots for column: " + str(variables[index]), fontsize=15)
plt.show()

#displaying histograms for the selected column:
fig2, axes2 = plt.subplots(nrows = 2, sharex = True, figsize=(8,10))

# boxplot only for records with target = 1
ax1 = df.loc[df["target"] == 1][variables[index]].plot.hist(ax=axes2[0], bins=20, color='green', alpha=0.7)
ax1.set_title('Target = 1')

# boxplot only for records with target = 0
ax2 = df.loc[df["target"] == 0][variables[index]].plot.hist(ax=axes2[1], bins=20, color='blue', alpha=0.7)
ax2.set_title('Target = 0')

fig2.suptitle("Histograms for column: " + str(variables[index]), fontsize=15)
plt.show()

## Is the Dataset balanced?
## How many 0 and 1 items there are?

In [None]:
mylst = list(df["target"].value_counts())
zero = round(float((mylst[0]/sum(mylst))*100),2)
one = round(float((mylst[1]/sum(mylst))*100),2)
print('The dataset has {zero} % of target 0 and {one} % of target 1'.format(zero=zero, one=one))