## Mini Project III

# mini-project-III
Repo with the instructions for the Mini Project III.


### Topics
This mini project is dedicated to following topics:
- Data Wrangling
- Data Visualization
- Data Preparation and Feature Engineering
- Dimensionality Reduction
- Unsupervised Learning

### Data
We will be using old data about different financial transactions. You can download the data from [here](https://drive.google.com/file/d/1zAjnf936aHkwVCq_BmA47p4lpRjyRzMf/view?usp=sharing). The data contains following tables:

- twm_customer - information about customers
- twm_accounts - information about accounts
- twm_checking_accounts - information about checking accounts (subset of twm_accounts)
- twm_credit_accounts - information about checking accounts (subset of twm_accounts)
- twm_savings_accounts - information about checking accounts (subset of twm_accounts)
- twm_transactions - information about financial transactions
- twm_savings_tran - information about savings transactions (subset of twm_transactions)
- twm_checking_tran - information about savings transactions (subset of twm_transactions)
- twm_credit_tran - information about credit checking (subset of twm_transactions)


### Output

In this miniproject, we will:

1.  create two separate customer segmentations (using clustering) to split them into 3-5 clusters: 
    - based on demographics (only on the information from twm_customer)
    - based on their banking behavior. We can take following things into consideration as banking behavior:
        - do they have savings account? How much do they save?
        - do they have credit account? How much do they live in debt?
        - are they making lot of small transactions or few huge ones?
2. visualize the created clusters using [radar charts](https://plotly.com/python/radar-chart/) and compare them agains each other
3. visualize segmentations using scatter plot. We will have to use PCA to be able to plot our observations in 2D.
4. (stretch) visualize in 2D how our clusters are evolving in each iteration of KMeans (for at least 20 iterations).
    - we will need to create own implementation of kmeans so we can see what is happening with the clusters during the iterations.


In [1]:
import pandas as pd 
import numpy as np

In [2]:
# explore the dataset customer 
df = pd.read_csv('twm_customer.csv', delimiter=';')
df.isna().mean()

cust_id            0.0
income             0.0
age                0.0
years_with_bank    0.0
nbr_children       0.0
gender             0.0
marital_status     0.0
name_prefix        0.0
first_name         0.0
last_name          0.0
street_nbr         0.0
street_name        0.0
postal_code        0.0
city_name          0.0
state_code         0.0
dtype: float64

In [9]:
# exploring kmeans with multiple features 



from sklearn.cluster import KMeans

#X, _ = make_blobs(n_samples=10, centers=3, n_features=4)

#df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])

kmeans = KMeans(n_clusters=5, n_init=10)

y = kmeans.fit_predict(df[['income','age','years_with_bank','nbr_children']])

df['Cluster'] = y

(df.groupby('Cluster').mean())

Unnamed: 0_level_0,cust_id,income,age,years_with_bank,nbr_children,marital_status,street_nbr,postal_code
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1362995.0,45674.809524,46.460317,4.071429,1.404762,2.134921,8791.873016,54686.714286
1,1362998.0,5777.573574,37.783784,3.825826,0.288288,1.660661,8478.690691,59963.411411
2,1363030.0,130895.8,54.6,4.0,0.6,2.2,10388.2,67077.8
3,1362984.0,23271.040486,45.530364,3.88664,0.923077,1.991903,8426.287449,61484.008097
4,1362953.0,80462.277778,49.361111,4.222222,0.833333,2.25,8398.583333,57450.805556


In [13]:
df

Unnamed: 0,cust_id,income,age,years_with_bank,nbr_children,gender,marital_status,name_prefix,first_name,last_name,street_nbr,street_name,postal_code,city_name,state_code,Cluster
0,1362691,26150,46,5,1,M,2,,Donald ...,Marek ...,8298,Second ...,89194,Las Vegas,NV,3
1,1362487,6605,71,1,0,M,2,,ChingDyi ...,Moussavi ...,10603,Daffodil ...,90159,Los Angeles,CA,1
2,1363160,18548,38,8,0,F,1,,Rosa ...,Johnston ...,8817,Figueroa ...,90024,Los Angeles,CA,3
3,1362752,47668,54,3,0,F,1,,Lisa ...,Martin ...,676,Humble ...,90172,Los Angeles,CA,0
4,1362548,44554,59,9,2,F,4,,Barbara ...,O'Malley ...,6578,C ...,10138,New York City,NY,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
742,1363324,14795,36,6,1,F,4,,Lillian ...,Kaufman ...,9677,B ...,90016,Los Angeles,CA,3
743,1362895,26387,56,6,1,M,2,,Marty ...,McSherry ...,3227,Inspiration ...,10126,New York City,NY,3
744,1362569,61300,50,0,2,M,2,,Ken ...,Lawrence ...,6082,23rd ...,87194,Albuquerque,NM,0
745,1363364,15100,37,7,0,F,2,,Debbie ...,Runner ...,7851,H ...,35241,Birmingham,AL,3


In [14]:
# trying to scale numerical data in df
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

In [15]:
ct = make_column_transformer(
    (MinMaxScaler(), ['income','age','years_with_bank','nbr_children','marital_status']), #turn all values in these columns between 0 and 1 
)

df_scaled = ct.fit_transform(df)


In [20]:
df_scaled = pd.DataFrame(df_scaled, columns = ['income','age','years_with_bank','nbr_children','marital_status'])

In [22]:
df_scaled['Cluster'] = df['Cluster']

In [23]:
df_scaled

Unnamed: 0,income,age,years_with_bank,nbr_children,marital_status,Cluster
0,0.181399,0.434211,0.555556,0.2,0.333333,3
1,0.045818,0.763158,0.111111,0.0,0.333333,1
2,0.128665,0.328947,0.888889,0.0,0.000000,3
3,0.330667,0.539474,0.333333,0.0,0.000000,0
4,0.309066,0.605263,1.000000,0.4,1.000000,0
...,...,...,...,...,...,...
742,0.102631,0.302632,0.666667,0.2,1.000000,3
743,0.183043,0.565789,0.666667,0.2,0.333333,3
744,0.425231,0.486842,0.000000,0.4,0.333333,0
745,0.104747,0.315789,0.777778,0.0,0.333333,3
