# Exploration Lesson

What is it?
- understand what features are driving the outcome
- try to find patterns in the data

Why do we care? 
- gain insights on our data 
- use knowledge to determine modeling
- if we see patterns, maybe we can build clusters

## Scenario
We would like to be able to use attributes of mall customers to estimate their spending score. In doing so, we can target those customers that are likely to be most profitable for us. Our target variable is spending_score. Currently the only customer data we have available to use in this project is age, annual_income and gender. It is possible we may not have enough information to build a valuable model. If not, maybe we could do some unsupervised learning, and find clusters of similar customers using all of the variables (including spending_score) and that could help us with a starting point for our targeted marketing.

In [None]:
#standard ds
import pandas as pd
import numpy as np

#viz and stats
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

#splits, scale
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

#my env file
from env import get_db_url


## Acquire

In [None]:
#get my data
df = pd.read_sql('SELECT * FROM customers;', get_db_url('mall_customers'))
df = df.set_index('customer_id')

#see it
df.head()

In [None]:
df.info()

## Prepare

In [None]:
def train_validate_test_split(df, target, seed=123):
    '''
    accepts dataframe and splits the data into train, validate and test 
    '''
    train_validate, test = train_test_split(df, test_size=0.2, random_state=seed)
    
    train, validate = train_test_split(train_validate, test_size=0.25, random_state=seed)
    return train, validate, test

In [None]:
def scale_my_data(train, validate, test):
    '''
    scale my data using minmaxscaler and add it back to my input datasets
    '''
    scaler = MinMaxScaler()
    scaler.fit(train[['age', 'annual_income']])
    
    X_train_scaled = scaler.transform(train[['age', 'annual_income']])
    X_validate_scaled = scaler.transform(validate[['age', 'annual_income']])
    X_test_scaled = scaler.transform(test[['age', 'annual_income']])

    train[['age_scaled', 'annual_income_scaled']] = X_train_scaled
    validate[['age_scaled', 'annual_income_scaled']] = X_validate_scaled
    test[['age_scaled', 'annual_income_scaled']] = X_test_scaled
    return train, validate, test

In [None]:
def prep_mall(df):
    '''
    dummy var for gender into is_male
    split on target of 'spending_score'
    scale age and annual income. 
    '''
    df['is_male'] = pd.get_dummies(df['gender'], drop_first=True)['Male']
    train, validate, test = train_validate_test_split(df, target='spending_score', seed=1349)
    train, validate, test = scale_my_data(train, validate, test)
    
    print(f'df: {df.shape}')
    print()
    print(f'train: {train.shape}')
    print(f'validate: {validate.shape}')
    print(f'test: {test.shape}')
    return df, train, validate, test

In [None]:
#prep my data!
df, train, validate, test = prep_mall(df)

In [None]:
train.head()

## Explore

In [None]:
#distribution of all my variables
plt.figure(figsize=(14,4))

for i, col in enumerate(train.columns[:-3]):
    plt.subplot(1,len(train.columns[:-3]),i+1)
    plt.hist(train[col])
    plt.title(col)

**Takeaways**
- ?

## If you don't know where to start, start with pairplot

In [None]:
#only looking at my unprocessed data only
sns.pairplot(train[['gender', 'age', 'annual_income', 'spending_score']], 
             corner=True
            )
plt.show()

**Takeaways**
- ?

## Q. Does the spending score differ across gender?

what kind of variables do i have?
- gender:
- spending_score: 

what types of plots can i use?
- ?

what type of stats test should i use? 
- ?


In [None]:
#visualize my variables
sns.boxplot(data=train, x='gender', y='spending_score')
plt.title('what is the relationship between gender and spending_score')
plt.show()

mann-whitley test
- $H_0$: there is no difference in spending scores between genders
- $H_a$: there is a difference in spending scores between genders

alpha = 0.05

In [None]:
#verify with stats
stats.mannwhitneyu(train[train.gender=='Male'].spending_score, 
                   train[train.gender=='Female'].spending_score)

result:

**Takeaways**
- ?

## Q. Is there a relationship between spending score and annual income?


what kind of variables do i have?
- annual income:
- spending score: 

what types of plots can i use?
- ?

what type of stats test should i use? 
- ?

In [None]:
#visualize my variables
sns.scatterplot(data=train, x='annual_income', y='spending_score')
plt.title('what is the relationship between annual income and spending score?')
plt.show()

Spearman R
- $H_0$: there is no linear correlation between annual income and spending score
- $H_a$: there is linear correlation between annual income and spending score

In [None]:
#verify it with stats
stats.spearmanr(train.annual_income, train.spending_score)

result: 

**Takeaways**
- ?

## Q. Is there a relationship between age and spending score? 

what kind of variables do i have?
- age:
- spending_score: 

what types of plots can i use?
- ?

what type of stats test should i use? 
- ?

In [None]:
#visualize my variables
sns.scatterplot(data=train, x='age', y='spending_score', )
plt.title('what is the relationship between age and spending score?')
plt.show()

thoughts:

#### create a bin for age

In [None]:
#make new age_bin column
train['age_bins'] = pd.cut(
                    train.age, #column to bin
                    [0,40,80], #the bins including the starting and ending point
                    labels=['40_and_under', 'over_40'] #labels for my bins
                )

In [None]:
#look at it!
train.head()

In [None]:
#distribution
train.age_bins.value_counts()

In [None]:
#type of new column
train.dtypes

#### dive into my new variable

what kind of variables do i have?
- age_bins:
- spending_score: 

what types of plots can i use?
- ?

what type of stats test should i use? 
- ?

In [None]:
#visualize my variable
sns.boxplot(data=train, x='age_bins', y='spending_score')
plt.title('relationship of spending score for people below and above 40')
plt.show()

In [None]:
#check for equal variances 
stats.levene(train.spending_score[train.age_bins == '40_and_under'],
             train.spending_score[train.age_bins == 'over_40'],)

In [None]:
#verify with a stats test
stats.ttest_ind(train.spending_score[train.age_bins == '40_and_under'],
                train.spending_score[train.age_bins == 'over_40'],
                equal_var=False)

result: 

**Takeaways**
- ?

## Q. If we control for age, does spending score differ across annual income?

what kind of variables do i have?
- age_bins:
- annual_income: 
- spending_score: 

what types of plots can i use?
- ?

what type of stats test should i use? 
- ?

In [None]:
#calculate my mean spending score
ss_mean = train.spending_score.mean()
ss_mean

In [None]:
#visulize my variables
sns.relplot(data=train, 
                x='annual_income', 
                y='spending_score', 
                col='age_bins'
               )

# plt.hlines(ss_mean,0,140)

#cycle through each axes-level plot to add overall mean line
# for ax in p.axes.flat:
#     ax.hlines(ss_mean,0,140, ls=':')

thoughts:

#### bin my under 40

In [None]:
#see where to make my bin
sns.histplot(train.age, bins=20)
plt.show()

thoughts:

In [None]:
#visualize my variables
sns.relplot(data=train, 
            x='annual_income', 
            y='spending_score',
            s=100,
            col=pd.cut(train.age,[0,30,40,80])
           )

plt.suptitle("Do the different decades account for the upper vs lower extremes?")
plt.tight_layout()
plt.show()

thoughts:

#### how does gender affect them?

how do i add a fourth category to my plots?
- ?

In [None]:
#visualize my plots
sns.relplot(data=train, 
            x='annual_income', 
            y='spending_score', 
            hue='gender',
            s=100,
            col=pd.cut(train.age,[0,30,40,80])
           )

plt.suptitle("Do the different decades account for the upper vs lower extremes?")
plt.tight_layout()
plt.show()

**Takeaways:**
- ?

## Q. If we control for annual income, does spending score differ across age?

Since I want to control for annual income, I need to bin it.

In [None]:
#look at it to figure out some bins
train.annual_income.hist()
plt.show()

In [None]:
#make the bins
train['income_bins'] = pd.cut(train.annual_income, [0,50,140])
train.head()

In [None]:
#visualize my variables
sns.relplot(data=train, 
            x='age', 
            y='spending_score',
            hue='income_bins'
           )

plt.title('how does age relate to spending_score when accounting for age')
plt.show()

**Takeaways:**
- ? 

## Explore Conclusion
