# Exploration Lesson

### Goals:

- Can we see patterns, find signals in the data?

- What features are driving the outcome?

- Are there other features we can construct that have stronger relationships?

- Use visualization and statistical testing to help answer these questions.

- We want to walk away from exploration with with modeling strategies (feature selection, algorithm selection, evaluation methods).

### Scenario:

We would like to be able to use attributes of customers to estimate their spending score. In doing so, we can target those customers that are likely to be most profitable for us. Our target variable is spending_score. Currently the only customer data we have available to use in this project is age, annual_income and gender. It is possible we may not have enough information to build a valuable model. If not, maybe we could do some unsupervised learning, and find clusters of similar customers using all of the variables (including spending_score) and that could help us with a starting point for our targeted marketing.

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

from env import get_db_url

## Acquire!

In [None]:
df = pd.read_sql('SELECT * FROM customers;', get_db_url('mall_customers'))
df = df.set_index('customer_id')
df.head()

In [None]:
df.info()

## Prepare!

In [None]:
def train_validate_test_split(df, target, seed=123):
    '''
    This function takes in a dataframe and splits the data into train, validate and test. 
    '''
    train_validate, test = train_test_split(df, test_size=0.2, random_state=seed)
    train, validate = train_test_split(train_validate, test_size=0.3, random_state=seed)
    return train, validate, test

In [None]:
def scale_my_data(train, validate, test):
    '''
    scale my data using minmaxscaler and add it back to my input datasets
    '''
    scaler = MinMaxScaler()
    scaler.fit(train[['age', 'annual_income']])
    
    X_train_scaled = scaler.transform(train[['age', 'annual_income']])
    X_validate_scaled = scaler.transform(validate[['age', 'annual_income']])
    X_test_scaled = scaler.transform(test[['age', 'annual_income']])

    train[['age_scaled', 'annual_income_scaled']] = X_train_scaled
    validate[['age_scaled', 'annual_income_scaled']] = X_validate_scaled
    test[['age_scaled', 'annual_income_scaled']] = X_test_scaled
    return train, validate, test

In [None]:
def prep_mall(df):
    '''
    dummy var for gender into is_male
    split on target of 'spending_score'
    scale age and annual income. 
    '''
    df['is_male'] = pd.get_dummies(df['gender'], drop_first=True)['Male']
    train, validate, test = train_validate_test_split(df, target='spending_score', seed=1349)
    train, validate, test = scale_my_data(train, validate, test)
    
    print(f'df: {df.shape}')
    print()
    print(f'train: {train.shape}')
    print(f'validate: {validate.shape}')
    print(f'test: {test.shape}')
    return df, train, validate, test

In [None]:
df, train, validate, test = prep_mall(df)

In [None]:
train.head()

## Explore!

1. Ask your question

2. Vizualize it

3. Perform a stats test, if needed

4. Write your takeaway

### Q1. What is the distribution of each variable?

Since I'm doing at univariate exploration, I can use the original dataset.

#### Takeaways

### Q2. Does the spending score differ across gender?

I am now comparing variables, so I HAVE to use the train dataset.

> what type of variable is spending_score?   
> what type of variable is gender? 

In [None]:
#barplot


Thoughts

Which stats test to use? 

In [None]:
#stats.levene()


In [None]:
#stats.ttest_ind()
stats.ttest_ind(train[train.gender=='Male'].spending_score, 
                train[train.gender=='Female'].spending_score,
               equal_var=True)

#### Takeaway

### Q3. Is there a relationship between spending score and annual income?

> what type of variable is spending score?   
> what type of variable is annual income? 

In [None]:


# plt.title('what is the relationship between annual income and spending score?')
# plt.show()

Thoughts

Which stats test to use? 


Spearmean R


In [None]:
#stats.spearmanr()


#### Takeaways

### Q4. Is there a relationship between age and spending score? 

> what type of variable is age?    
> what type of variable is spending score? 

In [None]:

# plt.title('what is the relationship between age and spending score?')
# plt.show()

Thoughts

We'll use `pd.cut()` to make bins

In [None]:
#make new age_bin column


In [None]:
# train.head()

In [None]:

# plt.title('relationship of spending score for people below and above 40')
# plt.show()

Which stats test to use? 


Levene test

In [None]:
train.age_bin.value_counts()

In [None]:
train.dtypes

In [None]:
#stats.levene


> our pvalue is less than alpha, therefore we reject the null hypothesis and say our variances are not equal

In [None]:
#stats.ttest_ind


> our pvalue is less than alpha, therefore we reject the null hypothesis

#### Takeaway

### Q. If we control for age, does spending score differ across annual income?

Use `sns.relplot` to control for variables

Thoughts


In [None]:
#create age bins in col


# plt.suptitle("Do the different decades account for the upper vs lower extremes?")
# plt.tight_layout()

Thoughts

#### Does gender play a role?

In [None]:

# plt.suptitle("Do the different decades account for the upper vs lower extremes?")
# plt.tight_layout()

Thoughts


#### Takeaways


### Q. If we control for annual income, does spending score differ across age?

In [None]:

# plt.title('annual_income')
# plt.show()

In [None]:
#make income_bin

#plot it

# plt.title("How does age compare to spending score within each income bin?")
# plt.show()

Takeaways


### If you don't know where to start, start with pairplot

In [None]:
# print("Interaction of variables along with income bins")

# plt.show()

In [None]:
# print("Interaction of variables by gender")

# plt.show()

## Conclusion

