## Tidy Data

1. Load your data from the previous exercise.

1. Are any column headers values instead of variable names?  

    - print column names
    - answer question above
    - if answer is yes, transform the data so that column names do not include values, and all values are in the table records.
    
2. Do multiple variables exist in a single column?

    - answer question above
    - if yes, print a sample of the column (name and values)
    - if yes, transform the data so that there is one column for each of the variables represented in that single column. 
    
3. Are there multiple observation units in the same table?  (e.g. 'per person & family' in same table)

4. single observation unit in multiple tables (e.g. 'per person' exists across multiple tables)

## Distributions and outliers

#### Numeric Variables

1. Currently the number of rows and columns in our function `hist_subplots()` is *hard coded*.  We want to avoid that where possible when the code will be repeatedly called requiring different values.  Write a function, `fig_subplots()`, that takes the a list of column names and returns a figure with a figure and axes for each of the columns.  The subplots should be distributed as evenly as possible, with no more than 4 rows and no fewer than 2 columns.  

2. Write a function, `hist_matrix()`, that then sets up the frame for the subplots (using `fig_subplots()`) and plots each of the histograms (using `hist_subplot()`).

3. What are your takeaways from these plots? For each plot:

    1. Describe it's distribution. 
    
    2. Identify any outliers and your thoughts on where each group of outliers came.  Data errors? True observational outliers? Possible scenarios?
    
    3. Give your recommendations for handling these outliers, reasoning for each, and any necessary details. Options include (but are not limited to):
    
        - ignore the outliers (leave them as is)
        - remove the observations with those outliers
        - remove the variables with those outliers
        - replace them with another value (include what you would replace them with, e.g. "I would replace the values that are > 100 in variable x with 100", or "I would replace the outliers that are above $z = 1.5 x IQR+Q3$ with z." 
        
    4. Should there be any transformation(s) performed on the column?  Types of numeric transformations include (but are not limited to):
    
        - standard-normal
        - min/max
        - log 
        - arithmetic, e.g. appling a scalar weight
        - merging with another feature
        - binning
        - convert to boolean
        
    5. Provide any other insights or takeaways you gained related to the column and variable(s).

### Categorical Variables

#### Exercises

1. Write a function, `bar_subplot()`, that returns a subplot for each of of a list of categorical columns.  

    - Each subplot should meet the following requirements:
    
        - A bar chart plotting the number of observations by each class in the column.  (Think histogram but for categorical variables)
        - A title that is the name of the columns
        - arguments: columns name, nrows, ncols, index, title of subplot, dataframe name, and any arguments necessary for the chart function you will be using.  
        
    - Test the function by plotting the columns whose names are in the list `cat_vars`
    
2. Write a function `bar_matrix()` that then sets up the frame for the subplots (using `fig_subplots()`) and plots each of the bar charts (using `bar_subplot()`)

3. What are your takeaways from these plots? For each plot:

    1. Describe how the observations are distributed across the classes.
    
    2. Identify any outliers and your thoughts on where each group of outliers came.  Data errors? True observational outliers? Possible scenarios?
    
    3. Give your recommendations for handling these outliers, reasoning for each, and any necessary details. Options include (but are not limited to):
    
        - ignore the outliers (leave them as is)
        - remove the observations with those outliers
        - remove the columns with those outliers
        - replace them with another value (include what you would replace them with, e.g. "I would replace the values that are > 100 in columns x with 100", or "I would replace the outliers that are above $z = 1.5 x IQR+Q3$ with z." 
        
    4. Should there be any transformation(s) on the columns?  Types of categorical transformations include (but are not limited to):
    
        - new class
        - grouping classes
        - text normalization
        - arithmetic, e.g. appling a scalar weight
        - merging with another column
        - convert to boolean
        
    5. Provide any other insights or takeaways you gained related to the column and variable(s).
    
    
    
 

## Exploration

4. Is there a relationship between *x1 continuous variable*, *x2 continuous variable*, and churn?  Create a single chart to answer this question.


## Modeling

2. Walk through the steps of training the SVM Support Vector classifier.  Set the class_weight, probability, and ramdom_state parameters accordingly. 
3. Evaluate your results using the model score, confusion matrix, and classification report.
4. Print and clearly label the following:  Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support. 
5. Research the `kernel` parameter.  What is your best option(s) for the particular problem you are trying to solve and the data to be used?
6. Run through steps 2-4 using another `kernel` (from question 5) 
7. Which appears to perform better?
8. Test the best model on your testing data. 
9. Store your final model into logit for future use

