# Exploratory Data Analysis

Planning - Acquisition - Preparation - **Exploratory Analysis** - Modeling - Product Delivery

In this lesson, we introduce exploratory data analysis, called EDA, which is the "explore" stage in the DS Pipeline.

## Goals of EDA

We explore the interactions of the attributes and target variable to help discover drivers of our target variable and redundant or interdependent attributes. 

1. Discover features that are driving the outcome (target). (Number 1 reason to explore)
2. Learn the vast majority of our takeaways and interesting stories from the data. 
3. Discover if we need to drop features, if we need to handle missing values, or if there's value to combining features. 

## EDA Process

1. Hypothesize: Form and document your initial hypotheses about how the predictors (independent variables, features, or attributes) interact with the target (y-value or dependent variable). You can do this in the form of questions in a natural language (as opposed to "statistical" language). 

2. Visualize: use visualization techniques to identify drivers. When a visualization needs to be followed up with a statistical test, do so.

3. Test your hypotheses when visualization isn't immediately clear. Use the appropriate statistical tests (t-tests, correlation, chi-square)

**General Recipe**

1. Univariate Stats: descriptive stats, frequencies, histograms. This is often done during prep prior to splitting into train/validate/test, but if it is not, then it should be done first here. Why? Outliers. Running tests that assume normalcy. Scale of each variable. General "getting to know" your data. 

    - Univariate means a single variable, so we'll look at `.value_counts()` and histograms.
    - Explore the target variable itself. What is the distribution of values?  
    - Explore the categorical and qualitative variables. 
    - Explore the numeric variables. 


2. Bivariate Stats. Bivariate means two variables.

    - Plot the interactions of each variable with the target. Document your takeaways.     
    - Explore interation of independent variables using viz and/or hypothesis testing to address interdependence. 


3. Multivariate Stats (more than 2 variables): Ask additional questions of the data, such as how subgroups compare to each-other and to the overall population. Answer questions using visualizations and/or hypothesis testing. 

    - use color to represent a discrete variable and then choose a chart style based on the data types of other two variables. 

4. Statistical Tests: If the visualizations are not crystal clear, it's important to conduct hypothesis tests.

    - With numeric to numeric, test for correlation with Pearson's R for linear and Spearman's R for non-linear relationships.

    - For numeric to categorical, compare the means of two populations or a subgroup to the population using a [t-test](https://ds.codeup.com/stats/compare-means/), if your samples are normaly(ish) distributed but have different variances (as determined by calling `.var()` on each column/Series, [ANOVA](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html) to compare means from more than 2 groups, or a [Mann-Whitney u-test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html) if the data does not match the [assumptions of a t-test](https://www.investopedia.com/ask/answers/073115/what-assumptions-are-made-when-conducting-ttest.asp).

    - With categorical to categorical variables, use $\chi^2$, [chi-squared test](https://ds.codeup.com/stats/compare-group-membership/).


**Standing Orders** for Exploration

- **Document** your initial questions or assumptions. Write them down (in your README or notebook) so they are concrete and not in your head.

- **Document** your takeaways after each visualization. Even if your takeaway is, "there is nothing interesting between *var1* and *target*". 

- **Document** your answer to each question. 

- When you run statistical tests to answer your questions, **Document** your null and alternative hypothesis, the test you run, the test results, and your conclusion. 

- **Document** your takeaways, in case that wasn't clear. It is a huge component of your final deliverable/analysis.

- **Document** your action plan.  What are your next steps and/or new questions based on what you have learned? I recommend documenting, continuing through all of your questions, and then going back and taking action only after you have answered your initial questions. 

## Aquire

Acquire Titanic data from our mySQL database

In [17]:
# acquire.get_titanic_data()


In [18]:
# peek


## Prepare

Prepare the Titanic data. We apply the same steps before splitting so that we handle edge cases identically.

- drop deck since most of the data is missing
- drop rows where age or embarked is missing
- drop passenger_id, since it adds no new information
- drop class, as encoded values are in pclass
- create dummy vars & drop sex, embark_town

**Questions for 2nd draft** 

- Let's investigate and determine what are best options are for handling the missing ages

In [19]:
# prepare.prep_titanic(df)


In [20]:
# peek

**Split** data into Train, Validate, Test

In [21]:
# prepare.my_train_test_split(df, target = 'survived')

In [22]:
# shape

In [23]:
# Stratification means we'll get even proportions of the target variable in each data set
# train.survived.mean(), validate.survived.mean(), test.survived.mean()

## Univariate Exploration: Explore Individual Variables

### Goals

- Identify Outliers, and whether those are anomalies or data errors. 
- Identify distributions of numeric data. Statistical tests often assume a type of distribution, such as a normal distribution. 
- Get a sense of scale for each variable. 
- Get a good general understanding of your data.  
- Is your target balanced or imbalanced? 
- Are there variables with no entropy? 

### How

#### Numeric Variables

- `df.describe()`
- `series.hist()` 
- `sns.boxplot()`

#### Discrete Variables

- `series.value_counts()`
- `series.value_counts(normalize=True)`
- `sns.countplot()`

### Things to take away from this step

- Document findings at the end of the section (and throughout)
- Return to prep to further clean in ways discovered in this step
- Document questions that come up as you begin to look at the data. 

## Bivariate Exploration: Explore Interactions of 2 Variables

### Goals

- Analyze each feature with respect to the target variable and document takewaways. Always document your findings and takeaways, even if the takeaway is "There's nothing here between x and y". 
- Analyze features with respect to each other to identify those that may be interdependent and add no additional information. 
- Ask and answer specific questions. 

### How

#### Numeric x Numeric Variables

**Plots**

- `sns.scatterplot()`
- `sns.heatmap()`
- `sns.lineplot()`
- `sns.lmplot()`
- `sns.pairplot()`

**Stats**

- Pearson's R: tests for **LINEAR** correlation  `scipy.stats.pearsonr()` 
- Spearman's Rho: tests for monotonic relationships (not necessarily linear, relationship between ordered sets):  `scipy.stats.spearmanr()` 

#### Discrete x Numeric Variables

**Plots**

- `sns.swarmplot()`
- `sns.violinplot()`
- `sns.barplot()`
- `sns.stripplot()`
- `sns.boxenplot()`

**Stats**

- Independent t-test: compare the mean of two groups, are they significantlly different? `scipy.stats.ttest_ind()`
- ANOVA (one-way): compare the mean of more than two groups, are they significantlly different? `scipy.stats.f_oneway()`
- Mann-Whitney: compare the mean of two groups, when the data is not necessarily normally distributed, i.e. the non-parametric version of the t-test: `scipy.stats.mannwhitneyu()`
- Kruskal Wallis: compare the mean of more than two groups, when the data is not necessarily normally distributed, i.e. the non-parametric version of the ANOVA test `scipy.stats.kruskal()` 

#### Discrete x Discrete

**Plots**

- `sns.swarmplot()`
- `sns.countplot()` adding hue for the second discrete variable. 
- If one categorical is a boolean, such as survived in the titanic case, we can set the binary target to the `y` axis and see proportions by using: `sns.barplot()`, `sns.boxplot()`

**Stats**

- Crosstab comparing values `pd.crosstab()`
- Chi-Square test: is there a relationship between two categorical variables? `scipy.stats.chi2_contingency()`


### Things to take away from this step

- Document findings at the end of the section (and throughout)
- Return to prep to further clean in ways discovered in this step
- Document further questions to explore in multivariate exploration. 
- Document features you wish to move forward into modeling, and variables you wish to drop. 

## Multivariate Exploration: Explore the Interactions of 3+ Variables

### Goals

- Ask and answer specific questions: We ask more specific and targeted questions of the data, such as how subgroups compare to each-other and to the overall population.
- Identify relationships between independent variables (aka features, predictors) and dependent variable (aka target, outcome). 
- During multivariate analysis, we often add another dimension to our data, such as the target variable as color.

### How

**Plots**

- Add color to your plots. For seaborn plots, the argument is `hue=<colname>`. For maplotlib plots, the argument is `c=<yourseries>.astype('category').cat.codes`
- If you have multiple numeric columns, generating a [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) with the target variable set to the `hue` argument might help. It may also be too noisy.
- A [relplot](https://seaborn.pydata.org/generated/seaborn.relplot.html) of a numeric x, a numeric y, and a `hue` argument using a category z. If you discover a good set of numeric columns in the `pairplot`, then it would be valuable to create a visual for that pairing along with the target.
- We can also use `hue` along with [seaborn catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html)
- We can make subgroups based on multiple categorical features and compare to other groups or the population

**Stats**

- Multivariate statistical tests exist, but are outside the scope of this course. 

- We can, however, create subgroups based on multiple categorical features and conduct hypothesis tests. Using the same methods used in the bivariate exploration, you will begin by controlling for the third variable. For example, select only customers who are senior citizens ("control" for senior citizen), and then test whether there is a significant difference in cost for those who churn and those who do not churn. 


### Things to take away from this step

- Document takeaways, findings, conclusions at the end of the section and throughout. 
- Document initial recommendations. 
- Documnet new questions to ask of the data.
- Finalize features to move forward into modeling. 
- Return to prep step as needed. 

**Questions to answer**


_____________________



**Get Creative**

Ask additional, more specific and targeted questions of the data, such as how subgroups compare to each-other and to the overall population. We then answer these questions using visualizations and/or hypothesis testing.

1. Is there a relationship between survival and parch for women travelers?
2. Is there a relationship between survival and parch for male travelers? 
3. Is there a relationship between survival and sibsp for women travelers? 
4. Is there a relationship between survival and sibsp for male travelers?

What other subgroups can you create and visualize?

What variables have you not worked with yet?


What other subgroups can you create and test?

## Conclusion

Here we pull all of our takeaways and actions together into one place we can reference as we move forward. 



**Work fast to an MVP understanding of your data**

- Focus on features that give you the biggest bang for your buck. 

- If there's 30% of a population that's responsible for 99% of the sales, starting with breaking down that 30% population into different groups 


**When you have time for a second iteration**

- Revisit some of the things you may have skipped earlier in order to get to an MVP.

- For example, there were ~20% of rows with missing age. If you have time, check these out. Is there a set of the population that is similar such that we can impute an expected age value?

- Explore creating your own features

    - Turning numeric columns like `age` into a category with `is_child`, for example with a boolean.
    - Where does it make logical sense to combine columns  

## Exercises

### Part 1

Continue in your `classification_exercises.ipynb` notebook. As always, add, commit, and push your changes.

**Section 1 - iris_db:** Using iris data from our mySQL server and the methods used in the lesson above: 

1. Acquire, prepare & split your data. 

1. Univariate Stats

    - For each measurement type (quantitative variable): create a histogram, boxplot, & compute descriptive statistics (using .describe()). 

    - For each species (categorical variable): create a frequency table and a bar plot of those frequencies. 

    - Document takeaways & any actions. 


2. Bivariate Stats

    - Visualize each measurement type (y-axis) with the species variable (x-axis) using barplots, adding a horizontal line showing the overall mean of the metric (y-axis). 

    - For each measurement type, compute the descriptive statistics for each species. 

    - For virginica & versicolor: Compare the mean petal_width using the Mann-Whitney test (scipy.stats.mannwhitneyu) to see if there is a significant difference between the two groups. Do the same for the other measurement types. 

    - Document takeaways & any actions. 


3. Multivariate Stats

    - Visualize the interaction of each measurement type with the others using a pairplot (or scatter matrix or something similar) and add color to represent species. 
    
    - Visualize two numeric variables by means of the species. Hint: `sns.relplot` with `hue` or `col`

    - Create a swarmplot using a melted dataframe of all your numeric variables. The x-axis should be the variable name, the y-axis the measure. Add another dimension using color to represent species. Document takeaways from this visualization.

    - Ask a specific question of the data, such as: is the sepal area signficantly different in virginica compared to setosa? Answer the question through both a plot and using a mann-whitney or t-test. If you use a t-test, be sure assumptions are met (independence, normality, equal variance). 

    - Document takeaways and any actions. 



### Part II

Explore your `titanic` dataset more completely.

- Determine drivers of the target variable
- Determine if certain columns should be dropped
- Determine if it would be valuable to bin some numeric columns
- Determine if it would be valuable to combine multiple columns into one.

Does it make sense to combine any features?

Do you find any surprises?

Document any and all findings and takeaways in your notebook using markdown.

### Part III

- Explore your `telco` data to discover drivers of churn
- Determine if certain columns should be dropped
- Determine if it would be valuable to bin some numeric columns
- Determine if it would be valuable to combine multiple columns into one.

What are your drivers of churn?

Does it make sense to combine any features?

Do you find any surprises?

Document any and all findings and takeaways in your notebook using markdown.