### Four Types of Analytics
![chart](../images/4-types-of-data-analytics-01.png)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Let's get some data and take a look

In [None]:
people_df = pd.read_csv('../data/people.csv')

In [None]:
people_df.shape

##### Statistics is decision making in the face of uncertainty or variablity
 - GOAL: can we understand the metrics that explain how many years a person played sports?

In [None]:
people_df.head()

### Common descriptive statistics
 - measures of central tendency (mean, median, mode)
 - measures of variability (standard deviation, variance)
 - distribution metrics (quartiles, interquartile range, outliers)

In [None]:
people_df['weight'].value_counts()

In [None]:
people_df.describe()

 -  What are the mean, median and mode for `weight`?

In [None]:
# mean is 169.95
# median is 166
# mode is 140
people_df.weight.value_counts().head(2)

 - What are the following values for `years_played_sports`?
    - Minimum  
    - Maximum  
    - 1st Quartile  
    - 2nd Quartile  
    - 3rd Quartile  
    - Interquartile Range (IQR) (Difference between the 1st and 3rd quartiles) 

In [None]:
# IQR = 9
low_outliers = 1-(1.5 * 9)
print(low_outliers)
high_outliers = 10 + (1.5 * 9)
print(high_outliers)

In [None]:
people_desc = people_df.describe()

In [None]:
low_outliers_2 = people_desc.loc['25%', 'years_played_sports']-(1.5 * (people_desc.loc['75%', 'years_played_sports']-people_desc.loc['25%', 'years_played_sports']))

In [None]:
low_outliers_2

- Outliers can be mathematically determined. They are values that fall below (Q1 − 1.5 IQR) or above (Q3 + 1.5 IQR)
    - how many outliers are there for `years_played_sports`?

In [None]:
print(len(people_df.loc[(people_df.years_played_sports < -12.5) | (people_df.years_played_sports > 23.5)]))
sns.boxplot(people_df.years_played_sports);

### [Correlations](https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php) help us understand if variables _may_ have an underlying relationship
- a perfect positive correlation is 1.0
- a perfect negative correlation is -1.0
- interpreting correlation depends on the context and purpose!


In [None]:
people_df.corr(numeric_only = True)

 - Which variables in our dataset are most highly correlated with each other?
 - Which variables might explain the variability in `years_played_sports`?

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))

corr = people_df.corr(numeric_only = True)
# create a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap=cmap, mask = mask, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});

### Normalization
 - Does the `years_played_sports` variable have the same meaning for all people in the dataset? Is a 70 yo with 5 years of sports the same as a 20 yo with 5 years of sports?
 - How would you normalize it?

In [None]:
people_df['pct_life_sports'] = people_df.years_played_sports/people_df.age

In [None]:
people_df.corr(numeric_only = True)

- Correlation between a variable you are trying to explain (sometimes called the dependent variable or target) and a variable that might explain it (independent variable or explanatory variable) helps us understand the target better.  
- Correlation between two explanatory variables may cause us to overestimate their importance to explaining the variance in the target. Think about _why_ variables might be highly correlated.
 

#### Feature Engineering

##### These could be from an external dataset or from the provided data

- compare bmi to an optimal bmi
- categorize as `oldest`, `middle`, `youngest`, or `only` child

In [None]:
men_bmi_age = pd.read_csv('../data/men_bmi_age.csv')
women_bmi_age = pd.read_csv('../data/women_bmi_age.csv')

In [None]:
men_bmi_age.head()

In [None]:
women_bmi_age.head()

In [None]:
people_df_men = people_df[people_df['sex'] == 'M']
people_df_women = people_df[people_df['sex'] == 'F']
people_df_unk = people_df[~people_df['sex'].isin(['M', 'F'])]
people_df_unk_2 = people_df[people_df['sex'].isnull()]

In [None]:
people_df_men.head()

In [None]:
people_df_women.head()

In [None]:
people_df_unk

In [None]:
people_df_unk_2

In [None]:
people_df_men = pd.merge(people_df_men, men_bmi_age, how = 'left', on = 'age')
people_df_women = pd.merge(people_df_women, women_bmi_age, how = 'left', on = 'age')

In [None]:
people_df_men.head()

In [None]:
people_df_men['diff_optimal_bmi'] = people_df_men['optimal_bmi'] - people_df_men['bmi']
people_df_women['diff_optimal_bmi'] = people_df_women['optimal_bmi'] - people_df_women['bmi']

In [None]:
people_df_men

In [None]:
people_df = pd.concat([people_df_women, people_df_men, people_df_unk])

In [None]:
people_df.head()

In [None]:
people_df = people_df.reset_index(drop = True)

In [None]:
people_df['birth_category'] = ''

In [None]:
people_df

In [None]:
for ind, row in people_df.iterrows():
    if row['sibling_count'] == 0:
        people_df.loc[ind, 'birth_category'] = 'only'
    elif row['birth_order'] == 1:
        people_df.loc[ind, 'birth_category'] ='oldest'
    elif row['birth_order'] > row['sibling_count']:
        people_df.loc[ind, 'birth_category'] ='youngest'
    else:
        people_df.loc[ind, 'birth_category'] ='middle'

In [None]:
people_df

# End of instruction