In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_palette("viridis")
sns.set_context("talk")

## Summary measures of variables
Assume you got offers to join three different companies - A, B or C - as a data scientist. Before you make your decision, you gathered data about the salaries of data scientists in the three companies

## The data
Let's load the data

In [4]:
salaries_df = pd.read_csv('salaries.csv') # read the data into dataframe
salaries_df

Unnamed: 0,Company,Salary,EmployeeID
0,A,15500,1
1,A,23500,2
2,A,18000,3
3,A,19000,4
4,A,21500,5
5,B,19500,1
6,B,20500,2
7,B,19500,3
8,B,19500,4
9,B,18500,5


## Summary measures of the dataset
Now, let's assess different measures for the salaries in each of the companies

That is, we would like to get summaries of the measures (mean, median etc.) for each *company*.
To do this, we will use the *groupby* method and then summarize each resulting group with the *agg* method.

Note that we do not have missing values here, but some summary measures would ignore these if there were any.

In [None]:
grpby_company = salaries_df.groupby('Company', as_index=False) # create a groupby object, grouping by Company
agg_types = {  # the measures we want to include
    'Salary': ['mean']
}
# agg_types = {  # the measures we want to include
#     'Salary': ['mean', 'median', 'min', 'max', 'count']
# }
salaries_agg_df = grpby_company.agg(agg_types)  # create the aggregations
salaries_agg_df

## Visualizing the data
Let's visualize the data, to get another perspective

In [None]:
bins = np.arange(14000,38001,500)  # bins for the histogram (covers min to max values)
ax=sns.displot(salaries_df, x="Salary", col="Company", bins=bins) # plot histograms (how many observations in each bin)
ax.set_axis_labels("Salary","Number of Employees")

# DataSaurus

This notebook demonstrates why visualizing data can be critical.

The data we use here includes 3 variables: *x*, *y*, and *type*. <br>
We can think of *x* as units of some input (e.g. resources invested), of *y* as units of some output (e.g. units produced), and of *type* as the source of the data (e.g. department). <br>
We are interested in the relationship between the input and the output for different sources. 

Let's first read the data into a dataframe

In [None]:
df = pd.read_csv('Datasaurus.csv') # read the data
df

How many observations do we have?
***
We would like to examine the properties of the *x* and *y* variables for each *type*.
To do this, we will use the *groupby* method and then summarize each resulting group with the *agg* method. 

We can do this as we did it above (for company), see commented-out segment below. Here, I will use another method called "named aggregations". This method allows direct naming of the variables created in the aggregation step:


In [None]:
grpby_type = df.groupby('type', as_index=False) # create a groupby object, grouping by type

# agg_types = {  # the measures we want to include
#     'x': ['mean', 'std'],
#     'y': ['mean', 'std']
# }
# summary_stats = grpby_type.agg(agg_types)  # create the aggregations

# compute summary statistics (mean, sd) for x and y in each dataset, and name them (pandas' "named aggregations")
summary_stats = grpby_type.agg(
    mean_x=('x','mean'),
    mean_y=('y','mean'),
    std_x=('x','std'),
    std_y=('y','std')
)
summary_stats

What do you see? <br>
How do the "inputs" and "outputs" differ across types?

Do you think the relationship between *x* and *y* is the same across *types*? <br>
Let's check the **correlation** between *x* and *y* in each type:

In [None]:
grpby_type.corr() # prints a correlation matrix of the variables within each group

What did we get? <br>
Why all the 1s? Do the correlations differ across types?

Let's now **plot** the relations between *x* and *y* in each type

In [None]:
# we want to plot all x-y pairs one next to the other. 
# Seaborn's FacetGrid sets the various plots
g = sns.FacetGrid(data=df, col="type", col_wrap=4, hue='type') # hue controls colors of the plotted data
# Seaborn's map creates the different plots
g.map(plt.scatter, "x", "y", alpha=.7, s=30) # alpha controls transparency of the plotted data; s controls size of points

What do you say? Are the relations of the inputs and outputs the same in each source? 

This demonstrates how we can sometimes be fooled by summary statistics. Plotting makes life easier.