# Plotting Lab

In the `datasets/` folder in this notebook you will find two datasets:

- `salary.csv` -- a dataset comparing salary data across gender and tenure lines for academics 
- `wine_quality.csv` -- a dataset comparing chemical qualities of red and white wine and user-rated quality scores (on a 10 point scale)

Your task is to use Matplotlib and Seaborn to create two, high-quality plots, one from each of these two datasets. Your deliverable for this lab is to share your plots in your market's slack channel at the end of the day. 

Part of effective data science work is to be able to take new datasets and investigate them for interesting correlations or relationships that might be the basis of future research or investigation. Take this lab as an opportunity to practice those skills and how plotting can help you in that goal!

##### Useful Workflow Tips

1. Open the data and do a quick EDA:
  - How many rows and columns?
  - Is there missing data?
  - What do each of the columns mean?
    - Sometimes it may not be clear at first glance so double check
    - Googling for some insight into that domain (such as salary information for the academic world) is not just highly encouraged, but may be 100% required in some cases
    - Consider checking in with your colleagues, classmates, and teachers
  - At first glance, are there columns that you think might have an interesting relationship
2. Begin plotting:
  - If a variable of interest is encoded as a string, do some feature extraction / transformation to turn it into numeric values
  - Use something like seaborn's pairplot to visualize overall relationships
  - Start digging into a bivariate relationship
3. Refine plots:
  - Try different plotting types / plotting options to create an accurate and interesting plot
  - Remember to include titles, axes labels, etc.
  - Does your plot have a story? What should a reader take away from your plot

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Salary

In [7]:
salary = pd.read_csv('datasets/salary.csv')
salary.head()

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696


# Wine Quality

In [9]:
wine = pd.read_csv('datasets/wine_quality.csv')
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


# Shape of Salary Dataset

In [11]:
a,b = salary.shape
print (str(a) + ' rows and ' + str(b) + ' columns' )

52 rows and 6 columns


# Shape of Wine Dataset

In [13]:
a,b = wine.shape
print (str(a) + ' rows and ' + str(b) + ' columns' )

6497 rows and 13 columns


# Missing Data? 

### Salary Dataset

In [15]:
salary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 6 columns):
sx    52 non-null object
rk    52 non-null object
yr    52 non-null int64
dg    52 non-null object
yd    52 non-null int64
sl    52 non-null int64
dtypes: int64(3), object(3)
memory usage: 2.5+ KB


In [18]:
salary.isnull().sum()

sx    0
rk    0
yr    0
dg    0
yd    0
sl    0
dtype: int64