
**Introduction**

The purpose of this project is to explore data visualization techniques utilizing the Salaries by College Type dataset. This kernal seeks to clean the data and provide a few visualizations. 

**Setup**

The following packages will need to be installed: pandas, numpy, regex, seaborn, matplotlib,and bokeh. For windows owners, open terminal and install the following command: Pip install “package name".


In [1]:
# To begin, import the following: pandas, seaborn, regex,and matplotlib
# To ignore warnings, use the following code to make the display more attractive.
import pandas as pd
import numpy as np
import re
import csv
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", color_codes=True)

#To import the college salary dataset:
college = pd.read_csv("../input/salaries-by-college-type.csv")


#To view income data:
college.head()



In [2]:
#check the type of data we have
type(college)

In [3]:
#To check for any missing values
college.isnull().any()

In [4]:
#Some values are missing
#drop the values columns that have missing data 
college.dropna(inplace=True)
college.reset_index(inplace=True, drop=True)
college.head()

In [5]:
#Check again for missing values
college.isnull().any()

In [7]:
#Change strings to floats using str.replace
#Only replace strings in columns that are numeric (will not work eitherwise)
for x in college.columns:
    if x != 'School Name' and x != 'School Type':
        new_col = college[x].str.replace("$", "")
        new_col  = new_col .str.replace(",", "")
        college[x] = pd.to_numeric(new_col )
        


In [8]:
#Now that that the data is clean, we can begin to graph
college.head()

In [9]:
#To view the number of schools by type:
college["School Type"].value_counts()

In [10]:
# A seaborn plot that plots univariate or bivariate density estimates is kdeplot.
#Plot school type by Starting Median Salary
sns.FacetGrid(college, hue="School Type", size=5) \
   .map(sns.kdeplot, "Starting Median Salary") \
   .add_legend()
plt.show()

In [11]:
#Use a heatmap to view the correlation between variables in the dataset.

corr = college.corr()
corr = (corr)
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,cmap="YlGnBu")
        
plt.show()

In [12]:
#A bar plot of the average Mid-Career Salary by school type
#Note: It is best not to take the average of a median if there are outliers 
objects = ('Engineering', 'Party',' Liberal Arts' ,'Ivy League','State')
y_pos = np.arange(len(objects))
performance = [103842,84685,89379,120125,78567]
 
plt.bar(y_pos, performance, align='center', alpha=0.5,  color='r')
plt.xticks(y_pos, objects, rotation='70')
plt.ylabel('Average')
plt.title('Average Mid-Career Salary by School Type')
 
plt.show()

In [None]:
#Use the visualization package bokeh to create a scatter plot that has an html output
#Image of output_file can be seen below. 
from bokeh.charts import Scatter, output_file, show

p = Scatter(college, x='School Type', y='Starting Median Salary', color='darkmagenta', title="Starting Median Salary vs School Type",
            xlabel="School Type", ylabel="Starting Median Salary")

output_file("college.html")

show(p)

**Bokeh Image**
(https://imgur.com/a/QAMBZ)

