# Major_Evaluation Data Cleaning/Analysis Project

To begin the project, it is critical that we import all of the packages that the project will be using

In [1]:
#importing all necessary packages
import pandas as pd
import numpy as np
import matplotlib as mp
import os
import settings

Now that we have all of the necessary packages imported, we should view what types of data files we will be handling

In [2]:
#create a list of all of our data file names
data_files = os.listdir(settings.Data_dir)
print data_files

['Bachelor_Degrees_Conferred.csv', 'degrees-that-pay-back.csv']


Take notice that the all of the data files are in a ".csv" format. 
This file format is easy to handle thanks to the "pandas" python package that we have already imported.

Let's get the data files into a useable python format by uploading the data files into their own pandas
data frames.

In [3]:
#initialize an empty dictionary that will provide easy, organized access to the data frames
DF = {}

#for-in loop to access each data file
for entry in data_files:
    
    #read in data file to a temporary data frame
    tempDF = pd.read_csv(os.path.join(settings.Data_dir,entry))

    #store the dataframe inside the data frame dictionary
    DF[entry.split(".")[0]] = tempDF

    #delete the temporary data frame to free up some memory
    del tempDF
    
#print the dictionary keys for future reference
print DF.keys()
    


['degrees-that-pay-back', 'Bachelor_Degrees_Conferred']


Great! Now we have our data in pandas data frames, we can view, edit, and analyze the data all in this Jupyter Notebook

Let's take a look at what the data actually looks like by printing the contents of each data frame

In [4]:
DF["degrees-that-pay-back"]

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Percent change from Starting to Mid-Career Salary,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
0,Accounting,"$46,000.00","$77,100.00",67.6,"$42,200.00","$56,100.00","$108,000.00","$152,000.00"
1,Aerospace Engineering,"$57,700.00","$101,000.00",75.0,"$64,300.00","$82,100.00","$127,000.00","$161,000.00"
2,Agriculture,"$42,600.00","$71,900.00",68.8,"$36,300.00","$52,100.00","$96,300.00","$150,000.00"
3,Anthropology,"$36,800.00","$61,500.00",67.1,"$33,800.00","$45,500.00","$89,300.00","$138,000.00"
4,Architecture,"$41,600.00","$76,800.00",84.6,"$50,600.00","$62,200.00","$97,000.00","$136,000.00"
5,Art History,"$35,800.00","$64,900.00",81.3,"$28,800.00","$42,200.00","$87,400.00","$125,000.00"
6,Biology,"$38,800.00","$64,800.00",67.0,"$36,900.00","$47,400.00","$94,500.00","$135,000.00"
7,Business Management,"$43,000.00","$72,100.00",67.7,"$38,800.00","$51,500.00","$102,000.00","$147,000.00"
8,Chemical Engineering,"$63,200.00","$107,000.00",69.3,"$71,900.00","$87,300.00","$143,000.00","$194,000.00"
9,Chemistry,"$42,600.00","$79,900.00",87.6,"$45,300.00","$60,700.00","$108,000.00","$148,000.00"


In [5]:
DF["Bachelor_Degrees_Conferred"]

Unnamed: 0,"Table 322.10. Bachelor's degrees conferred by postsecondary institutions, by field of study: Selected years, 1970-71 through 2014-15",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38
0,Field of study,1970-71,1975-76,1980-81,1985-86,1990-91,1995-96,2000-01,2004-05,2005-06,...,,,,,,,,,,
1,1,2,3,4,5,6,7,8,9,10,...,,,,,,,,,,
2,Total ........................................,839730,925746,935140,987823,1094538,1164792,1244171,1439264,1485242,...,,,,,,,,,,
3,Agriculture and natural resources ...............,12672,19402,21886,16823,13124,21425,23370,23002,23053,...,,,,,,,,,,
4,Architecture and related services ...............,5570,9146,9455,9119,9781,8352,8480,9237,9515,...,,,,,,,,,,
5,"Area, ethnic, cultural, gender, and group stud...",2579,3577,2887,3021,4776,5633,6160,7569,7879,...,,,,,,,,,,
6,Biological and biomedical sciences ..............,35705,54154,43078,38395,39482,61014,60576,65915,70607,...,,,,,,,,,,
7,Business ........................................,115396,143171,200521,236700,249165,226623,263515,311574,318042,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,"Communication, journalism, and related program...",10324,20045,29428,41666,51650,47320,58013,72715,73955,...,,,,,,,,,,


The data for both data frames appears to be relatively nice to play with. The data is largely continuous, quantitative with only a few qualitative data points for the majors. There isn't a large level individaul of attention needed to make the data frames suit our purposes for this project. 

The first thing that caught my eye that needs to be addressed is the large amount of "NaN" values in the "Bachelor_Degrees_Conferred" dataframe. "Nan" values are essentially spots for data that is not filled. Keeping these inside of our data will throw off our analysis later on, so it is best to remove them to get an accurate representation of our data.

Second, the "Bachelor_Degrees_Conferred" dataframe has headers that are more or less useless as they are unnamed. We should either remove the pre-set headers or rename them so that they can provide value. 

Lastly, looking at our data frames, it may not be a good idea to combine our data sets into one combined data set right away. Removing the "NaN" values as well as condensing some of the data may help make the data frame more neat and easier to interpret.

Let's start cleaning the data sets to make them useable for the purposes of this project. Since the "Bachelor_Degrees_Conferred" dataframe appears to be the one that needs more cleaning, we will begin with that

In [6]:
#cleaning the "Bachelor_Degrees_Conferred" dataframe

#The column headers for this dataframe do not appear to be of use to us, but the first line appears to be. Lets
#replace the original headers with the first line.

#rename the column headers
DF["Bachelor_Degrees_Conferred"].columns = DF["Bachelor_Degrees_Conferred"].iloc[0]

#drop the row that has become the new header
DF["Bachelor_Degrees_Conferred"] = DF["Bachelor_Degrees_Conferred"][2:]

#print dataframe to check
DF["Bachelor_Degrees_Conferred"]




Unnamed: 0,Field of study,1970-71,1975-76,1980-81,1985-86,1990-91,1995-96,2000-01,2004-05,2005-06,...,nan,nan.1,nan.2,nan.3,nan.4,nan.5,nan.6,nan.7,nan.8,nan.9
2,Total ........................................,839730.0,925746.0,935140.0,987823.0,1094538.0,1164792.0,1244171.0,1439264.0,1485242.0,...,,,,,,,,,,
3,Agriculture and natural resources ...............,12672.0,19402.0,21886.0,16823.0,13124.0,21425.0,23370.0,23002.0,23053.0,...,,,,,,,,,,
4,Architecture and related services ...............,5570.0,9146.0,9455.0,9119.0,9781.0,8352.0,8480.0,9237.0,9515.0,...,,,,,,,,,,
5,"Area, ethnic, cultural, gender, and group stud...",2579.0,3577.0,2887.0,3021.0,4776.0,5633.0,6160.0,7569.0,7879.0,...,,,,,,,,,,
6,Biological and biomedical sciences ..............,35705.0,54154.0,43078.0,38395.0,39482.0,61014.0,60576.0,65915.0,70607.0,...,,,,,,,,,,
7,Business ........................................,115396.0,143171.0,200521.0,236700.0,249165.0,226623.0,263515.0,311574.0,318042.0,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,"Communication, journalism, and related program...",10324.0,20045.0,29428.0,41666.0,51650.0,47320.0,58013.0,72715.0,73955.0,...,,,,,,,,,,
10,Communications technologies .....................,478.0,1237.0,1854.0,1479.0,1397.0,853.0,1178.0,2523.0,2981.0,...,,,,,,,,,,
11,Computer and information sciences ...............,2388.0,5652.0,15121.0,42337.0,25159.0,24506.0,44142.0,54111.0,47480.0,...,,,,,,,,,,
