# Create Community College Machine Learning Dataset

This notebook reads in the merged data file located [here](https://github.com/BrownRegaSterlingHeinen/PostsecondaryAttainment/blob/master/RawDatasets/mergedData.xlsx) as input data. A lot of the code used in this notebook is from Dr. Jake Drew's repository for North Carolina public school data which can be viewed [here](https://github.com/jakemdrew/EducationDataNC/tree/master/2017/Machine%20Learning%20Datasets). This creates a dataset that is preprocessed for Machine Learning by going through the following transformations:

- Columns that have the same value in every single row are deleted.
- Nominal columns that have a unique value in every single row (all values are different) are deleted.
- Empty columns (all values are NA or NULL) are deleted.
- Numeric columns with more than the percentage of missing values specified by the missingThreshold parameter are deleted.
- Remaining numeric columns with missing values are imputed / populated with the mean of the column.  
- Categorical / text based columns with > uniqueThreshold unique values are deleted.
- Duplicated or highly similar columns with > 95% correlation are delelted.

In [107]:
import pandas as pd
import numpy as np

# Load in merged community college data
url = 'https://github.com/BrownRegaSterlingHeinen/PostsecondaryAttainment/blob/master/2016/NCCC%20Datasets/mergedData.xlsx?raw=true'
NCCCData = pd.read_excel(url)

In [108]:
# View info for data
NCCCData.info()
NCCCData.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 721 entries, County to $200,000 or more Number of Returns
dtypes: float64(365), int64(68), object(288)
memory usage: 332.4+ KB


Unnamed: 0,County,College Name,AdvESL_MeasureableSkills_Participant_POP_MSG,AdvESL_MeasureableSkills_IndividualsServed,AdvESL_MeasureableSkills_Participants12+Hours,AdvESL_MeasureableSkills_ParticipServed,AdvESL_MeasureableSkills_POPs,AdvESL_MeasureableSkills_AHSGrad,AdvESL_MeasureableSkills_HSE,AdvESL_MeasureableSkills_Postsecondary Enrollment,...,Tax due at time of filing [11]_Number of returns,Tax due at time of filing [11]_Amount,Overpayments refunded [12]_Number of returns,Overpayments refunded [12]_Amount,"$1 under $25,000 Number of Returns","$25,000 under $50,000 Number of Returns","$50,000 under $75,000 Number of Returns","$75,000 under $100,000 Number of Returns","$100,000 under $200,000 Number of Returns","$200,000 or more Number of Returns"
0,Alamance,Alamance Community College,33.30%,12.0,11.0,91.70%,12.0,0.00%,0.00%,8.30%,...,13290.0,48075.0,58020.0,150599.0,28880.0,19790.0,10120.0,6220.0,7190.0,1500.0
1,Beaufort,Beaufort County Community College,25.00%,9.0,8.0,88.90%,8.0,0.00%,0.00%,0.00%,...,4170.0,15632.0,15750.0,42501.0,9010.0,5100.0,2610.0,1720.0,2080.0,380.0
2,Bladen,Bladen Community College,*,4.0,4.0,100.00%,4.0,*,*,*,...,1800.0,6079.0,8980.0,25248.0,4950.0,3210.0,1330.0,760.0,750.0,100.0
3,Henderson,Blue Ridge Community College,5.90%,42.0,34.0,81.00%,34.0,0.00%,0.00%,0.00%,...,11420.0,47803.0,37170.0,91382.0,19140.0,12730.0,7510.0,5050.0,6080.0,1550.0
4,Brunswick,Brunswick Community College,*,6.0,4.0,66.70%,4.0,*,*,*,...,13910.0,58804.0,40890.0,108478.0,20350.0,12870.0,7970.0,6050.0,8660.0,2170.0


In [109]:
# Missing Data Threshold (Per Column)
missingThreshold = 0.60

# Unique Value Threshold (Per Column)
# Delete Columns >  uniqueThreshold unique values prior to one-hot encoding. 
# (each unique value becomes a new column during one-hot encoding)
uniqueThreshold = 25

## Prepare Consolidated Dataset for Machine Learning
Below we perform operations on the entire dataset to remove columns and update row values that could cause problems during machine learning.

We can see that there are many columns that are meant to be floats that are listed as objects. We need to iteratively go through all of the columns and change these values.

In [110]:
for column in range(len(NCCCData.columns)): #loops through every column in the data set
    if NCCCData.iloc[:,column].isnull().sum() == 0: #these are the string columns that do not need to be changed
        print("Column ", column , "stayed the same")
        # If a column contains '-', it is not reading correctly, the following code deals with those columns
        for value in range(len(NCCCData)):
            if '%' in str(NCCCData.iloc[value,column]):
                NCCCData.iloc[value,column] = float(NCCCData.iloc[value,column].strip("%"))/100
                NCCCData.iloc[:,column] = pd.to_numeric(NCCCData.iloc[:,column], errors='ignore')
            elif NCCCData.iloc[value,column] == "-":
                NCCCData.iloc[value, column] = None
                NCCCData.iloc[:,column] = pd.to_numeric(NCCCData.iloc[:,column], errors='ignore')
    elif NCCCData.iloc[:,column].dtype == "object": #if the column is of type object (percentages)
        for value in range(len(NCCCData)):
            if pd.isnull(NCCCData.iloc[value,column]):
                continue
            elif "%" in NCCCData.iloc[value,column]: # one of the issues is percentages
                if ' ' in NCCCData.iloc[value,column]:
                    NCCCData.iloc[value,column] = NCCCData.iloc[value,column].split((' ', 1)[0])[0]
                NCCCData.iloc[value,column] = float(NCCCData.iloc[value,column].strip("%"))/100
            elif "," in NCCCData.iloc[value, column]: # another issues is commas
                NCCCData.iloc[value, column] = float(NCCCData.iloc[value, column].replace(",", ""))
            elif NCCCData.iloc[value,column] == "*":
                NCCCData.iloc[value, column] = None
            #elif NCCCData.iloc[value,column] == "-":
             #   NCCCData.iloc[value, column] = NCCCData.iloc[value, column].replace("-", "")
            else:
                continue
        print("Column ", column, "was changed to type numeric")
        NCCCData.iloc[:,column] = pd.to_numeric(NCCCData.iloc[:,column])
    else:
        print("Column", column, "stayed the same")

Column  0 stayed the same
Column  1 stayed the same
Column  2 was changed to type numeric
Column 3 stayed the same
Column 4 stayed the same
Column  5 was changed to type numeric
Column 6 stayed the same
Column  7 was changed to type numeric
Column  8 was changed to type numeric
Column  9 was changed to type numeric
Column  10 was changed to type numeric
Column  11 was changed to type numeric
Column  12 was changed to type numeric
Column 13 stayed the same
Column  14 was changed to type numeric
Column 15 stayed the same
Column  16 was changed to type numeric
Column 17 stayed the same
Column  18 was changed to type numeric
Column 19 stayed the same
Column  20 was changed to type numeric
Column 21 stayed the same
Column  22 was changed to type numeric
Column 23 stayed the same
Column  24 was changed to type numeric
Column  25 was changed to type numeric
Column  26 was changed to type numeric
Column  27 was changed to type numeric
Column 28 stayed the same
Column  29 was changed to type nu

Column  230 was changed to type numeric
Column  231 was changed to type numeric
Column  232 was changed to type numeric
Column 233 stayed the same
Column  234 was changed to type numeric
Column  235 was changed to type numeric
Column  236 was changed to type numeric
Column  237 was changed to type numeric
Column 238 stayed the same
Column  239 was changed to type numeric
Column  240 was changed to type numeric
Column  241 was changed to type numeric
Column  242 was changed to type numeric
Column 243 stayed the same
Column  244 was changed to type numeric
Column  245 was changed to type numeric
Column  246 was changed to type numeric
Column  247 was changed to type numeric
Column 248 stayed the same
Column  249 was changed to type numeric
Column  250 was changed to type numeric
Column  251 was changed to type numeric
Column  252 was changed to type numeric
Column 253 stayed the same
Column  254 was changed to type numeric
Column  255 was changed to type numeric
Column  256 was changed t

### Impute any Remaining Missing Values with Mean Value
Missing values will be imputed with the mean of the column.

In [111]:
#Print out all the missing value rows
pd.set_option('display.max_rows', 1000)

print('\r\n*********The Remaining Missing Values Below will be Imputed with the Mean!*********')

#Check for Missing values 
missing_values = NCCCData.isnull().sum().reset_index()
missing_values.columns = ['Variable Name', 'Number Missing Values']
missing_values = missing_values[missing_values['Number Missing Values'] > 0] 
missing_values


*********The Remaining Missing Values Below will be Imputed with the Mean!*********


Unnamed: 0,Variable Name,Number Missing Values
2,AdvESL_MeasureableSkills_Participant_POP_MSG,15
3,AdvESL_MeasureableSkills_IndividualsServed,5
4,AdvESL_MeasureableSkills_Participants12+Hours,5
5,AdvESL_MeasureableSkills_ParticipServed,9
6,AdvESL_MeasureableSkills_POPs,5
7,AdvESL_MeasureableSkills_AHSGrad,15
8,AdvESL_MeasureableSkills_HSE,15
9,AdvESL_MeasureableSkills_Postsecondary\nEnroll...,15
10,AdvESL_MeasureableSkills_AHSCredits,15
11,AdvESL_MeasureableSkills_Post-test,15


In [112]:
#Replace all remaining NaN with mean of column
NCCCData = NCCCData.fillna(NCCCData.mean())

#Check for Missing values after final imputation 
missing_values = NCCCData.isnull().sum().reset_index()
missing_values.columns = ['Variable Name', 'Number Missing Values']
missing_values = missing_values[missing_values['Number Missing Values'] > 0] 
missing_values

Unnamed: 0,Variable Name,Number Missing Values
191,NucMedTech_PCTPassing2015,59
225,CosmeticInstructor_PCTPassing2016,59
281,VetMedTech_PCTPassing2015,59
387,ACTComposite_25thPercentile,59
388,ACTComposite_75thPercentile,59
389,ACTEnglish_25thPercentile,59
390,ACTEnglish_75thPercentile,59
391,ACTMath_25thPercentile,59
392,ACTMath_75thPercentile,59
393,Students_Submitted_ACTScores,59


The remaining columns with missing values are completely empty, so these will be deleted.

In [113]:
NCCCData = NCCCData.dropna(axis='columns', how='all')

### Remove Columns with Problematic Data
Here we remove entire columns that could cause problems during machine learning. The following operations are performed:

- Remove any columns that have the same value in every single row.
- Remove any columns that have a unique value in every single row (all values are different).
- Remove empty columns (all values are NA or NULL).

In [114]:
#Remove any fields that have the same value in all rows
UniqueValueCounts = NCCCData.nunique(dropna=False)
SingleValueCols = UniqueValueCounts[UniqueValueCounts == 1].index
NCCCData = NCCCData.drop(SingleValueCols, axis=1)

#Review dataset contents after drops
print('*********After: Removing columns with the same value in every row.*******************')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(SingleValueCols))

*********After: Removing columns with the same value in every row.*******************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 657 entries, County to $200,000 or more Number of Returns
dtypes: float64(587), int64(68), object(2)
memory usage: 302.9+ KB

Columns Deleted:  42


In [115]:
#Remove any fields that have unique values in every row
NCCCDataRecordCt = NCCCData.shape[0]
UniqueValueCounts = NCCCData.apply(pd.Series.nunique)
AllUniqueValueCols = UniqueValueCounts[UniqueValueCounts == NCCCDataRecordCt].index
NCCCData = NCCCData.drop(AllUniqueValueCols, axis=1)

#Review dataset contents after drops
print('*********After: Removing columns with unique values in every row.*******************')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(AllUniqueValueCols))

*********After: Removing columns with unique values in every row.*******************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 632 entries, County to $200,000 or more Number of Returns
dtypes: float64(587), int64(44), object(1)
memory usage: 291.4+ KB

Columns Deleted:  25


In [116]:
#Remove any empty fields (null values in every row)
NCCCDataRecordCt = NCCCData.shape[0]
NullValueCounts = NCCCData.isnull().sum()
NullValueCols = NullValueCounts[NullValueCounts == NCCCDataRecordCt].index
NCCCData = NCCCData.drop(NullValueCols, axis=1)

#Review dataset contents after empty field drops
print('*********After: Removing columns with null / blank values in every row.*************')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(NullValueCols))

*********After: Removing columns with null / blank values in every row.*************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 632 entries, County to $200,000 or more Number of Returns
dtypes: float64(587), int64(44), object(1)
memory usage: 291.4+ KB

Columns Deleted:  0


### Handle Other Missing Values Types
- Here we eliminate any numeric and continous columns with more than the percentage of missing values specified by the missingThreshold parameter.
- All remaining missing values are populated with 0.

In [117]:
#Isolate continuous and categorical data types
#These are indexers into the NCCCData dataframe and may be used similar to the schoolData dataframe 
CCD_boolean = NCCCData.loc[:, (NCCCData.dtypes == bool) ]
CCD_nominal = NCCCData.loc[:, (NCCCData.dtypes == object)]
CCD_continuous = NCCCData.loc[:, (NCCCData.dtypes != bool) & (NCCCData.dtypes != object)]
print("Boolean Columns: ", CCD_boolean.shape[1])
print("Nominal Columns: ", CCD_nominal.shape[1])
print("Continuous Columns: ", CCD_continuous.shape[1])
print("Columns Accounted for: ", CCD_nominal.shape[1] + CCD_continuous.shape[1] + CCD_boolean.shape[1])

Boolean Columns:  0
Nominal Columns:  1
Continuous Columns:  631
Columns Accounted for:  632


In [118]:
#Eliminate nominal columns with more than missingThreshold percentage of missing values
NCCCDataRecordCt = CCD_nominal.shape[0]
missingValueLimit = NCCCDataRecordCt * missingThreshold
NullValueCounts = CCD_nominal.isnull().sum()
NullValueCols = NullValueCounts[NullValueCounts >= missingValueLimit].index
NCCCData = NCCCData.drop(NullValueCols, axis=1)

#Review dataset contents after empty field drops
print('*********After: Removing columns with >= missingThreshold % of missing values******')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(NullValueCols))

*********After: Removing columns with >= missingThreshold % of missing values******
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 632 entries, County to $200,000 or more Number of Returns
dtypes: float64(587), int64(44), object(1)
memory usage: 291.4+ KB

Columns Deleted:  0


In [119]:
#Eliminate continuous columns with more than missingThreshold percentage of missing values
NCCCDataRecordCt = CCD_continuous.shape[0]
missingValueLimit = NCCCDataRecordCt * missingThreshold
NullValueCounts = CCD_continuous.isnull().sum()
NullValueCols = NullValueCounts[NullValueCounts >= missingValueLimit].index
NCCCData = NCCCData.drop(NullValueCols, axis=1)

#Review dataset contents after empty field drops
print('*********After: Removing columns with >= missingThreshold % of missing values******')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(NullValueCols))

*********After: Removing columns with >= missingThreshold % of missing values******
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 632 entries, County to $200,000 or more Number of Returns
dtypes: float64(587), int64(44), object(1)
memory usage: 291.4+ KB

Columns Deleted:  0


### Categorical Variables
Any categorical variables that have greater than the set number of unique values, the uniqueThreshold, in each column will be deleted. Other categorical variables will be one-hot encoded.

In [120]:
#Delete categorical columns with > 25 unique values (Each unique value becomes a column during one-hot encoding)
oneHotUniqueValueCounts = NCCCData[CCD_nominal.columns].apply(lambda x: x.nunique())
oneHotUniqueValueCols = oneHotUniqueValueCounts[oneHotUniqueValueCounts >= uniqueThreshold].index
NCCCData.drop(oneHotUniqueValueCols, axis=1, inplace=True) 

#Review dataset contents one hot high unique value drops
print('*********After: Removing columns with >= uniqueThreshold unique values***********')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(oneHotUniqueValueCols))

*********After: Removing columns with >= uniqueThreshold unique values***********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 631 entries, AdvESL_MeasureableSkills_Participant_POP_MSG to $200,000 or more Number of Returns
dtypes: float64(587), int64(44)
memory usage: 290.9 KB

Columns Deleted:  1


In [121]:
#Isolate remaining categorical variables
begColumnCt = len(NCCCData.columns)
CCD_nominal = NCCCData.loc[:, (NCCCData.dtypes == object)]

#one hot encode categorical variables
if (len(CCD_nominal.columns) != 0):
    NCCCData = pd.get_dummies(data=NCCCData, 
                           columns=CCD_nominal, drop_first=True)
    #Determine change in column count
    endColumnCt = len(NCCCData.columns)
    columnsAdded = endColumnCt - begColumnCt

    #Review dataset contents one hot high unique value drops
    print('Columns To One-Hot Encode: ', len(CCD_nominal.columns))
    print('\r\n*********After: Adding New Columns Via One-Hot Encoding*************************')
    NCCCData.info(verbose=False)
    print('\r\nNew Columns Created Via One-Hot Encoding: ', columnsAdded)
else:
    print("No categorical variables to one-hot encode.")

No categorical variables to one-hot encode.


### Identify and Remove Highly Correlated Features
Find and remove any columns / features that are > 95% correlated

In [122]:
from IPython.core.display import display as d

pd.set_option("display.max_rows",None)

#set the value you want to see variable pairs correlation greater than
correlation_value=.95

#find the correlations of all variable pairs and sort it
#then take only those with correlation greater than value entered
so=pd.DataFrame(NCCCData.corr().unstack().sort_values(kind="quicksort",ascending=False))
so.columns=["Correlation"]
pos=so[(so['Correlation']<1)&(so['Correlation']>correlation_value)].drop_duplicates()
neg=so[(so['Correlation']>-1)&(so['Correlation']<(-1*correlation_value))].drop_duplicates()
#print(NCCCData.shape)

    
#create a list of variables to remove (just the left column from the correlations)
pos_list=[]
neg_list=[]

#create a list of the first variables in the pair for the pos and neg lists
if(len(pos)>0):
    pos['correlated_vars']=pos.index
    for i in range(len(pos)):
        pos_list.append(pos['correlated_vars'][i][1])
    
if(len(neg)>0):    
    neg['correlated_vars']=neg.index
    for i in range(len(neg)):
        neg_list.append(neg['correlated_vars'][i][1])

#remove duplicates from the lists
pos_list=set(pos_list)
neg_list=set(neg_list)

#print out results
if len(neg_list)>0:
    print('Variables with negative correlation below threshold')
    print(len(neg_list))
    #d(neg)
else:    
    print('No variables with negative correlation below threshold')
    
print("")

if len(pos_list)>0:
    print('Variables with positive correlation above threshold')
    print(len(pos_list))
    #d(pos)
else:
    print('No variables with positive correlation above threshold')
    
#create a list of variables that we want to drop from the data set
if((len(neg_list)>0)&(len(pos_list)>0)):
    to_drop=pos_list.union(neg_list)
elif(len(neg_list)>0):
    to_drop=neg_list
else:
    to_drop=pos_list


Variables with negative correlation below threshold
2

Variables with positive correlation above threshold
306


In [123]:
#drop the variables in the to_drop list form the data set
NCCCData = NCCCData.drop(to_drop, axis=1)

In [124]:
#Check columns after drop 
print('\r\n*********After: Dropping Highly Correlated Fields**************************************')
NCCCData.info(verbose=False)

#Save the final dataset to a .csv file
NCCCData.to_csv('NCCCData' + '_ML.csv', sep=',', index=False)


*********After: Dropping Highly Correlated Fields**************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 323 entries, AdvESL_MeasureableSkills_Participant_POP_MSG to Number of 
farm returns
dtypes: float64(298), int64(25)
memory usage: 149.0 KB


In [125]:
print('*********FINAL DATASET DETAILS*********************************************************\r\n')
NCCCData.info(verbose=True)

*********FINAL DATASET DETAILS*********************************************************

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 323 columns):
AdvESL_MeasureableSkills_Participant_POP_MSG                                                                float64
AdvESL_MeasureableSkills_ParticipServed                                                                     float64
AdvESL_MeasureableSkills_AHSGrad                                                                            float64
AdvESL_MeasureableSkills_HSE                                                                                float64
AdvESL_MeasureableSkills_Postsecondary
Enrollment                                                           float64
AdvESL_MeasureableSkills_MSG                                                                                float64
Beg_ESL_PCTProgress                                                                                         fl