# Create Community College Machine Learning Dataset

This notebook reads in the merged data file located [here](https://github.com/BrownRegaSterlingHeinen/PostsecondaryAttainment/blob/master/RawDatasets/mergedData.xlsx) as input data. A lot of the code used in this notebook is from Dr. Jake Drew's repository for North Carolina public school data which can be viewed [here](https://github.com/jakemdrew/EducationDataNC/tree/master/2017/Machine%20Learning%20Datasets). This creates a dataset that is preprocessed for Machine Learning by going through the following transformations:

- Columns that have the same value in every single row are deleted.
- Nominal columns that have a unique value in every single row (all values are different) are deleted.
- Empty columns (all values are NA or NULL) are deleted.
- Numeric columns with more than the percentage of missing values specified by the missingThreshold parameter are deleted.
- Remaining numeric columns with missing values are imputed / populated with the mean of the column.  
- Categorical / text based columns with > uniqueThreshold unique values are deleted.
- Duplicated or highly similar columns with > 95% correlation are delelted.

In [3]:
import pandas as pd
import numpy as np

# Load in merged community college data
url = 'https://github.com/BrownRegaSterlingHeinen/PostsecondaryAttainment/blob/master/2016/Raw%20Datasets/mergedData.xlsx?raw=true'
NCCCData = pd.read_excel(url)

In [4]:
# View info for data
NCCCData.info()
NCCCData.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Columns: 2904 entries, County to $200,000 or more Number of Returns
dtypes: float64(2570), int64(19), object(315)
memory usage: 1.7+ MB


Unnamed: 0,County,College Name,AdvESL_MeasureableSkills_Participant_POP_MSG,AdvESL_MeasureableSkills_IndividualsServed,AdvESL_MeasureableSkills_Participants12+Hours,AdvESL_MeasureableSkills_ParticipServed,AdvESL_MeasureableSkills_POPs,AdvESL_MeasureableSkills_AHSGrad,AdvESL_MeasureableSkills_HSE,AdvESL_MeasureableSkills_Postsecondary Enrollment,...,Tax due at time of filing [11]_Number of returns,Tax due at time of filing [11]_Amount,Overpayments refunded [12]_Number of returns,Overpayments refunded [12]_Amount,"$1 under $25,000 Number of Returns","$25,000 under $50,000 Number of Returns","$50,000 under $75,000 Number of Returns","$75,000 under $100,000 Number of Returns","$100,000 under $200,000 Number of Returns","$200,000 or more Number of Returns"
0,Alamance,Alamance Community College,33.30%,12.0,11.0,91.70%,12.0,0.00%,0.00%,8.30%,...,13290.0,48075.0,58020.0,150599.0,28880.0,19790.0,10120.0,6220.0,7190.0,1500.0
1,Beaufort,Beaufort County Community College,25.00%,9.0,8.0,88.90%,8.0,0.00%,0.00%,0.00%,...,4170.0,15632.0,15750.0,42501.0,9010.0,5100.0,2610.0,1720.0,2080.0,380.0
2,Bladen,Bladen Community College,*,4.0,4.0,100.00%,4.0,*,*,*,...,1800.0,6079.0,8980.0,25248.0,4950.0,3210.0,1330.0,760.0,750.0,100.0
3,Henderson,Blue Ridge Community College,5.90%,42.0,34.0,81.00%,34.0,0.00%,0.00%,0.00%,...,11420.0,47803.0,37170.0,91382.0,19140.0,12730.0,7510.0,5050.0,6080.0,1550.0
4,Brunswick,Brunswick Community College,*,6.0,4.0,66.70%,4.0,*,*,*,...,13910.0,58804.0,40890.0,108478.0,20350.0,12870.0,7970.0,6050.0,8660.0,2170.0


In [5]:
# Missing Data Threshold (Per Column)
missingThreshold = 0.60

# Unique Value Threshold (Per Column)
# Delete Columns >  uniqueThreshold unique values prior to one-hot encoding. 
# (each unique value becomes a new column during one-hot encoding)
uniqueThreshold = 25

## Prepare Consolidated Dataset for Machine Learning
Below we perform operations on the entire dataset to remove columns and update row values that could cause problems during machine learning.

We can see that there are many columns that are meant to be floats that are listed as objects. We need to iteratively go through all of the columns and change these values.

In [6]:
for column in range(len(NCCCData.columns)): #loops through every column in the data set
    if NCCCData.iloc[:,column].isnull().sum() == 0: #these are the string columns that do not need to be changed
        print("Column ", column , "stayed the same")
        # If a column contains '-', it is not reading correctly, the following code deals with those columns
        for value in range(len(NCCCData)):
            if '%' in str(NCCCData.iloc[value,column]):
                NCCCData.iloc[value,column] = float(NCCCData.iloc[value,column].strip("%"))/100
                NCCCData.iloc[:,column] = pd.to_numeric(NCCCData.iloc[:,column], errors='ignore')
            elif NCCCData.iloc[value,column] == "-":
                NCCCData.iloc[value, column] = None
                NCCCData.iloc[:,column] = pd.to_numeric(NCCCData.iloc[:,column], errors='ignore')
    elif NCCCData.iloc[:,column].dtype == "object": #if the column is of type object (percentages)
        for value in range(len(NCCCData)):
            if pd.isnull(NCCCData.iloc[value,column]):
                continue
            elif "%" in NCCCData.iloc[value,column]: # one of the issues is percentages
                if ' ' in NCCCData.iloc[value,column]:
                    NCCCData.iloc[value,column] = NCCCData.iloc[value,column].split((' ', 1)[0])[0]
                NCCCData.iloc[value,column] = float(NCCCData.iloc[value,column].strip("%"))/100
            elif "," in NCCCData.iloc[value, column]: # another issues is commas
                NCCCData.iloc[value, column] = float(NCCCData.iloc[value, column].replace(",", ""))
            elif NCCCData.iloc[value,column] == "*":
                NCCCData.iloc[value, column] = None
            #elif NCCCData.iloc[value,column] == "-":
             #   NCCCData.iloc[value, column] = NCCCData.iloc[value, column].replace("-", "")
            else:
                continue
        print("Column ", column, "was changed to type numeric")
        NCCCData.iloc[:,column] = pd.to_numeric(NCCCData.iloc[:,column])
    else:
        print("Column", column, "stayed the same")

Column  0 stayed the same
Column  1 stayed the same
Column  2 was changed to type numeric
Column 3 stayed the same
Column 4 stayed the same
Column  5 was changed to type numeric
Column 6 stayed the same
Column  7 was changed to type numeric
Column  8 was changed to type numeric
Column  9 was changed to type numeric
Column  10 was changed to type numeric
Column  11 was changed to type numeric
Column  12 was changed to type numeric
Column 13 stayed the same
Column  14 was changed to type numeric
Column 15 stayed the same
Column  16 was changed to type numeric
Column 17 stayed the same
Column  18 was changed to type numeric
Column 19 stayed the same
Column  20 was changed to type numeric
Column 21 stayed the same
Column  22 was changed to type numeric
Column 23 stayed the same
Column  24 was changed to type numeric
Column  25 was changed to type numeric
Column  26 was changed to type numeric
Column  27 was changed to type numeric
Column 28 stayed the same
Column  29 was changed to type nu

Column  235 was changed to type numeric
Column  236 was changed to type numeric
Column  237 was changed to type numeric
Column 238 stayed the same
Column  239 was changed to type numeric
Column  240 was changed to type numeric
Column  241 was changed to type numeric
Column  242 was changed to type numeric
Column 243 stayed the same
Column  244 was changed to type numeric
Column  245 was changed to type numeric
Column  246 was changed to type numeric
Column  247 was changed to type numeric
Column 248 stayed the same
Column  249 was changed to type numeric
Column  250 was changed to type numeric
Column  251 was changed to type numeric
Column  252 was changed to type numeric
Column 253 stayed the same
Column  254 was changed to type numeric
Column  255 was changed to type numeric
Column  256 was changed to type numeric
Column  257 was changed to type numeric
Column 258 stayed the same
Column  259 was changed to type numeric
Column  260 was changed to type numeric
Column  261 was changed t

Column 687 stayed the same
Column 688 stayed the same
Column 689 stayed the same
Column 690 stayed the same
Column 691 stayed the same
Column 692 stayed the same
Column 693 stayed the same
Column 694 stayed the same
Column 695 stayed the same
Column 696 stayed the same
Column 697 stayed the same
Column 698 stayed the same
Column 699 stayed the same
Column 700 stayed the same
Column 701 stayed the same
Column 702 stayed the same
Column 703 stayed the same
Column 704 stayed the same
Column 705 stayed the same
Column 706 stayed the same
Column 707 stayed the same
Column 708 stayed the same
Column 709 stayed the same
Column 710 stayed the same
Column 711 stayed the same
Column 712 stayed the same
Column 713 stayed the same
Column 714 stayed the same
Column 715 stayed the same
Column 716 stayed the same
Column 717 stayed the same
Column 718 stayed the same
Column 719 stayed the same
Column 720 stayed the same
Column 721 stayed the same
Column 722 stayed the same
Column 723 stayed the same
C

Column 1098 stayed the same
Column 1099 stayed the same
Column 1100 stayed the same
Column 1101 stayed the same
Column 1102 stayed the same
Column 1103 stayed the same
Column 1104 stayed the same
Column 1105 stayed the same
Column 1106 stayed the same
Column 1107 stayed the same
Column 1108 stayed the same
Column 1109 stayed the same
Column 1110 stayed the same
Column 1111 stayed the same
Column 1112 stayed the same
Column 1113 stayed the same
Column 1114 stayed the same
Column 1115 stayed the same
Column 1116 stayed the same
Column 1117 stayed the same
Column 1118 stayed the same
Column 1119 stayed the same
Column 1120 stayed the same
Column 1121 stayed the same
Column 1122 stayed the same
Column 1123 stayed the same
Column 1124 stayed the same
Column 1125 stayed the same
Column 1126 stayed the same
Column 1127 stayed the same
Column 1128 stayed the same
Column 1129 stayed the same
Column 1130 stayed the same
Column 1131 stayed the same
Column 1132 stayed the same
Column 1133 stayed t

Column 1636 stayed the same
Column 1637 stayed the same
Column 1638 stayed the same
Column 1639 stayed the same
Column 1640 stayed the same
Column 1641 stayed the same
Column 1642 stayed the same
Column 1643 stayed the same
Column 1644 stayed the same
Column 1645 stayed the same
Column 1646 stayed the same
Column 1647 stayed the same
Column 1648 stayed the same
Column 1649 stayed the same
Column 1650 stayed the same
Column 1651 stayed the same
Column 1652 stayed the same
Column 1653 stayed the same
Column 1654 stayed the same
Column 1655 stayed the same
Column 1656 stayed the same
Column 1657 stayed the same
Column 1658 stayed the same
Column 1659 stayed the same
Column 1660 stayed the same
Column 1661 stayed the same
Column 1662 stayed the same
Column 1663 stayed the same
Column 1664 stayed the same
Column 1665 stayed the same
Column 1666 stayed the same
Column 1667 stayed the same
Column 1668 stayed the same
Column 1669 stayed the same
Column 1670 stayed the same
Column 1671 stayed t

Column 2026 stayed the same
Column 2027 stayed the same
Column 2028 stayed the same
Column 2029 stayed the same
Column 2030 stayed the same
Column 2031 stayed the same
Column 2032 stayed the same
Column 2033 stayed the same
Column 2034 stayed the same
Column 2035 stayed the same
Column 2036 stayed the same
Column 2037 stayed the same
Column 2038 stayed the same
Column 2039 stayed the same
Column 2040 stayed the same
Column 2041 stayed the same
Column 2042 stayed the same
Column 2043 stayed the same
Column 2044 stayed the same
Column 2045 stayed the same
Column 2046 stayed the same
Column 2047 stayed the same
Column 2048 stayed the same
Column 2049 stayed the same
Column 2050 stayed the same
Column 2051 stayed the same
Column 2052 stayed the same
Column 2053 stayed the same
Column 2054 stayed the same
Column 2055 stayed the same
Column 2056 stayed the same
Column 2057 stayed the same
Column 2058 stayed the same
Column 2059 stayed the same
Column 2060 stayed the same
Column 2061 stayed t

Column 2431 stayed the same
Column 2432 stayed the same
Column 2433 stayed the same
Column 2434 stayed the same
Column 2435 stayed the same
Column 2436 stayed the same
Column 2437 stayed the same
Column 2438 stayed the same
Column 2439 stayed the same
Column 2440 stayed the same
Column 2441 stayed the same
Column 2442 stayed the same
Column 2443 stayed the same
Column 2444 stayed the same
Column 2445 stayed the same
Column 2446 stayed the same
Column 2447 stayed the same
Column 2448 stayed the same
Column 2449 stayed the same
Column 2450 stayed the same
Column 2451 stayed the same
Column 2452 stayed the same
Column 2453 stayed the same
Column 2454 stayed the same
Column 2455 stayed the same
Column 2456 stayed the same
Column 2457 stayed the same
Column 2458 stayed the same
Column 2459 stayed the same
Column 2460 stayed the same
Column 2461 stayed the same
Column 2462 stayed the same
Column 2463 stayed the same
Column 2464 stayed the same
Column 2465 stayed the same
Column 2466 stayed t

Column 2848 stayed the same
Column 2849 stayed the same
Column 2850 stayed the same
Column 2851 stayed the same
Column 2852 stayed the same
Column 2853 stayed the same
Column 2854 stayed the same
Column 2855 stayed the same
Column 2856 stayed the same
Column 2857 stayed the same
Column 2858 stayed the same
Column 2859 stayed the same
Column 2860 stayed the same
Column 2861 stayed the same
Column 2862 stayed the same
Column 2863 stayed the same
Column 2864 stayed the same
Column 2865 stayed the same
Column 2866 stayed the same
Column 2867 stayed the same
Column 2868 stayed the same
Column 2869 stayed the same
Column 2870 stayed the same
Column 2871 stayed the same
Column 2872 stayed the same
Column 2873 stayed the same
Column 2874 stayed the same
Column 2875 stayed the same
Column 2876 stayed the same
Column 2877 stayed the same
Column 2878 stayed the same
Column 2879 stayed the same
Column 2880 stayed the same
Column 2881 stayed the same
Column 2882 stayed the same
Column 2883 stayed t

### Impute any Remaining Missing Values with Mean Value
Missing values will be imputed with the mean of the column.

In [7]:
#Print out all the missing value rows
pd.set_option('display.max_rows', 1000)

print('\r\n*********The Remaining Missing Values Below will be Imputed with the Mean!*********')

#Check for Missing values 
missing_values = NCCCData.isnull().sum().reset_index()
missing_values.columns = ['Variable Name', 'Number Missing Values']
missing_values = missing_values[missing_values['Number Missing Values'] > 0] 
missing_values


*********The Remaining Missing Values Below will be Imputed with the Mean!*********


Unnamed: 0,Variable Name,Number Missing Values
2,AdvESL_MeasureableSkills_Participant_POP_MSG,31
3,AdvESL_MeasureableSkills_IndividualsServed,21
4,AdvESL_MeasureableSkills_Participants12+Hours,21
5,AdvESL_MeasureableSkills_ParticipServed,25
6,AdvESL_MeasureableSkills_POPs,21
7,AdvESL_MeasureableSkills_AHSGrad,31
8,AdvESL_MeasureableSkills_HSE,31
9,AdvESL_MeasureableSkills_Postsecondary\rEnroll...,31
10,AdvESL_MeasureableSkills_AHSCredits,31
11,AdvESL_MeasureableSkills_Post-test,31


In [8]:
#Replace all remaining NaN with mean of column
NCCCData = NCCCData.fillna(NCCCData.mean())

#Check for Missing values after final imputation 
missing_values = NCCCData.isnull().sum().reset_index()
missing_values.columns = ['Variable Name', 'Number Missing Values']
missing_values = missing_values[missing_values['Number Missing Values'] > 0] 
missing_values

Unnamed: 0,Variable Name,Number Missing Values
191,NucMedTech_PCTPassing2015,75
225,CosmeticInstructor_PCTPassing2016,75
281,VetMedTech_PCTPassing2015,75


The remaining columns with missing values are completely empty, so these will be deleted.

In [9]:
NCCCData = NCCCData.dropna(axis='columns', how='all')

### Remove Columns with Problematic Data
Here we remove entire columns that could cause problems during machine learning. The following operations are performed:

- Remove any columns that have the same value in every single row.
- Remove any columns that have a unique value in every single row (all values are different).
- Remove empty columns (all values are NA or NULL).

In [10]:
#Remove any fields that have the same value in all rows
UniqueValueCounts = NCCCData.nunique(dropna=False)
SingleValueCols = UniqueValueCounts[UniqueValueCounts == 1].index
NCCCData = NCCCData.drop(SingleValueCols, axis=1)

#Review dataset contents after drops
print('*********After: Removing columns with the same value in every row.*******************')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(SingleValueCols))

*********After: Removing columns with the same value in every row.*******************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Columns: 2593 entries, County to $200,000 or more Number of Returns
dtypes: float64(2572), int64(19), object(2)
memory usage: 1.5+ MB

Columns Deleted:  308


In [11]:
#Remove any fields that have unique values in every row
NCCCDataRecordCt = NCCCData.shape[0]
UniqueValueCounts = NCCCData.apply(pd.Series.nunique)
AllUniqueValueCols = UniqueValueCounts[UniqueValueCounts == NCCCDataRecordCt].index
NCCCData = NCCCData.drop(AllUniqueValueCols, axis=1)

#Review dataset contents after drops
print('*********After: Removing columns with unique values in every row.*******************')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(AllUniqueValueCols))

*********After: Removing columns with unique values in every row.*******************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Columns: 2586 entries, County to $200,000 or more Number of Returns
dtypes: float64(2572), int64(13), object(1)
memory usage: 1.5+ MB

Columns Deleted:  7


In [12]:
#Remove any empty fields (null values in every row)
NCCCDataRecordCt = NCCCData.shape[0]
NullValueCounts = NCCCData.isnull().sum()
NullValueCols = NullValueCounts[NullValueCounts == NCCCDataRecordCt].index
NCCCData = NCCCData.drop(NullValueCols, axis=1)

#Review dataset contents after empty field drops
print('*********After: Removing columns with null / blank values in every row.*************')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(NullValueCols))

*********After: Removing columns with null / blank values in every row.*************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Columns: 2586 entries, County to $200,000 or more Number of Returns
dtypes: float64(2572), int64(13), object(1)
memory usage: 1.5+ MB

Columns Deleted:  0


### Handle Other Missing Values Types
- Here we eliminate any numeric and continous columns with more than the percentage of missing values specified by the missingThreshold parameter.
- All remaining missing values are populated with 0.

In [13]:
#Isolate continuous and categorical data types
#These are indexers into the NCCCData dataframe and may be used similar to the schoolData dataframe 
CCD_boolean = NCCCData.loc[:, (NCCCData.dtypes == bool) ]
CCD_nominal = NCCCData.loc[:, (NCCCData.dtypes == object)]
CCD_continuous = NCCCData.loc[:, (NCCCData.dtypes != bool) & (NCCCData.dtypes != object)]
print("Boolean Columns: ", CCD_boolean.shape[1])
print("Nominal Columns: ", CCD_nominal.shape[1])
print("Continuous Columns: ", CCD_continuous.shape[1])
print("Columns Accounted for: ", CCD_nominal.shape[1] + CCD_continuous.shape[1] + CCD_boolean.shape[1])

Boolean Columns:  0
Nominal Columns:  1
Continuous Columns:  2585
Columns Accounted for:  2586


In [14]:
#Eliminate nominal columns with more than missingThreshold percentage of missing values
NCCCDataRecordCt = CCD_nominal.shape[0]
missingValueLimit = NCCCDataRecordCt * missingThreshold
NullValueCounts = CCD_nominal.isnull().sum()
NullValueCols = NullValueCounts[NullValueCounts >= missingValueLimit].index
NCCCData = NCCCData.drop(NullValueCols, axis=1)

#Review dataset contents after empty field drops
print('*********After: Removing columns with >= missingThreshold % of missing values******')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(NullValueCols))

*********After: Removing columns with >= missingThreshold % of missing values******
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Columns: 2586 entries, County to $200,000 or more Number of Returns
dtypes: float64(2572), int64(13), object(1)
memory usage: 1.5+ MB

Columns Deleted:  0


In [15]:
#Eliminate continuous columns with more than missingThreshold percentage of missing values
NCCCDataRecordCt = CCD_continuous.shape[0]
missingValueLimit = NCCCDataRecordCt * missingThreshold
NullValueCounts = CCD_continuous.isnull().sum()
NullValueCols = NullValueCounts[NullValueCounts >= missingValueLimit].index
NCCCData = NCCCData.drop(NullValueCols, axis=1)

#Review dataset contents after empty field drops
print('*********After: Removing columns with >= missingThreshold % of missing values******')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(NullValueCols))

*********After: Removing columns with >= missingThreshold % of missing values******
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Columns: 2586 entries, County to $200,000 or more Number of Returns
dtypes: float64(2572), int64(13), object(1)
memory usage: 1.5+ MB

Columns Deleted:  0


### Categorical Variables
Any categorical variables that have greater than the set number of unique values, the uniqueThreshold, in each column will be deleted. Other categorical variables will be one-hot encoded.

In [16]:
#Delete categorical columns with > 25 unique values (Each unique value becomes a column during one-hot encoding)
oneHotUniqueValueCounts = NCCCData[CCD_nominal.columns].apply(lambda x: x.nunique())
oneHotUniqueValueCols = oneHotUniqueValueCounts[oneHotUniqueValueCounts >= uniqueThreshold].index
NCCCData.drop(oneHotUniqueValueCols, axis=1, inplace=True) 

#Review dataset contents one hot high unique value drops
print('*********After: Removing columns with >= uniqueThreshold unique values***********')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(oneHotUniqueValueCols))

*********After: Removing columns with >= uniqueThreshold unique values***********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Columns: 2585 entries, AdvESL_MeasureableSkills_Participant_POP_MSG to $200,000 or more Number of Returns
dtypes: float64(2572), int64(13)
memory usage: 1.5 MB

Columns Deleted:  1


In [17]:
#Isolate remaining categorical variables
begColumnCt = len(NCCCData.columns)
CCD_nominal = NCCCData.loc[:, (NCCCData.dtypes == object)]

#one hot encode categorical variables
if (len(CCD_nominal.columns) != 0):
    NCCCData = pd.get_dummies(data=NCCCData, 
                           columns=CCD_nominal, drop_first=True)
    #Determine change in column count
    endColumnCt = len(NCCCData.columns)
    columnsAdded = endColumnCt - begColumnCt

    #Review dataset contents one hot high unique value drops
    print('Columns To One-Hot Encode: ', len(CCD_nominal.columns))
    print('\r\n*********After: Adding New Columns Via One-Hot Encoding*************************')
    NCCCData.info(verbose=False)
    print('\r\nNew Columns Created Via One-Hot Encoding: ', columnsAdded)
else:
    print("No categorical variables to one-hot encode.")

No categorical variables to one-hot encode.


### Identify and Remove Highly Correlated Features
Find and remove any columns / features that are > 95% correlated

In [18]:
from IPython.core.display import display as d

pd.set_option("display.max_rows",None)

#set the value you want to see variable pairs correlation greater than
correlation_value=.95

#find the correlations of all variable pairs and sort it
#then take only those with correlation greater than value entered
so=pd.DataFrame(NCCCData.corr().unstack().sort_values(kind="quicksort",ascending=False))
so.columns=["Correlation"]
pos=so[(so['Correlation']<1)&(so['Correlation']>correlation_value)].drop_duplicates()
neg=so[(so['Correlation']>-1)&(so['Correlation']<(-1*correlation_value))].drop_duplicates()

#print out results
if neg.shape[0]>0:
    print('Variables with negative correlation below threshold')
    print(len(neg))
    #d(neg)
else:    
    print('No variables with negative correlation below threshold')
    
print("")

if pos.shape[0]>0:
    print('Variables with positive correlation above threshold')
    print(len(pos))
    #d(pos)
else:
    print('No variables with positive correlation above threshold')
    
#create a list of variables to remove (just the left column from the correlations)
pos_list=[]
neg_list=[]

#create a list of the first variables in the pair for the pos and neg lists
if(len(pos)>0):
    pos['correlated_vars']=pos.index
    for i in range(len(pos)):
        pos_list.append(pos['correlated_vars'][i][1])
    
if(len(neg)>0):    
    neg['correlated_vars']=neg.index
    for i in range(len(neg)):
        neg_list.append(neg['correlated_vars'][i][1])

#remove duplicates from the lists
pos_list=set(pos_list)
neg_list=set(neg_list)

#create a list of variables that we want to drop from the data set
if(len(neg_list)>0&len(pos_list)>0):
    to_drop=pd.concat(pos_list,neg_list)
elif(len(neg_list)>0):
    to_drop=neg_list
else:
    to_drop=pos_list
    
#drop the variables in the to_drop list form the data set
NCCCData = NCCCData.drop(to_drop, axis=1)

Variables with negative correlation below threshold
2

Variables with positive correlation above threshold
96637


In [19]:
#Check columns after drop 
print('\r\n*********After: Dropping Highly Correlated Fields**************************************')
NCCCData.info(verbose=False)

#Save the final dataset to a .csv file
NCCCData.to_csv('NCCCData' + '_ML.csv', sep=',', index=False)


*********After: Dropping Highly Correlated Fields**************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Columns: 2583 entries, AdvESL_MeasureableSkills_Participant_POP_MSG to $200,000 or more Number of Returns
dtypes: float64(2570), int64(13)
memory usage: 1.5 MB


In [20]:
print('*********FINAL DATASET DETAILS*********************************************************\r\n')
NCCCData.info(verbose=True)

*********FINAL DATASET DETAILS*********************************************************

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 2583 columns):
AdvESL_MeasureableSkills_Participant_POP_MSG                                                                  float64
AdvESL_MeasureableSkills_IndividualsServed                                                                    float64
AdvESL_MeasureableSkills_Participants12+Hours                                                                 float64
AdvESL_MeasureableSkills_ParticipServed                                                                       float64
AdvESL_MeasureableSkills_POPs                                                                                 float64
AdvESL_MeasureableSkills_AHSGrad                                                                              float64
AdvESL_MeasureableSkills_HSE                                                                     