# Create Community College Machine Learning Dataset

This notebook reads in the merged data file located [here](https://github.com/BrownRegaSterlingHeinen/PostsecondaryAttainment/blob/master/RawDatasets/mergedData.xlsx) as input data. A lot of the code used in this notebook is from Dr. Jake Drew's repository for North Carolina public school data which can be viewed [here](https://github.com/jakemdrew/EducationDataNC/tree/master/2017/Machine%20Learning%20Datasets). This creates a dataset that is preprocessed for Machine Learning by going through the following transformations:

- Columns that have the same value in every single row are deleted.
- Nominal columns that have a unique value in every single row (all values are different) are deleted.
- Empty columns (all values are NA or NULL) are deleted.
- Numeric columns with more than the percentage of missing values specified by the missingThreshold parameter are deleted.
- Remaining numeric columns with missing values are imputed / populated with the mean of the column.  
- Categorical / text based columns with > uniqueThreshold unique values are deleted.
- Duplicated or highly similar columns with > 95% correlation are delelted.

In [80]:
import pandas as pd
import numpy as np

# Load in merged community college data
url = 'https://github.com/BrownRegaSterlingHeinen/PostsecondaryAttainment/raw/master/2016/NCCC%20Datasets/mergedData_withHS.xlsx'
NCCCData = pd.read_excel(url)

In [81]:
# View info for data
NCCCData.info()
NCCCData.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 1950 entries, County to Lea_Name
dtypes: float64(1168), int64(493), object(289)
memory usage: 898.9+ KB


Unnamed: 0,County,College Name,AdvESL_MeasureableSkills_Participant_POP_MSG,AdvESL_MeasureableSkills_IndividualsServed,AdvESL_MeasureableSkills_Participants12+Hours,AdvESL_MeasureableSkills_ParticipServed,AdvESL_MeasureableSkills_POPs,AdvESL_MeasureableSkills_AHSGrad,AdvESL_MeasureableSkills_HSE,AdvESL_MeasureableSkills_Postsecondary Enrollment,...,SPG Grade_A,SPG Grade_A+NG,SPG Grade_B,SPG Grade_C,SPG Grade_D,SPG Grade_F,EVAAS Growth Status_Exceeded,EVAAS Growth Status_Met,EVAAS Growth Status_NotMet,Lea_Name
0,Alamance,Alamance Community College,33.30%,12.0,11.0,91.70%,12.0,0.00%,0.00%,8.30%,...,0.015657,0.0,0.0,0.984343,0.0,0.0,0.516816,0.307625,0.175558,Alamance-Burlington Schools
1,Beaufort,Beaufort County Community College,25.00%,9.0,8.0,88.90%,8.0,0.00%,0.00%,0.00%,...,0.109143,0.0,0.0,0.206798,0.684059,0.0,0.109143,0.0,0.890857,Beaufort County Schools
2,Bladen,Bladen Community College,*,4.0,4.0,100.00%,4.0,*,*,*,...,0.0,0.0,0.0,0.548252,0.451748,0.0,0.0,0.548252,0.451748,Bladen County Schools
3,Henderson,Blue Ridge Community College,5.90%,42.0,34.0,81.00%,34.0,0.00%,0.00%,0.00%,...,0.047871,0.0,0.952129,0.0,0.0,0.0,0.77551,0.22449,0.0,Henderson County Schools
4,Brunswick,Brunswick Community College,*,6.0,4.0,66.70%,4.0,*,*,*,...,0.0,0.082145,0.645008,0.272847,0.0,0.0,1.0,0.0,0.0,Brunswick County Schools


In [82]:
# Missing Data Threshold (Per Column)
missingThreshold = 0.60

# Unique Value Threshold (Per Column)
# Delete Columns >  uniqueThreshold unique values prior to one-hot encoding. 
# (each unique value becomes a new column during one-hot encoding)
uniqueThreshold = 25

## Prepare Consolidated Dataset for Machine Learning
Below we perform operations on the entire dataset to remove columns and update row values that could cause problems during machine learning.

We can see that there are many columns that are meant to be floats that are listed as objects. We need to iteratively go through all of the columns and change these values.

In [83]:
for column in range(len(NCCCData.columns)): #loops through every column in the data set
    if NCCCData.iloc[:,column].isnull().sum() == 0: #these are the string columns that do not need to be changed
        print("Column ", column , "stayed the same")
        # If a column contains '-', it is not reading correctly, the following code deals with those columns
        for value in range(len(NCCCData)):
            if '%' in str(NCCCData.iloc[value,column]):
                NCCCData.iloc[value,column] = float(NCCCData.iloc[value,column].strip("%"))/100
                NCCCData.iloc[:,column] = pd.to_numeric(NCCCData.iloc[:,column], errors='ignore')
            elif NCCCData.iloc[value,column] == "-":
                NCCCData.iloc[value, column] = None
                NCCCData.iloc[:,column] = pd.to_numeric(NCCCData.iloc[:,column], errors='ignore')
    elif NCCCData.iloc[:,column].dtype == "object": #if the column is of type object (percentages)
        for value in range(len(NCCCData)):
            if pd.isnull(NCCCData.iloc[value,column]):
                continue
            elif "%" in NCCCData.iloc[value,column]: # one of the issues is percentages
                if ' ' in NCCCData.iloc[value,column]:
                    NCCCData.iloc[value,column] = NCCCData.iloc[value,column].split((' ', 1)[0])[0]
                NCCCData.iloc[value,column] = float(NCCCData.iloc[value,column].strip("%"))/100
            elif "," in NCCCData.iloc[value, column]: # another issues is commas
                NCCCData.iloc[value, column] = float(NCCCData.iloc[value, column].replace(",", ""))
            elif NCCCData.iloc[value,column] == "*":
                NCCCData.iloc[value, column] = None
            #elif NCCCData.iloc[value,column] == "-":
             #   NCCCData.iloc[value, column] = NCCCData.iloc[value, column].replace("-", "")
            else:
                continue
        print("Column ", column, "was changed to type numeric")
        NCCCData.iloc[:,column] = pd.to_numeric(NCCCData.iloc[:,column])
    else:
        print("Column", column, "stayed the same")

Column  0 stayed the same
Column  1 stayed the same
Column  2 was changed to type numeric
Column 3 stayed the same
Column 4 stayed the same
Column  5 was changed to type numeric
Column 6 stayed the same
Column  7 was changed to type numeric
Column  8 was changed to type numeric
Column  9 was changed to type numeric
Column  10 was changed to type numeric
Column  11 was changed to type numeric
Column  12 was changed to type numeric
Column 13 stayed the same
Column  14 was changed to type numeric
Column 15 stayed the same
Column  16 was changed to type numeric
Column 17 stayed the same
Column  18 was changed to type numeric
Column 19 stayed the same
Column  20 was changed to type numeric
Column 21 stayed the same
Column  22 was changed to type numeric
Column 23 stayed the same
Column  24 was changed to type numeric
Column  25 was changed to type numeric
Column  26 was changed to type numeric
Column  27 was changed to type numeric
Column 28 stayed the same
Column  29 was changed to type nu

Column  230 was changed to type numeric
Column  231 was changed to type numeric
Column  232 was changed to type numeric
Column 233 stayed the same
Column  234 was changed to type numeric
Column  235 was changed to type numeric
Column  236 was changed to type numeric
Column  237 was changed to type numeric
Column 238 stayed the same
Column  239 was changed to type numeric
Column  240 was changed to type numeric
Column  241 was changed to type numeric
Column  242 was changed to type numeric
Column 243 stayed the same
Column  244 was changed to type numeric
Column  245 was changed to type numeric
Column  246 was changed to type numeric
Column  247 was changed to type numeric
Column 248 stayed the same
Column  249 was changed to type numeric
Column  250 was changed to type numeric
Column  251 was changed to type numeric
Column  252 was changed to type numeric
Column 253 stayed the same
Column  254 was changed to type numeric
Column  255 was changed to type numeric
Column  256 was changed t

Column 482 stayed the same
Column 483 stayed the same
Column 484 stayed the same
Column 485 stayed the same
Column 486 stayed the same
Column 487 stayed the same
Column 488 stayed the same
Column 489 stayed the same
Column 490 stayed the same
Column 491 stayed the same
Column 492 stayed the same
Column 493 stayed the same
Column 494 stayed the same
Column 495 stayed the same
Column 496 stayed the same
Column 497 stayed the same
Column 498 stayed the same
Column 499 stayed the same
Column 500 stayed the same
Column 501 stayed the same
Column 502 stayed the same
Column 503 stayed the same
Column 504 stayed the same
Column 505 stayed the same
Column 506 stayed the same
Column 507 stayed the same
Column 508 stayed the same
Column 509 stayed the same
Column 510 stayed the same
Column 511 stayed the same
Column 512 stayed the same
Column 513 stayed the same
Column 514 stayed the same
Column 515 stayed the same
Column 516 stayed the same
Column 517 stayed the same
Column 518 stayed the same
C

Column  807 stayed the same
Column  808 stayed the same
Column  809 stayed the same
Column  810 stayed the same
Column  811 stayed the same
Column  812 stayed the same
Column  813 stayed the same
Column  814 stayed the same
Column  815 stayed the same
Column  816 stayed the same
Column  817 stayed the same
Column  818 stayed the same
Column  819 stayed the same
Column  820 stayed the same
Column  821 stayed the same
Column  822 stayed the same
Column  823 stayed the same
Column  824 stayed the same
Column  825 stayed the same
Column  826 stayed the same
Column  827 stayed the same
Column  828 stayed the same
Column  829 stayed the same
Column  830 stayed the same
Column  831 stayed the same
Column  832 stayed the same
Column  833 stayed the same
Column  834 stayed the same
Column  835 stayed the same
Column  836 stayed the same
Column  837 stayed the same
Column  838 stayed the same
Column  839 stayed the same
Column  840 stayed the same
Column  841 stayed the same
Column  842 stayed t

Column  1180 stayed the same
Column  1181 stayed the same
Column  1182 stayed the same
Column  1183 stayed the same
Column  1184 stayed the same
Column  1185 stayed the same
Column  1186 stayed the same
Column  1187 stayed the same
Column  1188 stayed the same
Column  1189 stayed the same
Column  1190 stayed the same
Column  1191 stayed the same
Column  1192 stayed the same
Column  1193 stayed the same
Column  1194 stayed the same
Column  1195 stayed the same
Column  1196 stayed the same
Column  1197 stayed the same
Column  1198 stayed the same
Column  1199 stayed the same
Column  1200 stayed the same
Column  1201 stayed the same
Column  1202 stayed the same
Column  1203 stayed the same
Column  1204 stayed the same
Column  1205 stayed the same
Column  1206 stayed the same
Column  1207 stayed the same
Column  1208 stayed the same
Column  1209 stayed the same
Column  1210 stayed the same
Column  1211 stayed the same
Column  1212 stayed the same
Column  1213 stayed the same
Column  1214 s

Column  1500 stayed the same
Column  1501 stayed the same
Column  1502 stayed the same
Column  1503 stayed the same
Column  1504 stayed the same
Column  1505 stayed the same
Column  1506 stayed the same
Column  1507 stayed the same
Column  1508 stayed the same
Column  1509 stayed the same
Column  1510 stayed the same
Column  1511 stayed the same
Column  1512 stayed the same
Column  1513 stayed the same
Column  1514 stayed the same
Column  1515 stayed the same
Column  1516 stayed the same
Column  1517 stayed the same
Column  1518 stayed the same
Column  1519 stayed the same
Column  1520 stayed the same
Column  1521 stayed the same
Column  1522 stayed the same
Column  1523 stayed the same
Column  1524 stayed the same
Column  1525 stayed the same
Column  1526 stayed the same
Column  1527 stayed the same
Column  1528 stayed the same
Column  1529 stayed the same
Column  1530 stayed the same
Column  1531 stayed the same
Column  1532 stayed the same
Column  1533 stayed the same
Column  1534 s

Column  1868 stayed the same
Column  1869 stayed the same
Column  1870 stayed the same
Column  1871 stayed the same
Column  1872 stayed the same
Column  1873 stayed the same
Column  1874 stayed the same
Column  1875 stayed the same
Column  1876 stayed the same
Column  1877 stayed the same
Column  1878 stayed the same
Column  1879 stayed the same
Column  1880 stayed the same
Column  1881 stayed the same
Column  1882 stayed the same
Column  1883 stayed the same
Column  1884 stayed the same
Column  1885 stayed the same
Column  1886 stayed the same
Column  1887 stayed the same
Column  1888 stayed the same
Column  1889 stayed the same
Column  1890 stayed the same
Column  1891 stayed the same
Column  1892 stayed the same
Column  1893 stayed the same
Column  1894 stayed the same
Column  1895 stayed the same
Column  1896 stayed the same
Column  1897 stayed the same
Column  1898 stayed the same
Column  1899 stayed the same
Column  1900 stayed the same
Column  1901 stayed the same
Column  1902 s

### Remove Columns with Problematic Data
Here we remove entire columns that could cause problems during machine learning. The following operations are performed:

- Remove any columns that have the same value in every single row.
- Remove any columns that have a unique value in every single row (all values are different).
- Remove empty columns (all values are NA or NULL).

In [84]:
#Remove any fields that have the same value in all rows
UniqueValueCounts = NCCCData.nunique(dropna=False)
SingleValueCols = UniqueValueCounts[UniqueValueCounts == 1].index
NCCCData = NCCCData.drop(SingleValueCols, axis=1)

#Review dataset contents after drops
print('*********After: Removing columns with the same value in every row.*******************')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(SingleValueCols))

*********After: Removing columns with the same value in every row.*******************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 1497 entries, County to Lea_Name
dtypes: float64(1423), int64(71), object(3)
memory usage: 690.1+ KB

Columns Deleted:  453


In [85]:
#Remove any fields that have unique values in every row
NCCCDataRecordCt = NCCCData.shape[0]
UniqueValueCounts = NCCCData.apply(pd.Series.nunique)
AllUniqueValueCols = UniqueValueCounts[UniqueValueCounts == NCCCDataRecordCt].index
NCCCData = NCCCData.drop(AllUniqueValueCols, axis=1)

#Review dataset contents after drops
print('*********After: Removing columns with unique values in every row.*******************')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(AllUniqueValueCols))

*********After: Removing columns with unique values in every row.*******************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 1472 entries, County to Lea_Name
dtypes: float64(1423), int64(47), object(2)
memory usage: 678.6+ KB

Columns Deleted:  25


In [86]:
#Remove any empty fields (null values in every row)
NCCCDataRecordCt = NCCCData.shape[0]
NullValueCounts = NCCCData.isnull().sum()
NullValueCols = NullValueCounts[NullValueCounts == NCCCDataRecordCt].index
NCCCData = NCCCData.drop(NullValueCols, axis=1)

#Review dataset contents after empty field drops
print('*********After: Removing columns with null / blank values in every row.*************')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(NullValueCols))

*********After: Removing columns with null / blank values in every row.*************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 1472 entries, County to Lea_Name
dtypes: float64(1423), int64(47), object(2)
memory usage: 678.6+ KB

Columns Deleted:  0


### Handle Other Missing Values Types
- Here we eliminate any numeric and continous columns with more than the percentage of missing values specified by the missingThreshold parameter.
- All remaining missing values are populated with 0.

In [87]:
#Isolate continuous and categorical data types
#These are indexers into the NCCCData dataframe and may be used similar to the schoolData dataframe 
CCD_boolean = NCCCData.loc[:, (NCCCData.dtypes == bool) ]
CCD_nominal = NCCCData.loc[:, (NCCCData.dtypes == object)]
CCD_continuous = NCCCData.loc[:, (NCCCData.dtypes != bool) & (NCCCData.dtypes != object)]
print("Boolean Columns: ", CCD_boolean.shape[1])
print("Nominal Columns: ", CCD_nominal.shape[1])
print("Continuous Columns: ", CCD_continuous.shape[1])
print("Columns Accounted for: ", CCD_nominal.shape[1] + CCD_continuous.shape[1] + CCD_boolean.shape[1])

Boolean Columns:  0
Nominal Columns:  2
Continuous Columns:  1470
Columns Accounted for:  1472


In [88]:
#Eliminate nominal columns with more than missingThreshold percentage of missing values
NCCCDataRecordCt = CCD_nominal.shape[0]
missingValueLimit = NCCCDataRecordCt * missingThreshold
NullValueCounts = CCD_nominal.isnull().sum()
NullValueCols = NullValueCounts[NullValueCounts >= missingValueLimit].index
NCCCData = NCCCData.drop(NullValueCols, axis=1)

#Review dataset contents after empty field drops
print('*********After: Removing columns with >= missingThreshold % of missing values******')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(NullValueCols))

*********After: Removing columns with >= missingThreshold % of missing values******
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 1472 entries, County to Lea_Name
dtypes: float64(1423), int64(47), object(2)
memory usage: 678.6+ KB

Columns Deleted:  0


In [89]:
#Eliminate continuous columns with more than missingThreshold percentage of missing values
NCCCDataRecordCt = CCD_continuous.shape[0]
missingValueLimit = NCCCDataRecordCt * missingThreshold
NullValueCounts = CCD_continuous.isnull().sum()
NullValueCols = NullValueCounts[NullValueCounts >= missingValueLimit].index
NCCCData = NCCCData.drop(NullValueCols, axis=1)

#Review dataset contents after empty field drops
print('*********After: Removing columns with >= missingThreshold % of missing values******')
NCCCData.info(verbose=False)
print('\r\nColumns Deleted: ', len(NullValueCols))

*********After: Removing columns with >= missingThreshold % of missing values******
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 1369 entries, County to Lea_Name
dtypes: float64(1320), int64(47), object(2)
memory usage: 631.1+ KB

Columns Deleted:  103


In [90]:
#Eliminate continuous columns with more than Threshold percentage of 0 values
threshold = 0.40
NCCCDataRecordCt = NCCCData.shape[0]
zeroValueLimit = NCCCDataRecordCt * threshold
zeroValueCounts = pd.Series(np.count_nonzero(NCCCData, axis = 0))
zeroValueCols = zeroValueCounts[zeroValueCounts <= zeroValueLimit].index

newData = NCCCData.drop(NCCCData.columns[zeroValueCols], axis=1)

#Review dataset contents after empty field drops
print('*********After: Removing columns with >= 60% of 0 values******')
newData.info(verbose=False)
print('\r\nColumns Deleted: ', len(zeroValueCols))

*********After: Removing columns with >= 60% of 0 values******
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 963 entries, County to Lea_Name
dtypes: float64(916), int64(45), object(2)
memory usage: 444.0+ KB

Columns Deleted:  406


### Impute any Remaining Missing Values with Mean Value
Missing values will be imputed with the mean of the column.

In [91]:
#Print out all the missing value rows
pd.set_option('display.max_rows', 1000)

print('\r\n*********The Remaining Missing Values Below will be Imputed with the Mean!*********')

#Check for Missing values 
missing_values = newData.isnull().sum().reset_index()
missing_values.columns = ['Variable Name', 'Number Missing Values']
missing_values = missing_values[missing_values['Number Missing Values'] > 0] 
missing_values


*********The Remaining Missing Values Below will be Imputed with the Mean!*********


Unnamed: 0,Variable Name,Number Missing Values
1,AdvESL_MeasureableSkills_Participant_POP_MSG,15
2,AdvESL_MeasureableSkills_IndividualsServed,5
3,AdvESL_MeasureableSkills_Participants12+Hours,5
4,AdvESL_MeasureableSkills_ParticipServed,9
5,AdvESL_MeasureableSkills_POPs,5
6,AdvESL_MeasureableSkills_HSE,15
7,AdvESL_MeasureableSkills_Postsecondary\nEnroll...,15
8,AdvESL_MeasureableSkills_Post-test,15
9,AdvESL_MeasureableSkills_MSG,15
10,Beg_ESL_Students,17


In [92]:
#Replace all remaining NaN with mean of column
newData = newData.fillna(newData.mean())

#Check for Missing values after final imputation 
missing_values = newData.isnull().sum().reset_index()
missing_values.columns = ['Variable Name', 'Number Missing Values']
missing_values = missing_values[missing_values['Number Missing Values'] > 0] 
missing_values

Unnamed: 0,Variable Name,Number Missing Values


In [55]:
#NCCCData = NCCCData.dropna(axis='columns', how='all')

### Categorical Variables
Any categorical variables that have greater than the set number of unique values, the uniqueThreshold, in each column will be deleted. Other categorical variables will be one-hot encoded.

In [93]:
#Delete categorical columns with > 25 unique values (Each unique value becomes a column during one-hot encoding)
oneHotUniqueValueCounts = newData[CCD_nominal.columns].apply(lambda x: x.nunique())
oneHotUniqueValueCols = oneHotUniqueValueCounts[oneHotUniqueValueCounts >= uniqueThreshold].index
newData.drop(oneHotUniqueValueCols, axis=1, inplace=True) 

#Review dataset contents one hot high unique value drops
print('*********After: Removing columns with >= uniqueThreshold unique values***********')
newData.info(verbose=False)
print('\r\nColumns Deleted: ', len(oneHotUniqueValueCols))

*********After: Removing columns with >= uniqueThreshold unique values***********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 961 entries, AdvESL_MeasureableSkills_Participant_POP_MSG to EVAAS Growth Status_NotMet
dtypes: float64(916), int64(45)
memory usage: 443.0 KB

Columns Deleted:  2


In [94]:
#Isolate remaining categorical variables
begColumnCt = len(newData.columns)
CCD_nominal = newData.loc[:, (newData.dtypes == object)]

#one hot encode categorical variables
if (len(CCD_nominal.columns) != 0):
    newData = pd.get_dummies(data=newData, 
                           columns=CCD_nominal, drop_first=True)
    #Determine change in column count
    endColumnCt = len(newData.columns)
    columnsAdded = endColumnCt - begColumnCt

    #Review dataset contents one hot high unique value drops
    print('Columns To One-Hot Encode: ', len(CCD_nominal.columns))
    print('\r\n*********After: Adding New Columns Via One-Hot Encoding*************************')
    newData.info(verbose=False)
    print('\r\nNew Columns Created Via One-Hot Encoding: ', columnsAdded)
else:
    print("No categorical variables to one-hot encode.")

No categorical variables to one-hot encode.


### Identify and Remove Highly Correlated Features
Find and remove any columns / features that are > 95% correlated

In [95]:
from IPython.core.display import display as d

pd.set_option("display.max_rows",None)

#set the value you want to see variable pairs correlation greater than
correlation_value=.95

#find the correlations of all variable pairs and sort it
#then take only those with correlation greater than value entered
so=pd.DataFrame(newData.corr().unstack().sort_values(kind="quicksort",ascending=False))
so.columns=["Correlation"]
pos=so[(so['Correlation']<1)&(so['Correlation']>correlation_value)].drop_duplicates()
neg=so[(so['Correlation']>-1)&(so['Correlation']<(-1*correlation_value))].drop_duplicates()

#print out results
if neg.shape[0]>0:
    print('Variables with negative correlation below threshold')
    print(len(neg))
    #d(neg)
else:    
    print('No variables with negative correlation below threshold')
    
print("")

if pos.shape[0]>0:
    print('Variables with positive correlation above threshold')
    print(len(pos))
    #d(pos)
else:
    print('No variables with positive correlation above threshold')
    
#create a list of variables to remove (just the left column from the correlations)
pos_list=[]
neg_list=[]

#create a list of the first variables in the pair for the pos and neg lists
if(len(pos)>0):
    pos['correlated_vars']=pos.index
    for i in range(len(pos)):
        pos_list.append(pos['correlated_vars'][i][1])
    
if(len(neg)>0):    
    neg['correlated_vars']=neg.index
    for i in range(len(neg)):
        neg_list.append(neg['correlated_vars'][i][1])

#remove duplicates from the lists
pos_list=set(pos_list)
neg_list=set(neg_list)

#create a list of variables that we want to drop from the data set
if(len(neg_list)>0&len(pos_list)>0):
    to_drop=pd.concat(pos_list,neg_list)
elif(len(neg_list)>0):
    to_drop=neg_list
else:
    to_drop=pos_list
    
#drop the variables in the to_drop list form the data set
newData = newData.drop(to_drop, axis=1)

Variables with negative correlation below threshold
129

Variables with positive correlation above threshold
13109


In [96]:
#Check columns after drop 
print('\r\n*********After: Dropping Highly Correlated Fields**************************************')
newData.info(verbose=False)

#Save the final dataset to a .csv file
newData.to_csv('NCCCData_withHS' + '_ML.csv', sep=',', index=False)


*********After: Dropping Highly Correlated Fields**************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Columns: 927 entries, AdvESL_MeasureableSkills_Participant_POP_MSG to EVAAS Growth Status_NotMet
dtypes: float64(882), int64(45)
memory usage: 427.4 KB


In [97]:
print('*********FINAL DATASET DETAILS*********************************************************\r\n')
newData.info(verbose=True)

*********FINAL DATASET DETAILS*********************************************************

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 927 columns):
AdvESL_MeasureableSkills_Participant_POP_MSG                                                                  float64
AdvESL_MeasureableSkills_IndividualsServed                                                                    float64
AdvESL_MeasureableSkills_Participants12+Hours                                                                 float64
AdvESL_MeasureableSkills_ParticipServed                                                                       float64
AdvESL_MeasureableSkills_POPs                                                                                 float64
AdvESL_MeasureableSkills_HSE                                                                                  float64
AdvESL_MeasureableSkills_Postsecondary
Enrollment                                                 