# Setting up a train-test split in scikit-learn

The first step is to split the data into a training set and a test set. Some labels don't occur very often, but we want to make sure that they appear in both the training and the test sets. We provide a function that will make sure at least min_count examples of each label appear in each split: multilabel_train_test_split.

In [1]:
import pandas as pd
df=pd.read_csv('Training_Data.csv',index_col=0)

In [2]:
print(df.columns)
print("\n\n\n\n\n")
print(df.dtypes)

Index(['Function', 'Use', 'Sharing', 'Reporting', 'Student_Type',
       'Position_Type', 'Object_Type', 'Pre_K', 'Operating_Status',
       'Object_Description', 'Text_2', 'SubFund_Description',
       'Job_Title_Description', 'Text_3', 'Text_4', 'Sub_Object_Description',
       'Location_Description', 'FTE', 'Function_Description',
       'Facility_or_Department', 'Position_Extra', 'Total',
       'Program_Description', 'Fund_Description', 'Text_1'],
      dtype='object')






Function                   object
Use                        object
Sharing                    object
Reporting                  object
Student_Type               object
Position_Type              object
Object_Type                object
Pre_K                      object
Operating_Status           object
Object_Description         object
Text_2                     object
SubFund_Description        object
Job_Title_Description      object
Text_3                     object
Text_4                     object
Sub_O

In [3]:
NUMERIC_COLUMNS=['FTE','Total']

In [4]:
#The Columns in the LABEL list are the targets
LABELS=['Function',
 'Use',
 'Sharing',
 'Reporting',
 'Student_Type',
 'Position_Type',
 'Object_Type',
 'Pre_K',
 'Operating_Status']
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label,axis=0)

The first step is to split the data into a training set and a test set. Some labels don't occur very often, but we want to make sure that they appear in both the training and the test sets. We provide a function that will make sure at least min_count examples of each label appear in each split: 

multilabel_train_test_split.

### The multilabel_train_test_split Function

In [5]:
import numpy as np
import pandas as pd

def multilabel_sample(y, size=1000, min_count=5, seed=None):
    """ Takes a matrix of binary labels `y` and returns
        the indices for a sample of size `size` if
        `size` > 1 or `size` * len(y) if size =< 1.
        The sample is guaranteed to have > `min_count` of
        each label.
    """
    try:
        if (np.unique(y).astype(int) != np.array([0, 1])).all():
            raise ValueError()
    except (TypeError, ValueError):
        raise ValueError('multilabel_sample only works with binary indicator matrices')

    if (y.sum(axis=0) < min_count).any():
        raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')

    if size <= 1:
        size = np.floor(y.shape[0] * size)

    if y.shape[1] * min_count > size:
        msg = "Size less than number of columns * min_count, returning {} items instead of {}."
        warn(msg.format(y.shape[1] * min_count, size))
        size = y.shape[1] * min_count

    rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))

    if isinstance(y, pd.DataFrame):
        choices = y.index
        y = y.values
    else:
        choices = np.arange(y.shape[0])

    sample_idxs = np.array([], dtype=choices.dtype)

    # first, guarantee > min_count of each label
    for j in range(y.shape[1]):
        label_choices = choices[y[:, j] == 1]
        label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
        sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])

    sample_idxs = np.unique(sample_idxs)

    # now that we have at least min_count of each, we can just random sample
    sample_count = int(size - sample_idxs.shape[0])

    # get sample_count indices from remaining choices
    remaining_choices = np.setdiff1d(choices, sample_idxs)
    remaining_sampled = rng.choice(remaining_choices,
                                   size=sample_count,
                                   replace=False)

    return np.concatenate([sample_idxs, remaining_sampled])


def multilabel_sample_dataframe(df, labels, size, min_count=5, seed=None):
    """ Takes a dataframe `df` and returns a sample of size `size` where all
        classes in the binary matrix `labels` are represented at
        least `min_count` times.
    """
    idxs = multilabel_sample(labels, size=size, min_count=min_count, seed=seed)
    return df.loc[idxs]


def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):
    """ Takes a features matrix `X` and a label matrix `Y` and
        returns (X_train, X_test, Y_train, Y_test) where all
        classes in Y are represented at least `min_count` times.
    """
    index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])

    test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)
    train_set_idxs = np.setdiff1d(index, test_set_idxs)

    test_set_mask = index.isin(test_set_idxs)
    train_set_mask = ~test_set_mask

    return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])

In [6]:
# Create the new DataFrame: numeric_data_only which has been used as the prediction parameters(X)
numeric_data_only = df[NUMERIC_COLUMNS].fillna(-1000)
# Get labels and convert to dummy variables: label_dummies that has been used as Targets(Y)
label_dummies = pd.get_dummies(df[LABELS])

In [7]:
print(numeric_data_only)
print("\n\n\n\n")
print(label_dummies)

                FTE          Total
134338     1.000000   50471.810000
206341 -1000.000000    3477.860000
326408     1.000000   62237.130000
364634 -1000.000000      22.300000
47683  -1000.000000      54.166000
229958 -1000.000000      -8.150000
417668 -1000.000000    2000.050000
126378 -1000.000000       0.720000
275539 -1000.000000     228.250000
85262  -1000.000000      69.560000
304569 -1000.000000   -5509.320000
330504 -1000.000000      16.410000
84272      0.600000   38824.790000
64760  -1000.000000 -122544.070000
21870      0.000000     228.530000
18698  -1000.000000      94.357986
169454 -1000.000000     146.510000
169914     1.000000   66651.255981
189701 -1000.000000   30382.320000
43727  -1000.000000    -446.110000
5614   -1000.000000     550.310000
291539     0.012931     329.353815
307038     1.000000  103318.698037
27645  -1000.000000     649.860000
126388     0.000000      71.140000
14962      1.000000   17101.770000
84040  -1000.000000  -21795.930000
61639  -1000.000000 

In [8]:
# Create training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(numeric_data_only,label_dummies,
                                                               size=0.2, 
                                                               seed=123)

# Print the info
print("X_train info:")
print(X_train.info())
print("\n\nX_test info:")  
print(X_test.info())
print("\n\ny_train info:")  
print(y_train.info())
print("\n\ny_test info:")  
print(y_test.info()) 


X_train info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 320222 entries, 134338 to 415831
Data columns (total 2 columns):
FTE      320222 non-null float64
Total    320222 non-null float64
dtypes: float64(2)
memory usage: 7.3 MB
None


X_test info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80055 entries, 206341 to 72072
Data columns (total 2 columns):
FTE      80055 non-null float64
Total    80055 non-null float64
dtypes: float64(2)
memory usage: 1.8 MB
None


y_train info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 320222 entries, 134338 to 415831
Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating
dtypes: uint8(104)
memory usage: 34.2 MB
None


y_test info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80055 entries, 206341 to 72072
Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating
dtypes: uint8(104)
memory usage: 8.6 MB
None


# Training a model

Here we will import the logistic regression and one versus rest classifiers in order to fit a multi-class logistic regression model to the NUMERIC_COLUMNS of our feature data.

Then we'll test and print the accuracy with the .score() method to see the results of training. 

In [9]:
# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Instantiate the classifier: clf
clf =OneVsRestClassifier(LogisticRegression())

# Fit the classifier to the training data
clf.fit(X_train,y_train)

# Print the accuracy
print("Accuracy: {}".format(clf.score(X_test,y_test)))

Accuracy: 0.0


# Using Hold Out Data

We are ready to make some predictions! Remember, the train-test-split we've carried out so far is for model development. The original competition provides an additional test set, for which we'll never actually see the correct labels. This is called the "holdout data."

The point of the holdout data is to provide a fair test for machine learning competitions.

Our original goal is to predict the probability of each label.Here ,we'll do just that by using the .predict_proba() method on your trained model.

In [10]:
# Load the holdout data: holdout
holdout =pd.read_csv('HoldOut.csv',index_col=0)

# Generate predictions: predictions
predictions =clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))

  interactivity=interactivity, compiler=compiler, result=result)
  np.exp(prob, prob)


## Writing out your results to a csv for submission

Now , we'll write your predictions to a .csv using the .to_csv() method on a pandas DataFrame. Then we'll evaluate our performance according to the LogLoss metric .

We'll use our predictions values to create a new DataFrame, prediction_df.

In [11]:
# Generate predictions: predictions
predictions = clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))

# Format predictions in DataFrame: prediction_df
prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS]).columns,
                             index=holdout.index,
                             data=predictions)


# Save prediction_df to csv
prediction_df.to_csv('predictions.csv')

# Submit the predictions for scoring: score
score =score_submission(pred_path='predictions.csv')

# Print score
print('Your model, trained with numeric data only, yields logloss score: {}'.format(score))

  np.exp(prob, prob)


NameError: name 'score_submission' is not defined

It's time to step up your game and incorporate the text data.

In [12]:
df

Unnamed: 0,Function,Use,Sharing,Reporting,Student_Type,Position_Type,Object_Type,Pre_K,Operating_Status,Object_Description,...,Sub_Object_Description,Location_Description,FTE,Function_Description,Facility_or_Department,Position_Extra,Total,Program_Description,Fund_Description,Text_1
134338,Teacher Compensation,Instruction,School Reported,School,NO_LABEL,Teacher,NO_LABEL,NO_LABEL,PreK-12 Operating,,...,,,1.000000,,,KINDERGARTEN,50471.810000,KINDERGARTEN,General Fund,
206341,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,Non-Operating,CONTRACTOR SERVICES,...,,,,RGN GOB,,UNDESIGNATED,3477.860000,BUILDING IMPROVEMENT SERVICES,,BUILDING IMPROVEMENT SERVICES
326408,Teacher Compensation,Instruction,School Reported,School,Unspecified,Teacher,Base Salary/Compensation,Non PreK,PreK-12 Operating,Personal Services - Teachers,...,,,1.000000,,,TEACHER,62237.130000,Instruction - Regular,General Purpose School,
364634,Substitute Compensation,Instruction,School Reported,School,Unspecified,Substitute,Benefits,NO_LABEL,PreK-12 Operating,EMPLOYEE BENEFITS,...,,,,UNALLOC BUDGETS/SCHOOLS,,PROFESSIONAL-INSTRUCTIONAL,22.300000,GENERAL MIDDLE/JUNIOR HIGH SCH,,REGULAR INSTRUCTION
47683,Substitute Compensation,Instruction,School Reported,School,Unspecified,Teacher,Substitute Compensation,NO_LABEL,PreK-12 Operating,TEACHER COVERAGE FOR TEACHER,...,,,,NON-PROJECT,,PROFESSIONAL-INSTRUCTIONAL,54.166000,GENERAL HIGH SCHOOL EDUCATION,,REGULAR INSTRUCTION
229958,Facilities & Maintenance,O&M,School Reported,School,Unspecified,Custodian,Benefits,NO_LABEL,PreK-12 Operating,CONTRA BENEFITS,...,,,,NON-PROJECT,,UNDESIGNATED,-8.150000,EMPLOYEE BENEFITS,,EMPLOYEE BENEFITS
417668,Instructional Materials & Supplies,Instruction,School Reported,School,Special Education,Non-Position,Supplies/Materials,NO_LABEL,PreK-12 Operating,EDUCATIONAL,...,,,,,,SUPPLIES AND MATERIALS,2000.050000,SPECIAL EDUCATION LOCAL,LOCAL FUND,
126378,Food Services,O&M,School on Central Budgets,Non-School,Unspecified,Coordinator/Manager,Benefits,NO_LABEL,PreK-12 Operating,EMPLOYEE BENEFITS,...,,DISTRICT WIDE ORGANIZATION UNI,,NON-PROJECT,,UNDESIGNATED,0.720000,UNDESIGNATED,,UNDESIGNATED
275539,Teacher Compensation,Instruction,School Reported,School,Unspecified,Teacher,Benefits,NO_LABEL,PreK-12 Operating,EMPLOYEE BENEFITS,...,,,,ELA S - TEACHING SPANISH ONLY,,PROFESSIONAL-INSTRUCTIONAL,228.250000,GENERAL ELEMENTARY EDUCATION,,REGULAR INSTRUCTION
85262,Substitute Compensation,Instruction,School Reported,School,Unspecified,Substitute,Benefits,NO_LABEL,PreK-12 Operating,EMPLOYEE BENEFITS,...,,,,UNALLOC BUDGETS/SCHOOLS,,PROFESSIONAL-INSTRUCTIONAL,69.560000,GENERAL ELEMENTARY EDUCATION,,REGULAR INSTRUCTION
