### This is part one of two notebooks to train a Support Vector Machine to discriminate between two types of collisional events.

### The first part shows how the data was manipulated into the format we need and how to select a random sample from the larger data set to keep things more manageable.

The files have already been provided so you do not need to run this, but feel free to look through to get an idea of how it works.

It accompanies Chapter 4 of the book.

Data for this exercise were kindly provided by [Sascha Caron](https://www.nikhef.nl/~scaron/).

Author: Viviana Acquaviva

In [None]:
import numpy as np
import pandas as pd
from matplotlib import rc

In [None]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 100)
rc('text', usetex=False)

In [None]:
df = pd.read_csv('../data/TrainingValidationData.csv', delimiter=',', names=list(['P'+str(i) for i in range(53)]), usecols=range(53))
print(df.columns)

In [None]:
df.head()

In [None]:
new = df['P0'].str.split(';',expand=True)

new.columns = ['numID', 'processID', 'weight', 'MET', 'METphi', 'Type_1']

In [None]:
new.head()

In [None]:
df = df.join(new, how='outer') #join them side to side

In [None]:
df.head()

#### The new columns have been appended at the end; we still need to split the type of product.

In [None]:
for i in range(4,53,4):

    new = df['P'+str(i)].str.split(';',expand=True) 
    
    df['P'+str(i)] = new[0]
    
    df['Type_'+str(int(i/4+1))] = new[1]
    
print(df.columns)

In [None]:
df = df.drop('P0', axis=1)

In [None]:
df.columns.values

In [None]:
#just re-ordering

cols = ['numID', 'processID', 'weight',
       'MET', 'METphi', 'Type_1', 'P1', 'P2', 'P3', 'P4',  'Type_2', 'P5', 'P6', 'P7', 'P8', 'Type_3', 'P9', 'P10', 'P11',
       'P12',  'Type_4', 'P13', 'P14', 'P15', 'P16', 'Type_5','P17', 'P18', 'P19', 'P20',
       'Type_6','P21', 'P22', 'P23', 'P24', 'Type_7','P25', 'P26', 'P27', 'P28', 'Type_8','P29',
       'P30', 'P31', 'P32', 'Type_9', 'P33', 'P34', 'P35', 'P36', 'Type_10','P37', 'P38',
       'P39', 'P40', 'Type_11', 'P41', 'P42', 'P43', 'P44', 'Type_12', 'P45', 'P46', 'P47',
       'P48', 'Type_13','P49', 'P50', 'P51', 'P52']

In [None]:
X = df[cols].drop(['numID', 'processID', 'weight'], axis = 1)

In [None]:
len(cols)

In [None]:
X.head() 

In [None]:
X.describe() #remember that this only affects numerical columns

Some columns that should be numerical are of type "object"

In [None]:
X.columns[X.dtypes == object]

Re-cast data type where appropriate

In [None]:
for el in ['MET', 'METphi', 'P4', 'P8', 'P12',
        'P16',  'P20', 'P24',  'P28',
    'P32', 'P36', 'P40', 'P44',
      'P48', 'P52']:
    X[el] = X[el].astype('float64')

In [None]:
X.dtypes

#### Select 5000 rows

In [None]:
np.random.seed(10)

sel = np.random.choice(df.shape[0], 5000)

features = X.iloc[sel,:]

In [None]:
features.shape

In [None]:
features.columns

Reset index

In [None]:
features.reset_index(drop=True, inplace=True)

In [None]:
features.head()

#### Export feature data frame to file

In [None]:
features.to_csv('../data/ParticleID_features.csv', index_label= 'ID')

#### Select labels

In [None]:
y = df.processID[sel].values # values makes it an array

In [None]:
y

#### Export labels to file

In [None]:
np.savetxt('../data/ParticleID_labels.txt', y, fmt = '%s')