# MiniBooNE Particle Identification

In this project, I will look at the <i>MiniBooNE Particle Identification data set</i> from the UCI repository. 

<h2> 1. Data Set </h2>

First, I will install and import libraries we will need. Then, I will import the data set using the URL for the relevant UCI archive. If this doesn't work, consider importing the data set locally from your computer.

In [1]:
#install pyarrow (to later save dataframe as a parquet file)
!pip install pyarrow



In [2]:
import pandas as pd
import pyarrow as pa
import numpy as np

In [3]:
#Get data from the UCI repo
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt"
data = np.loadtxt(url, skiprows = 1)

In [4]:
data.shape

(130064, 50)

In [5]:
data

array([[2.59413e+00, 4.68803e-01, 2.06916e+01, ..., 4.57585e-01,
        7.17692e-02, 2.45996e-01],
       [3.86388e+00, 6.45781e-01, 1.81375e+01, ..., 9.35523e-01,
        3.33613e-01, 2.30621e-01],
       [3.38584e+00, 1.19714e+00, 3.60807e+01, ..., 1.01345e+00,
        2.55512e-01, 1.80901e-01],
       ...,
       [3.10842e+00, 2.17814e+00, 5.63651e+01, ..., 7.89276e-01,
        7.30342e-01, 1.52876e-01],
       [5.44560e+00, 1.84570e+00, 1.03463e+02, ..., 2.87259e+00,
        8.19867e-01, 2.10619e-01],
       [4.55062e+00, 1.34174e+00, 8.00887e+01, ..., 2.64744e+00,
        7.42709e-01, 2.76477e-01]])

In [6]:
#Define columns C1-C50
cols = ["C" + str(i+1) for i in range(50)]
cols

['C1',
 'C2',
 'C3',
 'C4',
 'C5',
 'C6',
 'C7',
 'C8',
 'C9',
 'C10',
 'C11',
 'C12',
 'C13',
 'C14',
 'C15',
 'C16',
 'C17',
 'C18',
 'C19',
 'C20',
 'C21',
 'C22',
 'C23',
 'C24',
 'C25',
 'C26',
 'C27',
 'C28',
 'C29',
 'C30',
 'C31',
 'C32',
 'C33',
 'C34',
 'C35',
 'C36',
 'C37',
 'C38',
 'C39',
 'C40',
 'C41',
 'C42',
 'C43',
 'C44',
 'C45',
 'C46',
 'C47',
 'C48',
 'C49',
 'C50']

In [7]:
df = pd.DataFrame(data, columns = cols)
print("Data Frame Dimensions: ", data.shape)
df.head()

Data Frame Dimensions:  (130064, 50)


Unnamed: 0,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,...,C41,C42,C43,C44,C45,C46,C47,C48,C49,C50
0,2.59413,0.468803,20.6916,0.322648,0.009682,0.374393,0.803479,0.896592,3.59665,0.249282,...,101.174,-31.373,0.442259,5.86453,0.0,0.090519,0.176909,0.457585,0.071769,0.245996
1,3.86388,0.645781,18.1375,0.233529,0.030733,0.361239,1.06974,0.878714,3.59243,0.200793,...,186.516,45.9597,-0.478507,6.11126,0.001182,0.0918,-0.465572,0.935523,0.333613,0.230621
2,3.38584,1.19714,36.0807,0.200866,0.017341,0.260841,1.10895,0.884405,3.43159,0.177167,...,129.931,-11.5608,-0.297008,8.27204,0.003854,0.141721,-0.210559,1.01345,0.255512,0.180901
3,4.28524,0.510155,674.201,0.281923,0.009174,0.0,0.998822,0.82339,3.16382,0.171678,...,163.978,-18.4586,0.453886,2.48112,0.0,0.180938,0.407968,4.34127,0.473081,0.25899
4,5.93662,0.832993,59.8796,0.232853,0.025066,0.233556,1.37004,0.787424,3.66546,0.174862,...,229.555,42.96,-0.975752,2.66109,0.0,0.170836,-0.814403,4.67949,1.92499,0.253893


In [8]:
#Let's save this data set as a parquet file
df.to_parquet('data/miniboone.parquet', compression='gzip', index=False)