Pandas is a python library that enables easy exploration of structured data. This notebook will explore a dataset with pandas and introduce some common operations.

For each function below, there are many options available and pandas has excellent documentation.

For documentation in the notebook, type: pd.read_csv?
For source code, type: pd.read_csv??

In [None]:
import pandas as pd
import numpy as np

In [None]:
!unzip higgs-boson.zip

In [None]:
!unzip training.zip
!unzip test.zip

## Read data

pandas has functions to read data from a variety of sources - CSV files, excel spreadsheets, a sql database, json files etc. The central two data structures are: 

pd.DataFrame (multiple columns) - think tables

pd.Series (single column ordered by an index) - think time-series

#### Question: Read data from CSV file into a dataframe. Bonus points: Read ~10% of the rows (hint: see argument skiprows)

## Explore data

#### Question: Print the number of rows and columns in the dataframe

#### Question: Print the top 10 rows

#### Question: Print the last 10 rows

#### Question: Print the 5th row (hint: see iloc)

#### Question: Print a statistical summary of the data (hint: describe())

#### Question: Print all the column names in the dataframe

## What data is this?

The Large Hadron Collider directs two beams of protons into each other at very high speeds/momenta/energies. Each collision is analogous to two ballons filled with water and glass beads inside them. When they collide, the balloons burst and the beads in one bag "collide" with beads in another bag. Most pairs of beads just "graze" each other and these are called elastic collisions (these actually result in the production of light particles e.g. pions).

The interesting collisions are one where the protons hit "head-on" and result in the disintegration into internal constituents of the proton which are made of particles called quarks and gluons. Each collision is called an **event**.

This data consists of an event for each row. Each event results consists of properties of particles produced during the collision. You'll see terms like **jet**, **met**, **tau**, **lep** that refer to these particles. **PRI** refers to primary i.e. values that are directly observed. **DER** refers to values *derived* from primary values.

**pt**, **delta**, **phi**, **eta** refer to various angles/momenta measured during the collision.

**Central Task**: the goal here is to predict the value in the column Label. We won't be predicting it but will explore the data.

#### Question: Look at the values in column "Label"

#### Question: Print all unique values in "Label"

#### Question: Find all unique values and their counts in "Label" (avoid for loops)

#### Question: Select two columns - "Label" and "PRI_jet_num"

#### Question: Find every unique combination of the two columns. In otherwise, only keep unique tuples and remove duplicate rows.

#### Question: After removing duplicates, sort result by "Label" and "PRI_jet_num"

#### Question: Sorting a large dataframe will create another copy. Is there a way to do this in a more memory-efficient manner (hint: see arguments to the sorting function)

### Filtering data

#### Question: Select all rows where "Label" = "s". How many rows do you get?

### Group by

#### Question: Group the data by "Label" and get the number of rows in each group ("s" and "b")

#### Question: Group the data by "Label" and get the mean value in each group for each column

#### Question: Group the data by "Label" and get the standard deviation value in each group for each column

#### Question: Group the data by "Label" and 
1. get the percentage of rows in each group for the column "DER_mass_MMC"
2.the sum of the squares of the values in each group for the column "DER_pt_h"

### Column-wise or row-wise operations - apply

#### Question: Choose the first 10 columns and find the mean across each column

#### Question: Choose the first 10 columns and find the mean across each row

### Plotting

While exploring data, plotting and visualization is a key component. One can treat pandas columns as numpy arrays and use matplotlib or use python's wrappers for visualization.

In [None]:
import matplotlib.pylab as plt
%matplotlib inline

#### Question: Make a histogram for "PRI_jet_num"

#### Question: Plot "PRI_jet_all_pt"

#### Question: Make a histogram for "PRI_jet_all_pt"


#### Question: (Scatter( Plot PRI_let_pt vs PRI_jet_all_pt

#### Question: Make a density plot of PRI_jet_all_pt

#### Machine Learnnig

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score 

#### Question: break your training data set up into a "features" dataframe and a "labels" data frame

#### Question: Now divide you datasets into a training and testing set. (hint: use train_test_split function provided by scikit learn)

#### Question: Train a classifier using the SGDClassifier

#### Question : Make predictions using your trained classifier on you testing set

#### Question: what is your model's accuracy? 

#### Question: Generate a confusion matrix to get a better sense of your results.

#### Question: Print a classification report for even more insight into your models performance. How well did it perform?  

#### Question: Train a classifier using the Random Forest Classifier

#### Question : Make predictions using your trained classifier on you testing set¶

#### Question: what is your model's accuracy?

#### Question: Generate a confusion matrix to get a better sense of your results.

#### Question: Print a classification report for even more insight into your models performance. How well did it perform?

#### Question: Which model is better in this case? 