# MSAIL Tutorial Series

## Intro to ML

### Some Definitions

__Artificial Intelligence__
* _definition_: the theory and development of computer systems able to perform tasks that normally require human intelligence
* _examples_:
    * __search algorithms__ - pathfinding, graph traversal
    * __constraint satisfaction problems__ - map coloring, concert scheduling
    * __logic__ - resolution refutation, prolog
    * __planning__ - graph planning given some initial state
    * __machine learning__ - the interesting stuff...

__Machine Learning__
* _informal definition_: Algorithms that improve their prediction performance at some task with experience (or data).
<img src="ml.png">

### Categories of ML Algorithms

__Supervised Learning__
* _goal_: given the data X in some feature space and the labels Y, learn to predict Y from X
* regression (continuous labels) - stock market prediction, Airbnb
* classification (discrete labels) - digit recognition, object recognition

__Unsupervised Learning__
* _goal_: given the data X without any labels, learn the _structures_ of the data
* clustering - image compression
* component analysis - dimensionality reduction, reconstruction

__Reinforcement Learning__
* _setting_: given a sequence of states X and "rewards", agent has to take actions A for each time step
* _goal_: how to "learn to act" or "making decisions" to maximize the sum of future rewards

### Development Environment

__IPython Notebooks__
* http://continuum.io/downloads
* Anaconda is really nice because it comes with a bunch of commonly used Python packages for data science all bundled together

__Kaggle__
* https://www.kaggle.com/c/titanic
* Go ahead and make an account on Kaggle
* Download the datasets (train.csv and test.csv) for the Titantic competition
* Place the datasets in some directory where you would like to work on MSAIL stuff from

__MSAIL Curriculum__
* https://github.com/MSAIL/Curriculum
* Start by downloading this IPython notebook, start up Anaconda, and opening this intro.ipynb

Import some libraries

In [32]:
import csv
import numpy as np
import random

Open the file with the training data

In [33]:
with open('train.csv', 'rb') as tf:
    csv_file_object = csv.reader(tf, delimiter=',')
    header = csv_file_object.next()
    data=[]
    for row in csv_file_object:
        data.append(row)
    data = np.array(data)

Open the file with the training data ~pythonically~

In [34]:
with open('train.csv', 'rb') as tf:
    csv_file_object = csv.reader(tf, delimiter=',')
    header = csv_file_object.next()
    data = np.array([row for row in csv_file_object])

Explore the data

In [35]:
print data[0]
print data[-1]

['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171'
 '7.25' '' 'S']
['891' '0' '3' 'Dooley, Mr. Patrick' 'male' '32' '0' '0' '370376' '7.75' ''
 'Q']


Let's see how many passengers there were, how many survived, and then the ratio

In [36]:
number_passengers = np.size(data[:,1].astype(np.float))
number_survived = np.sum(data[:,1].astype(np.float))
proportion_survivors = number_survived / number_passengers
print number_passengers, number_survived, proportion_survivors

891 342.0 0.383838383838


Let's filter the data by the passenger classes

In [37]:
class_data = data[:,2].astype(np.float)
class_1_stats = class_data == 1
class_2_stats = class_data == 2
class_3_stats = class_data == 3

class_1_onboard = data[class_1_stats,1].astype(np.float)
class_2_onboard = data[class_2_stats,1].astype(np.float)
class_3_onboard = data[class_3_stats,1].astype(np.float)

For each class (1-3), let's see what percent of people from those classes survived

In [38]:
proportion_class_1_survived = \
                       np.sum(class_1_onboard) / np.size(class_1_onboard)  
proportion_class_2_survived = \
                       np.sum(class_2_onboard) / np.size(class_2_onboard) 
proportion_class_3_survived = \
                       np.sum(class_3_onboard) / np.size(class_3_onboard) 

print 'Proportion of class 1 who survived is %s' % proportion_class_1_survived
print 'Proportion of class 2 who survived is %s' % proportion_class_2_survived
print 'Proportion of class 3 who survived is %s' % proportion_class_3_survived

Proportion of class 1 who survived is 0.62962962963
Proportion of class 2 who survived is 0.472826086957
Proportion of class 3 who survived is 0.242362525458


Open the file with the test data

In [39]:
with open('test.csv', 'rb') as tf:
    csv_file_object = csv.reader(tf, delimiter=',')
    header = csv_file_object.next()
    data = np.array([row for row in csv_file_object])

Let's predict the survival of a passenger based solely on which class they were in, write these predictions to a csv file, and submit to Kaggle!

In [40]:
with open("classbasedmodel.csv", "wb") as pf:
    predictions = csv.writer(pf, delimiter=',')
    predictions.writerow(["PassengerId", "Survived"])
    data[:,-1] = data[:,1].astype(np.float)
    for example in data:
        if example[-1] == 1:
            predictions.writerow([example[0], '1'])
        elif example[-1] == 3:
            predictions.writerow([example[0], '0'])
        else:
            prob = random.uniform(0, 1)
            if prob < 0.5:
                predictions.writerow([example[0], '0'])
            else:
                predictions.writerow([example[0], '1'])