### What is Machine Learning?

Machine learning is a subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.

_https://en.wikipedia.org/wiki/Machine_learning_

First let's look at this classic [Kaggle](https://kaggle.com) example - [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)

### Pandas - powerful Python data analysis toolkit

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

http://pandas.pydata.org/pandas-docs/stable/index.html

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

print("pandas version: ", pd.__version__)

In [None]:
# First let's create a DataFrame

dates = pd.date_range('20170101', periods=6)
dates

In [None]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

In [None]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2

In [None]:
df2.dtypes

In [None]:
### show available "column names" and "public attributes"

# df2.<Press Tab>

In [None]:
df.head()

In [None]:
df.tail(3)

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values

In [None]:
# describe a quick statistic summary
df.describe()

In [None]:
# transpose
df.T

In [None]:
df.T.describe()

In [None]:
df

In [None]:
# sorting by an axis, availability: 0 or 1
df.sort_index(axis=0, ascending=False)

In [None]:
# sorting by values
df.sort_values('B')

### Selection

In [None]:
df['A']

In [None]:
df[0:3]

In [None]:
df['20170102':'20170104']

In [None]:
# By Label
df.loc[dates[0]]

In [None]:
# By position
df.iloc[3:5, 0:2]

#### Boolean Indexing
Using a single column's values to select data

In [None]:
df[df.A > 0]

In [None]:
df.A

In [None]:
df[df > 0]

In [None]:
# isin()

df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

In [None]:
df2[df2['E'].isin(['two','four'])]

In [None]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20170102', periods=6))
s1

In [None]:
df['F'] = s1
df

In [None]:
df.at[dates[0],'A'] = 0
df

In [None]:
df.iat[0,1] = 0
df

### Missing Data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. 

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.


In [None]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

In [None]:
df1.dropna(how='any') # how='all'

In [None]:
df3 = df1.fillna(value=5)
df3

In [None]:
df1

In [None]:
pd.isna(df1)

### Stats

In [None]:
# Stats
df.mean()

In [None]:
# on other axis
df.mean(1)

### The Titanic Example

![titanic-example](https://img.buzzfeed.com/buzzfeed-static/static/2018-05/22/19/asset/buzzfeed-prod-web-03/anigif_sub-buzz-4981-1527032798-1.gif?downsize=715:*&output-format=auto&output-quality=auto)

[**Important** concepts](https://medium.com/technology-nineleaps/some-key-machine-learning-definitions-b524eb6cb48)

#### Feature

Features are individual independent variables that act as the input in your system. Prediction models use features to make predictions. New features can also be obtained from old features using a method known as ‘feature engineering’. More simply, you can consider one column of your data set to be one feature. Sometimes these are also called attributes. And the number of features are called dimensions.

#### Target

The target is whatever the output of the input variables. It could be the individual classes that the input variables maybe mapped to in case of a classification problem or the output value range in a regression problem. If the training set is considered then the target is the training output values that will be considered.

#### Label

Labels are the final output. You can also consider the output classes to be the labels. When data scientists speak of labeled data, they mean groups of samples that have been tagged to one or more labels.

#### Train

While training for machine learning, you pass an algorithm with training data. The learning algorithm finds patterns in the training data such that the input parameters correspond to the target. The output of the training process is a machine learning model which you can then use to make predictions. This process is also called “learning”.

In [None]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve
from sklearn.linear_model import LogisticRegression

%matplotlib inline

In [None]:
train = pd.read_csv("./titanic/train.csv")
test = pd.read_csv("./titanic/test.csv")

In [None]:
train.head()

In [None]:
train.describe()

In [None]:
test.head() # which column is missing?

In [None]:
## Join train and test datasets in order to obtain the same number of features during categorical conversion
train_len = len(train)
dataset = pd.concat(objs=[train, test], axis=0).reset_index(drop=True) # Axis is important

print("Train data length =>", len(train))
print("Test data length =>", len(test))
print("Concat dataset length =>", len(dataset))

In [None]:
# Check for Null values
dataset.isnull().sum()

#### Feature Visualization

Correlation matrix between numerical values (SibSp Parch Age and Fare values) and Survived 

In [None]:
g = sns.heatmap(train[["Survived","SibSp","Parch","Age","Fare"]].corr(),annot=True, fmt = ".2f", cmap = "coolwarm")

Seems `Fare` has a strong positive correlation with `Survived`.

Here the question comes:
1. How many people in train has `fare` greater than 50?
2. What is the percentage?

#### Explore SibSp feature vs Survived

In [None]:
g = sns.factorplot(x="SibSp",y="Survived",data=train,kind="bar", size = 6 , 
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

**Sex vs Survived**

In [None]:
g = sns.barplot(x="Sex",y="Survived",data=train)
g = g.set_ylabel("Survival Probability")

Repeat this process for all features, after the process you should have a big picture of how the final output - `Survived` is possibly affected by each feature.

#### Feature Engineer

In [None]:
# Get Title from Name
dataset_title = [i.split(",")[1].split(".")[0].strip() for i in dataset["Name"]]
dataset["Title"] = pd.Series(dataset_title)
dataset["Title"].unique()

In [None]:
g = sns.countplot(x="Title",data=dataset)
g = plt.setp(g.get_xticklabels(), rotation=45) 

In [None]:
# Classify different titles into different priority group 
dataset["Title"] = dataset["Title"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset["Title"] = dataset["Title"].map({"Master":0, "Miss":1, "Ms" : 1 , "Mme":1, "Mlle":1, "Mrs":1, "Mr":2, "Rare":3})
dataset["Title"] = dataset["Title"].astype(int)

In [None]:
g = sns.countplot(dataset["Title"])
g = g.set_xticklabels(["Master","Miss/Ms/Mme/Mlle/Mrs","Mr","Rare"])

In [None]:
g = sns.factorplot(x="Title",y="Survived",data=dataset,kind="bar")
g = g.set_xticklabels(["Master","Miss-Mrs","Mr","Rare"])
g = g.set_ylabels("survival probability")

For each interested feature, find a way to group/simplify/scale it so the model can better understand it mathematically.

In [None]:
dataset.head()

We will skip some parts and directly load the fully processed data by

In [None]:
dataset = pd.read_csv('./titanic/data-processed.csv')

In [None]:
dataset.head()

#### Modeling

In [None]:
## Separate train dataset and test dataset
train_len = 881 # hard code 

train = dataset[:train_len]
test = dataset[train_len:]
test.drop(labels=["Survived"],axis = 1,inplace=True)

In [None]:
## Separate train features and label 

train["Survived"] = train["Survived"].astype(int)
Y_train = train["Survived"]
X_train = train.drop(labels = ["Survived"],axis = 1)

In [None]:
# Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=10)

In [None]:
train.Sex.unique()

In [None]:
random_state = 2
classifier = LogisticRegression(random_state = random_state)
scores = cross_val_score(classifier, X_train, y = Y_train, scoring = "accuracy", cv = kfold, n_jobs=4)

In [None]:
scores

#### Predict

In [None]:
classifier = classifier.fit(X_train, Y_train)
prediction = pd.Series(classifier.predict(test), name="Survived")

#### Write Results

In [None]:
results = pd.concat([test["PassengerId"],prediction],axis=1)
# results.to_csv("prediction.csv",index=False)

source - https://www.kaggle.com/leyuanyu/titanic-top-4-with-ensemble-modeling

### What is Deep Learning?

Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, semi-supervised or unsupervised.

_https://en.wikipedia.org/wiki/Deep_learning_

### Image

An image (from Latin: imago) is an artifact that depicts visual perception, for example, a photo or a two-dimensional picture, that has a similar appearance to some subject—usually a physical object or a person, thus providing a depiction of it.

_https://en.wikipedia.org/wiki/Image_

![what-is-image](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQiXIBxwSfJ58XeDpb7Rzvs20bIU2Vjr0hiSLYYLtqXbFBkmy18DA)

## Numpy

From [cs231n note](http://cs231n.github.io/python-numpy-tutorial/).

In [None]:
import numpy as np

a = np.array([1, 2, 3])   # Create a rank 1 array
print(type(a))            # Prints "<class 'numpy.ndarray'>"
print(a.shape)            # Prints "(3,)"
print(a[0], a[1], a[2])   # Prints "1 2 3"

In [None]:
a[0] = 5                  # Change an element of the array
print(a)                  # Prints "[5, 2, 3]"

b = np.array([[1, 2, 3],[4, 5, 6]])    # Create a rank 2 array
print(b.shape)                     # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0])   # Prints "1 2 4"

In [None]:
a = np.zeros((2,2))   # Create an array of all zeros
print(a)              # Prints "[[ 0.  0.]
                      #          [ 0.  0.]]"

b = np.ones((1,2))    # Create an array of all ones
print(b)              # Prints "[[ 1.  1.]]"

In [None]:
c = np.full((2,2), 7)  # Create a constant array
print(c)               # Prints "[[ 7.  7.]
                       #          [ 7.  7.]]"

d = np.eye(2)         # Create a 2x2 identity matrix
print(d)              # Prints "[[ 1.  0.]
                      #          [ 0.  1.]]"

e = np.random.random((2,2))  # Create an array filled with random values
print(e)                     # Might print "[[ 0.91940167  0.08143941]
                             #               [ 0.68744134  0.87236687]]"

In [None]:
# Create the following rank 2 array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
#  [6 7]]
b = a[:2, 1:3]

# A slice of an array is a view into the same data, so modifying it
# will modify the original array.
print(a[0, 1])   # Prints "2"
b[0, 0] = 77     # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1])   # Prints "77"

In [None]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)
print(np.add(x, y))

# Elementwise difference; both produce the array
# [[-4.0 -4.0]
#  [-4.0 -4.0]]
print(x - y)
print(np.subtract(x, y))

# Elementwise product; both produce the array
# [[ 5.0 12.0]
#  [21.0 32.0]]
print(x * y)
print(np.multiply(x, y))

# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x))

In [None]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

v = np.array([9,10])
w = np.array([11, 12])

# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))

# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(v))
print(np.dot(x, v))

# Matrix / matrix product; both produce the rank 2 array
# [[19 22]
#  [43 50]]
print(x.dot(y))
print(np.dot(x, y))

In [None]:
x = np.array([[1,2],[3,4]])

print(np.sum(x))  # Compute sum of all elements; prints "10"
print(np.sum(x, axis=0))  # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1))  # Compute sum of each row; prints "[3 7]"

### Optional: Advanced Numpy

In [None]:
# Check http://cs231n.github.io/python-numpy-tutorial/

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)

# Plot the points using matplotlib
plt.plot(x, y)
plt.show()  # You must call plt.show() to make graphics appear.

In [None]:
# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])
plt.show()

In [None]:
# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)

# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')

# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')

# Show the figure.
plt.show()