![Andrew NG](images/starters_academy.png)

# How to get started in data science 

#### What we'll learn today: 
1. Setup a work environment 
2. Understand the data 
3. Clean the data 
4. Create a classifier 
5. Evaluate our performance 

# Jupyter notebooks 

Web application to mix code and data. Excellent for presentations. 

Writing some markdown ***(double click to see the raw writing)***: 

-------

# Big important title 

### some odd type of subtitle 

Do you see any Teletubbies in here? Do you see a slender plastic tag clipped to my shirt with my name printed on it? Do you see a little Asian child with a blank expression on his face sitting outside on a mechanical helicopter that shakes when you put quarters in it? No? Well, that's what you see at a toy store. And you must think you're in a toy store, because you're here shopping for an infant named Jeb.

*uuuh formatting!*

-------

Running some code: 

In [1]:
15 + 4

19

Defining variables and keeping state: 

In [2]:
duck = 7

def i_can_make_functions():
    print('And use them later')

And using it later 

In [3]:
duck + 2

9

In [4]:
i_can_make_functions()

And use them later


Load image from disk: 

![title](images/an_image.jpg)

Load image from url:

![title](https://media.giphy.com/media/ljUXHv2x2BpjG/giphy.gif)

#### Where to get started: 
* [Lessons by datacamp](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)

# RISE

Useful to turn notebooks into interactive presentations. 

###### Note: After installing [RISE](https://github.com/damianavila/RISE), you can click on the button on the far left of the toolbar and see this presentation as slides. 

Oh cool, the cells have become slides! 

In [5]:
print('3 ducks')

3 ducks


#### Where to get started with RISE 
* [RISE](https://github.com/damianavila/RISE)

# Pandas 

Library for data analysis and wrangling 

![pandas](images/pandas.png)

#### All data science projects start with this line

In [6]:
import pandas as pd 

Reading data 

In [7]:
data = pd.read_csv('data/train.csv')

#### Previewing the data

In [8]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Running basic analysis

In [9]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


#### Observing a single column

In [10]:
data['Age'].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

#### Creating a histogram

In [11]:
%matplotlib inline
data['Age'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1079da160>

#### Aggregations

In [12]:
data.Age.min()

0.41999999999999998

In [13]:
data.Age.max()

80.0

In [14]:
data.Age.mean()

29.69911764705882

In [15]:
data.Age.median()

28.0

#### Group-apply-combine

In [16]:
data.groupby('Pclass').mean()

Unnamed: 0_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,461.597222,0.62963,38.233441,0.416667,0.356481,84.154687
2,445.956522,0.472826,29.87763,0.402174,0.380435,20.662183
3,439.154786,0.242363,25.14062,0.615071,0.393075,13.67555


#### Dealing with missing data

In [17]:
data.Age.head(10)

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

In [18]:
data.Age.isnull().sum()

177

#### Replacing missing data

In [19]:
mean_age = data.Age.mean()

In [20]:
data.Age = data.Age.fillna(mean_age)

#### Transforming data

In [21]:
data.Sex.head()

0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: object

In [22]:
data.Sex = data.Sex.map({'male': 0, 'female': 1})

In [23]:
data.Sex.head()

0    0
1    1
2    1
3    1
4    0
Name: Sex, dtype: int64

#### Understanding contents

In [24]:
data.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

Subsetting: 

In [25]:
columns_we_want = [
    'Survived', 
    'Pclass', 
    'Sex', 
    'Age', 
    'Parch', 
    'Fare', 
    'Embarked']

In [26]:
data = data.loc[:, columns_we_want]

In [27]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Parch,Fare,Embarked
0,0,3,0,22.0,0,7.25,S
1,1,1,1,38.0,0,71.2833,C
2,1,3,1,26.0,0,7.925,S
3,1,1,1,35.0,0,53.1,S
4,0,3,0,35.0,0,8.05,S


#### Making dummies

In [28]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Parch,Fare,Embarked
0,0,3,0,22.0,0,7.25,S
1,1,1,1,38.0,0,71.2833,C
2,1,3,1,26.0,0,7.925,S
3,1,1,1,35.0,0,53.1,S
4,0,3,0,35.0,0,8.05,S


In [29]:
data = pd.get_dummies(data)

In [30]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
0,0,3,0,22.0,0,7.25,0,0,1
1,1,1,1,38.0,0,71.2833,1,0,0
2,1,3,1,26.0,0,7.925,0,0,1
3,1,1,1,35.0,0,53.1,0,0,1
4,0,3,0,35.0,0,8.05,0,0,1


In [31]:
features = [col for col in data.columns if col not in 'Survived']

#### Where to get started: 
* [Python for data analysis](https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793/ref=asap_bc?ie=UTF8)  
* [10 minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)   
* [Pandas cheat sheet](https://www.dataquest.io/blog/pandas-cheat-sheet/)  

# Scikit

### Library for Machine Learning in Python 

![classifiers](images/scikit .png)

#### Key principles

`model.fit(features, target)`

`model.predict(features)`

`model.predict_proba(features)`

#### Before the fun, we still need to do a last step of data preparation

![train test set](images/test_train.png)

Let's split our data into `X_train, X_test, y_train, y_test`

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
 X_train, X_test, y_train, y_test = train_test_split(data[features], 
                                                     data['Survived'], 
                                                     test_size=0.2)

In [34]:
print('The model will be trained on {} rows, and tested on {} rows'.format(X_train.shape[0], 
                                                                           X_test.shape[0]))

The model will be trained on 712 rows, and tested on 179 rows


## And now for the fun part! 

In [35]:
from sklearn.tree import DecisionTreeClassifier

In [36]:
tree_classifier = DecisionTreeClassifier(max_depth=3)

In [37]:
tree_classifier.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

![tree model](images/tree_model.png)

Predicting new data:

In [39]:
predictions = tree_classifier.predict(X_test)

In [40]:
predictions

array([0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0])

In [41]:
y_test.values

array([0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0])

How well did we do?

In [42]:
from sklearn.metrics import accuracy_score

In [43]:
accuracy_score(predictions, y_test.values)

0.83240223463687146

What other metrics are there? 

In [44]:
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import auc
from sklearn.metrics import brier_score_loss

How hard is it to try other models? 

In [45]:
from sklearn.ensemble import RandomForestClassifier

In [46]:
from sklearn.naive_bayes import GaussianNB

In [47]:
from sklearn.neighbors import KNeighborsClassifier

In [48]:
from sklearn.neural_network import MLPClassifier

Can we beat our performance by changing model? 

In [49]:
random_forest = RandomForestClassifier(n_estimators=100, 
                                       max_depth=8, 
                                       n_jobs=-1)

In [50]:
random_forest.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=8, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=-1, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [51]:
predictions = random_forest.predict(X_test)

In [52]:
accuracy_score(predictions, y_test.values)

0.86033519553072624

Predicting probabilities

In [53]:
probabilities = random_forest.predict_proba(X_test)[:,1]

In [54]:
probabilities[0:10]

array([ 0.16793474,  0.86776833,  0.05288249,  0.12536647,  0.98689964,
        0.09179868,  0.10424513,  0.10845246,  0.89541667,  0.08286192])

In [55]:
y_test.values[0:10]

array([0, 1, 0, 0, 1, 0, 0, 1, 1, 0])

#### Where to get started: 
* [Scikit-Learn documentation](scikit-learn.org)
* [Jake VanderPlas at Pydata (video)](https://www.youtube.com/watch?v=HC0J_SPm9co)

# This all sounds great. But how do I learn it? 

![leaning](images/learning.gif)

# Anaconda 

#### Not having to deal with 4 packages is a pleasure. 

![Anaconda](images/anaconda.png)

#### Where to get started: 
* [Anaconda package](https://www.google.pt/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiuoqu-scDUAhVC7BQKHS_BAVwQFggoMAA&url=https%3A%2F%2Fwww.continuum.io%2Fdownloads&usg=AFQjCNH5KKA7CTASoQKpNBeQAV2xSKKTrQ)

# Dataquest 

#### A interactive way to learn machine learning 

![tree model](images/dataquest.png)

#### Where to get started: 
* [DataQuest](https://www.dataquest.io)

# Kaggle 

### Challenges to put your knowledge to practice

![tree model](images/kaggle.png)

#### Where to get started: 
* [Kaggle webiste](kaggle.com)

# Coursera 

We're all Andrew Ng's students, directly or indirectly. 

![Andrew NG](images/Andrew.png)

#### Where to get started: 
* [Andrew NG on coursera](https://www.coursera.org/learn/machine-learning)

# Lisbon Data Science Starters Academy 

If you are in Lisbon, this is the place to start (disclaimer: I'm totally biased on this) 

![Andrew NG](images/starters_academy.png)

#### Where to get started: [lisbondatascience.org](http://www.lisbondatascience.org/)