## What is scikit-learn?

* The scikit-learn is a Python module for machine learning built on top of NumPy, SciPy, and matplotlib.
* The scikit-learn is one of the most popular open source machine learning libraries (packages) in Python.
* The scikit-learn is a part of Anaconda distribution.
* The scikit-learn has nice API http://scikit-learn.org/stable/ with full of examples, documentation and explanation.
* Book: *** Introduction to Machine Learning with Python  ***
by Andreas C. Mueller , Sarah Guido 


#### Anaconda Installation

<pre><code>
conda install scikit-learn
    
pip install -U scikit-learn
</code></pre>

#### Requirements for working with data in scikit-learn

* Features and response should be numeric (NumPy arrays or sparse matrix)

* Features and response are separate objects that should be in specific format.


- ***``X``*** :    feature 
  - attribute, independent variable 
  - two-dimensional arrays
- ***``y``*** : response 
  - target, label, dependent variable
  - one dimensional array or series




#### Methods

Scikit-learn has the main methods: ``fit``, ``predict``, ``transform``, ``score``
    
- ``model.fit(X,y)``

- ``model.predict(X_new)``               
  - ``model.predict_proba()`` 

- ``model.score()``

- ``model.transform()``   
  - *unsupervised or feature selection*


#### Modelling Steps

1 Import the class (estimator)
     
     from sklearn.linear_model import LogisticRegression
     
2 instantiate the class (estimator) and assign to an object ``clf_logreg``. 

*We can specify tuning parameters here otherwise the default parameters will be applied.*

     clf_logreg = LogisticRegression()

3 Build the model  (train or fit the model with data)

     clf_logreg.fit(X, y)                                                            
     clf_logreg.fit(X_train, y_train)

4 Predict  the response values for X and assign an object ``y_pred``        
       
     y_pred = clf_logreg.predict(X)

5 Make predictions based on the test data on unseen data

     predictions = clf_logreg.predict(testdata)




# Data Analysis and Modelling Titanic Starter

The datasets can be downloaded from Kaggle.  https://www.kaggle.com/c/titanic

In [152]:
import sys
print("Python version: {}".format(sys.version))

Python version: 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]


In [153]:
import sklearn
print("scikit-learn version: {}".format(sklearn.__version__))

scikit-learn version: 0.18.1


In [154]:
#import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

In [155]:
# Load data
train = pd.read_csv("https://raw.githubusercontent.com/PyDataWorkshop/datasets/master/titanic/train.csv")  
test = pd.read_csv("https://raw.githubusercontent.com/PyDataWorkshop/datasets/master/titanic/test.csv")