# Python Training - Intermediate

## I. numpy

[Numpy](https://www.numpy.org) is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

In [1]:
# Import convention
import numpy as np

The main object is the **numpy.array**. Note that `numpy.array` is not the same as the Standard Python Library class `array.array`, which only handles one-dimensional arrays and offers less functionality. 

### Introduction

In [2]:
# A sample np.array
a = np.arange(15)
print(a)
type(a)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]


numpy.ndarray

Important note: numpy arrays are **homogeneous** (they do not allow for elements of different _data type_, or, in numpy lingo, `dtype`. See rich [documentation on data types](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)

In [3]:
np.array([1, 'a']) # S21: strings of byte length 21

array(['1', 'a'], dtype='<U11')

numpy has a wide variety of supported data types, and manipulating data types in large datasets can lead to significant improvements in terms of memory use and performance

In [4]:
# Notice the difference with Standard Python when you just invoke the array:
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [5]:
# Standard Python
b = [1,2,3]
b

[1, 2, 3]

In [6]:
# Correct way to create a numpy array:
c = np.array([[1, 2, 3, 4], [2, 3, 4, 5]])
c

array([[1, 2, 3, 4],
       [2, 3, 4, 5]])

In [14]:
# A numpy 2D array with random numbers
np.random.rand(3,2)

array([[0.92651902, 0.84730702],
       [0.6896085 , 0.12824139],
       [0.07630996, 0.04213123]])

In [18]:
# Various attributes
c.shape # returns a tuple

(2, 4)

In [19]:
# Returns the data type of elements inside the array
a.dtype

dtype('int32')

In [21]:
type(a)

numpy.ndarray

In [22]:
np.zeros((2,2))

array([[0., 0.],
       [0., 0.]])

In [24]:
np.ones((2,2)) # shape (3,1) implied

array([[1., 1.],
       [1., 1.]])

In [27]:
# Identity matrix
np.eye(3,4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.]])

### Indexing

Like in Standard Python, there are several ways to access specific elements or subsets of a numpy array:

Similar to Python lists, numpy arrays can be **sliced**. Since arrays may be multidimensional, you must specify a slice for each dimension of the array

In [28]:
# Slicing
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
a

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [29]:
# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 2 and 3
b = a[:2, 1:3]
print(b)

[[2 3]
 [6 7]]


In [31]:
# A slice of an array is a view into the same data, so modifying it
# will modify the original array.
print(a[0, 1])   # Prints "2"
b[0, 0] = 77     # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1])   # Prints "77"

2
77


In [32]:
a

array([[ 1, 77,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [33]:
# We can get slice to get certain rows of n array
row_r1 = a[1, :]    # Rank 1 view of the second row of a
row_r2 = a[1:2, :]  # Rank 2 view of the second row of a
print(row_r1, row_r1.shape)  # Prints "[5 6 7 8] (4,)"
print("")
print(row_r2, row_r2.shape)  # Prints "[[5 6 7 8]] (1, 4)"

[5 6 7 8] (4,)

[[5 6 7 8]] (1, 4)


In [None]:
# We can make the same distinction when accessing columns of an array:
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print(col_r1, col_r1.shape)  # Prints "[ 2  6 10] (3,)"
print("")
print(col_r2, col_r2.shape)  # Prints "[[ 2]
                             #          [ 6]
                             #          [10]] (3, 1)"

In [34]:
# Integer Array Indexing
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(a)
print("")
print(a[0]) # first row
print("")
print(np.array([a[0, 1], a[2, 1]])) # elements (0,1) and (2,1) of a

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]

[1 2 3 4]

[ 2 10]


In [35]:
# Boolean Indexing
a = np.arange(12)
print(a > 2)

[False False False  True  True  True  True  True  True  True  True  True]


The following method of boolean indexing is **very important** for pandas Series/DataFrames manipulations, as we will see later: 

In [36]:
# Subset array elements larger than 2
a[a>2]

array([ 3,  4,  5,  6,  7,  8,  9, 10, 11])

### (numpy) Array Math

In [37]:
# array of 10,000 random numbers between 0 and 1
a = np.random.rand(3,2)
a

array([[0.02366769, 0.74743654],
       [0.03477518, 0.1475584 ],
       [0.79185064, 0.38527405]])

In [38]:
# Transpose
a.T

array([[0.02366769, 0.03477518, 0.79185064],
       [0.74743654, 0.1475584 , 0.38527405]])

In [41]:
# Dot Product
b = np.random.rand(2)
a.dot(b)

array([0.21450483, 0.06837528, 0.78468288])

In [47]:
# Simple operations on numpy arrays are element-wise:
a = np.array([9,3,15])
b = np.array([10,6,4])
print(a + b)  # Same as np.add(a,b)
print("")
print(a - b)  # Same as np.subtract(a,b)
print("")
print(a * b)  # Same as np.multiply(a,b)
print("")
print(a / b)  # Same as np.divide(a,b)
print("")
print(a ** b) # Same as np.power(a,b)
print("")
print(np.sqrt(a))

[19  9 19]

[-1 -3 11]

[90 18 60]

[0.9  0.5  3.75]

[-808182895        729      50625]

[3.         1.73205081 3.87298335]


Other operations: `np.min(), np.max(), np.sum()`, etc.

In [48]:
# Statistics
print('Mean: ', np.mean(a))              # Also: np.nanmean() ignores np.nan values within the array
print('Median: ', np.median(a))          # np.nanmedian()
print('Standard Deviation: ', np.std(a)) # np.nanstd(a)
print('90th percentile: ', np.percentile(a, 90))

Mean:  9.0
Median:  9.0
Standard Deviation:  4.898979485566356
90th percentile:  13.799999999999999


Note: `np.nan` is _like_ `None` in Standard Python, but it is of type `float`!

In [49]:
b = np.array([np.nan, 1])

In [50]:
b

array([nan,  1.])

In [51]:
b.dtype

dtype('float64')

There is an abundance of tutorials covering `numpy` in great detail. I will post some links on the Confluence page.

## II. pandas

`pandas` is a Python library written for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Its name comes from the econometric term _panel data_

Pandas is capable of many tasks including:
- Reading/writing many different data formats
- Selecting subsets of data
- Calculating across rows and down columns
- Finding and filling missing data
- Applying operations to independent groups within the data
- Reshaping data into different forms
- Combing multiple datasets together
- Advanced time-series functionality
- Visualization through matplotlib and seaborn

Its documentation is extremely thorough (2,500+ pages long!) but, in my opinion, it is not very helpful for an audience interested in "real" data analysis: all the data is contrived or randomly generated, and the real power of pandas, multiple operations performed in sequence, is not really outlined there.

In [52]:
# Import convention
import pandas as pd

The main data structures in Pandas are the `Series` and the `DataFrame`

In [57]:
# Load a numpy array into a pd.Series
a = np.random.rand(50)
b = pd.Series(a)
b.head() # Returns the first 5 rows, unless specified otherwise


0    0.614481
1    0.527340
2    0.683276
3    0.100152
4    0.191499
dtype: float64

In [71]:
# Load a Python dictionary into a pd.DataFrame (common variable name: `df`)
data = {
    'number': [1, 10, 100, np.nan, 500],
    'color': ['Red', 'Green', 'Blue', 'Yellow', 'Black'],
    'price': [2.0, 15.50, 3.42, 64.50,np.nan]
}

df = pd.DataFrame(data)
df

Unnamed: 0,number,color,price
0,1.0,Red,2.0
1,10.0,Green,15.5
2,100.0,Blue,3.42
3,,Yellow,64.5
4,500.0,Black,


A `pandas DataFrame` comprises three elements:
    - Index
    - Columns
    - Values

In [72]:
print(df.index) # can be changed with set_index()
print(df.columns) # dtype 'object' is a general string-like data type
type(df.values) # df.values in a numpy array!

RangeIndex(start=0, stop=5, step=1)
Index(['number', 'color', 'price'], dtype='object')


numpy.ndarray

Indexing a pandas DataFrame is an immensely rich topic. Let us just mention that we can manually manipulate the `index` (primary key) of the DataFrame.

In [73]:
df.set_index('color')

Unnamed: 0_level_0,number,price
color,Unnamed: 1_level_1,Unnamed: 2_level_1
Red,1.0,2.0
Green,10.0,15.5
Blue,100.0,3.42
Yellow,,64.5
Black,500.0,


In [74]:
# Note: `color` can no longer be accessed as a dataframe column
df['color']

0       Red
1     Green
2      Blue
3    Yellow
4     Black
Name: color, dtype: object

Important note: such operations return **copies** of the original DataFrame, without changing the original variable. In order to change that, you either have to reassign to a new variable, or use the `inplace` flag.

In [75]:
df

Unnamed: 0,number,color,price
0,1.0,Red,2.0
1,10.0,Green,15.5
2,100.0,Blue,3.42
3,,Yellow,64.5
4,500.0,Black,


In [76]:
df.set_index('color', inplace=True)
df

Unnamed: 0_level_0,number,price
color,Unnamed: 1_level_1,Unnamed: 2_level_1
Red,1.0,2.0
Green,10.0,15.5
Blue,100.0,3.42
Yellow,,64.5
Black,500.0,


### Working with missing data

The last row of our dataframe has a `np.nan` (missing) value for `price`. Depending on the specifics of our problem, we might either want to `drop` this row, or impute (`fill`) some value for the missing data:

In [77]:
# Pandas axes: 0 = rows, 1 = columns
df2 = df.dropna(axis=0, how='any') # "Drop all rows where any column has a value of np.nan"
df2

Unnamed: 0_level_0,number,price
color,Unnamed: 1_level_1,Unnamed: 2_level_1
Red,1.0,2.0
Green,10.0,15.5
Blue,100.0,3.42


In [79]:
# Let us impute the mean of the remaining values
df3 = df.fillna(df['number'].mean()) # Fill all empty "cells" of the DataFrame with the mean value of the "price" column
df3

Unnamed: 0_level_0,number,price
color,Unnamed: 1_level_1,Unnamed: 2_level_1
Red,1.0,2.0
Green,10.0,15.5
Blue,100.0,3.42
Yellow,152.75,64.5
Black,500.0,152.75


Pandas Series have numpy-like math operations (`mean(), median(), sum()`, etc.) - Important distinction: Pandas automatically ignores `np.nan` values during these calculations.

In [80]:
# Get some fast counts of NaN rows for `price` column
print(df['price'].isnull().sum())
print(df['price'].notnull().sum())

1
4


As you can see, a great strength of Pandas is its support for practically limitless ** chain operations **. Use carefully - and wisely: it might be easier to perform many operations on one go, but it is harder to debug (if a chain operation fails, you need to figure out which step failed)

In [89]:
# Import a sample dataset from a csv file
import pandas as pd
df = pd.read_csv('d:/Users/PShirbhate/Downloads/Evolent-Python Training/Auto.csv') # Mike, change filepath
df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
5,15.0,8,429.0,198,4341,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220,4354,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215,4312,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225,4425,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190,3850,8.5,70,1,amc ambassador dpl


In [90]:
# df.describe() automatically subsets the numerical columns of the dataframe
df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,23.445918,5.471939,194.41199,104.469388,2977.584184,15.541327,75.979592,1.576531
std,7.805007,1.705783,104.644004,38.49116,849.40256,2.758864,3.683737,0.805518
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.0,4.0,105.0,75.0,2225.25,13.775,73.0,1.0
50%,22.75,4.0,151.0,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,275.75,126.0,3614.75,17.025,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


### Chained indexing

##### Example 1

In [98]:
# Subset the dataframe, and then return one column of the subset
df[['mpg', 'cylinders', 'displacement']]['displacement']

0      307.0
1      350.0
2      318.0
3      304.0
4      302.0
5      429.0
6      454.0
7      440.0
8      455.0
9      390.0
10     383.0
11     340.0
12     400.0
13     455.0
14     113.0
15     198.0
16     199.0
17     200.0
18      97.0
19      97.0
20     110.0
21     107.0
22     104.0
23     121.0
24     199.0
25     360.0
26     307.0
27     318.0
28     304.0
29      97.0
       ...  
362    112.0
363    112.0
364    112.0
365    112.0
366    135.0
367    151.0
368    140.0
369    105.0
370     91.0
371     91.0
372    105.0
373     98.0
374    120.0
375    107.0
376    108.0
377     91.0
378     91.0
379     91.0
380    181.0
381    262.0
382    156.0
383    232.0
384    144.0
385    135.0
386    151.0
387    140.0
388     97.0
389    135.0
390    120.0
391    119.0
Name: displacement, Length: 392, dtype: float64

In [101]:
df.set_index('name', inplace=True)

##### Example 2

In [111]:
df.columns
df.head()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
chevrolet chevelle malibu,18.0,8,307.0,130,3504,12.0,70,1
buick skylark 320,15.0,8,350.0,165,3693,11.5,70,1
plymouth satellite,18.0,8,318.0,150,3436,11.0,70,1
amc rebel sst,16.0,8,304.0,150,3433,12.0,70,1
ford torino,17.0,8,302.0,140,3449,10.5,70,1


In [113]:
# Location indexing: .loc is primarily label based
df.loc[['chevrolet chevelle malibu']][['year','displacement']]

Unnamed: 0_level_0,year,displacement
name,Unnamed: 1_level_1,Unnamed: 2_level_1
chevrolet chevelle malibu,70,307.0
chevrolet chevelle malibu,71,250.0


##### Example 3

In [108]:
# Location indexing: .iloc is primarily integer position based
df.iloc[2:5].iloc[:, -3:] # Returns rows 2,3,4, and then the last three columns of this subset

Unnamed: 0_level_0,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
plymouth satellite,11.0,70,1
amc rebel sst,12.0,70,1
ford torino,10.5,70,1


##### Example 4

In [115]:
# Boolean indexing
df[df['acceleration'] == 12][['acceleration', 'mpg']]

Unnamed: 0_level_0,acceleration,mpg
name,Unnamed: 1_level_1,Unnamed: 2_level_1
chevrolet chevelle malibu,12.0,18.0
amc rebel sst,12.0,16.0
chevrolet impala,12.0,14.0
ford country squire (sw),12.0,13.0
pontiac safari (sw),12.0,13.0
chevrolet impala,12.0,13.0
pontiac catalina,12.0,14.0
chevrolet caprice classic,12.0,13.0
ford mustang ii,12.0,13.0
chevy c10,12.0,13.0


All these examples are correct pandas code, but it is not _idiomatic_ Let us rewrite them in idiomatic pandas code

In [116]:
# Example 1
df['origin'] # No need to subset a dataframe when the desired end result is a single column

name
chevrolet chevelle malibu            1
buick skylark 320                    1
plymouth satellite                   1
amc rebel sst                        1
ford torino                          1
ford galaxie 500                     1
chevrolet impala                     1
plymouth fury iii                    1
pontiac catalina                     1
amc ambassador dpl                   1
dodge challenger se                  1
plymouth 'cuda 340                   1
chevrolet monte carlo                1
buick estate wagon (sw)              1
toyota corona mark ii                3
plymouth duster                      1
amc hornet                           1
ford maverick                        1
datsun pl510                         3
volkswagen 1131 deluxe sedan         2
peugeot 504                          2
audi 100 ls                          2
saab 99e                             2
bmw 2002                             2
amc gremlin                          1
ford f250           

In [117]:
# Example 3
df.iloc[2:5, -3:]

Unnamed: 0_level_0,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
plymouth satellite,11.0,70,1
amc rebel sst,12.0,70,1
ford torino,10.5,70,1


#### Value sorting

In [118]:
df.sort_values(by='acceleration', ascending=False) # Remember: this does not alter original df

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
peugeot 504,27.2,4,141.0,71,3190,24.8,79,2
vw pickup,44.0,4,97.0,52,2130,24.6,82,2
vw dasher (diesel),43.4,4,90.0,48,2335,23.7,80,2
volkswagen type 3,23.0,4,97.0,54,2254,23.5,72,2
chevrolet chevette,29.0,4,85.0,52,2035,22.2,76,1
oldsmobile cutlass salon brougham,23.9,8,260.0,90,3420,22.2,79,1
chevrolet woody,24.5,4,98.0,60,2164,22.1,76,1
peugeot 504,19.0,4,120.0,88,3270,21.9,76,2
mercedes-benz 240d,30.0,4,146.0,67,3250,21.8,80,2
vw rabbit c (diesel),44.3,4,90.0,48,2085,21.7,80,2


### pandas aggregations: the split-apply-combine framework

One of the greatest strengths of pandas is the blazing fast speed it can perform aggregations on large DataFrames. `groupby` comes straight from the SQL world. Such aggregations on pandas are usually understood in three distinct steps:
- The **split** step involves breaking up and grouping a DataFrame depending on the value of the specified key.
- The **apply** step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
- The **combine** step merges the results of these operations into an output.

In [119]:
df.groupby('cylinders')['mpg'].mean()

cylinders
3    20.550000
4    29.283920
5    27.366667
6    19.973494
8    14.963107
Name: mpg, dtype: float64

In [120]:
# Another way to perform the operation above:
df.groupby('cylinders')['mpg'].agg('mean')

cylinders
3    20.550000
4    29.283920
5    27.366667
6    19.973494
8    14.963107
Name: mpg, dtype: float64

In [121]:
# Careful: curly brackets needed when you want multiple columns as output
df.groupby('cylinders')[['mpg', 'weight']].agg({'mean', 'median'})

Unnamed: 0_level_0,mpg,mpg,weight,weight
Unnamed: 0_level_1,median,mean,median,mean
cylinders,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
3,20.25,20.55,2375,2398.5
4,28.4,29.28392,2230,2305.110553
5,25.4,27.366667,2950,3103.333333
6,19.0,19.973494,3210,3202.120482
8,14.0,14.963107,4140,4114.718447


In [122]:
# You can write short python functions to use as arguments in `apply` or `agg` for customized results
def add_10(s):
    return s + 10

In [129]:
new_df = df.groupby('cylinders')['mpg'].apply(add_10).apply(pd.Series)
# the last `apply` makes new_df look "nicer". like a pd.DataFrame

ValueError: cannot reindex from a duplicate axis

In [None]:
new_df.columns = ['inflated_score']
new_df

In [None]:
# Another example of groupby
df.groupby('state')['score'].sum().rename('Total Score').reset_index() # reset_index brings back default indexing

There are three different `apply` instances in the split-apply-combine process:
- aggregate, if we want to get a single value for each group
- filter, if we want to get a subset of the input rows
- transform, if we want to get a new value for each input row

The `groupby` examples we have seen so far were instances of the `aggregate` method:

In [None]:
# Aggregate
df.groupby('state')['score'].agg('mean')
# same as df.groupby('state')['score'].mean()
# same as df.groupby('state')['score'].aggregate('mean')

In [None]:
# Filter with lambda function (short function defined on the spot instead of `def...`)
df.groupby('state').filter(lambda x : x['score'].max() > 4)

In [130]:
# Transform - adds new column with same value for each group
df['scaled_mpg'] = df.groupby('cylinders')['mpg'].transform(lambda x : x/x.mean())
df

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,scaled_mpg
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
chevrolet chevelle malibu,18.0,8,307.0,130,3504,12.0,70,1,1.202959
buick skylark 320,15.0,8,350.0,165,3693,11.5,70,1,1.002466
plymouth satellite,18.0,8,318.0,150,3436,11.0,70,1,1.202959
amc rebel sst,16.0,8,304.0,150,3433,12.0,70,1,1.069297
ford torino,17.0,8,302.0,140,3449,10.5,70,1,1.136128
ford galaxie 500,15.0,8,429.0,198,4341,10.0,70,1,1.002466
chevrolet impala,14.0,8,454.0,220,4354,9.0,70,1,0.935635
plymouth fury iii,14.0,8,440.0,215,4312,8.5,70,1,0.935635
pontiac catalina,14.0,8,455.0,225,4425,10.0,70,1,0.935635
amc ambassador dpl,15.0,8,390.0,190,3850,8.5,70,1,1.002466


Pandas is a very powerful library for data manipulation and analysis in Python. Because the operations under the hood are **vectorized**, there is almost **NEVER** a need to iterate over a DataFrame using a `for` loop.

Understanding `pandas` and the underlying structures of `numpy` is a very important "prerequisite" before moving on to `scikit-learn` for predictive analytics. 

## III. scikit-learn

Scikit-learn is a python module for machine learning and is built on top of the numpy, scipy, and matplotlib modules. It comes loaded with lots of features and components. Below are a few of them to help you understand:
 - Supervised learning algorithms
 - Unsupervised learning algorithms
 - Cross-validation
 - Feature extraction
 - Dimensionality reduction
 - Preprocessing
 - Model selection and evaluation
 
It's very well documented online and the official documentation can be found here: http://scikit-learn.org/stable/documentation.html


#### Load the data

In [3]:
# import pandas
import pandas as pd

# import the titanic file 
filepath = "d:/Users/PShirbhate/Downloads/Evolent-Python Training/titanic.csv"
titanic = pd.read_csv(filepath)
titanic.head()

Unnamed: 0,pid,pclass,survived,sex,age,sibsp,parch,fare,embarked
0,1,1st,1,female,29.0,0,0,211.337494,Southampton
1,2,1st,1,male,,1,2,151.550003,Southampton
2,3,1st,0,female,2.0,1,2,151.550003,Southampton
3,4,1st,0,male,30.0,1,2,151.550003,Southampton
4,5,1st,0,female,25.0,1,2,151.550003,Southampton


#### Preprocessing

http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

Account for missing data

In [4]:
# view the dataset to see if there are any missing variables
titanic[titanic['age'].isnull()].shape

(264, 9)

In [5]:
# find exactly where the null values are
titanic.isnull().sum()

pid           0
pclass        0
survived      0
sex           0
age         264
sibsp         0
parch         0
fare          1
embarked      2
dtype: int64

In [6]:
# replace null values in the age column
age_mean = titanic['age'].mean()  # get the mean age used to fill in missing age values
titanic['age'].fillna(age_mean, axis=0, inplace=True)

In [8]:
# drop the remaining missing variables
titanic.dropna(axis=0, inplace=True)

Replace the index with passenger ID

In [9]:
# set the index to the pid variable
titanic.set_index('pid', drop=True, inplace=True)
titanic.head()

Unnamed: 0_level_0,pclass,survived,sex,age,sibsp,parch,fare,embarked
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1st,1,female,29.0,0,0,211.337494,Southampton
2,1st,1,male,29.908852,1,2,151.550003,Southampton
3,1st,0,female,2.0,1,2,151.550003,Southampton
4,1st,0,male,30.0,1,2,151.550003,Southampton
5,1st,0,female,25.0,1,2,151.550003,Southampton


Encode categorical variables

In [10]:
# creating dummy vaiables for categorical variables
titanic_new = pd.get_dummies(titanic)
titanic_new.head(10)

Unnamed: 0_level_0,survived,age,sibsp,parch,fare,pclass_1st,pclass_2nd,pclass_3rd,sex_female,sex_male,embarked_Cherbourg,embarked_Queenstown,embarked_Southampton
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,1,29.0,0,0,211.337494,1,0,0,1,0,0,0,1
2,1,29.908852,1,2,151.550003,1,0,0,0,1,0,0,1
3,0,2.0,1,2,151.550003,1,0,0,1,0,0,0,1
4,0,30.0,1,2,151.550003,1,0,0,0,1,0,0,1
5,0,25.0,1,2,151.550003,1,0,0,1,0,0,0,1
6,1,48.0,0,0,26.549999,1,0,0,0,1,0,0,1
7,1,63.0,1,0,77.958298,1,0,0,1,0,0,0,1
8,0,39.0,0,0,0.0,1,0,0,0,1,0,0,1
9,1,53.0,2,0,51.479198,1,0,0,1,0,0,0,1
10,0,71.0,0,0,49.5042,1,0,0,0,1,1,0,0


Split the explanatory and repsonse variables

In [11]:
# create an array of the column names
titanic_names = titanic_new.columns.values
titanic_names

array(['survived', 'age', 'sibsp', 'parch', 'fare', 'pclass_1st',
       'pclass_2nd', 'pclass_3rd', 'sex_female', 'sex_male',
       'embarked_Cherbourg', 'embarked_Queenstown',
       'embarked_Southampton'], dtype=object)

In [12]:
# store the resonse and explanatory variables
y = titanic_new['survived']
x = titanic_new[titanic_names[1:]]

In [13]:
x.head()

Unnamed: 0_level_0,age,sibsp,parch,fare,pclass_1st,pclass_2nd,pclass_3rd,sex_female,sex_male,embarked_Cherbourg,embarked_Queenstown,embarked_Southampton
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,29.0,0,0,211.337494,1,0,0,1,0,0,0,1
2,29.908852,1,2,151.550003,1,0,0,0,1,0,0,1
3,2.0,1,2,151.550003,1,0,0,1,0,0,0,1
4,30.0,1,2,151.550003,1,0,0,0,1,0,0,1
5,25.0,1,2,151.550003,1,0,0,1,0,0,0,1


In [14]:
y.head()

pid
1    1
2    1
3    0
4    0
5    0
Name: survived, dtype: int64

#### Creating the training and testing dataset

Cross Validation

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set x_test, y_test

In [15]:
# split the dataset into training and testing using train test split
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape )

(1044, 12) (1044,)
(262, 12) (262,)




#### Logistic Regression

Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [16]:
# import linear model
from sklearn import linear_model

# create logistic regression object
log = linear_model.LogisticRegression()

In [17]:
# train the model using the trainging datasets
model = log.fit(x_train, y_train)

In [20]:
# make the predicitions using the testing set
predictions = model.predict(x_test)
print(predictions)

[0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 0 1 1 0 1
 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0
 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 1
 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 1 0 0
 0 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0
 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0
 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1
 0 0 0]


In [21]:
# get the accuracy of the model
model.score(x_test, y_test)

0.7900763358778626

In [22]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, predictions)
confusion_matrix

array([[134,  23],
       [ 32,  73]], dtype=int64)

In [None]:
# create the classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

#### K-fold cross validation

Is this model any good? It helps to verify the effectiveness fo the alogrithm using KFold. This will split our data into k buckets, then run the alogirthm using a different bucket as the test set for each iteration.

In [25]:
from sklearn.cross_validation import cross_val_score, cross_val_predict

# perform 10-fold cross validation
scores = cross_val_score(model, x, y, cv = 10)
print(scores)

[0.54961832 0.80916031 0.86259542 0.83206107 0.76335878 0.81679389
 0.70992366 0.65648855 0.65116279 0.75193798]


In [26]:
# get the mean accuracy of all of the models
scores.mean()

0.7403100775193799