# A Deep dive into the Python Data Science Ecosystem

### The goal of today's lab is to equip you with enough information about the python libraries used in the data science process and to help you explore data using these tools

## The Scientific Python Ecosystem

![The Scientific Python Ecosystem](img/SciPy_ecosystem.png)

### NumPy lays the foundation for fast vectorized computation using n dimensional arrays

* __A Numpy Array is a homogenous collection of elements of the same data type (dtype)__

Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. 

##### To use Numpy, we first need to import the numpy package:

In [5]:
import numpy as np

##### Arrays

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. 
The number of dimensions is the rank of the array
The shape of an array is a tuple of integers giving the size of the array along each dimension.

We can initialize numpy arrays from nested Python lists, and access elements using square brackets:

In [6]:
a = np.array([1, 2, 3])  # Create a rank 1 array
print (type(a))
print (a.shape)
print (a[0], a[1], a[2])
a[0] = 5                 # Change an element of the array
print (a)

<class 'numpy.ndarray'>
(3,)
1 2 3
[5 2 3]


![The Numpy Arrays](img/numpy_arrays.png)

* __Some suported dtypes in numpy arrays are given below__

![](img/numpy_dtypes.jpg)

## We will explore the following ideas

### Attributes of arrays

* __Determining the size, shape, memory consumption, and data types of arrays__

### Indexing of arrays
* __Getting and setting the value of individual array elements__

### Slicing of arrays
* __Getting and setting smaller subarrays within a larger array__

### Reshaping of arrays
* __Changing the shape of a given array__

In [7]:
b = np.array([[1,2,3],[4,5,6]])   # Create a rank 2 array
print (b)

[[1 2 3]
 [4 5 6]]


In [8]:
print (b.shape)
print (b[(0, 0)], b[0, 1], b[1, 0])

(2, 3)
1 2 4


##### Array indexing

Numpy offers several ways to index into arrays.

Slicing: Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array:


In [9]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print (a)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


In [15]:
# finding the rank of a matrix
b = np.matrix([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
np.linalg.matrix_rank(b)

2

In [10]:
b = a[:2, 1:3]
print (b)

[[2 3]
 [6 7]]


In [13]:
a[:2,1:4]

array([[2, 3, 4],
       [6, 7, 8]])

#### ACTIVITY:
Create a 3X4 array as follows:


In [14]:
a = np.array([[1,2,3,5], [10,20,30,50], [100,200,300,500]])
a[1:3,1:3]

array([[ 20,  30],
       [200, 300]])

### Conditional Indexing

Boolean array indexing: Boolean array indexing lets you pick out arbitrary elements of an array. Frequently this type of indexing is used to select the elements of an array that satisfy some condition. Here is an example:

In [44]:
a = np.array([[1,2], [3, 4], [5, 6]])

bool_idx = (a > 2)  # Find the elements of a that are bigger than 2;
                    # this returns a numpy array of Booleans of the same
                    # shape as a, where each slot of bool_idx tells
                    # whether that element of a is > 2.

print (bool_idx)

[[False False]
 [ True  True]
 [ True  True]]


In [45]:
# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values
# of bool_idx
print (a[bool_idx])

# We can do all of the above in a single concise statement:
print (a[a > 2])

[3 4 5 6]
[3 4 5 6]


In [84]:
print (a[(a > 2) & (a<10)])

[3 5]


In [46]:
x = np.array([1, 2])  # Let numpy choose the datatype
y = np.array([1.0, 2.0])  # Let numpy choose the datatype
z = np.array([1, 2], dtype=np.int64)  # Force a particular datatype

print (x.dtype, y.dtype, z.dtype)

int64 float64 int64


#### ACTIVITY:

From the "a" numpy array extract only those elements that are divisible by 2 


In [18]:
x = np.array([2, 4, 6, 3, 5])
x[x%2==0]

array([2, 4, 6])

### Min-Max scaling using numpy

In [89]:
import numpy as np

# Create a numpy array that starts at 0 and end at 99

A = np.arange(100)

print(A)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]


In [90]:
Amax, Amin = A.max(), A.min()

print(Amax, Amin)

99 0


In [91]:
# range standardization
Ascaled = (A - Amin)/(Amax - Amin)

print(Ascaled)

[0.         0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
 0.06060606 0.07070707 0.08080808 0.09090909 0.1010101  0.11111111
 0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
 0.18181818 0.19191919 0.2020202  0.21212121 0.22222222 0.23232323
 0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
 0.3030303  0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
 0.36363636 0.37373737 0.38383838 0.39393939 0.4040404  0.41414141
 0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
 0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
 0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.5959596
 0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
 0.66666667 0.67676768 0.68686869 0.6969697  0.70707071 0.71717172
 0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
 0.78787879 0.7979798  0.80808081 0.81818182 0.82828283 0.83838384
 0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.89898

* __Why does this work?__

### Vectorized Operations in Numpy

![](img/numpy_vectorized_computation.jpg)

### Reshaping Numpy arrays

In [50]:
print (a)

[[1 2]
 [3 4]
 [5 6]]


In [51]:
a.shape

(3, 2)

In [52]:
a_reshape = a.reshape((2,3))
print (a_reshape)

[[1 2 3]
 [4 5 6]]


In [53]:
a_reshape.shape

(2, 3)

# Understanding Pandas

## Pandas has two core data structures the series and the dataframe objects

### A pandas series object is like an ordered dictionary, so you can access elements by their position as well as labels

### A pandas series object is analogous to a named list in R

In [92]:
import pandas as pd

dataFromList = pd.Series([0.25, 0.5, 0.75, "Make India Great Again!"], index=['a', 'b', 'c', 'd'])

print(dataFromList)

a                       0.25
b                        0.5
c                       0.75
d    Make India Great Again!
dtype: object


In [94]:
trial_list = pd.Series([0.25, 0.5, 0.75, "Make India Great Again!"])

print(trial_list)

0                       0.25
1                        0.5
2                       0.75
3    Make India Great Again!
dtype: object


In [99]:
dataFromDict = pd.Series({"a" : 0.25, "b" :  0.5, "c" : 0.75, "d" : "Make India Great Again!"})

print(dataFromDict)

a                       0.25
b                        0.5
c                       0.75
d    Make India Great Again!
dtype: object


In [102]:
student_info = pd.Series({"batch" : 47, "percentage" :  [10,10], "name" : "Amith"})

print(student_info)

batch               47
percentage    [10, 10]
name             Amith
dtype: object


* __Below we access the same element from the pandas series object using both the index as well as the position of the element in the series__

In [103]:
print("Using the Index: " + str(dataFromDict['d']))

print("Using the Position: " + str(dataFromDict[3]))

Using the Index: Make India Great Again!
Using the Position: Make India Great Again!


In [105]:
print (" My Name is: ",student_info["name"])
print (" My Name is: ",student_info[2])
print (" My Name is: ",student_info["percentage"])

 My Name is:  Amith
 My Name is:  Amith
 My Name is:  [10, 10]


## Pandas DataFrames

* __The Pandas DataFrame object is a combination of labelled series objects__

In [108]:
sampleDataFromDictOfDicts = pd.DataFrame({'Batch': { "Akhil" : 25, "Tarun": 27, "Chetan": 30, "Thomas": 32, "Manoj": 35},
                            'Scholarship_winners': { "Akhil" : 5, "Tarun": 4, "Chetan": 3, "Thomas": 5, "Manoj": 5}})

sampleDataFromDictOfDicts

sampleDataFromDictOfDicts = pd.DataFrame({'Batch': { "Akhil" : 25, "Tarun": 27, "Chetan": 30, "Thomas": 32, "Manoj": 35},
                            'Scholarship_winners': { "Akhil" : 5, "Chetan": 3,  "Tarun": 4, "Thomas": 5, "Manoj": 5, "Amith": 50}})
sampleDataFromDictOfDicts

Unnamed: 0,Batch,Scholarship_winners
Akhil,25.0,5
Amith,,50
Chetan,30.0,3
Manoj,35.0,5
Tarun,27.0,4
Thomas,32.0,5


In [25]:
sampleDataFromDict = pd.DataFrame({'Batch': [25, 27, 30, 32, 35],
                                   'Scholarship_winners': [5,4,3,5,5]})

sampleDataFromDict

Unnamed: 0,Batch,Scholarship_winners
0,25,5
1,27,4
2,30,3
3,32,5
4,35,5


In [26]:
sampleDataFromList = pd.DataFrame([[25, 5], [27, 4], [30, 3], [32, 5], [35, 5]],
                                  columns = ["Batch", "Scholarship_winners"],
                                  index=["Akhil", "Tarun", "Chetan", "Thomas", "Manoj"])

sampleDataFromList

Unnamed: 0,Batch,Scholarship_winners
Akhil,25,5
Tarun,27,4
Chetan,30,3
Thomas,32,5
Manoj,35,5


In [111]:
me_scholar = pd.DataFrame([[47, 5], [46, 4], [45, 3], [43, 5], [42, 5]],
                                  columns = ["Batch", "Scholarship_winners"],
                                  index=["Me", "Me", "Me", "Me", "Me"])

me_scholar

sample  = pd.DataFrame([1,2,3,4,4], index=["ha","hu","xyz","qrw","kjhg"])
sample

Unnamed: 0,0
ha,1
hu,2
xyz,3
qrw,4
kjhg,4


# Going From Pandas to Numpy and Back

* __Sometimes data should be in a numpy array form before being passed into functions from other libraries, so you need to be comfortable converting objects from pandas dataframes to numpy arrays and vice versa__

* __You lose your indexes and columns, some times additional datatypes supported in pandas (ex: datetime) when you convert pandas dataframes to numpy arrays__

In [112]:
sampleNpData = sampleDataFromList.values # Pandas to Numpy Array

sampleNpData

array([[25,  5],
       [27,  4],
       [30,  3],
       [32,  5],
       [35,  5]])

In [123]:
print(sampleDataFromList.columns)
print(sampleDataFromList.index)
samplePdData = pd.DataFrame(sampleNpData, columns=sampleDataFromList.columns, 
                            index=sampleDataFromList.index)

samplePdData

Index(['Batch', 'Scholarship_winners'], dtype='object')
Index(['Akhil', 'Tarun', 'Chetan', 'Thomas', 'Manoj'], dtype='object')


Unnamed: 0,Batch,Scholarship_winners
Chetan,30,3
Thomas,32,5


# Accessing Elements from a Pandas DataFrame

# iloc vs loc

### __.iloc lets you access the elements using the position of the elements__

![](img/pd_iloc.jpg)

In [34]:
allColumns = samplePdData.iloc[3, : ]
print("All Columns: " + "\n---------------------------------\n" + str(allColumns))

oneColumn = samplePdData.iloc[3, :1 ]

print("\n" + "Only the First Column: " + "\n---------------------------------\n" + str(oneColumn))

twoColumns = samplePdData.iloc[3, :2 ]

print("\n" + "Till The Second Column: " + "\n---------------------------------\n" + str(twoColumns))

All Columns: 
---------------------------------
Batch                  32
Scholarship_winners     5
Name: Thomas, dtype: int64

Only the First Column: 
---------------------------------
Batch    32
Name: Thomas, dtype: int64

Till The Second Column: 
---------------------------------
Batch                  32
Scholarship_winners     5
Name: Thomas, dtype: int64


In [126]:
samplePdData.iloc[2:4,]
samplePdData.iloc[2:4,:1]
samplePdData.iloc[2:4,:]

Unnamed: 0,Batch
Chetan,30
Thomas,32


### __.loc lets you access the elements using the column and the index labels of the elements__

![](img/pd_loc.jpg)

In [128]:
samplePdData.loc["Chetan", :"Scholarship_winners"]

Batch    30
Name: Chetan, dtype: int64

In [129]:
samplePdData.loc["Chetan", :"Batch"]

Batch    30
Name: Chetan, dtype: int64

In [131]:
samplePdData.loc["Thomas",]

Batch                  32
Scholarship_winners     5
Name: Thomas, dtype: int64

In [132]:
samplePdData.loc["Thomas"]

Batch                  32
Scholarship_winners     5
Name: Thomas, dtype: int64

## Indexing using [ ] returns a Pandas Series Object

In [133]:
samplePdData['Batch']

Akhil     25
Tarun     27
Chetan    30
Thomas    32
Manoj     35
Name: Batch, dtype: int64

In [135]:
samplePdData['Batch'].mean()

29.8

* __ You might come across this method of indexing when you want to conditionally access elements from a pandas DataFrame__

In [65]:
samplePdData.loc[samplePdData['Batch'] > 28, : ] # Here we access only those rows where the Batch number is higher than 28

Unnamed: 0,Batch,Scholarship_winners
Chetan,30,3
Thomas,32,5
Manoj,35,5


In [136]:
samplePdData.loc[samplePdData['Batch'] > 28]

Unnamed: 0,Batch,Scholarship_winners
Chetan,30,3
Thomas,32,5
Manoj,35,5


## GroupBy in Pandas

### Finding the mean age at each Insofe Branch

In [66]:
cityData = pd.DataFrame( {"Name" : ["Akhil", "Tarun", "Chetan", "Thomas", "Manoj" , "Jeevan"] , 
                    "City" : ["Bangalore", "Hyderabad", "Bangalore", "Hyderabad", "Bangalore", "Hyderabad"],
                    "Age" : [25, 25, 23, 26, 27, 40]})
cityData

Unnamed: 0,Name,City,Age
0,Akhil,Bangalore,25
1,Tarun,Hyderabad,25
2,Chetan,Bangalore,23
3,Thomas,Hyderabad,26
4,Manoj,Bangalore,27
5,Jeevan,Hyderabad,40


* __Applying the GroupBy method returns a groupby object__

In [67]:
cityData.groupby(by = "City")

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x112e84c50>

* __We can view the structure by converting it into a list__

In [68]:
list(cityData.groupby(by = "City"))

[('Bangalore',      Name       City  Age
  0   Akhil  Bangalore   25
  2  Chetan  Bangalore   23
  4   Manoj  Bangalore   27), ('Hyderabad',      Name       City  Age
  1   Tarun  Hyderabad   25
  3  Thomas  Hyderabad   26
  5  Jeevan  Hyderabad   40)]

* __You can perform operations on a groupby object to return a DataFrame__

In [69]:
cityData.groupby(by = "City").mean() # it will take a mean of all the numeric columns

Unnamed: 0_level_0,Age
City,Unnamed: 1_level_1
Bangalore,25.0
Hyderabad,30.333333


In [70]:
cityData.groupby(by = "City").count()

Unnamed: 0_level_0,Name,Age
City,Unnamed: 1_level_1,Unnamed: 2_level_1
Bangalore,3,3
Hyderabad,3,3


* __We can also apply any function to a groupby object using the apply method__

![](img/pd_apply.png)

In [71]:
cityData.groupby(by = "City").apply(sum)

Unnamed: 0_level_0,Name,City,Age
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bangalore,AkhilChetanManoj,BangaloreBangaloreBangalore,75
Hyderabad,TarunThomasJeevan,HyderabadHyderabadHyderabad,91


### Get and Set Working Directories

In [35]:
import os

In [36]:
os.getcwd() # Print your current working directory

'/Users/amithprasad/repos/insofe_data_sc/20180826/20180826_Batch47_CSE7212c_Python_Intro/Python_Data_Science'

In [37]:
#os.chdir("C:\\Users\\Gaurang\\Desktop\\Batch_47\\CSE7212c\\python_data_science\\") 
os.chdir("/Users/amithprasad/repos/insofe_data_sc/20180826/20180826_Batch47_CSE7212c_Python_Intro/Python_Data_Science") 
# Change the working directory to the disered path

### Read CSV Files

In [137]:
grade = pd.read_csv("Grade1.csv",header=0)

grade

Unnamed: 0,StudentId,English1,Math1,Science1,OverallPct1
0,1,67,78,30,58
1,2,52,72,67,64
2,3,66,56,55,59
3,4,30,42,44,39
4,5,56,53,48,52
5,6,75,53,77,68
6,7,80,47,37,55
7,8,68,75,30,58
8,9,48,36,68,51
9,10,62,34,54,50


### Activity

### Read in the "Grade2.csv" file

In [50]:
grade2 = pd.read_csv("Grade2.csv",header=0)

grade2

Unnamed: 0,StudentId,English2,Math2,Science2,OverallPct2
0,1,62,54,61,59
1,2,43,75,78,65
2,3,52,35,56,48
3,4,64,50,43,52
4,5,49,43,37,43
5,6,65,80,31,59
6,7,48,68,61,59
7,8,48,36,36,40
8,9,65,58,55,59
9,10,74,38,32,48


### Dataset Understanding, Summary

#### Head and Tail Methods

In [51]:
grade.head()

Unnamed: 0,StudentId,English1,Math1,Science1,OverallPct1
0,1,67,78,30,58
1,2,52,72,67,64
2,3,66,56,55,59
3,4,30,42,44,39
4,5,56,53,48,52


In [52]:
grade.tail()

Unnamed: 0,StudentId,English1,Math1,Science1,OverallPct1
15,16,48,58,57,54
16,17,36,37,40,38
17,18,72,74,62,69
18,19,58,56,33,49
19,20,44,57,66,56


#### Data Frame Shape

In [53]:
grade.shape

(20, 5)

#### Summary of the dataset

In [54]:
grade.describe()

Unnamed: 0,StudentId,English1,Math1,Science1,OverallPct1
count,20.0,20.0,20.0,20.0,20.0
mean,10.5,60.15,54.6,52.65,55.85
std,5.91608,13.899167,13.188512,15.89861,8.975258
min,1.0,30.0,34.0,30.0,38.0
25%,5.75,51.0,45.75,39.25,50.75
50%,10.5,64.0,55.0,54.0,56.0
75%,15.25,70.25,59.75,66.25,61.0
max,20.0,80.0,78.0,77.0,70.0


#### Data Types of each of the columns

In [55]:
grade.dtypes

StudentId      int64
English1       int64
Math1          int64
Science1       int64
OverallPct1    int64
dtype: object

#### Row Names (Referred to as Index)

In [56]:
grade.index.values

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

#### Column Names (Referred to as Columns)

In [57]:
grade.columns.values

array(['StudentId', 'English1', 'Math1', 'Science1', 'OverallPct1'],
      dtype=object)

### Data Types and Conversions in Pandas Data Frames

#### Print the datatypes of all columns of a data frame

In [58]:
grade.dtypes

StudentId      int64
English1       int64
Math1          int64
Science1       int64
OverallPct1    int64
dtype: object

#### Convert the Data Type of elements stored in a columns

In [59]:
#### .astype() method can be used to convert column data types

grade['English1'] = grade['English1'].astype('float32')

In [139]:
grade.dtypes

StudentId      int64
English1       int64
Math1          int64
Science1       int64
OverallPct1    int64
dtype: object

#### The most interesting of the data types in pandas is the "category" data type

* It is analogous to the factors data type in R

In [151]:
grade['StudentId'].astype('category')

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
12    13
13    14
14    15
15    16
16    17
17    18
18    19
19    20
Name: StudentId, dtype: category
Categories (20, int64): [1, 2, 3, 4, ..., 17, 18, 19, 20]

#### Setting / Changing the values of elements in the pandas data frame

In [152]:
grade[ grade['English1'] == 80 ] = np.nan

In [153]:
grade.loc[2,'Science1'] = np.nan

### Handling Missing Values 

In [154]:
grade.head()

Unnamed: 0,StudentId,English1,Math1,Science1,OverallPct1
0,1.0,67.0,78.0,30.0,58.0
1,2.0,52.0,72.0,67.0,64.0
2,3.0,66.0,56.0,,59.0
3,4.0,30.0,42.0,44.0,39.0
4,5.0,56.0,53.0,48.0,52.0


In [159]:
grade.StudentId.dtype

dtype('float64')

In [155]:
grade.isnull().sum()

StudentId      1
English1       1
Math1          1
Science1       2
OverallPct1    1
dtype: int64

#### Drop NA Values

In [156]:
grade_no_na = grade.dropna()

In [157]:
grade_no_na.isnull().sum()

StudentId      0
English1       0
Math1          0
Science1       0
OverallPct1    0
dtype: int64

In [158]:
grade_no_na.head()

Unnamed: 0,StudentId,English1,Math1,Science1,OverallPct1
0,1.0,67.0,78.0,30.0,58.0
1,2.0,52.0,72.0,67.0,64.0
3,4.0,30.0,42.0,44.0,39.0
4,5.0,56.0,53.0,48.0,52.0
5,6.0,75.0,53.0,77.0,68.0


#### Mean value Imputation

In [69]:
mean_values_to_impute = { k: i for k, i in zip(grade.columns.values, grade.mean().values)}

mean_values_to_impute

{'StudentId': 10.68421052631579,
 'English1': 59.10526315789474,
 'Math1': 55.0,
 'Science1': 53.388888888888886,
 'OverallPct1': 55.89473684210526}

In [70]:
# central imputation
grade_mean_imputation = grade.fillna(mean_values_to_impute)

In [71]:
grade_mean_imputation.isnull().sum()

StudentId      0
English1       0
Math1          0
Science1       0
OverallPct1    0
dtype: int64

In [72]:
grade.describe()

Unnamed: 0,StudentId,English1,Math1,Science1,OverallPct1
count,19.0,19.0,19.0,18.0,19.0
mean,10.684211,59.105263,55.0,53.388889,55.894737
std,6.018976,13.449059,13.424687,16.346033,9.21891
min,1.0,30.0,34.0,30.0,38.0
25%,5.5,50.0,45.0,41.0,50.5
50%,11.0,62.0,56.0,54.0,56.0
75%,15.5,69.0,61.5,66.75,62.0
max,20.0,78.0,78.0,77.0,70.0


### Binning Data

In [162]:
bins = [30 , 40, 60, 80,100]

group_names = ['LowGrade', 'AverageGrade', 'GoodGrade', 'ExcellentGrade']

grade['Grades'] = pd.cut(grade['OverallPct1'], bins, labels = group_names)

In [163]:
grade.head()

Unnamed: 0,StudentId,English1,Math1,Science1,OverallPct1,Grades
0,1.0,67.0,78.0,30.0,58.0,AverageGrade
1,2.0,52.0,72.0,67.0,64.0,GoodGrade
2,3.0,66.0,56.0,,59.0,AverageGrade
3,4.0,30.0,42.0,44.0,39.0,LowGrade
4,5.0,56.0,53.0,48.0,52.0,AverageGrade


#### Summary of dataset along with categorical data

In [164]:
grade.describe()

Unnamed: 0,StudentId,English1,Math1,Science1,OverallPct1
count,19.0,19.0,19.0,18.0,19.0
mean,10.684211,59.105263,55.0,53.388889,55.894737
std,6.018976,13.449059,13.424687,16.346033,9.21891
min,1.0,30.0,34.0,30.0,38.0
25%,5.5,50.0,45.0,41.0,50.5
50%,11.0,62.0,56.0,54.0,56.0
75%,15.5,69.0,61.5,66.75,62.0
max,20.0,78.0,78.0,77.0,70.0


In [165]:
grade.describe(include = 'all')

Unnamed: 0,StudentId,English1,Math1,Science1,OverallPct1,Grades
count,19.0,19.0,19.0,18.0,19.0,19
unique,,,,,,3
top,,,,,,AverageGrade
freq,,,,,,12
mean,10.684211,59.105263,55.0,53.388889,55.894737,
std,6.018976,13.449059,13.424687,16.346033,9.21891,
min,1.0,30.0,34.0,30.0,38.0,
25%,5.5,50.0,45.0,41.0,50.5,
50%,11.0,62.0,56.0,54.0,56.0,
75%,15.5,69.0,61.5,66.75,62.0,


### Dummify Categorical Data

In [166]:
grade['Grades'].dtype

CategoricalDtype(categories=['LowGrade', 'AverageGrade', 'GoodGrade', 'ExcellentGrade'], ordered=True)

In [167]:
dummified_grades = pd.get_dummies(grade)

In [168]:
dummified_grades.head()

Unnamed: 0,StudentId,English1,Math1,Science1,OverallPct1,Grades_LowGrade,Grades_AverageGrade,Grades_GoodGrade,Grades_ExcellentGrade
0,1.0,67.0,78.0,30.0,58.0,0,1,0,0
1,2.0,52.0,72.0,67.0,64.0,0,0,1,0
2,3.0,66.0,56.0,,59.0,0,1,0,0
3,4.0,30.0,42.0,44.0,39.0,1,0,0,0
4,5.0,56.0,53.0,48.0,52.0,0,1,0,0


# Intro to Scikit Learn

__The Scikit Learn library gives us to do the following data science operations and more__
* Feature extraction
* Classification
* Regression
* Clustering
* Dimension reduction
* Model selection
* Pipelines and Feature Unions

__Today we will be building scikit learn pipeline that would help us automate the datascience process__

## Two main methods to interface with scikit learn, estimators and transformers

* __The general api pattern for estimators is as follows__

![](img/sk_estimator_interface.jpg)

* __The general difference between the transformer and predictor apis are the predict() and transform() methods__

![](img/sk_transformer.jpg)

# Scikit Learn Pipelines

* __For now remember to use numpy arrays with scikit learn pipelines, packages such as sklearn-pandas are not production ready yet and issues with the unexpected behaviour of pandas when used with pipelines make them extremely unreliable__

![](img/sk_pipelines.jpeg)

![](img/engineering-pipelines.jpeg)

## Sklearn Transformers

#### Generate Random data for testing out the interface of scikit learn transformers

In [169]:
random_data = np.random.choice(a = 300000, size = 120).reshape(20, 6)

random_data

array([[188753,  15858,  79037, 115955,  11842,  40392],
       [126031, 294235,  32726, 152362,  29910, 274406],
       [101472, 284809,  95102, 182305, 286664, 270598],
       [256371,  99610, 280252, 289122,   4065, 177054],
       [271630, 222164,  56318, 299094, 230001, 217798],
       [166549, 173353, 234548, 258758, 282437,  89641],
       [166167, 255830, 202818, 135936, 152010, 282718],
       [264651,  40332, 237762, 224541, 177571,  46739],
       [  8019,  20114, 264680, 194818,  62679, 188318],
       [ 63819, 216538, 223058, 246361, 102927, 266541],
       [ 90199, 115906, 274965, 206491, 212655,  47954],
       [ 18333,  99073, 127444, 130364, 287112, 126326],
       [ 39499,  67207, 213832,  79613, 285241,  60639],
       [213192, 100977,  90268, 258955, 111046, 111578],
       [274704, 225327,  87376,   4113, 255232,  79341],
       [167797, 160777,  83452, 181222,  23126, 198974],
       [ 86339,  37533, 179816, 166499, 296364, 136589],
       [200947, 267133,  63083,

### Range Normalizing the Data

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
min_max_scaler = MinMaxScaler()

min_max_scaler.fit(random_data)

normalized_data = min_max_scaler.transform(random_data)

normalized_data

### Standardizing the Data

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
standard_scaler = StandardScaler()

standard_scaler.fit(random_data)

standardized_data = standard_scaler.transform(random_data)

### Building ML Models using Sklearn Estimators and Pipelines

![](img/mnist_data.jpeg)

In [None]:
from sklearn.datasets import load_digits

X, Y = load_digits().data, load_digits().target

### TRAIN - TEST SPLIT 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size= 0.3, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression

predictiveModel = LogisticRegression()

In [None]:
predictiveModel.fit(X_train, Y_train)

In [None]:
Y_preds = predictiveModel.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(Y_test, Y_preds)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline(steps = [("scale", StandardScaler()), ("logreg", predictiveModel)])

In [None]:
pipe.fit(X_train, Y_train)

In [None]:
Y_pipe_preds = pipe.predict(X_test)

In [None]:
accuracy_score(Y_test, Y_pipe_preds)

# Numpy Cheat Sheet

![](img/numpy_cheat_sheet.png)

# Pandas Cheat Sheet

![](img/pandas_cheat_sheet.jpg)

# Scikit Learn Cheat Sheet

![](img/scikit_cheat_sheet.jpg)