# Introduction to Machine Learning

On our course CCE3219 we are going to learn an introduction to machine learning. During our study we will perform the following steps to most of the data sources.
* Preprocessing: In this step we perform data cleansing and data integration to combine the needed data sets from multiple sources and put them in a way suitable to machine learning models.
* Feature Extraction: After collecting the data, we need to extract several features to enhance our model building. We might need to calculate variance, Standard Deviation, mean, median, Fast Fourier Transform, etc. Feature extraction depends on the data type and the application we are building.
* Feature Selection: After extracting hundreds of features, we need to select the most relevant features that would enhance the prediction accuracy and remove redundant features from the feature set.
* Machine Learning: Finally, we perform the machine learning process of the selected features and evaluate the prediction accuracy.

To deal with these several steps, we need to use python and some popular libraries. During our study, we are using numpy, pandas, scipy, and scikit-learn libraries.



## Preprocessing
Python has lists to include data. Lists in python are flexable and could include multiple data types. We need a new way to contain data and could perform statistics and mathematical operations.


In [90]:
a=[1, 'hello', True, 1.2]


### Numpy array
We are using the numpy array. Numpy array is like a list, but contain a lot of methods that helps during preprocessing. Also, it might contain elements of the same type, or of different types. 

In [91]:
import numpy as np
b= np.array([1,2,3,4,5,8] , dtype="float32")
type(b)

numpy.ndarray

To show the type inside the numpy array we can use the attribute dtype.

In [92]:
b.dtype

dtype('float32')

#### Accessing Values


In [93]:
print(b[:2])
print(b[1:])
print(b[0])
b[0]=5



[ 1.  2.]
[ 2.  3.  4.  5.  8.]
1.0


#### Array Shape
To view the dimension of the array, we can use the shape attribute. The shape attribute returns a tuble to view the two dimensions.

In [94]:
print(b.shape)
print(b.shape[0])

(6,)
6


#### 2D Array

In [95]:
b=np.array([[1,2,3],[4,5,6]], dtype="int32")
b.shape

(2, 3)

#### Generating arrays
We can generate array between two numbers using arange.

In [96]:
b=np.arange(0,1,.1)
b

array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])

We could also generate array which contains ones or zeros.

In [97]:
b=np.ones((2,3), dtype="float32")
b


array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]], dtype=float32)

In [98]:
b=np.zeros(7, dtype="int64")
b

array([0, 0, 0, 0, 0, 0, 0], dtype=int64)

To fill all the elements of the array with the same value we can use the method fill.

In [99]:
b.fill(99)
b

array([99, 99, 99, 99, 99, 99, 99], dtype=int64)

To copy the object content by value not by reference we can use the method copy.

In [100]:
z=b.copy()
z.fill(100)
print(b)
print(z)

[99 99 99 99 99 99 99]
[100 100 100 100 100 100 100]


We can search for a value in the array using the in operator of python.

In [101]:
10 in z

False

#### Array Math
We can perform mathematical operations between two arrays.

In [102]:
c=b+z
c

array([199, 199, 199, 199, 199, 199, 199], dtype=int64)

In [103]:
c=b*10
c

array([990, 990, 990, 990, 990, 990, 990], dtype=int64)

In [104]:
a=np.ones((2,3))
b= np.ones(3)
c=a*b
c

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

We can perform logical operations between two arrays.

In [105]:
a>c
a>10

array([[False, False, False],
       [False, False, False]], dtype=bool)

#### Accessing using boolean
We can access the array using a result of a condition.

In [106]:
a=np.array([1,2,3,4,5,6])
a[a>3]

array([4, 5, 6])

#### Array Statistics
We can use several array methods to perform statistics on the data.

In [107]:
a.sum()
a.mean()
a.var()
a.min()
a.max()
a.min()
a.std()
np.median(a)

3.5

### Pandas Series
Pandas is build on series, a similar object to numpy arrays but with index for the elements. Like the difference between lists and dictionaries in python. 

In [108]:
import pandas as pd
s1= pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])
s1

a    1
b    2
c    3
d    4
e    5
dtype: int64

#### Accessing Series
We can access the series either by the location of the element or the index of the element.


In [109]:
s1[1]

2

In [110]:
s1['d']

4

#### Statistics in Series
Like numpy arrays, series has methods to perform statistics and other operations.

In [111]:
s1.max()

5

In [112]:
s1.var()

2.5

### Data Frame
A data frame is a collection of series. It is like a table and each column is a series of a certain type. In addition to that, the data frame has index, it works as a label for every row in the data frame. Data frame could be considered a dictionary of series.

In [113]:
df= pd.DataFrame({
    'A': pd.Series([1,2,3,4]),
    'B': pd.Series(['A','B','C','D'])
})
df

Unnamed: 0,A,B
0,1,A
1,2,B
2,3,C
3,4,D


In [114]:
df2= pd.DataFrame({
    'A': 1.0,
    'B': pd.Timestamp('20180214'),
    'C': np.array([3]*10, dtype="int32"),
    'F': 'foo'
}, index = ['a','b','c','d','e','aa','bb','cc','dd','ee'])
df2

Unnamed: 0,A,B,C,F
a,1.0,2018-02-14,3,foo
b,1.0,2018-02-14,3,foo
c,1.0,2018-02-14,3,foo
d,1.0,2018-02-14,3,foo
e,1.0,2018-02-14,3,foo
aa,1.0,2018-02-14,3,foo
bb,1.0,2018-02-14,3,foo
cc,1.0,2018-02-14,3,foo
dd,1.0,2018-02-14,3,foo
ee,1.0,2018-02-14,3,foo


#### Viewing data
We can view rows at the beginning of a data frame using head. We can view rows at the end of the data frame using tail. Also, we can view index and types of columns of a data frame using index and dtypes respectively. To show summary statistics of the data frame we can use describe.

In [115]:
df2.head()

Unnamed: 0,A,B,C,F
a,1.0,2018-02-14,3,foo
b,1.0,2018-02-14,3,foo
c,1.0,2018-02-14,3,foo
d,1.0,2018-02-14,3,foo
e,1.0,2018-02-14,3,foo


In [116]:
df2.head(3)

Unnamed: 0,A,B,C,F
a,1.0,2018-02-14,3,foo
b,1.0,2018-02-14,3,foo
c,1.0,2018-02-14,3,foo


In [117]:
df2.tail()

Unnamed: 0,A,B,C,F
aa,1.0,2018-02-14,3,foo
bb,1.0,2018-02-14,3,foo
cc,1.0,2018-02-14,3,foo
dd,1.0,2018-02-14,3,foo
ee,1.0,2018-02-14,3,foo


In [118]:
df2.index

Index(['a', 'b', 'c', 'd', 'e', 'aa', 'bb', 'cc', 'dd', 'ee'], dtype='object')

In [119]:
df2.describe()

Unnamed: 0,A,C
count,10.0,10.0
mean,1.0,3.0
std,0.0,0.0
min,1.0,3.0
25%,1.0,3.0
50%,1.0,3.0
75%,1.0,3.0
max,1.0,3.0


#### Sorting
We can sort based on values and based on index. If we want to perform the sorting on the current object, we use inplace=True.

In [120]:
df3=df2.sort_values(by= 'B')

In [121]:
df2.sort_values(by= 'B', inplace= True)
df2

Unnamed: 0,A,B,C,F
a,1.0,2018-02-14,3,foo
b,1.0,2018-02-14,3,foo
c,1.0,2018-02-14,3,foo
d,1.0,2018-02-14,3,foo
e,1.0,2018-02-14,3,foo
aa,1.0,2018-02-14,3,foo
bb,1.0,2018-02-14,3,foo
cc,1.0,2018-02-14,3,foo
dd,1.0,2018-02-14,3,foo
ee,1.0,2018-02-14,3,foo


#### Accessing Data
We can use both column name and row number to access the data.

In [122]:
df2['A'] # returns a series

a     1.0
b     1.0
c     1.0
d     1.0
e     1.0
aa    1.0
bb    1.0
cc    1.0
dd    1.0
ee    1.0
Name: A, dtype: float64

In [123]:
df2[0:3]

Unnamed: 0,A,B,C,F
a,1.0,2018-02-14,3,foo
b,1.0,2018-02-14,3,foo
c,1.0,2018-02-14,3,foo


##### Selection by label
We can select a subset of the data frame using the labels of both columns and rows

In [124]:
df2.loc[:,['A','B']]

Unnamed: 0,A,B
a,1.0,2018-02-14
b,1.0,2018-02-14
c,1.0,2018-02-14
d,1.0,2018-02-14
e,1.0,2018-02-14
aa,1.0,2018-02-14
bb,1.0,2018-02-14
cc,1.0,2018-02-14
dd,1.0,2018-02-14
ee,1.0,2018-02-14


In [125]:
df2.loc[['a','c','dd'], ['B','C']]

Unnamed: 0,B,C
a,2018-02-14,3
c,2018-02-14,3
dd,2018-02-14,3


##### Selection by position
We can access the data frame using the position of both the columns and rows.


In [126]:
df2.iloc[3]

A                      1
B    2018-02-14 00:00:00
C                      3
F                    foo
Name: d, dtype: object

In [127]:
df2.iloc[3:5,0:2]

Unnamed: 0,A,B
d,1.0,2018-02-14
e,1.0,2018-02-14


In [128]:
df2.iloc[[1,2,4], [0,2]]

Unnamed: 0,A,C
b,1.0,3
c,1.0,3
e,1.0,3


In [129]:
df2.iloc[:, 1:3]

Unnamed: 0,B,C
a,2018-02-14,3
b,2018-02-14,3
c,2018-02-14,3
d,2018-02-14,3
e,2018-02-14,3
aa,2018-02-14,3
bb,2018-02-14,3
cc,2018-02-14,3
dd,2018-02-14,3
ee,2018-02-14,3


#### Bolean indexing
We can access the data frame using boolean condition.

In [130]:
df2.A>0

a     True
b     True
c     True
d     True
e     True
aa    True
bb    True
cc    True
dd    True
ee    True
Name: A, dtype: bool

In [131]:
df2[df2.A>0]

Unnamed: 0,A,B,C,F
a,1.0,2018-02-14,3,foo
b,1.0,2018-02-14,3,foo
c,1.0,2018-02-14,3,foo
d,1.0,2018-02-14,3,foo
e,1.0,2018-02-14,3,foo
aa,1.0,2018-02-14,3,foo
bb,1.0,2018-02-14,3,foo
cc,1.0,2018-02-14,3,foo
dd,1.0,2018-02-14,3,foo
ee,1.0,2018-02-14,3,foo


#### Adding a new column


In [132]:
df2['G']= np.ones(10)
df2

Unnamed: 0,A,B,C,F,G
a,1.0,2018-02-14,3,foo,1.0
b,1.0,2018-02-14,3,foo,1.0
c,1.0,2018-02-14,3,foo,1.0
d,1.0,2018-02-14,3,foo,1.0
e,1.0,2018-02-14,3,foo,1.0
aa,1.0,2018-02-14,3,foo,1.0
bb,1.0,2018-02-14,3,foo,1.0
cc,1.0,2018-02-14,3,foo,1.0
dd,1.0,2018-02-14,3,foo,1.0
ee,1.0,2018-02-14,3,foo,1.0


#### Seting a value

##### By Label

In [133]:
df2.at['a','A']=0
df2

Unnamed: 0,A,B,C,F,G
a,0.0,2018-02-14,3,foo,1.0
b,1.0,2018-02-14,3,foo,1.0
c,1.0,2018-02-14,3,foo,1.0
d,1.0,2018-02-14,3,foo,1.0
e,1.0,2018-02-14,3,foo,1.0
aa,1.0,2018-02-14,3,foo,1.0
bb,1.0,2018-02-14,3,foo,1.0
cc,1.0,2018-02-14,3,foo,1.0
dd,1.0,2018-02-14,3,foo,1.0
ee,1.0,2018-02-14,3,foo,1.0


##### By Index


In [134]:
df2.iat[0,2]=0
df2

Unnamed: 0,A,B,C,F,G
a,0.0,2018-02-14,0,foo,1.0
b,1.0,2018-02-14,3,foo,1.0
c,1.0,2018-02-14,3,foo,1.0
d,1.0,2018-02-14,3,foo,1.0
e,1.0,2018-02-14,3,foo,1.0
aa,1.0,2018-02-14,3,foo,1.0
bb,1.0,2018-02-14,3,foo,1.0
cc,1.0,2018-02-14,3,foo,1.0
dd,1.0,2018-02-14,3,foo,1.0
ee,1.0,2018-02-14,3,foo,1.0


## References

* https://engineering.ucsb.edu/~shell/che210d/numpy.pdf
* http://pandas.pydata.org/pandas-docs/stable/10min.html