# DS 422 - Machine Learning Driven Data Analysis I

## Lab 2

**Musabbir Hasan<br>**
*musabbirhasansammak@outlook.com<br>*

**Objectives:**
1. To understand how `ndarray`, the foundational data structures behind almost all Python machine learning libraries works.
2. To understand how `Series` and `DataFrame`, the primary two data structures of Pandas, can be used to work with structured data.
3. To be able to comfortably read structured data of various formats using Pandas.
4. Try Linear/Polynomial Regression models to predict housing price. 
5. Light introduction to Kaggle competitions and submitting our predictions to Kaggle.

## Software, Libraries, and Websites Requirements
1. Python 3 with NumPy, Pandas, Matplotlib, Sklearn, and Jupyter. *Howerver, you can just install [Anaconda](https://www.anaconda.com/products/individual).*
2. Git. *Please download Git for your respective platform.*
3. Docker. *Will be used in final project.*
4. Kaggle. *If you do not already have opened a Kaggle account, please do so [here](https://www.kaggle.com/)*.

## Installation
The aforementioned 4 libraries will be probably enough for the whole course. For Windows users, the best way to get those, I think, is to download and install the [Anaconda Individual Edition](https://www.anaconda.com/products/individual). My personal suggestion is to tick the **Add to path** option during installation even though the installer discourages to do so. It simplifies many of the unnecessary problems later. Linux or Mac users can install the libraries using `pip`.

## A Possible Workflow of Machine Learning

1. Reading and writing data of variety of formats (csv, tsv, text, xlsx, json, sql, etc). Everything starts with getting and sending data.
2. Exploratory data analysis of the acquired data. For example, visualizing data, calculating summaries.
3. Data cleaning, manipulating, combining, normalizing, rehsaping, slicing, shuffling.
4. Modeling with machine learning or other statistical algorithms.
5. Presenting the results, addressing queries like why and how the model works, and what do not work and why, etc. Simply interpreting the model.

## NumPy & the `ndarray`
**I want you to be comfortable with vectorized operations and broadcasting.**

`ndarray` is a n-dimensional array. Each array may contain data of only 1 type. It means you can create an array of any dimension you want. It will be fast and memory efficient. Let's create a 1D and 2D array.

In [None]:
import numpy as np

In [None]:
array1d = np.array([1, 2, 3.5, 4, 5])
array2d = np.asarray([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
array = np.arange(10)

`array` copies input data by default. `asarray` does not copy if the input data is already `ndarray`. `arange` is like the NumPy version of `range`.

In [None]:
x = np.array([1, 2, 3, 4, 5])
y = np.asarray(x)
z = np.array(x)
print(id(x))
print(id(y))
print(id(z))

In [None]:
array1d

In [None]:
array2d

In [None]:
array

**Let's check their shapes and dimensions.**

In [None]:
print(array1d.shape)
print(array1d.ndim)
print(array2d.shape)
print(array2d.ndim)

**Let's check their data types.**

In [None]:
print(array1d.dtype)
print(array2d.dtype)

**We can also cast an array into another data type.**

In [None]:
array1d = array1d.astype(np.int16)
print(array1d)

**Quiz 1. What other data types does NumPy support?**

**Is there any easy way to create `ndarray`?**

In [None]:
np.zeros((5, 5))

In [None]:
np.ones((2,2,3))

In [None]:
np.empty((5, 3))

In [None]:
np.ones_like(np.zeros((5, 5)))

**Similarly, you can use `np.zeros_like`, `np.empty_like` functions to create arrays of specific shapes.**

**Quiz 2. Find out what `np.eye` and `np.identity` functions do. They also create `ndarray` of given dimension. But what's the difference?**

**No matter what operation you perform with a `ndarray` and a scaler, it will be performed on all the elements of the array, and it will return a new copy. This is usually called vectorization**

In [None]:
a = np.ones((5, 5))
a

In [None]:
a - 1

**Slicing a `ndarray` returns a *view*, not a *copy* like python lists, and therefore, any operation performed on it will change on the actual array.**

In [None]:
a = np.arange(10)
a

**Slicing like Python lists.**

In [None]:
a[0:3]

In [None]:
a[:]

In [None]:
a[0:-1]

In [None]:
print(a[-1])
print(a[-2])
print(a[-3])

**Quiz 3. Write a code that gives you the first element of *a*. But use negative index, and most importantly, you can not use hard-coded index. For example, you can not simply use `a[-10]`.**

**To avoid unnecessary copying and to be memory efficient, slicing `ndarray` returns a view, not a copy.**

In [None]:
b = [1, 2, 3, 4, 5, 6, 7, 8, 9]

In [None]:
c = b[:5]
d = a[:5]

In [None]:
c[0] = 100
d[0] = 100

In [None]:
print(a)
print(b)

**Indexing higher dimensional arrays.**

In [None]:
a = np.array([
    [0, 1, 2],
    [3, 4, 5],
    [6, 7, 8]
])
a

In [None]:
a[1, 1]

In [None]:
a[:, :]

In [None]:
a[0,:]

In [None]:
a[:, 0]

**Indexing with logical indices.**

In [None]:
a = np.arange(10)
a

In [None]:
l = [False for _ in range(10)]
l[3] = True
l[7] = True
l

In [None]:
a[l]

In [None]:
a > 5

In [None]:
a[a > 5]

**Broadcasting**

In [None]:
a = np.array([
    [0, 1, 2],
    [3, 4, 5],
    [6, 7, 8]
])
a

In [None]:
b = np.arange(3)
b

In [None]:
a + b

In [None]:
c = np.array([
    [0],
    [1],
    [2]
])
c

In [None]:
a + c

## Pandas Data Structures

### Series
- An array like object.
- Contains an array of data.
- Any single data type supported by NumPy.
- Has labels associated with the data.

In [None]:
import pandas as pd

**By default, Series assigns numbers 0 to N-1 as labels for the data.**

In [None]:
series1 = pd.Series([1,2,3,4,5])
series1

**But we can assign labels by our own.**

In [None]:
series2 = pd.Series(['Hello', 'Pandas', 'Series'], index=['a', 'b', 'c'])
series2

**We can also use dictionary to create a Series.**

In [None]:
dic = {
    'a': 'Hello',
    'b': 'Pandas',
    'c': 'Series'
}
series3 = pd.Series(dic)
series3

**We can convert Series to NumPy array easily.**

In [None]:
series2.values

**We can access any item with the indices.**

In [None]:
series2.index

In [None]:
series2['a']

In [None]:
series1[[1,2,3]]

**We can subset series with some logical conditions.**

In [None]:
series1[series1 > 2]

In [None]:
series1 > 2

In [None]:
series1[[False, False, True, True, True]]

In [None]:
series2[series2 == 'Hello']

In [None]:
series2 == 'Hello'

**Null values.**

In [None]:
series4 = pd.Series(['Hello', 'Pandas', 'Series', np.nan], index=['a', 'b', 'c', 'd'])
series4

In [None]:
series4.isnull()

In [None]:
series4[series4.isnull()]

In [None]:
series4[series4.isnull()] = 'Something'
series4

**Finally, a Series can have its name. So can the index of the Series.**

In [None]:
series4.name = 'Words'
series4.index.name = 'Index'
series4

### Data Frame
You can think of it like a list of Series.

In [None]:
dic = {
    'feature1': [21, 32, 33, 55],
    'feature2': [1, 6, 7, 3],
    'feature3': [12, 34, 45, 67]
}

In [None]:
df = pd.DataFrame(dic)
df

**We can access columns of a DataFrame by using dictionary like styles. Notice that the index of the output Series is same as the DataFrame.**

In [None]:
df.feature1

In [None]:
df['feature2']

**We can create and delete a new feature using dictionary like style.**

In [None]:
df['feature4'] = [145, 155, 167, 159]
df

In [None]:
del df['feature4']
df

**Indexing DataFrame.**

In [None]:
df[df.feature1 > 25]

In [None]:
df.loc[(df.feature1 > 25) & (df.feature2 > 3)]

In [None]:
df.loc[(df.feature1 > 25) & (df.feature2 > 3), ['feature1']]

## Data Loading
- **`read_csv`**: Loads data from a file or URL using comma as default delimiter. Please read details in [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
- **`read_table`**: Loads data from a file or URL using tab as default delimiter. Please read details in [read_table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html)
- **`read_json`**: Reads JSON files. Please read details in [read_json](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)
- **`read_excel`**: Reads XLSX files. Please read details in [read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)
- **`open`**: To open any file using plain Python. Please read details in [Python File IO](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files)

### Data Loading Examples (CSV/TSV Files)

**Pandas usually infers data type automatically, assumes top row is the header row, and creates index automatically.**

In [None]:
dummy = pd.read_csv('./dummy-data/dummy1.csv')
dummy.head(2)

**You can use the index column to access any row in the data frame.**

In [None]:
dummy.loc[0, :]

**However, you can choos any specific column to be the index of the data frame. Also notice that there is actually no difference between `read_csv` and `read_table` except for `sep` parameter.**

In [None]:
dummy = pd.read_table('./dummy-data/dummy1.csv', index_col='feature5', sep=',')
dummy.head()

**But make sure that you are using that index column to access your data.**

In [None]:
dummy.loc[12,:]

**If your dataset does not have any header, tell Pandas. Otherwise, it will use the first observation as a header.**

In [None]:
dummy = pd.read_csv('./dummy-data/dummy2.csv')
dummy.tail(2)

**If you tell pandas that there is no header, it will assign some random header.**

In [None]:
dummy = pd.read_csv('./dummy-data/dummy2.csv', header=None)
dummy.sample(frac=.5)

**You can supply your preferred column names if you want.**

In [None]:
names = ['feature' + str(i) for i in range(1, 7)]
dummy = pd.read_csv('./dummy-data/dummy2.csv', header=None, names=names)
dummy.sample(n=3)

### Data Loading Examples (Excel Files)
My personal suggestion is to avoid excel as much as possible if you are going to work with Python.

In [None]:
dummy = pd.read_excel('./dummy-data/dummy.xlsx', sheet_name=0)
dummy.iloc[1:5]

### Data Loading Examples (JSON)

In [None]:
dummy = pd.read_json('./dummy-data/dummy.json')
dummy.iloc[0:3, 0:4]