# A Brief Introduction to Data Handling in Python

This is a very brief introduction to data handling, and possibly revision for many of you.  
Please note that some of the code demonstrated below is going to be used in later exercises during the module.

## Importing Libraries

First, we import libraries: NumPy (`numpy`), Pandas (`pandas`), and Scikit-learn (`sklearn`) are the main libraries we will be working with.
- NumPy is a library adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- Pandas is a library for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
- Scikit-learn is a machine learning library.

In [None]:
import numpy as np
import pandas as pd

From Scikit-learn, we import only selected items.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.datasets import load_iris

We installed NumPy, Pandas and Scikit-learn on the top of standard Python installation.  
Python itself also has a standard library which does not require additional installation. 
You may find some of the modules useful in the future, but we will not be importing them in this notebook:
https://docs.python.org/3/library/index.html

## Exploring the Data Set

The 'Iris data set' is included in Scikit-learn, and is a well-known data set commonly used when teaching machine learning.  
We load it below.   
Round brackets are used mainly for function arguments.  
Square brackets are used mainly in the context of indexing and slicing (here, accessing elements of a collection). 

In [None]:
iris = load_iris()
X = iris['data']
y = iris['target']

We can print details of the Iris data set:  
We  can print using the standard Python `print()` function, but in a Jupyter Notebook, the value of the last item in the code cell is automatically printed.  
Automatic printing can be supressed with `;`.  

In [None]:
iris

We can also explore the data set in more detail:

- We can print feature names:

In [None]:
iris['feature_names']

- We can print data dimensions:

In [None]:
X.shape 

Let's assign the `shape` of the data set to variables, (we can use them later to iterate over the dataset):

`X.shape` returns a two-element array, where `X.shape[0]` is the number of rows (samples), and `X.shape[1]` is the number of columns (features).  
(Please note that indexing starts at 0; row and column indexing also starts at 0).  
  
The code below is equivalent to:  
`n_sample = X.shape[0]`  
`n_feature = X.shape [1]`

In [None]:
n_sample, n_feature = X.shape 

You can select columns and rows in the dataset. 

Rows are selected before the comma, columns after the comma. Note that this corresponds with the order of elements decribing shape. This row, column order is consistent in Python. 

You will encounter objects with more than 2 dimensions, these will be described using one number per axis, e.g. `X[:, :, :, :]` for four dimensions.    
  
Let's select the first row: 

In [None]:
X[0, :] 

When selecting adjacent rows / columns, use `:` to separate the start and end of the slice. 

When selecting multiple non-adjacent rows / columns, use `,` between them, and wrap the whole selction in square brackets.    
  
Let's select first three rows:

In [None]:
X[0:3, :]

Let's select the first column, then save it to a variable for later use:

In [None]:
first_col = X[:, 0] 
print(first_col)

## Writing Functions 

We can calculate average 'sepal length' (feature 0) by hand:  
  
The notation `x += y` is a shorthand for `x = x + y`. 

In [None]:
total = 0
for item in range(n_sample):
    total += X[item, 0] 
    
avg = total / n_sample
print(avg)

Above, we used a `for` loop, these are used to iterate over a sequence. If you are not familiar with them, you can find information here: https://www.w3schools.com/python/python_for_loops.asp

Other common ways of controlling flow of your code are `if`, `elif` and `else` statements. More information can be found here: https://www.w3schools.com/python/python_conditions.asp  

If we think we will be using the above code again, we can put it into a function which calculates the average of a vector. Then we pass all the required arguments to the function and return the result.  

Since we need to pass the vector as an argument anyway, we can calculate its length inside the function.  
(Note that we switched from operating in two dimensions when working with the whole multidimensional array above, to operating in one dimension below).

In [None]:
def avg_vector(vector): #function definition
    total = 0
    n_sample = len(vector)
    for item in range(n_sample):
        total += vector[item]
    
    avg = total / n_sample
    return avg

In [None]:
avg_sepal_len = avg_vector(first_col) #function call
print(avg_sepal_len)

We can now call the function to calculate the average of another column: 'sepal width':

In [None]:
avg_sepal_width = avg_vector(X[:, 1])
print(avg_sepal_width)

## Using NumPy

We will need custom functions in the future, however we should leverage existing Python functions and use the imported libraries to avoid having to write code for common operations - like averaging - from scratch.

Simple functions exist in Python by default, and frequently also exist as a part of more advanced libraries like NumPy. You will encounter both versions, but try to make consistent choices in your own code.     
  
The Python `mean()` function:

In [None]:
py_avg = first_col.mean()
print(py_avg)

In the code below, `axis = 0` specifies column-wise operation, whereas `axis = 1` would specify row-wise operation. When using new functions, check documentation to see what arguments they can take.

In [None]:
py_avg_by_col = X.mean(axis = 0)
print(py_avg_by_col)

An equivalent function exists in NumPy:  
(You can check documentation of NumPy average function here:  
https://docs.scipy.org/doc/numpy/reference/generated/numpy.average.html)

In [None]:
np_avg = np.average(first_col)
print(np_avg)

In [None]:
np_avg_by_col = np.average(X, axis = 0)
print(np_avg_by_col)

Characteristic NumPy features include multidimensional array objects called `ndarray`'s and functions which work with them. Examples are available here: https://docs.scipy.org/doc/numpy/user/quickstart.html  

In fact the Iris data set we encountered earlier contains NumPy array objects.  

We can check it using `type()` function:

In [None]:
dataset_type = type(iris.data)
dataset_dimensions = iris.data.ndim
dataset_shape = iris.data.shape
print("Iris dataset is of type %s, has %d dimensions, %d rows and %d columns.\n" % (dataset_type, dataset_dimensions, dataset_shape[0], dataset_shape[1]))

col_type = type(first_col)
col_dimensions = first_col.ndim
col_shape = first_col.shape
print("A single column is of type %s, has %d dimension, and %d rows.\n" % (col_type, col_dimensions, col_shape[0]))

## Using Masks

If you would like to make changes to the data, but also want to keep the original data for reference, you can create a copy:

In [None]:
X_copy = X.copy()

To work with a selection of your values: 

1) Create a mask for your data: This has the same shape as your data, and contains Boolean values (True / False) indicating whether a given value in the data set fulfills the condition.  

2) Use the mask to perform operations only on the selected values from your data set.

Below, we select all elements of `X_copy` which are larger than 5.0, then replace them with `nan` using NumPy a function. 

In [None]:
mask = (X_copy > 5.0)
print(mask)

We then replace values which were larger than 5.0 with `nan`. `nan` stands for "not a number", and is commonly used to indicate undefined values.

In [None]:
X_copy[mask] = np.nan
print(X_copy)

In practice you will most likely be removing or replacing `nan` in your data set.  

The NumPy `isnan()` function returns a Boolean value describing whether the object is `nan`.  

Below, we use it inside the NumPy `any()` function. Together, this will return `True` for all rows which contain at least one `nan`, and `False` for those which do not contain any `nan`.  

Next, we remove all rows with `nan` from the data set. 

In the future, we will cover strategies such as replacing `nan` with feature's average. 

In [None]:
mask_nan = np.any(np.isnan(X_copy), axis=1)
X_copy = X_copy[~mask_nan]
print(X_copy)

You could also use the NumPy `nan_to_num()` function: https://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html   

Pandas library (descibed in more detail below) provides, amongst other things, the `dropna()` function for removing undefined data.

There exists redundancy and more than one way to achieve the same result. 

## Using Pandas

You will frequently work with Pandas data frames.  

When creating data frame objects, `_df` is frequently added to an object's name for clarity, or the object is simply called `df`.  

Empty frame can be created as follows by not providing values for data and columns:     

In [None]:
empty_df = pd.DataFrame() 
print(empty_df)

Below, we fill the frame with random integers in range 0 to 10, arranged in 100 rows and 3 columns.

By default, `df.head()` prints the first five rows of the data frame. This is handy when we want to check the data frame, but we do not want to print all rows. Arguments can be used to select the number of printed rows and the starting line.

In [None]:
labels = ["age", "blood_pressure", "weight"]
my_data = np.random.randint(low = 0, high = 10,size=(100, 3))
df1 = pd.DataFrame(data = my_data, columns = labels)
print(df1.head())

Further syntax for creating data frames is shown below. 

To initialise an empty data frame only with labels, use:  

In [None]:
empty2_df = pd.DataFrame({"age" : [], "blood_pressure" : [], "weight" : []})

To create a data frame with content:

In [None]:
df2 = pd.DataFrame({"age" : np.random.randint(0,99,size=100), "blood_pressure" : np.random.randint(60,150,size=100), "weight" : np.random.randint(50,120,size=100)})
print(df2.head())

We can check details of the data frame using the `df.info()` function, where `df` is replaced with the name of your data frame object.

In [None]:
df2.info()

Basic statistical analysis of Pandas dataframes can be performed by using the `df.describe()` function.  
You should always look at your data and do common sense checks before you feed it into machine learning algorithms.

In [None]:
df2.describe()

Both column names and indices (accessed using `iloc[ ]`) can be used to select dataframe columns.  

A single set of square brackets is required when using one column name, and two sets of square brackets are required when using multiple names. For example:  
`df2["age"]`  
`df2[["age", "weight"]]`

In [None]:
df2[["age", "weight"]].head() 

In [None]:
df2.iloc[:, [0, 2]].head()

`df.iloc[ ]` can also be used to select rows:

In [None]:
df2.iloc[0:5, :]

## Saving and Reading Data
### Checking the Working Directory
First, lets check which directory we are currently working in:

In [None]:
pwd

Files will be saved here, unless the full file path is provided. 

We can change the working directory (for this we need to import os library): 

In [None]:
import os

my_path = "/home/dariush/Desktop" # replace with your path
os.chdir(my_path)

Check the change:

In [None]:
pwd

### Saving Files
One of the ways to save your data is by using the Pandas `df.to_csv()` function. 

We create a tab-separated file (`\t`).

In [None]:
file_name = "my_file.csv" # if you want to be on the safe side, you can pass the whole path, i.e. directory + file_name
df2.to_csv(file_name,index=False) # we are saving it without the index column. We do not really need it (it was just numbering), and it is going to be recreated when we open the file below.

To read a file we can use the Pandas `read_csv()` function. It can take many useful arguments, for example to indicate the delimiter used (such as commas or tabs), to skip blank lines, etc. 

Documentation can be found here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

In [None]:
my_saved_file_name = "my_file.csv"
new_df = pd.read_csv(my_saved_file_name)
print(new_df.head())

## Final Note
Please remember that most data scientists Google things all the time: Libraries change; problems we are facing have already been solved by someone else.  
The 'Stack Overflow' site is a useful source of problem solutions and code examples.  
Remember that using documentation, libraries and copying small chunks of code (if not restricted by law) is OK...although this should always be referenced where appropriate.