# Week 2 - Data and Statistics

## Learning Objectives

+ Mounting to Google Drive
+ Different Types of Data File Formats
  + Reading CSV file using Pandas
  + Introduction to Pandas
      + Series and Dataframes
+ Reading Other File Formats in Pandas
+ Network/Graph Data Representation
  + Introduction to Numpy
      + Array
  + Operations on Graph
      + Slicing and Broadcasting
+ Introduction to Scipy
  + Distributions
  + Statistical tests

Most of the materials for this tutorial are from [pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html), "Python for Probability, Statistics, and Machine Learning" by José Unpingco, [scipy lectures](https://scipy-lectures.org/packages/statistics/index.html), and [numpy tutorial](https://numpy.org/devdocs/user/whatisnumpy.html). You can refer to these resources for further understanding and practice. 


# Mounting your drive in Google Colab

In [None]:
from google.colab import drive
drive.mount('/content/drive') # alternative is to drag and drop to google colab

In [None]:
import sys
sys.path.insert(0,'/content/drive/My Drive/Colab Notebooks/IT5006/Week 2')

from python_import_demo import *
print(imported_function(3))


# Different Type of Data File Formats

In general, data can be categorized as structured and unstructured data. As Data Scientists, we come across variety of data file formats - depending on the type and modality of the data we are handling. Same data may be available to us in different formats - e.g. while working on clinical data of patients from a hospital, we can have data in any of the following formats:
+ Various tables (about billing and admission of patient) available with a patient-wise key mappings \[structured data\]
+ Patient-wise ECG data (time-series)
+ Collection of clinical notes (.txt files) for each patient
+ Patient-wise data for Radiology Images (image data)

Different tools and libraries are specialized in handling certain type of data.
Some important file formats that we come across most often in real-life are listed below.
+ Tabular and Structured Data - Excel sheets, csv files, tab separated files, SQL data fall under this category
+ Hierarchical or Nested File Format - JSON and XML are the most popular files under this category. The big-data files of HDF5 may also fall under this category. 
+ Image Data - The image file formats typically depend on the application
+ Text Data - Plain txt files are most common to store unstructured text data (which may have been scrapped from internet)
+ Array Data - Typically graph data, or modelling results are stored in form of array data. This may be available as NPY files or more specific binary file formats. 

In this tutorial, we will see the Python libraries available for handling a few of these data types - Pandas and Numpy. Then, we will explore the usage of another library - Scipy - to perform statistical tests using Python. 

## Introduction to Pandas

Pandas is a Python package which provides flexibility and expressive data structures designed to work with "relational" or "labeled" data easily. It is well suited for different kinds of data:

+ Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

+ Ordered and unordered (not necessarily fixed-frequency) time series data.

+ Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

+ Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a pandas data structure

There are two primary data structures in Pandas: Series and Dataframes

## Reading csv file using Pandas

We can read a csv into a dataframe using pandas. We now read the data contained in the ```brain_size.csv``` file. It gives the observations of brain size and weight and IQ. Reading a CSV in Pandas has many capabilities not available in the basic csv module. You can refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for all the various parameters. 

Please note that the community agreed alias for pandas is pd, so loading pandas as pd is assumed standard practice for all the pandas documentation.

In [None]:
import pandas as pd

data_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/IT5006/Week 2/brain_size.csv", sep=';', na_values='.', index_col=0)
data_df

We have explored the usage of the most popular file format for tabular data - CSV (comma separated values) file. A CSV file is essentially a text file only, but the various columns are separated by comma, and every new line represents a new row. We can read a CSV file like a normal python file, and split on comma, or even use the natively available [csv module](https://docs.python.org/3/library/csv.html) for reading and writing CSV files.

However, for performing analysis, and further integrating with typical ML pipeline, Pandas library is typically used for Data Analytics. 

## Reading text file using Pandas

We can also read a txt file into a dataframe using the versatile ```read_csv``` function in pandas. We now read the data contained in the ```earth_orbital_data.lst``` file. It gives the location of the Earth with respect to the sun (the radius, latitude and longitude) on various days in the year 2000. We then change the column names so they are easier to understand. The data was retrieved from NASA: ```https://omniweb.gsfc.nasa.gov/coho/helios/heli.html```

In [None]:
earth_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/IT5006/Week 2/earth_orbital_data.lst", sep = " ", skipinitialspace = True)
earth_df = earth_df.rename(columns={'YEAR': 'Year', 'DAY': 'Day', "RAD_AU": "Radius", "SE_LAT": "Latitude", "SE_LON": "Longitude"})
earth_df

## Using pandas

### Series

Series objects combines an index and corresponding data values. 

In [None]:
x = pd.Series(index = range(1, 11), data = [x**2 for x in range(1,11)], name="Squares")
x

We can retrieve values from a dataframe using ```iloc```, which is an integer-location based indexing that is used for selection by position.

In [None]:
x.iloc[5]

In [None]:
x.iloc[1:3]

### Dataframe

Formally, we refer to the two-dimensional, potentially heterogeneous, tabular data structure as a Dataframe. It can be thought of as dictionary-like container for Series objects. It is similar to a spreadsheet and can also be loaded from csv files.

<img src = "https://pandas.pydata.org/pandas-docs/stable/_images/01_table_dataframe.svg">

Every column in a dataframe is a Series. 

In [None]:
cubes = pd.Series(index=range(1,11), data=[x**3 for x in range(1,11)], name="Cubes")
df = pd.DataFrame({"Squares":x, "Cubes":cubes})
df

In [None]:
type(df['Squares'])

We can see that the datatypes for the dataframe have been inferred. Columns with mixed types are stored with the object dtype. We can also set the dtype to be [categorical](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) or [datetime](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html). You can read the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes) for the various data types.

In [None]:
data_df.dtypes

How can we convert the ```Gender``` column with mixed types into categorical data?

### Operations on Dataframes

We can get some quick properties of the data using Dataframe functions.

In [None]:
data_df.head(10)

We see that the second row in the dataframe has NaN value in "Weight" column - which indicates a missing value. Note that it is important to handle these missing values when doing statistical analysis.

In [None]:
data_df.shape

This indicates that the dataframe has 40 rows and 7 columns. We can see the names of the columns if present in the csv too.

In [None]:
data_df.columns

Pandas provides us with a quick way to generate descriptive statistics using the ```describe``` function. We will get the descriptive statistics for each of the numeric columns in the data frame.

These statistics are generated by excluding the missing values in the data. The count value in the statistics can also thus be used to get an idea of the number of missing values in the column. 

In [None]:
data_df.describe()

We can also check the dataframe for a specific value of specific column directly using format ```df[col]==val```. This returns a series with ```True``` values for only those rows which have the specific ```val``` in the ```col``` mentioned. 

How would we find out which rows correspond to observations of female patients?

In [None]:
df_female = data_df['Gender']=="Female"
df_female.head()

We can use the series to filter the dataframe to get only the rows which adhere to the condition mentioned. We can extract further information from the filtered rows.

What is the mean VIQ of female patients?

In [None]:
data_df[data_df['Gender']=="Female"]['VIQ'].mean()

Alternatively, we can use the series to create a new column is Female too.

In [None]:
data_df["isFemale"] = df_female
data_df.head()

How do we find the mean VIQ of female patients with weight greater than 120 pounds?

#### Grouping on values 

You can use ```groupby``` function to split the dataframe on values of the variable.

We can view the VIQ of all females in the dataset by grouping the patients by gender.

In [None]:
gender_groupby = data_df.groupby("Gender")
for gender, val in gender_groupby["VIQ"]:
  print("Gender:"+str(gender))
  print("Values")
  print(val)

So, we see that groupby returns an object that contains information for the group - we created a series grouped on the value of Gender. 

Typically we use groupby to get some aggregate statistics for the groups we have done a groupby on. It is important to note that groupby evaluation is lazy - i.e. no computation is done until the aggregation function is applied. 


In [None]:
gender_groupby

For example, can find the mean FSIQ, VIQ, PIQ, weight, height and MRI count among females and males using groupby.

In [None]:
gender_groupby.mean()

How many males/females are included in the study?

What is the mean value for VIQ for entire population? Also, print the mean VIQ for each gender in the following format:
```
Male, VIQmeanvalue
Female, VIQmeanvalue
```


Create a new column named BMI using the Weight and Height column. The formula is weight(in kgs)/square of height(in m). The height is in inches and weight is in lbs. You can create two new columns with appropriate units before making BMI column. 


In [None]:
data_df.head(5)

How does BMI vary among males and females (variation according to underweight(under 18.5), normal(18.5-22.9), and overweight(above 22.9))? 

## Handling Other File Formats in Pandas

Pandas supports reading many different file formats - including general delimited file (which is referred to as table), Excel, JSON, HDF5, SQL, etc. You can refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) for full details. 

### JSON files
In Python, JSON files can natively also be handled using the [json](https://docs.python.org/3/library/json.html) module. We basically treat the data as Python dictionary. 

Let us quickly explore handling of JSON file in Pandas. For reading a JSON file, the orientation of the file plays an important role, which describes how the JSON file is structured and must be read.  
+ 'split' : dict like {index -> \[index\], columns -> \[columns\], data -> \[values\]}
+ 'records' : list like \[{column -> value}, ... , {column -> value}\]
+ 'index' : dict like {index -> {column -> value}}
+ 'columns' : dict like {column -> {index -> value}}
+ 'values' : just the values array

We can write the `data_df` dataframe as JSON and read it. 

In [None]:
exchange_rates_df = pd.read_json("/content/drive/MyDrive/Colab Notebooks/IT5006/Week 2/exchangerates.json", orient = "records")
exchange_rates_df

# Handling Network/Graph Data

Suppose we are working on social network data. Let us consider we have the adjacency matrix representation of the social network connectivity available.   

The adjacency matrix indicates the connectivity between two nodes, each represented by the index of the row and column of the matrix. The presence of an edge between the nodes is indicated by 1 (or the weight in case of weighted graph) or 0 otherwise. 


| ![adj_mat.jpg](https://mathworld.wolfram.com/images/eps-gif/AdjacencyMatrices_1002.gif) | 
|:--:| 
| Image source: Wolfram MathWorld |

Rather than representing this graph data as list of lists using Python, we can use another library - Numpy to represent the adjacency matrix and perform compuattion easily. Incidentally, Pandas is built on top of Numpy, and so all the Pandas operations like sum, aggregation, etc. use the NumPy functionality.

## Introduction to Numpy

Numpy is a fundamental package for scientific computing in Python. At the core of a numpy package is *ndarray* object. This is encapsulation of *n*-dimensional arrays of **homogeneous** data types. There are many important differences between numpy arrays and standard Python sequences:
+ Numpy arrays have fixed size at creation.
+ The items have to be of homogeneous data type.
+ Numpy arrays efficiently execute advanced mathematical operations. 

### Array

We can create a numpy array from a homogeneous list. Numpy's array class is called [```ndarray```](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html). It is also known by the alias ```array```. The ```ndarray``` has attributes like ```ndim```, ```shape```, ```size```, ```dtype```, ```itemsize```, ```data```, etc.

In [None]:
import numpy as np

a = np.array([1,2,5,3])
print(a)

b = np.array(["This", "is", "a", "np", "array", "of", "strings"])
print(b)

b.shape

We have many options to create new numpy arrays. Let us quickly go through each of the different ways, other than ```np.array```.

In [None]:
c = np.ones((3,4))
d = np.full((3,4), 0.15)
e = np.empty((3,2))

print(c)
print(d)
print(e)

We can use usual element-wise operations on these arrays.

In [None]:
print(c+d)
print(c*d)

```np.arange``` returns evenly spaced values within a given interval - with start, stop and step size which can be specified. Another function which can be used for getting evenly spaced values in specific interval - [```np.linspace```](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html). Rather than step-size as used in arange, it uses number of samples to generate.

Another function which is handy is ```reshape``` - it can change the shape of the array without changing its data. 

In [None]:
a = np.arange(1, 25, 3)
print(a)

In [None]:
a1 = a.reshape(4,2)
a1

You can think of the reshaping process as reading across rows first, then across columns.

In [None]:
a1.reshape(2,4)

## Operations on Adjacency Matrix using Numpy

Let us use the adjacency matrix for a 4 node graph shown in the figure earlier.
```
[[0 0 1 1]
[0 0 1 1]
[1 1 0 0]
[1 1 0 0]]
```

In [None]:
c = np.array([[0,0,1,1],[0, 0, 1, 1],[1, 1, 0, 0],[1, 1, 0, 0]])
c

Now, if we want to find the degree of each node, we need to sum the number of edges for each node. We can use the Numpy operation ```sum``` for finding the degree of each node. 

In [None]:
c.sum(axis=0) # sum over each column

So, we see that the degree of each node is 2. Numpy allows other reductions as well - such as ```argmin```. For these operations, we need to specify the axis for our operation. If the axis is not mentioned, the returned index is into the flattened array. For a visual representation, the following image helps capture the direction for operation on the array.

| ![axis.jpg](https://i.stack.imgur.com/h1alT.jpg) | 
|:--:| 
| Image source: Stack Overflow |

We can similarly get statistical results on the data too, again specifying the axes. We can get mean, standard deviation, and many more statistics, using this similar principle. You can look up the documentation to get complete list of functionality. You can explore all such Numpy operations in the [documentation](https://numpy.org/doc/stable/reference/routines.statistics.html).

### Slicing and Broadcasting

Slicing, indexing and iteration in one-dimensional array are much similar to lists in python. Multidimensional arrays can have one index per axis. When fewer indices are provided than the number of axes of the array, the missing indices are considered as complete slices.

In [None]:
print(c[2:4,:])
print(c[2:4])

We have seen that operations are usually performed element-wise, when both the arrays are of the same shape. Numpy’s broadcasting rule relaxes this constraint when the array shapes meet certain constraints. 

When operating on two arrays, numpy compares their shapes. It starts with the rightmost dimension and works its way left. Two dimensions are compatible when
1. they are equal, or
2. one of them is 1

Note that this implies that the values do not need to same number of dimensions!

#### Application of Broadcasting - Creating distance matrix

Let’s construct an array of distances (in miles) between cities of Route 66 in US: Chicago, Springfield, Saint-Louis, Tulsa, Oklahoma City, Amarillo, Santa Fe, Albuquerque, Flagstaff and Los Angeles. 

```
[0, 198, 303, 736, 871, 1175, 1475, 1544, 1913, 2448]
```

We can create the adjacency matrix for these cities assuming that the only route for travel is the route mentioned. We will use ```np.newaxis``` for creating column vector from the np array. Then using the broadcasting technique, we can compute the adjacency matrix between the cities. 

In [None]:
mileposts = np.array([0, 198, 303, 736, 871, 1175, 1475, 1544, 1913, 2448])
mileposts[:, np.newaxis] # or np.transpose([mileposts])

In [None]:
distance_array = np.abs(mileposts - mileposts[:, np.newaxis])
distance_array

Using the adjacency matrix, you can do various computations like shortest path between two nodes, etc. These fall into the category of graph algorithms and are beyond scope of the course, but representing adjacency matrix as numpy array gives the flexibility of performing quick operations.

### Generating random arrays

In [None]:
 r = np.random.rand(10,2)
 r

How can I find the maximum value in the array?

In [None]:
np.max(r)

How can I find the location of the maximum value in the array?

How do conditionals work on an array?

In [None]:
if r > 0.8:
  print("All values are larger than 0.8")
else:
  print("Some values in r are smaller than 0.8")

Now that we have seen the various types of data available, let us understand the implementation of statistical inference using Python. We will use the library Scipy for the same. 

# Introduction to Scipy

The scipy package contains various toolboxes dedicated to common issues in scientific computing. Its different submodules correspond to different applications, such as interpolation, integration, optimization, image processing, statistics, special functions, etc. 

Although there are some basic statistical functions in numpy (e.g., mean, std, median), the real repository for statistical functions is in ```scipy.stats```. There are over eighty continuous probability distributions implemented in ```scipy.stats``` and an additional set of more than ten discrete distributions, along with many other supplementary
statistical functions.

We will go through these distributions and also do basic hypothesis testing using ```scipy.stats```.

## Distributions

Given observations of a random process, their histogram is an estimator of the random process’s PDF (probability density function). We can use ```np.random``` for generating random samples. 

However here, we continue working with the data we have loaded, and fit distribution to the VIQ column of the data. We assume that the data for VIQ is drawn from normal distribution. Let us convert the VIQ column to numpy array first.

In [None]:
samples = data_df.VIQ.to_numpy() # convert series to np array
samples

We can now fit a normal distribution to this observed data. We do a maximum-likelihood fit of the observations to estimate the parameters of the underlying parameters. 

In [None]:
from scipy import stats

loc, std = stats.norm.fit(samples)
print(loc)
print(std)

The mean is an estimator of the center of the distribution. The normal distribution distribution fit by scipy should give same have same center as the mean of the sample as we have created a sample distribution. 

In [None]:
np.mean(samples)

The median is another estimator of the center. It is the value with half of the observations below, and half above. It is the 50th percentile. Unlike the mean, the median is not sensitive to the tails of the distribution. It is “robust”.

In [None]:
np.median(samples)

In [None]:
stats.scoreatpercentile(samples, 50)

## Statistical tests using Scipy

A statistical test is a decision indicator. For instance, if we have two sets of observations, that we assume are generated from Gaussian processes, we can use a T-test to decide whether the means of two sets of observations are significantly different. 

Let us begin our hypothesis testing using the initial exmaple of brain data. Consider the null hypothesis that the expected value (mean) of the sample of VIQ values is equal to the given population mean. The ```ttest_1samp``` function does two-sided test and returns the T-statistic and p-value.

In [None]:
stats.ttest_1samp(samples, 0) # population mean is 0

Let the significance value be 0.05. As p-value<0.05, we can reject the null hypothesis that population mean for the VIQ measure is 0.

We have already seen that the mean for VIQ is different for males and females. So we can perform a test to check if this difference is significant. We use ```ttest_ind``` function for doing a two-sided test for the null hypothesis that the two independent samples have identical average (expected) values.

In [None]:
female_viq = data_df[data_df['Gender'] == 'Female']['VIQ']
male_viq = data_df[data_df['Gender'] == 'Male']['VIQ']
stats.ttest_ind(female_viq, male_viq)

As the p-value>0.05, we can not reject the null hypothesis that the males and females have identical VIQ measure. 

#### Z-tests using Scipy

If the mean and standard deviation of a population is known, we can use the z-test. We can use the lightbulb example in the lecture slides.

A particular brand of tires claims that its deluxe tire averages at least 50,000 km before it needs to be replaced. From past studies of this tire, the standard deviation is known to be 8,000 km. A survey of owners of that tire design is conducted. From the 28 tires surveyed, the mean lifespan was 46,500 km with a standard deviation of 9,800 km. Using α=0.05, is the data highly inconsistent with the claim?

$H_0: \mu \geq 50,000$

$H_1: \mu < 50,000$

$z = \frac{50000-46500}{8000/\sqrt{28}} = 2.315$

How can we calculate the p-value?

In [None]:
z = (50000-46500)/(8000/np.sqrt(28))