<a href="https://colab.research.google.com/github/Amirhatamian/Statistical-Models-For-Data-Science/blob/main/Statistical_Models_For_Data_Science_ToDo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Write your own Google drive path to files
DrivePath = "/content/drive/My Drive/Colab Notebooks"

# Link to Google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import numpy as np
import pandas as pd
print (np.__version__)
print(pd.__version__)


#**Brief Recall to Numerical Python (NumPy)**

NumPy represents the fundamental library required for high performance scientific computing and data analysis. It is the foundation on which
many other higher-level tools are built. Some of the main things it
provides are: \\
• *ndarray*, a fast and space-efficient multidimensional array providing vectorized arithmetic operations and sophisticated broadcasting capabilities; \\
• standard mathematical functions for fast operations on entire arrays of data
without having to write loops; \\
• tools for reading/writing array data; \\
• linear algebra, random number generation, and Fourier transform capabilities.

In what follows, we briefly recall how arrays of data are handled in the Python language by built-in functions and how NumPy improves on this (note: Python includes several built-in container types, that are lists, dictionaries, sets, and tuples.)

###**List, Indexing and Slicing**

A **list** represents the basic ordered and mutable data collection type in Python, holding items that can be of different nature. For example we can merge numbers with strings.

In [None]:
# Basic Notions
myNumbers = [4, 5, 6, 7]
print(myNumbers)
print('My first number is ' + str(myNumbers[0]))

string1 = [str(c) for c in myNumbers] # list of strings
print(string1) # print(type(string1[0])) <class 'str'>

print('---------------')
myNumbers.append(10) # To add an element to the list
print(myNumbers)

print('---------------')
myNumbers.pop(1) # To remove an element, pop(N) defines the Nth position to be removed
print(myNumbers)

# myNumbers.pop() # if no position is given as input, the last item is removed
# print(myNumbers)

In [None]:
# Creating a List
L1 = list(range(10)) # range returns a sequence of numbers, starting from 0 (by default), with increments = 1 (by default), and stops before a specified number.
print(L1)

L2 = list(range(1,10,2)) # start, stop, step (note: stop is not included)
print(L2)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 3, 5, 7, 9]


In [None]:
# Indexing Arrays: getting (and eventually setting) the value of individual array elements
X = [5, 0, 3, 3, 7, 9]
print(X[0]) # single element
print(X[-2]) # negative index is used to index from the end of the array

In [None]:
# Slicing of arrays: getting (and eventually setting) smaller subarrays within a larger array
X = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
print(X)
print(X[5:]) # if no end position is defined, the slicing goes up to the end of the list (step 1 by default)
print(X[:5]) # if no start position is defined, the slicing starts from the first item of the list, and stop not included (step 1 by default)

# Note = X[start:stop:step] (remember, stop not included)
print(X[::2])  # every other element, starting at the first position
print(X[1::2]) # every other element, starting at index 1
print(X[0:5:]) # same as before, extended way


In [None]:
# Note: a potentially confusing case is when the step value is negative. In this case, the defaults for start and stop are swapped.
print(X[::-1])  # all elements, reversed
print(X[5::-2]) # reversed every other from index 5

###**Other important data types: Tuples and Dictionaries**

Besides the lists seen above, tuples and dictionaries are also important data types we can use in Python. A **tuple** is similar to a list, but it is created with parentheses instead of square brackets. The main difference is that, while the list is mutable, a tuple is immutable. So tuples are ordered, unchangeable, heterogenous, and allow duplicate values. \\

A Python **dictionary** is a hash table or a hash mapping (associative arrays). Dictionaries are indexed with keys, which can be any immutable type. For example, a string or number can be a key. Thus, dictionaries represent a *key:value* mapping.

In [None]:
# Tuple
simple_tuple = (1, 5, 6, 4, 2)
print(simple_tuple)

slic = simple_tuple[0:3] # Slicing
print(slic)

# Note: Tuple can not be sorted, differently from lists
animals = ['cat','fish', 'penguin','dog']
animals.sort()
print(animals)

# simple_tuple.sort() # This would give error!

# To convert a tuple into a list
# t = list(slic)
# print(t)


In [None]:
# Simple dictionary
simple_dict = {'five':5, 'three':3, 'one':1} # To create the dictionary
print(simple_dict)
print(simple_dict ['five']) # To access the data

# Adding new items to a dictionary
simple_dict['ninety'] = 90
print(simple_dict)

# Note: Dictionary are ordered (insertion order).
# To sort a dictionary in alphabetic order:
# print(dict(sorted(simple_dict.items())))

In [None]:
# More complex dictionary
ex1 = {'four': [8, 4.0],
       'two': 'London',
       'three': list(range(3)),
       'one': np.ones([1,5],dtype=int)}

print(ex1) # insertion order
print(dict(sorted(ex1.items()))) # alphabetic order
# print(ex1['four'])
# list(ex1.keys()) # To check the keys
# list(ex1.values()) # To check the values


###**Creating Arrays from Python Lists with NumPy**

The *np.array()* can be used to efficiently create arrays from Python lists. Indeed, one of the key features given by NumPy is its N-dimensional array object *ndarray*, used to represent both matrices and vectors being a fast and flexible container for large data sets. NumPy arrays are faster and more compact than Python lists. Indeed, an array consumes less memory and is convenient to use.

In [None]:
data = list(range(2,20,2))
b = np.array(data) #1D Row vector
print(b)
print(type(data))
print(type(b))
print('--------------')

# Alternative to performing list(range()) + np.array():
b2 = np.arange(2,20,2) # as command range(), but returns an ndarray instead of a range object (or list if we use list(range()))
print(b2)
print(type(b2))
print('--------------')

c = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # 3x3 matrix
print(c)
print('--------------')
# Note: we can also specify the type, appending for example ",dtype=np.float64"

# New elements can be added to a previously defined array (mutability property):
a2 = np.insert(b,2,[100,200]) # The second element defines the position where to add the new elements
print(a2)

In [None]:
# To access the data in the NumPy array/matrix
# Few examples with matrices
print(c)
print('')

print(c[:2, :2]) # part of the matrix (ending term is not included)
print('')

print(c[1,1]) # To access a specific item
print('')

In [None]:
# Note: to create simulated random integer data, the NumPy random number generator can be used:
np.random.seed(0)  # seed for reproducibility

# function: random.randint(low, high=None, size=None, dtype=int).
# If high is None (the default), then results are from [0, low).
x1 = np.random.randint(100, size=6)  # 1D array
x2 = np.random.randint(10, size=(3, 4))  # 2D array (matrix)

print(x1)
print(x2)

Each ndarray has a set of attributes, given by *ndim* (the number of dimensions), *shape* (the size of each dimension) and *size* (the total size of the array). Another useful attribute is *dtype*, which stores the information regarding the data type.

In [None]:
print("x2 ndim: ", x2.ndim)
print("x2 shape:", x2.shape)
print("x2 size: ", x2.size)
print("x2 dtype:", x2.dtype)

###**Array Reshaping and Concatenation**

A useful operation that can be done when dealing with data is the **reshaping of arrays**, allowing to change the shape of a given array without modifying the content. Indeed, the number of elements in the initial and final arrays is the same. The most flexible way of doing this is with the *reshape* function.

In [None]:
Row_V_1D = np.array([1, 2, 3, 4 ]) # ndim = 1, 1D Array (a single list/array of elements)
Row_V_2D = np.reshape(Row_V_1D,(1,4)) # ndim = 2, 2D Array
Col_V = np.reshape(Row_V_1D,(4, 1)) # ndim=2, alternative: Col_V = Row_V_2D.T

print(Row_V_1D)
print(Row_V_2D)
print(Col_V)
print('------')
print("Row_V_1D shape: ", Row_V_1D.shape)
print("Row_V_2D shape: ", Row_V_2D.shape)
print("Col_V shape: ", Col_V.shape)

In [None]:
T = np.array(range(10))
print(T)
T_reshaped_to_2D = np.reshape(T, (-1, 2)) # -1 means that the number of rows in unknown and NumPy will define it, according to the other dimension/elements
print(T_reshaped_to_2D)


In [None]:
# As an alternative to reshape, we could use the np.newaxis that will increase the dimensions of our array by one dimension:
a = np.array([2, 4, 6]) # 1D array
a_expanded = a[np.newaxis, :] # 2D Row array, otherwise: a_expanded = a[:,np.newaxis] for Column

print(a)
print(a_expanded)
print("a shape: ", a.shape)
print("a dim: ", a.ndim)
print("a_expanded shape: ", a_expanded.shape)
print("a_expanded dim: ", a_expanded.ndim)

In addition, it is possible to combine multiple arrays into one, performing an array concatenation (*np.concatenate*). This takes a tuple or list of arrays as its first argument:

In [None]:
# For 1D Arrays
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
z = np.array([0, 1])
Conc = np.concatenate([x, y, z]) # We can concatenate even more arrays
print(Conc)
print(Conc.shape)

In [None]:
# For Matrices
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])
print(grid)
print('')

M1 = np.concatenate((grid,grid)) # default is axis=0
print(M1)
print('')

M2 = np.concatenate((grid,grid),axis=1)
print(M2)
print('------')

y = np.array([10,11,12]) # 1D array
y_r = np.reshape(y,(1,len(y))) # 2D array
M3 = np.concatenate((grid,y_r)) # I can not directly concatenate using y as it is 1D array, so should be converted to 2D first
print(y_r)
print(M3)
print('------')

# Alternative for this last operation:
M4 = np.vstack((grid,y)) # vstack stack arrays in sequence vertically (row wise). hstack for stacking arrays in sequence horizontally, column wise.
print(M4)


###**Structured Data: Structured Arrays**

When dealing with a heterogeneous set of data, which can not be easily stored in traditional arrays for further operations, NumPy provides an efficient storage. Suppose we have different information stored in separate arrays, they can be compounded in a structured array using a similar sintax:

```
# data = np.zeros(N, dtype={'names':('A', 'B', 'C'),
                          'formats':((np.str_, 10), int, np.float32)})
data['A'] = A
data['B'] = B
data['C'] = C

```



In [None]:
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 9]
weight = [55.2, 85.5, 68.1, 31.5]

data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                          'formats':((np.str_, 10), int, np.float32)})
#print(data.dtype)

# This above is an empty container that we have to fill with the corresponding information:
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

# To access some elements in the structured array:
print(data['age']) # To access data according to the label index (e.g. all ages)
print(data[2]['weight']) # To access a specific variable in a Nth position

##**2. Introducing Pandas Objects**

In Section 1, we have given some basic recalls to NumPy and its important *ndarray* object, while in the next section we will build on this knowledge by looking at the data structures provided by the Pandas library. Pandas is a newer package built on top of NumPy, which provides an efficient implementation of a **DataFrame**, among the others. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. \\
Thus, at the very basic level, Pandas objects can be seen as an enhanced versions of NumPy structured arrays where the rows and columns are identified with labels rather than simple integer indices. Let's start from the three fundamental Pandas data structures: Series, Index, and DataFrame.





**2.1 The Pandas Series and Index Objects**

A Pandas Series is a 1D array of indexed data. It can be created from a list or array. The Series wraps both a sequence of values and a sequence of indices, which can be accessed with the **values** and **index** attributes, respectively. The *values* are simply a NumPy array, while the *index* is an immutable sequence storing axis labels.

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[5,8,9,7])
#data = pd.Series([0.25, 0.5, 0.75, 1.0]) # default index values [0,1,2,3]

print(data.values)
print(list(data.index))
print('Values type:', type(data.values))
print('Index type:', type(data.index))
print('-----------------')

# Dictionary-like expression to examine the keys/indices (Series as a specialization of a Python dictionary)
print(list(data.keys())) # To examine the index
print(list(data.items())) # To examine the values

# data['e'] = 1.25 # To add or modify a specific element

[0.25 0.5  0.75 1.  ]
[5, 8, 9, 7]
Values type: <class 'numpy.ndarray'>
Index type: <class 'pandas.core.indexes.numeric.Int64Index'>
-----------------
[5, 8, 9, 7]
[(5, 0.25), (8, 0.5), (9, 0.75), (7, 1.0)]


In [None]:
# Simple slicing and indexing
print(data[0:2]) # Simple slicing, referring to the positions
print(data[8]) # Indexing, referring to the index

*Main difference with 1D NumPy array:* presence of the index. Indeed while the Numpy array has an <u>implicitly </u>defined integer index used to access the values, the Pandas Series has an <u>explicitly </u> defined index associated with the values. This explicit definition gives more capabilities to the Series object, e.g. the index can consist of values of any desired type (also strings).

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
print(data)
print(data['c'])
print(data['c':'d'])

This Pandas Index object can be seen as an immutable array or as an ordered set. If we construct an Index from a list of integers:

In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
print(list(ind))
print(list(ind[::2]))
# Index as immutable array -> it broadly operates as a simple array, but cannot be changed
# ind[1] = 0 -> this will give an error!

# Index as ordered set
#indA = pd.Index([1, 3, 5, 7, 9])
#indB = pd.Index([2, 3, 5, 7, 11])
#indA & indB # intersection

Thus, to construct Pandas Series objects, the general syntax is: \\
> `pd.Series(data, index=values) where index is optional ` \\



In [None]:
T0 = pd.Series(['a','b','c'], index=[2,1,3])
print(T0)

# Alternative for the same result {index:value}
T1 = pd.Series({2:'a', 1:'b', 3:'c'})
print(T1)


Besides the examples above and as we already mentioned, data can be a dictionary in which index refers to the dictionary keys:

In [None]:
# Example of a dictionary and conversion to a Series object
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
print(population_dict)
print('Type population dict:', type(population_dict))
print('-----')

population = pd.Series(population_dict)
print(population)
print('Type population:', type(population))


In [None]:
# Slicing can be performed with two different methods, as we have labels as indexes:
print(population['Texas':'Florida']) # slicing by explicit index
print('---------')
print(population[1:4]) # slicing by implicit integer index

<u>Important Note:</u> Among the possible operations, slicing may be a source of confusion. Indeed, when slicing with an explicit index (i.e., `data['a':'c']`), the final index is included in the slice, while when slicing with an implicit index (i.e., `data[0:2]`), the final index is excluded from the slice!

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
print(data)
print('')
print(data['b':'d']) # slicing by explicit index
print('')
print(data[1:3]) # slicing by implicit integer index

# Note: when dealing with integer values as index the confusion can be increased for slicing and indexing!
#data = pd.Series(['a', 'b', 'c','d'], index=[1, 3, 5, 10])
#print(data)
#print(data[10]) # explicit index when indexing --> d
#print(data[1:3]) # implicit index when slicing --> 3 b
#                                                   5 c

As slicing and indexing conventions can be a source of confusion, Pandas provides some special *indexer attributes* that explicitly expose the indexing schemes. These are *loc* and *iloc*.

In [None]:
# loc attribute allows indexing and slicing to refer to the explicit index
data = pd.Series(['a', 'b', 'c','d'], index=[1, 3, 5, 10])
print(data)
print('----')
print(data.loc[1])
print('----')
print(data.loc[1:5])

In [None]:
# iloc attribute allows indexing and slicing to refer to the implicit Python-style index
print(data)
print('----')
print(data.iloc[1])
print('----')
print(data.iloc[1:3])

**2.2 The Pandas DataFrame Object**

Similarly to Series, DataFrame can be seen as a generalization of a NumPy multidimensional array or as a specialization of a Python dictionary.
A DataFrame is basically a tabular data structure, with rows and columns. Rows have a specific index to access them, as seen before, while the columns are a collection of Series objects. Therefore, the DataFrame objects can be seen as a spreadsheet, but it is much more flexible. \\
Here an example, taking some of the data seen above:

In [None]:
# Create a DataFrame from two dictionaries (dictionary + conversion to series)
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)

states = pd.DataFrame({'Popul.': population,
                       'Area': area})

states  # you can try states.shape or states.ndim to check the dimensions

In [None]:
# Two important attributes on DataFrame
print(states.index)
print(states.columns)

Below some examples on how to construct a DataFrame from scratch:


In [None]:
# From a single Series object
pd.DataFrame(population, columns=['Pop.Number']) # Note: if we add index = [1,2,3,4,5] or any other value it will return all NaN as "population" (series type)
                                                 # has already an index object (check with population.index)
# pd.DataFrame({'Pop.Number': population}) # alternative

In [None]:
# From a 2D-array
np.random.seed(42)
c = np.array([[1, 2, 3, 10], [4, 5, 6, 11], [7, 8, 9, 21]]) # 3x4 matrix
pd.DataFrame(c,
             columns=['first', 'second', 'third', 'fourth'],
             index=['n1', 'n2', 'n3'])

In [None]:
# From a simple list
data = [{'a': 1, 'b': 2}, {'b': 3, 'c': 4}] # Two rows
pd.DataFrame(data) # the missing values are filled with NaN

The individual Series that compose the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

In [None]:
# Previous example
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data['area']
# data['density'] = data['pop'] / data['area'] -> in this way we can append a new column to the previous DataFrame

When it comes to indexing of a DataFrame object, the dictionary-style indexing of columns limits the possibility to treat it as a NumPy array. In particular, passing a single index to an array accesses a row while using the column name would give access to the column itself. In this case, loc, iloc and ix indexers seen above can help to treat this object as if it is a NumPy array. For example:

In [None]:
data

In [None]:
data.iloc[:3, :1] # use implicit indexes

In [None]:
data.loc[:'Florida', :'area'] # use explicit index and column names

### **Short Exercise 1:**

Create from scratch a DataFrame object with three rows (A,B,C) and three columns (Names, Weight and Age), in order to store information related to three subjects (Tom, Peter and Alexander), their weights (66,55,89) and age. In particular, for the variable age generate random values between 30 and 50.
Try to explore different possibilities. \\
Then, add a fourth column (Height) with values of your choice, and reorder the columns to have 'Name, Age, Height, Weight'. How can you access the data once the object of interest has been created?
Finally, sort the entries according to the Age values (descending order, hint: use dataFrame.sort_values(), for having a quick look at the documentation: `help(pd.DataFrame.sort_values)`).

###**2.3 Merging Data**

In some cases, it might be useful to merge the data we have stored in different DataFrames in order to have all in a single object.


In [None]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
display(df1, df2)

# Combination:
df3 = pd.merge(df1, df2, on='employee')
df3

Sometimes the two datasets we have to merge have different column names. In this case, we can use the left_on and right_on keywords to specify the two column names:

In [None]:
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
df4_merge = pd.merge(df1, df3, left_on="employee", right_on="name")
display(df4_merge)

In [None]:
# The result has a redundant column that we can drop, for example by using the drop() method of DataFrames:
df4_merge.drop('name',axis=1,inplace=True)
print(df4_merge.index)
df4_merge

In [None]:
# Sometimes it might be useful to merge on an index rather than a column.
# df.set_index allows to set the DataFrame index using existing columns, i.e. we transform a previous column into an index:
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display(df1a, df2a)
df4_merge = pd.merge(df1a, df2a, on='employee')
display(df4_merge)
print(df1a.index)

###Extra - Additional useful operations

In [None]:
# 1. Replacing all Occurrences of a String in a DataFrame
df = pd.DataFrame({"Student1":['OK','Awful','Acceptable'],
                   "Student2":['Perfect','Awful','OK'],
                   "Student3":['Acceptable','Perfect','Poor']})
print(df)

# Replace the strings by numerical values (0-4)
df.replace(['Awful', 'Poor', 'OK', 'Acceptable', 'Perfect'], [0, 1, 2, 3, 4])

In [None]:
# 2. Applying simple functions to columns or rows in a DataFrame
triple = lambda x: x*3
# This allows defining a small function without using the "def" keyword.
# In this case, the lambda function takes one argument "x" and returns the result of multiplying it by 3.
# e.g, triple(4) returns the result of the operation

df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['A', 'B', 'C'])
print(df)

# Apply the triple function to the entire df or to a specific column, e.g. df['A'].apply(triple)
print(df.apply(triple))

# Apply the triple function to a specific row
print(df.loc[1].apply(triple))

In [None]:
# 3. Save DataFrame in a CSV file
import os
df.to_csv(os.path.join(DrivePath, 'myDataFrame.csv'), sep='\t')