<h1><center>COMP1008: Exercise 1<br/>Introduction to NumPy and Pandas</center></h1>

NumPy and Pandas are two extremely useful Python libraries for COMP1008 exercises and coursework. They provide powerful and flexible mathematical and data analysis methods for machine learning and data science.

This tutorial provides some examples on basic data manipulation using NumPy and Pandas. Some hand-on tasks are designed for you to practice your data manipulation skills.

<div id="partone"><h2><center>Part 1: NumPy Basics</center></h2></div>

The [NumPy](https://numpy.org/doc/stable/user/absolute_beginners.html) library in Python provides a powerful array object `ndarray` to handle well-organized data in numerical tasks. NumPy should be already installed with Anaconda. We simply import it and call it `np`. 

In [None]:
import numpy as np

## 1.1. Introduction to NumPy arrays

`ndarray` is a homogeneous n-dimensional array object, with <b>methods</b> and <b>attributes</b> to efficiently operate on data. It can hold a collection of items of any one data type, either in a vector (1D one-dimensional) or a matrix (multi-dimensional).

### Creating NumPy arrays

Method `np.array()` can easily create NumPy arrays.

In [None]:
arr1 = np.array([1, 2, 3, 4, 5, 6, 7, 8]) # 1D array
arr1

In [None]:
arr2 = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
arr2 # a 3X4 matrix

Attributes such as `shape`, `size`, `dtype` can provide useful information of the arrays.

In [None]:
# the dimensions of the array.
arr2.shape

In [None]:
# the type of the elements in the array.
arr2.dtype

We can explicitly set the data type of a `ndarray` with the `dtype` parameter in the `array` method, and convert its type using the `astype` method.

In [None]:
# Construct an array with data of type float32
arr3 = np.array([1, 4, 2, 5, 3], dtype='float32')

# Check the type of the elements in the array.
arr3.dtype

In [None]:
# Convert an array from type float32 to int8
arr3 = arr3.astype(dtype=np.int8)
arr3.dtype

Other methods to easily construct NumPy arrays.

In [None]:
# Create a 1D length-10 integer array filled with zeros
np.zeros(10)

In [None]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

In [None]:
# Create a linear sequence starting at 0 ending at 20, with a step of 2
np.arange(0, 20, 2)

In [None]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

In [None]:
# Create a 3x3 array of uniformly distributed random values in [0.0, 1.0)
np.random.random((3, 3))

In [None]:
# Create a 5x3 array of random integers in [0, 10)
np.random.randint(0, 10, (5, 3))

<b>Note</b>: You can use the `?` character to access the built-in documentation of functions. Refer to [Help and Documentation in IPython](https://jakevdp.github.io/PythonDataScienceHandbook/01.01-help-and-documentation.html) for more information.

In [None]:
np.zeros?

<div class="alert alert-info">
    <h4>Task 1.1</h4>
</div>

Create a 1D array of numbers from 0 to 9 and name it `arr1`. Display the resulting array.

In [None]:
# Your code goes here




## 1.2. Indexing and Data Slicing in NumPy
NumPy arrays offer fast element access for efficient data manipulation.

### Indexing
The (<i>i,j</i>)-th value (from index zero) can be accessed by specifying its index in `[,]`.

In [None]:
# Create a 5x3 array of random integers in the interval [0, 10)
arr4 = np.random.randint(0, 10, (4, 3))
arr4

In [None]:
arr4[0,0]

In [None]:
# Indexing from the end of the array
arr4[3,-1]

Using the above indexing notation, we can modify the values of the NumPy array.

In [None]:
# Modify the value of the arrary
arr4[0,0] = 0
arr4

### Data slicing

`arr[start:stop:step]` selects a sub-array in NumPy. Short syntax `arr[:stop]` and `arr[start:]` use the default starting and ending index.

In [None]:
# starting from index 0 till the 5th location
arr1[:5]

In [None]:
# from index 1 till the end
arr1[1:]

In [None]:
# similar syntax for 2D array: every row, the first two columns
arr2[:,:2]

In [None]:
# every other row, all columns
arr2[::2, :]

In [None]:
# use condiction to select values in the array that are in (2,8)
arr1[(arr1>2)&(arr1<8)]

In [None]:
# Replace those values > 3 by 2
arr1[arr1>3] = 2
arr1

<div class="alert alert-info">
    <h4>Task 1.2</h4>
</div>

Extract all odd numbers from `arr1`. <b>Hint</b>: think about the condition to use for data slicing.

In [None]:
# Your code goes here




## 1.3. Computation in NumPy

### Mathematical operations

NumPy provides a wide variety of straightforward mathematical operations on arrays.

In [None]:
arr5 = np.linspace(0, 1, 8)
print('arr1:\n',arr1)
print('arr5:\n',arr5)

In [None]:
# Add items elementwise by using operator +.
arr1+arr5

In [None]:
# Multiply items elementwise by using method multiply.
np.multiply(arr5,10)

In [None]:
# Return square-root of an array, elementwise, using method sqrt.
np.sqrt(arr1)

In [None]:
# Array elements raised to powers elementwise.
np.power(arr1,3)

<div class="alert alert-info">
    <h4>Task 1.3</h4>
</div> 

Given the array `N`, calculate and print: $2*N$, $N^2$, $N^8$, and $2^N$.<br>
<b>Hint</b>: find the suitable [mathematical operations](https://numpy.org/doc/stable/reference/routines.math.html#mathematical-functions) in NumPy.

In [None]:
N = np.arange(10)
# Your code goes here




### Statistical operations

NumPy offer powderful [statistical functions](https://numpy.org/doc/stable/reference/routines.statistics.html) for essential analysis on the array.

In [None]:
# Sum all of the elements of the array
np.sum(arr1)

In [None]:
# Compute the mean of the array
np.mean(arr1)

In [None]:
# Compute the standard deviation of the array
np.std(arr1)

In [None]:
# Return the maximal element in an array
np.max(arr1)

In [None]:
# Sum of array elements over a given axis by setting the parameters in NumPy method
np.sum(arr2,axis=0)

<div id="partone"><h2><center>Part 2: Basics of Pandas</center></h2></div>

NumPy handles mostly Numerical values. <a href="https://pandas.pydata.org/docs/user_guide/index.html"><b>Pandas</b></a> is a newer library built on NumPy. It offers more flexibile data analytics and manipulation on numeric, alphabetic, and heterogeneous data types for data processing in machine learning and data science.

In [None]:
# We need to import pandas as pd
import pandas as pd

## 2.1. Pandas DataFrame Object

A <b><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">DataFrame</a></b> in Pandas can be thought of as a 2D container to support easy data manipulation, similar to a spreadsheet or table. It can be created by method `pd.DataFrame()`, with optional arguments `index` (row labels) and `columns` (column labels) as follows.

In [None]:
df = pd.DataFrame({'Name': ['Alice', 'Bob'], 
                   'Module': ['Chemistry', 'Math'], 
                   'Mark': [73, 68]},
                  index=[1, 2])
df

<b>Dataframe</b> has attributes `values`, `index` and `columns` to access to the data, index and column labels.

In [None]:
df.values

In [None]:
df.index

In [None]:
df.columns

## 2.2. Reading & viewing data in Pandas

Pandas provides various functions to analyse and process the data for data analysis in machine learning and data science.

### Reading data

Pandas functions such as `read_csv()` or `read_excel()` import data files as a DataFrame `df`).

In [None]:
dfemissions = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/Emissions%20Data.csv")
dfemissions

### Viewing data

In [None]:
# View the actual data inside the data structure DataFrame using attribute .values
dfemissions.values

In [None]:
# View by default the first five samples using method .head()
dfemissions.head()

In [None]:
# View the last three samples using method .tail()
dfemissions.tail(3)

In [None]:
# View the data type of the each column using attribute .dtypes
dfemissions.dtypes

In [None]:
# Change the data type of the `Year` column to `float64` using method .astype()
dfemissions.Year.astype('float64')

In [None]:
# Check how many dimensions of the data using attribute .shape
dfemissions.shape

<div class="alert alert-info">
    <h4>Task 2.2</h4>
</div>

To import different file types, find the suitable Pandas function from Pandas I/O functions in [IO tools description](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) to read the excel file at path ```data/data-combinatorial-challenge.xlsx``` and name the data object ```dfC```. Then show the number of columns in the DataFrame `dfC`.

In [None]:
path_to_file = 'data/data-combinatorial-challenge.xlsx'

# Your code goes here




### Descriptive statistics
Pandas provides various [functions](https://pandas.pydata.org/docs/user_guide/basics.html#descriptive-statistics) to easily find statistical information of the data.

In [None]:
# Calculate the mean of the data in the `Emission` column using method .mean()
dfemissions['Emission'].mean()

In [None]:
# Calculate the standard deviation of the data in the `Emission` column
dfemissions['Emission'].std()

In [None]:
# Calculate the sum of the data in the `Emission` column
dfemissions['Emission'].sum()

In [None]:
# Generates a high-level summary of the attributes of the DataFrame (numerical columns only)
dfemissions.describe()

## 2.3. Indexing and selection in Pandas

### Data indexing

We can access data in DataFrame by either `dataframe[colname]` or an attribute ```dataframe.colname```.

In [None]:
# access a slice by `dataFrame[colname]`
dfemissions['Continent']

In [None]:
# access a slice by an attribute `dataFrame.colname`
dfemissions.Continent

We can select data by two indexing operators, i.e., ```iloc``` (index-based using index) and ```loc``` (label-based using column labels).

In [None]:
# index-based selection: access row 1 in dataFrame
dfemissions.iloc[1] 

In [None]:
# access column 2 in dataFrame
dfemissions.iloc[:,2]

In [None]:
# label-based selection: show each row of the columns 'Country' and 'Emission'
dfemissions.loc[:,['Country','Emission']]

<div class="alert alert-info">
    <h4>Task 2.3</h4>
</div>

Is there an error in the below code cell? If so how to fix it?

In [None]:
dfemissions.iloc[:5,['Country']]

### Data slicing

In [None]:
# Select the 5th-10th rows of the data
dfemissions[5:10]

In [None]:
# Show the samples where the Continent is in 'countries', using label-based selection

countries = ["Oceania","Europe"] # list of selected continents

# dfemissions.loc[x]: select rows in dfemissions with values in x
# dfemissions.Continent.isin(countries): find rows with Continent values in countries
dfemissions.loc[dfemissions.Continent.isin(countries)]

In [None]:
# Access the built-in documentation of functions using `?`
dfemissions.Continent.isin?

<div class="alert alert-info">
    <h4>Task 2.4</h4>
</div>

Show the first ten samples in `dfemissions` where `Emission` is higher than the mean emission value.<br>
<b>Hint</b>: break the task into subtasks. For example, calculate the mean value first, then do the conditional data slicing. 

In [None]:
# Your code goes here.





## 2.4. Grouping, sorting and mapping

We often need to group things together by certain characteristics, to analyse selected data important to us.

### Grouping

The [```groupby()```](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method returns a `groupby` object that contains information of grouped data. Each group is a slice of the DataFrame containing only data with values that match.

In [None]:
# Count the number of countries in each continent, i.e. data is grouped by continent
dfemissions.groupby('Continent')['Continent'].count()

In [None]:
# Calculate the mean emission in each continent, i.e. data grouped by continent, then fetch the 'Emission' column
dfemissions.groupby('Continent')['Emission'].mean()

### Sorting

The [```sort_values()```](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method sorts values of either axis specified by `by`. The attribute `ascending` is `True` by default.

In [None]:
# To sort ascending by the citric acid.
dfemissions = dfemissions.sort_values(by='Emission', ascending=False)
dfemissions.head()

<div class="alert alert-info">
    <h4>Task 2.5</h4>
</div>

Sort the data `dfemissions` descendingly by `Continent` and then `Country`.

In [None]:
# Your code goes here.




<div class="alert alert-warning">
    <h4>Optional</h4>
</div>

To extend your knowledge and skills in NumPy and Pandas, here are some additional resources:
- NumPy
    - [Standard data types in NumPy](https://numpy.org/doc/stable/user/basics.types.html)
    - [NumPy quickstart](https://numpy.org/doc/stable/user/quickstart.html#numpy-quickstart)
    - [Introduction to NumPy](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html) in [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- Pandas
    - [Learn Pandas Tutorial](https://www.kaggle.com/learn/pandas) in Kaggle (<i>highly recommend</i>)
    - YouTube: <a href="https://www.youtube.com/watch?v=dcqPhpY7tWk">What's Pandas?</a>
    - <a href="https://www.youtube.com/watch?v=iGFdh6_FePU">Pandas in 10 minutes</a>
    - [Data Manipulation with Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) in [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
    
<div class="alert alert-success">
    <h2>🍰 End</h2> 
</div>