<h1><center>COMP1008: Exercise 1<br/>Introduction to NumPy and Pandas</center></h1>

`NumPy` and `Pandas` are two extremely useful Python libraries. They provide powerful and flexible mathematical and data analysis methods for machine learning and data science. This tutorial provides some examples on basic data manipulation in <a href="#partone">Part 1: NumPy</a> and <a href="#parttwo">Part 2: Pandas</a>. Some hand-on tasks are designed for you to practice your data manipulation skills.

<div id="partone"><h2><center>Part 1: NumPy Basics</center></h2></div>

The [NumPy](https://numpy.org/doc/stable/user/absolute_beginners.html) library provides a powerful array object `ndarray` to handle well-organized data in numerical tasks. NumPy should be already installed with Anaconda. We simply import it and call it `np`.

In [2]:
import numpy as np
# import SymPy as sp

### 1.1. Introduction to NumPy arrays

`ndarray` is a homogeneous n-dimensional array object, with <b>methods</b> and <b>attributes</b> to efficiently operate on data. It can hold a collection of items of any one data type, either in a vector (1D one-dimensional) or a matrix (multi-dimensional).

#### Creating NumPy arrays

Method `np.array()` can easily create NumPy arrays.

In [None]:
# try your COMP1043 tutorial task using numpy
arr0 = np.array([1, 0, -2]) # 1D array
arr0

In [None]:
arr1 = np.array([2, 1, 1]) # 1D array

In [None]:
arr = arr0 + 2 * arr1
arr

In [None]:
arr2 = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
arr2 # a 3X4 2D matrix

Attributes such as `shape`, `size`, `dtype` can provide useful information of the arrays.

In [None]:
# the dimensions of the array.
arr2.shape

Other methods to easily construct NumPy arrays.

In [None]:
# Create a linear sequence starting at 0 ending at 20, with a step of 2
np.arange(0, 20, 2)

In [None]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

In [None]:
# Create a 3x3 array of uniformly distributed random values in [0.0, 1.0)
np.random.random((3, 3))

<div class="alert alert-info">
    <h4>Task 1.1</h4>
</div>

Create a 1D array of numbers from 0 to 9 and name it `arr1`. Display the resulting array.

In [3]:
# Your code goes here

arr1 = np.arange(0,10)
print(arr1)



[0 1 2 3 4 5 6 7 8 9]


### 1.2. Indexing and Data Slicing in NumPy
NumPy arrays offer fast element access for efficient data manipulation.

#### Data Indexing
The (<i>i,j</i>)-th value (from index zero) can be accessed by specifying its index in `[,]`.

In [None]:
# Create a 5x3 array of random integers in the interval [0, 10)
arr3 = np.random.randint(0, 10, (4, 3))
arr3

In [None]:
arr3[0,0]

In [None]:
# Indexing from the end of the array
arr3[3,-1]

In [None]:
# Modify the value of the arrary
arr3[0,0] = 0
arr3

#### Data slicing

`arr[start:stop:step]` selects a sub-array in NumPy. Short syntax `arr[:stop]` and `arr[start:]` use the default starting and ending index.

In [None]:
# starting from index 0 till the 5th location
arr1[:5]

In [None]:
# from index 1 till the end
arr1[1:]

In [None]:
# similar syntax for 2D array: every row, the first two columns
arr2[:,:2]

In [None]:
# use condiction to select values in the array that are in (2,8)
arr1[(arr1>2)&(arr1<8)]

In [None]:
# Replace those values > 3 by 2
arr1[arr1>3] = 2
arr1

<div class="alert alert-info">
    <h4>Task 1.2</h4>
</div>

Extract all odd numbers from `arr1`. <b>Hint</b>: think about the condition to use for data slicing.

In [4]:
# Your code goes here

arr1[arr1%2==1]

array([1, 3, 5, 7, 9])

### 1.3. Computation in NumPy

#### Mathematical operations

NumPy provides a wide variety of straightforward mathematical operations on arrays.

In [None]:
arr4 = np.linspace(0, 1, 8)
print('arr1:\n',arr1)
print('arr4:\n',arr4)

In [None]:
# Add items elementwise by using operator +.
arr1+arr4

In [None]:
# Array elements raised to powers elementwise.
np.power(arr1,3)

<div class="alert alert-info">
    <h4>Task 1.3</h4>
</div> 

Given the array `N`, calculate and print: $2*N$, $N^2$, $N^8$, and $2^N$.<br>
<b>Hint</b>: find the suitable [mathematical operations](https://numpy.org/doc/stable/reference/routines.math.html#mathematical-functions) in NumPy.

In [5]:
N = np.arange(10)
# Your code goes here
print(2*N,N**2,N**8,2**N)


[ 0  2  4  6  8 10 12 14 16 18] [ 0  1  4  9 16 25 36 49 64 81] [       0        1      256     6561    65536   390625  1679616  5764801
 16777216 43046721] [  1   2   4   8  16  32  64 128 256 512]


#### Statistical operations

NumPy offer powderful [statistical functions](https://numpy.org/doc/stable/reference/routines.statistics.html) for essential analysis on the array.

In [None]:
# Sum all of the elements of the array
np.sum(arr1)

In [None]:
# Compute the standard deviation of the array
np.std(arr1)

In [None]:
# Return the maximal element in an array
np.max(arr1)

<div id="parttwo"><h2><center>Part 2: Basics of Pandas</center></h2></div>

NumPy handles mostly Numerical values. <a href="https://pandas.pydata.org/docs/user_guide/index.html"><b>Pandas</b></a> offers more flexibile data analytics and manipulation on numeric, alphabetic, and heterogeneous data types for data processing in machine learning and data science.

In [8]:
# We need to import pandas as pd
import pandas as pd

### 2.1. Pandas DataFrame Object

A Pandas <b><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">DataFrame</a></b> can be thought of as a 2D container for easy data manipulation, similar to a spreadsheet or table. A DataFrame `df` is created as follows by method `pd.DataFrame()`.

In [None]:
df = pd.DataFrame({'Name': ['Alice', 'Bob'], 
                   'Module': ['Chemistry', 'Math'], 
                   'Mark': [73, 68]})

In [None]:
df

<b>Dataframe</b> has attributes such as `values`, `index` and `columns` to access to the data, index and column labels.

In [None]:
df.columns

### 2.2. Reading & viewing data in Pandas

#### Reading data

Pandas methods such as `read_csv()` or `read_excel()` import data files as a DataFrame.

In [9]:
dfemissions = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/Emissions%20Data.csv")
dfemissions

Unnamed: 0,Year,Country,Continent,Emission
0,2008,Aruba,South America,24.750133
1,2009,Aruba,South America,24.876706
2,2010,Aruba,South America,24.182702
3,2011,Aruba,South America,23.922412
4,2008,Andorra,Europe,6.296125
...,...,...,...,...
783,2011,Zambia,Africa,0.212450
784,2008,Zimbabwe,Africa,0.569255
785,2009,Zimbabwe,Africa,0.600521
786,2010,Zimbabwe,Africa,0.646073


### Viewing data

In [None]:
# View the actual data inside the data structure DataFrame using attribute .values
dfemissions.values

In [None]:
# View by default the first five samples using method .head()
dfemissions.head()

In [None]:
# View the last three samples using method .tail()
dfemissions.tail(3)

In [None]:
# View the data type of the each column using attribute .dtypes
dfemissions.dtypes

In [None]:
# Change the data type of the `Year` column to `float64` using method .astype()
dfemissions.Year.astype('float64')

In [None]:
# Check how many dimensions of the data using attribute .shape
dfemissions.shape

### Descriptive statistics
Pandas provides various [methods](https://pandas.pydata.org/docs/user_guide/basics.html#descriptive-statistics) to easily find statistical information of the data.

In [None]:
# Calculate the mean of the data in the `Emission` column using method .mean()
dfemissions['Emission'].mean()

In [None]:
# Calculate the standard deviation of the data in the `Emission` column
dfemissions['Emission'].std()

In [None]:
# Calculate the sum of the data in the `Emission` column
dfemissions['Emission'].sum()

In [None]:
# Generates a high-level summary of the attributes of the DataFrame (numerical columns only)
dfemissions.describe()

## 2.3. Indexing and selection in Pandas

### Data indexing

We can access data in DataFrame by either `dataframe[colname]` or an attribute ```dataframe.colname```.

In [None]:
# access a slice by `dataFrame[colname]`
dfemissions['Continent']

In [None]:
# access a slice by an attribute `dataFrame.colname`
dfemissions.Continent

We can select data by either their index or referring to their labels.

In [None]:
dfemissions[['Country','Emission']] # select two columns

In [None]:
dfemissions[5:10] # Select the 5th-10th rows

<div class="alert alert-info">
    <h4>Task 2</h4>
</div>

Show the first ten samples in `dfemissions` where `Emission` is higher than the mean emission value.<br>
<b>Hint</b>: break the task into subtasks. For example, calculate the mean value first, then do the conditional data slicing. 

In [None]:
# Your code goes here.
print(dfemissions[(dfemissions['Emission']>(dfemissions['Emission'].mean()))].head(10))

#readable version below

#mean = dfemissions['Emission'].mean()
#strippedDF = dfemissions[(dfemissions['Emission']>mean)]
#print(strippedDF.head(10))

    Year               Country      Continent   Emission
0   2008                 Aruba  South America  24.750133
1   2009                 Aruba  South America  24.876706
2   2010                 Aruba  South America  24.182702
3   2011                 Aruba  South America  23.922412
4   2008               Andorra         Europe   6.296125
5   2009               Andorra         Europe   6.049173
6   2010               Andorra         Europe   6.124770
7   2011               Andorra         Europe   5.968685
20  2008  United Arab Emirates           Asia  23.033600
21  2009  United Arab Emirates           Asia  21.102296


<div class="alert alert-warning">
    <h4>Optional</h4>
</div>

To extend your knowledge and skills in NumPy and Pandas, here are some additional resources:
- NumPy
    - [Standard data types in NumPy](https://numpy.org/doc/stable/user/basics.types.html)
    - [NumPy quickstart](https://numpy.org/doc/stable/user/quickstart.html#numpy-quickstart)
    - [Introduction to NumPy](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html) in [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- Pandas
    - [Learn Pandas Tutorial](https://www.kaggle.com/learn/pandas) in Kaggle (<i>highly recommend</i>)
    - YouTube: <a href="https://www.youtube.com/watch?v=dcqPhpY7tWk">What's Pandas?</a>
    - <a href="https://www.youtube.com/watch?v=iGFdh6_FePU">Pandas in 10 minutes</a>
    - [Data Manipulation with Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) in [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
    
<div class="alert alert-success">
    <h2>🍰 End</h2> 
</div>