<h1><center>COMP1008: Exercise 1<br/>Introduction to NumPy and Pandas</center></h1>

NumPy and Pandas are two extremely useful Python libraries for COMP1008 exercises and coursework. They provide powerful and flexible mathematical and data analysis methods for machine learning and data science.

This tutorial provides some examples on basic data manipulation using NumPy and Pandas. Some hand-on tasks are designed for you to practice your data manipulation skills.

<div id="partone"><h2><center>Part 1: NumPy Basics</center></h2></div>

The [NumPy](https://numpy.org/doc/stable/user/absolute_beginners.html) library in Python provides a powerful array object `ndarray` to handle well-organized data in numerical tasks. NumPy should be already installed with Anaconda. We simply import it and call it `np`. 

In [1]:
import numpy as np

## 1.1. Introduction to NumPy arrays

`ndarray` is a homogeneous n-dimensional array object, with <b>methods</b> and <b>attributes</b> to efficiently operate on data. It can hold a collection of items of any one data type, either in a vector (1D one-dimensional) or a matrix (multi-dimensional).

### Creating NumPy arrays

Method `np.array()` can easily create NumPy arrays.

In [2]:
arr1 = np.array([1, 2, 3, 4, 5, 6, 7, 8]) # 1D array
arr1

array([1, 2, 3, 4, 5, 6, 7, 8])

In [3]:
arr2 = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
arr2 # a 3X4 matrix

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

Attributes such as `shape`, `size`, `dtype` can provide useful information of the arrays.

In [4]:
# the dimensions of the array.
arr2.shape

(3, 4)

In [5]:
# the type of the elements in the array.
arr2.dtype

dtype('int64')

We can explicitly set the data type of a `ndarray` with the `dtype` parameter in the `array` method, and convert its type using the `astype` method.

In [6]:
# Construct an array with data of type float32
arr3 = np.array([1, 4, 2, 5, 3], dtype='float32')

# Check the type of the elements in the array.
arr3.dtype

dtype('float32')

In [7]:
# Convert an array from type float32 to int8
arr3 = arr3.astype(dtype=np.int8)
arr3.dtype

dtype('int8')

Other methods to easily construct NumPy arrays.

In [8]:
# Create a 1D length-10 integer array filled with zeros
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [9]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [10]:
# Create a linear sequence starting at 0 ending at 20, with a step of 2
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [11]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [12]:
# Create a 3x3 array of uniformly distributed random values in [0.0, 1.0)
np.random.random((3, 3))

array([[0.95099559, 0.61571384, 0.20770845],
       [0.87360432, 0.03665051, 0.28125323],
       [0.17963512, 0.13639668, 0.69938507]])

In [14]:
# Create a 5x3 array of random integers in [0, 10)
np.random.randint(0, 10, (5, 3))

array([[4, 4, 6],
       [4, 0, 0],
       [3, 9, 8],
       [9, 8, 7],
       [2, 3, 0]])

<b>Note</b>: You can use the `?` character to access the built-in documentation of functions. Refer to [Help and Documentation in IPython](https://jakevdp.github.io/PythonDataScienceHandbook/01.01-help-and-documentation.html) for more information.

In [15]:
xnp?

In [16]:
np.zeros?

<div class="alert alert-info">
    <h4>Task 1.1</h4>
</div>

Create a 1D array of numbers from 0 to 9 and name it `arr1`. Display the resulting array.

In [19]:
np.arrange?

Object `np.arrange` not found.


In [46]:
# Your code goes here
arr1 = np.arange(0,10,1)
arr1

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

## 1.2. Indexing and Data Slicing in NumPy
NumPy arrays offer fast element access for efficient data manipulation.

### Indexing
The (<i>i,j</i>)-th value (from index zero) can be accessed by specifying its index in `[,]`.

In [22]:
# Create a 5x3 array of random integers in the interval [0, 10)
arr4 = np.random.randint(0, 10, (4, 3))
arr4

array([[4, 1, 9],
       [4, 1, 5],
       [0, 8, 3],
       [7, 4, 4]])

In [32]:
arr4[0,0]

4

In [35]:
# Indexing from the end of the array
arr4[3,-2]

4

Using the above indexing notation, we can modify the values of the NumPy array.

In [36]:
# Modify the value of the arrary
arr4[0,0] = 0
arr4

array([[0, 1, 9],
       [4, 1, 5],
       [0, 8, 3],
       [7, 4, 4]])

### Data slicing

`arr[start:stop:step]` selects a sub-array in NumPy. Short syntax `arr[:stop]` and `arr[start:]` use the default starting and ending index.

In [53]:
# starting from index 0 till the 5th location
arr1[:5]

array([0, 1, 2, 3, 4])

In [54]:
# from index 1 till the end
arr1[1:]

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [55]:
# similar syntax for 2D array: every row, the first two columns
arr2[:,:2]

array([[ 1,  2],
       [ 5,  6],
       [ 9, 10]])

In [56]:
# every other row, all columns
arr2[::2, :]

array([[ 1,  2,  3,  4],
       [ 9, 10, 11, 12]])

In [57]:
# use condiction to select values in the array that are in (2,8)
arr1[(arr1>2)&(arr1<8)]

array([3, 4, 5, 6, 7])

In [58]:
# Replace those values > 3 by 2
arr1[arr1>3] = 2
arr1

array([0, 1, 2, 3, 2, 2, 2, 2, 2, 2])

<div class="alert alert-info">
    <h4>Task 1.2</h4>
</div>

Extract all odd numbers from `arr1`. <b>Hint</b>: think about the condition to use for data slicing.

In [59]:
# Your code goes here
arr1[(arr1%2) == 1]


array([1, 3])

## 1.3. Computation in NumPy

### Mathematical operations

NumPy provides a wide variety of straightforward mathematical operations on arrays.

In [62]:
arr5 = np.linspace(0, 1, 10)
print('arr1:\n',arr1)
print('arr5:\n',arr5)

arr1:
 [0 1 2 3 2 2 2 2 2 2]
arr5:
 [0.         0.11111111 0.22222222 0.33333333 0.44444444 0.55555556
 0.66666667 0.77777778 0.88888889 1.        ]


In [63]:
# Add items elementwise by using operator +.
arr1+arr5

array([0.        , 1.11111111, 2.22222222, 3.33333333, 2.44444444,
       2.55555556, 2.66666667, 2.77777778, 2.88888889, 3.        ])

In [64]:
# Multiply items elementwise by using method multiply.
np.multiply(arr5,10)

array([ 0.        ,  1.11111111,  2.22222222,  3.33333333,  4.44444444,
        5.55555556,  6.66666667,  7.77777778,  8.88888889, 10.        ])

In [65]:
# Return square-root of an array, elementwise, using method sqrt.
np.sqrt(arr1)

array([0.        , 1.        , 1.41421356, 1.73205081, 1.41421356,
       1.41421356, 1.41421356, 1.41421356, 1.41421356, 1.41421356])

In [66]:
# Array elements raised to powers elementwise.
np.power(arr1,3)

array([ 0,  1,  8, 27,  8,  8,  8,  8,  8,  8])

<div class="alert alert-info">
    <h4>Task 1.3</h4>
</div> 

Given the array `N`, calculate and print: $2*N$, $N^2$, $N^8$, and $2^N$.<br>
<b>Hint</b>: find the suitable [mathematical operations](https://numpy.org/doc/stable/reference/routines.math.html#mathematical-functions) in NumPy.

In [70]:
N = np.arange(10)
# Your code goes here
print("original",N)
print(np.multiply(N,2))
print(np.power(N,2))
print(np.power(N,8))
print(np.power(2,N))


original [0 1 2 3 4 5 6 7 8 9]
[ 0  2  4  6  8 10 12 14 16 18]
[ 0  1  4  9 16 25 36 49 64 81]
[       0        1      256     6561    65536   390625  1679616  5764801
 16777216 43046721]
[  1   2   4   8  16  32  64 128 256 512]


### Statistical operations

NumPy offer powderful [statistical functions](https://numpy.org/doc/stable/reference/routines.statistics.html) for essential analysis on the array.

In [71]:
# Sum all of the elements of the array
np.sum(arr1)

18

In [72]:
# Compute the mean of the array
np.mean(arr1)

1.8

In [73]:
# Compute the standard deviation of the array
np.std(arr1)

0.7483314773547883

In [74]:
# Return the maximal element in an array
np.max(arr1)

3

In [76]:
arr2

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [75]:
# Sum of array elements over a given axis by setting the parameters in NumPy method
np.sum(arr2,axis=0)

array([15, 18, 21, 24])

<div id="partone"><h2><center>Part 2: Basics of Pandas</center></h2></div>

NumPy handles mostly Numerical values. <a href="https://pandas.pydata.org/docs/user_guide/index.html"><b>Pandas</b></a> is a newer library built on NumPy. It offers more flexibile data analytics and manipulation on numeric, alphabetic, and heterogeneous data types for data processing in machine learning and data science.

In [77]:
# We need to import pandas as pd
import pandas as pd

## 2.1. Pandas DataFrame Object

A <b><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">DataFrame</a></b> in Pandas can be thought of as a 2D container to support easy data manipulation, similar to a spreadsheet or table. It can be created by method `pd.DataFrame()`, with optional arguments `index` (row labels) and `columns` (column labels) as follows.

In [78]:
df = pd.DataFrame({'Name': ['Alice', 'Bob'], 
                   'Module': ['Chemistry', 'Math'], 
                   'Mark': [73, 68]},
                  index=[1, 2])
df

Unnamed: 0,Name,Module,Mark
1,Alice,Chemistry,73
2,Bob,Math,68


<b>Dataframe</b> has attributes `values`, `index` and `columns` to access to the data, index and column labels.

In [79]:
df.values

array([['Alice', 'Chemistry', 73],
       ['Bob', 'Math', 68]], dtype=object)

In [80]:
df.index

Index([1, 2], dtype='int64')

In [81]:
df.columns

Index(['Name', 'Module', 'Mark'], dtype='object')

## 2.2. Reading & viewing data in Pandas

Pandas provides various functions to analyse and process the data for data analysis in machine learning and data science.

### Reading data

Pandas functions such as `read_csv()` or `read_excel()` import data files as a DataFrame `df`).

In [82]:
dfemissions = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/Emissions%20Data.csv")
dfemissions

Unnamed: 0,Year,Country,Continent,Emission
0,2008,Aruba,South America,24.750133
1,2009,Aruba,South America,24.876706
2,2010,Aruba,South America,24.182702
3,2011,Aruba,South America,23.922412
4,2008,Andorra,Europe,6.296125
...,...,...,...,...
783,2011,Zambia,Africa,0.212450
784,2008,Zimbabwe,Africa,0.569255
785,2009,Zimbabwe,Africa,0.600521
786,2010,Zimbabwe,Africa,0.646073


### Viewing data

In [83]:
# View the actual data inside the data structure DataFrame using attribute .values
dfemissions.values

array([[2008, 'Aruba', 'South America', 24.75013321],
       [2009, 'Aruba', 'South America', 24.87670585],
       [2010, 'Aruba', 'South America', 24.18270225],
       ...,
       [2009, 'Zimbabwe', 'Africa', 0.600521157],
       [2010, 'Zimbabwe', 'Africa', 0.646072745],
       [2011, 'Zimbabwe', 'Africa', 0.691697897]], dtype=object)

In [84]:
# View by default the first five samples using method .head()
dfemissions.head()

Unnamed: 0,Year,Country,Continent,Emission
0,2008,Aruba,South America,24.750133
1,2009,Aruba,South America,24.876706
2,2010,Aruba,South America,24.182702
3,2011,Aruba,South America,23.922412
4,2008,Andorra,Europe,6.296125


In [85]:
# View the last three samples using method .tail()
dfemissions.tail(3)

Unnamed: 0,Year,Country,Continent,Emission
785,2009,Zimbabwe,Africa,0.600521
786,2010,Zimbabwe,Africa,0.646073
787,2011,Zimbabwe,Africa,0.691698


In [86]:
# View the data type of the each column using attribute .dtypes
dfemissions.dtypes

Year           int64
Country       object
Continent     object
Emission     float64
dtype: object

In [87]:
# Change the data type of the `Year` column to `float64` using method .astype()
dfemissions.Year.astype('float64')

0      2008.0
1      2009.0
2      2010.0
3      2011.0
4      2008.0
        ...  
783    2011.0
784    2008.0
785    2009.0
786    2010.0
787    2011.0
Name: Year, Length: 788, dtype: float64

In [88]:
# Check how many dimensions of the data using attribute .shape
dfemissions.shape

(788, 4)

<div class="alert alert-info">
    <h4>Task 2.2</h4>
</div>

To import different file types, find the suitable Pandas function from Pandas I/O functions in [IO tools description](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) to read the excel file at path ```data/data-combinatorial-challenge.xlsx``` and name the data object ```dfC```. Then show the number of columns in the DataFrame `dfC`.

In [93]:
path_to_file = 'data-combinatorial-challenge.xlsx'

# Your code goes here
dfC = pd.read_excel(path_to_file)
dfC.columns



Index(['N', '2*N', 'N^2', 'N^8', '2^N', 'N!'], dtype='object')

### Descriptive statistics
Pandas provides various [functions](https://pandas.pydata.org/docs/user_guide/basics.html#descriptive-statistics) to easily find statistical information of the data.

In [94]:
# Calculate the mean of the data in the `Emission` column using method .mean()
dfemissions['Emission'].mean()

4.9083670030558375

In [95]:
# Calculate the standard deviation of the data in the `Emission` column
dfemissions['Emission'].std()

6.406295153046607

In [96]:
# Calculate the sum of the data in the `Emission` column
dfemissions['Emission'].sum()

3867.793198408

In [97]:
# Generates a high-level summary of the attributes of the DataFrame (numerical columns only)
dfemissions.describe()

Unnamed: 0,Year,Emission
count,788.0,788.0
mean,2009.5,4.908367
std,1.118744,6.406295
min,2008.0,0.020542
25%,2008.75,0.62463
50%,2009.5,2.590285
75%,2010.25,6.704494
max,2011.0,48.60162


## 2.3. Indexing and selection in Pandas

### Data indexing

We can access data in DataFrame by either `dataframe[colname]` or an attribute ```dataframe.colname```.

In [98]:
# access a slice by `dataFrame[colname]`
dfemissions['Continent']

0      South America
1      South America
2      South America
3      South America
4             Europe
           ...      
783           Africa
784           Africa
785           Africa
786           Africa
787           Africa
Name: Continent, Length: 788, dtype: object

In [99]:
# access a slice by an attribute `dataFrame.colname`
dfemissions.Continent

0      South America
1      South America
2      South America
3      South America
4             Europe
           ...      
783           Africa
784           Africa
785           Africa
786           Africa
787           Africa
Name: Continent, Length: 788, dtype: object

We can select data by two indexing operators, i.e., ```iloc``` (index-based using index) and ```loc``` (label-based using column labels).

In [100]:
# index-based selection: access row 1 in dataFrame
dfemissions.iloc[1] 

Year                  2009
Country              Aruba
Continent    South America
Emission         24.876706
Name: 1, dtype: object

In [101]:
# access column 2 in dataFrame
dfemissions.iloc[:,2]

0      South America
1      South America
2      South America
3      South America
4             Europe
           ...      
783           Africa
784           Africa
785           Africa
786           Africa
787           Africa
Name: Continent, Length: 788, dtype: object

In [102]:
# label-based selection: show each row of the columns 'Country' and 'Emission'
dfemissions.loc[:,['Country','Emission']]

Unnamed: 0,Country,Emission
0,Aruba,24.750133
1,Aruba,24.876706
2,Aruba,24.182702
3,Aruba,23.922412
4,Andorra,6.296125
...,...,...
783,Zambia,0.212450
784,Zimbabwe,0.569255
785,Zimbabwe,0.600521
786,Zimbabwe,0.646073


<div class="alert alert-info">
    <h4>Task 2.3</h4>
</div>

Is there an error in the below code cell? If so how to fix it?

In [104]:
dfemissions.iloc[:5,1]

0      Aruba
1      Aruba
2      Aruba
3      Aruba
4    Andorra
Name: Country, dtype: object

### Data slicing

In [105]:
# Select the 5th-10th rows of the data
dfemissions[5:10]

Unnamed: 0,Year,Country,Continent,Emission
5,2009,Andorra,Europe,6.049173
6,2010,Andorra,Europe,6.12477
7,2011,Andorra,Europe,5.968685
8,2008,Afghanistan,Asia,0.158962
9,2009,Afghanistan,Asia,0.249074


In [107]:
# Show the samples where the Continent is in 'countries', using label-based selection

countries = ["Oceania","Europe"] # list of selected continents

# dfemissions.loc[x]: select rows in dfemissions with values in x
# dfemissions.Continent.isin(countries): find rows with Continent values in countries
dfemissions.loc[dfemissions.Continent.isin(countries)]

Unnamed: 0,Year,Country,Continent,Emission
4,2008,Andorra,Europe,6.296125
5,2009,Andorra,Europe,6.049173
6,2010,Andorra,Europe,6.124770
7,2011,Andorra,Europe,5.968685
16,2008,Albania,Europe,1.580113
...,...,...,...,...
759,2011,Vanuatu,Oceania,0.591266
764,2008,Samoa,Oceania,1.039490
765,2009,Samoa,Oceania,1.072106
766,2010,Samoa,Oceania,1.103871


In [111]:
# Access the built-in documentation of functions using `?`
dfemissions.Continent.isin?

<div class="alert alert-info">
    <h4>Task 2.4</h4>
</div>

Show the first ten samples in `dfemissions` where `Emission` is higher than the mean emission value.<br>
<b>Hint</b>: break the task into subtasks. For example, calculate the mean value first, then do the conditional data slicing. 

In [134]:
# Your code goes here.
meanEmission = dfemissions['Emission'].mean()
dfemissions[dfemissions['Emission'] > meanEmission].loc[:10]



Unnamed: 0,Year,Country,Continent,Emission
0,2008,Aruba,South America,24.750133
1,2009,Aruba,South America,24.876706
2,2010,Aruba,South America,24.182702
3,2011,Aruba,South America,23.922412
4,2008,Andorra,Europe,6.296125
5,2009,Andorra,Europe,6.049173
6,2010,Andorra,Europe,6.12477
7,2011,Andorra,Europe,5.968685


## 2.4. Grouping, sorting and mapping

We often need to group things together by certain characteristics, to analyse selected data important to us.

## Grouping

The [```groupby()```](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method returns a `groupby` object that contains information of grouped data. Each group is a slice of the DataFrame containing only data with values that match.

In [135]:
# Count the number of countries in each continent, i.e. data is grouped by continent
dfemissions.groupby('Continent')['Continent'].count()

Continent
Africa           212
Asia             180
Europe           180
North America    108
Oceania           56
South America     52
Name: Continent, dtype: int64

In [136]:
# Calculate the mean emission in each continent, i.e. data grouped by continent, then fetch the 'Emission' column
dfemissions.groupby('Continent')['Emission'].mean()

Continent
Africa           1.215859
Asia             7.154809
Europe           6.759568
North America    5.790549
Oceania          4.381733
South America    4.513206
Name: Emission, dtype: float64

## Sorting

The [```sort_values()```](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method sorts values of either axis specified by `by`. The attribute `ascending` is `True` by default.

## To sort ascending by the citric acid.

In [137]:
dfemissions = dfemissions.sort_values(by='Emission', ascending=False)
dfemissions.head()

Unnamed: 0,Year,Country,Continent,Emission
592,2008,Qatar,Asia,48.60162
593,2009,Qatar,Asia,44.836401
595,2011,Qatar,Asia,44.018926
594,2010,Qatar,Asia,42.639076
710,2010,Trinidad And Tobago,North America,38.337841


<div class="alert alert-info">
    <h4>Task 2.5</h4>
</div>

Sort the data `dfemissions` descendingly by `Continent` and then `Country`.

In [139]:
# Your code goes here.
dfemissions.sort_values(by='Continent',ascending=False)

Unnamed: 0,Year,Country,Continent,Emission
735,2011,Uruguay,South America,2.296201
586,2010,Paraguay,South America,0.817267
733,2009,Uruguay,South America,2.401221
102,2010,Brazil,South America,2.113415
133,2009,Chile,South America,3.969434
...,...,...,...,...
191,2011,Djibouti,Africa,0.561941
645,2009,Sao Tome And Principe,Africa,0.548309
58,2010,Benin,Africa,0.541771
151,2011,Rep Of Congo,Africa,0.538098


In [140]:
dfemissions.sort_values(by='Country',ascending=False)

Unnamed: 0,Year,Country,Continent,Emission
784,2008,Zimbabwe,Africa,0.569255
785,2009,Zimbabwe,Africa,0.600521
786,2010,Zimbabwe,Africa,0.646073
787,2011,Zimbabwe,Africa,0.691698
782,2010,Zambia,Africa,0.192079
...,...,...,...,...
18,2010,Albania,Europe,1.515632
9,2009,Afghanistan,Asia,0.249074
11,2011,Afghanistan,Asia,0.425262
8,2008,Afghanistan,Asia,0.158962


<div class="alert alert-warning">
    <h4>Optional</h4>
</div>

To extend your knowledge and skills in NumPy and Pandas, here are some additional resources:
- NumPy
    - [Standard data types in NumPy](https://numpy.org/doc/stable/user/basics.types.html)
    - [NumPy quickstart](https://numpy.org/doc/stable/user/quickstart.html#numpy-quickstart)
    - [Introduction to NumPy](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html) in [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- Pandas
    - [Learn Pandas Tutorial](https://www.kaggle.com/learn/pandas) in Kaggle (<i>highly recommend</i>)
    - YouTube: <a href="https://www.youtube.com/watch?v=dcqPhpY7tWk">What's Pandas?</a>
    - <a href="https://www.youtube.com/watch?v=iGFdh6_FePU">Pandas in 10 minutes</a>
    - [Data Manipulation with Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) in [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
    
<div class="alert alert-success">
    <h2>🍰 End</h2> 
</div>