# Week 8: Numpy, Pandas, Matplotlib

During our course, we will work in Google Colab. When we work with data in Python, we will need additional modules and libraries. In this seminar we will use two of them: **Numpy** and **Pandas**.

## 1. Introduction to `NumPy`

Today we will get acquainted with the `NumPy` library (short for *Numeric Python*), which is often used in tasks related to data science. We won't use it often, but it has the required data type that will be needed for other libraries and tasks. The library is preinstalled in the Google Colab, we only need to import it. Let's check it by importing the library with a shortened name `np`. It is convenient because we don't need to use the full name everywhere. `np` is a common and generally accepted name for the `NumPy` library. It can often be seen in documentation or official tutorials.



*   Here you can find documentation: https://numpy.org/doc/stable/
*   NumPy: the basics: https://numpy.org/doc/stable/user/absolute_beginners.html



In [None]:
import numpy as np # importing the library with the shortened name

Let's remind what lists in Python are.

In [None]:
x = [2, 4, 5] # x and y are lists with integers
y = [10, 4, 2]
print(x + y)

If we try to add up these two lists, there is no error but the result is not what we expected. In this case, the `+` operator is to concatenate two lists.

Suppose that x is the first column in our table, and y is the second one. And we need to create a new variable that contains the sum of two columns. It is difficult to do this operation with lists so here we need a new type of data - `NumPy array`.



In [None]:
x_np = np.array(x) # we can initialize NumPy arrays from Python lists
y_np = np.array(y)

In [None]:
print(x_np) # looks like a list but it is a NumPy array

In [None]:
print(type(x_np))

The main NumPy object is `Ndarray` - it is an n-dimensional array.

In [None]:
print(x_np + y_np) # adding up to arrays

So now we get the third column of our data that contain the sum of two previous columns. This data type allows us to perform **element-wise** operations.

In [None]:
print(x_np * 2) # we can apply different operations like addition, multiplication, division and so on.
print(x_np ** 2)

We could have gotten the same result by looping through the list and adding up the elements of another list, but using the library makes it **faster**.

Assume that we have a list with different types of data: string, integer, floating number and boolean. Then we're trying to convert the list to `NumPy array`.

In [None]:
z = ['cat', 1, 2.4, True]
print(z)

z_np = np.array(z)
print(z_np) # all elements become strings
print(type(z_np))
print(z_np.dtype)

print(type(z_np[2])) # just like with lists, we can refer to an element

If types of objects in an array are different, then they will be converted to the same data type. If there is a string in an array, so all other objects will be strings.

In [None]:
z_np = np.array([1, 2.4, True]) # if we delete a string, so all objects are floating point numbers
print(z_np)
print(z_np.dtype)

In [None]:
z_np = np.array([1, True]) # if we delete a floating-point number, so all objects are integers
print(z_np)
print(z_np.dtype)

So, `NumPy array` is a data structure that allows you to store a set of elements of **the same type**: either only integers, or floating-point numbers, or strings, or boolean values.

NumPy array is different from a Python list.


*   If we add up two arrays, we get an elementwise addition, not a concatenation.
*   We can apply various operations to arrays: addition, exponentiation, and so on.
*   Unlike a list, which can contain different types of variables and store them, a NumPy array can only contain objects of one type.



### Filtering data

Imagine that the `age` variable contains the age of respondents and we need to determine who of the respondents are older than 18 years old.

In [None]:
age = np.array([18, 20, 43, 23, 14, 52, 12])

In [None]:
print(age >= 18)

Using the comparison sign we get the NumPy array that describes our data. Comparison operations are also carried out on an elemental basis.

If we need to check for two conditions, then we use the logical OR. For example, we need to identify who of the respondents are younger than 18 or older than 35.  

In [None]:
print(age < 18 or age >=35)

We've got an error here because the standard logical operations in Python (`or`, `and`) do not work. Therefore, we use the operations that are used to intersect sets: union `|` (used as OR) and intersection `&` (used as AND).

In [None]:
print(age < 18 | age >=35) # union of sets

We run into an error again because Python is trying to concatenate `18` and `age`. Thus, we need to enclose the comparison operations in round brackets.

In [None]:
print((age < 18) | (age >=35)) # union of sets

Now it works! We got an array with boolean variables.

In [None]:
age[(age < 18) | (age >=35)] # filtering our array by condition

We can use conditions for filtering data in an array. Then it will be helpful for filtering tables.

In [None]:
# To make negation (not) of a boolean statement in Python use "~" symbol

print((age < 18) | (age >=35)) # union of sets

print(~((age < 18) | (age >=35))) # not of the union of sets

### Important function - `np.where`

In [None]:
a = np.array([2, 4, 5])
np.where(a > 3) # tell me where in a, entries are > 3

We get indexes of elements that are >3.

In [None]:
a[np.where(a > 3)]

Now we get the elements that are >3.

### N-dimensional array

We have worked with one-dimensional objects. But an array can also be a multi-dimensional object. In this course, we will work with 2D arrays.

A vector is an array with a one dimension (there’s no difference between row and column vectors), while a matrix refers to an array with two dimensions.

A table is a two-dimensional object, it consists of columns and rows.

In [None]:
A = np.array([[2, 4, 5], [10, 4, 2]]) # an array with two lists
print(A)

In [None]:
print(type(A))

`A` is a two-dimensional matrix. The type is still a NumPy array.

We will most often work with two-dimensional arrays. A two-dimensional array can be thought of as a matrix or a table.

### Array characteristics

We can look at the attributes of the matrix.

`ndarray.shape` will display a tuple of integers that indicate the number of elements stored along each dimension of the array. If, for example, you have a 2-D array with 2 rows and 3 columns, the shape of your array is (2, 3).

`ndarray.ndim` will tell you the number of axes, or dimensions, of the array.

`ndarray.size` will tell you the total number of elements of the array. This is the product of the elements of the array’s shape.



They are used without round brackets because they are not methods, but attributes, object's characteristics.

In [None]:
print(A.shape)

In [None]:
print(A.ndim)

In [None]:
print(A.size)

### Working with array elements

If we need to refer to the elements of an array, then this operation will be similar to working with nested lists.

In [None]:
A

In [None]:
A[0] # the first array in A

For instance, we need to take the element which is the first by row, the second by column.

In [None]:
A[0][1] # the second element in the first list

In [None]:
print(A[0][1]) # double indexing
print(A[0,1]) # same

The second method does not work with lists, but it is quite convenient to use it for matrices.

The number of rows is indicated first, then the number of columns.

The colon `:` is commonly used for slicing. This is how we can get all numbers (slicing of a row) in the first column:

In [None]:
print(A[:,0]) # only the first element from each row

## 2. Introduction to `Pandas`

Pandas is a library for working with table data in Python. Pandas stands for *Python Data Analysis Library*. Or according to some sources, the name is derived from the term *Panel data*. Pandas offers data structures and operations for manipulating numerical tables and time series. It can replace Excel or Google Sheets.

*   Documentation: https://pandas.pydata.org/
*   10 minutes intro: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
*   Pandas Cheat-Sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf


In [None]:
import pandas as pd # importing the library with a shortened name

In [None]:
df = pd.DataFrame() # creating an empty dataframe as df

In [None]:
df # this is a created dataframe

In [None]:
type(df)

Data Frame object is a data table. It is a 2-dimensional data structure that can store data of different types (including characters, integers, floating-point values, categorical data and more). It is similar to a spreadsheet or a table.

The data structure contains labeled axes (rows and columns)

![picture](https://pandas.pydata.org/docs/_images/01_table_dataframe.svg)

Let's create the data frame, a table with marks.

In [None]:
df['STATS'] = [10, 8, 9] # create a column STATS and put a column with data in it
df

Hint: the first column contains indexes, we don't need to set these values. They will appear automatically. Numbering starts from zero.

Now we add another column and fill it with marks.

In [None]:
df['PROG'] = [9, 7, 10]
df

In [None]:
print(df) # outputs as text

In [None]:
display(df) # outputs our table interactively

### Working with columns

The syntax of dataframes is dictionary-like. As the key is called in the dictionary, so in the same way, we can get a column from our dataframe.

In [None]:
df['STATS'] # get a column by its name

In [None]:
type(df['STATS']) # the type of a column is a Pandas Series

Pandas Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.).

In [None]:
df[['STATS','PROG']] # get two columns

In [None]:
df[['PROG','STATS']] # order is important

In [None]:
type(df[['PROG','STATS']])

It is not a Series object because it is already a two-dimensional object.

![pic](https://www.datasciencemadesimple.com/wp-content/uploads/2020/05/create-series-in-python-pandas-0.png)

### Attributes of a column `name`, `dtype`, `values`

Now we will look at what attributes a column has. These attributes indicate the characteristics of the column.

In [None]:
df['STATS'].name # get a name of a column

In [None]:
df['STATS'].dtype # get a type of elements in a column

Note: don't pay much attention to numbers after int. The type is an integer. The number indicates a different range of storage capacity.

In [None]:
df['STATS'].values # here we can see NumPy array

In [None]:
type(df['STATS'].values) # definitely NumPy array

One more useful attribute is `.shape`. It lets us know the shape of a dataframe.

In [None]:
df.shape

In [None]:
df.info() # gives information about a dataframe

In [None]:
df.describe()   # Some statistics about the dataframe

Now let's count students' average marks by adding up two columns and dividing by 2.

AVERAGE = (First column + Second column) / 2

AVERAGE is also a Data frame.

In [None]:
df['AVERAGE'] = (df['STATS'] + df['PROG']) / 2
df # a new dataframe with three columns

Let's overwrite the variable with the new order of columns.

In [None]:
df = df[['AVERAGE','STATS', 'PROG']]
df

*   We get acquainted with Pandas types of data - Dataframe and Series.
*   Now we understand how to create a new dataframe and fill it with data.
*   We know how to work with columns of a dataframe.



Further, we will learn to work with rows and get a column by an index.  

## Other important attributes: columns and index

We continue to work with our dataframe with marks, now let's see how we can select different data from it.

`.columns` is an attribute that shows the column labels of the DataFrame.

In [None]:
df.columns # an object with the names of all columns

In [None]:
df.index # indexes of our dataframe

We can use this attribute to change names of columns:

In [None]:
df.columns = ['AV', 'S', 'P'] # rename columns with shorthen names: AV, S, P

In [None]:
df.columns

In [None]:
df

## How to take a row from a dataframe?

Can we use `df[0]`?

In [None]:
df[0]

We've got an error here. Because we can get only the names of columns by `df[0]`.

To take a row from a dataframe we need an additional attribute `.iloc`. It means *index location*.

In [None]:
df.iloc[0] # first row in our dataframe

This way we can get the marks of the first student.

The data is converted to a float type (floating-point data) because the NumPy array contains only the same type of data.

### Difference between .loc and .iloc

`.loc` allows accessing a row by label (not index).

In [None]:
df.loc[0]

The same result with `df.iloc[]` because the generated index is similar to the name of a row.

Now we create a dataframe with the new names of rows, we need to change our indexes.

In [None]:
df.columns = ['STATS', 'PYTHON', 'AVERAGE'] # names of columns

In [None]:
df.index = ['Mark', 'Alice', 'Alex'] # names of rows

In [None]:
df

In [None]:
df.iloc[0] # get a row by index

In [None]:
df.loc[0]

We get an error here because now we don't have a row with name 0.

In [None]:
df.loc['Mark'] # get a row by name

Let's get Alice's mark for Statistics using `loc` and `iloc`

In [None]:
df

In [None]:
df.iloc[1,0] # indexes of a needed mark is [1,0]

In [None]:
df.loc['Alice','STATS'] # names of a needed mark

One more way how to get a mark referring to a column.

In [None]:
df['STATS'][1]

In [None]:
df['STATS']['Alice']

# Working with datasets

We learned how to work with a dataframe, namely with its columns and rows.

Of course, most often we have to work with datasets. So, now we will practice with some datasets: loading them and using some attributes.

Datasets are usually stored in xls (x) format - for working in Excel, or (more often) in csv format - comma-separated value. Csv files can also be separated with a semicolon, tab and so on.

Firstly, we need to upload datasets into Google Colab. You can click on the folder icon on the left. Then move the files from your computer to the field that opens. Here we will use three files: `coffee_stats.xlsx`, `coffee_stats_2.csv`, `students.csv`. You can open it in a text editor and see what is inside.

To open it in Python, we need to use `pd.read_excel('our_table')` or `pd.read_csv('our_table')`, depending on the format. These function read data from a csv file or xls (x) format and converts the data to pandas.DataFrame.

In [None]:
df_excel = pd.read_excel('coffee_stats.xlsx') # loading the Excel file

In [None]:
df_excel

The dataset is not displayed in full, but only the first and the last rows.
 Therefore, unlike Excel, it quickly loads large tables.

In [None]:
df_excel.head() # this attribute contains the first five rows

In [None]:
df_excel.head(6) # this attribute contains the first six rows

By default, the attribute displays 5 rows, but we can set the needed number. NaN stands for Not a Number, here it is an empty cell. 

Or we call them missing values when the data is missed in a cell.

In [None]:
df_excel.tail() # this attribute contains the last five columns

In [None]:
df_excel.tail(7)

In [None]:
df_csv = pd.read_csv('students.csv') # loading the csv file
df_csv.head()

In [None]:
df_csv_sep = pd.read_csv('coffee_stats_2.csv')
df_csv_sep.head()

Here something went wrong. The issue is that our data is separated with a semicolon `;`. In this case, we need to add `sep=';'` that means a separator is a semicolon.

In [None]:
df_csv_sep = pd.read_csv('coffee_stats_2.csv', sep=';', header=None)
df_csv_sep.head()

Now we have the right format of a dataframe but the names of columns are not displayed in the right way. To fix this problem we need to find a parameter that will change it.

If you don't know, please check the documentation: [link](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_csv.html#pandas.read_csv). There you can find that the header parameter assigned to 0 will fix it.

In [None]:
df_csv_sep = pd.read_csv('coffee_stats_2.csv',
                     sep=';', header=0)
df_csv_sep.head()

## 3. Introduction to `Matplotlib`

In [None]:
import numpy as np
import matplotlib.pyplot as plt

### Scatter plot

`.scatter()` represents scatter plot takes two points as coordinates (X, Y) and draw a poin on a plain.

In [None]:
X_data = np.random.random(50) * 100
y_data = np.random.random(50) * 100

plt.scatter(X_data, y_data)  
plt.show()

We can change the color of the points by `c` arguments.

In [None]:
plt.scatter(X_data, y_data, c='orange')  
plt.show()

Changing color is possible by color code also. `#000000` represents black color.

In [None]:
plt.scatter(X_data, y_data, c='#000000')  
plt.show()

We can change the marker to different type by `marker` argument.

Argument `s` will change the size of the marker.

In [None]:
plt.scatter(X_data, y_data, c='red', s=100, marker='*')  
plt.show()

### Line plot

If you need a plot which represent the connection of the points in your plot, you may use line plot, like time series data.

In [None]:
years = [2000 + x for x in range(24)]
weights = [67, 67, 70, 69, 74, 75, 75, 80, 82, 80, 79, 75,
           76, 75, 72, 79, 80, 80, 81, 80, 83, 82, 80, 82]

weights_ticks = list(range(65, 85, 2))

plt.plot(years, weights, c='g', lw=4, linestyle="--")
plt.title('Weights change through ageing', fontsize=15, fontname='Times New Roman')
plt.xlabel('years')
plt.ylabel('weights')
plt.yticks(weights_ticks, [f'{x} kg' for x in weights_ticks])
plt.legend(['Aging'])
plt.show()

In [None]:
# Let's make a legend

weights_1 = [67, 67, 70, 69, 74, 75]
weights_2 = [100, 115, 110, 111, 108, 112]
weights_3 = [80, 85, 87, 83, 90, 91]

plt.plot(weights_1, label='Person 1')
plt.plot(weights_2, label='Person 2')
plt.plot(weights_3, label='Person 3')
plt.legend(loc='upper left')
plt.show()

### Bar plot

To visualize the categorical data we use bar plot.

In [None]:
languages = ['Russian', 'English', 'Persian', 'Arabic', 'Chinese']
people = [100, 150, 80, 90, 120]

plt.bar(languages, people, color='orange', width=0.8, edgecolor='black', lw=2)

### Histogram

In [None]:
ages = np.random.normal(20, 1.5, 1000)

plt.hist(ages, bins=20, color='orange', edgecolor='black')
plt.show()

### Pie chart

In [None]:
languages = ['Russian', 'English', 'Persian', 'Arabic', 'Chinese']
people = [100, 300, 80, 20, 120]

explodes=[0, 0, 0, 0.12, 0]

plt.pie(people, labels=languages, explode=explodes, autopct='%.2f%%',
        pctdistance=.75)
plt.show()

### Box plot

In [None]:
heights = np.random.normal(172, 8, 300)

plt.boxplot(heights)
plt.show()

In this plot we can see the median which is the second quartile (orange line)

First and third qurtiles are the line below and the top of the orange line.

Also the minimum and the maximum values.

And those dots are some outliers.