# NCRM April 2024 - Intro to python for Data Analysis
### Session 5: Numpy and Pandas

![image.png](attachment:image.png)

## Lewys Brace
#### l.brace@exeter.ac.uk

## Numpy

NumPy is the most foundational package for numerical computing in Python.

If you are going to work on data analysis or machine learning projects, then having a solid understanding of NumPy is nearly mandatory.

Indeed, many other libraries, such as pandas and scikit-learn, use NumPy’s array objects as the lingua franca for data exchange.

One of the reasons as to why NumPy is so important for numerical computations is because it is designed to work efficiently with large arrays of data in a number of ways, including:

1. Storing data internally in a continuous block of memory, independent of other in-built Python objects.
2. Performing complex computations on entire arrays without the need for for loops.

### What you’ll find in NumPy

- **ndarray**: an efficient multidimensional array providing fast array-orientated arithmetic operations and flexible broadcasting capabilities.

- Mathematical functions for fast operations on entire arrays of data without having to write loops.

- Tools for reading/writing array data to disk and working with memory-mapped files.

- Linear algebra, random number generation, and Fourier transform capabilities.

- A C API for connecting NumPy with libraries written in C, C++, and FORTRAN. This is why Python is the language of choice for wrapping legacy codebases.

### Importing NumPy

We first need to import NumPy into our working environment, which we do using the import funtion.

You'll also notice that we add "as np" afterwards. This "as np" allows us to type "np" when we call a NumPy function instead of "numpy"; i.e. **np.array** instead of **numpy.array**.  In principle, you could use anything as the shorthand, i.e. "import numpy as nump", but "np" is the standard convention.

In [None]:
import numpy as np

### The NumPy ndarray: A multi-dimensional array object

The NumPy ndarray object is a fast and flexible container for large data sets in Python. They are a bit like Python lists, but are still a very different beast at the same time. 

### Ndarray vs. lists

By now, you are familiar with Python lists and how incredibly useful they are.

So, you may be asking yourself:

> "I can store numbers and other objects in a Python list and do all sorts of computations and manipulations through list comprehensions, for-loops etc. What do I need a NumPy array for?"

There are some very significant advantages of using NumPy arrays overs lists when working with numerical data.

### Creating a NumPy array

To understand these advantages, lets create an array.

One of the most common, of the many, ways to create a NumPy array is to create one from a list by passing it to the np.array() function.

In [None]:
list1 = [0,1,2,3,4]
arr = np.array(list1)
print(type(arr))
print(arr)

### Differences between lists and ndarrays

The key difference between an array and a list is that arrays are designed to handle vectorised operations while a python lists are not.

It should be noted here that, once a Numpy array is created, you cannot increase its size. To do so, you will have to create a new array. 

That means, if you apply a function, it is performed on every item in the array, rather than on the whole array object.

For example:

In [None]:
list1 = [0,1,2,3,4]
arr = np.array(list1)
print("The array: ", arr)
arr = arr+2
print("The array again: ", arr)

### Create a 2D array from a list of list

You can convert a list of lists into a 2D array through the use of the np.array() function. When doing so, NumPy treats each of the sub-lists as a row in the output array, meaning that the first elements in each sub-list, for example, will form the first column.

In [None]:
list2 = [[0,1,2], [3,4,5,], [6,7,8]]
print("List2: ", list2)
arr2 = np.array(list2)
print("Array2: ", arr2)

### The dtype argument

You can specify the data type of an array, whether the elements in the array are integers, floats, etc, by using the dtype().

In [None]:
list2 = [[0,1,2], [3,4,5,], [6,7,8]]
print("List2: ", list2)
arr3 = np.array(list2, dtype="float")
print("Array3: ", arr3)

### The astype argument

You can also convert a pre-existing array with elements of one particular type to an array with elements of a different type using the .astype() argument.

In [None]:
print("Array3: ", arr3)
arr3_int = arr3.astype('int')
print(arr3_int)

### dtype='object'

You can force NumPy to create an array with elements of different types by creating a list that contains your elements then feeding that into the np.array() function using the dtype='object' argument.

In [None]:
arr_obj = np.array([1, 'a'], dtype='object')
print(arr_obj)

However, it is strongly advised that you do NOT do this. Hence, this technically being a "work around". You should not have an array of different types because doing so will eventually cause you issues in your programme when it comes to carrying out calculations.

### The tolist() function

Alternatively, you could convert your array into a list with the .tolist() argument.

In [None]:
arr_list = arr_obj.tolist()
print(arr_list)

### Inspecting a NumPy array

There are a range of functions built into NumPy that allow you to inspect different aspects of an array.

In [None]:
list2 = [[0,1,2], [3,4,5], [6,7,8]]
arr3 = np.array(list2, dtype='float')
#We can view the shape of the array with .shape. Here, we see it has 3 rows and 3 columns.
print("Shape: ", arr3.shape)
#We can use .dtype to tell us the type of the elements.
print("Type of elements: ", arr3.dtype)
#We can get the size with .size. Here is is 9 (3 rows * 3 colums)
print("Size: ", arr3.size)
#We get the number of dimensions of an array with ndim.
print("Dimensions: ", arr3.ndim)

## Exercise 1
Do the first set of exercises in the Jupyter notebook of exercises for this session.

### Extracting specific items from an array

You can extract certain elements of an array using indices, much like when you’re working with lists. Unlike lists, however, arrays can optionally accept as many parameters in the square brackets as there are number of dimensions.

In [None]:
print("Whole array: ", arr3)
print("First two elements of the first two rows: ", arr3[:2, :2])

### Boolean indexing

A boolean index array is of the same shape as the filtered array, but it only contains TRUE and FALSE values; where these True or False values are dependent upon the corresponding element fulfilling a certain criteria.

In [None]:
arr_bool = arr3>2
print(arr_bool)

So we see here that the boolean array tells us which elements in arr3 contain values greater than 2.

## Exercise 2
Complete the second set of exercises in the Jupyter notebook of exercises for this session.

## Pandas

- Pandas, like NumPy, is one of the most popular Python libraries for data analysis.
- It is a high-level abstraction over low-level NumPy, which is written in pure C.
- The main benefit of Pandas is that it provides high-performance, easy-to-use, data structures and data analysis tools.
- There are two main data structures in Pandas: dataframes and series.

When importing pandas, its conventional to import it "as pd", much like the "as np" when we import NumPy.

In [None]:
import pandas as pd

#### Pandas Series

A pandas series is similar to a list, but differs in the fact that a series associates a label with each element, forming an index. If an index is not explicitly provided by the user, pandas creates a RangeIndex ranging from 0 to N-1; where N is the number of elements in your series.

In [None]:
#First, let's convert a list of integers into a Series.
list1 = [5, 6, 7, 8, 9, 10]
new_series = pd.Series(list1)
#We'll see here that Pandas has auto-created an index of 0-5 for the Series.
print(new_series)

### Indices in Series

As you may suspect by this point, a series has ways to extract all of the values in the series, as well as individual elements by index. If you didn't specify an index, and instead allowed Pandas to autocreate one, then you can use indexing in exactly the same way as you would with a list:

In [None]:
print(new_series[4])

However, if you used the index=[] argument when you created your index, then you can use your custom index in order to access a specific element. You can even use strings as indices in Pandas data structures.

In [None]:
new_series = pd.Series(list1, index=['a','b','c', 'd', 'e', 'f'])
print(new_series['c'])

You can retrieve several elements simultaneously by feeding index values in as a list.

In [None]:
print(new_series[['a', 'd', 'f']])

You can also use indexing to alter the value of an element(s).

In [None]:
new_series[['b', 'd']] = 0
print(new_series)

### Filtering and maths operations

Filtering and maths operations are easy with Pandas as well. To do this, you specify the variable name of the series, and then in the proceeding [], you write the series name again, followed by the conditonal statement.

In [None]:
#print all elements in the series that have a value greater than 2:
print(new_series[new_series>2])
print("_______")
#Print all elements with a value greater than two and multiply each of these values by 2.
print(new_series[new_series>2]*2)

### Pandas dataframe

The dataframe object, often abbreviated to df, is the data structure that makes Pandas such a powerful and useful package, particularly for data analysts.

Simplistically, you can think of it as a table. Where the columns are variables and the rows are observations. For example:

![image.png](attachment:image.png)

### Creating a Pandas dataframe

In order to create a new Pandas dataframe, let's first create a dictionary.

In [None]:
country_dict = {'Country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
                   'Population': [17.04, 143.5, 9.5, 45.5],
                   'Square_km': [2724902, 17125191, 207600, 603628]}
print(type(country_dict))
print(country_dict)

Now, lets convert our dictionary into a Pandas dataframe. To do this, we feed our dictionary into Pandas' DataFrame() method.

In [None]:
df = pd.DataFrame(country_dict)
print(df)

Note: When you create a dataframe from a dictionary, Pandas will get the column (variable) names for your dataframe from the dictionary's keys.
    
You can also create a dataframe from a list of lists. When you do so, however, you need to use the columns=[] argument in order to specify the names of your columns.

In [None]:
list1 = [[0,1,2], [3, 4, 5], [6, 7, 8]]
df2 = pd.DataFrame(list2, columns = ['V1', 'V2', 'V3'])
print(df2)

### Dataframe variables

Note: That all columns in a Pandas dataframe are of series type:

In [None]:
print(type(df['Country']))

### Indexing in a dataframe

A Pandas dataframe object as two indices; a column index and row index.

As with the series object, if you do not provide one, Pandas will create a RangeIndex from 0 to N-1.

You can address a single column (variable) of your dataframe by using similar indexing methods to that which we used in order to get values from a series using string indexing.

In [None]:
print(df['Country'])

There are numerous ways to provide row indices explicitly. For example, you could provide an index when creating a dataframe. Here, we are going to use strings for indexing.

In [None]:
df = pd.DataFrame(country_dict, index=['KZ', 'RU', 'By', 'UA'])
print(df)

Alternatively you can do it during runtime. Here, we are going to actually name the index column 'country code'.

In [None]:
df = pd.DataFrame(country_dict)
df.index = ['KZ', 'RU', 'By', 'UA']
df.index.name = "Country_code"
print(df)

Row access using index can be performed in several ways. If you're using string indexing, you can use the **.loc()** method:

In [None]:
print(df.loc['KZ'])

If you're using numerical indexing, you can use **.iloc()**

In [None]:
print(df.iloc[1])

Note: If you specify a string index, you can still use numerical indexing; one does not replace the other.
    
You can equally use both types of indexing simultaneously in order to get a specific cell from a particular column.

In [None]:
print(df['Country'].iloc[1])

A selection of particular rows and columns can be selected in the following way. The first argument in the indexing brackets is the indexing for the rows and the second is the columns. Here, we are going to provide a list of indices for the rows so that we can get two rows.

In [None]:
print(df.loc[['KZ', 'RU'], 'Population'])

You can also use slicing, like we did with lists in the previous sessions.

In [None]:
print(df.loc[['KZ', 'RU'], :'Population'])

### Filtering

We can also filter a dataframe using Boolean arrays. For example, we want to get the values for the Country and Square_km variables where the population value is above 18.

In [None]:
print(df[df['Population'] > 18][['Country', 'Square_km']])

### Deleting columns

You can delete a column using the **.drop()** function. The axis argument states whether you want to drop labels from the rows (0 or ‘index’) or columns (1 or ‘columns’).

In [None]:
df = df.drop(['Population'], axis='columns')
print(df)

### Reading data from a file and converting it into a Pandas dataframe

As a computational social scientist/social data scientist, you will often have data you want to work with stored in in a file.

These data files could be any one one of several different file types, and Pandas supports most of them; including CSV, XML, HTML, Excel, SQL, JSON. It even has functionality to import SPSS data files.

Pandas has an incredibly useful method that is used to import a data file and automatically convert it into a Pandas dataframe.

Lets say we're working with a .csv file (comma-separated values file), we can use the Pandas **read_csv()** method to import our data file and automatically convert it into a Pandas dataframe.

Using the .read_csv function, you can read data in directly from a csv file into a Pandas dataframe.

If you are working with another file type (i.e. SQL, Excel, etc), in-depth details of other versions of the **read_** method that correspond to these other variable types can be found [here](https://pandas.pydata.org/docs/reference/io.html). For the time being, the basic input/output methods for Pandas are:

| Data file type | Reader | Writer |
| --- | --- | ---|
| csv | .read_csv() | .to_csv() |
| json | .read_json() | .to_json() |
| Microsoft Excel and OpenDocument (ODF) | .read_excel() | .to_excel() |
| Stata | .read_stata() | .to_stata() |
| SAS | .read_sas() | N/A |
| SPSS | .read_spss() | N/A |

We'll be using some of these other **read_** methods in future sessions of the course.

The **read_csv()** method takes two main arguments. The first is the filename as a string. If your file is **NOT** in the current working directory, then the string containing your filepath also has to include the absolute file path to the directory where the file is stored. It is also important to end this string with the type of file you are importing; in this case, a .csv file. 

The second argument is **header** and tells Pandas which column countains the column headers for our file. Pandas has in-built aalgorithms that are pretty good at determining this automatically, but it is good practice to state it explicityly nonetheless.

There are other optional arguments we can feed in, depending on the nature of our data file. Full details of these arguments can be founf [here](https://pandas.pydata.org/docs/reference/io.html). We'll look at some of these optional arguments as we progress through the course.

Below is an example import statement using the **read_csv()** method. We assign our dataframe to the variable name "df", which is a very common shorthand for "dataframe" that you will see regularly in example code.

In [None]:
df = pd.read_csv('Username/filepath/filename.csv', header=0)

Once your data is loaded into your Python script as a Pandas dataframe, you can then do your data cleaning, analysis, graphing and the like; the subjects of future sessions on this course.

Before you do any of that, however, it is good practice to first explore your data set so that you understand the nature of the data you are working with. Pandas has a number of in-built functions for this.

One of the most useul is the **list()** method, which returns a list of the variable (column) names. To use this method, you feed the **list()** method one argument; the name of your dataframe. Then, you enclose the **list()** method within a **print()** statement:

In [None]:
print(list(df))

Another useful method for understanding your data set is the **.head()** method, which prints the top 5 rows of data.

In [None]:
print(df.head())

If you want the **.head()** method to print more or fewer than the top 5 rows of data, feed the number you want in as an argument.=:

In [None]:
print(df.head(25))

## Exercise 4
Do the fourth set of exercises in the Jupyter notebook of exercises for this session.

### Save a Pandas dataframe to file
Once you're done with these operations, if you have made any changes to your dataframe and wish to save the amended dataframe as a .csv file, you can do so with the **.to_csv()**. This only requires one argument, a string that contains the name of the file followed by ".csv". Again, if you want to save the file to a folder that is not your current working directory, you will have to include the absolute filepath in your string:

In [None]:
df.to_csv('path/to/directory/filename.csv')

### Final notes on NumPy and Pandas
NumPy and Pandas have the ability to do so much more than has been discussed here. While it is worth developing a strong understanding of both packages, I would recommend against spending a disproportionate amount of time learning NumPy. This is because it is a vast package and you will learn the parts of NumPy that you need to learn, as you spend more time coding different tasks in Python.

## End of session 7