# 1. Handling files in Python

Working with files in Python is very similar to working with a book in the physical world. When you want to get something from a book you have to open it, you can either read it in its natural order or skip some lines and pages or you can write on it.

The same rules apply to files in Python. You can opt to open a file by referencing its name and then indicating whether you want to read from the file or write to it.

A file in Python is named location used to permanently store some information.

Let's start off with writing into a file

In order to write to a file we first need to open the file. You can open a file using the inbuilt open() function

In [5]:
file = open('file.txt')
file

<_io.TextIOWrapper name='file.txt' mode='r' encoding='cp1252'>

From this we can see that the default if we don't specify the mode is read.

We can also specify the mode we want to engage our file with. For example do we want to read from the file, append new text to the file or write to the file. Let's see what happens when we try and write to the file after opening it

In [6]:
file = open('file.txt','w')
file

<_io.TextIOWrapper name='file.txt' mode='w' encoding='cp1252'>

You can also specify the type of encoding. The default is dependant on the platform you are using. If you are using the default will be 'cp1252' whereas if you are using Linux the default will be 'utf-8'.

Now let's try and read a few lines from the file. We can read by specifying the number of characters we want to read from the file we have opened. Let us try and read the first three characters from our file

In [20]:
file = open('file.txt')
file.read(3)

'Two'

We can also choose to read the entire file in one go. We can do this by leaving out the number inside the brackets specifying the length of characters we want to read

In [22]:
file = open('file.txt')
file.read()

'Two men went bear hunting. While one stayed in the cabin, the other went out looking for a bear. He soon found a huge bear, shot at it but only wounded it. The enraged bear charged toward him, he dropped his rifle and started running for the cabin as fast as he could.\nHe ran pretty fast but the bear was just a little faster and gained on him with every step. Just ashe reached the open cabin door, he tripped and fell flat. Too close behind to stop, the bear tripped over him and went rolling into the cabin.\nThe man jumped up, closed the cabin door and yelled to his friend inside, "You skin this one while I go and get another one!"'

Once you have opened the file you can also opt to use a loop to read the file line by line. This can work either with a while statement or a for loop.

With a while statement you can read each line in the file until there are no lines remaining. Let's look at a practical example of this.

In [36]:
file = open('file.txt')
while True:
    line = file.readline()
    if len(line) == 0:
        break
    print(line,end="")
file.close()

Two men went bear hunting. While one stayed in the cabin, the other went out looking for a bear. He soon found a huge bear, shot at it but only wounded it. The enraged bear charged toward him, he dropped his rifle and started running for the cabin as fast as he could.
/nHe ran pretty fast but the bear was just a little faster and gained on him with every step. Just ashe reached the open cabin door, he tripped and fell flat. Too close behind to stop, the bear tripped over him and went rolling into the cabin.
/nThe man jumped up, closed the cabin door and yelled to his friend inside, "You skin this one while I go and get another one!"/n

With this while statement we start off by opening the file 'file.txt'. Once open we start reading each individual line using the readline() method. This method reads one entire line from our file. We then look at the length of the line with the inbuilt len() function. This function returns the length of a string, array, list or tuple. If the length of the line in this instance is 0 we break the while loop to ensure we don't continually reading from the while when we run out of lines.

Generally speaking it is good practice to close a file once we are done using it.

If you are using a with statement to open and read from your file, you do not need to close your file as the with statement will take care of closing the file for you. Here is an example

In [37]:
with open('file.txt') as file:
    for line in file:
        print(line)

Two men went bear hunting. While one stayed in the cabin, the other went out looking for a bear. He soon found a huge bear, shot at it but only wounded it. The enraged bear charged toward him, he dropped his rifle and started running for the cabin as fast as he could.

He ran pretty fast but the bear was just a little faster and gained on him with every step. Just ashe reached the open cabin door, he tripped and fell flat. Too close behind to stop, the bear tripped over him and went rolling into the cabin.

The man jumped up, closed the cabin door and yelled to his friend inside, "You skin this one while I go and get another one!"


###  1.1 Writing in a file

Now let's imagine a scenario where you want to write some text to the file using the write() method. You may either want to create a new file and write some text to it or to write to an existing file.

Let's start off by opening our current file and writing new text onto it. In this case you are trying to append text to an existing file, in order to do this you would need to add a to your open statement. If you are bit confused about this. You might find this resource helpful: http://book.pythontips.com/en/latest/open_function.html

In [38]:
with open('file.txt','a') as file:
    file.write('How would you respond if you were the friend')


Let's recap on what we just did. We opened the file file.txt with the intention of appending some new text to it. We want to add this new line at the end of the file.

**What if I want to prepend a line of text to the beginning of a file?**

'a' and 'a+' modes only allow you to append some text to the end of the file. The pointer moves to the end of the file before any writing is done. If you want to prepend some text at some other point than the end of the file here is an example of what you could do

In [50]:
with open('file.txt','r+') as file:
    file.write('This is not based on a real life:')
    file.seek(0)
    content =  file.read()
    print(content)

This is not based on a real life: 
 stayed in the cabin, the other went out looking for a bear. He soon found a huge bear, shot at it but only wounded it. The enraged bear charged toward him, he dropped his rifle and started running for the cabin as fast as he could.
He ran pretty fast but the bear was just a little faster and gained on him with every step. Just ashe reached the open cabin door, he tripped and fell flat. Too close behind to stop, the bear tripped over him and went rolling into the cabin.
The man jumped up, closed the cabin door and yelled to his friend inside, "You skin this one while I go and get another one!"How would you respond if you were the friend


Writing to a new file is a lot easier. To create a new txt file we need to specify the name of the new file in our open() statement and then select the write mode to write next to this new file. 

In [53]:
with open('newfile.txt','w') as file:
    file.write('This is a new file \n We are adding spaces between each sentence using backslash n.')


When we write next text to the file we can use \n to add a line between sentences.

Let's take a look at what our file looks like

In [55]:
with open('newfile.txt') as file:
    for line in file:
        print(line)

This is a new file 

 We are adding spaces between each sentence using backslash n.


We strongly recommend that you read these sources to further develop your understanding of file handling in Python

Sources:
* i.) Chapter 13, How to think like a computer scientists: Learning with Python 3 - http://openbookproject.net/thinkcs/python/english3e/files.html
* ii.) Working with file I/O in Python - https://dbader.org/blog/python-file-io

# 2. Introduction to Pandas

Pandas in Python stands for "Python Data Analysis Library" From the name you can probably already guess that this language is very important for work involving data analysis, data cleanup and data exploration. For this portion of the notebook we are going to focus on introducing what the Pandas library can do, its background and how it can be used using a real dataset to give you a practical understanding of how you can use the library.

Pandas has a lot of use cases ranging from computing statistics (averages, correlation, median etc), cleaning up data, visualizing the data with the help of Matplotlib, store cleaning and transforming data and machine learning with the help of other packages such as Keras and Sklearn. We won't however be getting into data visualization or machine learning in this course as this will be dealt with separately

You can read more about Pandas on Pydata.org: https://pandas.pydata.org/about/index.html

**How to use Pandas library**

Using pandas in Python is a relatively simple process. You can gain access of the library by using the import statement in Python. Aside from built in functions and basic types, you cannot use something in Python without first defining it. This means that you will not be able to use a function in pandas uunless it has been defined. Importing the pandas library to our notebook makes the entire package and all its functions accessible in our current scope, which then allows you to use pandas functions.

If you are interesting in understanding how the import statement works you can read more from here: https://www.codementor.io/@sheena/python-path-virtualenv-import-for-beginners-du107r3o1

Typically we import a package like pandas with an alias, this alias is often 'pd'. The reason we do this is so we can access a function in pandas using the much shorter form of pd.function as opposed to pandas.function. The alias is somewhat arbitrary but is used so often that it will probably be a lot easier for the person who will be reading your code if you import your pandas with the alias pd. Let's have a look at how we can do this.

In [None]:
import pandas as pd

If you cannot importpandas and get the error message ModuleNotFoundError this means that you may need to install the module. One way to install packages in Python is by using pip. The pip command is a Python tool for installing and managing packages.

You can read more about pip here: https://pip.pypa.io/en/stable/

Alternatively you could also use conda, which is an environment manager that installs and manages conda packages from the Anaconda reposiroty and Anaconda Cloud. You should however be able to install python by simply typing in pip install pandas on your notebook

**NumPy**

In most cases you will also need to import the numpy library to support computations you will be carrying on using Pandas. You can think of numpy as pandas best friend. NumPy is a package typically associated with scientific and mathematical computing in Python.

You can read more about NumPy here: https://docs.scipy.org/doc/numpy/user/whatisnumpy.html

Now let's import numpy with the alias np

In [2]:
import numpy as np

## 2.1 The DataFrame

According to the technical definition a Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Labeled in this context means that the columns and rows are labeled.

**What does this mean**

You can think of a pandas dataframe as something similar to an excel spreadsheet. The dataframe are a method of storing data in a tabular format with rows and columns. The columns could be different types. Meaning you can store data that is a mixture of strings and numbers (or other datatypes).

Dataframes consist of three principal components; the data, index and the columns. You can think of the index as an address of sort tha points out where particular data can be accessed. Both rows and columns have indexes. Row indexes are called as index while for columns the index is the column names.

You can either create a new dataframe, create a dataframe from a list,list of lists or dictionary of data or import data into your jupyter as a dataframe. Let's start by creating dataframes from a list 


In [9]:
nums = [1,2,3,4,5,6,7]
pd.DataFrame(nums, columns= ['numbers'])

Unnamed: 0,numbers
0,1
1,2
2,3
3,4
4,5
5,6
6,7


In the above example we used pd.DataFrame() to create our dataframe, encasing our list inside our brackets.

Since this list is a flat list of numbers, the data will be presented in our dataframe as a dataframe with one column and multiple rows. The rows will be equivalent to the length of the list. We also decided to name our columns numbers. You can also specify the index to be used for the dataframe we want to create using the index parameter. To get a full list of the available parameters in the pd.DataFrame() class you can read PyData.org for more information:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

Since a dataframe is generally used as an abstraction for data manipulation it is also possible to create a one dimensional labeled array that similarly to a dataframe is capable of holding any type of data. This is refered to as a pandas series. You can liken a pandas series to a column in an excel spreadhseet.  

You can read more about this here: https://towardsdatascience.com/pandas-series-a-lightweight-intro-b7963a0d62a2

Let us try and conver the list we just used in a pandas series

In [34]:
nums_series = pd.Series(nums)
nums_series

0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int64

So now we know that we can convert a list or list of lists into a pandas dataframe or series (if it is a one dimensional list). You may be asking if we can also convert a dictionary to a pandas dataframe. If you have already read the links we supplied earlier you will know that the answer to this question is a yes.

Now let us try and create a DataFrame from a dictionary

In [44]:
dictionary = {'Olabisi':'111-111-111','Iylian':'222-222-222','Balogun a Amineet':'333-333-333'}
pd.DataFrame(dictionary.items(),columns=['names','numbers'])

Unnamed: 0,names,numbers
0,Olabisi,111-111-111
1,Iylian,222-222-222
2,Balogun a Amineet,333-333-333


In the above example we start of by creating a dictionary with the a key value pair. We know that we can access all the values in the key by using the method .items(). We use this to assign the specify the values in our dataframe and then give our columns the appropriate names. Since a dictionary is a pair of data. A dictionary that isn't nested will create a dictionary with two columns. Even if we assign a list to the values of a key, we will only create a dataframe with two columns; one for the key and the other for the values. You can also opt to create an empty dataframe and append values to the dataframe at a later stage.

Let's look at an example of this process. In this scenario we will create an empty dataframe, however we will predefine its dimensions, specifically the number and name of the columns in the dataframe

In [20]:
#we start off by creating an empty dataframe
df = pd.DataFrame(columns=['A','B','C'])
#then we append new rows
df.loc[len(df)] = [1,2,3]
df

Unnamed: 0,A,B,C
0,1,2,3


As indicated by the comments on our code, we start off by creating an empty dataframe. However, we also label the columns that will appear on this dataframe. We then use the loc[] method which is used to access rows by taking in an index and returning the relevant row if the index exists in our dataframe. 

We use the loc method to append to our dataframe by index; using the index equivalent to the length of the dataframe (the length is the number of rows). We can then use the same command to add another row to the dataframe which will appear as the last row in our new dataframe

You can visit PyData for more information on the loc[] method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

You can read more about creating a dataframe from GeeksforGeeks: https://www.geeksforgeeks.org/creating-a-pandas-dataframe/

Now let's assume we want to read a pandas dataframe from a file. We have already learnt how to open and read a file in Python, reading a file using pandas is actually a lot easier depending no the file type. The good thing about reading a file using pandas is that the file is read as a pandas dataframe removing the need to convert the file to a dataframe. 

We have a dataset names combined_indicators_for_nigeria_2 in our folder, let's start off by reading this csv as a pandas dataframe

In [35]:
import pandas as pd
file = pd.read_csv('combined_indicators_for_nigeria_2.csv')

To get an extract of the first five rows we will call the head() function on our dataframe. By default the head function returns the first 5 rows, you can however specify the number of rows you want returned by adding an integer to represent the number of rows you would like to see from the top of the dataframe. Converseley you can use the tail() function to return the last n rows in the dataframe.

Viewing the first n rows gives us a quick overview of the the type of data in each column

In [36]:
file.head(5)

Unnamed: 0,country_name,country_iso3,year,indicator_name,indicator_code,value
0,Nigeria,NGA,2007,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,24800.0
1,Nigeria,NGA,2006,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23999.0
2,Nigeria,NGA,2005,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0
3,Nigeria,NGA,2004,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0
4,Nigeria,NGA,2003,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,22000.0


As mentioned, you can also opt to print out the last n rows by using the tail() function. By default this will print out the last 5 rows unless you specify otherwise

In [37]:
file.tail()

Unnamed: 0,country_name,country_iso3,year,indicator_name,indicator_code,value
4995,Nigeria,NGA,2007,"Grants, excluding technical cooperation (BoP, ...",BX.GRT.EXTA.CD.WD,1304920000.0
4996,Nigeria,NGA,2006,"Grants, excluding technical cooperation (BoP, ...",BX.GRT.EXTA.CD.WD,11388180000.0
4997,Nigeria,NGA,2005,"Grants, excluding technical cooperation (BoP, ...",BX.GRT.EXTA.CD.WD,5840000000.0
4998,Nigeria,NGA,2004,"Grants, excluding technical cooperation (BoP, ...",BX.GRT.EXTA.CD.WD,165130000.0
4999,Nigeria,NGA,2003,"Grants, excluding technical cooperation (BoP, ...",BX.GRT.EXTA.CD.WD,114860000.0


You may also want to get a better understanding of the dimensions of the dataframe. To clarify, you will get the number of rows and columns of the data presented in the dataframe. We can use the shape function to return a tuple representing the dimensions of our dataframe.

In [38]:
file.shape

(5000, 6)

Now suppose we see he country_iso3 columns and decide to change the name of the column to country code(iso) for our own convenience (because we deem this name to be easier to understand).

You can easily change the name of a column or even multiple columns in pandas using the rename() function. You can either create a dictionary mapping the index of the column you want to change to the new name you want to change it to or mapping the old column name to the new column name.

You can read more about the rename function in the PyData documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

Now let's change the country_iso3 column name to country code(iso)

In [39]:
file.rename(columns={'country_iso3':'country code(iso)'},inplace=True)
file.head()

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value
0,Nigeria,NGA,2007,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,24800.0
1,Nigeria,NGA,2006,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23999.0
2,Nigeria,NGA,2005,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0
3,Nigeria,NGA,2004,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0
4,Nigeria,NGA,2003,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,22000.0


You may have noticed that we used added the statement inplace=True after creating the mapping from old to new name. We did this in order for the renaming to be done in place. This means that the statement will not return anything (it will not print out the dataframe like if we renamed without the inplace=True statement). We use this so that we don't have to assign the dataframe with the new renamed column to our old dataframe. 

As stated previously, we can also rename multiple columns at a time by adding more key value pairs to the columns dictionary.

If you want to get an idea of the column names in your dataframe you can use the .columns method to get a list of the column names in our dataframe.

Let's look at an example of this:


In [40]:
file.columns

Index(['country_name', 'country code(iso)', 'year', 'indicator_name',
       'indicator_code', 'value'],
      dtype='object')

Similar to the columns method you can also return an array of all the values in each column in your dataframe by using the values method. You can either use values on the entire dataframe to return multiple arrays equivalent to the number of columns in your dataframe with the values in each row. Or alternatively call the method on a single column to return all values in the rows specific to that single column.

Let use look at the values in the year column

In [41]:
file['year'].values

array([2007, 2006, 2005, ..., 2005, 2004, 2003], dtype=int64)

## 2.2 Selecting data in the dataframe

Now let's assume we want to select a column or multiple columns in our dataframe and find out what unique values are in the dataframe.

So to put it in practical terms we want to understand if *"Agricultural machinery, tractors"* is the only unique indicator name in our dataframe of if we have other indicators. 

We will start off by selecting the relevant column which in this case will be the *'indicator_code'* column and then using the unique() function to get a list of the unique values from the selected column

In [42]:
file['indicator_name'].unique()

array(['Agricultural machinery, tractors',
       'Fertilizer consumption (% of fertilizer production)',
       'Fertilizer consumption (kilograms per hectare of arable land)',
       'Agricultural land (sq. km)', 'Agricultural land (% of land area)',
       'Arable land (hectares)', 'Arable land (hectares per person)',
       'Arable land (% of land area)',
       'Land under cereal production (hectares)',
       'Permanent cropland (% of land area)',
       'Rural land area where elevation is below 5 meters (sq. km)',
       'Rural land area where elevation is below 5 meters (% of total land area)',
       'Forest area (sq. km)', 'Forest area (% of land area)',
       'Agricultural irrigated land (% of total agricultural land)',
       'Average precipitation in depth (mm per year)',
       'Land area (sq. km)', 'Rural land area (sq. km)',
       'Agricultural machinery, tractors per 100 sq. km of arable land',
       'Cereal production (metric tons)',
       'Crop production index (2

You can also select multiple columns. The syntax is similar to our above example with the different being we would use a list to select multiple columns in one go.

Let's look at an example of how we would do this by selecting the indicator name and year columns:

In [43]:
file[['indicator_name','year']].head()

Unnamed: 0,indicator_name,year
0,"Agricultural machinery, tractors",2007
1,"Agricultural machinery, tractors",2006
2,"Agricultural machinery, tractors",2005
3,"Agricultural machinery, tractors",2004
4,"Agricultural machinery, tractors",2003


From looking at the unique indicators under the indicator name column we see that there are multiple indicators included in our dataframe. Now let's assume we have a high interest in indicators relevant to economic inequality and we specifically want to filter our dataframe and only show rows where the indicator name is *'Income share held by the lower 20%'*.

One way to do is is through a boolean variable by checking if the value in our *'indicator_name'* column is equals to the value we want to filter our dataframe for. To check for equality we either use the relevant comparison operator *'=='* or the *'.eq'* which is equivalent to our *'=='* comparison operator.

You can read more about *'.eq'* in the pandas pydata documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.eq.html

In [44]:
file[file['indicator_name'].eq('Income share held by lowest 20%')]

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value
3793,Nigeria,NGA,2009,Income share held by lowest 20%,SI.DST.FRST.20,5.4
3794,Nigeria,NGA,2003,Income share held by lowest 20%,SI.DST.FRST.20,5.7
3795,Nigeria,NGA,1996,Income share held by lowest 20%,SI.DST.FRST.20,3.7
3796,Nigeria,NGA,1992,Income share held by lowest 20%,SI.DST.FRST.20,4.0
3797,Nigeria,NGA,1985,Income share held by lowest 20%,SI.DST.FRST.20,6.0


This is a condensced one line version of first creating a boolean variable that returns true or false statements depending on whether the value has been found in the column for each row in our dataframe and the applying this filter to our dataframe to return only the rows where the equivalent row in our filter returns the boolean True

In [45]:
#creating the boolean variable
boolean_filter = file['indicator_name'].eq('Income share held by lowest 20%')
#applying the filter to our dataframe
file[boolean_filter]

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value
3793,Nigeria,NGA,2009,Income share held by lowest 20%,SI.DST.FRST.20,5.4
3794,Nigeria,NGA,2003,Income share held by lowest 20%,SI.DST.FRST.20,5.7
3795,Nigeria,NGA,1996,Income share held by lowest 20%,SI.DST.FRST.20,3.7
3796,Nigeria,NGA,1992,Income share held by lowest 20%,SI.DST.FRST.20,4.0
3797,Nigeria,NGA,1985,Income share held by lowest 20%,SI.DST.FRST.20,6.0


As mentioned, it is also possible to selec multiple rows using chained boolean variables. You can chain boolean variables using or, which would be represented by the symbol *'|'* or and *'&'*.

We are going to look at two examples. In the first example we will filter dataframes for rows where the indicator  is Income share held by lowest 20% or Poverty headcount ratio at $1.90 a day (2011 PPP) 

In [46]:
#let us filter for the indicator name being either income share held by the loest 20% or poverty heacount ratio at $1.90
file[file['indicator_name'].eq('Income share held by lowest 20%')|file['indicator_name'].eq('Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population)')]

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value
3793,Nigeria,NGA,2009,Income share held by lowest 20%,SI.DST.FRST.20,5.4
3794,Nigeria,NGA,2003,Income share held by lowest 20%,SI.DST.FRST.20,5.7
3795,Nigeria,NGA,1996,Income share held by lowest 20%,SI.DST.FRST.20,3.7
3796,Nigeria,NGA,1992,Income share held by lowest 20%,SI.DST.FRST.20,4.0
3797,Nigeria,NGA,1985,Income share held by lowest 20%,SI.DST.FRST.20,6.0
3798,Nigeria,NGA,2009,Poverty headcount ratio at $1.90 a day (2011 P...,SI.POV.DDAY,53.5
3799,Nigeria,NGA,2003,Poverty headcount ratio at $1.90 a day (2011 P...,SI.POV.DDAY,53.5
3800,Nigeria,NGA,1996,Poverty headcount ratio at $1.90 a day (2011 P...,SI.POV.DDAY,63.5
3801,Nigeria,NGA,1992,Poverty headcount ratio at $1.90 a day (2011 P...,SI.POV.DDAY,57.1
3802,Nigeria,NGA,1985,Poverty headcount ratio at $1.90 a day (2011 P...,SI.POV.DDAY,53.3


For the second example we will filter for data where the indicator is income share held by lowest 20% and the relevant year is 2009 or 2003

In [47]:
file[(file['indicator_name']== 'Income share held by lowest 20%') & file['year'].isin([2003,2009])]

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value
3793,Nigeria,NGA,2009,Income share held by lowest 20%,SI.DST.FRST.20,5.4
3794,Nigeria,NGA,2003,Income share held by lowest 20%,SI.DST.FRST.20,5.7


So i'm sure you have noticed by now that in many cases there is more than one way to accomplish a particular task. With that in mind let us introduce .loc and .iloc. These are operations that are used for selection purposes in dataframes.

### 2.2.1  Selecting data using iloc

The iloc indexer is used for selection by position, using an integer to represent the index location of some data. You can use *.iloc* to either select rows or columns by their position. Let's start off by selecting the first row from our dataframe.

Remember that python uses zero-based indexing which means that the first element will have the index 0

In [48]:
file.iloc[0]

country_name                                  Nigeria
country code(iso)                                 NGA
year                                             2007
indicator_name       Agricultural machinery, tractors
indicator_code                         AG.AGR.TRAC.NO
value                                           24800
Name: 0, dtype: object

As you may have guessed you are not limited to selecting one row, you can select multiple rows using the iloc indexer. You can for example select the first 3 rows similar to how you would do this on a list

In [49]:
file.iloc[0:3]

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value
0,Nigeria,NGA,2007,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,24800.0
1,Nigeria,NGA,2006,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23999.0
2,Nigeria,NGA,2005,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0


You can also use iloc to select a column in cases where you may not necessarily know the exact name of the column(s) you want to select but you know the position of the column. The syntax is similar to selecting a row but you need to add a colon and comma before the index of the column(s).

Let us try and select the first column and returning the values under this column

In [50]:
file.iloc[:,0]

0       Nigeria
1       Nigeria
2       Nigeria
3       Nigeria
4       Nigeria
         ...   
4995    Nigeria
4996    Nigeria
4997    Nigeria
4998    Nigeria
4999    Nigeria
Name: country_name, Length: 5000, dtype: object

Knowing what we can do using the iloc indexer let us try and use the iloc indexer to filter the dataframe to only show rows where the year column is equivalent to 2009.

Since we know that the year column is the third column, using zero-based indexing the index of the 3rd column would be equal to 2

In [51]:
file[file.iloc[:,2] == 2009]

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value
54,Nigeria,NGA,2009,Fertilizer consumption (% of fertilizer produc...,AG.CON.FERT.PT.ZS,2.135078e+02
62,Nigeria,NGA,2009,Fertilizer consumption (kilograms per hectare ...,AG.CON.FERT.ZS,5.261031e+00
77,Nigeria,NGA,2009,Agricultural land (sq. km),AG.LND.AGRI.K2,6.900000e+05
133,Nigeria,NGA,2009,Agricultural land (% of land area),AG.LND.AGRI.ZS,7.576007e+01
189,Nigeria,NGA,2009,Arable land (hectares),AG.LND.ARBL.HA,3.200000e+07
...,...,...,...,...,...,...
4825,Nigeria,NGA,2009,"Foreign direct investment, net (BoP, current US$)",BN.KLT.DINV.CD,-7.029619e+09
4866,Nigeria,NGA,2009,"Portfolio investment, net (BoP, current US$)",BN.KLT.PTXL.CD,3.452547e+08
4900,Nigeria,NGA,2009,"Reserves and related items (BoP, current US$)",BN.RES.INCL.CD,-1.051451e+10
4941,Nigeria,NGA,2009,"Net secondary income (BoP, current US$)",BN.TRF.CURR.CD,1.936164e+10


### 2.2.2  Selecting data using loc

Unlike the iloc indexer, loc is a label-based function meaning you would have to specify the label of the row or column you would like to select.

Now let's assume you want to select specific rows from our dataframe were the years are between 2006 and 2009. Pandas has a between function that allows us to select values between two boundaries as the name suggests. Let's see an actual example of this:

In [52]:
file.loc[file['year'].between(2006,2009)]

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value
0,Nigeria,NGA,2007,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,2.480000e+04
1,Nigeria,NGA,2006,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,2.399900e+04
54,Nigeria,NGA,2009,Fertilizer consumption (% of fertilizer produc...,AG.CON.FERT.PT.ZS,2.135078e+02
62,Nigeria,NGA,2009,Fertilizer consumption (kilograms per hectare ...,AG.CON.FERT.ZS,5.261031e+00
63,Nigeria,NGA,2008,Fertilizer consumption (kilograms per hectare ...,AG.CON.FERT.ZS,5.876833e+00
...,...,...,...,...,...,...
4974,Nigeria,NGA,2006,"Net capital account (BoP, current US$)",BN.TRF.KOGT.CD,1.055551e+10
4993,Nigeria,NGA,2009,"Grants, excluding technical cooperation (BoP, ...",BX.GRT.EXTA.CD.WD,9.347400e+08
4994,Nigeria,NGA,2008,"Grants, excluding technical cooperation (BoP, ...",BX.GRT.EXTA.CD.WD,8.322900e+08
4995,Nigeria,NGA,2007,"Grants, excluding technical cooperation (BoP, ...",BX.GRT.EXTA.CD.WD,1.304920e+09


You can also use loc to change the values in a given column and you could create a new column and conditionally assign values to said column.

Let's assume we want to create a column we will name *'timeline'*. We want all the data from 2009 to have the value *'latest'* and under the timeline column we create. Let's walk through how we would do this.

In [53]:
file.loc[file['year'] == 2009, 'timeline'] = 'latest'
file[file['year'] == 2009]

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value,timeline
54,Nigeria,NGA,2009,Fertilizer consumption (% of fertilizer produc...,AG.CON.FERT.PT.ZS,2.135078e+02,latest
62,Nigeria,NGA,2009,Fertilizer consumption (kilograms per hectare ...,AG.CON.FERT.ZS,5.261031e+00,latest
77,Nigeria,NGA,2009,Agricultural land (sq. km),AG.LND.AGRI.K2,6.900000e+05,latest
133,Nigeria,NGA,2009,Agricultural land (% of land area),AG.LND.AGRI.ZS,7.576007e+01,latest
189,Nigeria,NGA,2009,Arable land (hectares),AG.LND.ARBL.HA,3.200000e+07,latest
...,...,...,...,...,...,...,...
4825,Nigeria,NGA,2009,"Foreign direct investment, net (BoP, current US$)",BN.KLT.DINV.CD,-7.029619e+09,latest
4866,Nigeria,NGA,2009,"Portfolio investment, net (BoP, current US$)",BN.KLT.PTXL.CD,3.452547e+08,latest
4900,Nigeria,NGA,2009,"Reserves and related items (BoP, current US$)",BN.RES.INCL.CD,-1.051451e+10,latest
4941,Nigeria,NGA,2009,"Net secondary income (BoP, current US$)",BN.TRF.CURR.CD,1.936164e+10,latest


Let's backtrack and walk through what we just did:

We started off filtering for data were the year is equals to 2009. We then added a comma allowing us to create a new column named 'timeline'. We then proceeded to close the square brackets we use to select a column and assigned the value latest to the new column we had just created.

Here are a few sources you could use to enrich your knowledge on selecting data:
* i.) Selecting Subsets of Data in Pandas Part 1 - https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c
* ii.) Selecting pandas dataframe rows based on conditions - https://chrisalbon.com/python/data_wrangling/pandas_selecting_rows_on_conditions/
* iii.) How to select rows from a dataframe based on column values, StackOverFlow - https://stackoverflow.com/questions/17071871/how-to-select-rows-from-a-dataframe-based-on-column-values

## 2.3 Manipulating data

Now that you have a good understanding of how to select, filter and subset columns and rows in a pandas dataframe, let us look at ways we can manipulate data in a pandas dataframe. Now let's create a new column with random floats.

We can carry this out using the numpy package. Numpy a feature *'numpy.random.rand'* that will generate a random float between 0 and 1. Inside np.random.rand we can enter the dimensions of the array we want returned. Since we are looking to create 5000 rows of random integers for 1 column we will add the dimensions 5000,1 to the function

Numpy.random.rand documentation: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.rand.html


In [54]:
import numpy as np
file['randoms'] = np.random.rand(5000,1)
file.head(10)

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value,timeline,randoms
0,Nigeria,NGA,2007,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,24800.0,,0.370511
1,Nigeria,NGA,2006,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23999.0,,0.673565
2,Nigeria,NGA,2005,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0,,0.447208
3,Nigeria,NGA,2004,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0,,0.842611
4,Nigeria,NGA,2003,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,22000.0,,0.215552
5,Nigeria,NGA,2002,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,21000.0,,0.755003
6,Nigeria,NGA,2001,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,20006.0,,0.41888
7,Nigeria,NGA,2000,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,19400.0,,0.750167
8,Nigeria,NGA,1999,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,18850.0,,0.541283
9,Nigeria,NGA,1998,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,18300.0,,0.964422


Now let us check the data type of values in all the columns in our dataframe using .dtypes. This will return the data types of all the columns in our dataframe

Dtypes documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html

In [55]:
file.dtypes

country_name          object
country code(iso)     object
year                   int64
indicator_name        object
indicator_code        object
value                float64
timeline              object
randoms              float64
dtype: object

Now let's assume we want to convert all the random numbers we generated in the randoms column to integers. After converting the column values to integers we will assign this new column to the old randoms column to update it with the changes we have made.

Pandas has as .astype() method that allows for easy conversion from one data type to another.

Astype documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

In [56]:
file['randoms'] = file['randoms'].astype('int')
file.head()

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value,timeline,randoms
0,Nigeria,NGA,2007,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,24800.0,,0
1,Nigeria,NGA,2006,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23999.0,,0
2,Nigeria,NGA,2005,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0,,0
3,Nigeria,NGA,2004,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0,,0
4,Nigeria,NGA,2003,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,22000.0,,0


If you remember from past Chapters converting a float to an integer truncates the float by selecting only the digits to the left of the decimal point. Since all the floats in the randoms column have 0 as the whole number all the values will be truncated to 0.

Let's assume we want to carry out some basic arithmetic operations on the randoms column. Lucky for us pandas allows for the application of arithmetic operations on both series and dataframes.

In [67]:
file['random'] = file['randoms']+10
file.head()

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value,timeline,randoms,random
0,Nigeria,NGA,2007,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,24800.0,,10,0
1,Nigeria,NGA,2006,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23999.0,,10,0
2,Nigeria,NGA,2005,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0,,10,0
3,Nigeria,NGA,2004,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0,,10,0
4,Nigeria,NGA,2003,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,22000.0,,10,0


Oops! It looks like we mistakenly create a new column and named it *'random'* instead of saving the updated randoms column to the old randoms column. Not a problem, we just need to delete the new column we have created and update the correct column.

In order to delete a column we need to use the drop() method on our dataframe and specify the column we would like to drop

In [76]:
file = file.drop(['random'],axis=1)
file['randoms'] = file['randoms']+10
file.head()

Unnamed: 0,country_name,country code(iso),year,indicator_name,indicator_code,value,timeline,randoms
0,Nigeria,NGA,2007,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,24800.0,,10
1,Nigeria,NGA,2006,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23999.0,,10
2,Nigeria,NGA,2005,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0,,10
3,Nigeria,NGA,2004,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,23000.0,,10
4,Nigeria,NGA,2003,"Agricultural machinery, tractors",AG.AGR.TRAC.NO,22000.0,,10


You may ask why we had to add the argument axis=1 to our drop statement Well *'axis=1'* is us telling python to drop the labels from the columns. 1 represents columns while zero represents rows

Drop() documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html


### 2.3.1 The groupby function

In [None]:
Here are more sources to learn about pandas:

In [75]:
## 2.4 Primer to data cleaning using pandas