# Lesson 1: Input and Output


**LEARNING OBJECTIVES:**

1. Learn how to use `pandas` to open and import infomation from .csv files
2. Understand how to coordinate data within a DataFrame
3. Learn how to inserting a coloumn into the DataFrame
4. Learn how to export DataFrames as `.csv` files 

**Input and output of `.csv` files**

The first step to learning how to use the `pandas` library is to import data. 

Pandas can open files called 'comma separated value' files (.csv). These files are a series of strings or numbers seperated with commas.

For example, below is .csv might look like if you were to open in Notepad or in Microsoft Word:

However, if you were to open this in a spreadsheet program, such as Microsoft Excel it would produce the following table:

| Current liabilities               |Amount    |
| -------------                     |:--------:|
| Accounts payable                  | 30650    |
| Accrued expenses                  | 4500     |
| Notes payable                     | 15000    |
| Income taxes payable              | 4500     | 
| Current portion of long-term debt |-5000      | 

Now lets import this data into `pandas`:

Firstly, we need to import the `pandas` library. Lets give a nickname of `pd`.

In [None]:
import pandas as pd

Now we need to import the data into our DataFrame, lets call our DataFrame object `BalanceSheet`. 

We are importing data from a file saved on the internet.

In [None]:
BalanceSheet = pd.read_csv(
    "https://raw.githubusercontent.com/ThomasJewson/datasets/master/Liabilities.csv"
)

Now lets output this DataFrame by calling the object name, `BalanceSheet`.

The numbers on the left-hand side show the row index number, note Python starts counting from zero. The first row of the DataFrame are the column labels. 

In [None]:
BalanceSheet

Lets save this as a `.csv` file on our own computer. 

Firstly will need a file path to save our `.csv` to. I am going to save my `.csv` to my desktop, however, you should find a more suitable location for it. This has a file location of `C:\Users\Thomas\Desktop`. Lets give the file the name `Liabilities.csv`. Therefore, my file path will be `C:\Users\Thomas\Desktop\Liabilities.csv`

I found my file location by opening my `Desktop` folder. I then right-clicked and opened properties to find my file location.

In Python strings the backslash (`\`) is a special character, known as the "escape" character. We use the escape character to produce whitespace characters, for example, `\n` is new line, `\t` is tab and `\r` is a return.

However, how do we use a backslash in our strings then? Well, we use a double backslash, `\\`. See below.

In [None]:
print("\tPython\nis\ncool\n")

print("This prints a single backslash \\")

Therefore, we need to change all the single backslashes in our file path to double backslashes. 

My file pathway is now `C:\\Users\\Tom\\Documents\\Python\\Liabilities.csv`

Replace my file pathway below with yours and save the the `.csv` file.

In [None]:
BalanceSheet.to_csv("C:\\Users\\Tom\\Documents\\Python\\Liabilities.csv",index=False)

We do not want the index to be saved as it is automatically generated upon reading the file, leading to duplicate index columns being produced. Therefore, we need to use the `index=False` argument. 

We can read from this `.csv` file with the same function (`.read_csv()`) we used before. 

Again, you will need to replace my file pathway.

In [None]:
BalanceSheet_2 = pd.read_csv(
    "C:\\Users\\Tom\\Documents\\Python\\Liabilities.csv"
)

BalanceSheet_2

**Coordination of data**

We already know how to output the whole DataFrame, however, to selectively ouput data from the DataFrame we need to coordinate the infomation we want.

To coordinate data we need to tell the Python which rows and columns we want to output. This is done by using our row index numbers and column labels.

For example, if we wanted to only output the `Amount` column we would use the following code.

In [None]:
BalanceSheet["Amount"]

If we want to only output the first row, which has a row index number of `0`, we would use the following code.  

In [None]:
BalanceSheet[0:1]

The first number in the square brackets is where `pandas` starts outputting from and the second number in the square brackets is where `pandas` stops.

Therefore, if we wanted to ouput rows `1`,`2` and `3` we would do the following.

In [None]:
BalanceSheet[1:4]

We can even print out all the data after a certain row index, for example, all the rows after `2`.

In [None]:
BalanceSheet[2:]

Or even all the data before a certain row index, for example, all the rows before `2`.

In [None]:
BalanceSheet[:2]

We can even tell `pandas` to output everything expect the last row with the following.

In [None]:
BalanceSheet[:-1]

Or everything but the last two rows.

In [None]:
BalanceSheet[:-2]

And, we can tell `pandas` to output the last row with the following.

In [None]:
BalanceSheet[-1:]

Or the last 3 rows.

In [None]:
BalanceSheet[-3:]

We can also combine our column and row selections to coordinate our data.

For example, if we want to output the first two rows of the `Current liabilities` column.

In [None]:
BalanceSheet[0:2]["Current liabilities:"]

We can output multiple columns by putting the column labels in a series. For example, if we wanted to output both the `Current liabilities:` and the `Amount` columns we would be a series of the two labels, `["Current liabilities:","Amount"]`. This series then needs to be put in another pair of square brackets.

In [None]:
BalanceSheet[0:2][["Current liabilities:","Amount"]]

**Inserting more columns**

Lets start this new section with some new data.

This data is from prison populations in three UK prisons during July 2019 [1].

In [None]:
PrisonsData = pd.read_csv(
    "https://raw.githubusercontent.com/ThomasJewson/datasets/master/PrisonsData.csv"
)

PrisonsData

To add a new column to the right hand side of the `PrisonsData` DataFrame object we need to provide the data and the column label.

Lets add the ratings of the prison, with the column label of `Rating` and the following data [2]:

|Prison Name|Rating|
|-|-|
|Altcourse|3|
|Bedford|1|
|Cardiff|2|

In [None]:
PrisonsData["Rating"] = [3,1,2]

PrisonsData

To add a column not on the right hand side is slightly more complex.

To do this we need to use the `.insert()` function. Within the brackets we need to give the column index, the column label and the data - in that precise order. 

As Python starts counting from zero we give the columns the following indexes:

|Column label|Column Index|
|-|-|
|Prison Name|0|
|Capacity|1|
|Rating|2|

Lets add the prison populations as of July 2019, so that it is next to the `Capacity` column. 

In [None]:
PrisonsData.insert(
    2,                  # Column index
    "Population",                # Column title
    [1131,351,718]      # Population data
)

PrisonsData

**Conclusions:**

*You should now be able to do the following:*
1. Read `.csv` files with `pandas` with the `pd.read_csv()` function
2. Save DataFrames to `.csv` files with `.to_csv()` function
3. Understand what an escape character is, and why we need to use `"\\"` to ouput a single backslash
3. Coordinate and ouput specific data using square brackets
4. Add a new column to the right hand side of the DataFrame
5. Add a new column to a specific column index in the DataFrame using the `.insert()` function

**Optional Extension:**

The `pandas` library can open a variety of different file types, other than `.csv` files. In particular, it can open Microsoft Excel (`.xls`) and OpenDocument spreadsheet (`.ods`) file types. This can be achieved with the following functions: 

`pd.read_excel('path_to_file.xlsb', engine='pyxlsb')` for Excel spreadsheet files

`pd.read_excel('path_to_file.ods', engine='odf')` for OpenDocument spreadsheet files

To learn more about this, and about opening other file types read the following manual entry about [input and output](https://pandas.pydata.org/docs/user_guide/io.html).

`pandas` has a handy trick where it can read and write to your clipboard. Your clipboard is what is saved when you right-click copy or CNTRL-C a peice of highlighted data. To learn how to do this read the manual entry on [clipboards](https://pandas.pydata.org/docs/user_guide/io.html#clipboard).

To learn more about indexing and selecting data in `pandas`, read the manual entry for [indexing](https://pandas.pydata.org/docs/user_guide/indexing.html).


**Sources**

[1] https://www.gov.uk/government/statistics/prison-population-figures-2019

[2] https://www.gov.uk/government/statistics/prison-performance-ratings-2018-to-2019

    