# 2. Analyse and visualise data with python - 11:05 to 12:25

---

## [Reading Tabular Data into DataFrames](https://swcarpentry.github.io/python-novice-gapminder/07-reading-tabular/index.html) - 10 min / 10 min exercises

**Learning objectives**:
- Import the Pandas library
- Use Pandas to load a simple CSV data set
- Get some basic information about a Pandas DataFrame

#### 1. Use the Pandas library to do statistics on tabular data:
* Pandas is widely-used Python library for statistics, particularly on tabular data.
* Borrows many features from R's dataframes:
  - Two dimensional table whose columns have names and potentially have different type of data types/
* Load it with `import pandas as pd`. The alias pd is commonly used for Pandas.
* Read a Comma Separated Values (CSV) data file with `pd.read_csv`.
  - Agument is the name of the file to be read.
  - Assign result to a variable to store the data that was read.

In [None]:
import pandas as pd

data = pd.read_csv("https://raw.githubusercontent.com/MaastrichtU-IDS/coding4all/main/data/gapminder_gdp_oceania.csv?token=AMA7DPKNKUI3V6WJR3BZR53AKRNQA")

print(data)

       country  gdpPercap_1952  ...  gdpPercap_2002  gdpPercap_2007
0    Australia     10039.59564  ...     30687.75473     34435.36744
1  New Zealand     10556.57566  ...     23189.80135     25185.00911

[2 rows x 13 columns]


* The columns in a dataframe are the observed variables, and the rows are the observations.
* Pandas uses backslash `\` to show wrapped lines when output is too wide to fit the screen.

**File Not Found:**

> Our lessons store their data files in a `data` sub-directory, which is why the path to the file is `data/...csv`. If you formet to include `data/` or if you include it but your cope of the file is somewhere else, you will bet a runtime error that ends with a line like this:

`ERROR`: _OSError: File b'gapminder_gdp_oceania.csv' does not exist_

#### 2. Use `index_col` to specify that a column's values should be used as row headings:
* Row headings are numbers (0 and 1 in this case)
* Really want to index by country
* Pass the name of the column to `read_csv` as its `index_col` parameter to do this

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/MaastrichtU-IDS/coding4all/main/data/gapminder_gdp_oceania.csv?token=AMA7DPKNKUI3V6WJR3BZR53AKRNQA", index_col= "country")
print(data)

             gdpPercap_1952  gdpPercap_1957  ...  gdpPercap_2002  gdpPercap_2007
country                                      ...                                
Australia       10039.59564     10949.64959  ...     30687.75473     34435.36744
New Zealand     10556.57566     12247.39532  ...     23189.80135     25185.00911

[2 rows x 12 columns]


#### 3. Use the `DataFrame.info()` method to find out more about a dataframe

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Australia to New Zealand
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   gdpPercap_1952  2 non-null      float64
 1   gdpPercap_1957  2 non-null      float64
 2   gdpPercap_1962  2 non-null      float64
 3   gdpPercap_1967  2 non-null      float64
 4   gdpPercap_1972  2 non-null      float64
 5   gdpPercap_1977  2 non-null      float64
 6   gdpPercap_1982  2 non-null      float64
 7   gdpPercap_1987  2 non-null      float64
 8   gdpPercap_1992  2 non-null      float64
 9   gdpPercap_1997  2 non-null      float64
 10  gdpPercap_2002  2 non-null      float64
 11  gdpPercap_2007  2 non-null      float64
dtypes: float64(12)
memory usage: 288.0+ bytes


* This is a DataFrame
* Two rows named `Australia` and `New Zealand`
* Twelve columns, each of which has two actual 64-bit floating point values
  - We will talk later about null values, which are used to represent missing observations
* Uses 208 bytes of memory

#### 4. The DataFrame.columns variable stores information about the dataframe's columns

* Note that this is data, _not_ a method. (it doesn't have parentheses)
  - Like `math.pi`
  - So do not use `()` to call it

In [None]:
print(data.columns)

Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')


#### 5. Use `DataFrame.T` to transpose a dataframe:

* Sometimes want to treat columns as rows and vice versa.
* Transpose (written `.T`) doesn't copy the data, just changes the program's view of it.
* Like `columns`, it is a member variable


In [None]:
data.T

country,Australia,New Zealand
gdpPercap_1952,10039.59564,10556.57566
gdpPercap_1957,10949.64959,12247.39532
gdpPercap_1962,12217.22686,13175.678
gdpPercap_1967,14526.12465,14463.91893
gdpPercap_1972,16788.62948,16046.03728
gdpPercap_1977,18334.19751,16233.7177
gdpPercap_1982,19477.00928,17632.4104
gdpPercap_1987,21888.88903,19007.19129
gdpPercap_1992,23424.76683,18363.32494
gdpPercap_1997,26997.93657,21050.41377


#### 6. Use `DataFrame.describe()` to get summary statistics about data

`DataFrame.describe()` gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument `include = 'all'` 

In [None]:
data.describe()

Unnamed: 0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
count,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
mean,10298.08565,11598.522455,12696.45243,14495.02179,16417.33338,17283.957605,18554.70984,20448.04016,20894.045885,24024.17517,26938.77804,29810.188275
std,365.560078,917.644806,677.727301,43.986086,525.09198,1485.263517,1304.328377,2037.668013,3578.979883,4205.533703,5301.85368,6540.991104
min,10039.59564,10949.64959,12217.22686,14463.91893,16046.03728,16233.7177,17632.4104,19007.19129,18363.32494,21050.41377,23189.80135,25185.00911
25%,10168.840645,11274.086022,12456.839645,14479.47036,16231.68533,16758.837652,18093.56012,19727.615725,19628.685413,22537.29447,25064.289695,27497.598692
50%,10298.08565,11598.522455,12696.45243,14495.02179,16417.33338,17283.957605,18554.70984,20448.04016,20894.045885,24024.17517,26938.77804,29810.188275
75%,10427.330655,11922.958888,12936.065215,14510.57322,16602.98143,17809.077557,19015.85956,21168.464595,22159.406358,25511.05587,28813.266385,32122.777857
max,10556.57566,12247.39532,13175.678,14526.12465,16788.62948,18334.19751,19477.00928,21888.88903,23424.76683,26997.93657,30687.75473,34435.36744


EXERCISE 1: READING OTHER DATA

Read the data in `gapminder_gdp_americas.csv` into a variable called `americas` and display its summary statistics

In [None]:
path = 'https://raw.githubusercontent.com/MaastrichtU-IDS/coding4all/main/data/gapminder_gdp_americas.csv?token=AMA7DPJFRN6RH4UV6KKMPNDAKRRYY'

americas = pd.read_csv(path, index_col= "country")
print(americas)
# stats
americas.describe()

                    continent  gdpPercap_1952  ...  gdpPercap_2002  gdpPercap_2007
country                                        ...                                
Argentina            Americas     5911.315053  ...     8797.640716    12779.379640
Bolivia              Americas     2677.326347  ...     3413.262690     3822.137084
Brazil               Americas     2108.944355  ...     8131.212843     9065.800825
Canada               Americas    11367.161120  ...    33328.965070    36319.235010
Chile                Americas     3939.978789  ...    10778.783850    13171.638850
Colombia             Americas     2144.115096  ...     5755.259962     7006.580419
Costa Rica           Americas     2627.009471  ...     7723.447195     9645.061420
Cuba                 Americas     5586.538780  ...     6340.646683     8948.102923
Dominican Republic   Americas     1397.717137  ...     4563.808154     6025.374752
Ecuador              Americas     3522.110717  ...     5773.044512     6873.262326
El S

Unnamed: 0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
count,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0
mean,4079.062552,4616.043733,4901.54187,5668.253496,6491.334139,7352.007126,7506.737088,7793.400261,8044.934406,8889.300863,9287.677107,11003.031625
std,3001.727522,3312.381083,3421.740569,4160.88556,4754.404329,5355.602518,5530.490471,6665.039509,7047.089191,7874.225145,8895.817785,9713.209302
min,1397.717137,1544.402995,1662.137359,1452.057666,1654.456946,1874.298931,2011.159549,1823.015995,1456.309517,1341.726931,1270.364932,1201.637154
25%,2428.237769,2487.365989,2750.364446,3242.531147,4031.408271,4756.763836,4258.503604,4140.442097,4439.45084,4684.313807,4858.347495,5728.353514
50%,3048.3029,3780.546651,4086.114078,4643.393534,5305.445256,6281.290855,6434.501797,6360.943444,6618.74305,7113.692252,6994.774861,8948.102923
75%,3939.978789,4756.525781,5180.75591,5788.09333,6809.40669,7674.929108,8997.897412,7807.095818,8137.004775,9767.29753,8797.640716,11977.57496
max,13990.48208,14847.12712,16173.14586,19530.36557,21806.03594,24072.63213,25009.55914,29884.35041,32003.93224,35767.43303,39097.09955,42951.65309


EXERCISE 2: INSPECTING DATA

After reading the data for the Americas, use `help(americas.head)` and `help(americas.tail)` to find out what `DataFrame.head` and `DataFrame.tail` do

EXERCISE 3: READING FILES IN OTHER DIRECTORIES

The data for your current project is stored in a file called `microbes.csv`, which is located in a folder called `field_data`. You are doing analysis in a notebook called `analysis.ipynb` in a sibling folder called thesis:
```
your_home_directory
+-- field_data/
|   +-- microbes.csv
+-- thesis/
    +-- analysis.ipynb
```

SOL:
We need to specify the path to the file of interest in the call to pd.read_csv. We first need to ‘jump’ out of the folder thesis using ‘../’ and then into the folder field_data using ‘field_data/’. Then we can specify the filename `microbes.csv. The result is as follows:
`
data_microbes = pd.read_csv('../field_data/microbes.csv')
`


EXERCISE 4: WRITING DATA

As well as the `read_csv` function for reading data from a file, Pandas provides a `to_csv` function to write dataframes to files. Applying what you’ve learned about reading from files, write one of your dataframes to a file called `processed.csv`. You can use `help` to get information on how to use `to_csv`.

SOL:

In order to write the DataFrame americas to a file called processed.csv, execute the following command:

In [None]:
americas.to_csv('processed.csv')

In [None]:
def print_egg_label(mass):
    #egg sizing machinery prints a label
    if mass >= 90:
        return "warning: egg might be dirty"
    elif mass >= 85:
        return "jumbo"
    elif mass >= 70:
        return "large"
    elif mass < 70 and mass >= 55:
        return "medium"
    elif mass < 50:
        return "too light, probably spoiled"
    else:
        return "small"

In [None]:
for i in range(10):
    mass = 70 + 20.0 * (2.0 * random.random() - 1.0)
    print(mass, print_egg_label(mass))   

#### 8. Encapsulating Data Analysis

In [None]:
import pandas as pd

df = pd.read_csv('data/gapminder_gdp_asia.csv', index_col=0)
japan = df.loc['Japan']

In [None]:
def avg_gdp_in_decade(country, continent, year):
     df = pd.read_csv('data/gapminder_gdp_' + continent + '.csv', index_col=0)
     c = df.loc[country]
     gdp_decade = 'gdpPercap_' + str(year // 10)
     total = 0.0
     num_years = 0
     for yr_header in c.index: # c's index contains reported years
         if yr_header.startswith(gdp_decade):
             total = total + c.loc[yr_header]
             num_years = num_years + 1
     return total/num_years

In [None]:
avg_gdp_in_decade('Japan','asia',1983)
