<a href="https://colab.research.google.com/github/CBIIT/python-carpentry-workshop/blob/main/python_reading_tabular_data_into_dataframes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reading Tabular Data into DataFrames

## Use the Pandas library to do statistics on tabular data

- Pandas is a widely-used Python library for statistics, particularly on tabular data.
- Borrows many features from R’s dataframes.
- Load it with `import pandas as pd`. The alias pd is commonly used for Pandas.
- Read a Comma Separated Values (CSV) data file with `pd.read_csv`.

In [None]:
import pandas as pd

#GDP per capita/person in different countries of Oceania from 1952 to 2007https://raw.githubusercontent.com/CBIIT/python-carpentry-workshop/main/week3/gapminder_gdp_oceania.csv'
data = pd.read_csv(')
print(type(data))

<class 'pandas.core.frame.DataFrame'>


In [None]:
print(data)
data.head()
# The columns in a dataframe are the observed variables, and the rows are the observations.

       country  gdpPercap_1952  ...  gdpPercap_2002  gdpPercap_2007
0    Australia     10039.59564  ...     30687.75473     34435.36744
1  New Zealand     10556.57566  ...     23189.80135     25185.00911

[2 rows x 13 columns]


Unnamed: 0,country,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
0,Australia,10039.59564,10949.64959,12217.22686,14526.12465,16788.62948,18334.19751,19477.00928,21888.88903,23424.76683,26997.93657,30687.75473,34435.36744
1,New Zealand,10556.57566,12247.39532,13175.678,14463.91893,16046.03728,16233.7177,17632.4104,19007.19129,18363.32494,21050.41377,23189.80135,25185.00911


## Use `index_col` to specify that a column’s values should be used as row headings

- Row headings are numbers (0 and 1 in this case).
- Want to index by country.
- Pass the name of the column to `read_csv` as its `index_col` parameter to do this.

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/CBIIT/python-carpentry-workshop/main/week3/gapminder_gdp_oceania.csv',
                   index_col='country')

print(data)

             gdpPercap_1952  gdpPercap_1957  ...  gdpPercap_2002  gdpPercap_2007
country                                      ...                                
Australia       10039.59564     10949.64959  ...     30687.75473     34435.36744
New Zealand     10556.57566     12247.39532  ...     23189.80135     25185.00911

[2 rows x 12 columns]


## Use the `DataFrame.info()` method to find out more about a dataframe

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Australia to New Zealand
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   gdpPercap_1952  2 non-null      float64
 1   gdpPercap_1957  2 non-null      float64
 2   gdpPercap_1962  2 non-null      float64
 3   gdpPercap_1967  2 non-null      float64
 4   gdpPercap_1972  2 non-null      float64
 5   gdpPercap_1977  2 non-null      float64
 6   gdpPercap_1982  2 non-null      float64
 7   gdpPercap_1987  2 non-null      float64
 8   gdpPercap_1992  2 non-null      float64
 9   gdpPercap_1997  2 non-null      float64
 10  gdpPercap_2002  2 non-null      float64
 11  gdpPercap_2007  2 non-null      float64
dtypes: float64(12)
memory usage: 288.0+ bytes


- This is a DataFrame
- Two rows named 'Australia' and 'New Zealand'
- Twelve columns, each of which has two actual 64-bit floating point values.
- Uses 208 bytes of memory.

## The `DataFrame.columns` variable stores information about the dataframe’s columns

- Note that this is data, not a method. (It doesn’t have parentheses.)
- Called a **member variable**, or just member.
  - a variable that is associated with a specific object, and accessible for all its methods (member functions)

In [None]:
print(data.columns)

Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')


## Use `DataFrame.T` to transpose a dataframe

- Want to treat columns as rows and vice versa.
- Transpose doesn’t copy the data, just changes the program’s view of it.
- Like `columns`, it is a **member variable**.

In [None]:
print(data.T)

country           Australia  New Zealand
gdpPercap_1952  10039.59564  10556.57566
gdpPercap_1957  10949.64959  12247.39532
gdpPercap_1962  12217.22686  13175.67800
gdpPercap_1967  14526.12465  14463.91893
gdpPercap_1972  16788.62948  16046.03728
gdpPercap_1977  18334.19751  16233.71770
gdpPercap_1982  19477.00928  17632.41040
gdpPercap_1987  21888.88903  19007.19129
gdpPercap_1992  23424.76683  18363.32494
gdpPercap_1997  26997.93657  21050.41377
gdpPercap_2002  30687.75473  23189.80135
gdpPercap_2007  34435.36744  25185.00911


## Use `DataFrame.describe()` to get summary statistics about data.

- `DataFrame.describe()` gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument `include='all'`.

In [None]:
print(data.describe())

       gdpPercap_1952  gdpPercap_1957  ...  gdpPercap_2002  gdpPercap_2007
count        2.000000        2.000000  ...        2.000000        2.000000
mean     10298.085650    11598.522455  ...    26938.778040    29810.188275
std        365.560078      917.644806  ...     5301.853680     6540.991104
min      10039.595640    10949.649590  ...    23189.801350    25185.009110
25%      10168.840645    11274.086022  ...    25064.289695    27497.598692
50%      10298.085650    11598.522455  ...    26938.778040    29810.188275
75%      10427.330655    11922.958888  ...    28813.266385    32122.777857
max      10556.575660    12247.395320  ...    30687.754730    34435.367440

[8 rows x 12 columns]


pandas.core.frame.DataFrame

## Homework

### Exercise: Writing Data

As well as the `read_csv` function for reading data from a file, Pandas provides a `to_csv` function to write dataframes to files. Applying what you’ve learned about reading from files, write the dataframe `data` to a file called `processed.csv`. You can use `help` to get information on how to use `to_csv`.

### Exercise: Reading Other Data

Read the data in `gapminder_gdp_americas.csv` (which is stored at https://raw.githubusercontent.com/swcarpentry/python-novice-gapminder/gh-pages/data/gapminder_gdp_americas.csv) into a variable called `americas` and display its summary statistics.

### Exercise: Inspecting Data

After reading the data for the Americas, use `help(americas.head)` and `help(americas.tail)` to find out what `DataFrame.head` and `DataFrame.tail` do.

1. What method call will display the first three rows of this data?
2. What method call will display the last three columns of this data? (Hint: you may need to change your view of the data.)