![Practicum AI Logo image](https://github.com/PracticumAI/practicumai.github.io/blob/main/images/logo/PracticumAI_logo_250x50.png?raw=true)  <img src='https://github.com/PracticumAI/practicumai.github.io/blob/main/images/icons/practicumai_python.png?raw=true' align='right' width=50>

# *Practicum AI Python*: Data Wrangling - Part 2

This exercise adapted from Lipp et al. (2020) <i>The Data Wrangling Workshop</i> from <a href="https://www.packtpub.com/product/the-data-wrangling-workshop-second-edition/9781839215001">Packt Publishers</a> and the <a href="https://github.com/swcarpentry/python-novice-gapminder">Software Carpentries</a>


***

## 1. Use the Pandas library to do statistics on tabular data

Pandas has a lot of built in functionality in terms of statistics and visualization. Many of these have been highly optimized and run very quickly. 

`matplotlib` is another common visualization library that we will explore some too.

First, let's make a dataframe for this exercise.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Used in Jupyter to make graphs display in notebooks.

In [None]:
people = ['Ann', 'Brandon', 'Chen', 'David', 'Emily',
         'Farook', 'Gagen', 'Hamish', 'Imran', 'Joseph',
         'Katherine', 'Lily']

age    = [21, 12, 32, 45, 37, 18, 28, 52, 5 ,40, 48, 15]
weight = [55, 35, 77, 68, 70, 60, 72, 69, 18, 65, 82, 48]
height = [160, 135, 170, 165, 173, 168, 175, 159, 105, 171, 155, 158]

people_dict = { 'People':people, 'Age':age, 'Weight':weight, 'Height':height}
df = pd.DataFrame(data = people_dict)
df

### 1.1 Explore dataframe `.shape`, `.count()`, `.sum()`, `.describe()`

As noted, there are a lot of methods that can be used with dataframes. Let's explore some here.

In [None]:
df.shape # Note .shape does not have (), shape is an attribute

In [None]:
df['Age'].count()

In [None]:
# Can everyone go in the elevator at once?

df['Weight'].sum()

In [None]:
df['Height'].mean()

There are also `.max()`, `.min()`, `.std()`, `.median()` and others, but `.describe()` shows lots of information about data distributions:

In [None]:
df.describe()

## 2. Visualizing data with Pandas

Pandas has a number of built in visualization functions

<div style="padding: 10px;margin-bottom: 20px;border: thin solid #30335D;border-left-width: 10px;background-color: #fff"><strong>Note:</strong> Matplotlib is but one Python plotting option.  If you are already familiar with ggplot and the grammar of graphics, the plotnine library is the Python equivalent of this popular R package.  Seaborn is also a great Python graphing library.</div>

In [None]:
df['Weight'].hist()

In [None]:
df.plot.scatter('Weight', 'Height')

In [None]:
# We can also modify the output using matplotlib options
df.plot.scatter('Weight', 'Height', s=150, c='orange')

plt.grid(True)
plt.title('Weight vs. Heights scatter plot', fontsize=15)
plt.xlabel('Weight (in kg)', fontsize=10)
plt.ylabel('Height (in cm)', fontsize=10)

In [None]:
# Pair plots also, known as scatter matrices, are a great way to visualize all your data

pd.plotting.scatter_matrix(df)

### 2.1 Saving plots

We may want to save the output of a plot. This can be done using the  `plt.savefig()` function. By default, the format will be determined from the filename extension you provide.

In [None]:
plt.savefig('plot.pdf')

## 3. Use `DataFrame.info` to find out more about a dataframe

In [None]:
df.info()

## 4. The `DataFrame.columns` variable stores information about the dataframe's columns

* Note that this is an attribute, *not* a method.
    * Like `math.pi`.
    * So do not use `()` to try to call it.
* Called a *member variable*, or just *member*.

In [None]:
df.columns

## 5. Use `DataFrame.T` to transpose a dataframe

* Sometimes we want to treat columns as rows and vice versa.
* Transpose (written `.T`) doesn't copy the data, just changes the program's view of it.
* Like `columns`, it is a member variable.

In [None]:
df.T

***

## Bonus Questions

#### Q1: Reading Other Data

Read the data in `gapminder_gdp_americas.csv` (which should be the same directory as `gapminder_gdp_oceania.csv`) into the variable called `americas` and display its summary statistics. 

**Solution**

Click on the '...' below to show the solution.

In [None]:
# To read in a CSV, we use `pandas.read_csv`and pass the filename 
# 'data/gapminder_gdp_americas.csv' to it. We also once again pass the column 
# name 'country' to the parameter `index_col` in order to index by country: 

americas = pandas.read_csv('data/gapminder_gdp_americas.csv', index_col='country')

#### Q2: Inspecting Data 

After reading the data for the AMericans, use `help(americas.head)` and `help(americas.tal)` to find out what `DataFrame.head` and `DataFrame.tail` do.

1. What method call will display the first three rows fo this data?
2. What method call will display the last three columns of this data? (Hint: you may need to change your view of the data)

**Solution**

Click on the '...' below to show the solution.

In [None]:
# 1. We can check out the first five rows of `americas` by executing 
# `americas.head()` (allowing us to view the head of the DataFrame). We can 
# specify the number of rows we wish to see by specifying the parameter `n` 
# in our call to `americas.head()`. To view the first three rows, execute:

americas.head(n=3)

# The output is then

         continent  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
country                                                               
Argentina  Americas     5911.315053     6856.856212     7133.166023   
Bolivia    Americas     2677.326347     2127.686326     2180.972546   
Brazil     Americas     2108.944355     2487.365989     3336.585802`   

          gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  \
country                                                                     
Argentina     8052.953021     9443.038526    10079.026740     8997.897412   
Bolivia       2586.886053     2980.331339     3548.097832     3156.510452   
Brazil        3429.864357     4985.711467     6660.118654     7030.835878`   

          gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  \
country                                                                     
Argentina     9139.671389     9308.418710    10967.281950     8797.640716   
Bolivia       2753.691490     2961.699694     3326.143191     3413.262690   
Brazil        7807.095818     6950.283021     7957.980824     8131.212843`   

          gdpPercap_2007  
country                    
Argentina    12779.379640  
Bolivia       3822.137084  
Brazil        9065.800825

# 2. To check out the last three rows of `americas`, we would use the command, 
# `americas.tail(n=3)`, analogous to `head()` used above. However, here we want 
# to look at the last three columns so we need to change our view and then use 
# `tail()`. To do so, we create a new DataFrame in which rows and columns are 
# switched

americas_flipped = americas.T

# We can then view the last three columns of `americas` by viewing the last 
# three rows of `americas_flipped`:

americas_flipped.tail(n = 3)

# The output is then:

country        Argentina  Bolivia   Brazil   Canada    Chile Colombia  \
gdpPercap_1997   10967.3  3326.14  7957.98  28954.9  10118.1  6117.36   
gdpPercap_2002   8797.64  3413.26  8131.21    33329  10778.8  5755.26   
gdpPercap_2007   12779.4  3822.14   9065.8  36319.2  13171.6  7006.58   

country        Costa Rica     Cuba Dominican Republic  Ecuador    ...     \
gdpPercap_1997    6677.05  5431.99             3614.1  7429.46    ...      
gdpPercap_2002    7723.45  6340.65            4563.81  5773.04    ...      
gdpPercap_2007    9645.06   8948.1            6025.37  6873.26    ...      

country          Mexico Nicaragua   Panama Paraguay     Peru Puerto Rico  \
gdpPercap_1997   9767.3   2253.02  7113.69   4247.4  5838.35     16999.4   
gdpPercap_2002  10742.4   2474.55  7356.03  3783.67  5909.02     18855.6   
gdpPercap_2007  11977.6   2749.32  9809.19  4172.84  7408.91     19328.7   

country        Trinidad and Tobago United States  Uruguay Venezuela  
gdpPercap_1997             8792.57       35767.4  9230.24   10165.5  
gdpPercap_2002             11460.6       39097.1     7727   8605.05  
gdpPercap_2007             18008.5       42951.7  10611.5   11415.8  

Note: we could have done the above in a single line of code by 'chaining' the commands:

americas.T.tail(n=3)



#### Q3: Reading Files in Other Directories

The data for your current project is stored in a file called `microbes.csv`, which is located in a folder called `field_data`. Your are doing analysis in a notebook called `analysis.ipynb` in a sibliong folder called `thesis`:

In [None]:
your home directory
+-- field data/
|    +-- microbes.csv
+-- thesis/
     +-- analysis.ipynb

What value(s) should you pass to `read.csv` to read `microbes.csv` in `analysis.ipynb`?

**Solution**

Click on the '...' below to show the solution.

In [None]:
# We need to specify the path to the file of interest in the call to 
# `pandas.read_csv`. We first need to `jump` out of the folder `thesis` using
# `../` and then into the folder `field_data` using `field_data/`. Then we 
# can specify the filename `microbes.csv`.
#
# The result is as follows:

data_microbes = pandas.read_csv('../field_data/microbes.csv')

#### Q4: Writing Data

As well as the `read_csv` function for reafing data from a  file, Pandas provides a `to_csv` function to write dataframes to files. Applying what you've learned about reading from files, write one of your dataframes to a file calles `processed.csv`. You can use `help` to get information on how to use `to_csv`. 

**Solution**

Click on the '...' below to show the solution.

In [None]:
# In order to write the DataFrame `americas` to a file called `processed.csv`, 
# execute the following command:

americas.to_csv('processed.csv')

# For help on `to_csv`, you could execute, for example,

help(americas.to_csv)

# Note that `help(to_csv)` throws an error! This is a subtlety and is due to 
# the fact that `to_csv` is NOT a function in and of itself and the actual 
# call is `americas.to_csv`. 