# Tidying Data for Analysis

## 1. Recognising tidy data
For data to be tidy, it must have:

- Each variable as a separate column.
- Each row as a separate observation.

Data that can be represented in a variety of different ways, so it is important to be able to recognise tidy (or untidy) data.


Explore the structure of DataFrames in the IPython Shell prior to performing different operations on them. Doing this will not only strengthen the comprehension of the data cleaning concepts, but will also help take advantage of the relationship between working in the Shell and in the script.

In [1]:
import pandas as pd
airquality = pd.read_csv("datasets/airquality.csv")
airquality.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


In [2]:
airquality.tail()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
148,30.0,193.0,6.9,70,9,26
149,,145.0,13.2,77,9,27
150,14.0,191.0,14.3,75,9,28
151,18.0,131.0,8.0,76,9,29
152,20.0,223.0,11.5,68,9,30


In [3]:
airquality.shape

(153, 6)

## 2. Reshaping data using melt
Melting data is the process of turning columns of data into rows of data. In the tidy DataFrame, the variables `Ozone`, `Solar.R`, `Wind`, and `Temp` each had their own column. If, however, these variables are to be in rows instead, they can be melted in the DataFrame. In doing so, however, it would make the data untidy! This is important to keep in mind: Depending on how the data is represented, it will need to be reshaped it differently (e.g., this could make it easier to plot values).

Use `pd.melt()`. There are two parameters you should be aware of: `id_vars` and `value_vars`. The `id_vars` represent the columns of the data that are not to be melt (i.e., keep it in its current shape), while the `value_vars` represent the columns are to be melt into rows. By default, if no `value_vars` are provided, all columns not set in the `id_vars` will be melted. This could save a bit of typing, depending on the number of columns that need to be melted.

Melt the `Ozone`, `Solar.R`, `Wind`, and `Temp` columns into rows.

In [4]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame = airquality, id_vars=["Month", "Day"])

In [5]:
# Print the head of airquality_melt
airquality_melt.head()

Unnamed: 0,Month,Day,variable,value
0,5,1,Ozone,41.0
1,5,2,Ozone,36.0
2,5,3,Ozone,12.0
3,5,4,Ozone,18.0
4,5,5,Ozone,


In [6]:
# Print the head of airquality_melt
airquality_melt.tail()

Unnamed: 0,Month,Day,variable,value
607,9,26,Temp,70.0
608,9,27,Temp,77.0
609,9,28,Temp,75.0
610,9,29,Temp,76.0
611,9,30,Temp,68.0


In [7]:
airquality_melt.shape

(612, 4)

Melting a DataFrame is not always appropriate if it is needed to be tidy. Other transformations are to be performed depending on how the data is represented.

## 3. Customizing melted data
When melting DataFrames, it would be better to have column names more meaningful than variable and value (the default names used by `pd.melt()`).

The default names may work in certain situations, but it's best to always have data that is self explanatory.

Rename the variable column by specifying an argument to the `var_name` parameter, and the value column by specifying an argument to the `value_name` parameter.

In [8]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'], var_name="measurement", value_name="reading")

In [9]:
# Print the head of airquality_melt
airquality_melt.head()

Unnamed: 0,Month,Day,measurement,reading
0,5,1,Ozone,41.0
1,5,2,Ozone,36.0
2,5,3,Ozone,12.0
3,5,4,Ozone,18.0
4,5,5,Ozone,


The DataFrame is more informative now. Next is pivoting which is the opposite of melting.

## 4. Pivot data
Pivoting data is the opposite of melting it. Remember the tidy form that the `airquality` DataFrame was in before it was melted? Now begin pivoting it back into that form using the `.pivot_table()` method!

While melting takes a set of columns and turns it into a single column, pivoting will create a new column for each unique value in a specified column.

`.pivot_table()` has an index parameter which can specify the columns that do not need to be pivoted: It is similar to the `id_vars` parameter of `pd.melt()`. Two other parameters that have to specified are columns (the name of the column to pivot), and values (the values to be used when the column is pivoted).

In [21]:
# Print the head of airquality_melt
print(airquality_melt.head())

   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN


In [14]:
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=["Month", "Day"], columns="measurement", values="reading")

# Print the head of airquality_pivot
airquality_pivot.head()

Unnamed: 0_level_0,measurement,Ozone,Solar.R,Temp,Wind
Month,Day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,1,41.0,190.0,67.0,7.4
5,2,36.0,118.0,72.0,8.0
5,3,12.0,149.0,74.0,12.6
5,4,18.0,313.0,62.0,11.5
5,5,,,56.0,14.3


## 5. Resetting the index of a DataFrame
After pivoting `airquality_melt`, we didn't quite get back the original DataFrame.

What we got back instead was a pandas DataFrame with a hierarchical index (also known as a MultiIndex).

Hierarchical indexes In essence, group columns or rows by another variable - in this case, by `'Month'` as well as `'Day'`.

There's a very simple method to use to get back the original DataFrame from the pivoted DataFrame: `.reset_index()`. 

In [15]:
# Print the index of airquality_pivot
print(airquality_pivot.index)

MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
           codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

In [16]:
# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()

# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)

RangeIndex(start=0, stop=153, step=1)


In [17]:
# Print the head of airquality_pivot_reset
airquality_pivot_reset.head()

measurement,Month,Day,Ozone,Solar.R,Temp,Wind
0,5,1,41.0,190.0,67.0,7.4
1,5,2,36.0,118.0,72.0,8.0
2,5,3,12.0,149.0,74.0,12.6
3,5,4,18.0,313.0,62.0,11.5
4,5,5,,,56.0,14.3


## 6. Pivoting duplicate values
So far, we used the `.pivot_table()` method when there are multiple index values we want to hold constant during a pivot. We can also use pivot tables to deal with duplicate values by providing an aggregation function through the `aggfunc` parameter. Here, combine both these uses of pivot tables.

Let's say the data collection method accidentally duplicated the dataset. Explore their shapes in the IPython Shell by accessing their `.shape` attributes to confirm the duplicate rows present.

You'll see that by using `.pivot_table()` and the `aggfunc` parameter, we can not only reshape the data, but also remove duplicates. Finally, we can then flatten the columns of the pivoted DataFrame using `.reset_index()`.

In [22]:
import numpy as np
# Pivot table the airquality_dup: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=["Month", "Day"], columns="measurement", values="reading", aggfunc=np.mean)

# Print the head of airquality_pivot before reset_index
airquality_pivot.head()

Unnamed: 0_level_0,measurement,Ozone,Solar.R,Temp,Wind
Month,Day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,1,41.0,190.0,67.0,7.4
5,2,36.0,118.0,72.0,8.0
5,3,12.0,149.0,74.0,12.6
5,4,18.0,313.0,62.0,11.5
5,5,,,56.0,14.3


In [23]:
# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
airquality_pivot.head()

measurement,Month,Day,Ozone,Solar.R,Temp,Wind
0,5,1,41.0,190.0,67.0,7.4
1,5,2,36.0,118.0,72.0,8.0
2,5,3,12.0,149.0,74.0,12.6
3,5,4,18.0,313.0,62.0,11.5
4,5,5,,,56.0,14.3


In [24]:
# Print the head of airquality
airquality.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


The default aggregation function used by `.pivot_table()` is `np.mean()`. So we could have pivoted the duplicate values in this DataFrame even without explicitly specifying the `aggfunc` parameter.

## 7. Splitting a column with `.str`
The dataset here, consists of case counts of tuberculosis by country, year, gender, and age group.

Tidy the `'m014'` column, which represents males aged 0-14 years of age. In order to parse this value, extract the first letter into a new column for `gender`, and the rest into a column for `age_group`. Here, since we can parse values by position, we can take advantage of pandas' vectorized string slicing by using the `str` attribute of columns of type object.

Using pandas `.columns` attribute, and take note of the problematic column.

In [26]:
tb = pd.read_csv("datasets/Tuberculosis.csv")
tb.tail()

Unnamed: 0,country,year,m014,m1524,m2534,m3544,m4554,m5564,m65,mu,f014,f1524,f2534,f3544,f4554,f5564,f65,fu
196,YE,2000,110.0,789.0,689.0,493.0,314.0,255.0,127.0,,161.0,799.0,627.0,517.0,345.0,247.0,92.0,
197,YU,2000,,,,,,,,,,,,,,,,
198,ZA,2000,116.0,723.0,1999.0,2135.0,1146.0,435.0,212.0,,122.0,1283.0,1716.0,933.0,423.0,167.0,80.0,
199,ZM,2000,349.0,2175.0,2610.0,3045.0,435.0,261.0,174.0,,150.0,932.0,1118.0,1305.0,186.0,112.0,75.0,
200,ZW,2000,,,,,,,,,,,,,,,,


In [29]:
# Melt tb: tb_melt
tb_melt = pd.melt(frame = tb, id_vars=["country", "year"])
tb_melt.head()

Unnamed: 0,country,year,variable,value
0,AD,2000,m014,0.0
1,AE,2000,m014,2.0
2,AF,2000,m014,52.0
3,AG,2000,m014,0.0
4,AL,2000,m014,2.0


In [30]:
# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

In [31]:
tb_melt.head()

Unnamed: 0,country,year,variable,value,gender,age_group
0,AD,2000,m014,0.0,m,14
1,AE,2000,m014,2.0,m,14
2,AF,2000,m014,52.0,m,14
3,AG,2000,m014,0.0,m,14
4,AL,2000,m014,2.0,m,14


Notice the new `'gender'` and `'age_group'` columns that was created. It is vital to be able to split columns as needed.

## 8. Splitting a column with `.split()` and `.get()`
Another common way multiple variables are stored in columns is with a delimiter. Here we are using a dataset consisting of [Ebola cases and death counts by state and country](https://data.humdata.org/dataset/ebola-cases-2014).

Print the columns of ebola in the IPython Shell using `ebola.columns`. Notice that the data has column names such as `Cases_Guinea` and `Deaths_Guinea`. Here, the underscore `_` serves as a delimiter between the first part (cases or deaths), and the second part (country).

This time, we cannot directly slice the variable by position. We now need to use Python's built-in string method called `.split()`. By default, this method will split a string into parts separated by a space. However, in this case we want it to split by an underscore. we can do this on `'Cases_Guinea'`, for example, using `'Cases_Guinea'.split('_')`, which returns the list `['Cases', 'Guinea']`.

Extract the first element of this list and assign it to a `type` variable, and the second element of the list to a `country` variable. Accomplish this by accessing the `str` attribute of the column and using the `.get()` method to retrieve the `0` or `1` index, depending on the part needed.

In [37]:
ebola = pd.read_csv("datasets/ebola.csv")
print(ebola.tail())

          Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
117  3/27/2014    5         103.0            8.0                6.0   
118  3/26/2014    4          86.0            NaN                NaN   
119  3/25/2014    3          86.0            NaN                NaN   
120  3/24/2014    2          86.0            NaN                NaN   
121  3/22/2014    0          49.0            NaN                NaN   

     Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  \
117            NaN            NaN                 NaN          NaN   
118            NaN            NaN                 NaN          NaN   
119            NaN            NaN                 NaN          NaN   
120            NaN            NaN                 NaN          NaN   
121            NaN            NaN                 NaN          NaN   

     Cases_Mali  Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  \
117         NaN           66.0             6.0                 5.0   
118         

In [38]:
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=["Date", "Day"], var_name="type_country", value_name="counts")
ebola_melt.head()

Unnamed: 0,Date,Day,type_country,counts
0,1/5/2015,289,Cases_Guinea,2776.0
1,1/4/2015,288,Cases_Guinea,2775.0
2,1/3/2015,287,Cases_Guinea,2769.0
3,1/2/2015,286,Cases_Guinea,
4,12/31/2014,284,Cases_Guinea,2730.0


In [39]:
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split("_")

# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

In [45]:
# Print the head of ebola_melt
ebola_melt.head()

Unnamed: 0,Date,Day,type_country,counts,str_split,type,country
0,1/5/2015,289,Cases_Guinea,2776.0,"[Cases, Guinea]",Cases,Guinea
1,1/4/2015,288,Cases_Guinea,2775.0,"[Cases, Guinea]",Cases,Guinea
2,1/3/2015,287,Cases_Guinea,2769.0,"[Cases, Guinea]",Cases,Guinea
3,1/2/2015,286,Cases_Guinea,,"[Cases, Guinea]",Cases,Guinea
4,12/31/2014,284,Cases_Guinea,2730.0,"[Cases, Guinea]",Cases,Guinea


Now, It is a lot easier to make sense of the data now!

