<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Transformation:-Combining-and-Structuring-Data" data-toc-modified-id="Data-Transformation:-Combining-and-Structuring-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Transformation: Combining and Structuring Data</a></span><ul class="toc-item"><li><span><a href="#Combining-Data" data-toc-modified-id="Combining-Data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Combining Data</a></span><ul class="toc-item"><li><span><a href="#concat" data-toc-modified-id="concat-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span><code>concat</code></a></span><ul class="toc-item"><li><span><a href="#axis=0" data-toc-modified-id="axis=0-1.1.1.1"><span class="toc-item-num">1.1.1.1&nbsp;&nbsp;</span><code>axis=0</code></a></span></li><li><span><a href="#axis=1" data-toc-modified-id="axis=1-1.1.1.2"><span class="toc-item-num">1.1.1.2&nbsp;&nbsp;</span><code>axis=1</code></a></span></li><li><span><a href="#join=&quot;inner&quot;" data-toc-modified-id="join=&quot;inner&quot;-1.1.1.3"><span class="toc-item-num">1.1.1.3&nbsp;&nbsp;</span><code>join="inner"</code></a></span></li></ul></li><li><span><a href="#merge" data-toc-modified-id="merge-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span><code>merge</code></a></span></li><li><span><a href="#join" data-toc-modified-id="join-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span><code>join</code></a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-1.1.4"><span class="toc-item-num">1.1.4&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#💡-Check-for-understanding" data-toc-modified-id="💡-Check-for-understanding-1.1.5"><span class="toc-item-num">1.1.5&nbsp;&nbsp;</span>💡 Check for understanding</a></span></li></ul></li><li><span><a href="#Structuring-Data-with-Pivot,-Stack/Unstack,-and-Melt" data-toc-modified-id="Structuring-Data-with-Pivot,-Stack/Unstack,-and-Melt-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Structuring Data with Pivot, Stack/Unstack, and Melt</a></span><ul class="toc-item"><li><span><a href="#Pivot" data-toc-modified-id="Pivot-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Pivot</a></span></li><li><span><a href="#Stack-and-Unstack" data-toc-modified-id="Stack-and-Unstack-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Stack and Unstack</a></span></li><li><span><a href="#Melt" data-toc-modified-id="Melt-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Melt</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#💡-Check-for-understanding" data-toc-modified-id="💡-Check-for-understanding-1.2.5"><span class="toc-item-num">1.2.5&nbsp;&nbsp;</span>💡 Check for understanding</a></span></li></ul></li></ul></li></ul></div>

# Data Transformation: Combining and Structuring Data

## Combining Data

When working with data, you often encounter situations where you need to combine or merge multiple datasets to gain more insights or perform further analysis.

Pandas provides functions for [combining different data sets](http://pandas.pydata.org/pandas-docs/stable/merging.html) based on [relational algebra](https://en.wikipedia.org/wiki/Relational_algebra): `join`, `merge` and `concat`.

In [43]:
import pandas as pd
import numpy as np

In [2]:
# DataFrame 1: Sales information
df_sales = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
    'Product': ['A', 'B', 'C', 'D'],
    'Quantity_Sold': [100, 200, 150, 120]
})
df_sales

Unnamed: 0,Date,Product,Quantity_Sold
0,2023-01-01,A,100
1,2023-01-02,B,200
2,2023-01-03,C,150
3,2023-01-04,D,120


In [3]:
# DataFrame 2: Revenue information
df_revenue = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-05'],
    'Revenue': [1000, 1500, 1200, 800]
})
df_revenue

Unnamed: 0,Date,Revenue
0,2023-01-01,1000
1,2023-01-02,1500
2,2023-01-03,1200
3,2023-01-05,800


In [4]:
# DataFrame 3: Costs information
df_costs = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-03', '2023-01-04', '2023-01-05'],
    'Costs': [500, 700, 600, 400]
})
df_costs

Unnamed: 0,Date,Costs
0,2023-01-01,500
1,2023-01-03,700
2,2023-01-04,600
3,2023-01-05,400


In [5]:
# DataFrame 1: Sales information next 4 months
df_sales_2 = pd.DataFrame({
    'Date': ['2023-01-04', '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08'],
    'Product': ['A', 'A', 'B', 'C', 'D'],
    'Quantity_Sold': [100, 100, 200, 150, 120]
})
df_sales_2

Unnamed: 0,Date,Product,Quantity_Sold
0,2023-01-04,A,100
1,2023-01-05,A,100
2,2023-01-06,B,200
3,2023-01-07,C,150
4,2023-01-08,D,120


### `concat`

- `concat` is usually used when you want to combine two or more DataFrames vertically or horizontally.
- It is commonly used when you have data split across multiple files or sources and want to stack them together to create a larger dataset.
- Vertical concatenation is used when you want to add more rows to an existing DataFrame.
- Horizontal concatenation is used when you want to add more columns to an existing DataFrame.
- Example: combining monthly or yearly sales data: Suppose you have sales data for a retail store split across multiple files, where each file contains sales data for a specific month or year. You can use concat to vertically stack these DataFrames and create a single DataFrame containing the complete sales data for all months or years.

`pd.concat` is used to concatenate multiple DataFrames.
- The `axis` parameter determines the axis along which the DataFrames will be stacked. `axis=0` (the default) stacks the DataFrames vertically (along rows), while `axis=1` stacks them horizontally (along columns).

#### `axis=0`

In [None]:
pd.concat([df_revenue, df_costs], axis = 1)
#Mismatch date because different index in both database

Unnamed: 0,Date,Revenue,Date.1,Costs
0,2023-01-01,1000,2023-01-01,500
1,2023-01-02,1500,2023-01-03,700
2,2023-01-03,1200,2023-01-04,600
3,2023-01-05,800,2023-01-05,400


In [7]:
# Concatenate the sales, and sales_2 vertically (along rows)
pd.concat([df_sales, df_sales_2], axis=0)

Unnamed: 0,Date,Product,Quantity_Sold
0,2023-01-01,A,100
1,2023-01-02,B,200
2,2023-01-03,C,150
3,2023-01-04,D,120
0,2023-01-04,A,100
1,2023-01-05,A,100
2,2023-01-06,B,200
3,2023-01-07,C,150
4,2023-01-08,D,120


In [8]:
# Concatenate the sales, revenue, and costs DataFrames vertically (along rows)
pd.concat([df_sales, df_revenue, df_costs], axis=0)

Unnamed: 0,Date,Product,Quantity_Sold,Revenue,Costs
0,2023-01-01,A,100.0,,
1,2023-01-02,B,200.0,,
2,2023-01-03,C,150.0,,
3,2023-01-04,D,120.0,,
0,2023-01-01,,,1000.0,
1,2023-01-02,,,1500.0,
2,2023-01-03,,,1200.0,
3,2023-01-05,,,800.0,
0,2023-01-01,,,,500.0
1,2023-01-03,,,,700.0


#### `axis=1`

In [9]:
# Concatenate the sales, revenue, and costs DataFrames horizontally (along columns)
pd.concat([df_sales, df_revenue, df_costs], axis=1) # Notice that the dates are NOT the same for each table

Unnamed: 0,Date,Product,Quantity_Sold,Date.1,Revenue,Date.2,Costs
0,2023-01-01,A,100,2023-01-01,1000,2023-01-01,500
1,2023-01-02,B,200,2023-01-02,1500,2023-01-03,700
2,2023-01-03,C,150,2023-01-03,1200,2023-01-04,600
3,2023-01-04,D,120,2023-01-05,800,2023-01-05,400


In [11]:
# Can see date discrepancy better if we look at sales_2:
pd.concat([ df_revenue, df_costs,df_sales_2], axis=1)

Unnamed: 0,Date,Revenue,Date.1,Costs,Date.2,Product,Quantity_Sold
0,2023-01-01,1000.0,2023-01-01,500.0,2023-01-04,A,100
1,2023-01-02,1500.0,2023-01-03,700.0,2023-01-05,A,100
2,2023-01-03,1200.0,2023-01-04,600.0,2023-01-06,B,200
3,2023-01-05,800.0,2023-01-05,400.0,2023-01-07,C,150
4,,,,,2023-01-08,D,120


### Join types

Whenever joining 2 tables, we always keep all the columns from both tables.

Then, depending on the join type, we will:
- **Inner join**: Keep only rows present in both tables
- **Outer join**: Keep all rows present in either table
- **Left join**: Keep all rows present in left/first table
- **Right join**: Keep all rows present in right/second table

**Self-join** and **Cross-join** will be discussed during the SQL classes.

![](https://cdn.educba.com/academy/wp-content/uploads/2019/10/Types-of-Join-inSQL.jpg.webp)  
(Source: [Educba.com](https://educba.com))

### `merge`

Merge is used to combine DataFrames based on a common column. By default, `merge` performs an *inner join*, where only the matching rows between the DataFrames are included in the result.

In [12]:
df_revenue.Date

0    2023-01-01
1    2023-01-02
2    2023-01-03
3    2023-01-05
Name: Date, dtype: object

In [17]:
pd.merge(left = df_revenue, right = df_costs, on = 'Date')
#Will only merge the common Values
#Interjoint

Unnamed: 0,Date,Revenue,Costs
0,2023-01-01,1000,500
1,2023-01-03,1200,700
2,2023-01-05,800,400


In [None]:
#outer = full join
pd.merge(left = df_revenue, right = df_costs, on = 'Date', how = 'outer')


Unnamed: 0,Date,Revenue,Costs
0,2023-01-01,1000.0,500.0
1,2023-01-02,1500.0,
2,2023-01-03,1200.0,700.0
3,2023-01-04,,600.0
4,2023-01-05,800.0,400.0


In [20]:
pd.merge(left = df_revenue, right = df_costs, on = 'Date', how = 'outer').sort_values(by= 'Costs')

Unnamed: 0,Date,Revenue,Costs
4,2023-01-05,800.0,400.0
0,2023-01-01,1000.0,500.0
3,2023-01-04,,600.0
2,2023-01-03,1200.0,700.0
1,2023-01-02,1500.0,


In [24]:
pd.merge(left = df_revenue, right = df_costs, on = 'Date', how = 'right')

Unnamed: 0,Date,Revenue,Costs
0,2023-01-01,1000.0,500
1,2023-01-03,1200.0,700
2,2023-01-04,,600
3,2023-01-05,800.0,400


In [21]:
# Merge the sales and revenue DataFrames on the 'Date' column (inner join)
# Only rows with a common value in the 'Date' column, present in both DataFrames, are included in the merged result.
pd.merge(df_sales, df_revenue, on='Date')

Unnamed: 0,Date,Product,Quantity_Sold,Revenue
0,2023-01-01,A,100,1000
1,2023-01-02,B,200,1500
2,2023-01-03,C,150,1200


If you want to perform an outer join, where all rows from both DataFrames are included, you can use `how='outer'`.

In [22]:
# Merge the sales and revenue DataFrames on the 'Date' column (outer join)
pd.merge(df_sales, df_revenue, on='Date', how='outer')

Unnamed: 0,Date,Product,Quantity_Sold,Revenue
0,2023-01-01,A,100.0,1000.0
1,2023-01-02,B,200.0,1500.0
2,2023-01-03,C,150.0,1200.0
3,2023-01-04,D,120.0,
4,2023-01-05,,,800.0


If you want to include all rows from the left DataFrame and only the matching rows from the right DataFrame, you can use `how='left'`.

In [23]:
# Merge the sales and revenue DataFrames on the 'Date' column (left join)
pd.merge(df_sales, df_revenue, on='Date', how='left')

Unnamed: 0,Date,Product,Quantity_Sold,Revenue
0,2023-01-01,A,100,1000.0
1,2023-01-02,B,200,1500.0
2,2023-01-03,C,150,1200.0
3,2023-01-04,D,120,


Similarly, if you want to include all rows from the right DataFrame and only the matching rows from the left DataFrame, you can use `how='right'`.

In [25]:
# Merge the sales and revenue DataFrames on the 'Date' column (right join)
pd.merge(df_sales, df_revenue, on='Date', how='right')

Unnamed: 0,Date,Product,Quantity_Sold,Revenue
0,2023-01-01,A,100.0,1000
1,2023-01-02,B,200.0,1500
2,2023-01-03,C,150.0,1200
3,2023-01-05,,,800


In these examples, we had the same column ('Date') in both DataFrames, but this is not always the case. To perform such joins, we use the `left_on` and `right_on` parameters.

`df1.merge(df2, left_on='col_1', right_on='col_2', how='inner')`

In [26]:
df_revenue.merge(df_sales, on='Date')

Unnamed: 0,Date,Revenue,Product,Quantity_Sold
0,2023-01-01,1000,A,100
1,2023-01-02,1500,B,200
2,2023-01-03,1200,C,150


In [30]:
pd.merge(
    left = df_revenue,
    right = df_sales,
    how = 'outer',
    left_on="Date_1",
    right_on="Date_2"
)

KeyError: 'Date_2'

In [47]:
# Rename date columns
df_revenue.rename({'Date': 'Date_1'}, axis=1, inplace=True)
df_costs.rename({'Date': 'Date_2'}, axis=1, inplace=True)

display(df_revenue)
display(df_costs)

Unnamed: 0,Date_1,Revenue
0,2023-01-01,1000
1,2023-01-02,1500
2,2023-01-03,1200
3,2023-01-05,800


Unnamed: 0,Date_2,Costs
0,2023-01-01,500
1,2023-01-03,700
2,2023-01-04,600
3,2023-01-05,400


In [48]:
# Illustrate merge with different column names
merged =df_revenue.merge(df_costs, left_on='Date_1', right_on='Date_2', how='outer')

In [None]:
#Specify the value
#np replace isna as date is object in the database
#create a new column with a condition using lambda function
merged['Date'] = merged.apply(lambda row: row["Date_1"] if row ["Date_1"] is not np.NaN else row["Date_2"], axis=1)

In [50]:
merged

Unnamed: 0,Date_1,Revenue,Date_2,Costs,Date
0,2023-01-01,1000.0,2023-01-01,500.0,2023-01-01
1,2023-01-02,1500.0,,,2023-01-02
2,2023-01-03,1200.0,2023-01-03,700.0,2023-01-03
3,,,2023-01-04,600.0,2023-01-04
4,2023-01-05,800.0,2023-01-05,400.0,2023-01-05


### `join`

`join()` works similarly to `merge()`. It is also used to combine DataFrames. However, there are some differences between the two:

1. **Method of Combination:**
   - `join()`: Combines DataFrames **based on their indexes**. It uses the index as the key to align the rows.
   - `merge()`: Combines DataFrames **based on the values in specified columns**. It can use one or more columns as the keys to align the rows.

2. **Default Behavior:**
   - `join()`: By default, performs a left join, keeping all rows from the left DataFrame and filling missing values with NaN from the right DataFrame.
   - `merge()`: By default, performs an inner join, keeping only the rows with matching values in both DataFrames.



First, we set the 'Date' column as the index for all three DataFrames, as follows:

In [53]:
df_revenue.rename({'Date_1': 'Date'}, axis=1, inplace=True)
df_costs.rename({'Date_2': 'Date'}, axis=1, inplace=True)


In [None]:
#df_sales.set_index('Date', inplace=True)
df_revenue.set_index('Date', inplace=True)
df_costs.set_index('Date', inplace=True)

In [54]:
#shift + Alt+ down key
#df_sales.set_index('Date', inplace=True)
df_revenue.set_index('Date', inplace=True)
df_costs.set_index('Date', inplace=True)

In [None]:
#By default left join, take all the data from the dataframe called first
df_revenue.join(df_costs)

Unnamed: 0_level_0,Revenue,Costs
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-01-01,1000,500.0
2023-01-02,1500,
2023-01-03,1200,700.0
2023-01-05,800,400.0


In [56]:
df_revenue.join(df_costs, how='inner')

Unnamed: 0_level_0,Revenue,Costs
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-01-01,1000,500
2023-01-03,1200,700
2023-01-05,800,400


In [57]:
df_revenue.join(df_costs, how='outer')

Unnamed: 0_level_0,Revenue,Costs
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-01-01,1000.0,500.0
2023-01-02,1500.0,
2023-01-03,1200.0,700.0
2023-01-04,,600.0
2023-01-05,800.0,400.0


Next, we use the `join()` method on `df_sales` to merge it with `df_revenue` and `df_costs`. By default, `join()` uses the 'Date' column as the key to merge the DataFrames.

The resulting df_combined DataFrame will contain all rows from df_sales along with corresponding revenue and costs information, where available. If there is no data for a specific date in either df_revenue or df_costs, the corresponding values will be filled with NaN.

In [58]:
# By default is a Left Join: Keep all rows from the left DataFrame and fill missing values with NaN from the right DataFrame.
df_sales.join([df_revenue, df_costs])

Unnamed: 0,Date_2,Product,Quantity_Sold,Revenue,Costs
0,2023-01-01,A,100.0,,
1,2023-01-02,B,200.0,,
2,2023-01-03,C,150.0,,
3,2023-01-04,D,120.0,,


In [59]:
# Inner Join: Only include rows with matching 'Date' values in both DataFrames.
df_sales.join([df_revenue, df_costs], how='inner')

Unnamed: 0,Date_2,Product,Quantity_Sold,Revenue,Costs


In [60]:
# Outer Join: Include all rows from both DataFrames and fill missing values with NaN where data is not available.
df_sales.join([df_revenue, df_costs],how='outer')

Unnamed: 0,Date_2,Product,Quantity_Sold,Revenue,Costs
0,2023-01-01,A,100.0,,
1,2023-01-02,B,200.0,,
2,2023-01-03,C,150.0,,
3,2023-01-04,D,120.0,,
2023-01-01,,,,1000.0,500.0
2023-01-02,,,,1500.0,
2023-01-03,,,,1200.0,700.0
2023-01-05,,,,800.0,400.0
2023-01-04,,,,,600.0


### Summary

- `concat` is used to combine two or more DataFrames vertically or horizontally. It's often used when data is split across multiple files and you want to create a larger dataset.
  - Vertical concatenation adds more rows to a DataFrame.
  - Horizontal concatenation adds more columns to a DataFrame.
  - `pd.concat` is used with the `axis` parameter determining the axis along which the DataFrames will be stacked (`axis=0` for rows and `axis=1` for columns).
  - The `join` parameter determines how to handle overlapping columns during concatenation. `join='outer'` includes all columns and fills missing values with NaN, while `join='inner'` includes only overlapping columns.
- `merge` is used to combine DataFrames based on a common column.
  - By default, `merge` performs an inner join, including only the matching rows between the DataFrames.
  - `how='outer'` performs an outer join, including all rows from both DataFrames.
  - `how='left'` performs a left join, including all rows from the left DataFrame and only matching rows from the right.
  - `how='right'` performs a right join, including all rows from the right DataFrame and only matching rows from the left.
  - If the columns to join on don't have the same name, `left_on` and `right_on` parameters are used.
- `join` is used to combine DataFrames based on their indexes.
  - By default, `join` performs a left join.
  - The DataFrame's index can be set using `set_index` and then `join` can be used to merge on this index.
  - Different types of joins (inner, outer) can be performed using the `how` parameter in the `join` function.

### 💡 Check for understanding

In [None]:
import pandas as pd

# Dataset 1: Student information
df_students = pd.DataFrame({
    'StudentID': ['S1', 'S2', 'S3', 'S4'],
    'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
    'Major': ['Physics', 'Mathematics', 'Chemistry', 'Biology']
})

# Create df_students_2 DataFrame
df_students_2 = pd.DataFrame({
    'StudentID': ['S5', 'S6'],
    'Name': ['Eve', 'Frank'],
    'Major': ['English', 'Computer Science']
})

# Dataset 2: Course enrollment information
df_courses = pd.DataFrame({
    'StudentID': ['S1', 'S2', 'S3', 'S5'],
    'Course': ['Physics 101', 'Mathematics 101', 'Chemistry 101', 'Biology 101']
})

# Dataset 3: Student grades
df_grades = pd.DataFrame({
    'StudentID': ['S1', 'S3', 'S4', 'S6'],
    'Grade': ['A', 'B', 'A', 'C']
})

1. Create a new DataFrame that contains the information from both `df_students` and `df_students_2`.

2. Merge `df_students` and `df_courses` on the 'StudentID' column. Try all four types of merges (inner, outer, left, and right) and observe the differences.

3. Set 'StudentID' as the index for `df_students`, `df_courses`, and `df_grades`. Then use `df_students.join` to combine all three datasets. Try different types of joins (inner, outer) and observe the differences.

In [None]:
# Your answer goes here

## Structuring Data with Pivot, Stack/Unstack, and Melt

These methods are useful for restructuring, aggregating, and reshaping data to better analyze and visualize it.

### Pivot

- Pivot is used to create a new derived table from another one.
- Allows us to reshape a DataFrame based on column values.
- Converts unique values from one column into multiple columns.

![](https://github.com/data-bootcamp-v4/lessons/blob/main/img/pivot.png?raw=true)

In [61]:
import pandas as pd

# Load Chipotle dataset from an online source
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/worldstats.csv'
df = pd.read_csv(url)

In [62]:
df.head(1)

Unnamed: 0,country,year,Population,GDP
0,Arab World,2015,392022276.0,2530102000000.0


In [64]:
# Pivot the DataFrame to see the GDP based on the country and year
#Works like pivot table
pivot_df = 
df.pivot_table
(
    index='country',
    columns='year', 
    values=['GDP'], 
    aggfunc='sum'
    )

SyntaxError: invalid syntax (3035194509.py, line 3)

In [69]:
pivot_df = df.pivot_table(
    index='country',
    columns='year', 
    values=['GDP'],
    aggfunc='sum'
)

In [70]:
pivot_df

Unnamed: 0_level_0,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP
year,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Afghanistan,5.377778e+08,5.488889e+08,5.466667e+08,7.511112e+08,8.000000e+08,1.006667e+09,1.400000e+09,1.673333e+09,1.373333e+09,1.408889e+09,...,7.057598e+09,9.843842e+09,1.019053e+10,1.248694e+10,1.593680e+10,1.793024e+10,2.053654e+10,2.004633e+10,2.005019e+10,1.919944e+10
Albania,,,,,,,,,,,...,8.992642e+09,1.070101e+10,1.288135e+10,1.204421e+10,1.192695e+10,1.289087e+10,1.231978e+10,1.278103e+10,1.327796e+10,1.145560e+10
Algeria,2.723638e+09,2.434767e+09,2.001461e+09,2.703004e+09,2.909340e+09,3.136284e+09,3.039859e+09,3.370870e+09,3.852147e+09,4.257253e+09,...,1.170273e+11,1.349771e+11,1.710007e+11,1.372110e+11,1.612073e+11,2.000131e+11,2.090474e+11,2.097035e+11,2.135185e+11,1.668386e+11
Andorra,,,,,,,,,,,...,3.536452e+09,4.010785e+09,4.001349e+09,3.649863e+09,3.346317e+09,3.427236e+09,3.146178e+09,3.249101e+09,,
Angola,,,,,,,,,,,...,4.178948e+10,6.044892e+10,8.417803e+10,7.549238e+10,8.247091e+10,1.041159e+11,1.153984e+11,1.249121e+11,1.267751e+11,1.026431e+11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West Bank and Gaza,,,,,,,,,,,...,4.910100e+09,5.505800e+09,6.673500e+09,7.268200e+09,8.913100e+09,1.045985e+10,1.127940e+10,1.247600e+10,1.271560e+10,1.267740e+10
World,1.364643e+12,1.420440e+12,1.524573e+12,1.638187e+12,1.799675e+12,1.959900e+12,2.125397e+12,2.262923e+12,2.440549e+12,2.686747e+12,...,5.107451e+13,5.758343e+13,6.312856e+13,5.983553e+13,6.564782e+13,7.284314e+13,7.442836e+13,7.643132e+13,7.810634e+13,7.343364e+13
"Yemen, Rep.",,,,,,,,,,,...,1.908173e+10,2.563367e+10,3.039720e+10,2.845950e+10,3.090675e+10,3.107886e+10,3.207477e+10,3.595450e+10,,
Zambia,6.987397e+08,6.823597e+08,6.792797e+08,7.043397e+08,8.226397e+08,1.061200e+09,1.239000e+09,1.340639e+09,1.573739e+09,1.926399e+09,...,1.275686e+10,1.405696e+10,1.791086e+10,1.532834e+10,2.026555e+10,2.345952e+10,2.550306e+10,2.804552e+10,2.713464e+10,2.120156e+10


In [73]:
pivot_df_population = df.pivot_table(
    index='country',
    columns='year', 
    values=['Population'],
    aggfunc='sum'
)
pivot_df_population

Unnamed: 0_level_0,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population
year,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Afghanistan,8.994793e+06,9.164945e+06,9.343772e+06,9.531555e+06,9.728645e+06,9.935358e+06,1.014884e+07,1.036860e+07,1.059979e+07,1.084951e+07,...,2.518362e+07,2.587754e+07,2.652874e+07,2.720729e+07,2.796221e+07,2.880917e+07,2.972680e+07,3.068250e+07,3.162751e+07,3.252656e+07
Albania,,,,,,,,,,,...,2.992547e+06,2.970017e+06,2.947314e+06,2.927519e+06,2.913021e+06,2.904780e+06,2.900247e+06,2.896652e+06,2.893654e+06,2.889167e+06
Algeria,1.112489e+07,1.140486e+07,1.169015e+07,1.198513e+07,1.229597e+07,1.262695e+07,1.298027e+07,1.335420e+07,1.374438e+07,1.414444e+07,...,3.374933e+07,3.426197e+07,3.481106e+07,3.540179e+07,3.603616e+07,3.671713e+07,3.743943e+07,3.818614e+07,3.893433e+07,3.966652e+07
Andorra,,,,,,,,,,,...,8.337300e+04,8.487800e+04,8.561600e+04,8.547400e+04,8.441900e+04,8.232600e+04,7.931600e+04,7.590200e+04,,
Angola,,,,,,,,,,,...,1.854147e+07,1.918391e+07,1.984225e+07,2.052010e+07,2.121995e+07,2.194230e+07,2.268563e+07,2.344820e+07,2.422752e+07,2.502197e+07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West Bank and Gaza,,,,,,,,,,,...,3.406334e+06,3.494496e+06,3.596688e+06,3.702218e+06,3.811102e+06,3.927051e+06,4.046901e+06,4.169506e+06,4.294682e+06,4.422143e+06
World,3.035056e+09,3.076121e+09,3.129064e+09,3.193947e+09,3.259355e+09,3.326054e+09,3.395866e+09,3.465297e+09,3.535512e+09,3.609910e+09,...,6.594722e+09,6.675833e+09,6.758303e+09,6.840956e+09,6.923684e+09,7.006908e+09,7.089452e+09,7.176092e+09,7.260780e+09,7.346633e+09
"Yemen, Rep.",,,,,,,,,,,...,2.109397e+07,2.170110e+07,2.232270e+07,2.295423e+07,2.359197e+07,2.423494e+07,2.488279e+07,2.553322e+07,,
Zambia,3.049586e+06,3.142848e+06,3.240664e+06,3.342894e+06,3.449266e+06,3.559687e+06,3.674088e+06,3.792864e+06,3.916928e+06,4.047479e+06,...,1.238151e+07,1.273868e+07,1.311458e+07,1.350785e+07,1.391744e+07,1.434353e+07,1.478658e+07,1.524609e+07,1.572134e+07,1.621177e+07


In [None]:
pivot_df_agregation = df.pivot_table(
    index='country',
    columns='year', 
    values=['Population', 'GDP'],
    aggfunc='sum'
)
pivot_df_agregation
#double the number of columns

Unnamed: 0_level_0,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,...,Population,Population,Population,Population,Population,Population,Population,Population,Population,Population
year,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Afghanistan,5.377778e+08,5.488889e+08,5.466667e+08,7.511112e+08,8.000000e+08,1.006667e+09,1.400000e+09,1.673333e+09,1.373333e+09,1.408889e+09,...,2.518362e+07,2.587754e+07,2.652874e+07,2.720729e+07,2.796221e+07,2.880917e+07,2.972680e+07,3.068250e+07,3.162751e+07,3.252656e+07
Albania,,,,,,,,,,,...,2.992547e+06,2.970017e+06,2.947314e+06,2.927519e+06,2.913021e+06,2.904780e+06,2.900247e+06,2.896652e+06,2.893654e+06,2.889167e+06
Algeria,2.723638e+09,2.434767e+09,2.001461e+09,2.703004e+09,2.909340e+09,3.136284e+09,3.039859e+09,3.370870e+09,3.852147e+09,4.257253e+09,...,3.374933e+07,3.426197e+07,3.481106e+07,3.540179e+07,3.603616e+07,3.671713e+07,3.743943e+07,3.818614e+07,3.893433e+07,3.966652e+07
Andorra,,,,,,,,,,,...,8.337300e+04,8.487800e+04,8.561600e+04,8.547400e+04,8.441900e+04,8.232600e+04,7.931600e+04,7.590200e+04,,
Angola,,,,,,,,,,,...,1.854147e+07,1.918391e+07,1.984225e+07,2.052010e+07,2.121995e+07,2.194230e+07,2.268563e+07,2.344820e+07,2.422752e+07,2.502197e+07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West Bank and Gaza,,,,,,,,,,,...,3.406334e+06,3.494496e+06,3.596688e+06,3.702218e+06,3.811102e+06,3.927051e+06,4.046901e+06,4.169506e+06,4.294682e+06,4.422143e+06
World,1.364643e+12,1.420440e+12,1.524573e+12,1.638187e+12,1.799675e+12,1.959900e+12,2.125397e+12,2.262923e+12,2.440549e+12,2.686747e+12,...,6.594722e+09,6.675833e+09,6.758303e+09,6.840956e+09,6.923684e+09,7.006908e+09,7.089452e+09,7.176092e+09,7.260780e+09,7.346633e+09
"Yemen, Rep.",,,,,,,,,,,...,2.109397e+07,2.170110e+07,2.232270e+07,2.295423e+07,2.359197e+07,2.423494e+07,2.488279e+07,2.553322e+07,,
Zambia,6.987397e+08,6.823597e+08,6.792797e+08,7.043397e+08,8.226397e+08,1.061200e+09,1.239000e+09,1.340639e+09,1.573739e+09,1.926399e+09,...,1.238151e+07,1.273868e+07,1.311458e+07,1.350785e+07,1.391744e+07,1.434353e+07,1.478658e+07,1.524609e+07,1.572134e+07,1.621177e+07


### Stack and Unstack

In pandas, `stack()` and `unstack()` are two methods used to transform data between "wide" and "long" formats in a DataFrame.

- `stack()`: This method "stacks" the data, converting the **columns into rows**, and results in a multi-level index. It is useful when you have a DataFrame with multiple columns representing similar data, and you want to combine them into a single column.

- `unstack()`: This method does the opposite of `stack()`. It "unstacks" the data, converting the **index back into columns**, and results in a more "wide" format. It is useful when you have a DataFrame with multi-level index and you want to separate the levels into separate columns.


![](https://github.com/data-bootcamp-v4/lessons/blob/main/img/stack.png?raw=true)

In [78]:
df.stack()

0      country            Arab World
       year                     2015
       Population        392022276.0
       GDP           2530101503617.0
1      country            Arab World
                          ...       
11209  GDP              1096646600.0
11210  country              Zimbabwe
       year                     1960
       Population          3752390.0
       GDP              1052990400.0
Length: 44844, dtype: object

In [81]:
# Create a multi-index DataFrame using set_index with 'country' and 'year' as the index columns
#Set the first qnd second columns
df_multiindex = df.set_index(['year','country']).sort_index(level='year')
df_multiindex

Unnamed: 0_level_0,Unnamed: 1_level_0,Population,GDP
year,country,Unnamed: 2_level_1,Unnamed: 3_level_1
1960,Afghanistan,8.994793e+06,5.377778e+08
1960,Algeria,1.112489e+07,2.723638e+09
1960,Australia,1.027648e+07,1.856759e+10
1960,Austria,7.047539e+06,6.592694e+09
1960,"Bahamas, The",1.095260e+05,1.698023e+08
...,...,...,...
2015,Vietnam,9.170380e+07,1.935994e+11
2015,West Bank and Gaza,4.422143e+06,1.267740e+10
2015,World,7.346633e+09,7.343364e+13
2015,Zambia,1.621177e+07,2.120156e+10


In [82]:
# Stack the DataFrame to convert columns into rows and create a Series
stacked_data = df_multiindex.stack()
stacked_data

year  country                
1960  Afghanistan  Population    8.994793e+06
                   GDP           5.377778e+08
      Algeria      Population    1.112489e+07
                   GDP           2.723638e+09
      Australia    Population    1.027648e+07
                                     ...     
2015  World        GDP           7.343364e+13
      Zambia       Population    1.621177e+07
                   GDP           2.120156e+10
      Zimbabwe     Population    1.560275e+07
                   GDP           1.389294e+10
Length: 22422, dtype: float64

In [84]:
# Unstack the Series back into a DataFrame with the 'year' level as columns
unstacked_data = stacked_data.unstack(level ='year')

unstacked_data.head()

Unnamed: 0_level_0,year,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Afghanistan,Population,8994793.0,9164945.0,9343772.0,9531555.0,9728645.0,9935358.0,10148840.0,10368600.0,10599790.0,10849510.0,...,25183620.0,25877540.0,26528740.0,27207290.0,27962210.0,28809170.0,29726800.0,30682500.0,31627510.0,32526560.0
Afghanistan,GDP,537777800.0,548888900.0,546666700.0,751111200.0,800000000.0,1006667000.0,1400000000.0,1673333000.0,1373333000.0,1408889000.0,...,7057598000.0,9843842000.0,10190530000.0,12486940000.0,15936800000.0,17930240000.0,20536540000.0,20046330000.0,20050190000.0,19199440000.0
Albania,Population,,,,,,,,,,,...,2992547.0,2970017.0,2947314.0,2927519.0,2913021.0,2904780.0,2900247.0,2896652.0,2893654.0,2889167.0
Albania,GDP,,,,,,,,,,,...,8992642000.0,10701010000.0,12881350000.0,12044210000.0,11926950000.0,12890870000.0,12319780000.0,12781030000.0,13277960000.0,11455600000.0
Algeria,Population,11124890.0,11404860.0,11690150.0,11985130.0,12295970.0,12626950.0,12980270.0,13354200.0,13744380.0,14144440.0,...,33749330.0,34261970.0,34811060.0,35401790.0,36036160.0,36717130.0,37439430.0,38186140.0,38934330.0,39666520.0


### Melt

The `melt()` function in pandas is used to transform a DataFrame from a **wide format to a long format**, which is often more suitable for certain data analysis tasks. In the wide format, each row represents a unique observation, and each column represents a different variable. However, in the long format, multiple rows may represent the same observation, and a new column is introduced to distinguish between the different variables.

![](https://github.com/data-bootcamp-v4/lessons/blob/main/img/melt.png?raw=true)

In [None]:
# Melt the DataFrame, keeping 'country' and 'year' as identifier variables, and 'Population' and 'GDP' as value variables
melted_data = pd.melt(df, id_vars=['country', 'year'], value_vars=['Population', 'GDP'], var_name='Indicator', value_name='Value')
melted_data.head()
#Create extra columns with the value you need to sort

Unnamed: 0,country,year,Indicator,Value
0,Arab World,2015,Population,392022276.0
1,Arab World,2014,Population,384222592.0
2,Arab World,2013,Population,376504253.0
3,Arab World,2012,Population,368802611.0
4,Arab World,2011,Population,361031820.0


### Summary

- `pivot` is used to create a new derived table from an existing one by reshaping a DataFrame based on column values and converting unique values from one column into multiple columns.
- `stack` and `unstack` are used to transform data between "wide" and "long" formats.
  - `stack` converts columns into rows, leading to a multi-level index. It's useful when multiple columns represent similar data that you want to combine into a single column.
  - `unstack` does the opposite of `stack`, converting the index back into columns and leading to a more "wide" format. It's useful when a DataFrame has a multi-level index that you want to separate into different columns.
- `melt` transforms a DataFrame from a wide format to a long format. It's useful for certain data analysis tasks where each row represents a unique observation in the wide format, but in the long format, multiple rows represent the same observation, and a new column is introduced to distinguish between different variables.

### 💡 Check for understanding

You are given a DataFrame with sales data for a company. The DataFrame contains information about the sales of various products in different regions. Create a summary of the total sales for each product in each region.


Dataset:

```python
import pandas as pd

data = {
    'Product': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Region': ['North', 'North', 'North', 'South', 'South', 'South', 'East', 'East', 'East'],
    'Sales': [100, 150, 200, 120, 180, 240, 80, 110, 160]
}

df = pd.DataFrame(data)
```

Expected output:

```python
Region   East  North  South

Product                    

A          80    100    120

B         110    150    180

C         160    200    240
```

In [None]:
# Your code here