# Python Data Reshaping

- https://pandas.pydata.org/pandas-docs/stable/reshaping.html

## Read the data

- The data: Number of tweets per day in my Brexit Twitter data, aggregated by original and retweets
- In long format

In [7]:
import pandas as pd

In [6]:
df_tweet_count = pd.read_csv("brexit_tweet_count_by_data.csv")
df_tweet_count[0:10]

Unnamed: 0,date,retweet,total_count
0,2016-01-06,True,1942
1,2016-01-06,False,699
2,2016-01-07,False,3511
3,2016-01-07,True,4811
4,2016-01-08,True,4752


## From long to wide using `pivot`

https://chrisalbon.com/python/pandas_long_to_wide.html
- reshape the data from long to wide
    - one row by date
    - two columns for retweets and non-retweets
- use panda's `pivot` function

In [8]:
df_tweet_count_wide = df_tweet_count.pivot(index='date', columns='retweet', values='total_count')
df_tweet_count_wide[0:10]

retweet,False,True
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-01-06,699,1942
2016-01-07,3511,4811
2016-01-08,2401,4752
2016-01-09,1620,4193
2016-01-10,2744,5038
2016-01-11,3251,5073
2016-01-12,2443,4661
2016-01-13,2805,5148
2016-01-14,3353,5227
2016-01-15,2533,4174


The output has two columns `False` and `True`, and `date` is converted to row index

## From wide to long using `unstack`
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html


- `unstack()` does something similar to the reviese of `pivot`
- But this is not ideal, because now the `retweet` status as well as `date` are the multi-index of the data. We need one more step

In [21]:
df_tweet_count_wide.unstack()

retweet  date      
False    2016-01-06      699
         2016-01-07     3511
         2016-01-08     2401
         2016-01-09     1620
         2016-01-10     2744
         2016-01-11     3251
         2016-01-12     2443
         2016-01-13     2805
         2016-01-14     3353
         2016-01-15     2533
         2016-01-16     1971
         2016-01-17     3378
         2016-01-18     3302
         2016-01-19     2399
         2016-01-20     3162
         2016-01-21     5735
         2016-01-22     4495
         2016-01-23     3359
         2016-01-24     3083
         2016-01-25     5260
         2016-01-26     5181
         2016-01-27     4541
         2016-01-28     4225
         2016-01-29     5122
         2016-01-31     4346
         2016-02-01     6213
         2016-02-02    10051
         2016-02-03     9843
         2016-02-04     8260
         2016-02-05     8315
                       ...  
True     2016-11-19     5463
         2016-12-03    40575
         2016-12-04    

In [25]:
df_unstack = df_tweet_count_wide.unstack().reset_index()
df_unstack[0:4]

Unnamed: 0,retweet,date,0
0,False,2016-01-06,699
1,False,2016-01-07,3511
2,False,2016-01-08,2401
3,False,2016-01-09,1620


## `melt()`?
- An `R` user might want to use `melt` but this doesn't really work for this purpose
- This will remove the row index (so the `date` information is gone)

In [30]:
pd.melt(df_tweet_count_wide[0:5])

Unnamed: 0,retweet,value
0,False,699
1,False,3511
2,False,2401
3,False,1620
4,False,2744
5,True,1942
6,True,4811
7,True,4752
8,True,4193
9,True,5038
