# Random Data for Reproducible Examples in Python Pandas
<br/>

<div style="width:200px; text-align:left"><img src="images/py_repro.png" width="200" /></div>

# What are Reproducible Examples?
<span style="font-size:24px">Used to troubleshoot issues with teams or online (GitHub, StackOverflow)</span>

<span style="font-size:24px">Consists of two components:</span>
<ol style="margin-top:10px;font-size:24px;line-height:26px">
  <li><b>CODE</b>: Runnable, compilable code from an empty environment</li>
  <li><b>DATA</b>: All available data and assigned objects, preferably small and lightweight</li>
</ol>
<br/>
<span style="font-size:24px"><b>Problem</b>: Data can be proprietary or contain confidential information</span>

<span style="font-size:24px"><b>Solution</b>: Randomized data set that resembles same structure as actual data.</span>

---

## Example Data

<span style="font-size:20px">Let's assume this public dateset from Department of Energy, EIA is proprietary.</span>

In [35]:
%%html
<style>.prompt{width: 100px; min-width: 0; visibility: collapse}</style>

In [36]:
import numpy as np
import pandas as pd

energy_df = pd.read_csv("https://www.eia.gov/totalenergy/data/browser/csv.php?tbl=T02.01")

energy_df.head(10)

Unnamed: 0,MSN,YYYYMM,Value,Column_Order,Description,Unit
0,TXRCBUS,194913,4460.588,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
1,TXRCBUS,195013,4829.528,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
2,TXRCBUS,195113,5104.68,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
3,TXRCBUS,195213,5158.406,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
4,TXRCBUS,195313,5052.749,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
5,TXRCBUS,195413,5262.555,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
6,TXRCBUS,195513,5608.073,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
7,TXRCBUS,195613,5839.664,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
8,TXRCBUS,195713,5744.189,1,Primary Energy Consumed by the Residential Sector,Trillion Btu
9,TXRCBUS,195813,6125.681,1,Primary Energy Consumed by the Residential Sector,Trillion Btu


In [37]:
energy_df.tail(10)

Unnamed: 0,MSN,YYYYMM,Value,Column_Order,Description,Unit
7052,TETCBUS,201911,8400.423,11,Primary Energy Consumption Total,Trillion Btu
7053,TETCBUS,201912,8937.727,11,Primary Energy Consumption Total,Trillion Btu
7054,TETCBUS,201913,100449.822,11,Primary Energy Consumption Total,Trillion Btu
7055,TETCBUS,202001,8961.181,11,Primary Energy Consumption Total,Trillion Btu
7056,TETCBUS,202002,8311.305,11,Primary Energy Consumption Total,Trillion Btu
7057,TETCBUS,202003,7843.358,11,Primary Energy Consumption Total,Trillion Btu
7058,TETCBUS,202004,6516.843,11,Primary Energy Consumption Total,Trillion Btu
7059,TETCBUS,202005,6859.155,11,Primary Energy Consumption Total,Trillion Btu
7060,TETCBUS,202006,7293.597,11,Primary Energy Consumption Total,Trillion Btu
7061,TETCBUS,202007,8105.214,11,Primary Energy Consumption Total,Trillion Btu


## <code>DataFrame.to_dict()</code>

In [38]:
print(energy_df.head(10).to_dict())

{'MSN': {0: 'TXRCBUS', 1: 'TXRCBUS', 2: 'TXRCBUS', 3: 'TXRCBUS', 4: 'TXRCBUS', 5: 'TXRCBUS', 6: 'TXRCBUS', 7: 'TXRCBUS', 8: 'TXRCBUS', 9: 'TXRCBUS'}, 'YYYYMM': {0: 194913, 1: 195013, 2: 195113, 3: 195213, 4: 195313, 5: 195413, 6: 195513, 7: 195613, 8: 195713, 9: 195813}, 'Value': {0: 4460.588, 1: 4829.528, 2: 5104.68, 3: 5158.406, 4: 5052.749, 5: 5262.555, 6: 5608.073, 7: 5839.664000000001, 8: 5744.189, 9: 6125.681}, 'Column_Order': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}, 'Description': {0: 'Primary Energy Consumed by the Residential Sector', 1: 'Primary Energy Consumed by the Residential Sector', 2: 'Primary Energy Consumed by the Residential Sector', 3: 'Primary Energy Consumed by the Residential Sector', 4: 'Primary Energy Consumed by the Residential Sector', 5: 'Primary Energy Consumed by the Residential Sector', 6: 'Primary Energy Consumed by the Residential Sector', 7: 'Primary Energy Consumed by the Residential Sector', 8: 'Primary Energy Consumed by the Re

## <code>DataFrame.to_json()</code>

In [61]:
print(energy_df.tail(10).to_json())

{"MSN":{"7052":"TETCBUS","7053":"TETCBUS","7054":"TETCBUS","7055":"TETCBUS","7056":"TETCBUS","7057":"TETCBUS","7058":"TETCBUS","7059":"TETCBUS","7060":"TETCBUS","7061":"TETCBUS"},"YYYYMM":{"7052":201911,"7053":201912,"7054":201913,"7055":202001,"7056":202002,"7057":202003,"7058":202004,"7059":202005,"7060":202006,"7061":202007},"Value":{"7052":8400.423,"7053":8937.727,"7054":100449.822,"7055":8961.181,"7056":8311.305,"7057":7843.358,"7058":6516.843,"7059":6859.155,"7060":7293.597,"7061":8105.214},"Column_Order":{"7052":11,"7053":11,"7054":11,"7055":11,"7056":11,"7057":11,"7058":11,"7059":11,"7060":11,"7061":11},"Description":{"7052":"Primary Energy Consumption Total","7053":"Primary Energy Consumption Total","7054":"Primary Energy Consumption Total","7055":"Primary Energy Consumption Total","7056":"Primary Energy Consumption Total","7057":"Primary Energy Consumption Total","7058":"Primary Energy Consumption Total","7059":"Primary Energy Consumption Total","7060":"Primary Energy Consump

## Random Data

<span style="font-size:20px">Recreate your data with same columns and structure as original dataset using random data</span>

### Assess Existing Structure

In [41]:
energy_df.dtypes

MSN              object
YYYYMM            int64
Value           float64
Column_Order      int64
Description      object
Unit             object
dtype: object

In [42]:
energy_df.describe()

Unnamed: 0,YYYYMM,Value,Column_Order
count,7062.0,7062.0,7062.0
mean,199500.649533,4067.255133,6.0
std,1513.363743,9278.309661,3.162502
min,194913.0,-7.623,1.0
25%,198307.0,904.10175,3.0
50%,199511.5,1817.035,6.0
75%,200803.0,2728.06025,9.0
max,202007.0,101161.852,11.0


In [59]:
pd.Series(energy_df['MSN'].unique())

0     TXRCBUS
1     TERCBUS
2     TXCCBUS
3     TECCBUS
4     TXICBUS
5     TEICBUS
6     TXACBUS
7     TEACBUS
8     TXEIBUS
9     TEDFBUS
10    TETCBUS
dtype: object

In [60]:
pd.Series(energy_df['Description'].unique())

0     Primary Energy Consumed by the Residential Sector
1       Total Energy Consumed by the Residential Sector
2      Primary Energy Consumed by the Commercial Sector
3        Total Energy Consumed by the Commercial Sector
4      Primary Energy Consumed by the Industrial Sector
5        Total Energy Consumed by the Industrial Sector
6     Primary Energy Consumed by the Transportation ...
7     Total Energy Consumed by the Transportation Se...
8     Primary Energy Consumed by the Electric Power ...
9                     Energy Consumption Balancing Item
10                     Primary Energy Consumption Total
dtype: object

## Random Sample and Generation

- <h3> Character/Categorical Values: <code>np.random.choice</code>, <code>np.random.permutation</code> </h3>
- <h3> Integer/Numeric Values: <code>np.random.randint</code>, <code>np.random.normal</code>, <code>np.random.uniform</code> </h3>
- <h3> Repeat Values: <code>np.repeat</code>, <code>np.tile</code> </h3>
- <h3> Sequence Values: <code>np.arange</code>, <code>np.linspace</code>, <code>pd.date_range</code> </h3>

In [49]:
from itertools import product

np.random.seed(112020)

month_series = pd.date_range('2000-01-01', '2020-11-01', freq='MS')
data = list(product(month_series, list('ABCDEFGHIJK')))

random_energy_df = (pd.DataFrame(data, columns = ['MSN', 'YYYYMM'])
                      .assign(Value = np.random.uniform(energy_df['Value'].min(), 
                                                        energy_df['Value'].max(), len(data)),
                              Description = np.tile(['Group'+str(i) if i >= 10 else 'Group0'+str(i)
                                                     for i in range(1,12)], len(month_series))
                             )
                   )

random_energy_df.head(20)

Unnamed: 0,MSN,YYYYMM,Value,Description
0,2000-01-01,A,22958.134739,Group01
1,2000-01-01,B,36740.355289,Group02
2,2000-01-01,C,74939.712005,Group03
3,2000-01-01,D,63739.375862,Group04
4,2000-01-01,E,55439.52397,Group05
5,2000-01-01,F,6427.70366,Group06
6,2000-01-01,G,64662.434873,Group07
7,2000-01-01,H,69077.008478,Group08
8,2000-01-01,I,26076.232504,Group09
9,2000-01-01,J,67487.210014,Group10


In [50]:
random_energy_df.tail(20)

Unnamed: 0,MSN,YYYYMM,Value,Description
2741,2020-10-01,C,31823.33768,Group03
2742,2020-10-01,D,67817.696473,Group04
2743,2020-10-01,E,38201.016731,Group05
2744,2020-10-01,F,66578.10607,Group06
2745,2020-10-01,G,59341.554378,Group07
2746,2020-10-01,H,86175.414602,Group08
2747,2020-10-01,I,56742.062185,Group09
2748,2020-10-01,J,45100.069654,Group10
2749,2020-10-01,K,77825.431432,Group11
2750,2020-11-01,A,98931.798117,Group01


## <code>DataFrame.to_dict()</code>

In [51]:
print(random_energy_df.sample(10).to_dict())

{'MSN': {230: Timestamp('2001-09-01 00:00:00'), 1437: Timestamp('2010-11-01 00:00:00'), 472: Timestamp('2003-07-01 00:00:00'), 1224: Timestamp('2009-04-01 00:00:00'), 1274: Timestamp('2009-08-01 00:00:00'), 207: Timestamp('2001-07-01 00:00:00'), 762: Timestamp('2005-10-01 00:00:00'), 1107: Timestamp('2008-05-01 00:00:00'), 642: Timestamp('2004-11-01 00:00:00'), 183: Timestamp('2001-05-01 00:00:00')}, 'YYYYMM': {230: 'K', 1437: 'H', 472: 'K', 1224: 'D', 1274: 'J', 207: 'J', 762: 'D', 1107: 'H', 642: 'E', 183: 'H'}, 'Value': {230: 37398.21427814129, 1437: 93772.10738930508, 472: 4333.967150403827, 1224: 92397.74293560344, 1274: 69938.49213559463, 207: 18360.929396354986, 762: 47428.68594870003, 1107: 61170.86860442543, 642: 54535.801772419545, 183: 74595.3716832878}, 'Description': {230: 'Group11', 1437: 'Group08', 472: 'Group11', 1224: 'Group04', 1274: 'Group10', 207: 'Group10', 762: 'Group04', 1107: 'Group08', 642: 'Group05', 183: 'Group08'}}


In [53]:
from pandas import Timestamp

sample_data = (pd.DataFrame({'MSN': {230: Timestamp('2001-09-01 00:00:00'), 1437: Timestamp('2010-11-01 00:00:00'), 472: Timestamp('2003-07-01 00:00:00'), 1224: Timestamp('2009-04-01 00:00:00'), 1274: Timestamp('2009-08-01 00:00:00'), 207: Timestamp('2001-07-01 00:00:00'), 762: Timestamp('2005-10-01 00:00:00'), 1107: Timestamp('2008-05-01 00:00:00'), 642: Timestamp('2004-11-01 00:00:00'), 183: Timestamp('2001-05-01 00:00:00')}, 'YYYYMM': {230: 'K', 1437: 'H', 472: 'K', 1224: 'D', 1274: 'J', 207: 'J', 762: 'D', 1107: 'H', 642: 'E', 183: 'H'}, 'Value': {230: 37398.21427814129, 1437: 93772.10738930508, 472: 4333.967150403827, 1224: 92397.74293560344, 1274: 69938.49213559463, 207: 18360.929396354986, 762: 47428.68594870003, 1107: 61170.86860442543, 642: 54535.801772419545, 183: 74595.3716832878}, 'Description': {230: 'Group11', 1437: 'Group08', 472: 'Group11', 1224: 'Group04', 1274: 'Group10', 207: 'Group10', 762: 'Group04', 1107: 'Group08', 642: 'Group05', 183: 'Group08'}})
                 .reset_index(drop=True))

sample_data

Unnamed: 0,MSN,YYYYMM,Value,Description
0,2001-09-01,K,37398.214278,Group11
1,2010-11-01,H,93772.107389,Group08
2,2003-07-01,K,4333.96715,Group11
3,2009-04-01,D,92397.742936,Group04
4,2009-08-01,J,69938.492136,Group10
5,2001-07-01,J,18360.929396,Group10
6,2005-10-01,D,47428.685949,Group04
7,2008-05-01,H,61170.868604,Group08
8,2004-11-01,E,54535.801772,Group05
9,2001-05-01,H,74595.371683,Group08
