# CME538 - Introduction to Data Science

## Tutorial 4 - JSON, Datetime and Visualization 
By Navid Kayhani, Marc Saleh
### Goals

### Tutorial Structure


1. JSON files Basics

    1.1 Loading JSON file
    
    2.2 Writing JSON file
    
    3.3 Accessing data in a json dictionary


2. Lecture Content Recap

    2.1 Working with strings for text analysis
    
    2.2. Working with datetime data type
    
    2.3. Visualization examples with Seaborn
    
    

<a id='section0'></a>
## Setup Notebook
At the start of a notebook, we need to import the Python packages we plan to use.
* [NumPy](https://numpy.org/) - A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. NumPy was introcuded in Lecture 4 and we will learn more about its functionality in this lecture. It is customary to `import numpy as np`.
* [Pandas](https://pandas.pydata.org/) - pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Lecture 5 and 6 will do a deep dive into the core functionality of Pandas. It is customary to `import pandas as pd`. 
* [Seaborn](https://seaborn.pydata.org/) - Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. We will use Seaborn throughout CIV1498 for data visualization. It is customary to `import seaborn as sns`.  
* [Maplotlib](https://matplotlib.org//) - Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. We will use Matplotlib throughout CIV1498 for data visualization. It is customary to `import matplotlib.pyplot as plt`. 

Next, we want to configure the Jupyter Notebook.
* `%matplotlib inline` - This code configured the notebook to display all plots, from Seaborn or Matplotlib, in the Notebook as opposed to in a separate pop-up window.
* `plt.style.use('fivethirtyeight')` - This code configured the plots with the "fivethirtyeight" styling, which tries to replicate the styles from the website [FiveThirtyEight](https://fivethirtyeight.com/).
* `sns.set_context("notebook")` - This sets the plotting context parameters to be optimized for a Notebook. This affects things like the size of the labels, lines, and other elements of the plot, but not the overall style.

In [None]:
# 3rd party imports
import os
import json
import pandas as pd
import seaborn as sns
from datetime import datetime
import matplotlib.pylab as plt

# Configure Notebook
#for plots to be inline
%matplotlib inline 
#for auto_complete 
%config Completer.use_jedi = False 

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

# <a id='section1'></a>
## 1. JSON files Basics


We will show how JSON looks like and how to deal with JSON in Python with the example dictionary shown below. 

In [None]:
dict_doe_family = {     
    "John": {
        "first name": "John", 
        "last name": "Doe", 
        "gender": "male", 
        "age": 30, 
        "favorite_animal": "panda",
        "married": True,
        "children": ["James", "Jennifer"],
        "hobbies": ["photography", "sky diving", "reading"]},
    "Jane": {
        "first name": "Jane", 
        "last name": "Doe", 
        "gender": "female", 
        "age": 27, 
        "favorite_animal": "zebra",
        "married": False,
        "children": None,
        "hobbies": ["cooking", "gaming", "tennis"]}}

We will focus on the following methods:
* **to read JSON**: the method `json.load()`
* **to write JSON**: the method `json.dump()`

### 1.1 Loading JSON from file or string
The `load()` method is used to load a JSON encoded file as a Python dictionary:

In [None]:
# read in file content as dict using the json module
dict_doe_family = json.load(open('Doe.json', encoding='utf8'))

print(type(dict_doe_family))
print(dict_doe_family)

### 1.2 Writing JSON to file or string
Let's first define our Python dictionary again:

In [None]:
dict_doe_family = {     
    "John": {
        "first name": "John", 
        "last name": "Doe", 
        "gender": "male", 
        "age": 30, 
        "favorite_animal": "panda",
        "married": True,
        "children": ["James", "Jennifer"],
        "hobbies": ["photography", "sky diving", "reading"]},
    "Jane": {
        "first name": "Jane", 
        "last name": "Doe", 
        "gender": "female", 
        "age": 27, 
        "favorite_animal": "zebra",
        "married": False,
        "children": None,
        "hobbies": ["cooking", "gaming", "tennis"]}}

The **`json.dump()`** method is used to write a Python dictionary to a JSON encoded file:

In [None]:
# Write code here
with open("Doe.json", "w") as outfile:
     json.dump(dict_doe_family, outfile)

### 1.3 Accessing data in a json dictionary

Note that the values in this dictionary can be containers themselves: Each key has another dictionary as a value. The keys 'children' and 'hobbies' have lists as values. Note that we can look at such a dictionary in terms of **layers of nesting**.


Consider the dict below. How many layers (i.e. containers within containers) can you identify?

In [None]:
dict_doe_family = {     
    "John": {
        "first name": "John", 
        "last name": "Doe", 
        "gender": "male", 
        "age": 30, 
        "favorite_animal": "panda",
        "married": True,
        "children": ["James", "Jennifer"],
        "hobbies": ["photography", "sky diving", "reading"]},
    "Jane": {
        "first name": "Jane", 
        "last name": "Doe", 
        "gender": "female", 
        "age": 27, 
        "favorite_animal": "zebra",
        "married": False,
        "children": None,
        "hobbies": ["cooking", "gaming", "tennis"]}}

In [None]:
# access information about John
john_info = dict_doe_family['John']
print(john_info)
# access information about John's hobbies:
john_hobbies = john_info['hobbies']
print(john_hobbies)

# You can also do this in one go:
john_hobbies = dict_doe_family['John']['hobbies']

In [None]:
# iterate over family dict by accessing 
#the family members (keys) and their information (values), and saves their hobbies in a list

# create empty list where hobbies of family members will be stored
members_hobbies = []

for key, values in dict_doe_family.items():
    # check what we are accessing:
    print(key, type(values))
    # access hobbies from info_dict
    hobbies = values['hobbies']
    print(hobbies)
    members_hobbies.append(hobbies)
    
print(members_hobbies)

## 2.1 Working with strings for text analysis

In [None]:
s = "#Hey, Thi$s is a string. With. Punctuation!!!" # Sample string 

Python provides a constant called string.punctuation that provides a great list of punctuation characters.

In [None]:
import string
string.punctuation

Python offers a function called translate() that will map one set of characters to another.

We can use the function maketrans() to create a mapping table. We can create an empty mapping table, but the third argument of this function allows us to list all of the characters to remove during the translation process. For example:

In [None]:
s.translate(str.maketrans('', '', string.punctuation))

In [None]:
# survey responses from electric vehicle owners on their opinion of public charging infrastructure
responses_survey = ['!!For in city use, my at home level 2 charger is more than adequate. Because I have limited range and poor battery management system in my car, I rarely go out of town, but I am in the market for a new EV and consider that the number and distribution of level 3 chargers throughout Canada is grossly inadequate and I am concerned that as they are being build they are not being built to provide energy quickly enough to match the capabilites of EVs that are coming on market. In my view, they should be capable of 350kW. This will ensure that there are not lineups at the pumps so to speak.',
                                        'Once installed, public EV charging stations should be recommissioned (eg GO transit parking lots where the units were removed but the rough-in wiring remains).  Give EV ownership a chance to expand given federally zero-emission vehicle mandate by 2035.  Why not offer incentives for phasing in EV charging stations at petrol stations as ICE cars phase-out and EV cars ownership expands?  Reducing/eliminating the cause of range anxiety could help to get us all into zero-emission vehicles ahead of 2035.',
                                        'Given the positive environmental effects of EVs I believe the price of electricity should be MUCH lower for EV owners.  I.e. there should be a way to determine what electricity I uses is going to my car...? separate metre? Also given MB has a lot of extra electricity and gives it away to large users I believe EV owners should get some sort of credit. Also I worry that MB Hydro is NOT prepared to provide adequate electric service to communities with LOTS of EV usage.',
                                        'DCFC is highly over-rated.  Most cars and depending on SoC, can not fully utilize rates over 100KW, and all batteries cut back draw starting about 60% SoC.  It would be better to have more 50/100KW chargers for light duty vehicles than funding for faster chargers. DCFC should normally only be used for inter-city travel, and located in service centres on highways (which is not listed in the survey question above).  Destinations and overnight hotels Level 2 is adequate.']


Remove all punctuation and common words (stopwords) from each response in the list of responses to the survey

In [None]:
# create new list without punctuation
responses_survey_stripped = [response.translate(str.maketrans('','', string.punctuation)) for response in responses_survey]


# create new list without stopwords
from nltk.corpus import stopwords
# stopwords include common words like ['what', 'who', 'is', 'a', 'at', 'is', 'he']

reponses_filtered = []

for response in responses_survey_stripped:
    words_response = response.split(' ')
    # remove stops words from words_rsponse
    words_response_filtered = [word for word in words_response if not word in stopwords.words()]
    # connect list of words into a sentence and append it to original list
    reponses_filtered.append((" ").join(words_response_filtered))

In [None]:
reponses_filtered

## 2.2 Working with datetime data type

In this tutorial, we’ll be working with daily time series of Open Power System Data (OPSD) for Germany, which has been rapidly expanding its renewable energy production in recent years. The data set includes country-wide totals of electricity consumption, wind power production, and solar power production for 2006-2017. 

In [None]:
# upload json data
opsd_daily_data = json.load(open('open_power_system_data_OPSD_germany_daily.json'))

In [None]:
# convert the data into pandas dataframe
opsd_daily = pd.DataFrame(opsd_daily_data) 
opsd_daily.tail()

In [None]:
# print dataframe info
opsd_daily.info()

In [None]:
# transform numerical columns to floats
opsd_daily[['Wind', 'Solar', 'Wind+Solar']] = opsd_daily[['Wind', 'Solar', 'Wind+Solar']].apply(pd.to_numeric)
opsd_daily.info()

In [None]:
# convert the 'Date' column to datetime format
opsd_daily['Date']= pd.to_datetime(opsd_daily['Date'])
opsd_daily.info()

In [None]:
# set the date column as the index
opsd_daily.set_index('Date', inplace=True)
opsd_daily.tail()

#### Time-based indexing

In [None]:
# add new columns 'Year' and 'Month' that indicate the year and month associated with each observation
opsd_daily['Year'] = opsd_daily.index.year
opsd_daily['Month'] = opsd_daily.index.month
opsd_daily.tail()

## 2.3 Visualization examples with seaborn

The following chart was made by ([Abela, 2006](http://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html)). It provides a first intuition on what kind of visualization to choose for your data. He also asks exactly the right question: **What do you want to show?** It is essential for any piece of communication to first consider: what is my main point? And after creating a visualization, to ask yourself: does this visualization indeed communicate what I want to communicate? (Ideally, also ask others: what kind of message am I conveying here?)

![chart_chooser](./images/chart_chooser.jpg)


In [None]:
opsd_daily

In [None]:
# Plot boxplots of daily energy consumption energy consumption per year
plt.figure(figsize=(10, 5))
plt.title('Daily Energy consumption segemented per year', fontsize=18)
ax = sns.boxplot(x = opsd_daily['Year'] , y = opsd_daily['Consumption'])
ax.xaxis.set_tick_params(labelsize=14)
ax.yaxis.set_tick_params(labelsize=14)
ax.set_xlabel('Year', fontsize=18)
ax.set_ylabel('Energy', fontsize=18)
plt.show()

In [None]:
# create new df by aggregating data to yearly energy consumption and production
opsd_annual = opsd_daily.groupby(by= ['Year']).sum().drop(columns = ['Month'])
opsd_annual

In [None]:
# Plot a lineplot of the total energy consumption over each year
plt.figure(figsize=(10, 5))
plt.title('Yearly Energy Consumption', fontsize=18)
ax = sns.lineplot(x = opsd_annual.index, y = opsd_annual['Consumption'], label='Total Energy Consumption')
#ax = sns.lineplot(x = opsd_annual.index, y =opsd_annual['Wind'], label='Wind Energy Production')
#ax = sns.lineplot(x = opsd_annual.index, y = opsd_annual['Solar'], label='Solar Energy Production')
ax.legend(fontsize=16)
ax.xaxis.set_tick_params(labelsize=14)
ax.yaxis.set_tick_params(labelsize=14)
ax.set_xlabel('Year', fontsize=18)
ax.set_ylabel('Energy', fontsize=18)
plt.show()

In [None]:
# Plot lineplots of the total energy production from wind and solar over each year

plt.figure(figsize=(10, 5))
plt.title('Yearly Renewable Energy Production', fontsize=18)
ax = sns.lineplot(x = opsd_annual.index, y =opsd_annual['Wind'], label='Wind Energy Production')
ax = sns.lineplot(x = opsd_annual.index, y = opsd_annual['Solar'], label='Solar Energy Production')
ax.legend(fontsize=16)
ax.xaxis.set_tick_params(labelsize=14)
ax.yaxis.set_tick_params(labelsize=14)
ax.set_xlabel('Year', fontsize=18)
ax.set_ylabel('Energy', fontsize=18)
plt.show()

In [None]:
# add a column that estimate the proportion of wind + solar out of total energy consumed in a year
opsd_annual['Wind+Solar/Consumption [%]'] = round(opsd_annual['Wind+Solar'] / opsd_annual['Consumption'],3)*100
opsd_annual

In [None]:
# Plot a barplot of the proportion of energy consumption that was produced from wind + farm each year
plt.figure(figsize=(10, 5))
plt.title('Yearly Energy Production Proportion from Renewables', fontsize=18)

# drop years renewable proportion is zero
opsd_annual = opsd_annual[opsd_annual['Wind+Solar/Consumption [%]'] != 0]

ax = sns.barplot(x = opsd_annual.index, y = opsd_annual['Wind+Solar/Consumption [%]'], label='Total Energy Consumption', 
                 palette="Blues_d")
ax.xaxis.set_tick_params(labelsize=14)
ax.yaxis.set_tick_params(labelsize=14)
ax.set_xlabel('Year', fontsize=18)
ax.set_ylabel('Proportion (%)', fontsize=18)
plt.show()