# DATA2001 - Week 2
## Data Exploration with Python

Let's start by importing all required Python libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Step 1: Reading data from a CSV file

For this demo, we are working with an example dataset about the major Australian power stations as published on data.gov.au:

In [None]:
# import CSV data into a Pandas DataFrame - and let's have a first look at its shape and content
rawData = pd.read_csv('MajorPowerStations_v2.csv')

print(rawData.shape)
rawData.head()

### 1.1 Inconsistent Data Examples - Inspect the 9th, 23rd, 31st, and 4th last records (using iloc[])
We often find some inconsistent data entries in a dataset. The following code demonstrates hot to inspects some specific data records within a DataFrame using the **iloc[]** function to go by row index. Note that row index start at 0. As we can see, there are NaN and "< Null >" values which we will need to address when preparing this data for further analysis.

In [None]:
idxLs = np.append(np.array([9,23,31])-1, -4)
rawData.iloc[idxLs]

### 1.2 Rename DataFrame Columns
The DataFrame columns are automatically named after the header row of the CSV file. You can change those names, e.g. to have it consistent with other dataset or simply to have it shorter, as follows:

In [None]:
# let's see the current axis titles
rawData.axes

In [None]:
# Another way to check the current names of the columns as read from the CSV's header line:
rawData.columns.values.tolist()

In [None]:
# Create a working copy of the raw data
wrkData = rawData.copy()

# Rename columns
wrkData.rename(columns={
    'OBJECTID': 'oid',
    'CLASS': 'class',
    'FID': 'fid',
    'NAME': 'name',
    'OPERATIONALSTATUS': 'status',
    'OWNER': 'owner',
    'GENERATIONTYPE': 'type',
    'PRIMARYFUELTYPE': 'fueltype',
    'PRIMARYSUBFUELTYPE': 'fuelsubtype',
    'GENERATIONMW': 'power',
    'GENERATORNUMBER': 'numGen',
    'SUBURB': 'suburb',
    'STATE': 'state',
    'SPATIALCONFIDENCE': 'spConf',
    'REVISED': 'revised',
    'COMMENT': 'comment',
    'LATITUDE': 'lat',
    'LONGITUDE': 'long'
}, inplace=True)

# Check df column data types
print(wrkData.dtypes)

# View
wrkData.head()

In [None]:
# can also drop some columns which we do not need
wrkData.drop(['fid'], axis=1, inplace=True)
wrkData.head()

In [None]:
wrkData['name'].count()

## Step 2. Data Cleaning and Conversion

In [None]:
# check what we have read in so far
# wrkData

### 2.1 Cleaning nominal data

In [None]:
# check the names of owners again as done before in the lecture using  Refine
wrkData['owner'].unique()

# the same, but alphabetically sorted
#wrkData.sort_values(by=['owner'])['owner'].unique()

In [None]:
# potential fix
wrkData['owner'].replace(to_replace="AGL", value="AGL Energy Pty Ltd", inplace=True)
wrkData['owner'].replace(to_replace="AGL Energy", value="AGL Energy Pty Ltd", inplace=True)
wrkData.sort_values(by=['owner'])['owner'].unique()

### 2.2 Date and Timestamps
Date values should be converted to datetime types to enable corresponding date functions and comparisons.

In [None]:
# the 'revised' column was initially read by Panadas as an int values
print("before conversion:",wrkData['revised'].dtypes)
print("  date of 1st row:",wrkData.iloc[1]['revised'],"\n")

# 'revised' is actually a date
# Convert 'revised' column to datetime
wrkData['revised'] = pd.to_datetime(wrkData['revised'],format="%Y%m%d")

#
print("after conversion:",wrkData['revised'].dtypes)
print(" datetime of 1st row:",wrkData.iloc[1]['revised'])
print(" date of 1st row:",wrkData.iloc[1]['revised'].date())

### 2.3 Cleaning numerical data (ordinal, interval, ratio) 
For aggregation and plotting, we need numerical variables to be free of placehodler strings. Depending on which function you want to use, removing NaN might also be needed (e.g. not possible for int columns). 

In [None]:
# replace NaN values which we generated by read_csv()
wrkData['numGen'].fillna(0, inplace=True)
wrkData['fuelsubtype'].fillna('', inplace=True)
wrkData

In [None]:
# try to convert numGen and power to integer values  (note: will fail)
print("before conversion:",wrkData['numGen'].dtypes)
print("before conversion:",wrkData['power'].dtypes)

wrkData['numGen'] = wrkData['numGen'].astype(int)
wrkData['power']  = wrkData['power'].astype(float)

Above's code failed because there is an unparsable string in the 'numGen' column which we need to fix first before we can proceed with the type conversion.

In [None]:
# Option 1: Fix the read_csv() call and pass missing values list including "<Null>"
# missing_values = ["<Null>"]
# rawData = pd.read_csv('...', na_values = missing_values)

# Option 2: replace <Null> value in current numGen column
wrkData['numGen'].replace(to_replace="<Null>", value=0, inplace=True)

In [None]:
print("before conversion:",wrkData['numGen'].dtypes)

# try converting to numeric data types again
wrkData['numGen'] = wrkData['numGen'].astype(int)

print("after conversion:",wrkData['numGen'].dtypes)
print("after conversion:",wrkData['power'].dtypes)

In [None]:
# Check df column data types
print(wrkData.dtypes)

## Step 3. Data Exploration and Descriptive Statistics
The next step is to explore the data with following Python / Pandas statements.

### Descriptive Statistics over all data entries

In [None]:
# which status values are used for the major Power Stations?
wrkData['status'].unique()

In [None]:
# which type of power stations are listed in the dataset?
# notice that there is again a "< Null >" and "nan" values used somewhere which we also would need to fix...
wrkData['type'].unique()

In [None]:
# what kind of fule is used by major power stations to produce energy?
wrkData['fueltype'].unique()

In [None]:
# what is the value range of the power generation capacity across all stations?
print("Minimum power generation capacity:", wrkData['power'].min(), "MW")
print("Maximum power generation capacity:", wrkData['power'].max(), "MW")

In [None]:
# what is the average and mean of the power generation capacity across all stations?
print("Average power generation capacity:", wrkData['power'].mean(), "MW")
print("Median  power generation capacity:", wrkData['power'].median(), "MW")

### Filtering
We can also compute those descriptive statistics for just a selected sub-set of the dataset using the **.loc[]** function:

In [None]:
# Which thermal solar power stations are in the dataset?
wrkData.loc[wrkData['type']=='Solar Thermal']

In [None]:
# How many wind parks are listed in the dataset?
wrkData.loc[wrkData['type']=='Wind Turbine', 'name'].count()

We also can specify complex filter conditions, e.g. connected with a logical and (**&**), and show only selected columns in the output.

In [None]:
# What are large wind parks with more than 100MW power capacity? 
# just showing name, type, pwoer, numGen and state attributes
wrkData.loc[ (wrkData['type']=='Wind Turbine') & (wrkData['power']>100),  ['name', 'type', 'power', 'numGen', 'state'] ]

**Question:** What do you think: Do we generate in Australia more power via wind or via solar?

In [None]:
print("Total installed wind power:  ",wrkData.loc[wrkData['fueltype']=='Wind', 'power'].sum(), "MW")
print("Total installed solar power: ",wrkData.loc[wrkData['fueltype']=='Solar','power'].sum(), "MW")

### Grouping
Sometimes, it is useful to **group** data by a certain attribute and then to summarise all entries of the same group, so that we can compare different groups.

In [None]:
# which is the most frequent class of power station?
wrkData['class'].mode()

In [None]:
# What is the frequency distribution of the power station class?
# this can be done with the groupby() function, followed by a size() for each group
classDistr = wrkData.groupby('class').size()
print(classDistr)
print(classDistr[0])

In [None]:
# what is the total power output capacity per class of power stations?
wrkData.groupby('class')['power'].sum()

## 4. Data Visualisations
### (a) Frequency Plot / Histogram
Produce a bar chart of Which primary fuel types are used. Bar chart plotting reference: https://pythonspot.com/matplotlib-bar-chart/

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

fuelTypeDistr = wrkData.groupby('fueltype').size().reset_index(name='numStations')

# Plot
plt.bar(fuelTypeDistr['fueltype'], fuelTypeDistr['numStations'], alpha=0.5, align='center')
plt.xticks(rotation=40)
plt.title('Major Australian Power Stations')
plt.xlabel('Primary Fuel Type')
plt.ylabel('Count')
plt.grid()

### (b) Histogram with Binning
Plot the number of generators per power station in bins of 10.

In [None]:
pyExpFreq = wrkData['power'].hist(bins=10, rwidth=0.9, color='#607c8e')
plt.title('Histogram of Power Output')
plt.xlabel('Power Output [MW]')
plt.ylabel('Number of Power Stations')
plt.grid(axis='y', alpha=0.25)

### (c) Scatter Plot for comparing power and size

In [None]:
%matplotlib inline

fig = plt.figure()
sub = plt.subplot()
wrkData.plot.scatter(x='numGen', y='power', c='DarkBlue', ax=sub)
sub.set_xlim(0,50)
plt.title('Power Output vs Num Generators')
plt.xlabel('Num Generators')
plt.ylabel('Output [MW]')

Next we want to color code the data points in the scatter plot by station type. To be able to give colors per point, we need to create a numeric encoding of the fuel type.

In [None]:
# which fule types exist again?
wrkData['fueltype'].unique()

In [None]:
# What is the frequency distribution of the fuel types?
fuelTypeDistr = wrkData.groupby('fueltype').size()
print(fuelTypeDistr)

In [None]:
# assign colors to some selected fuel types
# the numbers and the order chosen are up-to you.
# we have chosen an order that works well with the color schemes used in the subsequent plots
wrkData['fuelEncoding'] = wrkData['fueltype'].map({
    'Biogas': 1,
    'Wind': 2,
    'Solar': 3,
    'Water': 4,
    'Natural Gas': 5,
    'Coal Seam Methane': 6,
    'Coal': 7
})

In [None]:
# Now we can use this encoding column to color our plot
%matplotlib inline

fig = plt.figure()
sub = plt.subplot()
wrkData.plot.scatter(x='numGen', y='power', c='fuelEncoding', ax=sub)
sub.set_xlim(0,50)
plt.title('Power Output vs Num Generators')
plt.xlabel('Num Generators')
plt.ylabel('Output [MW]')

The plot of the fuel types is more easier to interpret when we use colour. We can either define dedicated colour values, or use one of the pre-defined colour maps from Matplotlib. Choosing an appropriate colouring scheme helps quite a bit to see that coal or natural gas power stations have fewer generators, but a large output, vs. e.g. wind farms with a high number of generators, but lower overall output.

In [None]:
# the same plot as before, but using a more vivid color scheme (colormap='Accent')
# (for available colormaps from matplotlib, see https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html)
%matplotlib inline

fig = plt.figure()
sub = plt.subplot()
wrkData.plot.scatter(x='numGen',y='power',c='fuelEncoding',colormap='Accent',ax=sub)
sub.set_xlim(0,50)
plt.title('Power Output vs Num Generators')
plt.xlabel('Num Generators')
plt.ylabel('Output [MW]')

### (d) Boxplots for Likert-Scale
Visualise boxplots for {spConf}, the spatial confidence value on the GPS location of the stations.

In [None]:
%matplotlib inline

plt.yticks(np.arange(1, 5, 1.0))
fig = wrkData.boxplot(['spConf']).set_title('Spatial Confidence')
plt.grid(axis='y', alpha=0) # disable grid lines

The boxplot shows that most entries have a high to very-high location confidence (4 or 5), but there are two outliers with very low confidence in thgeir location values.

## That's it for today