# Data Exploration with Python

In the following, we will get some first experience with using Python and Jupyter notebooks for explorartive data analysis. We will again use the data from the NSW "Work-From-Home" survey as our example scenario.

## EXERCISE 1: Reading and accessing data

### Reading the WFH survey responses data using Pandas

Download the _WFH-Survey-Responses-NSW.csv_ file. This is a clean version of the file. Note that we have changed the format of the file to .csv (comma-separated values), so you can get familiar with another very common file type. **Make sure that you save this file in the same folder you have this Jupyter notebook.**

To read the file and store the data, we will use `pandas`, an external Python module which contains useful functionality for processing and transforming data. First we will read our file with the `read_csv` function, and then we will print the first 3 rows of our data using the `head` function, to see how it looks like.

In [2]:
import pandas as pd
pd.options.mode.chained_assignment = None

df = pd.read_csv('WFH-Survey-Responses-NSW.csv')
df.head(3)

Unnamed: 0,Response ID,What year were you born?,What is your gender?,Which of the following best describes your industry?,Which of the following best describes your industry? (Detailed),Which of the following best describes your current occupation?,Which of the following best describes your current occupation? (Detailed),How many people are currently employed by your organisation?,Do you manage people as part of your current occupation?,Which of the following best describes your household?,...,My organisation encouraged people to work remotely,My organisation was well prepared for me to work remotely,It was common for people in my organisation to work remotely,It was easy to get permission to work remotely,I could easily collaborate with colleagues when working remotely,I would recommend remote working to others,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22
0,1,1972,Female,Manufacturing,Food Product Manufacturing,Clerical and administrative,Other Clerical and Administrative,Between 20 and 199,No,Couple with no dependent children,...,,,,,,,,,,
1,2,1972,Male,Wholesale Trade,Other Goods Wholesaling,Managers,"Chief Executives, General Managers and Legisla...",Between 1 and 4,Yes,Couple with dependent children,...,Somewhat agree,Somewhat agree,Somewhat agree,Somewhat agree,Somewhat agree,Somewhat agree,,,,
2,3,1982,Male,"Electricity, Gas, Water and Waste Services",Gas Supply,Managers,"Chief Executives, General Managers and Legisla...",More than 200,Yes,One parent family with dependent children,...,Somewhat agree,Somewhat agree,Neither agree nor disagree,Somewhat agree,Neither agree nor disagree,Neither agree nor disagree,,,,


As you can see, there are four columns at the end of the file that contain no information. We can easily remove them by using the pandas `drop` function. You can use this function to remove columns or rows, or even single cells.

In [None]:
df.drop(columns=['Unnamed: 19','Unnamed: 20','Unnamed: 21','Unnamed: 22']).head()

### Let's define column header names (define constants for dictionary keys)

In pandas, we can access the information of a column using the _header_ as an input, as `df['column_header']`. You can even select multiple columns, separating each column header by a comma, e.g: `df[['column1_header','column2_header']]`.

Given that the headers in our file are very long questions, we can create a variable with a shorter name to store the original header. That way we can use this shorter version as an input instead of the original header, making it much easier to work with.

In [None]:
RESPONSE = 'Response'
YEAR_BORN = 'What year were you born?'
GENDER = 'What is your gender?'
INDUSTRY = 'Which of the following best describes your industry?'
INDUSTRY_DETAILED = 'Which of the following best describes your industry? (Detailed)'
OCCUPATION = 'Which of the following best describes your current occupation?'
OCCUPATION_DETAILED = 'Which of the following best describes your current occupation? (Detailed)'
ORGANISATION_EMPLOYEE_NUMBER = 'How many people are currently employed by your organisation?'
MANAGE_PEOPLE = 'Do you manage people as part of your current occupation?'
HOUSEHOLD = 'Which of the following best describes your household?'
EMPLOYMENT_TIME = 'How long have you been in your current job?'
METRO_REGIONAL = 'Metro / Regional'
PERCENTAGE_WFH_LAST_YEAR ='Thinking about your current job, how much of your time did you spend remote working last year?'
ORGANISATION_WFH_ENCOURAGEMENT = 'My organisation encouraged people to work remotely'
ORGANISATION_WFH_PREPARATION = 'My organisation was well prepared for me to work remotely'
ORGANISATION_WFH_COMMON = 'It was common for people in my organisation to work remotely'
ORGANISATION_WFH_PERMISSION = 'It was easy to get permission to work remotely'
WFH_COLLABORATION = 'I could easily collaborate with colleagues when working remotely'
WFH_RECOMMEND = 'I would recommend remote working to others'

### Accessing columns

Now that we have created an easier way to access a column, let's see how it works.

Let's select the column that contains the answers to the question _What year were you born?_

In [None]:
df[YEAR_BORN]


Now it is your turn:

### TODO: Select the column with the answers to the questions: _Which of the following best describes your industry?_ and _Which of the following best describes your industry? (Detailed)_

In [None]:
# TODO: replace the content of this cell with your Python solution


## *STOP PLEASE. THE FOLLOWING IS FOR THE NEXT EXERCISE. THANKS.*


## EXERCISE 2: Frequency distribution

Obtaining the frequency distribution or mode of a column is quite simple when using pandas. We first need to select the column we want to use, and then by using the `value_counts()` function. This function will count the number of times the same value appears in that column and return the frequency distribution.

Let's obtain the frequency distribution for the question _What year were you born?_

In [None]:
df[YEAR_BORN].value_counts()

You can also chain multiple selectors and function calls with the dot-expression in Python. Each function in such a dot-expression is applied to the output of the previous selector or function. So for example in the following code, the `max()` function will give us the largest value of the output of the `value_counts()` function: 

In [None]:
df[YEAR_BORN].value_counts().max()

Ok, now it is your turn again:

### TODO: Calculate frequency distribution for the question: _Which of the following best describes your industry?_ and _Which of the following best describes your industry? (Detailed)_

In [None]:
# TODO: replace the content of this cell with your Python solution


### Check types

In [None]:
df.dtypes

Note how some of the variables have reasonable data types - for example the Response ID as intergert (int64) - while others are just generically imported as 'object' which means as text string. In some cases these are indeed text strings, but in some other cases it can hont towards some data cleaning tasks ahead.

We can also manually convert some columns to a new data type we find more appropriate. For example, in the following let us convert the 'What year were you born?' column into a Python datatime type:

In [None]:
from numpy import datetime64
from datetime import datetime
# Reference https://numpy.org/doc/1.18/reference/arrays.datetime.html
df[YEAR_BORN] = df[YEAR_BORN].apply(str)
df[YEAR_BORN] = pd.Series([datetime.strptime(year,'%Y') for year in df[YEAR_BORN]])
# If you need a datetime type (note pandas does not support times coarser than nanosecond.)
df.astype({YEAR_BORN: 'datetime64[ns]'})
df.head()

In [None]:
# Encode values as NaNs (not a number) or NaTs (not a time)
import numpy as np
before = df[YEAR_BORN].min()
df[YEAR_BORN] = df[YEAR_BORN].replace(np.datetime64('1900-01-01'), np.datetime64('NaT'))
after = df[YEAR_BORN].min()
print('before:', before)
print('after:', after)

### TODO: Update a function that cleans a Pandas Series

In [None]:
gender_series = pd.Series(['M', 'Male', 'NB', 'Female', 'F', 'NonBinary', 'Undisclosed'])

# Define the set of allowed values for the Series
from enum import Enum
class Gender(Enum):
    UNKNOWN = 1
    FEMALE = 2
    MALE = 3
    NONBINARY = 4

# A function that applies a transformation to the data in a series
def my_function(value):
    """Example: manually map string values to an Enum"""
    if value in {'Female', 'F'}:
        return Gender.FEMALE
    # TODO: handle other values
    else:
        raise NotImplementedError(f'TODO: Handle {value}.')

gender_series.apply(my_function)

## *STOP PLEASE. THE FOLLOWING IS FOR THE NEXT EXERCISE. THANKS.*


## EXERCISE 3: Calculating descriptive statistics

### Statistics with Pandas

Pandas includes multiple statistic functions, such as `min()`, `max()`, `mean()` and `median()`. Additionally, it includes the function `describe()`, which provides descriptive statistics.

Let's have a look at the statistics for the question _What year were you born?_

In [None]:
df[YEAR_BORN].describe(datetime_is_numeric=True)

Now, let's have a look at the statistics we get when dealing with nominal data. To do this, we will obtain the descriptive statistics for the question _Which of the following best describes your industry?_

In [None]:
df[INDUSTRY].describe()

### TODO: Obtain the descriptive statistics for the questions: _Which of the following best describes your current occupation?_ and _How many people are currently employed by your organisation?_

In [None]:
# TODO: replace the content of this cell with your Python solution


## *STOP PLEASE. THE FOLLOWING IS FOR THE NEXT EXERCISE. THANKS.*


 ## EXERCISE 4: Visualisation with matplotlib

### Making a histogram

`matplotlib` provides functionality for creating various plots.

Let's make a histogram for the question _What year were you born?_ To create a histogram, we use the `hist(x,bins=n)` function from matplotlib, where we need to specify the values (`x`) we want to plot and the number of bins (`n`) we want in our histogram. Additionally, we can specify the space between bars using the `rwidth` option.

In [3]:
import matplotlib.pyplot as plt

plt.hist(df[YEAR_BORN], bins = 5, rwidth=0.8)
plt.ylabel('Number of responses')
plt.xlabel('Year')
plt.title('What year were you born?')
plt.show()

NameError: name 'YEAR_BORN' is not defined

Try changing the number of bins and observe how the plot changes. The higher the number of bins, the smaller the bars will become, as it will divide the data in more segments. If you know the _min_ and _max_ of your data values, you can calculate an appropiate number of bins depending of what you want to observe. For example, if your data values go from 1 to 100, if you select 10 bins, it will divide your data in segments of every 10: 1-10, 11-20, ... , 91-100. If you select 5 bins, it will divide your data in segments of every 5: 1-5, 6-10, ... , 96-100. Always choose a number of bins that allows you to observe the tendency of the data.

Now, let's make a histogram with some nominal data. To do this, we first need to obtain the frequency distribution for the data we want to plot (See Exercise 2), and then use a bar plot to visualise the distribution. In this case, we don't need to use the histogram function, and there's no bin size because we're not dealing with numerical data.

Let's make the bar plot for the question _Which of the following best describes your industry?_ Given that our data has nominal data, it's best to make a horizontal bar plot. Additionally, we can use the pandas function `plot.barh()` to plot the data. This way, we only need to obtain the frequency distribution of the data and then plot. We can set the title of the plot as an option and then we can specify the labels of the axis using the `set_xlabel()` and `set_ylabel` functions.

In [None]:
industry_freq = df[INDUSTRY].value_counts()
ax = industry_freq.plot.barh(title='Which of the following best describes your industry?')
ax.set_xlabel('Frequency')
ax.set_ylabel('Industry')

### TODO: Make a histogram for the question: _Which of the following best describes your current occupation?_

In [None]:
# TODO: replace the content of this cell with your Python solution


### Making a scatterplot

Finally, let's make a scatterplot to compare the year born with the percentage of time WFH.

In [None]:
data = df[[YEAR_BORN,PERCENTAGE_WFH_LAST_YEAR]]
data[PERCENTAGE_WFH_LAST_YEAR] = data[PERCENTAGE_WFH_LAST_YEAR].str.rstrip('%').astype('float')
data_sorted = data.sort_values(by=YEAR_BORN)

plt.scatter( data_sorted[YEAR_BORN], data_sorted[PERCENTAGE_WFH_LAST_YEAR], s=5)
plt.title('Year born vs Percentage WFH')
plt.xlabel('Year born')
plt.ylabel('Percentage')
plt.show()

## BONUS: Exploring other plotting options and customisation

If you would like to explore more plotting options, we recommend you visit the `seaborn` tutorial here https://seaborn.pydata.org/tutorial.html

You will find multiple plotting options and will learn how to edit your plot to your liking. Here, we show you some examples of what you can achieve by using the `seaborn` library.

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(color_codes=True)
tips = sns.load_dataset("tips")
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips, markers=["o", "x"], palette="Set1");

 ## EXERCISE 5: Boxplot and Correlation

### Draw a boxplot for year born

Mean and standard deviation are not informative for skewed data. `boxplot` is is a good visualisation for viewing and comparing distributions. It also shows outliers, e.g., values greater than `Q3+1.5*IQR` or less than `Q1-1.5*IQR`.

In [None]:
from datetime import datetime, date


data = df[YEAR_BORN].dropna().to_list()

def birthdate_to_age(born):
    #get today's date
    today = date.today()
    return int(today.year - born.year - ((today.month,
                                          today.day) < (born.month,
                                                        born.day)))

age_list = [birthdate_to_age(b) for b in data]

fig = plt.figure(figsize =(10, 7))
plt.boxplot(age_list)
plt.title('Distribution of Age')
plt.show()

### Calculate correlation between two variables

Scipy includes various correlation statistics​

- Pearson’s r for two normally distributed variables​: stats.pearsonr()

- Spearman’s rho for ratio data, ordinal data, etc (rank-order correlation): stats.spearmanr()

In [None]:
from scipy import stats

# only keep rows where both year born and percentage wfh last year are defined
data = df[[YEAR_BORN,PERCENTAGE_WFH_LAST_YEAR]].dropna()

year_born = data[YEAR_BORN]
precent_wfh = data[PERCENTAGE_WFH_LAST_YEAR]

print(stats.spearmanr(year_born, precent_wfh)) 

# End of Exercise. Many Thanks.