# Assignment 2

Your information:
- Student Name: Elias Dekker
- Student Number: 14638487

## Important

You can choose to use Plotly Express or the original Plotly to complete this assignment.

You can use generative AI tools (e.g., ChatGPT) to help you complete this assignment. But you need to report the usage of generative AI tools, as mentioned in the Syllabus.

**You need to restart and run the notebook from scratch and make sure it works before submission.** Errors in the code (such as typos, incorrect syntax, and incorrect file name extensions) will always give zero scores in the corresponding categories on the rubric. TAs will not debug the errors for you. The bottom line is that if the TA cannot just open the file and click the run button to get the result, this will be counted as having errors in the code.

## Assignment 2.1

This assignment uses the `marriage.csv` file that contains the percentage of *unmarried* people in the US between 25 and 34. This file is a subset of the `both_sexes.csv` dataset from [FiveThirtyEight](http://fivethirtyeight.com/), publicly available in their Github repository: https://github.com/fivethirtyeight/data/tree/master/marriage.

In [1]:
import plotly.graph_objs as go
import plotly.express as px
import pandas as pd

To load the csv file we use the `read_csv` function from pandas. This function reads in the csv and turns it into a pandas dataframe. A data frame makes it easy to view and process the data.

For example, with the `head()` method we can display the first `n` rows of the csv.

In [2]:
df = pd.read_csv('marriage.csv')

df.head(n=10) # first 10 of the 17 rows

FileNotFoundError: [Errno 2] No such file or directory: 'marriage.csv'

Then we can select certain columns with a syntax similar to python dictionaries. `df['year']` thus returns the first column.

In this way we can easily select data and visualize it with plotly. In the next cell there's an example of a bar chart based on the `year` column and the `all` column, i.e. the total percentage of people between 25 and 34 who are **not** married.

In [34]:
trace = go.Bar(
    x = df['year'],
    y = df['all']
)
go.Figure(trace).show()

A number of things immediately stand out when we look at the bar chart above. First, there is a strange distribution of the bars, but looking at the data we see that this is correct, but perhaps not very visually apealing.

You may also notice that the scale of the y-axis does not make it clear that it is a percentage. We can solve both of these issues. For the x-axis, we can indicate that it is a categorical axis, rather than a scale. For the y-axis, we can show it as a percentage. Below is the code to do this.

In [35]:
layout = go.Layout(
    xaxis=go.layout.XAxis(
        type='category'
    ),
    yaxis=go.layout.YAxis(
        tickformat=',.0%',
    )
)

fig = go.Figure(data=trace, layout=layout)
fig.show()

It is now up to you to create a grouped bar chart for the first assignment based on this dataset, where the groups consist of three different income groups (`poor`, `mid`, and `rich`). Your chart should look like the following:

<img src="groupedbar.png">

**Requirements:**
- Plot title
  - `Percentage unmarried in the US between 25 and 34 by income`
- Axis labels
  - x axis: `Year`
  - y axis: `Percentage not married`
- Legend with correct names
  - `Low income`, `Middle income`, and `High income`
- Grouped and colored bars for each income group
- Correct format of x and y axis
  - x axis needs to be categories
  - y axis needs to be percentage
- Based on the data from the csv file
- Each group with the corret bar color that is color-blind safe
  - Low income group: `rgb(102,194,165)`
  - Middle income group: `rgb(252,141,98)`
  - High income group: `rgb(141,160,203)`
- The height of the graph needs to be `400`

*Hints:*
- Search online for ways of adjusting graph properties if you do not know how to do that, such as changing the color of the bar or the height of graph. Often you can find examples on the [official plotly documentation](https://plot.ly/python/reference/) or Stack Overflow. You can also ask generative AI (such as ChatGPT) to help you, but ChatGPT does not always work and may give you bad code. If you use ChatGPT, you need to report its usage as described in the course Syllabus.
- Check the [documentation of bar chart](https://plotly.com/python-api-reference/generated/plotly.graph_objects.Bar.html) for available properties.

In [36]:
trace1 = go.Bar(
    x = df['year'],
    y = df['poor'],
    name = 'Low income',
    marker_color = 'rgb(102,194,165)'
)

trace2 = go.Bar(
    x = df['year'],
    y = df['mid'],
    name = 'Middle income',
    marker_color = 'rgb(252,141,98)'
)

trace3 = go.Bar(
    x = df['year'],
    y = df['rich'],
    name = 'High income',
    marker_color = 'rgb(141,160,203)'
)

layout = go.Layout(
    title = 'Percentage unmarried in the US between 25 and 34 by income',
    xaxis=go.layout.XAxis(
        title = 'Year',
        type='category'
    ),
    yaxis=go.layout.YAxis(
        title = 'Percentage not married',
        tickformat=',.0%',
    ),
    barmode='group',
    height=400
)

fig = go.Figure(data=[trace1, trace2, trace3], layout=layout)

fig.show()

## Assignment 2.2

The second assignment is based on `existing_property.csv` (bestaande_koopwoning) based on data from the [CBS](http://statline.cbs.nl/Statweb/publication/?DM=SLNL&PA=83910NED&D1=3&D2=1-6&D3=114&HDR=T,G1&STB=G2&VW=T) and contains the numbers of houses sold in 2017 in the Netherlands, by house type.

For this assignment you have to make a donut chart based on this dataset in which this data is shown, broken down by type of house. Your chart should look like the following:

<img src="donutchart.png">

**Requirements:**
- Plot title
  - `Sold houses in the Netherlands, by type, in 2017`
- No legend
- Text and percentage around the donut chart
- Based on the data from the csv file
- The color scale on the sections in the pie chart needs to be `px.colors.qualitative.T10`
- The height of the graph needs to be `600`

*Hints:*
- See also the hints in Assignment 2.1
- Check [documentation of the pie chart](https://plotly.com/python-api-reference/generated/plotly.graph_objects.pie.html) for the available properties.
- Check the [documentation of colors](https://plotly.com/python/discrete-color/) for available color scales

In [38]:
df = pd.read_csv('existing_property.csv')

fig = px.pie(df,
             values='Aantal', 
             names='Type woning',
             title='Sold houses in the Netherlands, by type, in 2017',
             color_discrete_sequence=px.colors.qualitative.T10)

fig.update_traces(hole=0.4, textinfo='label+percent', textfont_size=12,
                  insidetextorientation='radial', textposition='outside')
fig.update_layout(showlegend=False)

fig.update_layout(height=600)

fig.show()

## Assignment 2.3

Display the information in the tree below as a Sunburst Chart.

<img src="species.jpg" alt="Figured obtained from http://people.cs.ksu.edu/~schmidt/300s05/">

Your chart should look like the following:

<img src="sunburst.png">

**Requirements:**
- Plot title
  - `Animal Species Tree`
- Make sure that all Mammals have a yellow color (with hex color "#fecb52"), and all Reptiles have a green color.
- The height of the graph needs to be 600

*Hints*:
- Examples of how to work with Sunburts charts can be found in the plot.ly documentation: https://plot.ly/python/sunburst-charts/
- In a sunburst plot you can also scale the size of the 'pie slices' based on a value, this is not necessary for this exercise.

In [39]:
labels = ["Animal", 
          "Reptile", "Mammal", 
          "Lizard", "Snake", "Bird", "Equine", "Bovine", "Canine",
          "Salamander", "Canary", "Horse", "Zebra", "Cow", "Lassie", "Rintintin",
          "Tweetie", "Bessie"]

parents = ["", 
           "Animal", "Animal", 
           "Reptile", "Reptile", "Reptile", "Mammal", "Mammal", "Mammal",
           "Lizard", "Bird", "Equine", "Equine", "Bovine", "Canine", "Canine",
           "Canary", "Cow"]

color_dict = {
    'Reptile': '#7FDBB6',  # Green
    'Mammal': '#FECB52'   # Yellow
}

colors = [color_dict[label] if label in color_dict else "" for label in labels]

fig = go.Figure(go.Sunburst(
    labels=labels,
    parents=parents,
    marker=dict(colors=colors),
    hoverinfo='label'
))

fig.update_layout(height=600, title_text="Multi-Layer Tree")
fig.show()

## Assignment 2.4

This question consists of two sub-questions that build on the same dataset about tips (fooien) from groups that have visited a restaurant.

In [40]:
tips_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv')
tips_df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### Assignment 2.4a

This part of the assignment aims to let you think about different ways of binning data points.
To answer this question, we must first take a number of steps to have the correct information available:

**Step 1)** Quantize the `tip` column into three groups (`Low`, `Medium`, `High`) using **both** the pandas function `.cut` and `.qcut`. Make sure you use `retbins=True` to return the bins in both the `.cut` and `.qcut` functions. You also need to **print** the bins for both `.cut` and `.qcut` functions using the `print` function.

**Step 2)** Make a plot for the `tip` column in the dataset (i.e., the one without quantization) based on the function `px.histogram` from Plotly Express. You need to give the plot a meaningful title.

**Step 3)** Make another two histogram plots for the **quantized variant** of the `tip` column: one from the `.cut` function, and another one from the `.qcut` function. You need to give each plot a meaningful title.

In total, you should have **two printed messages** and create **three plots** for this part of the assignment. Your plots should look like the followings:

<img src="distribution1.png">
<img src="distribution2.png">
<img src="distribution3.png">

**Question)** Now compare the two printed messages and three plots, answer the following questions. Explain in total no more than 100 words.
- What do you notice about how the distribution of the quantized data differs from the original data? Discuss this for both the `.cut` and `.qcut` functions.
- Why may we want to quantize data in different ways?

In [21]:
cut_result = pd.cut(tips_df['tip'], bins=3, labels=['Low', 'Medium', 'High'], retbins=True)
tips_df['tip_cut'] = cut_result[0]

qcut_result = pd.qcut(tips_df['tip'], q=3, labels=['Low', 'Medium', 'High'], retbins=True)
tips_df['tip_qcut'] = qcut_result[0]

fig = px.histogram(tips_df, x='tip', nbins=30)
fig.update_layout(title_text='Distribution of the total bill', height=600)
fig.show()

fig_cut = px.histogram(tips_df, x='tip_cut')
fig_cut.update_layout(title_text='Bins of the total bill using cut function', height=600)
fig_cut.show()

fig_qcut = px.histogram(tips_df, x='tip_qcut')
fig_qcut.update_layout(title_text='Bins of the total bill using the qcut function', height=600)
fig_qcut.show()

#### Written answer to question 2.4a in this cell
1. The distribution of the quantized data from both the cut and qcut functions does not reflect the same shape as the original data distribution. The cut function divides the range of the data into equal intervals, regardless of the number of data points within each interval. In contrast, the qcut function creates intervals with roughly equal numbers of data points, so its intervals may not be of equal length.

2. Quantizing data in different ways can serve various purposes. The cut function's equal intervals might be useful in applications where the range of values is important, ignoring data distribution. On the other hand, the qcut function's equal data distribution per interval can be beneficial when we care more about maintaining an even distribution across categories, such as in machine learning applications to prevent model bias towards more prevalent categories.

### Assignment 2.4b

To answer this question, we need two steps, as described below:

**Step 1)** Create a Parallel Categories plot based on [`go.Parcats()`](https://plotly.com/python/parallel-categories-diagram/#basic-parallel-categories-diagram-with-graphobjects) to show all the categorical variables except `total_bill`. The `tip` variable must be at the end (far right) of the plot, and each `tip` categorys (low, medium, high) must have a different color. The `size` variable may be treated as a categorical variable. You need to give the plot a meaningful title.

**Step 2)** Create a box plot based on [`plotly.express.box`](https://plotly.com/python/box-plots/) to show the `tip`, `smoker`, and `sex` variables. The `tip` variable should be at the y-axis. The plot should be a [facet plot](https://plotly.com/python/facet-plots/) that shows two subplots side-by-side, where these two plots share the same y-axis. For each subplot, there should be two box plots in it: one for smoker, and another one for non-smoker. There should be two subplots: one for female, and another one for male. You need to give the plot a meaningful title.

In total, you should create **two big plots** for this part of the assignment. Your plots should look like the followings:

<img src="parcats.png">
<img src="box.png">

**Question)** Based on these two plots, answer the following questions. Explain in total no more than 100 words.
- You are a data analyst and want to understand which situations are related to higher tips. Use the parallel categories plot to reason which combination of attributes may be related to higher tips visually. Briefly describe your reasoning process.
- You want to specifically inspect if there are differences in the tips that are given by smokers/non-smokers among female and male groups. Use the box plot to reason about your findings visually. Briefly describe your reasoning process.
- For this assignment, you do not need to use statistics (e.g., hypothesis testing or the actual numbers) in your explanation. Just visual observations are fine. But remember that in a real-world situation, you may need to provide an analysis to show statistical evidence.

*Hints:*
- Sorting the values for each of the attributes in the parallel categories plot in a logical order can help make them more interpretable. This can be done on the basis of `categoryorder` and `categoryarray`.
- Be very careful about making claims of "significant differences" or "large differences". Sometimes you may see differences, but the difference can be small and is not really important. Or the small difference can just be caused by the noise in the data.

In [45]:
tips_df['qcut_tip'] = pd.qcut(tips_df['tip'], 3, labels=['Low', 'Medium', 'High'])

day_order = ['Thur', 'Fri', 'Sat', 'Sun']
time_order = ['Lunch', 'Dinner']
tip_order = ['Low', 'Medium', 'High']
size_order = ['1', '2', '3', '4', '5', '6']

fig = go.Figure(data=[go.Parcats(dimensions=[
    {'label': 'Sex', 'values': tips_df['sex'], 'categoryorder': 'category ascending'},
    {'label': 'Smoker', 'values': tips_df['smoker'], 'categoryorder': 'category ascending'},
    {'label': 'Day', 'values': tips_df['day'], 'categoryorder': 'array', 'categoryarray': day_order},
    {'label': 'Time', 'values': tips_df['time'], 'categoryorder': 'array', 'categoryarray': time_order},
    {'label': 'Size', 'values': tips_df['size'].astype(str), 'categoryorder': 'array', 'categoryarray': size_order},
    {'label': 'Tip', 'values': tips_df['qcut_tip'], 'categoryorder': 'array', 'categoryarray': tip_order}],
    line={'color': tips_df['qcut_tip'].cat.codes, 'colorscale': [[0, 'lightsteelblue'], [0.5, 'gold'], [1, 'mediumseagreen']]}, 
    labelfont={'size': 18, 'family': 'Times'},
    tickfont={'size': 16, 'family': 'Times'},
    arrangement='freeform')])

fig.update_layout(
    title='Analysis of restaurant tips for all variables',
    autosize=False,
    width=1000,
    height=600,
)
fig.show()

fig = px.box(tips_df, x="sex", y="tip", color="smoker", facet_col="sex", title="Analysis of restaurant tips specifically for smokers/non-smokers")
fig.update_layout(height=600)
fig.show()

#### Written answer to question 2.4b in this cell
1. The Parallel Categories plot visually suggests that larger groups (size=4, 5, 6) seem more likely to give higher tips. Further, dining at dinner time and on weekends (especially Sunday) appears to be related to higher tips. Finally, male diners seem to give more high tips than female diners. However, these are visual observations, and more rigorous statistical analysis would be needed to confirm these trends.

2. Looking at the Box plot, it seems that male smokers tend to give slightly higher median tips compared to male non-smokers. On the other hand, for females, the median tip doesn't seem to change much between smokers and non-smokers. Also, male diners, both smokers and non-smokers, tend to give slightly higher tips compared to female diners. However, it's important to note that these differences are not substantial and could be due to randomness or other variables not considered in the plot. A statistical test would be needed to ascertain if these differences are significant.