# Assignment 2

Your information:
- Student Name: Isabella Memoed
- Student Number: 13460803

## Important

You can choose to use Plotly Express or the original Plotly to complete this assignment.

You can use generative AI tools (e.g., ChatGPT) to help you complete this assignment. But you need to report the usage of generative AI tools, as mentioned in the Syllabus.

**You need to restart and run the notebook from scratch and make sure it works before submission.** Errors in the code (such as typos, incorrect syntax, and incorrect file name extensions) will always give zero scores in the corresponding categories on the rubric. TAs will not debug the errors for you. The bottom line is that if the TA cannot just open the file and click the run button to get the result, this will be counted as having errors in the code.

## Assignment 2.1

This assignment uses the `marriage.csv` file that contains the percentage of *unmarried* people in the US between 25 and 34. This file is a subset of the `both_sexes.csv` dataset from [FiveThirtyEight](http://fivethirtyeight.com/), publicly available in their Github repository: https://github.com/fivethirtyeight/data/tree/master/marriage.

In [1]:
import plotly.graph_objs as go
import plotly.express as px
import plotly.io as pio
import pandas as pd

To load the csv file we use the `read_csv` function from pandas. This function reads in the csv and turns it into a pandas dataframe. A data frame makes it easy to view and process the data.

For example, with the `head()` method we can display the first `n` rows of the csv.

In [2]:
df = pd.read_csv('marriage.csv')

df.head(n=10) # first 10 of the 17 rows

FileNotFoundError: [Errno 2] No such file or directory: 'marriage.csv'

Then we can select certain columns with a syntax similar to python dictionaries. `df['year']` thus returns the first column.

In this way we can easily select data and visualize it with plotly. In the next cell there's an example of a bar chart based on the `year` column and the `all` column, i.e. the total percentage of people between 25 and 34 who are **not** married.

In [64]:
trace = go.Bar(
    x = df['year'],
    y = df['all']
)
go.Figure(trace).show()

A number of things immediately stand out when we look at the bar chart above. First, there is a strange distribution of the bars, but looking at the data we see that this is correct, but perhaps not very visually apealing.

You may also notice that the scale of the y-axis does not make it clear that it is a percentage. We can solve both of these issues. For the x-axis, we can indicate that it is a categorical axis, rather than a scale. For the y-axis, we can show it as a percentage. Below is the code to do this.

In [65]:
# Create a layout object that contains the specifications for the X and Y axes.
layout = go.Layout(
    xaxis=go.layout.XAxis(
        type='category' # the type for the x axis is categorical
    ),
    yaxis=go.layout.YAxis(
        tickformat=',.0%', # show as percentage
    )
)

fig = go.Figure(data=trace, layout=layout)
fig.show()

It is now up to you to create a grouped bar chart for the first assignment based on this dataset, where the groups consist of three different income groups (`poor`, `mid`, and `rich`). Your chart should look like the following:

<img src="groupedbar.png">

**Requirements:**
- Plot title
  - `Percentage unmarried in the US between 25 and 34 by income`
- Axis labels
  - x axis: `Year`
  - y axis: `Percentage not married`
- Legend with correct names
  - `Low income`, `Middle income`, and `High income`
- Grouped and colored bars for each income group
- Correct format of x and y axis
  - x axis needs to be categories
  - y axis needs to be percentage
- Based on the data from the csv file
- Each group with the corret bar color that is color-blind safe
  - Low income group: `rgb(102,194,165)`
  - Middle income group: `rgb(252,141,98)`
  - High income group: `rgb(141,160,203)`
- The height of the graph needs to be `400`

*Hints:*
- Search online for ways of adjusting graph properties if you do not know how to do that, such as changing the color of the bar or the height of graph. Often you can find examples on the [official plotly documentation](https://plot.ly/python/reference/) or Stack Overflow. You can also ask generative AI (such as ChatGPT) to help you, but ChatGPT does not always work and may give you bad code. If you use ChatGPT, you need to report its usage as described in the course Syllabus.
- Check the [documentation of bar chart](https://plotly.com/python-api-reference/generated/plotly.graph_objects.Bar.html) for available properties.

In [66]:
# Code for assignment 2.1 in this cell
# Inkomstengroep met bijbehorende kleuren
income_groups = ['poor', 'mid', 'rich']
income_group_names = {
    'poor': 'Low income',
    'mid': 'Middle income',
    'rich': 'High income'
}
colors = ['rgb(102,194,165)', 'rgb(252,141,98)', 'rgb(141,160,203)']

# Maak een list voor opslaan balksporen
bar_traces = []

# Maak bar voor elke inkomensgroep
for i, income_group in enumerate(income_groups):
    trace = go.Bar(
        x=df['year'],
        y=df[income_group],
        name=income_group_names[income_group],                                     
        marker=dict(color=colors[i])
    )
    bar_traces.append(trace)

layout = go.Layout(
    title='Percentage unmarried in the US between 25 and 34 by income',
    xaxis=dict(title='Year', type='category'),
    yaxis=dict(title='Percentage not married', tickformat=',.0%'),
    barmode='group',
    height=400
)

fig = go.Figure(data=bar_traces, layout=layout)

# Weergeef de grouped bar chart
pio.show(fig)


## Assignment 2.2

The second assignment is based on `existing_property.csv` (bestaande_koopwoning) based on data from the [CBS](http://statline.cbs.nl/Statweb/publication/?DM=SLNL&PA=83910NED&D1=3&D2=1-6&D3=114&HDR=T,G1&STB=G2&VW=T) and contains the numbers of houses sold in 2017 in the Netherlands, by house type.

For this assignment you have to make a donut chart based on this dataset in which this data is shown, broken down by type of house. Your chart should look like the following:

<img src="donutchart.png">

**Requirements:**
- Plot title
  - `Sold houses in the Netherlands, by type, in 2017`
- No legend
- Text and percentage around the donut chart
- Based on the data from the csv file
- The color scale on the sections in the pie chart needs to be `px.colors.qualitative.T10`
- The height of the graph needs to be `600`

*Hints:*
- See also the hints in Assignment 2.1
- Check [documentation of the pie chart](https://plotly.com/python-api-reference/generated/plotly.graph_objects.pie.html) for the available properties.
- Check the [documentation of colors](https://plotly.com/python/discrete-color/) for available color scales

In [67]:
# Code for assignment 2.2 in this cell
df = pd.read_csv('existing_property.csv')

color_mapping = {
    'Eengezinswoningen':px.colors.qualitative.T10[0],
    'Tussenwoning':px.colors.qualitative.T10[1],
    'Apartement': px.colors.qualitative.T10[5], 
    'Vrijstaand': px.colors.qualitative.T10[4],
    'Hoekwoning':px.colors.qualitative.T10[2],
    '2-onder-1 kap ':px.colors.qualitative.T10[3],
    }
# Donut chart maken met plotly express
fig = px.pie(df, values='Aantal', names='Type woning', hole=0.8,
             color_discrete_sequence=list(color_mapping.values()))

# Chart titel
fig.update_layout(
    title='Sold houses in the Netherlands, by type, in 2017',
    height=600
)

# Tekst en percentages 
fig.update_traces(
    textposition='outside',
    textinfo='percent+label'
)

# Haal legenda weg
fig.update_layout(
    showlegend=False
)

# Weergeef donut chart
fig.show()


## Assignment 2.3

Display the information in the tree below as a Sunburst Chart.

<img src="species.jpg" alt="Figured obtained from http://people.cs.ksu.edu/~schmidt/300s05/">

Your chart should look like the following:

<img src="sunburst.png">

**Requirements:**
- Plot title
  - `Animal Species Tree`
- Make sure that all Mammals have a yellow color (with hex color "#fecb52"), and all Reptiles have a green color.
- The height of the graph needs to be 600

*Hints*:
- Examples of how to work with Sunburts charts can be found in the plot.ly documentation: https://plot.ly/python/sunburst-charts/
- In a sunburst plot you can also scale the size of the 'pie slices' based on a value, this is not necessary for this exercise.

In [68]:
# Code for assignment 2.3 in this cell
fig =go.Figure(go.Sunburst(
    labels=["Animal", "Mammal", "Reptile",
    "Equine", "Bovine", "Canine",
    "Lizard", "Snake", "Bird",
    "Salamander", "Canary", "Tweetie",
    "Horse", "Zebra", "Cow",
    "Bessie", "Lassie", "Rintintin"],
    
    parents=["", "Animal", "Animal", "Mammal", 
             "Mammal", "Mammal", "Reptile", 
             "Reptile", "Reptile", "Lizard","Bird","Canary","Equine","Equine","Bovine","Cow","Canine","Canine" ],
))

# Wijs kleuren toe aan Mammal en Reptile
colors = {
    "Mammal": "#fecb52",
    "Reptile": "#00cc96", 
    
}

fig.update_traces(
    marker=dict(
        colors=[colors.get(label, "") for label in fig.data[0].labels]
    ),
)

# Titel en height
fig.update_layout(
    title="Animal Species Tree",
    height=600
)

# fig.update_layout(margin=dict(t=0, b=0, r=0, l=0))
fig.update_layout(
    margin=dict(t=100, b=0, r=0, l=0),
    title=dict(
        y=0.95,
        x=0,
        xanchor="left",
        yanchor="top"
    )
)

# Weergeef Sunburst chart
fig.show()


## Assignment 2.4

This question consists of two sub-questions that build on the same dataset about tips (fooien) from groups that have visited a restaurant.

In [69]:
tips_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv')
tips_df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### Assignment 2.4a

This part of the assignment aims to let you think about different ways of binning data points.
To answer this question, we must first take a number of steps to have the correct information available:

**Step 1)** Quantize the `tip` column into three groups (`Low`, `Medium`, `High`) using **both** the pandas function `.cut` and `.qcut`. Make sure you use `retbins=True` to return the bins in both the `.cut` and `.qcut` functions. You also need to **print** the bins for both `.cut` and `.qcut` functions using the `print` function.

**Step 2)** Make a plot for the `tip` column in the dataset (i.e., the one without quantization) based on the function `px.histogram` from Plotly Express. You need to give the plot a meaningful title.

**Step 3)** Make another two histogram plots for the **quantized variant** of the `tip` column: one from the `.cut` function, and another one from the `.qcut` function. You need to give each plot a meaningful title.

In total, you should have **two printed messages** and create **three plots** for this part of the assignment. Your plots should look like the followings:

<img src="distribution1.png">
<img src="distribution2.png">
<img src="distribution3.png">

**Question)** Now compare the two printed messages and three plots, answer the following questions. Explain in total no more than 100 words.
- What do you notice about how the distribution of the quantized data differs from the original data? Discuss this for both the `.cut` and `.qcut` functions.
- Why may we want to quantize data in different ways?

In [70]:
# Code for assignment 2.4a in this cell

# Stap 1: Quantisize tip column met .cut and .qcut
tips_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv')

cut_bins, cut_labels = pd.cut(tips_df['tip'], bins=3, labels=["Low", "Medium", "High"], retbins=True)
qcut_bins, qcut_labels = pd.qcut(tips_df['tip'], q=3, labels=["Low", "Medium", "High"], retbins=True)

# Print bins voor .cut and .qcut
print("Bins for .cut:")
print(cut_bins)
print("Bins for .qcut:")
print(qcut_bins)

# Stap 2: Creeer een plot voor tip column zonder quantization
fig = px.histogram(tips_df, x='tip', title='Distribution of the total bill')
fig.show()

# Stap 3: Creeer histogram plots voor quantized tip column met .cut and .qcut
fig_cut = px.histogram(x=cut_bins, title='Bins of the total bill using the cut function')
fig_cut.update_layout(xaxis=dict(title='tip_cut'))
fig_cut.show()

fig_qcut = px.histogram(x=qcut_bins, title='Bins of the total bill using the qcut function')
fig_qcut.update_layout(xaxis=dict(title='tip_qcut'))
fig_qcut.show()

Bins for .cut:
0         Low
1         Low
2         Low
3         Low
4         Low
        ...  
239    Medium
240       Low
241       Low
242       Low
243       Low
Name: tip, Length: 244, dtype: category
Categories (3, object): ['Low' < 'Medium' < 'High']
Bins for .qcut:
0         Low
1         Low
2        High
3        High
4        High
        ...  
239      High
240       Low
241       Low
242       Low
243    Medium
Name: tip, Length: 244, dtype: category
Categories (3, object): ['Low' < 'Medium' < 'High']


#### Written answer to question 2.4a in this cell

Voor .cut:

De gekwantiseerde gegevens met behulp van .cut verdelen de tipwaarden in drie vakken van gelijke breedte: Laag, Gemiddeld en Hoog.
De histogramgrafiek laat zien dat de meeste fooien in de categorie Laag vallen, met een kleiner aantal in de categorieën Gemiddeld en Hoog.
De binning is gebaseerd op waardebereiken, wat resulteert in gelijke intervallen over het bereik van tipwaarden.

Voor .qcut:

De gekwantiseerde gegevens met behulp van .qcut verdelen de tipwaarden in drie bakken met ongeveer gelijke frequentie: laag, gemiddeld en hoog.

De histogramgrafiek laat zien dat de verdeling gelijkmatiger is verdeeld over de drie categorieën, met een vergelijkbaar aantal tips in elke categorie.

De binning is gebaseerd op kwantielen, waardoor een gelijk aantal gegevenspunten in elke bin wordt gegarandeerd, ongeacht het waardebereik.
Mogelijk willen we gegevens op verschillende manieren kwantificeren, afhankelijk van de specifieke analyse of toepassing. Het gebruik van bins met gelijke breedte (.cut) kan handig zijn als we de gegevens in gelijke intervallen willen verdelen, waardoor we een beter begrip krijgen van de waardebereiken. Aan de andere kant kan het gebruik van gelijke-frequentie-bins (.qcut) nuttig zijn als we een gelijke weergave van gegevenspunten in elke bin willen garanderen, waardoor de algehele verdeling effectiever wordt vastgelegd, vooral in gevallen waarin de gegevens mogelijk scheef zijn of uitschieters. Verschillende binning-strategieën bieden flexibiliteit bij het interpreteren en analyseren van de gegevens op basis van specifieke vereisten en doelstellingen.

### Assignment 2.4b

To answer this question, we need two steps, as described below:

**Step 1)** Create a Parallel Categories plot based on [`go.Parcats()`](https://plotly.com/python/parallel-categories-diagram/#basic-parallel-categories-diagram-with-graphobjects) to show all the categorical variables except `total_bill`. The `tip` variable must be at the end (far right) of the plot, and each `tip` categorys (low, medium, high) must have a different color. The `size` variable may be treated as a categorical variable. You need to give the plot a meaningful title.

**Step 2)** Create a box plot based on [`plotly.express.box`](https://plotly.com/python/box-plots/) to show the `tip`, `smoker`, and `sex` variables. The `tip` variable should be at the y-axis. The plot should be a [facet plot](https://plotly.com/python/facet-plots/) that shows two subplots side-by-side, where these two plots share the same y-axis. For each subplot, there should be two box plots in it: one for smoker, and another one for non-smoker. There should be two subplots: one for female, and another one for male. You need to give the plot a meaningful title.

In total, you should create **two big plots** for this part of the assignment. Your plots should look like the followings:

<img src="parcats.png">
<img src="box.png">

**Question)** Based on these two plots, answer the following questions. Explain in total no more than 100 words.
- You are a data analyst and want to understand which situations are related to higher tips. Use the parallel categories plot to reason which combination of attributes may be related to higher tips visually. Briefly describe your reasoning process.
- You want to specifically inspect if there are differences in the tips that are given by smokers/non-smokers among female and male groups. Use the box plot to reason about your findings visually. Briefly describe your reasoning process.
- For this assignment, you do not need to use statistics (e.g., hypothesis testing or the actual numbers) in your explanation. Just visual observations are fine. But remember that in a real-world situation, you may need to provide an analysis to show statistical evidence.

*Hints:*
- Sorting the values for each of the attributes in the parallel categories plot in a logical order can help make them more interpretable. This can be done on the basis of `categoryorder` and `categoryarray`.
- Be very careful about making claims of "significant differences" or "large differences". Sometimes you may see differences, but the difference can be small and is not really important. Or the small difference can just be caused by the noise in the data.

In [71]:
# Code for assignment 2.4b in this cell

parcats_data = tips_df.drop("total_bill", axis=1)  
parcats_data["tip_category"] = pd.qcut(tips_df["tip"], q=3, labels=["Low", "High", "Medium"]) 


color_mapping = {"Low": "lightsteelblue", "Medium": "mediumseagreen", "High": "darksalmon"}
parcats_data["tip_color"] = parcats_data["tip_category"].map(color_mapping)

parcats_data["time"] = parcats_data["time"].map({"Lunch": "Dinner", "Dinner": "Lunch"})

fig = go.Figure(go.Parcats(
    dimensions=[
        {"label": "Sex", "values": parcats_data["sex"]},
        {"label": "Smoker", "values": parcats_data["smoker"]},
        {"label": "Day", "values": parcats_data["day"]},
        {"label": "Time", "values": parcats_data["time"]},
        {"label": "Size", "values": parcats_data["size"]},
        {"label": "Tip category", "values": parcats_data["tip_category"]}
    ],
    line={"color": parcats_data["tip_color"], "colorscale": "Viridis"},
    arrangement="freeform"
))

fig.update_layout(
    title="Analysis of restaurant tips for all variables",
    height=600
)

fig.show()


fig1 = px.box(tips_df, y="tip", x="smoker", color="smoker", facet_col="sex", facet_col_wrap=2)

fig1.update_layout(
    title="Analysis of restaurant tips specifically for smokers/ non-smokers",
    height=500
)

fig1.update_xaxes(
    tickvals=[],  
    title=""  
)

fig1.show()

#### Written answer to question 2.4b in this cell

Op basis van de grafiek met parallelle categorieën kan worden waargenomen dat hogere fooien vaak worden geassocieerd met bepaalde combinaties van attributen. Met name als we kijken naar de eigenschappen van seks, roker en dag, blijkt dat mannen die niet roken en op zondag op bezoek komen, hogere fooien geven. Bovendien geven grotere groepen (bijv. maat 5 en 6) voor het kenmerk maat vaak hogere fooien. Wel is het belangrijk op te merken dat deze waarnemingen gebaseerd zijn op visuele patronen en met de nodige voorzichtigheid moeten worden geïnterpreteerd, aangezien ze geen statistisch bewijs leveren van significante of grote verschillen.
In de boxplot kunnen bij het onderzoeken van de verschillen in fooien die door rokers en niet-rokers tussen vrouwelijke en mannelijke groepen worden gegeven, enkele visuele waarnemingen worden gedaan. Voor vrouwen vertonen zowel rokers als niet-rokers een vergelijkbare reeks fooien, met een lichte neiging voor niet-rokers om iets hogere fooien te geven. Aan de andere kant geven rokers voor mannen doorgaans hogere fooien dan niet-rokers, met een groter interkwartielbereik.