# PR 1: Introduction to Visual Analytics and Visualization libraries in Python

# Lab goals:
In this lab, we will deep dive into how Visual Analytics and some of the most relevant Python libraries: matplotlib and seaborn.
We will apply these two libraries to EDA: Exploratory Data Analysis, a key stage in a Machine Learning project.

Use as reference information for both libraries:
- seaborn-> https://seaborn.pydata.org/
- matplotlib-> https://matplotlib.org/

In data visualization there are some web references as:
- https://datavizproject.com/
- https://www.data-to-viz.com/


### Due date: during the lab session. It is not allowed to send it after the session
### Submission procedure: via Moodle.
### Complete with your Name: Luca Franceschi
### Complete with your NIA: 253885

# 1 Environment setup and data gathering

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import Series, DataFrame

In [None]:
# Upload the dataset and store it in a tips variable.

In [None]:
tips = sns.load_dataset("tips")

**EX1** Print the top 5 and tail 5 of the dataset

In [None]:
# ANSWER
print(tips.head())

print(tips.tail())

With .info() we can identify the structure of the dataframe: columns, types of variables and nulls.

In [None]:
tips.info()

# 2 Exploratory Data Analysis (EDA)

## 2.1 Basics of EDA

**EX2** Is there any null in the dataframe?

**ANSWER:** No there is none, since with the *info* method we can identify that there are 244 entries, and no column contains other than 244 non-null elements.

With .describe() we can obtain the main statistics of numerical variables

In [None]:
tips.describe()

**EX3** Apply .unique() to identify unique values in categorical variables

In [None]:
# ANSWER
tips['time'].unique()

**EX4** To complete the understanding of categorical variables, we need also to know how many samples by category in each variable. Use .value_count() to count the number of samples per *day*, *smoker* and *sex* categories

In [None]:
# ANSWER
print(tips['day'].value_counts(), '\n')
print(tips['smoker'].value_counts(), '\n')
print(tips['sex'].value_counts(), '\n')

**Observation:** We can see that the presence of *Male* is more than double the one of *Female* in the Sex column.

## 2.2 Applying visualization to improve the EDA

Let's create an auxiliar variable, *aux*, that is the value_counts() of tips.day

In [None]:
aux=tips.day.value_counts()

As a series, we can access the category name using **aux.index** and the category values by **aux.values**

In [None]:
aux.index

In [None]:
aux.values

**EX5** Let's create our first bar plot using the following code: i.e. plt.bar(x, height). Which is the day with more visits?

In [None]:
plt.figure()
plt.bar(x=aux.index,height=aux.values)
plt.show()

**ANSWER:** The day with more visits is Saturday.

**EX6** Let's use several attributes of a bar plot. Create a new bar plot similar to the previous one with orange bars, no edgecolor, width of 0.4, align at the center the horizontal axis. Include also the legend.

In [None]:
plt.figure()
plt.bar(x=aux.index, height=aux.values, width=0.4, color='orange', edgecolor=None, align='center', label='Number of visits')
plt.legend()
plt.show()

**EX7** Repeat **EX6** but use color "grey" and rotation=90 in the x axis.

In [None]:
plt.figure()
plt.bar(x=aux.index, height=aux.values, width=0.4, color='grey', edgecolor=None, align='center', label='Number of visits')
plt.xticks(rotation=90)
plt.legend()
plt.show()

**EX8** Repeat EX7 but with horizontal bars. Recall to use *plt.barh* instead of *plt.bar*. Use plt.yticks() to rotate the labels of y axis. Play with different rotation values.

In [None]:
plt.figure()
plt.barh(y=aux.index, width=aux.values, height=0.4, color='grey', edgecolor=None, align='center', label='Number of visits')
plt.yticks(rotation=45)
plt.legend()
plt.show()

**EX9** Repeat EX6 with seaborn. Use **sns.countplot()**. Which is the main difference with barplot?

In [None]:
sns.countplot(x='day', data=tips, width=0.4, color='orange')
plt.show()

**ANSWER:** The main difference *obviously* is that the orange color is not the *exact same*. Just joking. The days are ordered now.

**EX10** Let's apply sns.barplot(data=tips, x="day", y="total_bill"). Which is this code line returning? Tip: Compare the plot with the main statistics of "total_bill" for each of these days using *.describe()*.

In [None]:
print(tips.groupby(['day'],observed=True)['total_bill'].describe())

In [None]:
sns.barplot(data=tips, x="day", y="total_bill")
plt.show()

**ANSWER:** we can see that the *sns.barplot* method by default shows the mean. Also the lines that are at the top of each bar (*errorbar*) are by default the confidence intervals at 95% confidence.

**EX11** Create a pie plot to represent the % of days with tips per each day. Use plt.pie() from matplotlib (Recall: aux.index represents the days and aux.values stores the number of days per day). Tip: Use autopct='%1.2f' as an attribute to show the ratio value in terms of '%'

In [None]:
plt.figure()
plt.pie(x=aux.values, labels=aux.index, autopct='%1.2f')
plt.show()

If we want to control the color attribute of each piece of the pie, we can create a variable colors as follows:

In [None]:
color = plt.get_cmap('Blues')(np.linspace(0.2, 0.7, len(aux.index)))

**EX12** Now repeat the previous plot that includes the `colors`=color attribute and also the following ones: explode=[0.2,0,0,0] and radius=3. Which are the effects of these parameters into the plot?

In [None]:
plt.figure()
plt.pie(x=aux.values, labels=aux.index, autopct='%1.2f', colors=color, explode=[0.2,0,0,0], radius=3)
plt.show()

**ANSWER:** It makes the plot have an evenly distributed blue color map, it sets the radius size to 3 (much bigger), and it shifts (explodes) the first element in the data array by 0.2 (saturday).  

##**Distributions plots:**##

**Histograms**:

*matplotlib function*: hist(x)
*parameters*:

- color: Set the color of the bars in the histogram.
- bins: Set the number of bins to display in the histogram, or specify speciic bins.



*seaborn function*: sns.displot(x)
*parameters*:

- color: Set the color of the bars in the histogram.
- kind: Approach for visualizing the data. Selects the underlying plotting function and determines the additional set of valid parameters.


**Boxplots**:

*matplotlib function*: violinplot(x)

*seaborn function*: sns.boxplot(x)

**Violin**: A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

*matplotlib function*: violinplot(x)
*parameters*:
- showmeans: If True, will toggle rendering of the means.
- showextrema: If True, will toggle rendering of the extrema.
- showmedians: If True, will toggle rendering of the medians.

*seaborn function*: sns.violinplot(x)

**EX13** Execute this histogram plot and examine the results. What type of distribution this variable presents?

In [None]:
sns.displot(tips['total_bill'], kind="hist")
plt.show()

**ANSWER:** This variable seems to exhibit a [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution)

**EX14** Repeat the previous exercise with matplotlib (i.e. plt.hist()). Enrich the visualization with orange colour and 150 bins. Add a title, X and Y axis names. Any issue in the visualization? How could you improve the distribution visualization?

In [None]:
plt.figure()
plt.hist(tips['total_bill'], bins=150, color='orange')
plt.title('Tip count vs total_bill histogram')
plt.xlabel('total_bill')
plt.ylabel('count')
plt.show()

**ANSWER:** there are too many bins for the range, with less bins (at most 50) the visualization is much better.

**EX15** Fix the identified issue and repeat the previous exercise.

In [None]:
plt.figure()
plt.hist(tips['total_bill'], bins=40, color='orange')
plt.title('Tip count vs total_bill histogram')
plt.xlabel('total_bill')
plt.ylabel('count')
plt.show()

**EX16** Execute the following code and describe what it is representing.

In [None]:
from statistics import mode
mean=tips["tip"].mean()
median=tips["tip"].median()
mode=mode(tips["tip"])
sns.displot(tips["tip"],kind="kde")
plt.axvline(mean,color='r',label='mean')
plt.axvline(median,color='b',label='median')
plt.axvline(mode,color='g',label='mode')
plt.legend()
plt.show()

**ANSWER:** we are looking at a plot that smoothens the histogram data (a.k.a. [kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation)) along with some descriptive statistical metrics that try to explain a bit the data. We also have the mode which is the most probable output for the distribution.

**EX17** Execute the following code and describe what it is representing.

In [None]:
plt.boxplot(tips["tip"])
plt.show()

**ANSWER:** this is a boxplot that describes how the variable behaves. It shows the quantiles, min, max, median, and outliers.

**EX18** Build a boxplot of `tip` variable per day using sns.boxplot()

In [None]:
sns.boxplot(tips["tip"])
plt.show()

**Observation:** it is quite similar.

**EX19** Execute the following code and describe the visualization

In [None]:
plt.violinplot(tips["tip"], showmeans=True, showmedians=True, showextrema=True)
plt.show()

**ANSWER:** in this plot we can see a boxplot overlaid over a distplot of the kernel density estimation of the data (a.k.a. violin plot).

**EX20** Build a violin plot of the `tip` variable per day and differentiating the `sex`variable using sns.violinplot(x,y, hue="sex")

In [None]:
sns.violinplot(data=tips, x="day", y='tip', hue='sex')
plt.show()

**EX21** Do females use to give more tips on Sundays? and on Saturdays? Justify your answer.

**ANSWER:** on average I would say that seemingly there is not a significant difference between tippings depending on sex, however the median is higher for females on sundays and lower on saturdays with respect to males. What we can see is that there seem to be more outliers especially on saturdays for both sexes.

##**Plots to represent relationships between variables:**##

**Scatterplot**:

Scatter plots are used to observe relationships between variables and uses dots to represent the relationship between them.

*matplotlib function*: scatter(x, y)
*parameters*:

- c: Set the color of the markers.
- s: Set the size of the markers.
- marker: Set the marker style, e.g., circles, triangles, or squares.
- edgecolor: Set the color of the lines on the edges of the markers.


**Jointplot**:

Draw a plot of two variables with bivariate and univariate graphs.

*seaborn*: sns.jointplot (x, y, data).


**Correlation matrix**:
To represent the correlation matrix, we will create a heatmap in matplotlib and seaborn.

*matplotlib function*: plt.imshow(`correlation matrix`)

*seaborn*: sns.heatmap(`correlation matrix`)

**Pairplot**:

This function will create a grid of Axes such that each numeric variable in data will by shared across the y-axes across a single row and the x-axes across a single column. The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column.

*seaborn*: seaborn.pairplot()
*parameters*:

- kind: Kind of plot to make
- markerssingle:  matplotlib marker code or list
- huename of variable in data




**EX22** Execute this scatter plot and examine the results. Which insight do you get from it?

In [None]:
plt.scatter(tips["total_bill"], tips['tip'])
plt.show()

**ANSWER:** I would say that there is a positive correlation between the two variables total_bill and tip.

**EX23** Repeat the previous exercise but adding the `s`=10, `marker`="o" and colour depending on the value of tips.total_bill. Add a title and X and Y labels.

In [None]:
ax = plt.scatter(tips["total_bill"], tips['tip'], s=10, marker='o', c=tips["total_bill"])
plt.colorbar(ax)
plt.title('Scatter plot between total bill and tip')
plt.xlabel('total_bill')
plt.ylabel('tip')
plt.show()

**EX23** Repeat the previous exercise where the size of each sample depends on the value of the tip multiplied by a factor x10.

In [None]:
ax = plt.scatter(tips["total_bill"], tips['tip'], s=tips['tip']*10, marker='o', c=tips["total_bill"])
plt.colorbar(ax)
plt.title('Scatter plot between total bill and tip')
plt.xlabel('total_bill')
plt.ylabel('tip')
plt.show()

**EX24**: Execute the following jointplot.

In [None]:
sns.jointplot(x="total_bill", y="tip", data=tips, kind="reg")
plt.show()

**EX25**: Is the correlation between `total_bill`and `tip`similar between males and females? To answer this question build a sns.jointplot with hue="sex".

In [None]:
sns.jointplot(x="total_bill", y="tip", data=tips, hue='sex')
plt.show()

**ANSWER:** i would say that the correlation is positive and similar at least.

**EX26** Execute the following and code. Which are the variable with more correlation? And with lower correlation?

In [None]:
plt.figure(figsize=(5,5))
sns.heatmap(tips.corr(numeric_only=True),annot=True)
plt.show()

**ANSWER:** tip and total bill are the variables with highest correlation, while size and tip are the lowest ones (neutral).

**EX27** Execute the following visualization code. Are `total_bill`, `tip`and `size`variables with the same distribution between male and female?

In [None]:
sns.pairplot(data=tips,hue='sex', markers=["o", "s"])
plt.show()

**ANSWER:** yes, they seem to follow the same distribution. There might be some outliers in the size plot and thus it shows a bit weird in its tail for males.

##**Multi-plots and subplots:**##

**Subplots**:

Build multiple charts in a figure.

*matplotlib* subplot(nrows, ncols, plot_number)
• nrows: The number of rows in the igure.
• ncols: The number of columns in the igure.
• plot_number: The placement of the chart (starts at 1).


In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
lr=LabelEncoder()

In [None]:
tips['LE Day']=lr.fit_transform(tips['day'])

In [None]:
tips.head()

**EX28**: Execute the following code. What are they representing?


In [None]:
plt.figure(figsize=(10,5))
plt.subplot(2,2,1)
tips['sex'].value_counts().plot(kind='bar')
plt.subplot(2,2,2)
tips['sex'].value_counts().plot(kind='pie')
plt.show()

**ANSWER:** they are representing the amount of tips in two different ways, as a barplot of total counts and as a percentage of the total tips.

**EX29**: Look at the following code. Its output is similar to the previous exercise but the subplots are swaped. Modify the code to have the same order: i.e. first subplot should be barplot and the second one, the pie chart.

In [None]:
fig, axes=plt.subplots(1,2,figsize=(15,5))
tips['sex'].value_counts().plot(kind='bar',ax=axes[0])
tips['sex'].value_counts().plot(kind='pie',ax=axes[1])
plt.show()