# Visualización con Histogramas & Boxplots



Cuando se trabaja con un nuevo conjunto de datos, una de las cosas más útiles es comenzar a visualizar los datos. Mediante el uso de tablas, histogramas, diagramas de caja y otras herramientas visuales, podemos tener una mejor idea de lo que los datos pueden estar tratando de decirnos y podemos obtener información sobre los datos que quizás no hubiéramos descubierto de otra manera.

Repasaremos cómo realizar algunas visualizaciones básicas en Python y, lo más importante, aprenderemos cómo comenzar a explorar datos desde una perspectiva gráfica.

In [None]:
# We first need to import the packages that we will be using
import seaborn as sns # For plotting
import pandas as pd
import matplotlib.pyplot as plt # For showing plots

# Load in the data set
df = pd.read_csv("Datasets/tips.csv")
df

## Histogram

### Need seaborn package

After we have a general 'feel' for the data, it is often good to get a feel for the shape of the distribution of the data.

kde = Kernel Density Estimation 

In [None]:
# Plot a histogram of the total bill
sns.histplot(df["total_bill"], 
             kde = False).set_title("Histogram of Total Bill")
plt.show()

In [None]:
# Plot a histogram of the total bill
sns.histplot(df["total_bill"], kde = True).set_title("Histogram of Total Bill")
plt.show()

In [None]:
# Plot a histogram of the Tips only
sns.histplot(df["tip"], 
             kde = False).set_title("Histogram of Total Tip")
plt.show()

In [None]:
# Plot a histogram of both the total bill and the tips'
sns.histplot(df["total_bill"], 
             kde = False)
sns.histplot(df["tip"], 
             kde = False).set_title("Histogram of Both Tip Size and Total Bill")
plt.show()

## Boxplot
 

Boxplots do not show the shape of the distribution, but they can give us a better idea about the center and spread of the distribution as well as any potential outliers that may exist. Boxplots and Histograms often complement each other and help an analyst get more information about the data

 

A `box plot` is a way of statistically representing the *distribution* of the data through five main dimensions:

*   **Minimum:** The smallest number in the dataset excluding the outliers.
*   **First quartile:** Middle number between the `minimum` and the `median`.
*   **Second quartile (Median):** Middle number of the (sorted) dataset.
*   **Third quartile:** Middle number between `median` and `maximum`.
*   **Maximum:** The largest number in the dataset excluding the outliers.


<img src="boxplot.png" width=440, align="center">

In [None]:
# Create a boxplot of the total bill amounts
sns.boxplot(x = df["total_bill"]).set_title("Box plot of the Total Bill")
#sns.boxplot(df["total_bill"], whis=(0, 100)).set_title("Box plot of the Total Bill")
plt.show()



In [None]:
# Create a boxplot of the tips amounts
sns.boxplot(x = df["tip"]).set_title("Box plot of the Tip")
plt.show()

In [None]:
# Create a boxplot of the tips and total bill amounts - do not do it like this
sns.boxplot(x = df["total_bill"])
sns.boxplot(x = df["tip"]).set_title("Box plot of the Total Bill and Tips")
plt.show()

In [None]:
a = df.boxplot(column=['total_bill'])

In [None]:
a= df.boxplot(column=['tip', 'total_bill'])

## Creating Histograms and Boxplots Plotted by Groups

While looking at a single variable is interesting, it is often useful to see how a variable changes in response to another. 

Using graphs, we can see if there is a difference between:

*  the tipping amounts of `smokers` vs. `non-smokers`, 
*  if tipping varies `according to the time of the day`, or 
*  we can explore other `trends in the data` as well.

In [None]:
print(df.head())

sns.boxplot(x = df["tip"]).set_title("Box plot of the Total Bill and Tips")
plt.show()

In [None]:
# Create a boxplot and histogram of the tips GROUPED by smoking status
sns.boxplot(x = df["tip"], 
            y = df["smoker"])
plt.show()

In [None]:
# Create a boxplot and histogram of the tips GROUPED by time of day
sns.boxplot(x = df["tip"], 
            y = df["time"])

In [None]:
# Create a boxplot and histogram of the tips grouped by the day
sns.boxplot(x = df["tip"], 
            y = df["day"])
plt

In [None]:
g = sns.FacetGrid(df, 
                  row = "smoker")
g = g.map(plt.hist, 
          "tip")
plt.show()

In [None]:
g = sns.FacetGrid(df, 
                  row = "time")
g = g.map(plt.hist, 
          "tip")
plt.show()

In [None]:

g = sns.FacetGrid(df, 
                  row = "day")
g = g.map(plt.hist, 
          "tip")
plt.show()

In [None]:
print('='*30,'Good luck!', '='*30)