# Visualization

![elgif](https://media.giphy.com/media/3og0IExSrnfW2kUaaI/giphy.gif)

### Motivation

In the realm of Python, we possess a powerful set of tools that enable us to craft a wide array of visualizations. In this class, we'll delve into these tools, primarily focusing on the utilization of Matplotlib, Pylab, and Seaborn. Our primary objective is to harness the power of visualizations for conducting Exploratory Data Analysis (EDA).

**Why Visualization Matters**

Visualization serves as a fundamental pillar in the realm of data analysis. It empowers us to not only view data but also uncover intricate patterns, hidden insights, and critical trends. Through visualization, we can effectively describe and communicate the essence of the data, making it an indispensable tool in the field of EDA.

**Matplotlib, Pylab, and Seaborn**

In our journey, we will explore three key libraries: Matplotlib, Pylab, and Seaborn. These libraries provide us with a wide array of tools, functions, and options to create diverse and impactful visualizations. Whether you're looking to craft simple line plots or complex heatmaps, these libraries have you covered.

Our exploration will encompass a range of visualization techniques, including scatter plots, bar charts, histograms, and more. We will learn how to customize and annotate our visualizations to convey precise information effectively.

Through this course, you'll gain the skills needed to extract valuable insights from data, tell compelling data-driven stories, and make informed decisions—all with the power of visualization.

So, let's dive into the world of data visualization with Matplotlib, Pylab, and Seaborn and unlock the potential of your data!

<h3>Table of Contents<span class="tocSkip"></span></h3>
<div class="toc"><ul class="toc-item"><li><span><a href="#COMPARISON" data-toc-modified-id="COMPARISON-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>COMPARISON</a></span><ul class="toc-item"><li><span><a href="#Bar-chart" data-toc-modified-id="Bar-chart-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Bar chart</a></span><ul class="toc-item"><li><span><a href="#seaborn:-countplot" data-toc-modified-id="seaborn:-countplot-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>seaborn: countplot</a></span></li><li><span><a href="#seaborn:-barplot" data-toc-modified-id="seaborn:-barplot-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>seaborn: barplot</a></span></li><li><span><a href="#matplotlib:-barplot" data-toc-modified-id="matplotlib:-barplot-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>matplotlib: barplot</a></span></li></ul></li><li><span><a href="#Column:-grouped-bar-chart" data-toc-modified-id="Column:-grouped-bar-chart-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Column: grouped bar chart</a></span><ul class="toc-item"><li><span><a href="#seaborn:-countplot" data-toc-modified-id="seaborn:-countplot-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>seaborn: countplot</a></span></li></ul></li><li><span><a href="#matplotlib:-bar-chart" data-toc-modified-id="matplotlib:-bar-chart-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>matplotlib: bar chart</a></span></li><li><span><a href="#Line-chart" data-toc-modified-id="Line-chart-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Line chart</a></span></li><li><span><a href="#Scatter-plot" data-toc-modified-id="Scatter-plot-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Scatter plot</a></span><ul class="toc-item"><li><span><a href="#seaborn:-scatterplot" data-toc-modified-id="seaborn:-scatterplot-1.5.1"><span class="toc-item-num">1.5.1&nbsp;&nbsp;</span>seaborn: scatterplot</a></span></li><li><span><a href="#matplotlib:-scatterplot" data-toc-modified-id="matplotlib:-scatterplot-1.5.2"><span class="toc-item-num">1.5.2&nbsp;&nbsp;</span>matplotlib: scatterplot</a></span></li></ul></li></ul></li><li><span><a href="#DISTRIBUTION" data-toc-modified-id="DISTRIBUTION-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>DISTRIBUTION</a></span><ul class="toc-item"><li><span><a href="#Histograms" data-toc-modified-id="Histograms-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Histograms</a></span></li></ul></li><li><span><a href="#SMALL-RECAP" data-toc-modified-id="SMALL-RECAP-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>SMALL RECAP</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#seaborn:-histograms" data-toc-modified-id="seaborn:-histograms-3.0.1"><span class="toc-item-num">3.0.1&nbsp;&nbsp;</span>seaborn: histograms</a></span></li><li><span><a href="#matplotlib:-histograms" data-toc-modified-id="matplotlib:-histograms-3.0.2"><span class="toc-item-num">3.0.2&nbsp;&nbsp;</span>matplotlib: histograms</a></span></li></ul></li><li><span><a href="#With-categorical-variables" data-toc-modified-id="With-categorical-variables-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>With categorical variables</a></span></li><li><span><a href="#swarmplot" data-toc-modified-id="swarmplot-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>swarmplot</a></span></li><li><span><a href="#Boxplot" data-toc-modified-id="Boxplot-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Boxplot</a></span></li><li><span><a href="#ViolinPlot" data-toc-modified-id="ViolinPlot-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>ViolinPlot</a></span></li><li><span><a href="#KDE-plot" data-toc-modified-id="KDE-plot-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>KDE plot</a></span><ul class="toc-item"><li><span><a href="#With-more-variables" data-toc-modified-id="With-more-variables-3.5.1"><span class="toc-item-num">3.5.1&nbsp;&nbsp;</span>With more variables</a></span></li><li><span><a href="#Add-KDE-to-the-histplot" data-toc-modified-id="Add-KDE-to-the-histplot-3.5.2"><span class="toc-item-num">3.5.2&nbsp;&nbsp;</span>Add KDE to the histplot</a></span></li></ul></li></ul></li><li><span><a href="#PART-OF-A-WHOLE" data-toc-modified-id="PART-OF-A-WHOLE-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>PART OF A WHOLE</a></span><ul class="toc-item"><li><span><a href="#Pie-plot-👀" data-toc-modified-id="Pie-plot-👀-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Pie plot 👀</a></span></li><li><span><a href="#Stacked-column-chart" data-toc-modified-id="Stacked-column-chart-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Stacked column chart</a></span><ul class="toc-item"><li><span><a href="#seaborn:-stacked-column-(histogram)" data-toc-modified-id="seaborn:-stacked-column-(histogram)-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>seaborn: stacked column (histogram)</a></span></li></ul></li></ul></li><li><span><a href="#RELATIONSHIPS" data-toc-modified-id="RELATIONSHIPS-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>RELATIONSHIPS</a></span><ul class="toc-item"><li><span><a href="#Scatter-plot" data-toc-modified-id="Scatter-plot-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Scatter plot</a></span></li><li><span><a href="#Line-chart" data-toc-modified-id="Line-chart-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Line chart</a></span></li><li><span><a href="#Correlation-Matrix" data-toc-modified-id="Correlation-Matrix-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Correlation Matrix</a></span></li><li><span><a href="#Pairplot" data-toc-modified-id="Pairplot-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Pairplot</a></span></li></ul></li><li><span><a href="#TRENDS" data-toc-modified-id="TRENDS-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>TRENDS</a></span><ul class="toc-item"><li><span><a href="#Line-chart" data-toc-modified-id="Line-chart-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Line chart</a></span></li><li><span><a href="#Area-chart" data-toc-modified-id="Area-chart-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Area chart</a></span></li><li><span><a href="#Column-chart:-grouped-bar-chart" data-toc-modified-id="Column-chart:-grouped-bar-chart-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Column chart: grouped bar chart</a></span></li><li><span><a href="#Jointplot:-histograms-&amp;-scatterplot" data-toc-modified-id="Jointplot:-histograms-&amp;-scatterplot-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>Jointplot: histograms &amp; scatterplot</a></span></li></ul></li><li><span><a href="#Subplots" data-toc-modified-id="Subplots-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Subplots</a></span></li><li><span><a href="#Save-Plots" data-toc-modified-id="Save-Plots-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Save Plots</a></span></li></ul></div>

# We import libraries

In [None]:
# Importing the Seaborn library for advanced data visualization (as sns)
import seaborn as sns

# Importing the Matplotlib library for basic plotting functionalities (as plt)
import matplotlib.pyplot as plt

# Importing the Pandas library for data manipulation and analysis (as pd)
import pandas as pd

# Importing the NumPy library for numerical operations (as np)
import numpy as np

# We load data

In [None]:
# Importing Pandas library and reading the Titanic dataset from a CSV file
titanic = pd.read_csv('titanic.csv')
#titanic=sns.load_dataset("titanic")

# Importing Seaborn library and loading the 'penguins' dataset
penguins = sns.load_dataset("penguins")

# Importing Seaborn library and loading the 'tips' dataset
tips = sns.load_dataset("tips")

# Importing Seaborn library and loading the 'flights' dataset
flights = sns.load_dataset("flights")

In [None]:
titanic.head()

In [None]:
penguins.head()

In [None]:
tips.head()

In [None]:
flights.head()

In [None]:
# Are we missing some extra steps to understands the dataframes?

# Display Settings

In [None]:
# Matplotlib inline to visualize Matplotlib graphs
%matplotlib inline

# Configuration to set so that all the Seaborn figures come out with this size
%config Inlinebackend.figure_format= 'retina'

In [None]:
# Set the Seaborn context to "poster" for larger text and figures
sns.set_context("poster")

# Set the default figure size for Seaborn plots
sns.set(rc={"figure.figsize": (12., 6.)})

# Set the Seaborn style to "whitegrid" for a white background with gridlines
sns.set_style("whitegrid")

- **COMPARISON**
    - **Bar Chart:** Utilized to compare individual data points across categories. Each bar represents a category and the height or length of the bar is proportional to the value or count of that category.
    - **Grouped Bar Chart:** A variation of the bar chart that displays two or more groups side by side for comparison across categories and sub-categories.
    - **Line Chart:** Ideal for showing trends over time. Each point on the line represents a data value at a specific time period.
    - **Scatter Plot:** Perfect for displaying the relationship between two numerical variables, allowing for the identification of correlations, patterns, or outliers.

- **DISTRIBUTION**
    - **Histogram:** Provides a visualization of the underlying frequency distribution of a set of continuous or discrete numerical data.
    - **Swarmplot:** Represents all data points along an axis, useful for displaying the distribution of a dataset and identifying any potential outliers.
    - **Boxplot:** Summarizes the distribution of a dataset based on five number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
    - **Violin Plot:** Combines the properties of boxplots and kernel density plots, providing a summary of the distribution of the data.

- **PART OF A WHOLE**
    - **Pie Chart:** Represents proportions or percentages among categories, by dividing a circle into proportional segments.
    - **Stacked Column Chart:** Displays the comparison of the individual sub-category totals with the overall total.

- **RELATIONSHIPS**
    - **Scatter Plot:** Ideal for depicting the relationship between two numerical variables, allowing for the detection of any correlations, patterns, or outliers.
    - **Line Chart:** Useful for displaying trends over time, showing the relationship between two numerical variables over a continuous time span.
    - **Heatmap:** Provides a colored graphical representation of data where individual values are represented as colors, useful for representing correlations or magnitudes.
    - **Pairplot:** Displays pairwise relationships in a dataset, ideal for quickly visualizing distributions and correlations.

- **TRENDS**
    - **Line Chart:** Effective for depicting trends over a continuous or discrete time interval.
    - **Area Chart:** Similar to line charts but with the area under the line filled, useful for representing the cumulative effect or total over time.
    - **Column Chart:** Useful for comparing discrete categories of data, showing the differences between them through vertical bars.


## COMPARISON

- **Bar chart:** A simple chart that represents data with rectangular bars of varying heights.

- **Grouped bar chart:** A chart that displays multiple bars side by side, allowing for comparison between groups.

- **Line chart:** A chart that shows data points connected by lines, useful for visualizing trends over time.

- **Scatter plot:** A plot that represents individual data points on a two-dimensional plane, useful for displaying relationships between variables.


### Bar chart

#### seaborn: countplot
A countplot is a fundamental visualization that allows us to create a basic bar chart to count the number of occurrences of each category within a categorical variable. In this specific context, we are using the Seaborn library to create a countplot to analyze the distribution of penguin species.

**Description (will work with the penguin):**
- The countplot is used to visualize and compare the frequency or count of each unique penguin species in the dataset.
- It provides a quick and straightforward way to understand the distribution of species within the penguin population.

`penguin by species❓`

In [None]:
# Display a random sample of 5 rows from the 'penguins' DataFrame
penguins.sample(5)

In [None]:
# Group the 'penguins' DataFrame by the 'species' column and count the occurrences of each species
# The result will show the count of penguins for each species in the dataset
penguins.groupby("species").agg("count")

In [None]:
# Create a countplot using Seaborn to visualize the distribution of penguin species
# The 'x' parameter specifies the 'species' column to plot on the x-axis
sns.countplot(x=penguins.species);

We can change the colors of the graphs using the argument `palette = "color code"`
Check seaborn palettes --> [Here](https://seaborn.pydata.org/tutorial/color_palettes.html)

In [None]:
# Create a countplot using Seaborn to visualize the distribution of penguin species
# The 'x' parameter specifies the 'species' column to plot on the x-axis
# The 'palette' parameter specifies the color palette to use for the plot (in this case, "magma" colormap)
sns.countplot(x=penguins["species"], palette="magma");

We can name the names of variables in `data` or vector data:

In [None]:
# Create a countplot using Seaborn to visualize the distribution of penguin species
# The 'data' parameter specifies the DataFrame 'penguins' containing the data
# The 'x' parameter specifies the 'species' column to plot on the x-axis
# The 'palette' parameter specifies the color palette to use for the plot (in this case, "magma" colormap)
sns.countplot(data=penguins, x="species", palette="magma");

#### seaborn: barplot

The Seaborn barplot is a versatile visualization that allows us to depict the central tendency of a numerical variable using rectangular bars. Unlike the countplot, which is used for categorical data, the barplot is tailored for numerical data. It provides a visual representation of the estimated average value for each category within a categorical variable, along with a measure of uncertainty represented by error bars.

**Description:**
- The barplot is particularly useful when you want to compare the average values of a numerical variable across different categories.
- Each rectangular bar in the plot represents a category, and the height of the bar corresponds to the estimated central tendency (typically the mean) of the numerical variable for that category.
- Error bars are included to indicate the uncertainty or variability around the estimated value. These error bars can be customized to represent various statistical measures.

`average passengers by year❓`

In [None]:
# Display the first few rows of the 'flights' DataFrame
flights.head()

In [None]:
# Grouping the data by 'year' and calculating the mean of 'passengers' rounded to 3 decimal places
flights.groupby("year")["passengers"].mean().round(3)

In [None]:
# Creating a barplot to visualize the average number of passengers by year
# The 'ci=None' parameter is used to exclude confidence intervals
barplot = sns.barplot(x="year", y="passengers", data=flights, ci=None)

`total passengers by year❓`

In [None]:
# Creating a barplot to visualize the total number of passengers by year
# 'x' represents the 'year' data, and 'y' represents the 'passengers' data
# 'palette="magma"' sets the color palette to 'magma'
# 'estimator=sum' calculates the sum of passengers for each year
# 'ci=None' is used to exclude confidence intervals
sns.barplot(x=flights["year"], y=flights["passengers"], palette="magma", estimator=sum, ci=None);

Seaborn's `barplot` is a versatile tool for creating bar charts to visualize data. It offers a simplified and visually appealing way to represent your data, making it easier to communicate information effectively. You can explore more about its functionality in the [official Seaborn documentation](https://seaborn.pydata.org/generated/seaborn.barplot.html).

**When to Use Matplotlib or Seaborn?**

Both Matplotlib and Seaborn are powerful libraries for creating visualizations in Python. However, choosing one over the other depends on your specific needs and preferences.

- **Matplotlib**: It's a highly customizable and flexible library, allowing you to create complex visualizations from scratch. If you require full control over every aspect of your plot or need to create unconventional chart types, Matplotlib may be the better choice.

- **Seaborn**: Seaborn is built on top of Matplotlib and provides a higher-level, more concise syntax for creating aesthetically pleasing, modern-looking visualizations. It's particularly useful for common statistical plots and can save you time when creating standard charts.

In summary, Seaborn is a great choice when you want to quickly create elegant visualizations with less code. However, if you need complete control or have specific customization requirements, Matplotlib remains a powerful option. The choice between them ultimately depends on your project's goals and your personal preferences.

#### matplotlib: barplot
In Matplotlib, the equivalent of the countplot we created using Seaborn is referred to as a "bar plot." It is essentially a bar graph that represents the same data.

To create a bar plot in Matplotlib, you need to group your DataFrame as needed and then use the `.plot` method to generate the plot.

`penguins by species❓`

In [None]:
# Count the number of penguins in each species and display the counts
species_counts = penguins["species"].value_counts()

In [None]:
# Create a bar plot to visualize the penguin species counts
species_counts.plot(kind="bar", title="Number of penguins by species")

# Rotate the x-axis labels slightly for better readability
# By default, the x-axis labels (species names) may be horizontal and could overlap if too long.
# 'plt.xticks(rotation=0)' is used to customize the appearance of the x-axis labels.
# In this case, 'rotation=0' means that we want to keep the labels horizontal (no rotation).
# Depending on the length of the labels or personal preference, you can adjust the rotation angle.
# Rotating labels can help prevent overlap and make the plot more readable.
plt.xticks(rotation=0)

# Customize the y-axis label
plt.ylabel("Number of Penguins")

In [None]:
# Count the number of penguins in each species and display the counts
species_counts = penguins["species"].value_counts()

# Create a bar plot with specific options:
# - 'kind="bar"' specifies the type of plot as a bar chart
# - 'color="salmon"' sets the color of the bars to salmon
# - 'title="Number of penguins by species"' adds a title to the plot
species_counts.plot(kind="bar", color="salmon", title="Number of penguins by species")

# Rotate the x-axis labels slightly for better readability
plt.xticks(rotation=45)

### Column: grouped bar chart

In data visualization, a grouped bar chart allows us to present data with an additional level of detail and complexity. We achieve this by utilizing the `hue` parameter, a powerful tool provided by seaborn. When we use the `hue` parameter in a grouped bar chart, it effectively multiplies the number of bars by the distinct values of a specific variable that we specify.

In this example, we are working with the Titanic dataset from seaborn. By incorporating the `hue` parameter, we can segment and differentiate bars within each category, providing insights into multiple dimensions of our data. This approach is particularly useful when we want to visualize relationships and comparisons across different subgroups or categorical variables, making our charts more informative and insightful.

`how many people survived/didn't survive by class❓`

In [None]:
# we can add plt. things from matplotlib into seaborn
titanic.head()

In [None]:
# Create a countplot to visualize the number of people surviving per class
# We use the 'Pclass' column for the x-axis (representing the passenger class)
# and the 'Survived' column for the hue (coloring bars based on survival status)
# We also specify a color palette using 'palette' for visual appeal
sns.countplot(x=titanic["Pclass"], hue=titanic["Survived"], palette="magma")

# Add a legend to the plot to label the 'dead' and 'not dead :)' categories
plt.legend(labels=['dead', 'not dead :)'])

In [None]:
# Create a countplot to visualize the number of people surviving per class
# We use the 'Pclass' column for the x-axis (representing the passenger class)
# and the 'Survived' column for the hue (coloring bars based on survival status)
# We also specify a color palette using 'palette' for visual appeal
sns.countplot(x=titanic["Pclass"], hue=titanic["Survived"], palette="magma")

# Add a legend to the plot to label the 'dead' and 'not dead :)' categories
plt.legend(labels=['dead', 'not dead :)'])

`how many penguins are male/female by species❓`

In [None]:
penguins.head()

#### seaborn: countplot

In [None]:
# Create a countplot to visualize the count of penguins' species by sex
# We use the 'species' column for the x-axis (representing penguin species)
# and the 'sex' column for the hue (coloring bars based on gender)
sns.countplot(x="species", hue="sex", data=penguins)

In [None]:
# Create a countplot to visualize the count of penguins' species by sex
# We use the 'species' column for the x-axis (horizontal bars representing penguin species)
# and the 'sex' column for the hue (coloring bars based on gender)
sns.countplot(x=penguins.species, hue=penguins.sex)

The same graph but horizontal 🙃

In [None]:
# Create a countplot to visualize the count of penguins' species by sex
# We use the 'species' column for the y-axis (vertical bars representing penguin species)
# and the 'sex' column for the hue (coloring bars based on gender)
sns.countplot(y=penguins.species, hue=penguins.sex)

`how many people are men/women by class❓`

We add another variable that has more categories, in seaborn the bars are colored by themselves 🌈

In [None]:
titanic.head()

In [None]:
# Create a countplot to visualize the count of passengers' gender by passenger class (Pclass)
# The 'Sex' column is used for the x-axis (horizontal bars representing gender)
# The 'Pclass' column is used for the hue (coloring bars based on passenger class)
# The order of the 'x' and 'hue' arguments doesn't matter when using variable names
sns.countplot(hue="Pclass", data=titanic, x="Sex")

### matplotlib: bar chart

In matplotlib, a bar chart is similar to the countplot we've seen in seaborn. It is a graphical representation that uses bars to display data. To create a bar chart in matplotlib, you typically start with a DataFrame that's been grouped as needed, and then you add the `.plot` method to generate the chart.

Bar charts in matplotlib can be customized extensively, allowing you to control various aspects of the chart's appearance, such as colors, labels, and annotations. While seaborn provides a more streamlined and visually appealing way to create bar charts, matplotlib offers greater flexibility for fine-tuning and customization.

Let's explore how to create a basic bar chart using matplotlib to visualize data in different ways.

`how many penguins are male/female by species❓`

We are going to try, in parts, to get the same graph that we have above, in which we count the penguins by species and sex.

First we have to group the data:

In [None]:
# Let's rememebr the dataframe
penguins.head()

In [None]:
# Grouping the penguins DataFrame by both 'species' and 'sex' columns
# Then, counting the occurrences of 'sex' within each group and creating a bar chart
penguins.groupby(["species", "sex"])["sex"].count().plot(kind="bar")

# Adding a title to the bar chart
plt.title('Number of Penguins by Species and Sex')

# Labeling the x-axis
plt.xlabel('Species and Sex')

# Labeling the y-axis
plt.ylabel('Count')

# Displaying the bar chart
plt.show()

**Customizing Bar Chart Colors in Matplotlib**

In data visualization, customizing the colors of your charts and plots is a powerful way to enhance their visual appeal and convey information effectively. Matplotlib, a widely used Python library for creating static, animated, and interactive visualizations, offers various options for customizing colors in your charts.

To customize the colors of bars in a bar chart using Matplotlib, you can use the `color` parameter. This parameter allows you to specify the colors of individual bars in the chart, providing flexibility in achieving the desired color scheme.

Here's how you can customize bar chart colors in Matplotlib:

1. Import the required libraries, typically `matplotlib.pyplot` for plotting.
2. Create your dataset, defining the categories and corresponding values.
3. Create a basic bar chart using `plt.bar()`.
4. Add the `color` parameter to your `plt.bar()` function and specify the colors you want to use. You can use named colors or provide RGB values.

You can experiment with different colors and combinations to create visually appealing and informative bar charts. For a comprehensive list of named colors available in Matplotlib, you can refer to the [list of named colors in Matplotlib](https://matplotlib.org/stable/gallery/color/named_colors.html).

In [None]:
# Group the penguins dataframe by two variables: "species" and "sex," and then count the occurrences of each combination.
# This gives us a count of penguins for each species and gender.
penguin_counts = penguins.groupby(["species", "sex"])["sex"].count()

# Create a bar chart to visualize the penguin counts.
# Use the "kind" parameter to specify that we want a bar chart.
# Additionally, set the colors for the bars using the "color" parameter.
# The first color "slategray" corresponds to one gender, and the second color "coral" corresponds to the other gender.
penguin_counts.plot(kind="bar", color=["slategray", "coral"])

# Add axis labels, titles, and legends if necessary for clarity.
plt.xlabel("Species and Gender", fontsize=15)
plt.ylabel("Count", fontsize=15)
plt.title("Penguin Counts by Species and Gender", fontsize=20)

# Show the plot.
plt.show()

To make the bars come together in a grouped bar chart, we need to perform an operation called `unstack()` on the DataFrame. This is particularly useful when working with multi-index data, which is a hierarchical arrangement of indexes.

**Stacking and unstacking** are operations in Pandas that allow you to reshape your data between long (stacked) and wide (unstacked) formats:

- **Stacking** typically involves moving data from columns to rows, creating a hierarchical (multi-level) index. It's useful for making your data long and narrow.

- **Unstacking** does the opposite, moving data from rows back to columns, essentially pivoting your data. This is helpful for creating wide-format data, which is often used in visualizations like grouped bar charts.

So, to group bars in a bar chart, we'll unstack our DataFrame, which will reorganize the data into a format that's suitable for creating grouped bars. You can learn more about the `unstack()` method [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html).

For a deeper understanding of multi-indexing in Pandas, you can refer to the official documentation on [advanced indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html).

![stack and unstack](https://miro.medium.com/max/1400/1*DYDOif_qBEgtWfFKUDSf0Q.png)

In [None]:
# Do you remember the function I provided to ensure uniformity in the datafrmes? Sometimes is not required, 
# and the output is good for us
hola2 = pd.DataFrame(penguins.groupby(["species", "sex"])["sex"].count()).unstack()
hola2.head()

In [None]:
# unstacking after doing groupby will allow us to group only one thing, instead of the two
hola2.plot(kind="bar");
plt.show()

In [None]:
# Another way of doing it
save = penguins.groupby(["species", "sex"])["sex"].count().unstack().plot(kind="bar")
plt.show()

We turn it upside down, or horizontally:

In [None]:
save = penguins.groupby(["species", "sex"])["sex"].count().unstack().plot(kind="barh" )
plt.show()

`average of all three lengths by species❓`

In [None]:
penguins.sample(5)

I leave you a reminder of the .agg with its [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)

In [None]:
# Group the penguins dataframe by the "species" column and calculate the mean values for three numerical features: bill_length_mm, bill_depth_mm, and flipper_length_mm.
# The result is a new dataframe called "flaps" with the mean values for each feature for each species.
flaps = penguins.groupby(["species"]).agg(
    {"bill_length_mm": "mean",
     "bill_depth_mm": "mean", 
     "flipper_length_mm": "mean" })

# Create a bar chart to visualize the mean values for each feature by species.
# Use the "kind" parameter to specify that we want a bar chart.
# Rotate the x-axis labels slightly for better readability.
flaps.plot(kind="bar")
plt.xticks(rotation=0.5)

# Add axis labels and titles for clarity.
plt.xlabel("Penguin Species", fontsize=15)
plt.ylabel("Mean Value", fontsize=15)
plt.title("Mean Values of Penguin Features by Species", fontsize=20)

# Show the plot.
plt.show()

### Line chart

A line chart is a popular visualization tool used to display data points over a continuous interval or time period. It's particularly useful for showing trends, patterns, and variations in data. Line charts are commonly used in data analysis and visualization to represent data with a temporal aspect, such as stock prices, temperature changes, or sales figures over time.

**Key Features of Line Charts:**

- **X-Axis**: Typically represents the independent variable, such as time, dates, or numerical values that have a continuous order.

- **Y-Axis**: Represents the dependent variable, which can be any numerical value or quantity that you want to visualize. It can be continuous or discrete.

- **Data Points**: Data points are plotted as dots or markers at specific values along the X and Y axes. These points are connected by lines, which create the characteristic "line" in the chart.

- **Lines**: Lines connecting data points help visualize the trends and patterns in the data. They make it easy to identify whether values are increasing, decreasing, or staying relatively constant over time or across different categories.

- **Use Cases**: Line charts are commonly used for time series data, stock price analysis, sales forecasting, and any scenario where you want to visualize data trends.

- **Customization**: Line charts can be customized with various options, such as colors, markers, labels, and annotations to enhance the presentation and clarity of the chart.

- **Interactivity**: In interactive data visualization libraries like Plotly or Bokeh, line charts can be made interactive, allowing users to zoom in, pan, or hover over data points to get more information.

Line charts are a powerful tool for data exploration and presentation. They help analysts and decision-makers understand how data changes over time or across different categories, making them an essential part of the data visualization toolkit.


`evolution of total passengers by year❓`

In [None]:
flights.head()

In [None]:
# Create a line plot using Seaborn
# - 'flights' is the DataFrame containing the data
# - 'x="year"' specifies the x-axis variable as "year" from the DataFrame
# - 'y="passengers"' specifies the y-axis variable as "passengers" from the DataFrame
# - 'ci=None' indicates that we do not want to display confidence intervals
sns.lineplot(data=flights, x="year", y="passengers", ci=None);
plt.ylabel("Average", fontsize=15);
plt.xlabel("The years", fontsize=15);
plt.title("This is the average of passengers throughout the years", fontsize = 20);

In [None]:
# Create a line plot using Seaborn
# - 'flights' is the DataFrame containing the data
# - 'x="year"' specifies the x-axis variable as "year" from the DataFrame
# - 'y="passengers"' specifies the y-axis variable as "passengers" from the DataFrame
sns.lineplot(data=flights, x="year", y="passengers"); 
plt.ylabel("Average", fontsize=15);
plt.xlabel("The years", fontsize=15);
plt.title("This is the average of passengers throughout the years", fontsize = 20);

In [None]:
# Create a line plot using Seaborn
# - 'flights' is the DataFrame containing the data
# - 'x="year"' specifies the x-axis variable as "year" from the DataFrame
# - 'y="passengers"' specifies the y-axis variable as "passengers" from the DataFrame
# - 'ci=None' indicates that we do not want to display confidence intervals
# - 'estimator=sum' specifies that the estimator for the y-values should be the sum of passengers. 
#    By default, when you create a line plot using Seaborn's lineplot function without specifying 
#    the estimator parameter, the estimator used is mean.
sns.lineplot(data=flights, x="year", y="passengers", ci=None, estimator=sum); 
plt.ylabel("Total", fontsize=15);
plt.xlabel("The years", fontsize=15);
plt.title("This is the total of passengers throughout the years", fontsize = 20);

`evolution of total passengers by year broken down by month❓`

In [None]:
# Create a line plot using seaborn to visualize passenger data over the years.
# The "data" parameter specifies the dataset to be used, which is the "flights" dataframe.
# The "x" parameter specifies the data for the x-axis, which is the "year" column in the dataset.
# The "y" parameter specifies the data for the y-axis, which is the "passengers" column in the dataset.

# The "hue" parameter is used to add color differentiation to the lines based on the "month" column.
# This means that each month will have a different colored line in the plot.

sns.lineplot(data=flights, x="year", y="passengers", hue="month")

# Add axis labels and a title for clarity.
plt.xlabel("Year", fontsize=15)
plt.ylabel("Number of Passengers", fontsize=15)
plt.title("Passenger Trends Over the Years by Month", fontsize=20)

# Show the legend to indicate which color corresponds to each month.
plt.legend(title="Month")

# Show the plot.
plt.show()

### Scatter plot
A scatter plot is a fundamental data visualization technique used to explore the relationship between two continuous variables. It is particularly useful for identifying patterns, trends, clusters, or outliers within data points.

In a scatter plot, each data point is represented as a point on a two-dimensional plane, with one variable on the x-axis and another on the y-axis. By examining the distribution of these points, you can gain insights into the correlation, association, or dispersion of the variables.

Scatter plots are valuable for tasks such as identifying linear or non-linear relationships, assessing the strength and direction of correlations, and detecting anomalies in data.

In this section, we'll explore how to create and customize scatter plots using Python's data visualization libraries, such as Matplotlib and Seaborn, to visually analyze data relationships and uncover valuable insights.

`how are these two quantitative variables (body mass & flipper length) related❓`

A scatterplot uses points to represent the values ​​of two different numerical variables. The position of each point on the horizontal and vertical axes indicates the values ​​of an individual data point. Scatter plots are used to look at relationships between variables.

In [None]:
# We're using Seaborn's scatterplot function to create a scatter plot.
# The 'x' parameter specifies the variable to be plotted on the x-axis, which is 'body_mass_g' (body mass in grams).
# The 'y' parameter specifies the variable to be plotted on the y-axis, which is 'flipper_length_mm' (flipper length in millimeters).
# The 'data' parameter specifies the DataFrame we're using, which is 'penguins'.
sns.scatterplot(x="body_mass_g", y="flipper_length_mm", data=penguins)

# This code will create a scatter plot showing the relationship between body mass and flipper length of penguins.
# Each point on the plot represents a penguin, with its body mass on the x-axis and flipper length on the y-axis.

`how are these quantitative variables (body mass & flipper length) related to this categorial (species)❓`

![image.png](attachment:image.png)

In [None]:
# We're using Seaborn's scatterplot function to create a scatter plot.
# The 'x' parameter specifies the variable to be plotted on the x-axis, which is 'body_mass_g' (body mass in grams).
# The 'y' parameter specifies the variable to be plotted on the y-axis, which is 'flipper_length_mm' (flipper length in millimeters).
# The 'hue' parameter adds color differentiation to the points based on the 'species' variable.
# The 'data' parameter specifies the DataFrame we're using, which is 'penguins'.
# The 'size' parameter specifies the variable that determines the size of the points, which is 'bill_length_mm' (bill length in millimeters).
# The 'sizes' parameter sets the range of sizes for the points from 0 to 300.
# The 'style' parameter adds different point styles for data points based on the 'island' variable.

sns.scatterplot(x="body_mass_g", y="flipper_length_mm", hue="species", data=penguins, size='bill_length_mm',
                sizes=(0, 300), style="island")

# This code will create a scatter plot showing the relationship between body mass and flipper length of penguins.
# Points will be colored based on the penguin species, sized based on bill length, and styled based on the island.

We can make the size of the dotsdepend on a numeric variable

In [None]:
# Creating a scatter plot using Seaborn's scatterplot function.
# The 'x' parameter specifies the variable for the x-axis, which is 'body_mass_g' (body mass in grams).
# The 'y' parameter specifies the variable for the y-axis, which is 'flipper_length_mm' (flipper length in millimeters).
# The 'hue' parameter adds color differentiation based on the 'species' variable.
# The 'size' parameter sets the size of points based on the 'flipper_length_mm' variable.

sns.scatterplot(x="body_mass_g", y="flipper_length_mm", hue="species", data=penguins, size="flipper_length_mm")

# This code creates a scatter plot with points colored by species and sized based on flipper length.

In [None]:
# Creating a scatter plot with Seaborn's scatterplot function.
# The 'x' parameter specifies the variable for the x-axis, which is 'body_mass_g' (body mass in grams).
# The 'y' parameter specifies the variable for the y-axis, which is 'flipper_length_mm' (flipper length in millimeters).
# The 'hue' parameter adds color differentiation based on the 'species' variable.
# The 'size' parameter sets the size of points based on the 'bill_length_mm' variable.

sns.scatterplot(x="body_mass_g", y="flipper_length_mm", hue="species", data=penguins, size="bill_length_mm")

# This code generates a scatter plot with points colored by species and sized based on bill length.

In [None]:
# Creating a scatter plot using Seaborn's scatterplot function.
# The 'x' parameter specifies the variable for the x-axis, which is 'body_mass_g' (body mass in grams).
# The 'y' parameter specifies the variable for the y-axis, which is 'flipper_length_mm' (flipper length in millimeters).
# The 'hue' parameter adds color differentiation based on the 'species' variable.
# The 'size' parameter sets the size of points based on the 'bill_length_mm' variable.
# The 'style' parameter adds different point styles based on the 'island' variable.

sns.scatterplot(x="body_mass_g", y="flipper_length_mm", hue="species", data=penguins, size="bill_length_mm", style="island")

# This code creates a scatter plot with points colored by species, sized based on bill length, and styled based on island.

#### matplotlib: scatterplot

`how are these two quantitative variables (body mass & flipper length) related❓`

In [None]:
# Creating a scatter plot using Matplotlib's plt.scatter() function.
# The 'x' parameter specifies the variable for the x-axis, which is 'body_mass_g' (body mass in grams).
# The 'y' parameter specifies the variable for the y-axis, which is 'flipper_length_mm' (flipper length in millimeters).

plt.scatter(x=penguins["body_mass_g"], y=penguins["flipper_length_mm"])

# Adding labels to the x-axis and y-axis for better understanding.
plt.xlabel("Chunkiness", fontsize=15)
plt.ylabel("Flipper Length", fontsize=15)

# Setting a title for the scatter plot.
plt.title("Body Mass vs. Flipper Length", fontsize=20)

# This code creates a basic scatter plot using Matplotlib, representing the relationship between body mass and flipper length.

## DISTRIBUTION


In data visualization, understanding the distribution of your data is essential. It helps you gain insights into how your data is spread out and whether it follows a particular pattern. The following techniques are commonly used for exploring the distribution of data:

- **Histograms:** Histograms provide a visual representation of the frequency or count of data points within specific intervals or bins. They are useful for understanding the shape and central tendencies of numerical data.

- **Boxplots:** Boxplots, also known as box-and-whisker plots, display the distribution of data through quartiles. They help identify outliers and provide a concise summary of the data's spread.

- **Kernel Density Estimation (KDE):** KDE is a technique for estimating the probability density function of a continuous random variable. It helps visualize the smooth distribution of data points and is particularly useful for understanding the underlying data distribution.

Each of these methods has its strengths and is suitable for different types of data exploration. Choose the one that best fits your analytical needs when examining the distribution of your data.


### Histograms

Histograms are graphical representations that resemble vertical bar charts. However, they serve a distinct purpose in data visualization. Histograms are particularly useful for understanding the underlying frequency distribution of a dataset, whether it consists of discrete or continuous values measured on an interval scale.

Key characteristics and benefits of histograms include:

- **Frequency Distribution:** Histograms provide insights into the distribution of data points within specified intervals or bins. By visualizing how data is distributed across these bins, you can observe patterns and trends.

- **Skewness and Kurtosis:** Histograms allow you to assess properties such as skewness (whether the data is skewed to the left or right) and kurtosis (which measures the concentration of values around the central region of the distribution). These properties offer valuable insights into the dataset's shape and characteristics.

- **Interval Representation:** The choice of intervals or bins in a histogram can impact the interpretation of the data. Adjusting bin sizes can reveal different aspects of the distribution.

Histograms are a fundamental tool in exploratory data analysis (EDA) and are often the first step in assessing the nature of your data. They provide a visual foundation for understanding data distributions, which can inform subsequent analysis and decision-making.

![assymetry](https://d2mk45aasx86xg.cloudfront.net/image1_11zon_4542aedc45.webp)

#### seaborn: histograms

`how is the age distributed❓`

In [None]:
titanic_2.fillna(old_mean, inplace=False).Age.mean() #before dropping values

In [None]:
# Create a histogram of passenger ages from the Titanic dataset
sns.histplot(x=titanic.Age);

In [None]:
# Create a histogram with more bins to show finer detail of passenger ages
sns.histplot(x=titanic.Age, bins=50);  # Do we want this much detail?

In [None]:
# Fill missing values in the 'Age' column with the mean age of passengers
titanic.Age.fillna(titanic.Age.mean(), inplace=True)

# Create a histogram of passenger ages after filling missing values with the mean age
sns.histplot(x=titanic.Age);

We have manipulated the data by padding the NaNs with 0 and we have changed the distribution of the data. We can manually specify the number of `boxes`.

In [None]:
sns.histplot(x=titanic.Age, bins=30)

We can choose if we fill it or not...

In [None]:
sns.histplot(x=titanic.Age, bins=40, fill=None);

In [None]:
# Create a histogram of passenger ages with 30 bins, no fill, and a kernel density estimate (KDE) overlay
sns.histplot(x=titanic.Age, bins=30, fill=None, kde=True);

# The 'kde' parameter adds a kernel density estimate (probability density function) to the histogram

### Kernel Density Estimate (KDE) in Histograms

In the context of histograms, the 'kde' parameter, short for Kernel Density Estimate, is a statistical technique used to estimate the probability density function (PDF) of a continuous random variable. When 'kde' is set to True in the `sns.histplot` function, it adds a smooth curve over the histogram bars.

This curve represents the estimated probability distribution of the data points. It provides a more continuous and smoothed representation of the data's underlying distribution, making it easier to visualize the shape and characteristics of the data.

In simpler terms, the KDE curve helps us understand how likely different values are in the dataset. Peaks in the KDE curve indicate where data is more concentrated, while valleys indicate regions of lower concentration. It's a useful tool for understanding the overall pattern and density of data points in a histogram.

We can ask for an approximation of the distribution / kernel density estimation.

In [None]:
sns.histplot(x=titanic.Age, bins=12, kde=True);

#### matplotlib: histograms

`how is the age distributed❓`

https://htmlcolorcodes.com

In [None]:
# Plotting a histogram of the 'Age' column from the Titanic dataset
# - bins=10: Divides the range of ages into 10 equal-width bins.
# - rwidth=0.90: Reduces the width of each bar in the histogram to 90% of the bin width.
# - color="#fdea14": Sets the color of the bars to a bright yellow.
titanic.Age.plot.hist(bins=10, rwidth=0.90, color="#fdea14")

In [None]:
# Plotting another histogram of the 'Age' column
# - bins=30: Divides the range of ages into 30 equal-width bins, providing more detail.
# - color="#21211d": Sets the color of the bars to a dark gray.
# - rwidth=0.90: Reduces the width of each bar to 90% of the bin width.
# - histtype="step": Displays the histogram as a step plot, showing only the outline.
titanic.Age.plot.hist(bins=30, color="#21211d", rwidth=0.90, histtype="step")

### With categorical variables

`how is the body mass distributed across species❓`

In [None]:
# Visualizing how body mass is distributed across penguin species
# - data=penguins: Specifies the dataset to use, which is 'penguins' in this case.
# - x="species": Sets the 'species' column as the x-axis (categorical variable).
# - y="body_mass_g": Sets the 'body_mass_g' column as the y-axis.
sns.scatterplot(data=penguins, x="species", y="body_mass_g")

The default representation of data in catplot() uses a scatter plot. There are two different categorical scatterplots in seaborn, each addressing the challenge of representing categorical data with a scatter plot. These plots help visualize how data points within each category are distributed along the categorical variable axis.

In [None]:
# Creating a categorical plot to visualize how body mass is distributed across penguin species.
# - data=penguins: Specifies the dataset to use, which is 'penguins' in this case.
# - x="species": Sets the 'species' column as the x-axis (categorical variable).
# - y="body_mass_g": Sets the 'body_mass_g' column as the y-axis.
sns.catplot(data=penguins, x="species", y="body_mass_g")

In [None]:
# Creating a categorical plot with hue and jitter.
# - data=penguins: Specifies the dataset to use, which is 'penguins' in this case.
# - x="species": Sets the 'species' column as the x-axis (categorical variable).
# - y="body_mass_g": Sets the 'body_mass_g' column as the y-axis.
# - hue="sex": Adds color differentiation based on the 'sex' column.
# - jitter=True: Adds jitter to data points to prevent overlapping.
sns.catplot(data=penguins, x="species", y="body_mass_g", hue="sex", jitter=True)

In [None]:
# Creating a categorical plot with hue and alpha transparency.
# - data=penguins: Specifies the dataset to use, which is 'penguins' in this case.
# - x="species": Sets the 'species' column as the x-axis (categorical variable).
# - y="body_mass_g": Sets the 'body_mass_g' column as the y-axis.
# - hue="sex": Adds color differentiation based on the 'sex' column.
# - alpha=0.3: Adjusts the transparency of data points (0.0 for fully transparent, 1.0 for fully opaque).
sns.catplot(data=penguins, x="species", y="body_mass_g", hue="sex", alpha=0.3)

### swarmplot

A swarmplot is a type of categorical scatter plot that aims to show the distribution of data points for different categories while preventing overlapping points. It's particularly useful when you have discrete or categorical data and want to visualize individual data points.

In a swarmplot:
- Each data point is represented as a point along the categorical axis.
- The points are adjusted to avoid overlapping, providing a clear view of the data distribution.
- Swarmplots are great for visualizing the spread and density of data within categories.

Swarmplots are valuable tools for exploratory data analysis when dealing with categorical or discrete data, as they provide insights into the distribution of individual data points within each category.


`how is the body mass distributed across species & male/female❓`

In [None]:
# Create a swarmplot to visualize the distribution of body_mass_g across different sexes for each penguin species.
sns.swarmplot(data=penguins, x="body_mass_g", y="sex", hue="species")

In [None]:
# Create another swarmplot, this time showing the distribution of body_mass_g for each penguin species with distinction by sex.
# The 'alpha' parameter controls the transparency of the points.
sns.swarmplot(data=penguins, x="body_mass_g", y="species", hue="sex", alpha=0.9)

### Boxplot

A boxplot, also known as a box-and-whisker plot, is a valuable tool for visualizing the distribution of a dataset and identifying key statistical measures. It provides a concise summary of the data's central tendency, spread, and presence of outliers. 

Key components of a boxplot:
- **Median (Q2)**: The line inside the box represents the median, which is the middle value of the dataset when arranged in ascending order. It divides the data into two equal halves, with 50% of the values falling below and 50% above it.

- **Interquartile Range (IQR)**: The box encloses the IQR, which is the range between the first quartile (Q1) and the third quartile (Q3). The IQR encompasses the central 50% of the data and provides a measure of data spread.

- **Whiskers**: The whiskers extend from the edges of the box to the minimum and maximum values within a defined range. Outliers, if present, are displayed as individual points beyond the whiskers. 

- **Outliers**: Outliers are data points that fall significantly outside the whiskers. They are typically shown as individual points and can indicate unusual or extreme values in the dataset.

Boxplots are valuable for comparing distributions, identifying skewness, and detecting potential outliers in the data. They provide a visual summary of the data's statistical properties, making it easier to grasp essential characteristics of a dataset.

To create a boxplot in Python, you can use libraries like Matplotlib or Seaborn, specifying the variable you want to visualize. Boxplots are a powerful tool for exploratory data analysis and gaining insights into your data's distribution.


![boxplots](https://www.simplypsychology.org/box-whisker-plot.jpg)

`how is the age distributed❓(titanic)`

In [None]:
sns.boxplot(x="Age", data=titanic);

`how is the age distributed across classes❓ (titanic)`

In [None]:
sns.boxplot(x="Pclass", y="Age", data=titanic);

### ViolinPlot

A Violin Plot is a versatile data visualization technique that combines elements of a box plot with a rotated kernel density plot. It is particularly useful for visualizing the distribution and probability density of a dataset across different categories or variables.

Key characteristics of a Violin Plot:
- **Symmetrical Shape**: The core of a Violin Plot resembles a symmetrical, rotated violin shape. This shape represents the estimated probability density of the data at different values along the y-axis.

- **Box Plot Inside**: Similar to a box plot, a central box is drawn within the violin shape. This box denotes the interquartile range (IQR) and the median of the data. The width of the box is proportional to the density of data points at that value.

- **Extended Tails**: The tails of the violin plot extend outward, indicating the range of data values. The width of the tails corresponds to the density of data points at different values.

Violin Plots are beneficial for comparing the distribution of data across multiple categories or groups, especially when dealing with complex datasets. They provide insights into the data's central tendency, spread, and the presence of multiple modes or peaks within the distribution.

By visualizing both the box plot and the kernel density estimation, Violin Plots offer a comprehensive view of the data's statistical properties. This makes them a valuable tool in exploratory data analysis and data visualization.

You can create Violin Plots in Python using libraries like Seaborn or Matplotlib, specifying the variables you want to compare or visualize.


`how is the age distributed❓`

In [None]:
# Create a violin plot for the 'Age' column in the Titanic dataset
violin = sns.violinplot(x=titanic.Age)

# Add a vertical line at the median age
violin.axvline(x=titanic.Age.median(), c="red", label="median")

# Add a legend to the plot to label the red line as 'median'
plt.legend();


`how is the bill_legth distributed across species❓`

In [None]:
# Create a violin plot for the 'bill_length_mm' column in the Penguins dataset, grouped by 'species'
sns.violinplot(x=penguins.bill_length_mm, y=penguins.species)

### KDE plot
This function allows much more control over the resulting plot than the seaborn.distplot function. If we pass as the first argument the set of y values ​​calculated in the previous section, we get exactly the same graph.

[Seaborn: kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html)

`how is the age distributed❓`

In [None]:
sns.kdeplot(x=titanic.Age);

#### With more variables


`how is the age across class❓`

In [None]:
sns.kdeplot(x=titanic.Age, hue=titanic.Pclass);

In [None]:
sns.kdeplot(x=titanic.Age, hue=titanic.Pclass, fill=True);

#### Add KDE to the histplot
The lines that we are going to add are from the matplotlib library, but... the good thing is that we can combine both libraries

In [None]:
# Create a histogram plot for the 'Age' column in the Titanic dataset with a KDE overlay and save it as a variable 'graf'
graf = sns.histplot(x=titanic.Age, kde=True)

# Add vertical lines to the plot at specific x-values
graf.axvline(x=titanic.Age.mean(), c="red", label="mean")     # Red vertical line at the mean age
graf.axvline(x=titanic.Age.median(), c="green", label="median") # Green vertical line at the median age
graf.axvline(x=titanic.Age.max(), c="blue", label="max")       # Blue vertical line at the maximum age

# Add a horizontal line to the plot at y=60
graf.axhline(y=60, c="black", label="Horizontal")              # Black horizontal line at y=60

# Display a legend to label the lines on the plot
plt.legend()

## PART OF A WHOLE

In this section, we explore visualization techniques that help us understand how individual parts contribute to a whole. These visualizations are particularly useful when we want to analyze the composition of a dataset or understand the distribution of categories within a larger context.

We'll cover the following visualization types:

- Pie plot
- Stacked bar chart
- Stacked column bar

### Pie plot 👀

- ⚠️ Pie plots are visually intuitive but may not be the best choice for displaying quantitative data effectively.
- ⚠️ Small portions, such as those representing a small percentage of the total, run the risk of being too tiny to discern on the chart.
- ⚠️ Comparing areas and angles in a pie chart, especially when they represent similar values, can be challenging for viewers.

Pie plots are best suited for showcasing the proportional distribution of distinct categories within a dataset. However, it's essential to use them judiciously, considering their limitations when presenting data.


`what is the proportion of species❓`

In [None]:
# Creating a pie chart from the 'penguins' DataFrame
# The 'autopct' parameter specifies the format for displaying the percentage on each pie slice
penguins.plot.pie(autopct="%.1f%%");

### Stacked column chart

In data visualization, a stacked column chart is a type of chart that represents data in a column format, where multiple columns are stacked on top of each other to show the composition of a whole while preserving the individual contributions of each component. Stacked column charts are particularly useful for displaying categorical data where you want to emphasize the total while highlighting the distribution of categories within that total.

#### seaborn: stacked column (histogram)


`how is the age of people that survived/didnt survive distributed❓`

In [None]:
sns.histplot(data=titanic, x="Age", hue="Survived", multiple="stack");

In [None]:
# 1. Create figure
sns.histplot(x=titanic.Age, hue=titanic.Survived, multiple="stack")

# 2. Modify figure
plt.axvline(x=65)  # Vertical line at age 65
plt.axhline(y=60)  # Horizontal line at 60
plt.axhline(y=80, c="red")  # Red horizontal line at 80

# 2.1. Detail
plt.title("Distribution of Age and Survival", size=20)  # Set the title
plt.xlabel("YOUNG -> OLD")  # Set the x-axis label
plt.ylabel("THIS IS HOW MANY")  # Set the y-axis label

# 3. Save figure
# --> plt.savefig("..age_and_death")  # Save the figure to a file
plt.show()  # Display the figure

In [None]:
# Create a stacked histogram of Age distribution based on Pclass
sns.histplot(data=titanic, x="Age", hue="Pclass", multiple="stack")

# Add vertical lines at the mean, 25th percentile, and 75th percentile of Age
plt.axvline(titanic.Age.mean(), color="red", linestyle="--", label="Mean Age")
plt.axvline(titanic.Age.quantile(0.75), color="green", linestyle="--", label="75th Percentile")
plt.axvline(titanic.Age.quantile(0.25), color="blue", linestyle="--", label="25th Percentile")

# Add labels and a legend
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.title("Age Distribution by Pclass")
plt.legend()

# Show the plot
plt.show()

Documentation of histograms --> https://seaborn.pydata.org/generated/seaborn.histplot.html

`how can I compare three different compound distributions at once❓`

In [None]:
# Create subplots with 1 row and 3 columns, and set the figure size
fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(20, 5))

# 1. Regular Histogram
sns.histplot(data=penguins, x="flipper_length_mm", hue="species", ax=axs[0])
axs[0].set_title("Regular Histogram")

# 2. Histogram for flipper lengths above 170
sns.histplot(data=penguins[penguins.flipper_length_mm > 170], x="flipper_length_mm", hue="species", multiple="stack", palette="mako", ax=axs[1])
axs[1].set_title("Histogram for Flipper Lengths Above 170")

# 3. Histogram for all flipper lengths with two species
sns.histplot(data=penguins, x="flipper_length_mm", hue="species", palette="mako", ax=axs[2])
axs[2].set_title("Histogram for All Flipper Lengths (Two Species)")

# Show the plots
plt.show()

### Correlation Matrix

The correlation matrix is an essential data analysis metric used to summarize data and understand the relationships between various variables. It helps in making informed decisions based on the interdependencies among these variables.

In a correlation matrix:

- Values close to 1 indicate a strong positive correlation.
- Values close to -1 indicate a strong negative correlation.
- Values close to 0 indicate a weak or no correlation.

`to what extent are these quantitative variables related❓`

In [None]:
# Calculate the correlation matrix for the Titanic dataset
corr = titanic.corr()
display(corr)

In [None]:
# Create a heatmap to visualize the correlation matrix
sns.heatmap(corr);

## Understanding NumPy Functions in Heatmap Creation

In the context of creating heatmaps, NumPy, a powerful numerical computing library in Python, offers valuable tools to manipulate data before visualizing it. Two essential NumPy functions commonly used in heatmap generation are `np.triu` and `np.ones_like`.

- [**`np.triu` Documentation**](https://numpy.org/doc/stable/reference/generated/numpy.triu.html): This function, short for "triangular upper," is used to create a boolean mask that retains only the upper triangular part of a matrix while setting the lower part to `False`. It is particularly useful when visualizing correlation matrices or similar symmetric data. 

- [**`np.ones_like` Documentation**](https://numpy.org/doc/stable/reference/generated/numpy.ones_like.html): This function generates an array of ones with the same shape and data type as a specified input array. It is often employed to create an initial matrix with `True` or `1` values before applying a mask.

In the context of heatmap creation, these NumPy functions play a crucial role in enhancing the visualization of correlation matrices. By using `np.triu` in combination with `np.ones_like`, we can focus on visualizing only one half of the correlation matrix, typically the upper triangular portion. This avoids redundant information and results in a more concise and interpretable heatmap, especially when dealing with symmetric data like correlation coefficients.

By understanding how to use these NumPy functions in heatmap creation, you can effectively tailor your visualizations to highlight the essential relationships within your data, making it easier to draw meaningful insights and conclusions.

In [None]:
# Creating a boolean mask for the upper triangular part of the correlation matrix
mask = np.triu(np.ones_like(corr, dtype=bool))

# Creating a color map for the heatmap using seaborn's diverging_palette
color_map = sns.diverging_palette(0, 10, as_cmap=True)

# Creating a heatmap using seaborn
sns.heatmap(corr,  
            mask=mask,               # Applying the upper triangular mask to hide redundant information
            cmap=color_map,          # Setting the color map for the heatmap
            square=True,             # Making the heatmap square
            linewidth=0.5,           # Setting the linewidth between cells
            vmax=1,                  # Defining the maximum color scale value
            vmin=-1,                 # Defining the minimum color scale value
            cbar_kws={"shrink": .5}  # Customizing the colorbar appearance
);

# Displaying the heatmap
plt.show()

### Pairplot
We plot relations between the variables

`how are all these variables compared❓`

In [None]:
sns.pairplot(penguins);

`how are all these variables compared across species❓`

In [None]:
sns.pairplot(penguins, hue="species");

## TRENDS

### Line chart

`how are all these groups changing over time❓`

In [None]:
df = pd.DataFrame({'years': [2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023],
                   'Data': [20, 12, 15, 14, 19, 23, 25, 29],
                   'UX': [5, 7, 7, 9, 12, 9, 9, 4],
                   'Web dev': [1, 1, 10, 6, 6, 5, 9, 12]})

In [None]:
data = pd.melt(df, ['years'])
sns.lineplot(x='years', y="value", hue='variable', 
             data=data,
             palette=['red', 'blue', 'purple']);

### Area chart

`how is the population changing over time and how is it different across these cities❓`

In [None]:
df = pd.DataFrame({'years': [2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023],
                   'Barcelona': [20000, 12000, 15000, 14000, 19000, 23000, 25000, 29000],
                   'Madrid': [5000, 7000, 7000, 9000, 12000, 9000, 9000, 4000],
                   'Valencia': [1000, 8000, 10000, 6000, 6000, 5000, 9000, 12000]})

In [None]:
#define colors to use in chart
color_map = ['red', 'steelblue', 'pink']
    
#create area chart
plt.stackplot(df.years, df.Barcelona, df.Madrid, df.Valencia,
              labels=['Barcelona', 'Madrid', 'Valencia'],
              colors=color_map)

#add legend
plt.legend(loc='upper left')

#add axis labels
plt.xlabel('Years')
plt.ylabel('Population')

#display area chart
plt.show()

----------------------------------------------------------------

### Jointplot: histograms & scatterplot
Draw a two-variable graph with bivariate and univariate graphs. It is similar to a scatterplot but adds the individual histograms of both variables

In [None]:
penguins.head()

In [None]:
sns.jointplot(data=penguins, x= "bill_depth_mm", y= "flipper_length_mm");

Assigning a `hue` variable will add conditional colors to the scatterplot and draw separate density curves (using kdeplot()) on the marginal axes

In [None]:
sns.jointplot(data=penguins, x= "bill_depth_mm", y= "flipper_length_mm", hue="species");

In [None]:
sns.jointplot(data=penguins, x= "body_mass_g", y= "flipper_length_mm");

- Can you do a Jointplot with matplotlib?
Yes, but you have to set the figures and do it independently... here's a [tutorial](https://stackabuse.com/matplotlib-scatter-plot-with-distribution-plots-histograms-jointplot/)

Penguins heatmap

## Subplots
Used to paint several graphics in the same "image"

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Create a 2x3 grid of subplots with a specified figure size
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(20, 15))

# Plot 1: Horizontal bar chart showing counts of penguins by species and sex
penguins.groupby(["species", "sex"])["sex"].count().unstack().plot(kind="barh", ax=axs[0, 0])

# Plot 2: Violin plot displaying the distribution of bill lengths by penguin species
sns.violinplot(x=penguins.bill_length_mm, y=penguins.species, ax=axs[0, 1])

# Plot 3: Stack multiple histograms of flipper lengths by species for comparison
sns.histplot(data=penguins, x="flipper_length_mm", hue="species", multiple="stack", ax=axs[0, 2])

# Plot 4: Swarm plot showing body mass distribution by penguin sex and species
sns.swarmplot(data=penguins, x="body_mass_g", y="sex", hue="species", ax=axs[1, 0])

# Plot 5: Pie chart displaying the distribution of penguin species
pens.plot.pie(autopct="%.1f%%", ax=axs[1, 1])

# Plot 6: Count plot showing the distribution of penguin sex
sns.countplot(x=penguins.sex, ax=axs[1, 2])

## Save Plots

In [None]:
scatter = sns.scatterplot(x="body_mass_g", y="flipper_length_mm", data=penguins, hue="species");
scatter.figure.savefig("../scatter.jpg", dpi=1000)

## Recap: Exploratory Data Analysis (EDA) Visualizations

In this EDA visualization section, we explored various types of visualizations to gain insights from our datasets. These visualizations help us understand data distributions, relationships between variables, and more. Let's recap the key visualizations we covered:

### Categorical Data Visualizations:

- **Bar Charts:** Used to visualize counts of categorical data.
- **Grouped Bar Charts:** Show counts of categorical data with grouping.
- **Stacked Bar Charts:** Display data counts with stacking.
- **Count Plots:** Show the count of categorical data using bars.

### Numeric Data Visualizations:

- **Histograms:** Visualize the distribution of numeric data.
- **Kernel Density Estimation (KDE):** Plot the probability density function on top of histograms.
- **Box Plots:** Display quartiles and identify outliers.
- **Violin Plots:** Combine box plots and KDE to visualize data distribution.

### Relationship Visualizations:

- **Scatter Plots:** Display relationships between two numeric variables.
- **Pair Plots:** Show pairwise relationships between multiple variables.
- **Heatmaps:** Visualize correlations between numeric variables.

### Part of a Whole Visualizations:

- **Pie Charts:** Represent parts of a whole, but can be less effective for quantitative data.
- **Stacked Bar Charts:** Show parts of a whole with stacked bars.
- **Stacked Column Charts:** Present parts of a whole with stacked columns.

These visualizations allow us to explore our datasets, identify patterns, and gain insights to guide further analysis and decision-making.

Remember to choose the most appropriate visualization type based on your data and the insights you want to extract.