<a href="https://colab.research.google.com/github/Sameer-30/Data-Science-With-Python/blob/main/Data_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Data Visualization with Python**
###**1. Introduction to Data Visualization**
Data visualization is a critical component of data science, enabling analysts to:


*   Explore patterns, trends, and outliers in data.
*  Communicate insights effectively to stakeholders.


*   Simplify complex data through graphical representation.

Common types of visualizations include line plots, bar charts, scatter plots, histograms, heatmaps, and box plots. Python libraries like `Matplotlib`, `Seaborn`, and `Plotly` simplify the creation of these visualizations.

###**2. Matplotlib: The Foundational Library**
`Matplotlib` is a low-level library for creating static, interactive, or animated visualizations. It provides full customization of plots.
##**Installation**

In [None]:
pip install matplotlib

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

`seaborn` is a data visualization library built on top of matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics (e.g., heatmaps, violin plots, etc.).
`sns` is an alias for seaborn. `%matplotlib inline` is a magic command specific to Jupyter notebooks and Google Colab. It ensures that the plots are rendered directly in the notebook (i.e., **inline**), below the code cell where the plot is created. Without this, the plot may open in a separate window, depending on the environment.

In [None]:
# Example plot with matplotlib
plt.plot([1, 2, 3], [4, 5, 6])
plt.show()

#Example plot with seaborn
sns.set(style="darkgrid")
sns.lineplot(x=[1, 2, 3], y=[4, 5, 6])


##**3. Line Plot**
A **line plot** is a basic type of plot used to visualize data points connected by straight lines. It is especially useful for showing trends over a period of time or any continuous data. In a line plot, the data points are plotted on a graph, and lines are drawn between them to connect the points in order. This helps to illustrate the relationship between the variables.

###**Key Features:**


*   **X-axis (Horizontal):** Represents the independent variable (e.g., time, categories, or other continuous data).
*   **Y-axis (Vertical):** Represents the dependent variable (the variable you're measuring or tracking).

###**Use Cases:**


*   **Time Series Data:** Line plots are often used to show how a value changes over time (e.g., stock prices, temperature trends).
*   **Trends and Patterns:** They can reveal trends, fluctuations, or patterns in data, making it easier to observe changes and relationships.
*   **Comparing Multiple Series:** You can plot multiple lines on the same graph to compare different sets of data.


**Example:**
Here's an example of how to create a simple line plot in Python using matplotlib and seaborn.



Let’s assume the following temperature data (in Celsius) for a city over 7 days:

* Monday: 22°C
* Tuesday: 24°C
* Wednesday: 19°C
* Thursday: 21°C
* Friday: 23°C
* Saturday: 25°C
* Sunday: 20°C

We'll use this data to create a line plot that shows how the temperature changes day by day.









###**Using `Matplotlib`**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Temperatures in Celsius for a week
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
temperatures = [22, 24, 19, 21, 23, 25, 20]

# Create a line plot
plt.plot(days, temperatures)

# Add title and labels
plt.title('Weekly Temperature Trend in Celsius')
plt.xlabel('Days')
plt.ylabel('Temperature (°C)')


# Display the plot
plt.show()


###**Using `Seaborn`**
`seaborn` makes it easier to generate aesthetically pleasing line plots with some built-in styling.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Temperatures in Celsius for a week
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
temperatures = [22, 24, 19, 21, 23, 25, 20]

# Set Seaborn style
sns.set(style="darkgrid")

# Create a line plot
sns.lineplot(x=days, y=temperatures)

# Add title and labels
plt.title('Weekly Temperature Trend in Celsius')
plt.xlabel('Days')
plt.ylabel('Temperature (°C) ')

# Display the plot
plt.show()


###**Additional Features in Line Plots:**
* **Markers:** You can add markers to the data points to make them more visible.
* **Multiple Lines:** You can plot multiple lines on the same graph by passing multiple x and y values.
*   **Line Styles:** You can change the color, style (solid, dashed), and width of the lines.







In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Temperatures in Celsius for a week
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
temperatures = [22, 24, 19, 21, 23, 25, 20]

# Set Seaborn style for a cleaner look
sns.set(style="darkgrid")

# Create a line plot
sns.lineplot(x=days, y=temperatures, marker='o', color='blue', linewidth=2)

# Adding title and labels
plt.title('Weekly Temperature Trend in Celsius', fontsize=16)
plt.xlabel('Day', fontsize=12)
plt.ylabel('Temperature (°C)', fontsize=12)

# Display the plot
plt.show()



###**Comparing Temperatures in Two Cities Over a Week**
Let’s assume we have the temperature data for City A and City B over the same 7 days:

* **City A:** [22, 24, 19, 21, 23, 25, 20] (Temperatures in Celsius)
* **City B:** [18, 21, 17, 19, 20, 22, 19] (Temperatures in Celsius)

We’ll plot both cities on the same line plot, each with different colors and a legend to identify which line corresponds to which city.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Data for two cities
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
city_a_temperatures = [22, 24, 19, 21, 23, 25, 20]
city_b_temperatures = [18, 21, 17, 19, 20, 22, 19]

# Set Seaborn style for a cleaner look
sns.set(style="darkgrid")

# Create line plots for both cities
sns.lineplot(x=days, y=city_a_temperatures, label='City A', color='blue', marker='o', linewidth=2)
sns.lineplot(x=days, y=city_b_temperatures, label='City B', color='green', marker='s', linewidth=2)

# Adding title and labels
plt.title('Weekly Temperature Comparison: City A vs City B', fontsize=14)
plt.xlabel('Day ', fontsize=12)
plt.ylabel('Temperature (°C)', fontsize=12)

# Display the legend
plt.legend()

# Display the plot
plt.show()


###**Advantages of Line Plots:**
* **Easy to interpret:** Trends and changes are clearly visible.
* **Compact visualization:** Shows data trends in a simple and clean way.
* **Versatile:** Works for both small and large datasets, and you can compare multiple variables.
###**Disadvantages:**
* **Not suitable for categorical data:** Line plots work best for continuous data, and using them for categorical data can lead to misleading results.

##**3. Scatter Plot**
A scatter plot is a graphical representation used to visualize the relationship between two continuous variables. It is created by plotting individual data points on a two-dimensional axis, where each axis represents one variable. The points in the plot show how much one variable is affected by another. Scatter plots are ideal for identifying correlations, clusters, and outliers.

* **Positive Correlation:** When data points trend upward (from left to right), it shows that as one variable increases, the other increases.
* **Negative Correlation:** When data points trend downward, it shows that as one variable increases, the other decreases.
* **No Correlation:** If data points are spread randomly without any clear pattern, the variables are uncorrelated.

In machine learning and data science, scatter plots are a powerful tool for exploring datasets, identifying patterns, and making preliminary analyses.

###**Example:** Scatter Plot using the Iris Flower Dataset

The Iris flower dataset is one of the most famous datasets in machine learning. It contains measurements of petal length, petal width, sepal length, and sepal width for three different species of Iris flowers:

* Iris Setosa
* Iris Virginica
* Iris Versicolor

This dataset is widely used for classification tasks, and it's often one of the first datasets used when learning machine learning techniques. It has **150 records** (50 samples from each species) and 5 attributes:

* Petal Length (cm)
* Petal Width (cm)
* Sepal Length (cm)
* Sepal Width (cm)
* Class (Species)

The dataset is publicly available from the UCI Machine Learning Repository and is often used in classification problems to predict the species of the Iris flowers based on the four measurement attributes.

###**Scatter Plot of the Iris Dataset**
Let's use a scatter plot to visualize the relationship between petal length and petal width for the three species in the Iris dataset. We will plot each species with a different color for easy distinction.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
import pandas as pd

# Load Iris dataset
iris = load_iris()
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
data['species'] = iris.target_names[iris.target]

# Create scatter plot for Petal Length vs Petal Width
sns.scatterplot(x='petal length (cm)', y='petal width (cm)', hue='species', data=data, palette='Set1')

# Adding title and labels
plt.title('Petal Length vs Petal Width - Iris Dataset', fontsize=16)
plt.xlabel('Petal Length (cm)', fontsize=12)
plt.ylabel('Petal Width (cm)', fontsize=12)

# Display the plot
plt.show()


###**Key Observations:**
* **Cluster Formation:** We’ll notice that the three species are well-separated based on their petal length and petal width. This could indicate that these two features might be helpful in classifying the species.
* **Distinct Groups:** Iris Setosa appears to form a clear cluster, while Iris Virginica and Iris Versicolor show more overlap but are still distinguishable.
* **Correlations:** There might be a slight positive correlation between petal length and petal width, as the points tend to trend upwards.

## **4. Histogram**
A histogram is a type of graph used to represent the distribution of a dataset. It is created by grouping the data into bins (or intervals) and counting how many data points fall into each bin. The height of each bar represents the frequency (or count) of data points that fall within the corresponding bin. Histograms are particularly useful for understanding the distribution of numerical data.

**Key Points about Histograms:**

* **Bins/Intervals:** The range of data is divided into intervals, and the bars represent how many data points fall within each interval.
* **Shape of Distribution:** Histograms can reveal the shape of the distribution (e.g., normal, skewed, bimodal), helping us understand the underlying structure of the data.
* **Skewness:** The distribution may be left-skewed (long tail on the left) or right-skewed (long tail on the right), indicating the presence of extreme values on one side.
* **Outliers:** Outliers are data points that fall far outside the majority of the data. In a histogram, outliers appear as isolated bars at the extreme ends.

Now, let's apply a histogram to the Iris dataset and explore Petal Length. We will also calculate and interpret important statistical measures such as mean, median, standard deviation, variance, and skewness.

## **Example**: Histogram for Petal Length in the Iris Dataset
We will plot the histogram for Petal Length (in cm) for the three species of Iris flowers in the dataset, and calculate key statistics like mean, median, standard deviation, variance, and skewness.

In [None]:
# load data into a Pandas dataframe
flowers_df = sns.load_dataset("iris")

In [None]:
flowers_df.sepal_width

In [None]:
flowers_df.describe()

### **Count**  
- Represents the **total number of observations** in the dataset (**150 samples**).  

### **Mean (Average)**  
- The **mean** represents the **central tendency** of the data.  
- Example: The **average petal length** is **3.758 cm**, meaning most flowers have a petal length around this value.  

### **Standard Deviation (Std)**  
- Measures the **spread** of the data.  
- A **low standard deviation** means values are **closely clustered** around the mean, while a **high standard deviation** indicates **greater variation**.  
- Example: Petal length has the highest **standard deviation (1.765)**, indicating that petal length varies significantly across species.  

### **Minimum (Min) and Maximum (Max) Values**  
- These represent the **smallest** and **largest** values recorded in the dataset.  
- Example: The **smallest petal length** is **1.0 cm**, while the **largest** is **6.9 cm**, showing a **wide range**.  

### **Quartiles (Q1, Q2, Q3)**  
- **25% (Q1)** → The first quartile represents the value **below which 25%** of observations fall.  
- **50% (Q2 - Median)** → The **middle value** when data is sorted in ascending order.  
- **75% (Q3)** → The third quartile represents the value **below which 75%** of observations fall.  

Example:
- **Median Sepal Width = 3.0 cm**, meaning half of the flowers have a **sepal width** below **3.0 cm** and half have widths above **3.0 cm**.  
- **75% of flowers have petal lengths below 5.1 cm**, while the **remaining 25%** have longer petal lengths.  



##**Key Insights from the Data**  
 - **Petal Length and Width have higher variability**, as seen from their **higher standard deviation**.  
 - **Sepal measurements are more stable** compared to petal measurements.  
 - The **range (Min to Max) for Sepal Width is smaller** compared to other features, indicating **less variability** in Sepal Width.  
 - The **quartiles help us understand data spread**: for example, 50% of flowers have petal lengths between **1.6 cm and 5.1 cm**.

In [None]:
plt.title("Distribution of Sepal Width")
plt.hist(flowers_df.sepal_width);

Following line creates a histogram of the sepal width from the flowers_df dataset, dividing the data into 5 bins to visualize its frequency distribution.

In [None]:
#specifying the numbers of bins
plt.hist(flowers_df.sepal_width, bins=5);

In [None]:
import numpy as np
np.arange(2,5,0.25)
# Specifying the boundaries of each bin
plt.hist(flowers_df.sepal_width, bins=np.arange(2,5,0.25));

In [None]:
# Beans of unequal sizes
plt.hist(flowers_df.sepal_width, bins=[1,3,4,4.5]);

##**Multiple Histogram**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
data['species'] = iris.target_names[iris.target]

# Plot histogram for Petal Length
plt.figure(figsize=(10, 6))
sns.histplot(data=data, x='petal length (cm)', hue='species', kde=True, bins=15, palette='Set1', edgecolor='black')

# Adding title and labels
plt.title('Histogram of Petal Length by Iris Species', fontsize=16)
plt.xlabel('Petal Length (cm)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)

# Display the plot
plt.show()




#  **5. Bar Chart**

##**What is a Bar Chart?**
A **bar chart** (or bar plot) is a graphical representation used to display **categorical data** using rectangular bars. The height (or length) of the bars represents the **frequency** or **value** of each category.

##**Key Features of a Bar Chart**
**Categorical Data**: Used to compare different categories (e.g., different flower species, sales data, etc.).  
- **Bars with Equal Width**: The width of bars remains constant, while the height varies based on value.  
- **Gaps Between Bars**: Unlike histograms, bar charts have gaps to indicate separate categories.  
- **Horizontal or Vertical**: Bars can be plotted either horizontally or vertically.  


##**Example: Average Sepal Width for Each Iris Species**
We'll use the **Iris dataset**, where we visualize the **average sepal width** for each flower species (`setosa`, `versicolor`, and `virginica`).  

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Load the Iris dataset
from seaborn import load_dataset
flowers_df = load_dataset("iris")

# Calculate the mean sepal width for each species
sepal_width_avg = flowers_df.groupby("species")["sepal_width"].mean()

# Plot the bar chart
plt.figure(figsize=(7,5))
sns.barplot(x=sepal_width_avg.index, y=sepal_width_avg.values, palette="viridis")

# Add labels and title
plt.xlabel("Species", fontsize=12)
plt.ylabel("Average Sepal Width (cm)", fontsize=12)
plt.title("Average Sepal Width for Each Iris Species", fontsize=14)

# Show the plot
plt.show()

Let's take a look at another data sets

In [None]:
tips_df= sns.load_dataset("tips")
tips_df

In [None]:
sns.barplot(x='day', y='total_bill', data=tips_df);

In [None]:
sns.barplot(x='day', y='total_bill', hue='sex', data=tips_df);

# **6. Heat Map**

##**What is a Heatmap?**
A **heatmap** is a **graphical representation of data** where individual values are represented using a **color gradient**. It is commonly used to visualize **correlations, relationships, and distributions** in a dataset.


## **Key Features of a Heatmap**
- **Color-Coded Representation**: Different colors indicate different magnitudes of data values.  
- **Best for Correlation Matrices**: Shows relationships between numerical variables.  
- **Darker/Brighter Colors Indicate Trends**: Higher values may have **darker** or **brighter** colors, depending on the chosen color scheme.  

In [None]:
flights_df = sns.load_dataset("flights").pivot(index="month", columns="year", values="passengers");

In [None]:
flights_df

In [None]:
plt.title("No. of Passengers (1000s)")
sns.heatmap(flights_df);

The brighter colors indicate a higher footfall at the airport. By looking at the graph, we can infer two things:
* The footfall at the airport in any given year tends to be the hihest around July & August.
* The footfall at the airport in any given month tends to grow year by year.

We can also display the actual values in each block by specifying `annot=True`, and use the `cmap` argument to change the color palette.

In [None]:
plt.figure(figsize=(12, 6))
plt.title("No. of Passengers (1000s)")
sns.heatmap(flights_df, annot=True, cmap="YlGnBu");