### Welcome to the Session on Data Visualization with Python 

#### Graphs are everywhere 

<img src='Graphs.png'>

- Understand the context for the need to communicate appropriately
- Exploratory Vs Explanatory Analysis 
- Ask the right questions
    - Who is the audience?
    - What am i trying to communicate? 
    - How can i tell a story? 

Choosing an appropriate visual
- Which chart is right for numeric data?
- What graph should I use for categorical variables ?
- Should I use plain text to represent this analysis ?
- I have used bar charts,I am not happy with the results , how can I better represent my analysis?

Let us see some examples with python

In [None]:
#### Import the following dependencies 

In [None]:
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

#### Load the data

In [None]:
fifa_data = pd.read_csv('fifa.csv', index_col="Date", parse_dates=True)

About the data
Dataset of historical FIFA rankings for six countries: 
- Argentina (ARG)
- Brazil (BRA) 
- Spain (ESP) 
- France (FRA) 
- Germany (GER)
- Italy (ITA)

In [None]:
fifa_data.head()

In [None]:
fifa_data.tail()

In [None]:
sns.set(rc={'figure.figsize':(16,6)})

In [None]:
# Line chart showing how FIFA rankings evolved over time 
sns.lineplot(data=fifa_data.head(100))

In [None]:
#Let us see the same chart on a different data 

In [None]:

spotify_data = pd.read_csv('spotify.csv', index_col="Date", parse_dates=True)

About the data 
Global daily streams on the music streaming service Spotify focusing on popular songs from 2017 and 2018 
- "Shape of You", by Ed Sheeran 
- "Despacito", by Luis Fonzi 
- "Something Just Like This", by The Chainsmokers and Coldplay
- "HUMBLE.", by Kendrick Lamar
- "Unforgettable", by French Montana

In [None]:
spotify_data.head()

In [None]:
spotify_data.tail()

In [None]:
#Line Plot on the data
sns.lineplot(data=spotify_data)

In [None]:
plt.title("Daily Global Streams of Popular Songs in 2017-2018")
 
sns.lineplot(data=spotify_data)

#### Plot a subset of the data

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,6))

# Add title
plt.title("Daily Global Streams of Popular Songs in 2017-2018")

# Line chart showing daily global streams of 'Shape of You'
sns.lineplot(data=spotify_data['Shape of You'], label="Shape of You")

# Line chart showing daily global streams of 'Despacito'
sns.lineplot(data=spotify_data['Despacito'], label="Despacito")

# Add label for horizontal axis
plt.xlabel("Date")

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,6))

# Add title
plt.title("Daily Global Streams of Popular Songs in 2017-2018")

# Line chart showing daily global streams of 'Shape of You'
sns.lineplot(data=spotify_data['Shape of You'])

# Line chart showing daily global streams of 'Despacito'
sns.lineplot(data=spotify_data['Despacito'])

# Add label for horizontal axis
plt.xlabel("Date")

In [None]:
spotify_data['Shape of You'].head(10)

#### Let us now look at a different dataset and different chart 

In [None]:

flight_data = pd.read_csv('flight_delays.csv', index_col="Month")

In [None]:
flight_data

In [None]:
plt.figure(figsize=(10,6))

# Add title
plt.title("Average Arrival Delay for Spirit Airlines Flights, by Month")

# Bar chart showing average arrival delay for Spirit Airlines flights by month
sns.barplot(x=flight_data.index, y=flight_data['NK'])

# Add label for vertical axis
plt.ylabel("Arrival delay (in minutes)")

Let us try a different type of chart on the same data 

In [None]:
plt.figure(figsize=(14,7))

# Add title
plt.title("Average Arrival Delay for Each Airline, by Month")

# Heatmap showing average arrival delay for each airline by month
sns.heatmap(data=flight_data, annot=True)

# Add label for horizontal axis
plt.xlabel("Airline")

Let us move on to the next type of data and the chart 

In [None]:
insurance_data = pd.read_csv('insurance.csv')


In [None]:
insurance_data.head()


In [None]:
insurance_data.tail()

In [None]:
sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'])


In [None]:
sns.regplot(x=insurance_data['bmi'], y=insurance_data['charges'])


#### Color-coded scatter plots

In [None]:
sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'], hue=insurance_data['smoker'])

In [None]:
sns.lmplot(x="bmi", y="charges", hue="smoker", data=insurance_data)


In [None]:
sns.boxplot(x=insurance_data['smoker'],
              y=insurance_data['charges'])

In [None]:
sns.swarmplot(x=insurance_data['smoker'],
              y=insurance_data['charges'])

Inference
- on average, non-smokers are charged less than smokers, and
- the customers who pay the most are smokers; whereas the customers who pay the least are non-smokers.

In [None]:
# Let us load another dataset and chart 

In [None]:
# Read the file into a variable iris_data
iris_data = pd.read_csv('iris.csv', index_col="Id")

# Print the first 5 rows of the data
iris_data.head()


In [None]:
# Histogram 
sns.distplot(a=iris_data['Petal Length (cm)'], kde=False)

In [None]:
sns.kdeplot(data=iris_data['Petal Length (cm)'], shade=True)

In [None]:
# 2D KDE plot
sns.jointplot(x=iris_data['Petal Length (cm)'], y=iris_data['Sepal Width (cm)'], kind="kde")

In [None]:


# Read the files into variables 
iris_set_data = pd.read_csv('iris_setosa.csv', index_col="Id")
iris_ver_data = pd.read_csv('iris_versicolor.csv', index_col="Id")
iris_vir_data = pd.read_csv('iris_virginica.csv', index_col="Id")

# Print the first 5 rows of the Iris versicolor data
iris_ver_data.head()

In [None]:
# Histograms for each species
sns.distplot(a=iris_set_data['Petal Length (cm)'], label="Iris-setosa", kde=False)
sns.distplot(a=iris_ver_data['Petal Length (cm)'], label="Iris-versicolor", kde=False)
sns.distplot(a=iris_vir_data['Petal Length (cm)'], label="Iris-virginica", kde=False)

# Add title
plt.title("Histogram of Petal Lengths, by Species")

# Force legend to appear
plt.legend()

In [None]:
# KDE plots for each species
sns.kdeplot(data=iris_set_data['Petal Length (cm)'], label="Iris-setosa", shade=True)
sns.kdeplot(data=iris_ver_data['Petal Length (cm)'], label="Iris-versicolor", shade=True)
sns.kdeplot(data=iris_vir_data['Petal Length (cm)'], label="Iris-virginica", shade=True)

# Add title
plt.title("Distribution of Petal Lengths, by Species")

One interesting pattern that can be seen in plots is that the plants seem to belong to one of two groups, where Iris versicolor and Iris virginica seem to have similar values for petal length, while Iris setosa belongs in a category all by itself.

In fact, according to this dataset, we might even be able to classify any iris plant as Iris setosa (as opposed to Iris versicolor or Iris virginica) just by looking at the petal length: if the petal length of an iris flower is less than 2 cm, it's most likely to be Iris setosa!

Since it's not always easy to decide how to best tell the story behind your data, I have broken the chart types into three broad categories to help with this.

Trends - A trend is defined as a pattern of change.
- Line charts are best to show trends over a period of time, and multiple lines can be used to show trends in more than one group.


Relationship - There are many different chart types that you can use to understand relationships between variables in your data.
- Bar charts are useful for comparing quantities corresponding to different groups.
- Heatmaps can be used to find color-coded patterns in tables of numbers.
- Scatter plots show the relationship between two continuous variables; if color-coded, we can also show the relationship with a third categorical variable.
- Including a regression line in the scatter plot makes it easier to see any linear relationship between two variables.
- This command is useful for drawing multiple regression lines, if the scatter plot contains multiple, color-coded groups.
- Categorical scatter plots show the relationship between a continuous variable and a categorical variable.


Distribution - We visualize distributions to show the possible values that we can expect to see in a variable, along with how likely they are.

- Histograms show the distribution of a single numerical variable.
- KDE plots (or 2D KDE plots) show an estimated, smooth distribution of a single numerical variable (or two numerical variables).
- This command is useful for simultaneously displaying a 2D KDE plot with the corresponding KDE plots for each individual variable.