## Data Visualization

Data visualization is a graphical representation of information of data. By using visual elements like charts and graphs, Data visualization libraries provide an accessible way to see and understand trends, outliers, and patterns in data.

### What is Seaborn
Seaborn is a Python data visualization library built on top of matplotlib. It provides a high-level interface for ploting attractive and informative statistical graphics.

### Importing Necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')
%matplotlib inline          # the matplotlib inline plots will appear directly below the cell in which the plot function was called.

### Now will load the dataset that are readily avaliable in the seaborn library

In [None]:
sns.get_dataset_names()

### Now will load the dataset online using seaborn library

In [None]:
tips = sns.load_dataset('tips')
tips.head()

In [None]:
tips.shape

In [None]:
df=sns.load_dataset('flights')

In [None]:
df = sns.load_dataset('titanic')

In [None]:
df.head()

In [None]:
import pandas as pd
pd.get_dummies(df['sex']).astype('int')

In [None]:
df=sns.load_dataset('iris')
df.head()

In [None]:
df.species.unique()

In [None]:
df=sns.load_dataset('tips')
df.head()

In [None]:
df.dtypes

In [None]:
cat_cols=df.describe(include='category').columns.to_list()

In [None]:
cat_cols.append('size')

In [None]:
cat_cols

In [None]:
for i in cat_cols:
    print(df[i].value_counts())
    print(df[i].nunique())
    print('------------')

## Uni-Variate Analysis

Univariate analysis is the study of a single variable, without considering any relationship with other variables. It is a fundamental method in exploratory data analysis to understand the distribution, central tendency, dispersion, and other descriptive statistics of a variable.

### Dist Plot
A distplot in matplotlib is useful for visualizing the distribution of a univariate dataset by showing a histogram and a kernel density estimate (KDE) curve. It is commonly used to identify the shape of the distribution, potential outliers, and skewness in the data.

In [None]:
sns.distplot(tips.total_bill)
plt.show()

### Ploting Histogram

In [None]:
plt.hist(tips.total_bill,bins = 10)

In [None]:
tips.total_bill.sort_values(ignore_index=True)

#### 5 number summary
The five-number summary is the minimum, first quartile, median, third quartile, and maximum.

In [None]:
tips.total_bill.describe()

![Box-Plot-and-Whisker-Plot-2.png](attachment:Box-Plot-and-Whisker-Plot-2.png)

In [None]:
sns.boxplot(x= tips.total_bill)

### Heat map

heatmap is used for visualizing the correlation between variables in a dataset by displaying a color-coded matrix. It is commonly used to identify patterns or relationships between variables, especially in large datasets.

In [None]:
titanic = sns.load_dataset('titanic')
titanic.isnull().sum()

In [None]:
titanic.shape

In [None]:
sns.heatmap(titanic.isnull())

In [None]:
sns.heatmap(tips.isnull())

In [None]:
tips[tips.describe().columns.to_list()].corr()

In [None]:
sns.heatmap(tips[tips.describe().columns.to_list()].corr(),annot = True)

### Bi-variate Analysis (Scatter plot)

A scatter plot is a graphical representation of two numerical variables in a dataset, where each point represents an observation. It is useful for visualizing the relationship between two variables, identifying patterns or trends, and detecting outliers or clusters in the data.

In [None]:
# Now will see how the total_bill and tip are correlated eachother

sns.scatterplot(x='total_bill',y = 'tip',data = tips)

In [None]:
# Now will see how the total_bill and tip are correlated eachother
                                                            # By default relational plot uses scatter plot
sns.relplot(x='total_bill',y='tip',data = tips)             # FacetGrid is nothing but you can plot many figure in a single plot

You can see if the total bill increases then tip is also increasing there is a Positive Linear Relationship

In [None]:
tips[['total_bill','tip']].corr()

In [None]:
np.corrcoef(tips.total_bill,tips.tip)

You can see it is 68% Positively correlated

![image.png](attachment:image.png)

### Multivariate Analysis (Scatter plot with hue)

Multivariate analysis is a statistical method that involves analyzing data with multiple variables or factors to identify patterns, relationships, and dependencies between them. It is used to explore complex relationships and to identify variables that have the greatest impact on the outcome.

In [None]:
tips.columns

In [None]:
sns.relplot(x='total_bill',y='tip',data=tips,hue='sex')

The figure tells you that Male has given highest tip and made high total bill compared to female and we can observe that the male customers are bit high compared to female

In [None]:
tips.groupby('sex')[['tip','total_bill']].describe()

#### Now will try to compare with total bill and tips based on Sex and smoker

In [None]:
sns.relplot(x='total_bill',y='tip',hue='sex',style='smoker',data=tips)

Figure tells that higest and more tips are given by male customers and also smoke comparatively female non smokers

In [None]:
tips.columns

In [None]:
tips.sex.value_counts(normalize=True)*100

In [None]:
tips.groupby(['smoker','sex'])[['total_bill','tip']].describe()

In [None]:
tips.groupby('smoker')['total_bill'].agg(['sum','mean','count','min','max'])

#### Now will try to compare with total bill and tip with based on time

In [None]:
sns.relplot(x='total_bill',y='tip',hue='time',data=tips)
plt.show()

Figure shows that most of the tips and highest bill are given during dinner time compared to Lunch time

In [None]:
sns.relplot(x='total_bill',y='tip',hue='time',style='sex',data=tips)
plt.show()

The Male customer contributing the highest tip and generating the largest total bill among diverse customers during dinner time

In [None]:
sns.relplot(x='total_bill',y='tip',hue='time',style='sex',size='smoker',data=tips)
plt.show()

Figure shows that during dinner, male smokers not only provided the highest tips but also generated the most substantial total bill compared to other.

In [None]:
tips.groupby(['time','smoker','sex'])[['tip','total_bill']].describe()

### Now will automate the process

In [None]:
cat_col = ['sex','smoker','day','time','size']
for i in cat_col:
    sns.relplot(x='total_bill',y='tip',hue=i,data=tips)

### Now will Print subplot

In [None]:
sns.relplot(x='total_bill',y='tip',hue='smoker',col='time',data=tips)    # If you print col then it prints column wise
plt.show()

In [None]:
sns.relplot(x='total_bill',y='tip',hue='smoker',row='size',data=tips)    # If you print col then it prints column wise
plt.show()

In [None]:
tips.head()

In [None]:
sns.relplot(x='total_bill',y='tip',hue='day',row='size',col='smoker',style='sex',data=tips)

In [None]:
# Or you can use col_wrap to wrap the column

sns.relplot(x='total_bill',y='tip',hue='day',col='size',style='sex',data=tips,size='smoker',col_wrap =3)

### Now we have used scatter plot and got to know how to use it, Will try to understand how to work on Line plot

### Line Plot:
Line Graph is commonly used to show trends or changes in data over time

In [None]:
Date = np.random.randint(1,31,size=350)
price = np.random.randint(2000,4500,size=350)

In [None]:
df = pd.DataFrame({'Date':Date,'Price':price})
df.head()

In [None]:
df.shape

In [None]:
sns.relplot(x='Date',y='Price',data=df,kind='line')

In [None]:
# By default the x and y axis will be sorted for line plot (if you want to turn off the sorting)
sns.relplot(x='Date',y='Price',data=df,kind='line',sort=False)

### If the Data is not sorted it doesn't get the proper information so by default seaborn will sort the data

### For better understanding of line plot we require timestamp on my x-axis so we can find the trend over a period

In [None]:
sns.get_dataset_names()

In [None]:
# Now will load fmri dataset using seaborn

fmri = sns.load_dataset('fmri')
fmri.head()

### Functional magnetic resonance imaging or functional MRI (fMRI) measures brain activity by detecting changes associated with blood flow.

In [None]:
# Now will perfrom Normal line plot
sns.relplot(x='timepoint',y='signal',kind='line',data=fmri)

### You can see we have line along with shadow (its because of 95% of confidence interval) That states that it is 95% confident that the line will fall on that region (we can also state that this is correlated with standard deviation, this the expected deviation from the mean of the data)

In [None]:
# Now will trun off the confidence interval

sns.relplot(x='timepoint',y='signal',kind='line',data=fmri,ci=False)

In [None]:
# Now will plot the real data by turning off the confident interval and estimator analysis

sns.relplot(x='timepoint',y='signal',estimator=None,kind='line',data=fmri,ci=False)

### Now you can see the data looks some what weird (we cannot gather much information)

#### hue = You can change the color based on different category
#### size = You can change the size based on the different category
#### style = You can change the style of plot based on the different category

In [None]:
fmri.head()

In [None]:
sns.relplot(x='timepoint',y='signal',hue='event',kind='line',data=fmri)

In [None]:
sns.relplot(x='timepoint',y='signal',hue='event',style='region',kind='line',data=fmri)

### Will provide the Markers to get better understanding of the plot

In [None]:
sns.relplot(x='timepoint',y='signal',hue='event',style='region',kind='line',markers=True,data=fmri)

In [None]:
sns.relplot(x='timepoint',y='signal',hue='subject',kind='line',ci=False,col='region',row='event',data=fmri)

## 2. Categorical Data Ploting

- countplot()
- catplot()
- boxplot()
- stripplot()
- swarmplot()


### Catplot
catplot :- The default representation of catplot is a scatter plot for 1 numerical and one or more categorical variable
    
This function provides access to several axes-level functions that show the relationship between a numerical and one or more categorical variables using one of several visual representations.

In [None]:
tips.head()

In [None]:
sns.catplot(x='day',y='tip',data=tips)      # x=categorical variable,y=numerical variable

### If i want to remove the deviation from the mean we can give an arrgument as jitter = False

In [None]:
sns.catplot(x='day',y='tip',jitter=False,data=tips)    # If you remove the jitter it stops the over lapping in the data

### Swarm Plot
Swarm Plot in Seaborn is used to draw a categorical scatterplot with non-overlapping points

In [None]:
sns.catplot(x='day',y='tip',kind='swarm',hue='sex',data=tips)

### Boxplot
Box plot show distributions with respect to categories. with different statistical information like min,max,and different quartiles

In [None]:
sns.catplot(x='day',y='tip',data=tips,kind='box')

In [None]:
sns.catplot(x='day',y='tip',kind = 'box',hue='time',data=tips)

In [None]:
df.groupby(['time','day'])['tip'].size().reset_index().rename(columns={'tip':'No of customers'})

In [None]:
sns.catplot(x='day',y='total_bill',hue='sex',data=tips,kind='box',dodge=False)  # If you print dodge = False it plots in the same axis

### Violin Plot

#### A violin plot is a combination of a box plot and a kernel density plot, which shows peaks in the data. It is used to visualize the distribution of numerical data. Unlike a box plot that can only show summary statistics, violin plots depict summary statistics and the density of each variable

In [None]:
sns.catplot(x='total_bill',y='day',hue='sex',data=tips,kind='violin')

In [None]:
sns.catplot(x='total_bill',y='day',hue='sex',data=tips,kind='violin',split=True)

### Will Load Titanic Dataset

In [None]:
titanic = sns.load_dataset('titanic')
titanic.head()

### Will perform categorical feature with another categorical feature


In [None]:
sns.countplot(x='sex',data=titanic)

In [None]:
sns.catplot(x='sex',y='survived',data=titanic,kind='bar')

In [None]:
titanic.groupby('sex')['survived'].value_counts(normalize=True)*100

### By the above graph you can state that 74% of female are survived and 18% of male are survived

In [None]:
sns.catplot(x='sex',y='survived',hue='class',kind='bar',data=titanic)    # The line represents the 95% of confident interval it may range between the values

In [None]:
titanic.groupby('sex','class')['survived'].value_counts(normalize=True)*100

By seeing above graph you can state that 96.8% of female Passenger who are in first class were survived and on the second class 92% of female Passenger were survived and 50% of female Passenger were in third class were survived ,  36% of Male Passenger were survived who are in first class, 15% of Male Passenger were survived who are in second class, and 13% of Male Passenger were survived who were in Third class

By seeing above statement you can state that the female passenger with firstclass ticket is almost survived and if the ticket is high class then there is higher probability to survive and the survival rate of female is higher than male

In [None]:
sns.catplot(x='deck',kind='count',data=titanic,hue='class')

#### You can see that the Deck A,B,C is belonging for First class (you can state that who all are travelling in deck A,B,C there is higher probability rate of survival)

### Now will perform Point plot

Pointplot is used to display point estimates and confidence intervals of categorical variables. It is useful for comparing different categories of data and identifying trends or differences between them.

In [None]:
sns.catplot(x='sex',y='survived',kind='point',hue='class',data=titanic)

### By seeing above graph you can state that there is a higher chances of survival rate for female Passenger compared to Male Passenger

### Joint plot

Jointplot in seaborn is used to visualize the relationship between two variables by plotting their joint distribution and marginal distributions. It helps in understanding the correlation and distribution of variables in a dataset.

In [None]:
# If you want to perform both scatter plot and distribution then you want to use joint plot
sns.jointplot(x='total_bill',y='tip',data=tips,marginal_ticks =True)

### Pairplot

### Pairplot plots histogram plot at the diagonal element and scatterplot at non diagonal element

In [None]:
sns.pairplot(tips)

### 4. Linear Regression and Relationship

In [None]:
tips.head()

In [None]:
sns.regplot(x='total_bill',y='tip',data=tips)        # Regression and linear model plot is almost the same

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips)     # or you can perform the same by using linear model plot

In [None]:
sns.lmplot(x='total_bill',y='tip',hue='sex',data=tips)

In [None]:
sns.lmplot(x='total_bill',y='tip',hue='sex',col='time',row='smoker',data=tips)