## Introduction to Data Visualization and Dashboards - Data Delivery

### Benefits of Data Visualization
- Fast Comprehension of Information
- Correlation of Relationships
   - ` Patterns, Trends over time, Frequency, Risk and Rewards `

- **2D Visualizations**
- **Dashboards** - Real time Visualization
- **Data Storytelling -** Check the Slide and Video from the last Webinar

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("processed_mv.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31252 entries, 0 to 31251
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 31252 non-null  int64  
 1   title              31252 non-null  object 
 2   original_language  31252 non-null  object 
 3   overview           31252 non-null  object 
 4   tagline            12676 non-null  object 
 5   release_date       31252 non-null  object 
 6   popularity         31252 non-null  int64  
 7   vote_count         31252 non-null  int64  
 8   vote_average       31252 non-null  float64
 9   budget             31252 non-null  int64  
 10  revenue            31252 non-null  int64  
 11  runtime            31252 non-null  int64  
 12  status             31252 non-null  object 
 13  adult              31252 non-null  bool   
 14  genre_names        31252 non-null  object 
 15  popularity_rank    31251 non-null  object 
 16  year               312

In [4]:
df.head()

Unnamed: 0,id,title,original_language,overview,tagline,release_date,popularity,vote_count,vote_average,budget,revenue,runtime,status,adult,genre_names,popularity_rank,year,profit
0,760161,Orphan: First Kill,en,After escaping from an Estonian psychiatric fa...,There's always been something wrong with Esther.,2022-07-27,5089,902,6.9,0,9572765,99,Released,False,"Horror, Thriller",Very Popular,2022,0
1,760741,Beast,en,A recently widowed man and his two teenage dau...,Fight for family.,2022-08-11,2172,584,7.1,0,56000000,93,Released,False,"Adventure, Drama, Horror",Very Popular,2022,0
2,882598,Smile,en,"After witnessing a bizarre, traumatic incident...","Once you see it, it’s too late.",2022-09-23,1864,114,6.8,17000000,45000000,115,Released,False,"Horror, Mystery, Thriller",Very Popular,2022,28000000
3,756999,The Black Phone,en,"Finney Blake, a shy but clever 13-year-old boy...",Never talk to strangers.,2022-06-22,1071,2736,7.9,18800000,161000000,103,Released,False,"Horror, Thriller",Very Popular,2022,142200000
4,772450,Presences,es,A man who loses his wife and goes to seclude h...,,2022-09-07,1021,83,7.0,0,0,0,Released,False,Horror,Very Popular,2022,0


In [5]:
df['release_date'] = pd.to_datetime(df['release_date'])
df['month'] = df['release_date'].apply(lambda x:x.month)

In [6]:
df.head()

Unnamed: 0,id,title,original_language,overview,tagline,release_date,popularity,vote_count,vote_average,budget,revenue,runtime,status,adult,genre_names,popularity_rank,year,profit,month
0,760161,Orphan: First Kill,en,After escaping from an Estonian psychiatric fa...,There's always been something wrong with Esther.,2022-07-27,5089,902,6.9,0,9572765,99,Released,False,"Horror, Thriller",Very Popular,2022,0,7
1,760741,Beast,en,A recently widowed man and his two teenage dau...,Fight for family.,2022-08-11,2172,584,7.1,0,56000000,93,Released,False,"Adventure, Drama, Horror",Very Popular,2022,0,8
2,882598,Smile,en,"After witnessing a bizarre, traumatic incident...","Once you see it, it’s too late.",2022-09-23,1864,114,6.8,17000000,45000000,115,Released,False,"Horror, Mystery, Thriller",Very Popular,2022,28000000,9
3,756999,The Black Phone,en,"Finney Blake, a shy but clever 13-year-old boy...",Never talk to strangers.,2022-06-22,1071,2736,7.9,18800000,161000000,103,Released,False,"Horror, Thriller",Very Popular,2022,142200000,6
4,772450,Presences,es,A man who loses his wife and goes to seclude h...,,2022-09-07,1021,83,7.0,0,0,0,Released,False,Horror,Very Popular,2022,0,9


#### Examples of what we want to do!

**Matplotlib**

In [None]:
import matplotlib.pyplot as plt
import numpy as np

In [None]:
mnt = dict(sorted(dict(df['month'].value_counts()).items()))
mnt

In [None]:
mnt_num = list(mnt.keys())
release_count = list(mnt.values())

In [None]:
fig = plt.figure(figsize=(9, 5))
plt.barh(mnt_num, release_count, color='maroon')
plt.yticks(np.arange(1, 13, step=1))
plt.xlabel("Number of Movie Released")
plt.grid(axis='x')
plt.ylabel("Month Number")
plt.title("Movies Released per Month")
plt.show()

**Seaborn**

In [None]:
import seaborn as sns

In [None]:
sns_df = pd.DataFrame(index=mnt_num, data=release_count)
sns_df.reset_index(inplace=True)
sns_df.columns = ['Month', 'Release']
sns_df

In [None]:
sns.set_style("darkgrid", {"grid.color": ".2", "grid.linestyle": ":"})
# palette{deep, muted, pastel, dark, bright, colorblind}
s = sns.barplot(x = "Release", y = "Month", data = sns_df, orient = 'h', color="maroon")
s.set_ylabel("Month Number")
s.set_title("Movies Released per Month")

**Plotly**

In [None]:
import plotly.express as px

In [None]:
fig = px.bar(sns_df, x = "Release", y = "Month", orientation = 'h').update_layout(
    xaxis_title="Movies Released per Month", yaxis_title="Month Number"
)
fig.show()

### Why should I learn Matplotlib ?
**Excerpts from Seaborn's Documentation**

` While you can be productive using only seaborn functions, full customization of your graphics will require some knowledge of matplotlib’s concepts and API. One aspect of the learning curve for new users of seaborn will be knowing when dropping down to the matplotlib layer is necessary to achieve a particular customization. On the other hand, users coming from matplotlib will find that much of their knowledge transfers.`

### Matplotlib


- Matplotlib is very fundamental
- Seaborn is built on it
- There are dashboards such as Plotly and so on


In [None]:
import matplotlib.pyplot as plt

- **Figure**: The overall container for the plot. A figure can contain multiplesubplots or axes.

- **Axes**: The area where the data is plotted. An axe can contain one or moreplots, such as lines, bars, or scatter points.

- **Title**: The title of the plot, usually located at the top of the plot.
- **X-axis and Y-axis:** The horizontal and vertical axes, respectively, that define the coordinate system for the data.

- **X-axis Label and Y-axis Label**: The labels for the X-axis and Y-axis,respectively, that describe the data being plotted.

- **Ticks**: The marks on the X-axis and Y-axis that represent the values ofthe data.

- **Tick Labels**: The labels for the ticks that display the values associatedwith the data.

- **Legends**: A key that explains the meaning of different markers or colorsin the plot.

- **Gridlines**: The lines that divide the plot into smaller units and make iteasier to read the data.


- Matplotlib graphs your data on **Figures**, each of which can **contain one or more Axes**, an area where points can be specified in terms of x-y coordinates (or theta-r in a polar plot, x-y-z in a 3D plot, etc.).

In [None]:
fig = plt.figure(figsize=(11,7), facecolor='lightskyblue', layout='constrained')
fig.suptitle('Figure')
ax = fig.add_subplot()
ax.set_title('Axes', loc='left', fontstyle='oblique', fontsize='large')
plt.show()

!['Figure Anatomy'](matplt.png)

` generating dummy data for explanation `

In [None]:
np.cumsum([2,34,66,7,90])

In [None]:
np.random.randn(200)

In [None]:
np.random.seed(19680801)
t = np.arange(200)
x = np.cumsum(np.random.randn(200))
y = np.cumsum(np.random.randn(200))

In [None]:
y

**Two Plots on a different axes**

In [None]:
fig, axs = plt.subplots(ncols=2, nrows=2, figsize=(12, 6), layout="constrained")

axs[1][0].plot(t, x)
axs[1,1].plot(t, y)

# You can address each subplot directly
axs[1,1].set_xlabel("number")
axs[1,0].set_xlabel("my number")

**Two Plots on a Single axe**

In [None]:
fig, ax = plt.subplots(figsize=(5, 3), layout='constrained')

linesx = ax.plot(t, x, label='Random walk x', color='r')
linesy = ax.plot(t, y, label='Random walk y')

ax.set_xlabel('Time [s]')
ax.set_ylabel('Distance [km]')
ax.set_title('Random walk example')
ax.legend(loc='lower right')

**Axes limits, scales, and ticking**

In [None]:
fig, ax = plt.subplots(figsize=(12, 6.5), layout='constrained')

np.random.seed(19680801)
t = np.arange(200)
x = 2**np.cumsum(np.random.randn(200))


linesx = ax.plot(t, x)
ax.set_yscale('log')
ax.set_xlim([10, 180])

ax.annotate('Minimum', xy=(140, 0.1), xytext=(70, 103),
            arrowprops=dict(facecolor='green'))

**Axes Layout**

In [None]:
fig, axs = plt.subplots(ncols=3, figsize=(10, 4.5), layout='constrained')

np.random.seed(19680801)
t = np.arange(200)
x = np.cumsum(np.random.randn(200))


axs[0].plot(t, x)
axs[0].set_title('aspect="auto"')

axs[1].plot(t, x)
axs[1].set_aspect(2)
axs[1].set_title('aspect=5')

In [None]:
df

In [None]:
df['vote_average'][:200].values

In [None]:
fig, ax1 = plt.subplots(figsize=(10, 4.5), layout='constrained')
x = df['vote_average'][:200].values
z = df['runtime'][:200].values

ax1.plot(t, x, label='Vote Average', color='blue')
ax1.set_ylabel("Vote Average", color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

ax2 = ax1.twinx()

ax2.plot(t, z, label='Runtime', color = 'orange')
ax2.set_ylabel("Runtime", color ='orange')
ax2.tick_params(axis='y', labelcolor='orange')


## Seaborn

In [None]:
import seaborn as sns
import pandas as pd

**Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures.**

In [None]:
df.head()

In [None]:
# Create a visualization
sns.relplot(
    data=df,
    x="vote_average", y="vote_count"
)

In [None]:
# Create a visualization
sns.relplot(
    data=df,
    x="vote_average", y="vote_count", col="popularity_rank", 
)

#### color mappings
- ` Distinct colors are used when your hue is categorical `
- ` When mapping a numeric variable, some functions will switch to a continuous gradient `

In [None]:
# Create a visualization
sns.relplot(
    data=df,
    x="vote_average", y="vote_count", hue="popularity_rank", 
)

In [None]:
# Create a visualization
sns.relplot(
    data=df,
    x="vote_average", y="vote_count", hue="budget", 
)

### How farther can I go with Seaborn ??

!['Classification of Seaborn Functions'](sea_plot.png)

**Histplot**

In [None]:
df.status.value_counts()

In [None]:
df.popularity_rank.value_counts()

` Sometimes you need to work on your scale `

In [None]:
g = sns.histplot(
        data=df, x="status", hue="popularity_rank", multiple="stack"
)
g.tick_params(axis='x', rotation=40)

In [None]:
g = sns.histplot(
        data=df, x="status", hue="popularity_rank"
)
g.set_yscale('log')

**Displot**

In [None]:
sns.displot(data=df, x="runtime", hue = 'month', kde=True)
plt.xscale('log')

In [None]:
sns.displot(data=df, x="runtime", col='popularity_rank')

In [None]:
sns.displot(data=df, x="runtime", hue='month',  kde=True,multiple='stack')

**kernel density estimation**

In [None]:
sns.kdeplot(data=df, x="runtime", hue = 'month')
plt.xscale('log')

**Count Plot**

In [None]:
plt.figure(figsize=(13,8))

f = sns.countplot(data=df,  x="year", color='green')
f.tick_params(axis='x', rotation=80)
f.tick_params(labelsize=9)

plt.savefig("year_dist.png", dpi=300)

**Strip Plot**

In [None]:
plt.figure(figsize=(12,8))
sns.stripplot(x= 'popularity_rank', y='vote_count', data= df, split=True)
plt.title('Popularity Rank by Vote Count')
plt.savefig("stripplt.png", dpi=300)

**Joint Plot**

In [None]:
plt.figure(figsize=(12,14))
sns.jointplot(data=df,  x="vote_count", y= 'vote_average', hue='popularity_rank')

### We stopped here 06-04-2024

**Catplot**

In [None]:
df.groupby(['year', 'month']).size()

In [None]:
cat_df = df.groupby(['year', 'month']).size().unstack()
cat_df.replace(np.nan, 0, inplace=True)
cat_df

In [None]:
sns.catplot(data=cat_df, kind="box")

In [None]:
## Profit

cat_profit_df = df.groupby(['year', 'month']).agg({'profit': 'sum'})
cat_profit_df

In [None]:
cat_profit_df = cat_profit_df.unstack()
cat_profit_df.replace(np.nan, 0, inplace=True)

In [None]:
cat_profit_df.columns = np.arange(1,13)

In [None]:
cat_profit_df

In [None]:
sns.catplot(data=cat_profit_df, kind="box")

**Relplot/Lineplot**

In [None]:
sns.lineplot(data = cat_profit_df)

In [None]:
sns.relplot(data=df, x="year", y="profit", hue="month", kind="line")

### Doing Subplot in Seaborn

In [None]:
f, axs = plt.subplots(1, 2, figsize=(12, 6), gridspec_kw=dict(width_ratios=[4, 3]))

sns.countplot(data=df,  x="month", color='green', ax=axs[0])
axs[0].tick_params(axis='x', rotation=80)
axs[0].tick_params(labelsize=9)

sns.scatterplot(data=df,  x="revenue", y= 'profit', hue="month", ax=axs[1])

f.tight_layout()

# Plotly

**Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.**

## Basic Plots

### Line Plot

In [None]:
import plotly.express as px
import plotly.offline as pyo

**With px.scatter, each data point is represented as a marker point, whose location is given by the x and y columns.**

In [None]:
fig = px.line(df, x="popularity", y="revenue", title='Profit and Revenue')
fig.show()

In [None]:
# Let us compare with Matplotlib
plt.plot(df.revenue)

In [None]:
fig = px.line(df,  y="revenue")
fig.show()

In [None]:
# Or this other way
pyo.plot([{
    'x' : df.index,
    'y' : df.revenue
}])

In [None]:
sns.lineplot(data=df, x="popularity", y="revenue", hue='popularity_rank')

In [None]:
# Another angle to it
fig = px.line(df, x="popularity", y="revenue", color='popularity_rank', title='Popularity and Revenue')
fig.show()

In [None]:
rev_st = df[['profit', 'revenue']]

In [None]:
rev_st

In [None]:
# Plot all columns on a figure
pyo.plot([{
    'x' : rev_st.index,
    'y' : rev_st[col],
    'name' : col
    # For legend
} for col in rev_st.columns])

### Scatter Plot

In [None]:
fig = px.scatter(rev_st, x="profit", y="revenue")
fig.show()

In [None]:
import plotly.graph_objs as go

In [None]:
# Modify plot to your taste
pyo.plot([
    go.Scatter(
        x = rev_st.profit,
        y = rev_st.revenue,
        mode = 'markers',
        marker = dict(
            size = 14,
            color = 'rgb(23,122,231)',
            symbol = 'square'
        )
    )
])

### Bubble Chart

In [None]:
fig = px.scatter(df.query("year==2022"), x="popularity", y="revenue",
         size="vote_count", color="popularity_rank",
                 hover_name="vote_count", size_max=60)
fig.show()

### SunBurst

!['Sun Burst'](sunburst.png)

In [None]:
fig = px.sunburst(df, path=['popularity','vote_average'], values='revenue', color='popularity_rank')
fig.show()

### Bar Plot

In [None]:
df.head()

In [None]:
px.histogram(df, x="month", color="status", barmode='group')

In [None]:
# You have to aggregate your data
popularity_rk = pd.pivot_table(df, values = 'profit', index = 'popularity_rank', aggfunc= "sum")

In [None]:
popularity_rk

In [None]:
fig = px.bar(popularity_rk, x=popularity_rk.index, y='profit')
fig.show()

##### Horizontal Bar Chart with Plotly Express

In [None]:
fig = px.bar(popularity_rk, y=popularity_rk.index, x='profit', orientation='h')
fig.show()

In [None]:
dt = [
    go.Bar(
        y = popularity_rk.index,
        x = popularity_rk.profit,
        name = 'Average Revenue by Popularity',
        orientation = 'h',
        marker=dict(
        color='rgba(246, 78, 139, 0.6)',
        line=dict(color='rgba(246, 78, 139, 1.0)', width=3)     
    )
    )
]
layout = go.Layout(title='Average Revenue by Popularity')

In [None]:
figure = go.Figure(data = dt, layout=layout)

In [None]:
pyo.plot(
    figure
)

### Gnatt Chart

In [None]:
ddf = pd.DataFrame([
    dict(Task="Job A", Start='2009-01-01', Finish='2009-02-28', Resource="Alex"),
    dict(Task="Job B", Start='2009-03-05', Finish='2009-04-15', Resource="Alex"),
    dict(Task="Job C", Start='2009-02-20', Finish='2009-05-30', Resource="Max"),
    dict(Task="Job D", Start='2010-03-05', Finish='2010-04-15', Resource="Alex"),
    dict(Task="Job E", Start='2011-02-20', Finish='2013-05-30', Resource="Max")
])
ddf


In [None]:
fig = px.timeline(ddf, x_start="Start", x_end="Finish", y="Task", color="Resource")
fig.update_yaxes(autorange="reversed")
fig.show()

[Documentation of Plotly](https://plotly.com/python/)

## Streamlit

**Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science. In just a few minutes you can build and deploy powerful data apps. So**

**Widgets**

#### Let us try some basic analytics on Jupyter Notebook and also on Streamlit

In [None]:
df.head()

### Build a Recommender System

In [None]:
lngs = df['original_language'].unique()
print("select any of the following",lngs)
choice = input("Type the Language as seen : ")

In [None]:
df[df["original_language"] == choice].sort_values(by=['popularity'], ascending=False)[["title",'popularity']]

### Check Vote Average by Year

In [None]:
df

In [None]:
years = df['year'].unique()
print("select any of the following",years)
year_to_filter = input("Type the Year as seen : ")

In [None]:
filtered_data = df[df['year'] == int(year_to_filter)]

In [None]:
filtered_data 

In [None]:
sns.lineplot(data = filtered_data,x = filtered_data.index, y = 'vote_average', hue='popularity_rank')

### Search Movie

In [None]:
selected_movie = input("Type Movie Name : ")
# Orphan: First Kill

In [None]:
def searcher(mvs_name, kw):
    recomd = []
    if kw in mvs_name.split():
        recomd.append(mvs_name)
    if len(recomd)>0:
        return recomd

In [None]:
rcd = df['title'].apply(lambda x:searcher(x, selected_movie))
for each in rcd:
    if each != None:
        print(each)

In [None]:
df[df['title'] == selected_movie]

In [None]:
### Installing Streamlit on Machine

In [None]:
#!pip3 install streamlit --user

In [None]:
#!python -m streamlit run main_page.py