# Visualizing data

This notebook demonstrates how to create data visualizations using: 
- Matplotlib and Seaborn
- Plotly
- Word-clouds

---

In [None]:
# If Seaborn or any other libraries are missing, you can install them using "!pip install"

#!pip install seaborn

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk import FreqDist

## Matplotlib

In [None]:
url = "https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/en_old_newspapers_5k.tsv"
df = pd.read_csv(url, sep="\t")

In [None]:
df.head(5)

In [None]:
# https://dariuslfuller.medium.com/creating-visuals-with-nltks-freqdist-ac4e667e49f3

all_text = "\n".join(df["Text"]).split()
all_fdist = FreqDist(all_text).most_common(20)

In [None]:
all_text[:20]

In [None]:
all_fdist

In [None]:
# converting data to Pandas series
all_fdist = pd.Series(dict(all_fdist))

In [None]:
all_fdist[:10]

In [None]:
## Matplotlib bar plot using Pandas attributes + xtick rotation for ease of viewing

all_plot = plt.bar(all_fdist.index, all_fdist.values)
ticks = plt.xticks(rotation=40)

In [None]:
# Add labels and title

all_plot = plt.bar(all_fdist.index, all_fdist.values)
ticks = plt.xticks(rotation=40)

plt.xlabel('Words')
plt.ylabel('Counts')
plt.title('Word Frequency Bar Plot')


In [None]:
# Matplotlib line plot

all_plot = plt.plot(all_fdist.index, all_fdist.values)
ticks = plt.xticks(rotation=40)

In [None]:
# Demo of a scatter plot (with synthetic data about countries)

countries = ['Country A', 'Country B', 'Country C', 'Country D', 'Country E']
population = [10, 50, 30, 80, 45]  # in millions
area = [100, 400, 150, 700, 350]   # in thousand square km

# Create scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(area, population)

# Annotate plot points with country names
for i, country in enumerate(countries):
    plt.annotate(country, (area[i], population[i]), xytext=(5, 5), textcoords='offset points')

plt.xlabel('Area (thousand sq km)')
plt.ylabel('Population (millions)')

plt.show()

### Stopword removal

For widely used languages such as English we can use NLTK's stopword list.

In [None]:
nltk.download('stopwords')

stopwords = nltk.corpus.stopwords.words('english')
print(stopwords[:10])

In [None]:
# let's convert the list to a set (with more efficient work lookup operations)
stopword_set = set(stopwords)

In [None]:
# removing stopwords
all_text_stopped = [word for word in all_text if word.lower() not in stopword_set]

# let's also remove some special symbols
spec_chars = ['--', '—', '-']
all_text_stopped = [word for word in all_text_stopped if word not in spec_chars]

all_text_stopped[:6]

In [None]:
# let's draw freq distribution again

all_fdist_stopped = FreqDist(all_text_stopped).most_common(20)
all_fdist_stopped = pd.Series(dict(all_fdist_stopped))

for line in all_fdist_stopped.keys():
    print(line, ":\t", all_fdist_stopped[line])

In [None]:
# vertical bar chart

all_plot = plt.bar(all_fdist_stopped.index, all_fdist_stopped.values)
ticks = plt.xticks(rotation=60)

In [None]:
# horizontal bar chart

all_plot = plt.barh(all_fdist_stopped.index, all_fdist_stopped.values)

In [None]:
all_plot = plt.barh(all_fdist_stopped.index, all_fdist_stopped.values)
ax = plt.gca()
ax.invert_yaxis()

### Stopwords for languages not included in NLTK

Previously we used stopwords from NLTK stopword list but that won't work for Latvian or other languages not included in NLTK.

Let's use an existing Latvian stopword list from Github:

In [None]:
import requests

stop_url = "https://raw.githubusercontent.com/Xangis/extra-stopwords/master/latvian"
res = requests.get(stop_url)

stopwords_lv = res.text.split()
print(stopwords_lv[:10])

stopword_set_lv = set(stopwords_lv)

In [None]:
# reading our text corpus

import pandas as pd

url_2 = "https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/lv_old_newspapers_5k.tsv"
df_2 = pd.read_csv(url_2, sep="\t")

In [None]:
all_text_lv = "\n".join(df_2["Text"]).split()
all_fdist_lv = FreqDist(all_text_lv).most_common(20)

In [None]:
# converting data to Pandas series
all_fdist_lv = pd.Series(dict(all_fdist_lv))

In [None]:
# removing stopwords
all_text_stopped_lv = [word for word in all_text_lv if word not in stopword_set_lv]

# removing special characters
spec_chars = ['-', '–', '—']
all_text_stopped_lv = [word for word in all_text_stopped_lv if word not in spec_chars]

all_text_stopped_lv[:6]

In [None]:
# draw freq distribution

all_fdist_stopped_lv = FreqDist(all_text_stopped_lv).most_common(20)
all_fdist_stopped_lv = pd.Series(dict(all_fdist_stopped_lv))

all_plot_lv = plt.barh(all_fdist_stopped_lv.index, all_fdist_stopped_lv.values)
ax = plt.gca()
ax.invert_yaxis()

### Histograms

Let's create a histogram displaying text word length.

A histogram is a type of graph that shows how often different numbers or ranges of numbers appear in a dataset. 

In [None]:
all_text[:10]

In [None]:
# for every word, return its length
word_length = [len(word) for word in all_text]

word_length[:10]

In [None]:
n_bins = 20

# Matplotlib histogram plot
plt.hist(word_length, bins=n_bins)

In [None]:
long_words = [word for word in all_text if len(word) >= 15]

long_words[:10]

## Seaborn

Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

https://seaborn.pydata.org/

In [None]:
data = FreqDist(all_text_stopped).most_common(20)
data = pd.DataFrame(data, columns = ["Word","Frequency"])

data[:10]

In [None]:
ax = sns.barplot(data, x="Word", y="Frequency")

ax.set_xticks(range(len(data["Word"])))
ax = ax.set_xticklabels(data["Word"], rotation=60)

In [None]:
ax = sns.barplot(data, y="Word", x="Frequency", orient="h")

In [None]:
# Seaborn histplot is similar to Matplotlib hist()
# with some improvements

# https://seaborn.pydata.org/generated/seaborn.histplot.html

sns.histplot(word_length, binwidth=2)

### Visualizing bigrams

In [None]:
import nltk.collocations as collocations
from nltk import FreqDist, bigrams

In [None]:
ngrams = bigrams(all_text_stopped)

from itertools import islice

for item in islice(ngrams, 10):
    print(item)

In [None]:
ngrams = bigrams(all_text_stopped)

ngram_freq_list = FreqDist(ngrams).most_common(20)

In [None]:
ngram_freq_list[:5]

In [None]:
# this program expects Python 3.6 or later where dictionary
# items maintain their insertion order.

ngram_dict = {}

for words, count in ngram_freq_list:
    key = "_".join(words)
    ngram_dict[key] = count

print(ngram_dict)

In [None]:
ngram_freqdist = pd.Series(ngram_dict)

In [None]:
# plot the figure

fig, ax = plt.subplots(figsize=(10,10))

## set the plot to horizontal + set title + display
bar_plot = sns.barplot(x=ngram_freqdist.values, y=ngram_freqdist.index, orient='h', ax=ax)
title = plt.title('Frequency Distribution')

#### Network graph using NetworkX

In [None]:
import networkx as nx
G = nx.Graph()

In [None]:
ngrams = bigrams(all_text_stopped)
ngram_freq_list = FreqDist(ngrams).most_common(30)

In [None]:
for item, cnt in ngram_freq_list:
    print(item, cnt)

In [None]:
# Add edges and their weights
for bigram, freq in ngram_freq_list:
    G.add_edge(bigram[0], bigram[1], weight=freq/5)

In [None]:
# Plot the network graph (using Kamada-Kawai layout)

plt.figure(figsize=(12, 8))
pos = nx.kamada_kawai_layout(G)
edges = G.edges(data=True)
weights = [edge[2]['weight'] for edge in edges]
nx.draw_networkx_nodes(G, pos, node_size=500, node_color='skyblue')
nx.draw_networkx_edges(G, pos, edgelist=edges, width=weights)
nx.draw_networkx_labels(G, pos, font_size=12, font_family='sans-serif')

plt.title('Bigram Network Graph')
plt.show()

In [None]:
# Plot the network graph (using spring layout)

plt.figure(figsize=(12, 8))
pos = nx.spring_layout(G, k=1.2)
edges = G.edges(data=True)
weights = [edge[2]['weight'] for edge in edges]
nx.draw_networkx_nodes(G, pos, node_size=500, node_color='skyblue')
nx.draw_networkx_edges(G, pos, edgelist=edges, width=weights)
nx.draw_networkx_labels(G, pos, font_size=12, font_family='sans-serif')

plt.title('Bigram Network Graph')
plt.show()

In [None]:
# Save the graph to a file (which can be loaded into Gephi)

nx.write_graphml(G, "graph.graphml")

## Plotly

Plotly graphing library helps you make interactive, publication-quality graphs:
- https://plotly.com/python/

The plotly.express module (typically imported as px) contains functions that can create entire figures at once. Plotly Express is built-in to the plotly library, and is the recommended starting point for creating most common figures.
- https://plotly.com/python/plotly-express/

In [None]:
import plotly.express as px

In [None]:
url = "https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/en_old_newspapers_5k.tsv"
df = pd.read_csv(url, sep="\t")

all_text = "\n".join(df["Text"]).split()
all_fdist = FreqDist(all_text).most_common(40)

df_all_fdist = pd.DataFrame(all_fdist, columns=["Word", "Frequency"])

df_all_fdist.head(5)


In [None]:
# Create a bar chart
fig = px.bar(df_all_fdist, x='Word', y='Frequency')

fig.show(renderer="colab")


In [None]:
# Scatter plot example

countries = ['Country A', 'Country B', 'Country C', 'Country D', 'Country E']
population = [10, 50, 30, 80, 45]  # in millions
area = [100, 400, 150, 700, 350]   # in thousand square km

data = pd.DataFrame({
    'Country': countries,
    'Population': population,
    'Area': area
})

In [None]:
# Create scatter plot with Plotly
fig = px.scatter(data, x='Area', y='Population', text='Country')

# Update layout for better readability
fig.update_traces(textposition='top center')
fig.update_layout(
    title='Population vs Area of Countries',
    xaxis_title='Area (thousand sq km)',
    yaxis_title='Population (millions)',
    showlegend=False
)

# Show the plot
fig.show(renderer="colab")

## Word-cloud visualization

https://github.com/amueller/word_cloud

In [None]:
## not needed if the WordCloud library is already installed
#!pip install wordcloud

In [None]:
import matplotlib.pyplot as plt

from wordcloud import WordCloud

In [None]:
# Let's prepare the text to visualize

url = "https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/en_old_newspapers_5k.tsv"
df = pd.read_csv(url, sep="\t")

all_text = "\n".join(df["Text"]).split()

stopword_set = set(stopwords)

# removing stopwords
all_text_stopped = [word for word in all_text if word.lower() not in stopword_set]

In [None]:
all_text_stopped[:10]

In [None]:
text = " ".join(all_text_stopped)
wordcloud = WordCloud().generate(text)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
# lower max_font_size, change the maximum number of word and lighten the background:

wordcloud = WordCloud(width=1000, height=500, max_words=40, background_color="white").generate(text)

plt.figure()
plt.imshow(wordcloud) #, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
# Save the image in a file:
wordcloud.to_file("wordcloud.png")

---

## Additional Information

Matplotlib supports many types of graphs:
- [Matplotlib plot types](https://matplotlib.org/stable/plot_types/index.html)
- [Matplotlib gallery](https://matplotlib.org/stable/gallery/index.html)

More information about Seaborn:
* https://seaborn.pydata.org/tutorial/introduction.html
* https://seaborn.pydata.org/tutorial/distributions.html
* https://seaborn.pydata.org/examples/index.html

Word-cloud generation:
* https://github.com/amueller/word_cloud

Tutorials:
* [Matplotlib tutorial](https://github.com/rougier/matplotlib-tutorial) by Nicolas P. Rougier
* [Pyplot tutorial](https://matplotlib.org/stable/tutorials/introductory/pyplot.html)



---

## Your turn!

Choose a text corpus and **visualize it** using the tools shown in this notebook.

**Write code in notebook cells below**.